# Diagnóstico de câncer de mama com knn

Projeto dirigido da disciplina de Aprendizado de Máquina baseado no projeto dirigido do capítulo 3 do livro *Machine Learning* do Lantz (2013).

O projeto consiste no uso do algoritmo knn para desenvolver um modelo para detecção de câncer usando dados de biópsias de células de mulheres com massas não normais nos seios.

A ideia é construir um classificador que indique se o cancer é benigno (B) ou maligno (M). Ao todo temos 32 campos de dados sobre diferentes tumores.



## 1 - Importando os dados e a biblioteca dplyr

In [17]:
# Importando o dataset
dataset_cancer <- read.csv("breast-cancer-wisconsin.data",
                           header = F)

In [18]:
# Importando a dplyr
# Lembrando que a dplyr é a principal biblioteca do tidyverse para manipulação de dados
library(dplyr)

## 2 - Primeiras visualizações e tratamentos do dataset

In [4]:
# Vetor com o nome das colunas
colunas <- c("ID", "diagnosis", "mean radius", "mean texture", "mean perimeter", "mean area",
             "mean smoothness", "mean compactness", "mean concavity", "mean concave points",
             "mean symmetry", "mean fractal dimension", "radius SE", "texture SE", 
             "perimeter SE", "area SE", "smoothness SE", "compactness SE", "concavity SE",
             "concave points SE", "symmetry SE", "fractal dimension SE", "worst radius",
             "worst texture", "worst perimeter", "worst area", "worst smoothness",
             "worst compactness", "worst concavity", "worst concave points", "worst symmetry",
             "worst fractal dimension")

In [19]:
# Nomeando as colunas do dataset
names(dataset_cancer) <- colunas
head(dataset_cancer)

Unnamed: 0_level_0,ID,diagnosis,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,⋯,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,⋯,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
2,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,⋯,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
3,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,⋯,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
4,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,⋯,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
5,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,⋯,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
6,843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,⋯,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244


In [20]:
# Visualizando as dimensões do dataframe
dim(dataset_cancer)

In [21]:
# Visualizando estrutura do dataset
str(dataset_cancer)

'data.frame':	569 obs. of  32 variables:
 $ ID                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
 $ diagnosis              : chr  "M" "M" "M" "M" ...
 $ mean radius            : num  18 20.6 19.7 11.4 20.3 ...
 $ mean texture           : num  10.4 17.8 21.2 20.4 14.3 ...
 $ mean perimeter         : num  122.8 132.9 130 77.6 135.1 ...
 $ mean area              : num  1001 1326 1203 386 1297 ...
 $ mean smoothness        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
 $ mean compactness       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
 $ mean concavity         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
 $ mean concave points    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
 $ mean symmetry          : num  0.242 0.181 0.207 0.26 0.181 ...
 $ mean fractal dimension : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
 $ radius SE              : num  1.095 0.543 0.746 0.496 0.757 ...
 $ texture SE             : num  0.905 0.734 0.787 1

In [22]:
# Quantas amostras cada classe possui
table(dataset_cancer$diagnosis)


  B   M 
357 212 

In [23]:
# Muitos algoritmos de classificação de machine learning em R exigem que a coluna
# target esteja codificada através de factors, por isso vamos codificar a coluna "diagnosis"
dataset_cancer$diagnosis <- factor(dataset_cancer$diagnosis, levels = c("B", "M"),labels = c("Benigno", "Maligno"))

In [25]:
# Dropando a coluna de id que é irrelevante para o algoritmo
dataset_cancer <- dataset_cancer[-1]
head(dataset_cancer)

Unnamed: 0_level_0,diagnosis,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,⋯,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,Maligno,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,⋯,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
2,Maligno,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,⋯,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
3,Maligno,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,⋯,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
4,Maligno,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,⋯,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
5,Maligno,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,⋯,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
6,Maligno,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,⋯,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244


## 3 - Normalização das distâncias

Tirando a coluna `diagnosis` todas as colunas são numéricas, logo, podemos aplicar a função `summary()` sobre elas.

In [30]:
# Aplicando a função summary sobre três colunas
summary(dataset_cancer[c("mean radius", "mean texture", "mean perimeter")])

  mean radius      mean texture   mean perimeter  
 Min.   : 6.981   Min.   : 9.71   Min.   : 43.79  
 1st Qu.:11.700   1st Qu.:16.17   1st Qu.: 75.17  
 Median :13.370   Median :18.84   Median : 86.24  
 Mean   :14.127   Mean   :19.29   Mean   : 91.97  
 3rd Qu.:15.780   3rd Qu.:21.80   3rd Qu.:104.10  
 Max.   :28.110   Max.   :39.28   Max.   :188.50  

Veja que a escala de cada coluna é muito diferente uma da outra. Como o knn é muito dependente do cálculo de distâncias, aquelas colunas com maior range seriam muito mais influentes que as outras. Para resolver esse problema vamos normalizar as escalas de cada coluna.

In [31]:
# Função de normalização de uma coluna
norm <- function(col){
  return ((col-min(col))/(max(col)-min(col)))
}

In [33]:
# Aplicando normalização às colunas de features do dataset
# Vamos usar o lapply que aplica a função para cada elemento da lista
# Também usamos o as.data.frame para transformar as transformações em dataframe
dataset_cancer_norm <- as.data.frame(lapply(dataset_cancer[2:31], norm))

In [34]:
# Visualizando dataset após normalização
head(dataset_cancer_norm)

Unnamed: 0_level_0,mean.radius,mean.texture,mean.perimeter,mean.area,mean.smoothness,mean.compactness,mean.concavity,mean.concave.points,mean.symmetry,mean.fractal.dimension,⋯,worst.radius,worst.texture,worst.perimeter,worst.area,worst.smoothness,worst.compactness,worst.concavity,worst.concave.points,worst.symmetry,worst.fractal.dimension
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0.5210374,0.0226581,0.5459885,0.3637328,0.5937528,0.7920373,0.7031396,0.7311133,0.6863636,0.6055181,⋯,0.6207755,0.1415245,0.6683102,0.45069799,0.6011358,0.6192916,0.5686102,0.9120275,0.5984624,0.418864
2,0.6431445,0.2725736,0.6157833,0.5015907,0.2898799,0.181768,0.2036082,0.3487575,0.379798,0.1413227,⋯,0.6069015,0.3035714,0.5398177,0.43521431,0.3475533,0.1545634,0.1929712,0.6391753,0.2335896,0.2228781
3,0.6014956,0.3902604,0.5957432,0.4494168,0.5143089,0.4310165,0.4625117,0.6356859,0.509596,0.2112468,⋯,0.5563856,0.3600746,0.5084417,0.37450845,0.4835898,0.3853751,0.3597444,0.8350515,0.4037059,0.213433
4,0.2100904,0.3608387,0.2335015,0.1029056,0.8113208,0.8113613,0.5656045,0.5228628,0.7762626,1.0,⋯,0.2483102,0.3859275,0.2413467,0.09400806,0.9154725,0.8140117,0.5486422,0.8848797,1.0,0.7737111
5,0.6298926,0.1565776,0.6309861,0.4892895,0.4303512,0.3478928,0.4639175,0.5183897,0.3782828,0.1868155,⋯,0.5197439,0.1239339,0.5069476,0.34157491,0.4373638,0.1724151,0.3194888,0.5584192,0.1575005,0.1425948
6,0.2588386,0.2025702,0.2679842,0.1415058,0.6786133,0.4619962,0.3697282,0.4020378,0.5186869,0.5511794,⋯,0.2682319,0.3126333,0.2639076,0.13674794,0.7127386,0.4827837,0.4277157,0.5982818,0.4770353,0.454939


In [35]:
# Verificando se a normalização funcionou
summary(dataset_cancer_norm)

  mean.radius      mean.texture    mean.perimeter     mean.area     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.2233   1st Qu.:0.2185   1st Qu.:0.2168   1st Qu.:0.1174  
 Median :0.3024   Median :0.3088   Median :0.2933   Median :0.1729  
 Mean   :0.3382   Mean   :0.3240   Mean   :0.3329   Mean   :0.2169  
 3rd Qu.:0.4164   3rd Qu.:0.4089   3rd Qu.:0.4168   3rd Qu.:0.2711  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
 mean.smoothness  mean.compactness mean.concavity    mean.concave.points
 Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000     
 1st Qu.:0.3046   1st Qu.:0.1397   1st Qu.:0.06926   1st Qu.:0.1009     
 Median :0.3904   Median :0.2247   Median :0.14419   Median :0.1665     
 Mean   :0.3948   Mean   :0.2606   Mean   :0.20806   Mean   :0.2431     
 3rd Qu.:0.4755   3rd Qu.:0.3405   3rd Qu.:0.30623   3rd Qu.:0.3678     
 Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000     
 mean.

## 4 - Divindo em datasets de treino e teste

In [36]:
# Definindo seed para realizar split de dados de maneira replicável
set.seed(50)

In [37]:
# Realizando o split selecionando aleatoriamente 75% das linhas do dataset
id_split = sort(sample(nrow(dataset_cancer_norm), nrow(dataset_cancer)*0.75))
id_split

In [41]:
# Definindo dataset de features para treino e teste
features = dataset_cancer_norm
features_train = features[id_split,]
features_test = features[-id_split,]

In [46]:
# Definindo vetor de targets para treino e teste
targets = dataset_cancer[1]
targets_train = targets[id_split,]
targets_test = targets[-id_split,]

Só lembrando que em R um valor em branco na hora de selecionar uma `[linha, coluna]` indica que todas as linhas ou colunas devem ser incluídas.

## 5 - Treinando e usando o modelo

In [48]:
# Instalando pacote que possui o modelo knn implementado
install.packages("class")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [53]:
# Carregando a biblioteca
library("class")

In [54]:
# Treinando o modelo
# A distância usada no cálculo desse knn é a euclideana
cancer_pred <- knn(train = features_train, test = features_test, cl = targets_train, k = 7)

A função `knn` devolve um vetor com os valores preditos pelo modelo construído com as features e targets de treino passados para os dados de teste também passados como parâmetro. 

In [55]:
cancer_pred

## 6 - Avaliando o modelo

In [56]:
# Instalando pacote com métricas de avaliação
install.packages("gmodels")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘gtools’, ‘gdata’




In [57]:
# Importando o pacote
library(gmodels)

In [58]:
# Avaliando através de matriz de confusão
CrossTable(x = targets_test, y = cancer_pred, prop.chisq = FALSE)


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  143 

 
             | cancer_pred 
targets_test |   Benigno |   Maligno | Row Total | 
-------------|-----------|-----------|-----------|
     Benigno |        90 |         0 |        90 | 
             |     1.000 |     0.000 |     0.629 | 
             |     0.968 |     0.000 |           | 
             |     0.629 |     0.000 |           | 
-------------|-----------|-----------|-----------|
     Maligno |         3 |        50 |        53 | 
             |     0.057 |     0.943 |     0.371 | 
             |     0.032 |     1.000 |           | 
             |     0.021 |     0.350 |           | 
-------------|-----------|-----------|-----------|
Column Total |        93 |        50 |       143 | 
             |     0.650 |     0.350 |           | 
-------------|----