**Objetivo**: classificação de base com dados sobre tumores para maligno e benigno

In [27]:
#instalação de pacotes necessários
#install.packages("e1071")
#install.packages("caret")
#install.packages("mlbench")
#install.packages("mice")
library(caret)
library(mlbench)
library(mice)

In [9]:
### Leitura dos dados

temp_dados <- read.csv("databases/1 - Cancer de Mama - Dados.csv")

head(temp_dados)

Unnamed: 0_level_0,Id,Cl.thickness,Cell.size,Cell.shape,Marg.adhesion,Epith.c.size,Bare.nuclei,Bl.cromatin,Normal.nucleoli,Mitoses,Class
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,1,5,1,1,1,2,1,3,1,1,benign
2,2,5,4,4,5,7,10,3,2,1,benign
3,3,3,1,1,1,2,2,3,1,1,benign
4,4,6,8,8,1,3,4,3,7,1,benign
5,5,4,1,1,3,2,1,3,1,1,benign
6,6,8,10,10,8,7,10,9,7,1,malignant


In [15]:
#tratando o ID e valores faltantes
set.seed(123) 
temp_dados$Id <- NULL
imp <- mice(temp_dados)
dados <- complete(imp, 1) # pq 1?


 iter imp variable
  1   1
  1   2
  1   3
  1   4
  1   5
  2   1
  2   2
  2   3
  2   4
  2   5
  3   1
  3   2
  3   3
  3   4
  3   5
  4   1
  4   2
  4   3
  4   4
  4   5
  5   1
  5   2
  5   3
  5   4
  5   5


"Number of logged events: 1"


In [17]:
#criando bases de treino e teste com 80%/20%
set.seed(123)
indices <- createDataPartition(dados$Class, p = 0.8, list = FALSE)
treino <- dados[indices, ]
teste <- dados[-indices, ]

#treinando o modelo com o hold-out
set.seed(123)
rna <- train(Class ~ ., data = treino, method = "nnet", trace = FALSE)
rna

Neural Network 

560 samples
  9 predictor
  2 classes: 'benign', 'malignant' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 560, 560, 560, 560, 560, 560, ... 
Resampling results across tuning parameters:

  size  decay  Accuracy   Kappa    
  1     0e+00  0.9023357  0.7486937
  1     1e-04  0.9412348  0.8589124
  1     1e-01  0.9539669  0.8992057
  3     0e+00  0.9533491  0.8970306
  3     1e-04  0.9523401  0.8947772
  3     1e-01  0.9528485  0.8949808
  5     0e+00  0.9366896  0.8531898
  5     1e-04  0.9494210  0.8871458
  5     1e-01  0.9513048  0.8908209

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 1 and decay = 0.1.

In [18]:
#predicoes  dos valores do conjunto de teste
predict_rna <- predict(rna, teste)

#matriz de confusão
confusionMatrix(predict_rna, as.factor(teste$Class))

Confusion Matrix and Statistics

           Reference
Prediction  benign malignant
  benign        88         1
  malignant      3        47
                                         
               Accuracy : 0.9712         
                 95% CI : (0.928, 0.9921)
    No Information Rate : 0.6547         
    P-Value [Acc > NIR] : <2e-16         
                                         
                  Kappa : 0.937          
                                         
 Mcnemar's Test P-Value : 0.6171         
                                         
            Sensitivity : 0.9670         
            Specificity : 0.9792         
         Pos Pred Value : 0.9888         
         Neg Pred Value : 0.9400         
             Prevalence : 0.6547         
         Detection Rate : 0.6331         
   Detection Prevalence : 0.6403         
      Balanced Accuracy : 0.9731         
                                         
       'Positive' Class : benign         
                   

In [22]:
#usando o Cross-validation

#indica o metodo cv e numeto de folders 10
ctrl <- trainControl(method = "cv", number = 10)

#executa a RNA com esse controle
set.seed(1234)
rna <- train(Class ~ ., data = treino, method = "nnet",
              trace = FALSE, trControl = ctrl)

predict_rna <- predict(rna, teste)
confusionMatrix(predict_rna, as.factor(teste$Class))

Confusion Matrix and Statistics

           Reference
Prediction  benign malignant
  benign        88         1
  malignant      3        47
                                         
               Accuracy : 0.9712         
                 95% CI : (0.928, 0.9921)
    No Information Rate : 0.6547         
    P-Value [Acc > NIR] : <2e-16         
                                         
                  Kappa : 0.937          
                                         
 Mcnemar's Test P-Value : 0.6171         
                                         
            Sensitivity : 0.9670         
            Specificity : 0.9792         
         Pos Pred Value : 0.9888         
         Neg Pred Value : 0.9400         
             Prevalence : 0.6547         
         Detection Rate : 0.6331         
   Detection Prevalence : 0.6403         
      Balanced Accuracy : 0.9731         
                                         
       'Positive' Class : benign         
                   

In [23]:
#parametrizacao da rna
#size, decay
grid <- expand.grid(size = seq(from = 1, to = 45, by = 10),
                    decay = seq(from = 0.1, to = 0.9, by = 0.3))

set.seed(123)
rna <- train(form = Class ~ ., data = treino, method = "nnet",
             tuneGrid = grid, trControl = ctrl, maxit = 2000, trace = FALSE)

In [24]:
rna

Neural Network 

560 samples
  9 predictor
  2 classes: 'benign', 'malignant' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 503, 503, 504, 505, 504, 505, ... 
Resampling results across tuning parameters:

  size  decay  Accuracy   Kappa    
   1    0.1    0.9641513  0.9217206
   1    0.4    0.9606123  0.9140039
   1    0.7    0.9605799  0.9140557
  11    0.1    0.9516513  0.8929857
  11    0.4    0.9623667  0.9175047
  11    0.7    0.9623656  0.9175021
  21    0.1    0.9533721  0.8966760
  21    0.4    0.9641524  0.9214360
  21    0.7    0.9623656  0.9173021
  31    0.1    0.9534359  0.8972229
  31    0.4    0.9605485  0.9134337
  31    0.7    0.9641200  0.9213982
  41    0.1    0.9588591  0.9092106
  41    0.4    0.9641200  0.9214069
  41    0.7    0.9659381  0.9254693

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 41 and decay = 0.7.

In [25]:
#fazendo a predicao no arquivo de teste
predict_rna <- predict(rna, teste)
confusionMatrix(predict_rna, as.factor(teste$Class))


Confusion Matrix and Statistics

           Reference
Prediction  benign malignant
  benign        89         1
  malignant      2        47
                                          
               Accuracy : 0.9784          
                 95% CI : (0.9382, 0.9955)
    No Information Rate : 0.6547          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9525          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.9780          
            Specificity : 0.9792          
         Pos Pred Value : 0.9889          
         Neg Pred Value : 0.9592          
             Prevalence : 0.6547          
         Detection Rate : 0.6403          
   Detection Prevalence : 0.6475          
      Balanced Accuracy : 0.9786          
                                          
       'Positive' Class : benign          

In [26]:
dados_novos_casos <- read.csv("databases/1 - Cancer de Mama - Dados - Novos Casos.csv")
dados_novos_casos$Id <- NULL
View(dados_novos_casos)

predict_rna <- predict(rna, dados_novos_casos)
dados_novos_casos$Class <- NULL
resultado <- cbind(dados_novos_casos, predict_rna)

View(resultado)

Cl.thickness,Cell.size,Cell.shape,Marg.adhesion,Epith.c.size,Bare.nuclei,Bl.cromatin,Normal.nucleoli,Mitoses,Class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
2,1,5,1,7,1,3,10,5,?
2,3,4,5,3,1,3,2,1,?
7,1,1,5,2,2,2,1,1,?


Cl.thickness,Cell.size,Cell.shape,Marg.adhesion,Epith.c.size,Bare.nuclei,Bl.cromatin,Normal.nucleoli,Mitoses,predict_rna
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
2,1,5,1,7,1,3,10,5,benign
2,3,4,5,3,1,3,2,1,benign
7,1,1,5,2,2,2,1,1,benign
