# Breast Cancer 

Introduction

## 1. Data

Data description

First of all, let's load the data into a DataFrame and print the first five rows so we can take a look at it.


In [1]:
library(dplyr)

“package ‘dplyr’ was built under R version 4.0.1”

Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [2]:
attribute_names <- c('sample_id', 
                     'Clump_Thickness', 
                     'Uniformity_of_Cell Size', 
                     'Uniformity_of_Cell_Shape', 
                     'Marginal_Adhesion', 
                     'Single_Epithelial_Cell_Size', 
                     'Bare_Nuclei',
                     'Bland_Chromatin',
                     'Normal_Nucleoli',
                     'Mitoses',
                     'Class')  

cancer_data <- read.csv('cancer.csv', col.names = attribute_names)
head(cancer_data, 5)

Unnamed: 0_level_0,sample_id,Clump_Thickness,Uniformity_of_Cell.Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<int>,<int>,<int>,<int>
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
5,1017122,8,10,10,8,7,10,9,7,1,4


### 1.1 Cleaning the Data

Before trying to apply any ML algorithm or mathematical artifact, we must clean the data. The data description tells us that the missing values in the dataset are indicated with the character '?'. 

In [3]:
missing_val_spots <- cancer_data == '?'
cancer_data[missing_val_spots]

As we can see, there are 16 missing values. In the book The Data Science Design Manual page 77, Skiena provides several method to deal with missing values, from these options we'll just pick one. 

The method that we'll follow consist in replacing the missing values by the mean value of the attribute to which it belongs. So, let's apply this process to the missing values in every column.   

In [4]:
for (col in 1:ncol(cancer_data)){
    if ('?' %in% cancer_data[, col]){
        mean_value <- mean(as.numeric(cancer_data[cancer_data[col] != '?', col]))    
        cancer_data[cancer_data[col] == '?', col] <- mean_value
        # Converts the column values into numerics
        cancer_data[, col] <- as.numeric(cancer_data[, col])
    }    
}

In [5]:
cancer_data[missing_val_spots]