The goal of this assignment is to use random forest to classify based on the diagnoses column. 

In [1]:
import pandas as pd 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score

In [2]:
#Load the data and convert into a dataframe
data = 'data-breastCancer.csv'
df = pd.read_csv(data)


In [3]:
#display the first 5 rows of the data
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [4]:
#print information about the data like dtypes, and number of non-null values in each column. 
print(df.info(),'\n')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [5]:
#check the dimensions of the data, shows the number of columns and rows
print("Shape of data:", '\n', df.shape, '\n')

Shape of data: 
 (569, 33) 



In [6]:
#check the number of missing values per coiumn 
print("missing values:", '\n', df.isnull().sum())


missing values: 
 id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_wor

I noticed that there were missing values in the column "Unnamed: 32". Since, that was the only column with missing values, I decided to remove the column from the dataframe. 

In [7]:
# remove the column that contains all the missing values
df = df.drop(['Unnamed: 32'], axis = 1)


In [8]:
#check again to see if there are any other missing values in any of the other columns
print("Missing values:", '\n', df.isnull().sum())


Missing values: 
 id                         0
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64


In [9]:
#check the shape of the data after removing column
print("Shape of data:", '\n', df.shape )

Shape of data: 
 (569, 32)


In [10]:
#remove id column since it does not impact the diagnoses of a patient
df = df.drop(['id'], axis = 1)

In [11]:
#check the number of observations per class in the diagnoses column
df['diagnosis'].value_counts()

diagnosis
B    357
M    212
Name: count, dtype: int64

Before starting model development, I checked the number of obervations per class in the diagnoses column. Unfortunately, there is some imbalance, so I will take that into account when coming up with my results. 

In [12]:
# All the the variables besides diagnoes are predictors, and the response is the diagnoses column
X = df.drop(['diagnosis'],axis = 1)
Y = df['diagnosis']
# check the number columns and rows for the predictors
print(X.shape)
# check the number of rows for the response variable 
print(Y.shape)

(569, 30)
(569,)


In [13]:
# perform the train test split on the predictor and response variable
seed = 7
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2, random_state=seed)



I created to a train, test split to classify based on the diagnoses column, where the 80% of the data will go to training subset, and 20% of the data will go to the test subset. 

In [14]:
# perform grid search on random forest to determine best parameters. 
param_grid = {'max_features': [4,5,6], 'n_estimators': [200, 400, 600, 800], 'max_depth': [1,5,10,15]}

forest = RandomForestClassifier(random_state=7)

gridsearch = GridSearchCV(forest, param_grid=param_grid, cv = 5)
#used grid search on the training data
gridsearch.fit(x_train, y_train)

I computed a grid search to determine the best parameters for this random forest model. 

In [15]:
#shows the best parameters for the random forest model 
print(gridsearch.best_estimator_)
#displays the accuracy of the random forest when using the best parameters suggested from the grid search 
print(gridsearch.best_score_)

RandomForestClassifier(max_depth=5, max_features=4, n_estimators=600,
                       random_state=7)
0.9582417582417582


The grid search that the depth of the trees, in other words pruned to 5 levels, with the max feature of 4, which means the maximum number for predictors to classify diagnoses, this is pretty close to the rule of thumb of sqrt(n), which was 5.38. Lastly, the number of estimators, which is number of trees created are 600. The number of trees created plays a significant role in the bagging process of the random forest. The model defaultly uses all of the data for the boostrapping sampling process. 

The random forest that derived from the grid search cross validation achieved an accuracy score 95.8 using the training data.

In [24]:
#Used the best parameters from the grid search for this random model 
random_forest = RandomForestClassifier(random_state=7, n_estimators=600, max_features=4)
#fitted random forest using the training data
random_forest.fit(x_train, y_train)
#displays a classification report and confusion matrix based on the training dataset
pred = random_forest.predict(x_train)
print(classification_report(y_train, pred))
cm1 = confusion_matrix(y_train, pred)
print(cm1)

              precision    recall  f1-score   support

           B       1.00      1.00      1.00       283
           M       1.00      1.00      1.00       172

    accuracy                           1.00       455
   macro avg       1.00      1.00      1.00       455
weighted avg       1.00      1.00      1.00       455

[[283   0]
 [  0 172]]


Using the parameters that were recommended from the grid search and using the training data, the model achieved an accuracy score of 100%. 

In [28]:
#displays a classification report and confusion matrix based on the training dataset
pred2 = random_forest.predict(x_test)
cm2 = confusion_matrix(y_test, pred2)
print(cm2)
print(classification_report(y_test, pred2))
class_report = classification_report(y_test, pred2, output_dict=True)
# Displays the macro average from the recall column of the classifcation report 
print("macro average (accuracy):",class_report['macro avg']['recall']*100)

[[74  0]
 [ 3 37]]
              precision    recall  f1-score   support

           B       0.96      1.00      0.98        74
           M       1.00      0.93      0.96        40

    accuracy                           0.97       114
   macro avg       0.98      0.96      0.97       114
weighted avg       0.97      0.97      0.97       114

macro average (accuracy): 96.25


After using the test subset for the random forest model it achieved an accuracy score of 96.25. This model had a true positive rate of 93% and a true negative rate of 100%. This means that the model does a better job at classifying patients with Malignent tumors than Benign. In this case, I would not adjust the cutoff because it would be better to have a false negative in this situation. 

In [29]:
#10 fold cross validation 
k_fold = KFold(n_splits=10, shuffle=True, random_state=7)
# score the 10 fold cross validation using all of the data
results = cross_val_score(random_forest,X, Y, cv=k_fold, scoring='accuracy')
# average all of the cross validation scores. 
print("Cross validation accuracy score:", results.mean()*100)

Cross validation accuracy score: 96.66353383458647


I performed a k fold cross validation where k is 10 in order to validate and see how well the model generalizes. After 10 splits, the k fold cv had an average accuracy of 96.66 which is extremely close to the accuracy score of the testing. 

In conclusion, although there are some signs of overfitting when comparing the results of the training to the testing and cross validation, the accuracy is still significantly high and it will still generalize well when it comes to unseen data. Due to imbalance of observations per class, it had an affect on correctly classifying the observations, the true positive was lower than the true negative when applying the model to the testing subset. However, having a false negative would be a safer than having a false positive in this situation, so changing cutoff would be unnecessary. 