<a href="https://colab.research.google.com/github/andreacohen7/healthcare/blob/main/Breast_Cancer_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Breast Cancer Classification
- Andrea Cohen
- 01.02.23

## Task:
  - To classify the diagnosis as either malignant or benign

## Data Source:
  - https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

### Preliminary Steps

#### Mount the drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#### Import libraries

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

#### Import data

In [3]:
path = '/content/cancer.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### Explore the data

In [4]:
display(df.info())
display(df.describe())
display(df['id'].nunique())
display(df['diagnosis'].value_counts())
print(f'There are {df.shape[0]} rows, and {df.shape[1]} columns.')
print(f'There are {df.duplicated().sum()} duplicate rows.')
print(f'There are {df.isna().sum().sum()} missing values.')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

None

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


569

B    357
M    212
Name: diagnosis, dtype: int64

There are 569 rows, and 32 columns.
There are 0 duplicate rows.
There are 0 missing values.


  - The column 'id' has as many unique values as there are rows in the dataframe--the data is different for each observation.  The data do not describe some quality of the observation, so the column will be dropped.
  - There are no duplicated rows.
  - All other columns have datatype float64.
  - There are no missing values.

In [5]:
df = df.drop(columns='id')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    object 
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                  5

### Find the class names and determine how balanced the classes are

In [6]:
df['diagnosis'].value_counts(normalize = True)

B    0.627417
M    0.372583
Name: diagnosis, dtype: float64

### Convert the string names of the classes to numeric values

In [7]:
df['diagnosis'] = df['diagnosis'].replace({'B':0, 'M':1})

### Arrange data into a features matrix and a target vector

In [8]:
y = df['diagnosis']
X = df.drop(columns='diagnosis')

### Train test split (model validation)

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Decision Tree Classifier

In [10]:
dec_tree = DecisionTreeClassifier(random_state = 42)
dec_tree.fit(X_train, y_train)
train_preds = dec_tree.predict(X_train)
test_preds = dec_tree.predict(X_test)
train_score = dec_tree.score(X_train, y_train)
test_score = dec_tree.score(X_test, y_test)
print(train_score)
print(test_score)

1.0
0.951048951048951


  - The default decision tree had a higher R^2 score on the training data than it did on the test data—the model is overfitting.

In [11]:
dec_tree.get_depth()

7

  - The default tree had a depth of 7.

In [12]:
depths = list(range(2, 7))
scores = pd.DataFrame(index=depths, columns = ['Test Score', 'Train Score'])
for depth in depths:
  dec_tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
  dec_tree.fit(X_train, y_train)
  train_score = dec_tree.score(X_train, y_train)
  test_score = dec_tree.score(X_test, y_test)
  scores.loc[depth, 'Train Score'] = train_score
  scores.loc[depth, 'Test Score'] = test_score
scores.head()

Unnamed: 0,Test Score,Train Score
2,0.916084,0.946009
3,0.958042,0.971831
4,0.951049,0.995305
5,0.958042,0.995305
6,0.951049,0.997653


In [13]:
sorted_scores = scores.sort_values(by='Test Score', ascending=False)
sorted_scores.head()

Unnamed: 0,Test Score,Train Score
3,0.958042,0.971831
5,0.958042,0.995305
4,0.951049,0.995305
6,0.951049,0.997653
2,0.916084,0.946009


  - The optimal max_depth is 3.

In [14]:
dec_tree_3 = DecisionTreeClassifier(max_depth = 3, random_state = 42)
dec_tree_3.fit(X_train, y_train)
train_3_score = dec_tree_3.score(X_train, y_train)
test_3_score = dec_tree_3.score(X_test, y_test)
print(train_3_score)
print(test_3_score)

0.971830985915493
0.958041958041958


  - The r2 of the final model is .97 on the training set, and the r2 of the final model is .96 on the test set.
  - The training and test results have moved closer to each other (a sign that overfitting was reduced). Most importantly, the testing score is higher.

### Bagging Classifier

In [15]:
bagclass = BaggingClassifier(random_state=42)
bagclass.fit(X_train, y_train)
bagclass.predict(X_test)
bagclass_train_score = bagclass.score(X_train, y_train)
bagclass_test_score = bagclass.score(X_test, y_test)
print(bagclass_train_score)
print(bagclass_test_score)

0.9929577464788732
0.951048951048951


  - The default bagging classifier had a higher R^2 score on the training data than it did on the test data—the model is overfitting.

In [16]:
bagclass.get_params()

{'base_estimator': None,
 'bootstrap': True,
 'bootstrap_features': False,
 'max_features': 1.0,
 'max_samples': 1.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [17]:
estimators = [10, 20, 30, 40, 50, 100]
scores2 = pd.DataFrame(index=estimators, columns = ['Test Score', 'Train Score'])
for num_estimators in estimators:
  bag_class = BaggingClassifier(n_estimators=num_estimators, random_state=42)
  bag_class.fit(X_train, y_train)
  train_score = bag_class.score(X_train, y_train)
  test_score = bag_class.score(X_test, y_test)
  scores2.loc[num_estimators, 'Train Score'] = train_score
  scores2.loc[num_estimators, 'Test Score'] = test_score
scores2.head()

Unnamed: 0,Test Score,Train Score
10,0.951049,0.992958
20,0.958042,0.997653
30,0.958042,1.0
40,0.958042,0.997653
50,0.958042,1.0


In [18]:
scores2 = scores2.sort_values(by='Test Score', ascending = False)
scores2

Unnamed: 0,Test Score,Train Score
20,0.958042,0.997653
30,0.958042,1.0
40,0.958042,0.997653
50,0.958042,1.0
100,0.958042,1.0
10,0.951049,0.992958


  - The optimal number of estimators is 20.

In [19]:
best_n_estimators = scores2.index[0]
bag_class_tuned = BaggingClassifier(n_estimators = best_n_estimators, random_state=42)
bag_class_tuned.fit(X_train, y_train)
print(bag_class_tuned.score(X_train, y_train))
print(bag_class_tuned.score(X_test, y_test))

0.9976525821596244
0.958041958041958


  - The r2 of the final model is .99 on the training set, and the r2 of the final model is .96 on the test set.
  - The training and test results have moved closer to each other (a sign that overfitting was reduced). Most importantly, the testing score is higher.

### Random Forest Classifier

In [20]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
rf.predict(X_test)
rf_train_score = rf.score(X_train, y_train)
rf_test_score = rf.score(X_test, y_test)
print(rf_train_score)
print(rf_test_score)

1.0
0.965034965034965


  - The default random forest had a higher R^2 score on the training data than it did on the test data—the model is overfitting.

In [21]:
rf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [22]:
est_depths = [estimator.get_depth() for estimator in rf.estimators_]
max(est_depths)

11

In [23]:
depths = range(1, max(est_depths))
scores3 = pd.DataFrame(index=depths, columns=['Test Score'])
for depth in depths:
  model = RandomForestClassifier(max_depth = depth, random_state=42)
  model.fit(X_train, y_train)
  scores3.loc[depth, 'Train Score'] = model.score(X_train, y_train)
  scores3.loc[depth, 'Test Score'] = model.score(X_test, y_test)
scores3.head()

Unnamed: 0,Test Score,Train Score
1,0.965035,0.920188
2,0.965035,0.955399
3,0.965035,0.981221
4,0.965035,0.99061
5,0.965035,0.992958


In [24]:
sorted_scores3 = scores3.sort_values(by='Test Score', ascending=False)
sorted_scores3.head()

Unnamed: 0,Test Score,Train Score
1,0.965035,0.920188
2,0.965035,0.955399
3,0.965035,0.981221
4,0.965035,0.99061
5,0.965035,0.992958


  - The optimal max_depth is 1.

In [25]:
n_ests = [50, 100, 150, 200, 250]
scores4 = pd.DataFrame(index=n_ests, columns=['Test Score', 'Train Score'])
for n in n_ests:
  model = RandomForestClassifier(max_depth=1, n_estimators=n, random_state=42)
  model.fit(X_train, y_train)
  scores4.loc[n, 'Train Score'] = model.score(X_train, y_train)
  scores4.loc[n, 'Test Score'] = model.score(X_test, y_test)
scores4.head()

Unnamed: 0,Test Score,Train Score
50,0.951049,0.93662
100,0.965035,0.920188
150,0.965035,0.934272
200,0.965035,0.934272
250,0.965035,0.931925


In [26]:
sorted_scores4 = scores4.sort_values(by='Test Score', ascending = False)
sorted_scores4.head()

Unnamed: 0,Test Score,Train Score
100,0.965035,0.920188
150,0.965035,0.934272
200,0.965035,0.934272
250,0.965035,0.931925
50,0.951049,0.93662


  - The optimal number of estimators is 100.

In [27]:
best_n_estimators2 = sorted_scores4.index[0]
rf_tuned = BaggingClassifier(n_estimators = best_n_estimators2, random_state=42)
rf_tuned.fit(X_train, y_train)
print(rf_tuned.score(X_train, y_train))
print(rf_tuned.score(X_test, y_test))

1.0
0.958041958041958


  - The r2 of the final model is 1.0 on the training set, and the r2 of the final model is .96 on the test set.
  - The training r2 stayed the same and the testing r2 decreased.
  - The default model performed better on the test set than the tuned model.

In [28]:
print(f'Decision Tree Classifier Test R2: {test_3_score}')
print(f'Bagging Classifier Test R2: {bag_class_tuned.score(X_test, y_test)}')
print(f'Random Forest Classifier Test R2: {(rf_test_score)}')

Decision Tree Classifier Test R2: 0.958041958041958
Bagging Classifier Test R2: 0.958041958041958
Random Forest Classifier Test R2: 0.965034965034965


  - Good performance on the testing data is the most important consideration for choosing a model--therefore the three models will be compared based on their Test Scores.
  - The Random Forests Classifier had the highest R2 score.
  - The Random Forests Classifier was the best model, based on the regression metrics for the testing data.