#  Cancer Prediction Problem

You need to Predict that the Cancer appears to be Benign or Malignant. The Breast cancer data takes all the variables that drive the potential probability of cancer lump being present.

Variables such as perimeter, radii of the lump and similar variables drive the probability of the lump being cancerous or not.

Herein lets Import the Data and Required Libraries to work.

In [17]:
import numpy as np
import pandas as pd
from sklearn.cross_validation import cross_val_score # Cross Validation Score
from sklearn.model_selection import GridSearchCV # Parameters of the Model
from sklearn.model_selection import RandomizedSearchCV # Tuning the Parameters
from sklearn.tree import DecisionTreeClassifier # Decision Tree Algo
from sklearn.ensemble import RandomForestClassifier # Random Forest Algo.
from sklearn.model_selection import train_test_split # helps in spliting the data in train and test set
from sklearn.metrics import accuracy_score # Calculating the Accuracy Score againts the Classes Predicted vs Actuals.
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier

In [2]:
# Importing the Dataset
cancer = pd.read_csv("~/Downloads/Excel and CSVs/Breast Cancer Classification Data.csv")
cancer.head() # Previewing dataset


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


#### Here we see that Diagnosis the Target Variable and Rest are Predictor Variables. We will divide the data in train and test by identifying our Xs and Ys... See the code below to understand.

In [3]:
# Defined my Xs and Ys
x = cancer.drop("diagnosis", axis = 1) # Dropping the Target Variable
y = cancer["diagnosis"] # Defining the Ys...

# Train & Test Split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.30, random_state = 123)

In [8]:
# Defining Tree Parameters For Grid Based Search:: For More Details Refer to Scikit Learn Documentation.
tree_param = {
    "criterion":["gini", "entropy"],
    "splitter":["best", "random"],
    "max_depth":[3,4,5,6],
    "max_features":["auto","sqrt","log2"],
    "random_state": [123]
}

### Why Grid Search 

Cross-validation is a method for robustly estimating test-set performance (generalization) of a model. Grid-search is a way to select the best of a family of models, parametrized by a grid of parameters.

Grid Search helps in identifying the Optimum value of the Parameters that can perform the best. We do Parameter tuning so that we can extract the most from our ML Algorithm and it is able to perform in the best possible manner. It should be able to perform a Grid Based Search and should return the best parameters that create a good model.


In [9]:
# Applying the Grid Search Algorithm
grid = GridSearchCV(tree, tree_param, cv = 10)

In [10]:
# Printing the Parameters after Grid Based Search
grid

GridSearchCV(cv=10, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
            max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=123,
            splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'criterion': ['gini', 'entropy'], 'splitter': ['best', 'random'], 'max_depth': [3, 4, 5, 6], 'max_features': ['auto', 'sqrt', 'log2'], 'random_state': [123]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [11]:
# Lets Fit into the Data so that It can let us know the correct Parameters 
best_parameter_search = grid.fit(x_train, y_train)

In [12]:
best_parameter_search.best_params_ # Printing the Best Parameters

{'criterion': 'gini',
 'max_depth': 5,
 'max_features': 'auto',
 'random_state': 123,
 'splitter': 'best'}

In [14]:
# Creating our First Model Called Decision Trees after Hyper Tuning
tree = DecisionTreeClassifier(criterion ='gini',
 max_depth=5,
 max_features= 'auto',
 random_state= 123,
 splitter='best')

In [15]:
# Developing the Model 
model_tree = tree.fit(x_train, y_train) # Fitting the Learner on Train Dataset.
pred_TREE = tree.predict(x_test) # Making Predictions
accuracy_score(y_test, pred_TREE) # Calculating Accuracy

0.9473684210526315

### Decision Model returns an Accuracy of 0.9473684210526315

In [20]:
# Lets Import ADABOOST & BAGGING Classifier
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier
bagg = BaggingClassifier()

In [26]:
# Lets Fit in the Bagging Classifier and See what do we get
baggedmodel = bagg.fit(x_train, y_train)

In [27]:
# Making Predictions...
pred_bagged = bagg.predict(x_test)

In [28]:
# Calculating Accuracy
accuracy_score(y_test, pred_bagged)

0.9707602339181286

### Bagging Model returns an Accuracy of 0.9707602339181286

In [34]:
# Finding the Parameters of Adaboost
rf = RandomForestClassifier()
ada = AdaBoostClassifier()
boostparam={
    "base_estimator":["rf"],
    "n_estimators":[50,75,100],
    "learning_rate":[1,2,3,4,5],
    "algorithm":["SAMME", "SAMME.R"],
    "random_state" : [123]
}

In [37]:
grid_boost = GridSearchCV(ada,boostparam, cv = 5)

In [None]:
grid_boost.fit(x_train,y_train).best_params_

In [39]:
# Lets Apply Boosting Here

ada = AdaBoostClassifier(algorithm= 'SAMME',
 learning_rate= 1,
 n_estimators= 75,
 random_state= 123)

In [40]:
ada_model = ada.fit(x_train, y_train)
pred_ada = ada.predict(x_test)
accuracy_score(y_test, pred_ada)

0.9824561403508771

In [42]:
import xgboost as xgb
from xgboost import XGBClassifier
xg = XGBClassifier()

In [43]:
# Apply XGBOOST at Base Level
model_xgb = xg.fit(x_train, y_train)

In [44]:
pred_xgb = xg.predict(x_test)

  if diff:


In [45]:
accuracy_score(y_test,pred_xgb)

0.9766081871345029

# Conclusion : Random Forest with Adaboost after tuning the parameters gives the best accuracy here 0.9824

In [33]:
tree = DecisionTreeClassifier()
cross_val_score(tree, x, y, cv=10).mean()

0.9070964048051163

In [34]:
RF = RandomForestClassifier()
cross_val_score(RF, x, y, cv=10).mean()

0.9490547489413188

In [35]:
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression()
cross_val_score(lg, x, y, cv = 10).mean()

0.558916688272405

In [36]:
ada = AdaBoostClassifier() # Bias but also variance too...
cross_val_score(ada, x,y, cv = 10).mean()
# What if I make an ensemble of RF and ADABOOST.

0.9701700371618701

In [37]:
gbm = GradientBoostingClassifier()
cross_val_score(gbm, x, y, cv = 10).mean()

0.9597668740817561

In [38]:
from xgboost import XGBClassifier
xg = XGBClassifier()
cross_val_score(xg, x, y, cv = 10).mean()

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


0.973679889378619

In [39]:
from sklearn.ensemble import VotingClassifier
vc = VotingClassifier(estimators=[("ADA", ada),("GBM", gbm), ("XGB", xg)])

In [40]:
cross_val_score(vc, x, y, cv=5).mean()

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


0.9614928818776451

In [41]:
# Accuracy doesnt matter if its a stand alone model.
# Here we are using CV and hence we can count on this value
# Otherwise, you need to keep into Kappa Statistics of Every Model
# that you generate. 
# Confusion Matrix()