# PCA, Param Tuning - 98% Accuracy with SVM and Logistic Regression

**Breast Cancer Data Analysis**


![](https://www.uicc.org/sites/main/files/styles/uicc_news_main_image/public/thumbnails/image/BCAM2016_FA.jpg?itok=zimiEGKS)

In this tutorial, based on the data we are going to find out if the cancer is benign or malignant. We would use python libraries such as Numpy, Pandas and Plotly. We would use classification techniques to predict the values (1 or 0) on our dataset. 

**Source** : https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Let's start off by installing and import the required libraries into our code

In [1]:
import numpy as np
import pandas as pd 
import plotly.express as px
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
%matplotlib inline

In [2]:
plt.rcParams['font.size'] = 12
plt.rcParams['figure.figsize'] = (15, 10)

`sklearn` provides this dataset for us to work with so we are going to be using the same library for importing our dataset and loading into a dataframe with the help of `Pandas` library

In [3]:
from sklearn.datasets import load_breast_cancer

In [4]:
data = load_breast_cancer()
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df['target'] = data.target

In [5]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [6]:
df[df.columns].describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


## Exploratory Data Analysis aka EDA

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

In [7]:
pd.DataFrame(df.corr().unstack().sort_values().drop_duplicates())

Unnamed: 0,Unnamed: 1,0
target,worst concave points,-0.793566
worst perimeter,target,-0.782914
target,mean concave points,-0.776614
worst radius,target,-0.776454
mean perimeter,target,-0.742636
mean perimeter,...,...
mean perimeter,mean area,0.986507
mean radius,mean area,0.987357
worst perimeter,worst radius,0.993708
mean perimeter,mean radius,0.997855


Above shown is the correlation dataframe among the features.

In [8]:
fig = px.histogram(df, 
                   x='mean area', 
                   marginal='box', 
                   color_discrete_sequence=['red'], 
                   title='Distribution of Mean Area of Cancer')
fig.update_layout(bargap=0.1)
fig.show()

In [9]:
fig = px.scatter(df, 
                   x='mean texture', 
                   color_discrete_sequence=['blue'], 
                   title='Spread of Mean Texture of Cancer')
fig.update_layout(bargap=0.1)
fig.show()

In [10]:
for template in ["none"]:
    fig = px.scatter(df,
                     x="mean compactness", 
                     color="mean compactness",
                     log_x=True, 
                     template=template, 
                     title="Compactness Mean of the Cancer")
    fig.update_xaxes(showgrid=False)
    fig.update_yaxes(showgrid=False)
    fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)','paper_bgcolor': 'rgba(0, 0, 0, 0)'})
    fig.show()

In [11]:
fig = px.histogram(df, 
                   x='target', 
                   color_discrete_sequence=['blue'],
                   title='Diagnosis Count')
fig.update_layout(bargap=0.3)
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)','paper_bgcolor': 'rgba(0, 0, 0, 0)'})
fig.show()

From our analysis above, we saw there are 357 Benign Cases and 212 Malignant breast cancer cases. Compactness Mean is more in the Malignant Cases as compared to the Benign Cases.

## Data Pre-processing
Now, lets start processing our data and make sure its in line with the requirements of the machine learning ecosystem as we wanna make sure there is no categorical data since Machine Learning Algorithms cannot work with Categorical data.

Fortunately, we don't have the categorical columns in our dataset so we are just using slicing method to make a list of columns and store that into `input-cols`. We used the slicing method to extract the list of `input_cols` and `target_col`.

In [12]:
input_cols = df.columns[:-1]
input_cols

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='object')

In [13]:
target_col =  df.columns[-1]
target_col

'target'

In [14]:
inputs_df = df[list(input_cols)].copy()
inputs_df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [15]:
targets = df[(target_col)]
targets

0      0
1      0
2      0
3      0
4      0
      ..
564    0
565    0
566    0
567    0
568    1
Name: target, Length: 569, dtype: int64

### Data Scaling

We want to scale the data as in the machine learning algorithms if the values of the features are closer to each other there are chances for the algorithm to get trained well and faster instead of the data set where the data points or features values have high differences with each other will take more time to understand the data and the accuracy will be lower.

In [16]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df[input_cols])
inputs_df[input_cols] = scaler.transform(inputs_df[input_cols])
inputs_df[input_cols].describe().loc[['min', 'max']]

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Principal Component Analysis (PCA)
Principal Component Analysis is a way to reduce the number of variables while maintaining the majority of the important information. It transforms a number of variables that may be correlated into a smaller number of uncorrelated variables, known as principal components.

The main objective of PCA is to simplify your model features into fewer components to help visualize patterns in your data and to help your model run faster. Using PCA also reduces the chance of overfitting your model by eliminating features with high correlation.

In [17]:
from sklearn.preprocessing import scale
from sklearn import decomposition
X = scale(inputs_df)
pca = decomposition.PCA(n_components=5)
pca.fit(X)

PCA(n_components=5)

In [18]:
scores = pca.transform(X)
scores_df = pd.DataFrame(scores, columns=['PC1', 'PC2','PC3', 'PC4', 'PC5'])
target = pd.Series(targets, name='target')
result_df = pd.concat([scores_df, target], axis=1)
result_df.head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,target
0,9.192837,1.948583,-1.123166,3.633731,-1.19511,0
1,2.387802,-3.768172,-0.529293,1.118264,0.621775,0
2,5.733896,-1.075174,-0.551748,0.912083,-0.177086,0
3,7.122953,10.275589,-3.23279,0.152547,-2.960878,0
4,3.935302,-1.948072,1.389767,2.940639,0.546747,0


In [19]:
scores.shape

(569, 5)

In [20]:
fig = px.scatter(result_df, 
                   x='PC1', 
                   y= 'PC2', 
                   color = 'target',
                   title='PC Picturization')
fig.update_layout(bargap=0.1)
fig.show()

### Explained Variance Ratio

The explained variance ratio is the percentage of variance that is attributed by each of the selected components. Ideally, you would choose the number of components to include in your model by adding the explained variance ratio of each component until you reach a total of around 0.8 or 80% to avoid overfitting.

In [21]:
print('Variance of each component:', pca.explained_variance_ratio_)
print('\n Total Variance Explained:', round(sum(list(pca.explained_variance_ratio_))*100, 2))

Variance of each component: [0.44272026 0.18971182 0.09393163 0.06602135 0.05495768]

 Total Variance Explained: 84.73


We can see that our first 5 principal components explain the majority of the variance in this dataset (84.73%)! This is an indication of the total information represented compared to the original data.

## Splitting Data
We start the process of training our data now that we are done with preprocessing of the data. Lets go ahead and split the data into 2 splits i.e. training and validation data. Training data will be used to train our model and we will validate the score on the validation data.

We have taken the test size as 0.25 since we don't want to train our model on the entire dataset and then end up having the model learn nothing when new set of data is thrown at it.

In [22]:
from sklearn.model_selection import train_test_split
train_inputs, val_inputs, train_targets, val_targets = train_test_split(scores, 
                                                                        targets, 
                                                                        test_size=0.25, 
                                                                        random_state=42)

In [23]:
train_inputs.shape, train_targets.shape, val_inputs.shape, val_targets.shape

((426, 5), (426,), (143, 5), (143,))

## Training Models to find the best one

In [24]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [25]:
names = ['Logistic Regression', "Nearest_Neighbors", "Linear_SVM","Gradient_Boosting", "Decision_Tree", "Random_Forest"]
classifiers = [
    LogisticRegression(solver='liblinear'),
    KNeighborsClassifier(n_neighbors=3),
    SVC(kernel="linear", C=0.025),
    GradientBoostingClassifier(n_estimators=100),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=100)]

In [26]:
scores = []
for name, clf in zip(names, classifiers):
    clf.fit(train_inputs, train_targets)
    score = clf.score(val_inputs, val_targets)
    scores.append(score)

In [27]:
scores_df = pd.DataFrame()
scores_df['name'] = names
scores_df['score'] = scores
scores_df.sort_values('score', ascending= False)

Unnamed: 0,name,score
0,Logistic Regression,0.986014
2,Linear_SVM,0.979021
3,Gradient_Boosting,0.972028
1,Nearest_Neighbors,0.965035
4,Decision_Tree,0.965035
5,Random_Forest,0.965035


We would NOT be picking the Gradient Boosting and Decision Tree since the test/validation score is less than 96%. Let's go ahead and tune some Hyperparameters.

## Hyperparameter Tuning for all the models

When creating a machine learning model, you'll be presented with design choices as to how to define your model architecture. Often times, we don't immediately know what the optimal model architecture should be for a given model, and thus we'd like to be able to explore a range of possibilities. 

In true machine learning fashion, we'll ideally ask the machine to perform this exploration and select the optimal model architecture automatically. Parameters which define the model architecture are referred to as hyperparameters and thus this process of searching for the ideal model architecture is referred to as hyperparameter tuning which is what he have done using `GridSearchCV` and `RandomizedSearchCV`.

In [28]:
from sklearn.model_selection import GridSearchCV
C_range = np.arange(1,11,1)
penalty_range= ['l2','l1']
max_iter_range = np.arange(100,210,10)
param_grid = dict(C=C_range, penalty=penalty_range, max_iter= max_iter_range)
model = LogisticRegression(solver='liblinear',)
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid.fit(train_inputs, train_targets)
print("The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))

The best parameters are {'C': 2, 'max_iter': 100, 'penalty': 'l2'} with a score of 0.97


In [29]:
neighbors_range = np.arange(1,7,1)
leaf_size_range = np.arange(10,40,10)
param_grid = dict(n_neighbors=neighbors_range, leaf_size=leaf_size_range)
model = KNeighborsClassifier(n_jobs=-1)
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid.fit(train_inputs, train_targets)
print("The best parameters are %s with a score of %0.2f"
      % (grid.best_params_, grid.best_score_))

The best parameters are {'leaf_size': 10, 'n_neighbors': 5} with a score of 0.95


In [30]:
Kernel_range = ['linear','rbf']
C_range = np.arange(1,6,1)
param_grid = dict(kernel=Kernel_range, C= C_range)
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid.fit(train_inputs, train_targets)
print("The best parameters are %s with a score of %0.2f"
      % (grid.best_params_, grid.best_score_))

The best parameters are {'C': 4, 'kernel': 'linear'} with a score of 0.96


In [31]:
max_depth_range = np.arange(1,8,1)
max_features_range= np.arange(1,6,1)
max_leaf_nodes_range = np.arange(2,100,10)
from sklearn.model_selection import RandomizedSearchCV
distributions = dict(max_depth=max_depth_range, max_features=max_features_range, max_leaf_nodes=max_leaf_nodes_range)
model = RandomForestClassifier(n_jobs=-1, random_state=42)
clf = RandomizedSearchCV(model, distributions, random_state=42)
clf.fit(train_inputs, train_targets)
print("The best parameters are %s with a score of %0.2f"
      % (clf.best_params_, clf.best_score_))

The best parameters are {'max_leaf_nodes': 42, 'max_features': 1, 'max_depth': 7} with a score of 0.95


Now that we are done tuning all our models, let's put those numbers in as is and see what is the best model that we have for this dataset.

In [32]:
names = ['Logistic Regression', "Nearest_Neighbors", "Linear_SVM", "Random_Forest"]
classifiers = [
    LogisticRegression(C=2,max_iter=100, penalty='l2',solver='liblinear'),
    KNeighborsClassifier(leaf_size=10, n_neighbors=5),
    SVC(kernel="linear", C=4),
    RandomForestClassifier(max_leaf_nodes=82,max_features=4, max_depth=5)]

In [33]:
scores = []
for name, clf in zip(names, classifiers):
    clf.fit(train_inputs, train_targets)
    score = clf.score(val_inputs, val_targets)
    scores.append(score)

In [34]:
scores_df = pd.DataFrame()
scores_df['name'] = names
scores_df['score'] = scores
scores_df.sort_values('score', ascending= False)

Unnamed: 0,name,score
0,Logistic Regression,0.986014
2,Linear_SVM,0.986014
3,Random_Forest,0.972028
1,Nearest_Neighbors,0.958042


We can clearly see that **Logistic Regression** and **SVM** has given us the best accuracy score.

**SUMMARY OF THE NOTEBOOK:-**

1. 357 Benign Cases and 212 Malignant breast cancer cases. Compactness Mean is more in the Malignant Cases as compared to the Benign Cases.
2. Depending upon the data and the computational power, one should use GridSearch or RandomizedSearch for hyperparameter tuning
3. PCA is a great way to shift from high dimensionality to low dimensionality. If we have more features than observations than we run the risk of massively overfitting our model — this would generally result in terrible out of sample performance.
4. Relying on complex algorithms always should not be the way out. Sometimes, even a simpler algorithms can work wonders.

**Resources**

- https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47
- https://scikit-learn.org/stable/modules/svm.html
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- https://www.youtube.com/c/DataProfessor

I HOPE THIS HELPED YOU IN SOME WAY. THANKS FOR TAKING OUT THE TIME TO GO THROUGH THE NOTEBOOK!!