# Breast Cancer Data Analysis

![](https://www.uicc.org/sites/main/files/styles/uicc_news_main_image/public/thumbnails/image/BCAM2016_FA.jpg?itok=zimiEGKS)

In this tutorial, based on the data we are going to find out if the cancer is benign or malignant. We would use python libraries such as `Numpy`, `Pandas` and `Plotly`. We would use regression techniques to predict the values on our dataset. Source : https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Let's start off by installing and import the required libraries into our code

In [3]:
import numpy as np
import pandas as pd 
import plotly.express as px
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
%matplotlib inline
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [4]:
plt.rcParams['font.size'] = 12
plt.rcParams['figure.figsize'] = (15, 10)

In [5]:
raw_df = pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')
raw_df

Using the info method on our `dataframe` we are essentially trying to see how many `null` values we have and what is the `datatype` of our columns.

In [6]:
raw_df.info()

We are dropping two columns since they are unnecessary to our training and add no value at all. We stored the values in a new variables and we also drew some graph to explore some interesting facts about our data.

In [7]:
df = raw_df.drop(columns=['id', 'Unnamed: 32'])
df

## Exploratory Data Analysis

In [8]:
fig = px.bar(df, 
            x='radius_mean', 
            y='diagnosis', 
            color='radius_mean',
            hover_data=['radius_mean'], 
            title='Radius Mean vs Diagnosis')
fig.update_xaxes(showgrid=False)   #Turning the grid off
fig.update_yaxes(showgrid=False)   #Turning the grid off
fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)','paper_bgcolor': 'rgba(0, 0, 0, 0)'})  #removing the background color
fig.show()

In [9]:
for template in ["none"]:
    fig = px.bar(df,
                     x="compactness_mean", 
                     y="diagnosis", 
                     color="compactness_mean",
                     log_x=True, 
                     template=template, 
                     title="Compactness Mean Vs Diagnosis")
    fig.update_xaxes(showgrid=False)
    fig.update_yaxes(showgrid=False)
    fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)','paper_bgcolor': 'rgba(0, 0, 0, 0)'})
    fig.show()

In [10]:
fig = px.histogram(df, 
                   x='diagnosis', 
                   color_discrete_sequence=['blue'],
                   title='Diagnosis Count')
fig.update_layout(bargap=0.3)
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)','paper_bgcolor': 'rgba(0, 0, 0, 0)'})
fig.show()

From our analysis above, we saw there are 357 Benign Cases and 212 Malignant breast cancer cases. Compactness Mean is more in the Malignant Cases as compared to the Benign Cases.

## Data Pre-processing
Now, We are splitting our `df` into `input_cols` as we wanna make sure there is no categorical data since Machine Learning Algorithms cannot work with Categorical data. Fortunately, we don't have the categorical columns in our dataset so we are just using slicing method to make a list of columns and store that into `input-cols`. We used the slicing method to extract the list of `input_cols` and `target_col`.

In [11]:
input_cols = df.columns[1:]
input_cols

In [12]:
target_col =  df.columns[0]
target_col

Since we don't want any data leakage while training our model, we are making a copy of our original dataframe `df` and storing the contents into the new dataframe `inputs_df` and `targets`. `inputs_df` contains all the data that is what we also refer to as independent variable however `targets` contain the dependent variable which means the data in this dataframe is dependent on the independent variables.

In [13]:
inputs_df = df[list(input_cols)].copy()
inputs_df

In [14]:
targets = df[(target_col)]
targets

### Scaling

In [15]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df[input_cols])
inputs_df[input_cols] = scaler.transform(inputs_df[input_cols])
inputs_df[input_cols].describe().loc[['min', 'max']]

### Label Encoding

In [16]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
targets1 = encoder.fit_transform(targets)
targets1

We start the process of training our data now that we are done with preprocessing of the data. Lets go ahead and split the data into 2 splits i.e.  training and validation data. Training data will be used to train our model and we will validate the score on the validation data.

We have taken the test size as 0.25 since we don't want to train our model on the entire dataset and then end up having the model learn nothing when new set of data is thrown at it.

## Splitting Data

In [17]:
from sklearn.model_selection import train_test_split
train_inputs, val_inputs, train_targets, val_targets = train_test_split(inputs_df, 
                                                                        targets1, 
                                                                        test_size=0.25, 
                                                                        random_state=42)

In [18]:
train_inputs.shape, train_targets.shape, val_inputs.shape, val_targets.shape

## Training our Logistic Regression Model

Logistic regression is a process of modeling the probability of a discrete outcome given an input variable. The most common logistic regression models a binary outcome; something that can take two values such as true/false, yes/no, and so on. Multinomial logistic regression can model scenarios where there are more than two possible discrete outcomes.

![](https://miro.medium.com/max/800/1*UgYbimgPXf6XXxMy2yqRLw.png)

Logistic regression is a useful analysis method for classification problems, where you are trying to determine if a new sample fits best into a category. As aspects of cyber security are classification problems, such as attack detection, logistic regression is a useful analytic technique.

In [19]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear')
model.fit(train_inputs, train_targets)

### Making Predictions

In [20]:
train_preds = model.predict(train_inputs)
val_preds = model.predict(val_inputs)
train_preds, val_preds

### Accuracy Score

In [21]:
from sklearn.metrics import accuracy_score
base_train_score = accuracy_score(train_targets, train_preds)
print('The Accuracy score on our training set is {:.2f}%.'.format(base_train_score*100))

In [22]:
base_val_score = accuracy_score(val_targets, val_preds)
print('The Accuracy score on our validation set is {:.2f}%.'.format(base_val_score*100))

In [23]:
from matplotlib import pyplot
train_scores, test_scores = list(), list()
values = [i for i in range(1, 11)]
for i in values:
    # configure the model
    model = LogisticRegression(solver='liblinear', C=i)
    # fit model on the training dataset
    model.fit(train_inputs, train_targets)
    # evaluate on the train dataset
    train_preds = model.predict(train_inputs)
    train_acc = accuracy_score(train_targets, train_preds)
    train_scores.append(train_acc)
    # evaluate on the test dataset
    test_preds = model.predict(val_inputs)
    test_acc = accuracy_score(val_targets, test_preds)
    test_scores.append(test_acc)
    # summarize progress
    print('%d, train: %.3f, test: %.3f' % (i, train_acc, test_acc))
# plot of train and test scores vs C Value
pyplot.plot(values, train_scores, '-o', label='Train')
pyplot.plot(values, test_scores, '-o', label='Test')
pyplot.legend()
pyplot.show()

### Hyperparameter Tuning
By looking at the chart we could tell that after 4, the training score is getting better but the validation/test score has come to a stand still which is why probably the best value for C is 4. Let's try to tune the parameters now

In [24]:
from sklearn.model_selection import GridSearchCV
C_range = np.arange(1,11,1)
penalty_range= ['l2','l1']
max_iter_range = np.arange(1,110,10)
param_grid = dict(C=C_range, penalty=penalty_range, max_iter= max_iter_range)
model = LogisticRegression(solver='liblinear',)
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)

In [25]:
grid.fit(train_inputs, train_targets)
print("The best parameters are %s with a score of %0.2f"
      % (grid.best_params_, grid.best_score_))

In [26]:
model = LogisticRegression(solver='liblinear', C= 4, max_iter= 11, penalty= 'l1')
model.fit(train_inputs, train_targets)
train_preds = model.predict(train_inputs)
val_preds = model.predict(val_inputs)
train_score = accuracy_score(train_targets, train_preds)
val_score = accuracy_score(val_targets, val_preds)

In [27]:
model_scores = pd.DataFrame({
    'Base Train Score': base_train_score.flatten()*100,
    'Base Val Score': base_val_score.flatten()*100,
    'New Train Score': train_score.flatten()*100,
    'New Val Score': val_score.flatten()*100
                })
model_scores

## Confusion Matrix

The model achieves an accuracy of 96.71% on the training set. We can visualize the breakdown of correctly and incorrectly classified inputs using a confusion matrix.

<img src="https://i.imgur.com/UM28BCN.png" width="480">

In [28]:
from sklearn.metrics import confusion_matrix
confusionmatrix = confusion_matrix(train_targets, train_preds, normalize='true')
plt.figure(figsize=(14,7)) 
sns.heatmap(confusionmatrix*100, annot=True,annot_kws={"size": 18}, cmap="Blues")
plt.xlabel('Prediction',fontsize=16)
plt.ylabel('Target',fontsize=16)
plt.title('Confusion Matrix', fontsize=20)
plt.show()

From the matrix above, we come to a conclusion that it did a good job at predicting True Positive i.e first quadrant however it could do a better job at predicting the true negatives.

### Feature Importance
Let's look at the weights assigned to different columns, to figure out which columns in the dataset are the most important.

In [29]:
weights = model.coef_
weights = weights.reshape(30)
weights_df = pd.DataFrame({
    'Feature': input_cols,
    'Weight': weights
                }).sort_values('Weight', ascending=False)
weights_df

In [30]:
fig = px.scatter(weights_df, x="Feature", y="Weight", title='Feature Importance Chart',color="Weight")
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)','paper_bgcolor': 'rgba(0, 0, 0, 0)'})
fig.show()

From the charts above, we see that `concave points_mean` is the most important feature in determining if its a malignant or benign case of breast cancer with the highest weight of 2.388964

## Decision Tree Classifier

In [31]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
model.fit(train_inputs, train_targets)

In [32]:
from sklearn.metrics import accuracy_score, confusion_matrix
train_preds = model.predict(train_inputs)
train_score = model.score(train_inputs, train_targets)*100
train_score

In [33]:
val_preds = model.predict(val_inputs)
val_score = model.score(val_inputs, val_targets)*100
val_score

In [34]:
model.tree_.max_depth

In [35]:
from sklearn.tree import plot_tree, export_text
plt.figure(figsize=(80,20))
plot_tree(model, feature_names=train_inputs.columns, max_depth=2, filled=True);

In [36]:
Diagnosis = {
    'M' : '1',
    'B' : '0'
            }
Diagnosis

In [37]:
confusionmatrix = confusion_matrix(train_targets, train_preds, normalize='true')
plt.figure(figsize=(14,7)) 
sns.heatmap(confusionmatrix, annot=True,annot_kws={"size": 18}, cmap="Greens")
plt.xlabel('Prediction',fontsize=16)
plt.ylabel('Target',fontsize=16)
plt.title('Confusion Matrix', fontsize=20)
plt.show()

### Hyperparameter Tuning

In [38]:
max_depth_range = np.arange(1,8,1)
max_features_range= np.arange(1,31,1)
max_leaf_nodes_range = np.arange(2,100,10)
param_grid = dict(max_depth=max_depth_range, max_features=max_features_range, max_leaf_nodes=max_leaf_nodes_range)
model = DecisionTreeClassifier(random_state=42)
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid.fit(train_inputs, train_targets)
print("The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))

In [39]:
model = DecisionTreeClassifier(max_depth = 5, max_features= 3, max_leaf_nodes= 11, random_state=42)
model.fit(train_inputs, train_targets)
train_preds = model.predict(train_inputs)
val_preds = model.predict(val_inputs)
train_score = accuracy_score(train_targets, train_preds)
val_score = accuracy_score(val_targets, val_preds)

In [40]:
model_scores = pd.DataFrame({
    'Base Train Score': base_train_score.flatten()*100,
    'Base Val Score': base_val_score.flatten()*100,
    'New Train Score': train_score.flatten()*100,
    'New Val Score': val_score.flatten()*100
                })
model_scores

## Random Forest Classifier

While tuning the hyperparameters of a single decision tree may lead to some improvements, a much more effective strategy is to combine the results of several decision trees trained with slightly different parameters. This is called a random forest model.

The key idea here is that each decision tree in the forest will make different kinds of errors, and upon averaging, many of their errors will cancel out. This idea is also commonly known as the "wisdom of the crowd".

A random forest works by averaging/combining the results of several decision trees:

![](https://1.bp.blogspot.com/-Ax59WK4DE8w/YK6o9bt_9jI/AAAAAAAAEQA/9KbBf9cdL6kOFkJnU39aUn4m8ydThPenwCLcBGAsYHQ/s0/Random%2BForest%2B03.gif)



In [41]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_jobs=-1, random_state=42)
model.fit(train_inputs,train_targets)

In [42]:
model_train_score = model.score(train_inputs, train_targets)*100
model_val_score = model.score(val_inputs, val_targets)*100
print('Random Forest Training Score - {:.2f}%'.format(model_train_score))
print('Random Forest Validation Score - {:.2f}%'.format(model_val_score))

In [43]:
forest_weights = model.feature_importances_
weights_df = pd.DataFrame({
    'Feature': input_cols,
    'Weight': forest_weights
                }).sort_values('Weight', ascending=False)
plt.figure(figsize=(20,10))
plt.xticks(rotation=45)
plt.title('Feature Importance')
sns.barplot(data=weights_df.head(10), x='Feature', y='Weight');

### HyperParameter Tuning

In [44]:
max_depth_range = np.arange(1,8,1)
max_features_range= np.arange(1,31,1)
max_leaf_nodes_range = np.arange(2,100,10)
from sklearn.model_selection import RandomizedSearchCV
distributions = dict(max_depth=max_depth_range, max_features=max_features_range, max_leaf_nodes=max_leaf_nodes_range)
model = RandomForestClassifier(n_jobs=-1, random_state=42)
clf = RandomizedSearchCV(model, distributions, random_state=42)

In [45]:
clf.fit(train_inputs, train_targets)
print("The best parameters are %s with a score of %0.2f" % (clf.best_params_, clf.best_score_))

In [46]:
model = RandomForestClassifier(max_depth= 6, 
                               max_leaf_nodes= 82,
                               max_features=14,
                               n_jobs=-1,
                               random_state=42)
model.fit(train_inputs,train_targets)
model_train_score = model.score(train_inputs, train_targets)*100
model_val_score = model.score(val_inputs, val_targets)*100
print('Random Forest Training Score - {:.2f}%'.format(model_train_score))
print('Random Forest Validation Score - {:.2f}%'.format(model_val_score))

In [47]:
model_scores = pd.DataFrame({
    'Base Train Score': base_train_score.flatten()*100,
    'Base Val Score': base_val_score.flatten()*100,
    'New Train Score': model_train_score.flatten(),
    'New Val Score': model_val_score.flatten()
                })
model_scores

## Principal Component Analysis (PCA)

Principal Component Analysis is a way to reduce the number of variables while maintaining the majority of the important information. It transforms a number of variables that may be correlated into a smaller number of uncorrelated variables, known as principal components.

The main objective of PCA is to simplify your model features into fewer components to help visualize patterns in your data and to help your model run faster. Using PCA also reduces the chance of overfitting your model by eliminating features with high correlation.

In [120]:
from sklearn.preprocessing import scale
from sklearn import decomposition #PCA
X = scale(inputs_df)
pca = decomposition.PCA(n_components=5)
pca.fit(X)

In [101]:
scores = pca.transform(X)
scores_df = pd.DataFrame(scores, columns=['PC1', 'PC2','PC3', 'PC4', 'PC5'])
target = pd.Series(targets1, name='target')
result_df = pd.concat([scores_df, target], axis=1)
result_df.head()

In [116]:
fig = plt.figure(figsize = (12,10))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('First Principal Component ', fontsize = 15)
ax.set_ylabel('Second Principal Component ', fontsize = 15)

ax.set_title('Principal Component Analysis (5PCs) for Cancer Dataset', fontsize = 20)

targets = [0, 1]
colors = ['r', 'g']
for target, color in zip(targets, colors):
    indicesToKeep = targets1 == target
    ax.scatter(result_df.loc[indicesToKeep, 'PC1'], 
               result_df.loc[indicesToKeep, 'PC2'],
               c = color, 
               s = 50)
ax.legend(targets)
ax.grid()

**Explained Variance Ratio**

The explained variance ratio is the percentage of variance that is attributed by each of the selected components. Ideally, you would choose the number of components to include in your model by adding the explained variance ratio of each component until you reach a total of around 0.8 or 80% to avoid overfitting.

In [121]:
print('Variance of each component:', pca.explained_variance_ratio_)
print('\n Total Variance Explained:', round(sum(list(pca.explained_variance_ratio_))*100, 2))

We can see that our first two principal components explain the majority of the variance in this dataset (84.73%)! This is an indication of the total information represented compared to the original data.

## Support Vector Machine

The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

![](https://miro.medium.com/max/1400/0*ecA4Ls8kBYSM5nza.jpg)

Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. These are the points that help us build our SVM.


In the SVM algorithm, we are looking to maximize the margin between the data points and the hyperplane.

In [141]:
from sklearn.svm import SVC
model = SVC(kernel='linear')
model.fit(train_inputs, train_targets)
val_preds = model.predict(val_inputs)
SVMAccuracy = accuracy_score(val_targets,val_preds)*100
print('SVM Accuracy Score - {:.2f}%'.format(SVMAccuracy))

We can clearly see that SVM has given us the best accuracy score i.e 98.60 without much hyperparameter tuning. 


SUMMARY OF THE NOTEBOOK:-
1. 357 Benign Cases and 212 Malignant breast cancer cases. Compactness Mean is more in the Malignant Cases as compared to the Benign Cases.
2. Depending upon the data and the computational power, one should use GridSearch or RandomizedSearch for hyperparameter tuning
3. PCA is a great way to shift from high dimensionality to low dimensionality. One must choose the number of components (while performing PCA) to include in your model by adding the explained variance ratio of each component until you reach a total of around 0.8 or 80% to avoid overfitting.
4. Relying on complex algorithms always should not be the way out. Sometimes, even a simpler algorithms can work wonders.

**Further reading **
- https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47
- https://scikit-learn.org/stable/modules/svm.html
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

If you like the notebook, please be sure to leave an upvote on your way out. THANKS FOR YOUR TIME!!