# An Investigation into the Wisconsin Breast Cancer Dataset

***

![image.png](attachment:image.png)

by Ben Janning

## Project Aim

***

- Undertake an analysis/review of the dataset and present an overview and background.
- Provide a literature review on classifiers which have been applied to the Dataset and compare their performance.
- Present a statistical analysis of the dataset.
- Using a range of machine learning algorithms, train a set of classifiers on the dataset and present classification performance results.
- Compare, contrast and critique results with reference to the literature.
- Discuss and investigate how the dataset could be extended – using data synthesis of new tumour datapoints.

## An Overview of Breast Cancer

***

![image.png](attachment:image.png)

According to research from cancer.org, Breast cancer is the leading cancer type in females in most countries in the world.    Approximately 20% of females will be diagnosed with Breast Cacner during their lifetime, although this number varies significantly by country. There are large variations in estimated incidence rates worldwide, with an almost 400% difference between areas ranked highest and lowest. [4]

Breast Cancer is caused by the cells in breasts growing in an abnormal manner.  When these cells group together they form a tumour. 

Breast cancer is generally over ten times more common in women, but is also found in men. Each year in Ireland, more than 3,500 women and approximately 35 men are diagnosed with breast cancer.  Women over 50 are most likely to be diagnosed.

Depending on the type of Breast cancer, it is treated with surgery, radiotherapy, chemotherapy, hormone therapy and targeted therapies. [5]

## Dataset Background

***

The Wisconsin Breast Cancer Dataset was created by Dr William H. Wolberg at the University of Wisconsin Hospitals and made available online in 1992. The dataset contains records collected from 699 patients of which 458 (65.5%) were from patients who had a benign BC tumour and 241 (34.5%) cases were from patients with a malignant BC tumour. [6]

The Wisconsin Breast Cancer dataset is a classification dataset, which records the measurements for breast cancer cases. This dataset has dimensionality 9. The malignant class of this dataset is considered as outliers, while points in the benign class are considered inliers. [1]

## Attributes

***

There are two datasets in existence for the study, the Original and the Diagnostic dataset.  The main difference being the attributes that they focus on.

All Original dataset features have values in the range between 1 to 10, where 1 represents a normal state and 10 represents a most abnormal state. 

The Dependant dataset feature has either a value of 4 or 2, where 4 denotes malignant BC tumour diagnosis and 2 denotes benign BC tumour diagnosis. 

From the 699 records, 16 records have missing values for the ‘Bare Nuclei’ feature. 

### Orgininal Datasets Attributes: [2]

***

1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
11. Class: (2 for benign, 4 for malignant)

### Diagnostic Dataset's Attributes: [3]

***

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter^2 / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)

i) symmetry

j) fractal dimension ("coastline approximation" - 1)


##  Classifiers Which Have Been Applied to the Dataset

***



## Load the Dataset

***

In [79]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [80]:
import warnings
warnings.filterwarnings('ignore')

In [81]:
# Load the Dataset

df = pd.read_csv('data/data.csv')

In [82]:
# View the first 5 rows

df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [83]:
# View the last 5 rows

df.tail()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
564,926424,M,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,...,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,
565,926682,M,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,
566,926954,M,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,...,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,
567,927241,M,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,...,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,
568,92751,B,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,...,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,


In [84]:
# Use shape to determine how many rows and columns

df.shape

(569, 33)

In [85]:
# Statististical Information. T = Transpose

df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,569.0,30371830.0,125020600.0,8670.0,869218.0,906024.0,8813129.0,911320500.0
radius_mean,569.0,14.12729,3.524049,6.981,11.7,13.37,15.78,28.11
texture_mean,569.0,19.28965,4.301036,9.71,16.17,18.84,21.8,39.28
perimeter_mean,569.0,91.96903,24.29898,43.79,75.17,86.24,104.1,188.5
area_mean,569.0,654.8891,351.9141,143.5,420.3,551.1,782.7,2501.0
smoothness_mean,569.0,0.09636028,0.01406413,0.05263,0.08637,0.09587,0.1053,0.1634
compactness_mean,569.0,0.104341,0.05281276,0.01938,0.06492,0.09263,0.1304,0.3454
concavity_mean,569.0,0.08879932,0.07971981,0.0,0.02956,0.06154,0.1307,0.4268
concave points_mean,569.0,0.04891915,0.03880284,0.0,0.02031,0.0335,0.074,0.2012
symmetry_mean,569.0,0.1811619,0.02741428,0.106,0.1619,0.1792,0.1957,0.304


In [86]:
# How many unique values are there? 2 - Malignant and Benign

df.diagnosis.unique()

array(['M', 'B'], dtype=object)

In [87]:
# How many of each are there?

df['diagnosis'].value_counts()

B    357
M    212
Name: diagnosis, dtype: int64

In [88]:
# Use Seaborn to plot the Values

sns.countplot(df['diagnosis'], palette='husl')

ValueError: could not convert string to float: 'M'

## Data Cleansing

***

In [None]:
# Drop the id column and the unnamed Column

df.drop('id', axis=1, inplace=True)
df.drop('Unnamed: 32', axis=1, inplace=True)

In [None]:
df.head()

In [None]:
# Map the values M and B to numeric values. [7]

df['diagnosis'] = df['diagnosis'].map({'M':1, 'B':0})
df.head()

In [None]:
# How many 0 Values are there? [8]

df.isnull().sum()

In [None]:
# def diagnosis_value(diagnosis):

# if diagnosis == 'M':
        return 1
# else:
        return 0
    
# df['diagnosis'] = df ['diagnosis'].apply(diagnosis_value)

In [None]:
# Check for correlation between the two variables.

df.corr()

radius_mean , perimeter _mean, area_mean have a high correlation with malignant tumor

In [None]:
# PLot the results on a Histogram

plt.hist(df['diagnosis'], color='g')
plt.title('Diagnosis M=1, B=0')
plt.show()

In [None]:
# Show the correlation on a Heatmap

plt.figure(figsize=(20,20))
sns.heatmap(df.corr(), annot=True)

In [None]:
# Take all the "mean" columns and generate a plot matrix

# Assigning the Columns

cols = ['diagnosis',
        'radius_mean', 
        'texture_mean', 
        'perimeter_mean', 
        'area_mean', 
        'smoothness_mean', 
        'compactness_mean', 
        'concavity_mean',
        'concave points_mean', 
        'symmetry_mean', 
        'fractal_dimension_mean']

sns.pairplot(data=df[cols], hue='diagnosis', palette='rocket')

We can see almost perfectly linear patterns between the radius, perimeter and area attributes.  These are hinting at the presence of multicollinearity between these variables. Another set of variables that possibly imply multicollinearity are the concavity, concave_points and compactness.

### What is Multicollinearity?

Multicollinearity occurs when two or more independent variables (also known as predictor) are highly correlated with one another in a regression model.

This means that an independent variable can be predicted from another independent variable in a regression model. 

In [None]:
# Correlation Matrix Visualised [9]

corr = df.corr().round(2)

# Generate a Mask for the Upper Triangle.

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set Figure Size
f, ax = plt.subplots(figsize = (20,20))

# Define the colormap

cmap = sns.diverging_pallete(220, 10, as_cmap=True)

# Heatmap

sns.heatmap (corr, mask=masl, cmap=cmap, vmin=-1, vmax=1, center=0,
             square=True, linewidth = .5, cbar_kws={"shrink": .5}, annot=True)

plt.tight_layout()

We can verify the presence of multicollinearity between some of the variables.
For instance, the radius_mean column has a correlation of 1 and 0.99 with perimiter_mean and area_mean columns, respectively.  This is because the three columns esentially contain the same information, which is the physical size of the observation (the cell).
Therefore, we should only pick ONE of the Three columns when we go into further analysis.

Another place where multicollinearity is apparent is between the 'mean' columns and the 'worst' column.  For instance, the radius_mean column has a correlation of 0.97 with the radius_worst column.

There is multicollinearity between the attributes compactness, concavity and concave points.  We can choose just ONE of these.  I have decided on Compactness.

In [None]:
# first, drop all "worst" columns
cols = ['radius_worst', 
        'texture_worst', 
        'perimeter_worst', 
        'area_worst', 
        'smoothness_worst', 
        'compactness_worst', 
        'concavity_worst',
        'concave points_worst', 
        'symmetry_worst', 
        'fractal_dimension_worst']
df = df.drop(cols, axis=1)

# then, drop all columns related to the "perimeter" and "area" attributes
cols = ['perimeter_mean',
        'perimeter_se', 
        'area_mean', 
        'area_se']
df = df.drop(cols, axis=1)

# lastly, drop all columns related to the "concavity" and "concave points" attributes
cols = ['concavity_mean',
        'concavity_se', 
        'concave points_mean', 
        'concave points_se']
df = df.drop(cols, axis=1)

# verify remaining columns
df.columns

In [None]:
# Draw the heatmap again, with the new correlation matrix
corr = df.corr().round(2)
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(20, 20))
sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-1, vmax=1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)
plt.tight_layout()

## References

***

[1] Odds, Breast Cancer Wisconsin, Original Dataset. http://odds.cs.stonybrook.edu/breast-cancer-wisconsin-original-dataset/

[2] UCI, Breast Cancer Wisconsin Original. https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)

[3] Diagnostic, Breast Cancer Wisconsin. https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)

[4] Cancer.org, Breast Cancer.  https://canceratlas.cancer.org/the-burden/breast-cancer/

[5] Cancer.ie, Breast Cancer. https://www.cancer.ie/cancer-information-and-support/cancer-types/breast-cancer

[6] A five-year analysis of studies focused on breast cancer prediction using machine learning NCBI. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7330506/

[7] Pandas Series Map. https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html

[8] Pandas DataFrame is Null. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html

[9] Many Pairwise Correlations. https://seaborn.pydata.org/examples/many_pairwise_correlations.html