# PfDA Project II. - Wisconsin Breast Cancer Dataset Investigation



#### Assignment for 22-23 Programming for Data Analysis by Mr Brian McGinley
    
#### Author: Eva Czeyda-Pommersheim

## I. Python Libraries

In [25]:
# Packages for data analysis
import numpy as np
import pandas as pd

from sklearn import svm

# Packages for visualization
import matplotlib.pyplot as plt
import seaborn as sns

## II. Introduction

As per World Health Organization cancer is a disease in which abnormal breast cells grow out of control and form tumours. If left unchecked, the tumours can spread throughout the body and become fatal.

Breast cancer cells begin inside the milk ducts and/or the milk-producing lobules of the breast. The earliest form (in situ) is not life-threatening. Cancer cells can spread into nearby breast tissue (invasion). This creates tumours that cause lumps or thickening.

In 2020, there were 2.3 million women diagnosed with breast cancer and 685 000 deaths globally. As of the end of 2020, there were 7.8 million women alive who were diagnosed with breast cancer in the past 5 years, making it the world’s most prevalent cancer. Breast cancer occurs in every country of the world in women at any age after puberty but with increasing rates in later life.[14]

The Wisconsin Breast Cancer (Diagnostics) Dataset (WBCD) was created by Dr. William H. Wolberg, W. Nick Street and Olvi L. Mangasarian at the University of Wisconsin Hospitals and made available online in 1992. It was donated to UCI Machine Learning Repository in 1995.[16]<p>
The dataset is a classification dataset with multivariate characteristics. The features are taken and computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.[16]<p>
During a fine needle aspiration (FNA), a small amount of breast tissue or fluid is removed from a suspicious area with a thin, hollow needle and checked for cancer cells.[15]<p>
The features describe characteristics of the cell nuclei present in the image for cancerous (malignant) and non-cancerous (benign).<p>
    
Below is an example of a digitized image from a Fine Needle Aspirate from a breast cell.<p>
<img src="https://www.myvmc.com/uploads/VMC/DiseaseImages/742_FNA1.jpg" width=200><p>
    
There are several machine learning algorithms which can be applied to this dataset with the purpose of improving the diagnosis/early detection of breast cancer in patients.

## III. Literature Review

## IV. Statistical Analysis of Wisconsin Breat Cancer Dataset

### Importing the WBCD (Diagnostics) Dataset

Source used for the purpose of this project:<p>https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data?resource=download

The Wisconsin Breast Cancer Diagnostics Dataset is also availabel under:<p>https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

In [53]:
# Using Pandas read content of data.csv file which is saved in the same location in the
# repository as this Jupyter Notebook
dataset = pd.read_csv('data.csv')

In [54]:
# Review of the header and the first few rows of the dataset
dataset.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [55]:
# Getting an overview of the dataset
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

There are 33 attributes in total in this dataset. 

1) ID number
2) Diagnosis (M = malignant = cancerous, B = benign = non-cancerous)
3-32)

Ten real-valued features are computed for each cell nucleus:

	a) radius (mean of distances from center to points on the perimeter)
	b) texture (standard deviation of gray-scale values)
	c) perimeter
	d) area
	e) smoothness (local variation in radius lengths)
	f) compactness (perimeter^2 / area - 1.0)
	g) concavity (severity of concave portions of the contour)
	h) concave points (number of concave portions of the contour)
	i) symmetry 
	j) fractal dimension ("coastline approximation" - 1)[16]

The 'id' column is not significant when analyzing this dataset as it provides no valuable information. There is also a column 'Unnamed', which has no values associated with any of the line items.

In [56]:
#Remove column "id" as it is not a value-add attribute in the dataset
dataset.drop(['id'], axis = 1, inplace=True)

In [57]:
dataset.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [58]:
dataset.shape

(569, 32)

In [59]:
dataset["diagnosis"].value_counts()

B    357
M    212
Name: diagnosis, dtype: int64

In [60]:
dataset.describe()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,0.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,


The attribute 'diagnosis' provides the feedback on the sampling of the tumor whether the diagnosis is malignant or benign. This attribute has the OBJECT data type. In order to be able to further assess the dataset it is beneficial to convert these values to integer datatype. So that malignant (cancerous) will be associated with integer '1 'and benign (non-cancerous) with integer '0'.

## V. Modelling and Performance Review

## VI. Modelling and Performance Review

## VII. Comparison with Literature Review

## VIII. Possibilities for extending dataset - Data Synthesis

## IX. Conclusion

## X. References

(No date) UCI Machine Learning Repository: Breast Cancer wisconsin (diagnostic) data set. Available at: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic) (Accessed: 22 May 2023).

Fine needle aspiration (2022) ENT Health. Available at: https://www.enthealth.org/conditions/fine-needle-aspiration/#:~:text=Fine%20needle%20aspiration%20(FNA)%2C,)%20or%20noncancerous%20(benign). (Accessed: 22 May 2023). 

### Tutorials<p>

[1] Simplilearn (2018) Scikit-Learn Tutorial | Machine Learning with scikit-learn | sklearn | python tutorial | Simplilearn, YouTube. YouTube. Available at: https://www.youtube.com/watch?v=0Lt9w-BxKFQ (Accessed: January 15, 2023). <p>
[2] programmingwithmosh (2020) Python machine learning tutorial (data science), YouTube. YouTube. Available at: https://www.youtube.com/watch?v=7eh4d6sabA0 (Accessed: January 15, 2023).<p>
[3] Simplilearn (2020) Machine learning for absolute beginners 2020 | machine learning tutorial for beginners | simplilearn, YouTube. YouTube. Available at: https://www.youtube.com/watch?v=oxvtR803RhE (Accessed: January 15, 2023).<p>
[4] Simplilearn (2018) Machine learning tutorial part - 1 | machine learning tutorial for beginners part - 1 | simplilearn, YouTube. YouTube. Available at: https://www.youtube.com/watch?v=DWsJc1xnOZo (Accessed: January 15, 2023).<p>
[5] Simplilearn (2018) Machine learning tutorial part - 2 | machine learning tutorial for beginners part - 2 | simplilearn, YouTube. YouTube. Available at: https://www.youtube.com/watch?v=_Wkx_447zBM (Accessed: January 15, 2023).<p>
[6] Simplilearn (2018) Machine learning basics | what is machine learning? | introduction to machine learning | Simplilearn, YouTube. YouTube. Available at: https://www.youtube.com/watch?v=ukzFI9rgwfU&amp;list=PLEiEAq2VkUUI73199L-Aym2MnKjBxJ-4X (Accessed: January 15, 2023). <p>
[7] An introduction to machine learning with scikit-learn (no date) scikit. Available at: https://scikit-learn.org/stable/tutorial/basic/tutorial.html (Accessed: January 15, 2023).<p>
[8] Learn intro to machine learning tutorials (no date) Kaggle. Available at: https://www.kaggle.com/learn/intro-to-machine-learning (Accessed: January 15, 2023).<p>

### Literature Review<p>
[9] Kadhim, R.R. and Kamil, M.Y., 2023. Comparison of machine learning models for breast cancer diagnosis. Int J Artif Intell, 12(1), pp.415-421.<p>
[10] Akbulut, S., Cicek, I.B. and Colak, C., 2022. Classification of Breast Cancer on the Strength of Potential Risk Factors with Boosting Models: A Public Health Informatics Application. Medical Bulletin of Haseki/Haseki Tip Bulteni, 60(3).<p>
[11] Sinha, N.K., Khulal, M., Gurung, M. and Lal, A., 2020. Developing a web based system for breast cancer prediction using xgboost classifier. Int J Eng Res, 9, pp.852-856.<p>
[12] Rahman, M.A., chandren Muniyandi, R., Albashish, D., Rahman, M.M. and Usman, O.L., 2021. Artificial neural network with Taguchi method for robust classification model to improve classification accuracy of breast cancer. PeerJ Computer Science, 7, p.e344.<p>
[13] Abdulkareem, S.A. and Abdulkareem, Z.O., 2021. An evaluation of the Wisconsin breast cancer dataset using ensemble classifiers and RFE feature selection. Int. J. Sci., Basic Appl. Res., 55(2), pp.67-80.<p>
[14] https://www.who.int/news-room/fact-sheets/detail/breast-cancer<p>(25th Jul)
[15] https://www.cancer.org/cancer/types/breast-cancer/screening-tests-and-early-detection/breast-biopsy/fine-needle-aspiration-biopsy-of-the-breast.html (25th July)
[16] Breast cancer wisconsin (Diagnostic) (no date) UCI Machine Learning Repository. Available at: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic (Accessed: 27 July 2023). 
[17] Breast Cancer Wisconsin Data Analysis: Machine Learning Project | Exploratory Data Analysis (2020) YouTube. Available at: https://www.youtube.com/watch?v=2ncx2q5GHbQ (Accessed: 30 July 2023). ; Wisconsin Breast Cancer Dataset Python| how to build model in machine learning (2020) YouTube. Available at: https://www.youtube.com/watch?v=ShxCPedWCDk (Accessed: 30 July 2023). 
    
### Code

## END