# Programming for Data Analysis - Project 2

## Problem Statement

This project will investigate the Wisconsin Breast Cancer dataset. The following list presents the
requirements of the project

- Undertake an analysis/review of the dataset and present an overview and background.
- Provide a literature review on classifiers which have been applied to the dataset and compare their performance
- Present a statistical analysis of the dataset
- Using a range of machine learning algorithms, train a set of classifiers on the dataset (using SKLearn etc.) and present classification performance results. Detail your rationale for the parameter selections you made while training the classifiers.
- Compare, contrast and critique your results with reference to the literature
- Discuss and investigate how the dataset could be extended – using data synthesis of new tumour datapoints
- Document your work in a Jupyter notebook.
- As a suggestion, you could use Pandas, Seaborn, SKLearn, etc. to perform your analysis.
- Please use GitHub to demonstrate research, progress and consistency.


## Project Overview
***

## Dataset Overview & Background
***

**Title:** Wisconsin Diagnostic Breast Cancer (WDBC) Dataset includes following information:

**Number of instances:** 569 

**Number of attributes:** 32 (ID, diagnosis, 30 real-valued input features)

**Attribute information:**

1) ID number

2) Diagnosis (M = malignant, B = benign)

3-32)

Ten real-valued features are computed for each cell nucleus:

	a) radius (mean of distances from center to points on the perimeter)
	b) texture (standard deviation of gray-scale values)
	c) perimeter
	d) area
	e) smoothness (local variation in radius lengths)
	f) compactness (perimeter^2 / area - 1.0)
	g) concavity (severity of concave portions of the contour)
	h) concave points (number of concave portions of the contour)
	i) symmetry 
	j) fractal dimension ("coastline approximation" - 1)
   
The mean, standard error, and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features.  For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

    
**Missing attribute values:** none

**Class distribution:** 357 benign, 212 malignant

## Libraries & Modules
***

In [2]:
import pandas as pd

# Numerical Arrays
import numpy as np

# Plotting
import matplotlib as plt
import seaborn as ss

# Machine Learning
import sklearn as skl

## Dataset Import and Set Up
***

### Dataset Import

In [49]:
file = 'data/diagnostic.data'

data = pd.read_csv(file, header=None)
# print (data)

### Variables Set Up

In [50]:
columns = [
    'id',
    'diagnosis',
    'radius_mean', 
    'texture_mean', 
    'perimeter_mean',
    'area_mean',
    'smoothness_mean',
    'compactness_mean',
    'concavity_mean',
    'concave_points_mean',
    'symmetry_mean',
    'fractal_dimension_mean',
    'radius_se', # se = standard error
    'texture_se', 
    'perimeter_se',
    'area_se',
    'smoothness_se',
    'compactness_se',
    'concavity_se',
    'concave_points_se',
    'symmetry_se',
    'fractal_dimension_se',
    'radius_worst', # worst = 'worst' or largest mean
    'texture_worst', 
    'perimeter_worst',
    'area_worst',
    'smoothness_worst',
    'compactness_worst',
    'concavity_worst',
    'concave_points_worst',
    'symmetry_mean',
    'fractal_dimension_worst',
    
]

# Adding column names
data.columns = columns

# print(data.head())

In [51]:

#file2 = open('data/diagnostic.names', 'r')
#print(file2.read())

## Classifiers
***

- literature review
- performance

## Statystical Analysis of the Dataset
***

### Data Overview

In [52]:
print (data.sample(5))

          id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
215  8810987         M       13.860         16.93           90.96      578.9   
424   907145         B        9.742         19.12           61.93      289.7   
123   865432         B       14.500         10.89           94.28      640.7   
296   891936         B       10.910         12.35           69.14      363.7   
562   925622         M       15.220         30.62          103.40      716.9   

     smoothness_mean  compactness_mean  concavity_mean  concave_points_mean  \
215          0.10260           0.15170        0.099010              0.05602   
424          0.10750           0.08333        0.008934              0.01967   
123          0.11010           0.10990        0.088420              0.05778   
296          0.08518           0.04721        0.012360              0.01369   
562          0.10480           0.20870        0.255000              0.09429   

     ...  radius_worst  texture_worst  perim

### Basic Information

Basic overview of the data incl. number of entries, number and names of columns, type of data, and Null values.

In [53]:
print (data.info()) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave_points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

#### Data Insights

1. There is a total of 569 data entries 
2. Data set does not include any Null values.
3. There is a total of 32 columns of 3 data types: 

    - 30 columns with float values
    - 1 column with integer values
    - 1 column with object values

### Diagnosis and Dataset Balance

In [57]:
# Grouping and adding up all instances of each diagnosis

print(data['diagnosis'].value_counts()) 

B    357
M    212
Name: diagnosis, dtype: int64


#### Data Insights

Out of 569 entries 357 were cathegorised as benign and 212 as malignant.

## Machine Learning
***

- Using a range of machine learning algorithms, train a set of classifiers on the dataset (using SKLearn etc.) and present classification performance results. Detail your rationale for the parameter selections you made while training the classifiers
- Compare, contrast and critique your results with reference to the literature

## Data Synthesis
***

- Discuss and investigate how the dataset could be extended – using data synthesis of new tumour datapoints

## References
***

UCI (1995). *Breast Cancer Wisconsin (Diagnostic) Data Set.* https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)