# AML Coursework assessment

## Describing, visualising, and transforming data

### Wisconsin Diagnostic Breast Cancer Dataset

The Wisconsin Diagnostic Breast Cancer dataset describes the measurements on cells in suspicious lumps in a women's breast. This dataset, obtained from the University of Wisconsin Hospitals, includes 30 feature variables and 1 target variable. The features, or input attributes, are numeric, and the target variable, has two possible values: M for Malignant and B for benign.

Each cell-nucleus include 10 actual features:

- Radius (mean of distances from center to points on the perimeter)
- Texture (standard deviation of gray-scale values)
- Perimeter
- Area
- Smoothness (local variation in radius lengths)
- Compactness (perimeter^2 / area - 1.0)
- Concavity (severity of concave portions of the contour)
- Concave points (number of concave portions of the contour)
- Symmetry 
- Fractal dimension ("coastline approximation" - 1)

Now, we will load the dataset so we can check the data in detail. To do so, we'll install the required libraries and packages.

In [None]:
!pip3 install matplotlib pandas numpy

In [None]:
!pip3 install -U ucimlrepo

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17) 
  
# data (as pandas dataframes) 
X = breast_cancer_wisconsin_diagnostic.data.features 
y = breast_cancer_wisconsin_diagnostic.data.targets 

data_combined = pd.concat([X, y], axis=1)
  
# metadata 
#print(breast_cancer_wisconsin_diagnostic.metadata) 
  
# variable information 
#print(breast_cancer_wisconsin_diagnostic.variables)



### Checking the data

We'll retrieve a data preview of the data so we can have a first look at it.

In [None]:
# Print the first few rows of the data
#print("\nFirst few rows of X (features):")
#print(X.head())

#print("\nFirst few rows of y (targets):")
#print(y.head())

print("\nFirst few rows of Features and Target combined")
print(data_combined.head())

### The dimensionality of the dataset

We can see that we're able to access the repository from UCIML and the data is loaded. Now we need to understand the dimensionality of the dataset in terms of number of rows, columns (features and target variable).

In [None]:
# Print the shape of the data

print("Shape of X (features):", X.shape)
print("Shape of y (targets):", y.shape)


### Statistical Summary

Let's retrieve the statistical properties of each attribute in order to get some insights.

In [32]:
pd.set_option('display.width', 100)
pd.set_option('display.precision', 3)

descriptionX = X.describe()
descriptionY = y.describe()
descriptionDataCombined = data_combined.describe()

#print(descriptionX)
#print(descriptionY)

print(descriptionDataCombined)

       radius1  texture1  perimeter1     area1  smoothness1  compactness1  concavity1  \
count  569.000   569.000     569.000   569.000      569.000       569.000     569.000   
mean    14.127    19.290      91.969   654.889        0.096         0.104       0.089   
std      3.524     4.301      24.299   351.914        0.014         0.053       0.080   
min      6.981     9.710      43.790   143.500        0.053         0.019       0.000   
25%     11.700    16.170      75.170   420.300        0.086         0.065       0.030   
50%     13.370    18.840      86.240   551.100        0.096         0.093       0.062   
75%     15.780    21.800     104.100   782.700        0.105         0.130       0.131   
max     28.110    39.280     188.500  2501.000        0.163         0.345       0.427   

       concave_points1  symmetry1  fractal_dimension1  ...  radius3  texture3  perimeter3  \
count          569.000    569.000             569.000  ...  569.000   569.000     569.000   
mean        

### Class Distribution

As this is a classification problem, we need to analyse the data in order to check how many observations do we have for each class. If they are highly imbalanced, then we'll need to take further actions in the data pre-processing.

In [31]:
class_counts = data_combined.groupby('Diagnosis').size()
print(class_counts)

total = class_counts.sum()
proportion_B = class_counts['B'] / total
proportion_M = class_counts['M'] / total

print(f"\nProportion of class B: {proportion_B:.2f}")
print(f"Proportion of class M: {proportion_M:.2f}")

Diagnosis
B    357
M    212
dtype: int64

Proportion of class B: 0.63
Proportion of class M: 0.37


We can see that 67% of the dataset instances are Benign and 37% are Malignant which represents an imbalance.