# Breast cancer dataset preprocessing
## Assignment: 01 Machine Learning, downloading and preprocessing a dataset

Author: Muhammad Faizan

Registration: 400941

Subject: Machine learning


I download this dataset from [UCI machine learning repository](http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic%29). This data has been downloaded for learning purposes, and using raw data for pre-processing and data splitting for training, validation, and test.

The following topics will expolred in this notebook:
 - Downloading data
 - reading the data
 - using python libaries to check for missing values
 - proprocessing 
 - data visulization
 - data splitting

In [175]:
#import neccessary libaries
import numpy as np
import pandas  as pd
import matplotlib.pyplot as plt
import sklearn
import os
import random
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

### reading the dataset from a directory in local machine


In [208]:

def check_path(path):
    """check if path exits or not
    
    args:
        path: str, path to the file or directory"""
    
    if os.path.exists(data_dir):
        print("Dataset exits!!")
    else:
        print("Oops!! dataset doesn't exist, please download it")

In [209]:
# %ls

In [210]:
data_dir = "data" 
breast_cancer_dataset = os.path.join(data_dir, "breast-cancer-wisconsin.data")
check_path(breast_cancer_dataset)

Dataset exits!!


In [211]:
data =  pd.read_csv(breast_cancer_dataset, sep=",")


### adding columns names to the data from names file

In [212]:
data.columns = ["Sample code number", "Clump Thickness", "Uniformity of Cell Size", "Uniformity of Cell Shape"
               , "Marginal Adhesion", "Single Epithelial Cell Size", "Bare Nuclei", "Bland Chromatin", "Normal Nucleoli",
               "Mitoses", "Class"]

print("shape of data: {}".format(data.shape))
# print("data attributes info: {}".format(data.columns))
data.head()

shape of data: (698, 11)


Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1002945,5,4,4,5,7,10,3,2,1,2
1,1015425,3,1,1,1,2,2,3,1,1,2
2,1016277,6,8,8,1,3,4,3,7,1,2
3,1017023,4,1,1,3,2,1,3,1,1,2
4,1017122,8,10,10,8,7,10,9,7,1,4


### Replacing "?" with NaN

In [213]:

print("checking for missing values...")
data = data.replace(to_replace=r'\?', value=np.nan, regex=True)
print(f"Number of Na values in the dataset: {data.isna().sum()}")


checking for missing values...
Number of Na values in the dataset: Sample code number              0
Clump Thickness                 0
Uniformity of Cell Size         0
Uniformity of Cell Shape        0
Marginal Adhesion               0
Single Epithelial Cell Size     0
Bare Nuclei                    16
Bland Chromatin                 0
Normal Nucleoli                 0
Mitoses                         0
Class                           0
dtype: int64


### Seperate out features from labels and split the dataset into train and test sets


In [214]:
X_features = data.iloc[:, :10]
print(f"Features shape: {X_features.shape}")


y_labels = data.iloc[:, 10:]
print(f"labels or targets shape: {y_labels.shape}")

Features shape: (698, 10)
labels or targets shape: (698, 1)


In [215]:
X_train, X_test, Y_train, Y_test = train_test_split(X_features, y_labels, test_size = 0.20, random_state = 0)

In [216]:
#now checking dataset shape
print("Splitting the dataset into training and test sets....")

print("Training features shape: {}".format(X_train.shape))
print("Training targets shape: {}".format(Y_train.shape))
print("Test features shape: {}".format(X_test.shape))
print("Test targets shape: {}".format(Y_test.shape))

Splitting the dataset into training and test sets....
Training features shape: (558, 10)
Training targets shape: (558, 1)
Test features shape: (140, 10)
Test targets shape: (140, 1)


### Normalizing the features to have 0 mean and 1 standard deviation 
features scaling is important for fast convergence and keeping the features in the same scale, it accleartes the training and improve model's performance

In [217]:
def normalize(X):
    """normalize features to have 0 mean and 1 std deviation
    
    Args:
        X: np.ndarray 
    
    Return:
        X_normalize: np.ndarray"""
    mean = np.mean(X, axis = 0)
    std = np.std(X, axis = 0)
    X_normalized = (X - mean)/std
    return X_normalized

### interplote for missing values and replace with mean values of that column

In [218]:

def interpolate_missing_values(X):
    """interploate with nan values using sklearn function
    
    Args:
        X: pd.DataFrame, input data
    Return:
        X_dot: np.ndarray, Transformed data"""
    
    imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
    imputer = imputer.fit(X)

    X_dot = imputer.transform(X)
    return X_dot

In [219]:
X_train = interpolate_missing_values(X_train)
X_test  = interpolate_missing_values(X_test)

Y_train = Y_train.to_numpy()
Y_test = Y_test.to_numpy()

### Perform feature scaling 

In [188]:
print("before feature scaling")
print("Standard deviation and mean of X_train: {}, {}".format(X_train.std(), X_train.mean()))
print("Standard deviation and mean of X_test: {}, {}".format(X_test.std(), X_test.mean()))
print("\n")

X_train = normalize(X_train)
X_test = normalize(X_test)

print('After feature scaling')
print("Standard deviation and mean of X_train: {}, {}".format(X_train.std(), X_train.mean()))
print("Standard deviation and mean of X_test: {}, {}".format(X_test.std(), X_test.mean()))

before feature scaling
Standard deviation and mean of X_train: 388143.3019753379, 108078.07236479931
Standard deviation and mean of X_test: 323786.51788750227, 103618.00422918807


After feature scaling
Standard deviation and mean of X_train: 1.0, -8.91361854896183e-18
Standard deviation and mean of X_test: 1.0, -8.881784197001253e-18
