## 1 

Import the librarys we need.

- We know our file is a .CSV so lets import `pandas`
- I will also import a few more libraries I expect to use. If they are not clear, don't import them, and you can go back and add them when it becomes clear that we will need them


In [1]:
import pandas as pd
import numpy as np
import torch 

Import data and look at it.

In [2]:
df = pd.read_csv('breast_cancer_dataset_raw_manipulated.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,diagnosis
0,0,4021,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,benign
1,1,4664,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,benign
2,2,2967,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,benign
3,3,1289,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,benign
4,4,3968,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,benign


In [4]:
df.tail()

Unnamed: 0.1,Unnamed: 0,id,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,diagnosis
636,636,1507,14.61,15.69,92.68,664.9,0.07618,0.03515,0.01447,0.01877,...,21.75,103.7,840.8,0.1011,0.07087,0.04746,0.05813,0.253,0.05695,malignant
637,637,3129,10.29,27.61,65.67,321.4,0.0903,0.07658,0.05999,0.02738,...,34.91,69.57,357.6,0.1384,0.171,0.2,0.09127,0.2226,0.08283,malignant
638,638,2003,14.26,18.17,91.22,633.1,0.06576,0.0522,0.02475,0.01374,...,25.26,105.8,819.7,0.09445,0.2167,0.1565,0.0753,0.2636,0.07676,malignant
639,639,4844,11.25,14.78,71.38,390.0,0.08306,0.04458,0.000974,0.002941,...,22.06,82.08,492.7,0.1166,0.09794,0.005518,0.01667,0.2815,0.07418,malignant
640,640,3013,20.94,23.56,138.9,1364.0,0.1007,0.1606,0.2712,0.131,...,27.0,165.3,2010.0,0.1211,0.3172,0.6991,0.2105,0.3126,0.07849,benign


 We can't see all the columns so lets get the title of each one.

In [5]:
df.columns

Index(['Unnamed: 0', 'id', 'mean radius', 'mean texture', 'mean perimeter',
       'mean area', 'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension',
       'diagnosis'],
      dtype='object')

## 2

Lets note some observations.

- `Unnamed` seems to be a duplicate of the index column
- For the `README.txt` we know the `id` column is not going to help us in classification

Lets keep looking around and check the `diagnosis` column

In [6]:
df['diagnosis'].unique()

array(['benign', 'malignant', nan], dtype=object)

Lets get the count of each "class"

In [7]:
counts = []
for diagnosis_type in df['diagnosis'].unique():
    count = (df['diagnosis'] == diagnosis_type).sum()
    counts.append(count.item())
    print(f"{diagnosis_type} has {count} occurrence")

print(f"\nTotal counts is: {sum(counts)}")
print(f"df is of size: {len(df)}")

benign has 239 occurrence
malignant has 399 occurrence
nan has 0 occurrence

Total counts is: 638
df is of size: 641


Interesting let's look into the `nan` and figure out why we are not counting it

In [8]:
for diagnosis_type in df['diagnosis'].unique():
    print(f"{diagnosis_type} is of type: {type(diagnosis_type)}")

benign is of type: <class 'str'>
malignant is of type: <class 'str'>
nan is of type: <class 'float'>


In [9]:
type(np.nan)

float

Lets see if the base case checks out.

In [10]:
np.nan == np.nan

False


This makes sense now.

We don't catch `np.nan` in the for loop above because `np.nan != np.nan`, so `df['diagnosis'] == np.nan` is always `False`.

`np.nan != np.nan` because NaN (Not a Number) is defined to be unequal to everything, including itself.


In [11]:
print(f"We are missing: {len(df) - sum(counts)} values. ")

We are missing: 3 values. 


In [12]:
df['diagnosis'].isna().sum()

np.int64(3)


When me make out classification datset we can't have rows that don't have a class so we must drop these rows.

In [13]:
df = df[ ~df['diagnosis'].isna()]
len(df), df['diagnosis'].isna().sum()

(638, np.int64(0))

# 3

Since `Unnamed` is treated like another feature insted of the index that it is lets go ahead and drop it.

In [14]:
df.columns

Index(['Unnamed: 0', 'id', 'mean radius', 'mean texture', 'mean perimeter',
       'mean area', 'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension',
       'diagnosis'],
      dtype='object')

In [15]:
df = df.drop(columns=['Unnamed: 0'])
df

Unnamed: 0,id,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,diagnosis
0,4021,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.300100,0.147100,0.2419,...,17.33,184.60,2019.0,0.16220,0.66560,0.711900,0.26540,0.4601,0.11890,benign
1,4664,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.086900,0.070170,0.1812,...,23.41,158.80,1956.0,0.12380,0.18660,0.241600,0.18600,0.2750,0.08902,benign
2,2967,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.197400,0.127900,0.2069,...,25.53,152.50,1709.0,0.14440,0.42450,0.450400,0.24300,0.3613,0.08758,benign
3,1289,11.42,20.38,77.58,386.1,0.14250,0.28390,0.241400,0.105200,0.2597,...,26.50,98.87,567.7,0.20980,0.86630,0.686900,0.25750,0.6638,0.17300,benign
4,3968,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.198000,0.104300,0.1809,...,16.67,152.20,1575.0,0.13740,0.20500,0.400000,0.16250,0.2364,0.07678,benign
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
636,1507,14.61,15.69,92.68,664.9,0.07618,0.03515,0.014470,0.018770,0.1632,...,21.75,103.70,840.8,0.10110,0.07087,0.047460,0.05813,0.2530,0.05695,malignant
637,3129,10.29,27.61,65.67,321.4,0.09030,0.07658,0.059990,0.027380,0.1593,...,34.91,69.57,357.6,0.13840,0.17100,0.200000,0.09127,0.2226,0.08283,malignant
638,2003,14.26,18.17,91.22,633.1,0.06576,0.05220,0.024750,0.013740,0.1635,...,25.26,105.80,819.7,0.09445,0.21670,0.156500,0.07530,0.2636,0.07676,malignant
639,4844,11.25,14.78,71.38,390.0,0.08306,0.04458,0.000974,0.002941,0.1773,...,22.06,82.08,492.7,0.11660,0.09794,0.005518,0.01667,0.2815,0.07418,malignant


## 4

Lets check for duplicate rows.

In [16]:
df.duplicated().sum()

np.int64(15)

This is where you ask yourself the question that is specific for your dataset.
- *"Does having duplicates make sense in my context"*

For us duplicates don't make sense so we will drop them.

In [17]:
df = df.dropna(axis=0)

In [18]:
len(df)

584

## 5

Lets check for missing data:
- Check across rows.

- Check across columns.


**5.1** Check across rows.

In [19]:
print(f"column id : # missing rows\n")
for col in df.columns:
    print(f"{col} : {df[col].isnull().sum()}")
    # print("-"* 50)

column id : # missing rows

id : 0
mean radius : 0
mean texture : 0
mean perimeter : 0
mean area : 0
mean smoothness : 0
mean compactness : 0
mean concavity : 0
mean concave points : 0
mean symmetry : 0
mean fractal dimension : 0
radius error : 0
texture error : 0
perimeter error : 0
area error : 0
smoothness error : 0
compactness error : 0
concavity error : 0
concave points error : 0
symmetry error : 0
fractal dimension error : 0
worst radius : 0
worst texture : 0
worst perimeter : 0
worst area : 0
worst smoothness : 0
worst compactness : 0
worst concavity : 0
worst concave points : 0
worst symmetry : 0
worst fractal dimension : 0
diagnosis : 0


**5.2** Check across column.

In [20]:
(df.isnull().sum(axis=1) > 0).sum()

np.int64(0)

Looks to be 54 rows that contain a NaN

There are many ways to deal with missing data:
-  Fill with a Mean/Median
-  Forward Fill (`ffill`): Replacing NaN values with the previous non-NaN value in the same column
-  Backward Fill (`bfill`): Replacing NaN values with the next non-NaN value in the same column
-  Droping the rows


We will opt for dropping the rows.

In [21]:
df = df.dropna(axis=0)
df

Unnamed: 0,id,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,diagnosis
0,4021,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.300100,0.147100,0.2419,...,17.33,184.60,2019.0,0.16220,0.66560,0.711900,0.26540,0.4601,0.11890,benign
1,4664,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.086900,0.070170,0.1812,...,23.41,158.80,1956.0,0.12380,0.18660,0.241600,0.18600,0.2750,0.08902,benign
2,2967,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.197400,0.127900,0.2069,...,25.53,152.50,1709.0,0.14440,0.42450,0.450400,0.24300,0.3613,0.08758,benign
3,1289,11.42,20.38,77.58,386.1,0.14250,0.28390,0.241400,0.105200,0.2597,...,26.50,98.87,567.7,0.20980,0.86630,0.686900,0.25750,0.6638,0.17300,benign
4,3968,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.198000,0.104300,0.1809,...,16.67,152.20,1575.0,0.13740,0.20500,0.400000,0.16250,0.2364,0.07678,benign
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
636,1507,14.61,15.69,92.68,664.9,0.07618,0.03515,0.014470,0.018770,0.1632,...,21.75,103.70,840.8,0.10110,0.07087,0.047460,0.05813,0.2530,0.05695,malignant
637,3129,10.29,27.61,65.67,321.4,0.09030,0.07658,0.059990,0.027380,0.1593,...,34.91,69.57,357.6,0.13840,0.17100,0.200000,0.09127,0.2226,0.08283,malignant
638,2003,14.26,18.17,91.22,633.1,0.06576,0.05220,0.024750,0.013740,0.1635,...,25.26,105.80,819.7,0.09445,0.21670,0.156500,0.07530,0.2636,0.07676,malignant
639,4844,11.25,14.78,71.38,390.0,0.08306,0.04458,0.000974,0.002941,0.1773,...,22.06,82.08,492.7,0.11660,0.09794,0.005518,0.01667,0.2815,0.07418,malignant


In [22]:
len(df)

584

## 6

Lets check the data types.

In [23]:
df.dtypes

id                           int64
mean radius                float64
mean texture               float64
mean perimeter             float64
mean area                  float64
mean smoothness            float64
mean compactness           float64
mean concavity             float64
mean concave points        float64
mean symmetry              float64
mean fractal dimension     float64
radius error               float64
texture error              float64
perimeter error            float64
area error                 float64
smoothness error           float64
compactness error          float64
concavity error            float64
concave points error       float64
symmetry error             float64
fractal dimension error    float64
worst radius               float64
worst texture              float64
worst perimeter            float64
worst area                 float64
worst smoothness           float64
worst compactness          float64
worst concavity            float64
worst concave points

Thoses look good

## 7
Great now that out data is clean lets make the `X`, `y`

1. We dont need `id` in `X`
2. We need `y` to be 0/1 not "benign"/"malignant"

**7.1**

Make the `X` matrix:

- Drop `[id, diagnosis]`

- Turn into numpy array

In [24]:
X = df.drop(columns=['id', 'diagnosis']).to_numpy()
X.shape

(584, 30)

**7.2** 

Make the `y` vector:
- Use just column `diagnosis`

- Convert to be 0 or 1

- Turn into numpy array

In [25]:
y = df['diagnosis'].apply(lambda x: 0 if x == 'benign' else 1).to_numpy()
# y = np.where(df['diagnosis'] == 'benign', 0, 1).to_numpy()
y.shape

(584,)

In [26]:
print(f"Shape of X: {X.shape}  and shape of y: {y.shape}")

Shape of X: (584, 30)  and shape of y: (584,)


**7.3**

Normalize the data:
- Mean 0 std 1
- MinMax scale between 0-1

**7.4** 

Split them up 80/20  for Train/Test.

In [27]:
X_normalized_minmax = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) # min max
X_normalized_std = (X - X.mean(axis=0)) / (X.std(axis=0)) #mean 0 std 1

X = X_normalized_std
# from sklearn.preprocessing import MinMaxScaler

# scaler = MinMaxScaler()
# X_normalized = scaler.fit_transform(X)

In [28]:
#shuffle the data first
np.random.seed(42) # for reproducibility
perm_idxs = np.random.permutation(len(X))
X = X[perm_idxs]
y = y[perm_idxs]

In [29]:
#split them 
train_size = int(len(X) * 0.8 )
X_train, y_train = X[:train_size], y[:train_size]
X_test, y_test = X[train_size:], y[train_size:]

In [30]:
print(f"Shape X train: {X_train.shape}  Shape y train: {y_train.shape}")
print(f"Shape X test: {X_test.shape}  Shape y test: {y_test.shape}")


Shape X train: (467, 30)  Shape y train: (467,)
Shape X test: (117, 30)  Shape y test: (117,)


**7.4**

We now save our processed data to be used.

In [31]:
np.save('X_train.npy', X_train)
np.save('y_train.npy', y_train)

np.save('X_test.npy', X_test)
np.save('y_test.npy', y_test)

or we save them as `.pt`

In [32]:
# torch.save(X_train, 'X_train.pt')
# torch.save(y_train, 'y_train.pt')

# torch.save(X_test, 'X_test.pt')
# torch.save(y_test, 'y_test.pt')