<a id="top"></a>

<a id = '1.0'></a>
<h1 style = "font-family: garamond; font-size: 40px; font-style: normal;background-color: #2ab7ca; color : #fed766; border-radius: 5px 5px;padding:5px;text-align:center; font-weight: bold" >Quick Navigation</h1>

    
* [Dependencies and Configuration](#1)
* [Stage 1: Preliminary Data Inspection and Cleaning](#2)
    * [Load the dataset](#31)
    * [A brief look at the dataset](#31)
    * [Drop, drop, drop the columns!](#31)
    * [Data Types](#31)
    * [Summary Statistics](#31)
    * [Missing Data](#31)
    * [Save Data](#31)

# Dependencies and Configuration

In [16]:
import random
from typing import List
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import pandas as pd

In [17]:
class global_config:
    
    # File Path
    raw_data = "../data/raw/data.csv"
    processed_data_stage_1 = "../data/processed/data_stage_1.csv"
    processed_data_stage_2 = "../data/processed/data_stage_2.csv"
    processed_data_stage_3 = "../data/processed/data_stage_3.csv"

    # Data Information
    target = ["diagnosis"]
    unwanted_cols = ["id", "Unnamed: 32"]

    # Plotting
    colors = ["#fe4a49", "#2ab7ca", "#fed766", "#59981A"]
    cmap_reversed = plt.cm.get_cmap('mako_r')
    
    # Seed Number
    seed = 1992

    # Cross Validation
    num_folds = 5
    cv_schema = "StratifiedKFold"
    split_size = {"train_size": 0.9, "test_size": 0.1}


def set_seeds(seed: int = 1234) -> None:
    """Set seeds for reproducibility."""
    np.random.seed(seed)
    random.seed(seed)

In [3]:
# set config
config = global_config

# set seeding for reproducibility
_ = set_seeds(seed = config.seed)

# Stage 1: Preliminary Data Inspection and Cleaning

## Load the dataset

In [4]:
df = pd.read_csv(config.raw_data)

## A brief look at the dataset

<div class="alert alert-info" role="alert">
<li> We will query the first five rows of the dataframe to get a feel on the dataset we are working on. 
    
<li> We also call <code> df.info() </code> to see the data types of the columns, and to briefly check if there is any missing values in our data (more on that later). 
</div>

```python
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    int64  
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
```

---

<div class="alert alert-block alert-danger">
<b>Importance of data types:</b> We must be sharp and ensure that each column is indeed stored in their respective data types! In the real world, we may often query "dirty" data from say, the database, where numeric data are represented in string. It is now our duty to ensure sanity checks are in place!
</div>

In [5]:
display(df.head())
# display(df.info())

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


A brief overview tells us our data is alright! There is, however, a column which is unnamed and has no values. This can be of various data source issues, for now, we quickly check the definition given by the dataset from [UCI's Breast Cancer Wisconsin (Diagnostic) Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) and confirm that there should only be 32 columns. With this in mind, we can safely delete the column. 

---

We also note that from the above, that the **id** column is the identifier for each patient. We will also drop this column as it holds no predictive power.

<div class="alert alert-block alert-danger">
<b>When can ID be important?</b>
<li> We should try to question our move and justify it. In this dataset, we have to ensure that each <b>ID</b> is unique, if it is not, it may suggest that there are patient records with multiple observation, which is a violation of <b>i.i.d</b> assumption and we may take note when doing cross-validation, so as to avoid information leakage.
<li> Since the ID column is unique, we will delete it. We will keep this at the back of our mind in the event that we ever need them for feature engineering.
</div>

In [6]:
print(f"The ID column is unique : {df['id'].is_unique}")

The ID column is unique : True


## Drop, drop, drop the columns!

Here we define a `drop_columns` function to drop the unwanted columns.

In [7]:
def drop_columns(df: pd.DataFrame, columns: List) -> pd.DataFrame:
    """[summary]

    Args:
        df (pd.DataFrame): [description]
        columns (List): [description]

    Returns:
        pd.DataFrame: [description]
    """

    df_copy = df.copy()
    df_copy = df_copy.drop(columns=columns, axis=1, inplace=False)
    return df_copy.reset_index(drop=True)


In [8]:
df = drop_columns(df, columns=config.unwanted_cols)

## Data Types

Let us split the data types into a few unbrellas:

<div class="alert alert-info" role="alert">
<b> Categorical </b>
    <li> <b>diagnosis</b>: The target variable diagnosis, although represented as string in the dataframe, should be categorical! This is because machines do not really like working with "strings" and prefer your type to be of "numbers". We will map them to 0 and 1, representing benign and malignant respectively. Since the target variable is just two unique values, we can use a simple map from pandas to do the job.
</div>

In [9]:
class_dict = {"B" : 0, "M":1}
df['diagnosis'] = df['diagnosis'].map(class_dict)

We will make sure that our mapping is accurate by asserting the following.

In [10]:
assert df['diagnosis'].value_counts().to_dict()[0] == 357
assert df['diagnosis'].value_counts().to_dict()[1] == 212

<div class="alert alert-info" role="alert">
    <b> Continuous </b>
    <li> <b>predictors</b>: A preliminary look seems to suggest all our predictors are continuous.
</div>

<div class="alert alert-success" role="alert">
From the brief overview, there does not seem to be any Ordinal or Nominal Predictors. This suggest that we may not need to perform encoding in our preprocessing.
</div>

## Summary Statistics

We will use a simple, yet powerful function call to check on the summary statistics of our dataframe. We note to the readers that there are much more powerful libraries like [pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) to give us an even more thorough summary, but for our purpose, we will use the good ol' `df.describe()`.

In [11]:
display(df.describe(include='all'))

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,0.372583,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,0.483918,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,0.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,0.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,0.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,1.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,1.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


The table does give us a good overview: for example, a brief glance give me the following observations:

1. The features **do not seem to be of the same scale**. This is going to be a problem as some models do not perform well if your features are not on the same scale. A prime example is a KNN model with Euclidean Distance as the distance metric, the difference in range of different features will be amplified with the squared term, and the feature with wider range will dominate the one with smaller range. 

2. From our dataset it we see that **area_mean** is very large and there is likely to be a squared term (possibly from **radius_mean**), we can look into them later through EDA.

Humans are more visual and that is why we still need EDA later to capture our attention on any anomaly from the dataset, and of course, if the dataset has many columns, then this summary statistics may even clog your progress if you were to read it line by line.

## Missing Data

<div class="alert alert-block alert-danger">
<b>Missing Alert!</b> Although from our analysis, we did not see any missing data, it is always good to remind ourselves to check it. A simple function that does the job is as follows.
</div>

In [12]:
def report_missing(df: pd.DataFrame, columns: List) -> pd.DataFrame:
    """A function to check for missing data.

    Args:
        df (pd.DataFrame): [description]
        columns (List): [description]

    Returns:
        pd.DataFrame: [description]
    """
    missing_dict = {"missing num": [], "missing percentage": []}
    for col in columns:
        num_missing = df[col].isnull().sum()
        percentage_missing = num_missing / len(df)
        missing_dict["missing num"].append(num_missing)
        missing_dict["missing percentage"].append(percentage_missing)

    missing_data_df = pd.DataFrame(index=columns, data=missing_dict)

    return missing_data_df

In [13]:
missing_df = report_missing(df, columns=df.columns)
display(missing_df.head())

Unnamed: 0,missing num,missing percentage
diagnosis,0,0.0
radius_mean,0,0.0
texture_mean,0,0.0
perimeter_mean,0,0.0
area_mean,0,0.0


## Save Data 

After Stage 1 is done, we saved the data to our processed folder, we name it `processed_data_stage_1.csv`, indicating that the data is processed after stage 1.

In [14]:
df.to_csv(config.processed_data_stage_1, index=False)