<div align="center">
<h1>Stage 1: Preliminary Data Inspection and Clearning</a></h1>
by Hongnan Gao
<br>
</div>

<a id="top"></a>

<a id = '1.0'></a>
<h1 style = "font-family: garamond; font-size: 40px; font-style: normal;background-color: #2ab7ca; color : #fed766; border-radius: 5px 5px;padding:5px;text-align:center; font-weight: bold" >Quick Navigation</h1>

    
* [Dependencies and Configuration](#1)
* [Stage 1: Preliminary Data Inspection and Cleaning](#2)
    * [Load the dataset](#31)
    * [A brief look at the dataset](#31)
    * [Drop, drop, drop the columns!](#31)
    * [Data Types](#31)
    * [Summary Statistics](#31)
    * [Missing Data](#31)
    * [Save Data](#31)

## Dependencies and Configuration

In [3]:
import random
from dataclasses import dataclass, field
from typing import List, Dict

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [4]:
@dataclass
class config:
    raw_data: str = "https://storage.googleapis.com/reighns/reighns_ml_projects/docs/supervised_learning/classification/breast-cancer-wisconsin/data/raw/data.csv"
    processed_data: str = "https://storage.googleapis.com/reighns/reighns_ml_projects/docs/supervised_learning/classification/breast-cancer-wisconsin/data/processed/processed.csv"
    train_size: float = 0.9
    seed: int = 1992
    num_folds: int = 5
    cv_schema: str = "StratifiedKFold"
    classification_type: str = "binary"

    target_col: List[str] = field(default_factory=lambda: ["diagnosis"])
    unwanted_cols: List[str] = field(default_factory=lambda: ["id", "Unnamed: 32"])

    # Plotting
    colors: List[str] = field(
        default_factory=lambda: ["#fe4a49", "#2ab7ca", "#fed766", "#59981A"]
    )
    cmap_reversed = plt.cm.get_cmap("mako_r")

    def to_dict(self) -> Dict:
        """Convert the config object to a dictionary.

        Returns:
            Dict: The config object as a dictionary.
        """
        return {
            "raw_data": self.raw_data,
            "processed_data": self.processed_data,
            "train_size": self.train_size,
            "seed": self.seed,
            "num_folds": self.num_folds,
            "cv_schema": self.cv_schema,
            "classification_type": self.classification_type,
            "target_col": self.target_col,
            "unwanted_cols": self.unwanted_cols,
            "colors": self.colors,
            "cmap_reversed": self.cmap_reversed,
        }

In [5]:
def set_seeds(seed: int = 1234) -> None:
    """Set seeds for reproducibility."""
    np.random.seed(seed)
    random.seed(seed)

In [6]:
# set config
config = config()

# set seeding for reproducibility
_ = set_seeds(seed = config.seed)

## Stage 1: Preliminary Data Inspection and Cleaning

### Load the dataset

In [7]:
df = pd.read_csv(config.raw_data)

### A brief look at the dataset

!!! info
    - We will query the first five rows of the dataframe to get a feel on the dataset we are working on. 
    
    - We also call `df.info()` to see the data types of the columns, and to briefly check if there is any missing values in our data (more on that later). 


```python
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    int64  
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
```

---

!!! danger "Importance of data types"
    We must be sharp and ensure that each column is indeed stored in their respective data types! In the real world, we may often query "dirty" data from say, the database, where numeric data are represented in string. It is now our duty to ensure sanity checks are in place!

In [15]:
display(df.head())
display(df.info(verbose=True))

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

None

A brief overview tells us our data is alright! There is, however, a column which is unnamed and has no values. This can be of various data source issues, for now, we quickly check the definition given by the dataset from [UCI's Breast Cancer Wisconsin (Diagnostic) Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) and confirm that there should only be 32 columns. With this in mind, we can safely delete the column. 

---

We also note that from the above, that the **id** column is the identifier for each patient. We will also drop this column as it holds no predictive power.

!!! info "When can ID be important?"
    - We should try to question our move and justify it. In this dataset, we have to ensure that each <b>ID</b> is unique, if it is not, it may suggest that there are patient records with multiple observation, which is a violation of <b>i.i.d</b> assumption and we may take note when doing cross-validation, so as to avoid information leakage.
    - Since the ID column is unique, we will delete it. We will keep this at the back of our mind in the event that we ever need them for feature engineering.

In [16]:
print(f"The ID column is unique : {df['id'].is_unique}")

The ID column is unique : True


### Drop, drop, drop the columns!

Here we define a `drop_columns` function to drop the unwanted columns.

In [17]:
def drop_columns(df: pd.DataFrame, columns: List) -> pd.DataFrame:
    """Drop unwanted columns from dataframe.

    Args:
        df (pd.DataFrame): Dataframe to be cleaned
        columns (List): list of columns to be dropped

    Returns:
        df_copy (pd.DataFrame): Dataframe with unwanted columns dropped.
    """

    df_copy = df.copy()
    df_copy = df_copy.drop(columns=columns, axis=1, inplace=False)
    return df_copy.reset_index(drop=True)

In [18]:
df = drop_columns(df, columns=config.unwanted_cols)

### Data Types

Let us split the data types into a few unbrellas:

!!! info 
    **Categorical Variables:**
    diagnosis: The target variable diagnosis, although represented as a string in the dataframe, should be categorical! This is because machines do not really like working with "strings" and prefer your type to be of "numbers". We will map them to 0 and 1, representing benign and malignant respectively. Since the target variable is just two unique values, we can use a simple map from pandas to do the job.

In [19]:
class_dict = {"B" : 0, "M":1}
df['diagnosis'] = df['diagnosis'].map(class_dict)

We will make sure that our mapping is accurate by asserting the following.

In [20]:
assert df['diagnosis'].value_counts().to_dict()[0] == 357
assert df['diagnosis'].value_counts().to_dict()[1] == 212

!!! info
    **Continuous Variables:**
    A preliminary look seems to suggest all our predictors are continuous.

!!! success
    From the brief overview, there does not seem to be any Ordinal or Nominal Predictors. This suggest that we may not need to perform encoding in our preprocessing.

### Summary Statistics

We will use a simple, yet powerful function call to check on the summary statistics of our dataframe. We note to the readers that there are much more powerful libraries like [pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) to give us an even more thorough summary, but for our purpose, we will use the good ol' `df.describe()`.

In [21]:
display(df.describe(include='all'))

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,0.372583,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,0.405172,1.216853,2.866059,40.337079,0.007041,0.025478,0.031894,0.011796,0.020542,0.003795,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,0.483918,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,0.277313,0.551648,2.021855,45.491006,0.003003,0.017908,0.030186,0.00617,0.008266,0.002646,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,0.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,0.1115,0.3602,0.757,6.802,0.001713,0.002252,0.0,0.0,0.007882,0.000895,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,0.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,0.2324,0.8339,1.606,17.85,0.005169,0.01308,0.01509,0.007638,0.01516,0.002248,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,0.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,0.3242,1.108,2.287,24.53,0.00638,0.02045,0.02589,0.01093,0.01873,0.003187,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,1.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,0.4789,1.474,3.357,45.19,0.008146,0.03245,0.04205,0.01471,0.02348,0.004558,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,1.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,2.873,4.885,21.98,542.2,0.03113,0.1354,0.396,0.05279,0.07895,0.02984,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


The table does give us a good overview: for example, a brief glance give me the following observations:

1. The features **do not seem to be of the same scale**. This is going to be a problem as some models do not perform well if your features are not on the same scale. A prime example is a KNN model with Euclidean Distance as the distance metric, the difference in range of different features will be amplified with the squared term, and the feature with wider range will dominate the one with smaller range. 

2. From our dataset it we see that **area_mean** is very large and there is likely to be a squared term (possibly from **radius_mean**), we can look into them later through EDA.

Humans are more visual and that is why we still need EDA later to capture our attention on any anomaly from the dataset, and of course, if the dataset has many columns, then this summary statistics may even clog your progress if you were to read it line by line.

### Missing Data

!!! danger "Missing Alert?"
    Although from our analysis, we did not see any missing data, it is always good to remind ourselves to check it. A simple function that does the job is as follows.

In [22]:
def report_missing(df: pd.DataFrame, columns: List) -> pd.DataFrame:
    """A function to check for missing data.

    Args:
        df (pd.DataFrame): The DataFrame to check.
        columns (List): The columns to check.

    Returns:
        missing_data_df (pd.DataFrame): Returns a DataFrame that reports missing data.
    """
    missing_dict = {"missing num": [], "missing percentage": []}
    for col in columns:
        num_missing = df[col].isnull().sum()
        percentage_missing = num_missing / len(df)
        missing_dict["missing num"].append(num_missing)
        missing_dict["missing percentage"].append(percentage_missing)

    missing_data_df = pd.DataFrame(index=columns, data=missing_dict)

    return missing_data_df

In [23]:
missing_df = report_missing(df, columns=df.columns)
display(missing_df.head())

Unnamed: 0,missing num,missing percentage
diagnosis,0,0.0
radius_mean,0,0.0
texture_mean,0,0.0
perimeter_mean,0,0.0
area_mean,0,0.0


### Save data

We save the data to processed and we can call it later on in subsequent notebooks.