<a href="https://colab.research.google.com/github/camohenry/WiDSproject/blob/main/WIDS_2024_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing and Transformation for Machine Learning

## Learning Objectives:

Key concepts and techniques in data preprocessing.

*   Importance of data preprocessing in machine learning.
*   Define and understand data cleaning, data integration, data transformation, and feature selection.
*   Implement data preprocessing in machine learning.






## WiDS Datathon Data set
A rich, real-world dataset which contains information about demographics, diagnosis and treatment options, and insurance provided about patients who were diagnosed with breast cancer. The dataset originated from Health Verity, one of the largest healthcare data ecosystems in the US. It was enriched with third party geo-demographic data to provide views into the socio economic aspects that may contribute to health equity. For this challenge, the dataset was then further enriched with zip code level climate data.

Challenge task: You will be asked to predict the duration of time it takes for patients to receive metastatic cancer diagnosis.

Why is this important? Metastatic TNBC is considered the most aggressive TNBC and requires most urgent and timely treatment. Unnecessary delays in diagnosis and subsequent treatment can have devastating effects in these difficult cancers. Differences in the wait time to get treatment is a good proxy for disparities in healthcare access.

The primary goal of building these models is to detect relationships between demographics of the patient with the likelihood of getting timely treatment. The secondary goal is to see if climate patterns impact proper diagnosis and treatment.
  


## Step 1: Import the Libraries
The foremost step of data preprocessing in machine learning includes importing some libraries. A library is basically a set of functions that can be called and used in the algorithm. There are many libraries available in different programming languages.

## Step 2: Import Data and Perform Intitial Analysis

The next important step is to load the data which has to be used in the machine learning algorithm. This is the most important machine learning preprocessing step. Collected data is to be imported for further assessment.

Once the data is loaded, checking for noisy or missing content is important.

The following code cell loads the separate .csv files and creates the following two pandas DataFrames:

* `train_df`, which contains the training set
* `test_df`, which contains the test set

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import preprocessing
from sklearn.preprocessing import LabelBinarizer


import tensorflow as tf
from matplotlib import pyplot as plt
from tensorflow import feature_column

# The following lines adjust the granularity of reporting.
pd.options.display.max_rows = 10
pd.options.display.float_format = "{:.1f}".format

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
from google.colab import drive
drive.mount("/content/gdrive", force_remount=True)

In [None]:

test_df = pd.read_csv("/content/gtrain_df = pd.read_csvdrive/MyDrive/Data/test.csv")
train_df = pd.read_csv("/content/gdrive/MyDrive/Data/train.csv")
train_df.reindex(np.random.permutation(train_df.index)) # shuffle the training set
display(train_df.head(5))

In [None]:
# print out the size of the data set
print ("Numbers of rows and columns in training set: ", train_df.shape)
print ("Number of rows and columns in testing set:", test_df.shape)

In [None]:

# print out column name and type of the training set
print (train_df.info())

In [None]:
# print out different data types
# Categorical columns
cat_col = [col for col in train_df.columns if train_df[col].dtype == 'object']
print('Categorical columns :',cat_col)
# Numerical columns
num_col = [col for col in train_df.columns if train_df[col].dtype != 'object']
print('Numerical columns :',num_col)

In [None]:
# plot the columns that have the most missing values
def plot_nas(df: pd.DataFrame):
    if df.isnull().sum().sum() != 0:
        na_df = (df.isnull().sum() / len(df)) * 100
        na_df = na_df.drop(na_df[na_df == 0].index).sort_values(ascending=False)
        missing_data = pd.DataFrame({'Missing Ratio %' :na_df})
        missing_data_more_than_20_percent = missing_data[missing_data['Missing Ratio %'] > 10.0]
        missing_data_more_than_20_percent.plot(kind = "barh")
        plt.show()
    else:
        print('No NAs found')

plot_nas(train_df)

#### Plot categorical columns

Categoricals is the data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values (categories). Examples are gender, social class, blood type, country affiliation. Plotting existing categories and number of rows per category will inform us any balance issue that data set might have.





In [None]:
train_df['payer_type'].value_counts().plot(kind='bar')

#### Print statistical summary of numerical columns

Different from categorical columns, numerical columns contain numerical values which might be difficult to plot. With numberical columns, we can print out statistical summary of the columns for initial analysis.

In [None]:
train_df['bmi'].describe()

#### Plot the correlation between different columns (features)

In [None]:
# plot correlation between categorical values vs. numerical values
train_df.groupby(['patient_race'])["bmi"].mean().plot(kind='bar')


In [None]:
# plot correlation between categorical values vs. categorical values
pd.crosstab(train_df['patient_race'],train_df['payer_type']).plot(kind="bar",stacked=True)

## Step 3: Fundamental data cleaning steps

While the techniques used for data cleaning may vary according to the types of data, there are several fundamental data cleaning steps that we always perform such as missing values, remove outliners.

Data cleaning is often a tedious process, but it is absolutely essential to get top results and powerful insights from your data.


#### Handle missing values

If missing values have been found, there are particularly two ways to resolve this issue:

*   Either remove the entire row that contains a missing value. However, removing the entire row can generate a possibility of losing some important data. This approach is useful if the dataset is very large.
*   Estimate the value by taking the mean, median or mode.

In [None]:
# fill in empty value in the "patient_race" column with "N/A"
train_df["patient_race"].fillna("N/A", inplace=True)

# fill in empty value in the "patient_race" column with "N/A"
train_df["payer_type"].fillna("N/A", inplace=True)

plot_nas(train_df)


#### Filter out data outliers

Outliers are data points that fall far outside of the norm and may skew your analysis too far in a certain direction. For example, if the average BMI value in our data set is  29.0. And normal BMI Categories are: Underweight = <18.5;
Normal weight = 18.5–24.9; Overweight = 25–29.9; Obesity = BMI of 30 or greater. But there are values are very high (max 85 as shown in our data nalaysis). In this case, you should consider deleting this data point. This may give results that are “actually” much closer to the average.

In [None]:
# Plot the BMI value to detect outliners
sns.boxplot(train_df['bmi'])

# outliner_train_df = train_df[train_df['bmi'] >70]
# display(outliner_train_df)


In [None]:
def removal_box_plot(df, column, threshold):
    removed_outliers = df[df[column] <= threshold]

    sns.boxplot(removed_outliers[column])
    plt.title(f'Box Plot without Outliers of {column}')
    plt.show()
    return removed_outliers

threshold_value = 70
no_outliers = removal_box_plot(train_df, 'bmi', threshold_value)

## Step 4: Data Transformation Techniques

Machine learning modules cannot understand non-numeric data. It is important to transform the data in a numerical form in order to prevent any problems at later stages.

#### Categorical columns

In this dataset, Payer Type is represented as a string (e.g. 'COMMERCIAL', or 'MEDICAID', 'MEDICAL ADVANTAGE'). We cannot feed strings directly to a model. Instead, we must first map them to numeric values. The categorical vocabulary columns provide a way to represent strings into numerical representation using either Label Encoding or One-hot Encoding techniques:

*   Label Encoding, which consists in converting the unique values of the categorical variable into integers that follow an order. For example, the 'COMMERCIAL', or 'MEDICAID', 'MEDICAL ADVANTAGE' of payer type will be encoded respectively as 0,1 and 2.

*   One-hot Encoding:  label encoding where we will assign a numerical value to these labels work. But this can add bias in our model as it will start giving higher preference to the MEDICAL ADVANTAGE parameter as 2>0 but ideally, both labels are equally important in the dataset. One-hot encoding techniques address this potenial bias issue by rather than labeling things as a number starting from 1 and then increasing for each category, we will go for more of a binary style of categorizing.


In [None]:
#Label Encoding
le1 = preprocessing.LabelEncoder()
train_df['payer_type_label_encode'] =le1.fit_transform(train_df['payer_type'])
display(train_df)


In [None]:
#One-hot Encoding
#Get the categorical values
one_hot_encoder = LabelBinarizer()
one_hot_encoder.fit(train_df['payer_type'])
print(one_hot_encoder.classes_)

#Transform our payer_type column to 4 different binary columns corresponding to different categories
transformed = pd.DataFrame(one_hot_encoder.transform(train_df['payer_type']),columns=one_hot_encoder.classes_)
#Combine with original data frame
train_df = pd.concat([train_df,transformed], axis = 1)
display(train_df)

### Bucketized columns
Often, you don't want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. Consider raw data that represents a person's age. Instead of representing age as a numeric column, we could split the age into several buckets. Notice the one-hot values below describe which age range each row matches.

In [None]:
bins= [0,2,4,13,20,70,100]
labels = ['Infant','Toddler','Kid','Teen','Adult',"Old Adult"]
train_df['patient_age_group'] = pd.cut(train_df['patient_age'], bins=bins, labels=labels, right=False)
display (train_df)

### Scaling

Scaling is the process of preprocessing the data in data analysis and ensuring that all the features in a dataset have similar ranges, making them more comparable and reducing the impact of different scales on machine learning algorithms. We can scale Pandas dataframe columns using methods like Min-max scaling, standardization, Robust scaling, and log transformation. In this article we will dive into the process of scaling pandas dataframe scaling using various methods.

In [None]:
display(train_df["home_value"])

Min-Max Scaling

Min-Max scaling is also known as normalization. Using min-max scaling we can resize the data to a fixed range, typically between 0 and 1. The original distribution shape is preserved maintaining both the minimum and maximum values.

In [None]:
def min_max_scaling(df, column_name):
    min_value = df[column_name].min()
    max_value = df[column_name].max()
    df[column_name] = (df[column_name] - min_value) / (max_value - min_value)

# Apply min-max scaling to 'Salary' column
min_max_scaling(train_df, 'home_value')

# Print the DataFrame after min-max scaling
print("DataFrame after Min-Max Scaling:")
display(train_df["home_value"])

In [None]:
def

In [None]:
# current version of pandas generates a bunch of warnings that we'll ignore
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

import sweetviz as sv

report = sv.analyze(train_df)
report.show_html('train.html')