# Preparation

## Import packages

In [1]:
# pandas is a package most used for managing and organizing data
import pandas as pd
# numpy is a package used for numerical operations
import numpy as np
# matplotlib and seaborn are packages for ploting graphs
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
# # scikit-learn is a package with many machine learning applications
# import sklearn as sk 
# Ignoring warnings for reading purposes
import warnings
warnings.filterwarnings("ignore")

## Versions

In [2]:
# Python version in use
!python --version

Python 3.7.11


In [3]:
print('The Pandas version utilized in this example is -',pd.__version__)
print('The Matplotlib version utilized in this example is -',mpl.__version__)
print('The Seaborn version utilized in this example is -',sns.__version__)
print('The Numpy version utilized in this example is -',np.__version__)

The Pandas version utilized in this example is - 1.1.5
The Matplotlib version utilized in this example is - 3.2.2
The Seaborn version utilized in this example is - 0.11.1
The Numpy version utilized in this example is - 1.19.5


Installing a more recent version of sklearn

In [4]:
!pip install scikit-learn==0.24.1 > y
import sklearn as sk
print(sk.__version__)
print('The Sklearn version utilized in this example is -',sk.__version__)

0.24.1
The Sklearn version utilized in this example is - 0.24.1


## Data

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are **females at least 21 years** old of Pima Indian heritage.

This dataset can be found on [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

In [5]:
# Getting the url of the dataset stored on the GEMINI github
url = "https://github.com/gemini-duke/toolbox/blob/main/machine-learning/data-preparation/diabetes_kaggle.zip?raw=true"

In [6]:
# reading data using the pandas package.
# the dataset is compressed in a zip file, so we need to use compression="zip"
data = pd.read_csv(url, compression="zip")

In [7]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Our dataset has 8 features (`Pregnancies, Glucose,BloodPressure, SkinThickness, Insulin, BMI,DiabetesPedigreeFunction, Age`) and the `Outcome`.


### Recoding null values
Some variables have null values that were coded as "0" instead of NaN.

In [8]:
fill_na = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',
           'DiabetesPedigreeFunction']

In [9]:
data[fill_na] = data[fill_na].replace(0, np.nan)

### Changing mumeric features for categorical

All of our features are numeric. I will build **extra features that are categorical** so we have different data types to handle.

#### Age categories

In [10]:
data["AgeClass"] = pd.cut(data.Age, bins=[18, 29,49,64, 500], 
                       labels=["18 to 29 years", "30 to 49 years", 
                               "50 to 64 years", "65 or more"],
                       include_lowest=True).astype("object")

In [11]:
data["AgeClass"].value_counts()

18 to 29 years    396
30 to 49 years    283
50 to 64 years     73
65 or more         16
Name: AgeClass, dtype: int64

Dropping the original Age

In [12]:
data = data.drop('Age', axis=1)

#### BMI classification

In [13]:
data["BMIClass"] = pd.cut(data.BMI, bins=[0, 18.4, 24.9, 29.9, 34.9, 40, 500],
                      labels=["Underweight", "Normal weight", "Pre-obesity",
                              "Obesity class I","Obesity class II", "Obesity class III"],
                          include_lowest=True).astype("object")

In [14]:
data["BMIClass"].value_counts()

Obesity class I      224
Pre-obesity          179
Obesity class II     152
Normal weight        102
Obesity class III     96
Underweight            4
Name: BMIClass, dtype: int64

Drop the original BMI

In [15]:
data = data.drop('BMI', axis=1)

## Spliting data in train and test

In [16]:
from sklearn.model_selection import train_test_split

Splitting our features and our target

In [17]:
# Features
X = data.drop('Outcome', axis=1).copy()
# Target
y = data.Outcome.copy()

Splitting the train and test set. Considering that we have a classification problem, it is important to ensure the stratification of our target

In [18]:
seed = np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, 
                                                    test_size=0.2)

# Transforming numeric data

In [19]:
print(f"Shape of the X_train set {X_train.shape}")
print(f"Shape of the X_test set {X_test.shape}\n")
print(f"Shape of the y_train set {y_train.shape}")
print(f"Shape of the y_test set {y_test.shape}")

Shape of the X_train set (614, 8)
Shape of the X_test set (154, 8)

Shape of the y_train set (614,)
Shape of the y_test set (154,)


Getting the numeric features

In [20]:
# Getting the categorical features
numeric = [col for col in X.columns if X[col].dtype != "object"]
numeric

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'DiabetesPedigreeFunction']

### Imputing missing values
The `SimpleImputer` class from scikitlearn can impute numeric data **considering the median, mean** or any other value that user wants.

Checking misisng values

In [21]:
data[numeric].isna().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
DiabetesPedigreeFunction      0
dtype: int64

Here we note how some features have many missing information. **Note**: usually, imputing so many values is not recommended but we will ignore that here.

In [22]:
# Importing the imputer
from sklearn.impute import SimpleImputer

In [23]:
# defining the imputer to use the median to fill missingnes
imputer = SimpleImputer(strategy="most_frequent")

Every transfomer in sklearn has three methods:
* `fit()`: the transformer learns the classes or values to be used for transforming the data. **Should only be used on the train set** to avoid data leakage.
* `transform()`: transform the variables using the information gathered on the fit step. **Should be used in both sets**;
* `fit_transform()`: performs the fit and transform steps simultaneously. **Should only be used in the train set**, and the the `transform()` method can be used on the test set.

***
Here is how it can be done.

In [24]:
# Fit and transforming the imputer object with the X_train set
X_train[numeric] = imputer.fit_transform(X_train[numeric])

# After fit and transforming the numeric variables of the train_set
# we can **transform** the test set
X_test[numeric] = imputer.transform(X_test[numeric])

Now we have no missing values 

In [25]:
X_train[numeric].isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
DiabetesPedigreeFunction    0
dtype: int64

In [26]:
X_test[numeric].isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
DiabetesPedigreeFunction    0
dtype: int64

### Scaling your data
Often ML models have a hard time handling data with different scales. For instance: **the distance from the patient's** house to the healthcare facility in **meters** vs. the **age of the patient in years**. In this sense it is a good practice to scale your data. 

The **sklearn** package has the `StandardScaler` class that works by **subtracting the mean** and then **dividing by the standard deviation**.


Importing the encoder


In [27]:
from sklearn.preprocessing import StandardScaler

In [28]:
# naming the scaler
scaler = StandardScaler()

Fitting and transforming to the training data

In [29]:
X_train[numeric] = scaler.fit_transform(X_train[numeric])

In [30]:
X_train[numeric]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,DiabetesPedigreeFunction
353,-0.851355,-1.050652,-0.820182,-1.992318,-1.054792,0.310794
711,0.356576,0.148048,0.483615,-0.322535,-1.314596,-0.116439
373,-0.549372,-0.551194,-1.146131,1.124611,-0.423839,-0.764862
46,-0.851355,0.813992,-1.309105,0.234060,-0.287751,0.262314
682,-1.153338,-0.884166,-0.657207,1.013292,-0.287751,-0.337630
...,...,...,...,...,...,...
451,-0.549372,0.414426,-0.168283,0.234060,-0.287751,0.195653
113,0.054593,-1.516813,-0.820182,0.234060,-0.287751,-0.261879
556,-0.851355,-0.817571,-0.168283,1.124611,-0.287751,-0.786072
667,1.866489,-0.351410,-0.168283,-0.322535,-0.287751,-1.019383


Transforming the test data

In [31]:
X_test[numeric] = scaler.transform(X_test[numeric])

In [32]:
X_test[numeric]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,DiabetesPedigreeFunction
44,0.960541,1.246856,-0.657207,0.234060,-0.287751,-0.555791
672,1.866489,-1.783190,2.765259,-0.767810,-0.980562,-0.583061
700,-0.549372,0.014859,0.320641,-0.322535,0.887553,0.016882
630,0.960541,-0.251519,-0.657207,0.234060,-0.287751,0.771356
81,-0.549372,-1.583407,-0.168283,0.234060,-0.287751,-1.137554
...,...,...,...,...,...,...
32,-0.247390,-1.117246,-1.146131,-2.103636,-0.918704,-0.637601
637,-0.549372,-0.917463,0.320641,-1.324404,-0.770244,0.519865
593,-0.549372,-1.317029,-1.635054,-0.879129,-0.164035,3.701382
425,0.054593,2.079286,0.483615,1.013292,1.840168,-0.646691


# Speeding it up with Pipelines and the ColumnTransformer
The process above is quite "messy" and confusing. Luckly, sklearn provides the Pipeline and ColumnTransformer (among other) classes that can greaty help you in training and evaluating machine learning models.

The **Pipeline** allows you to define a series of steps that need to be performed in the specified order. Here we will create the `numeric_transformer` to:
1. Impute missing values with the median for numeric variables;
2. Apply the StandardScaler method to scale data.

The `Pipeline` takes a list (`[]`) of steps, and each step is organized in a tuple (`()`). The tuple takes two arguments: the name of the method and the processor used. The final shape should be something like this: `Pipeline([("step1", processor1),...,("stepN", processorN])`

---

Next, we will use the **ColumnTransformer** to specify which columns will be transformed. This method is similar to the **Pipeline**, but the difference is that it takes a third element inside the tuple with the names of the columns to be transformed. `ColumnTransformer(["step1", processor1, ["columnA", "columnB"], ..., ["stepN", processorN, ["columnC", "columnD"]])`

In [33]:
# Importing the Pipeline class
from sklearn.pipeline import Pipeline

In [34]:
# Defining the cat_transformer, specifying the steps to be taken
# 1. First we will apply the imputer defined previously (using the SimpleImputer);
# 2. And then we will apply the scaler defined previously (using the StandardScaler)
numeric_transformer = Pipeline(
    [("imputer", imputer),
     ("scaler", scaler)]
)

The usage of the `numeric_transformer` is quite straightforward. You *fit_transform()* to numeric variables in the training data and only *transform()* the numeric variables of the test data.

In [35]:
# Getting a clean version of the train/test X/y sets.
seed = np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, 
                                                    test_size=0.2)

In [36]:
# Fit_transform the train data
X_train[numeric] = numeric_transformer.fit_transform(X_train[numeric])
X_train[numeric]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,DiabetesPedigreeFunction
353,-0.851355,-1.050652,-0.820182,-1.992318,-1.054792,0.310794
711,0.356576,0.148048,0.483615,-0.322535,-1.314596,-0.116439
373,-0.549372,-0.551194,-1.146131,1.124611,-0.423839,-0.764862
46,-0.851355,0.813992,-1.309105,0.234060,-0.287751,0.262314
682,-1.153338,-0.884166,-0.657207,1.013292,-0.287751,-0.337630
...,...,...,...,...,...,...
451,-0.549372,0.414426,-0.168283,0.234060,-0.287751,0.195653
113,0.054593,-1.516813,-0.820182,0.234060,-0.287751,-0.261879
556,-0.851355,-0.817571,-0.168283,1.124611,-0.287751,-0.786072
667,1.866489,-0.351410,-0.168283,-0.322535,-0.287751,-1.019383


In [37]:
# Transform test data
X_test[numeric] = numeric_transformer.transform(X_test[numeric])
X_test[numeric]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,DiabetesPedigreeFunction
44,0.960541,1.246856,-0.657207,0.234060,-0.287751,-0.555791
672,1.866489,-1.783190,2.765259,-0.767810,-0.980562,-0.583061
700,-0.549372,0.014859,0.320641,-0.322535,0.887553,0.016882
630,0.960541,-0.251519,-0.657207,0.234060,-0.287751,0.771356
81,-0.549372,-1.583407,-0.168283,0.234060,-0.287751,-1.137554
...,...,...,...,...,...,...
32,-0.247390,-1.117246,-1.146131,-2.103636,-0.918704,-0.637601
637,-0.549372,-0.917463,0.320641,-1.324404,-0.770244,0.519865
593,-0.549372,-1.317029,-1.635054,-0.879129,-0.164035,3.701382
425,0.054593,2.079286,0.483615,1.013292,1.840168,-0.646691


Now we define the ColumnTransformer method and specif which columns should be transformed.

**Note**: I will add an extra pipeline to transform categorical data for the sake of completion. You can checkout the step-by-step guide for pre-processing ofcategorical data in the [GEMINI](https://github.com/gemini-duke/toolbox/blob/main/machine-learning/data-preparation/Preprocessing_of_categorical_data_sklearn.ipynb) github

In [38]:
# Extra step for categorical variables -- see code in the link above
from sklearn.preprocessing import OneHotEncoder

cat_transformer = Pipeline(
    [("imputer", SimpleImputer(strategy="most_frequent")),
     ("ohe", OneHotEncoder(sparse=False, handle_unknown="ignore"))]
)

In [39]:
# Getting the names of categorical columns
categorical = [col for col in X.columns if X[col].dtype == "object"]
categorical

['AgeClass', 'BMIClass']

In [40]:
# Here we specify which pipelines will be applied to each set of variables
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer(
    [("categoric_data", cat_transformer, categorical),
     ("numeric_data", numeric_transformer, numeric)])

Visualizing the pipeline

In [41]:
from sklearn import set_config
set_config(display="diagram")

display(transformer)

Now it is easy to obtain the transformed train and test versions of your dataset

In [42]:
seed = np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, 
                                                    test_size=0.2)

In [43]:
# We will use fit_transform in the train set
X_train_prep = transformer.fit_transform(X_train)
# And we will only use the transform on the test set
X_test_prep = transformer.transform(X_test)

In [44]:
X_train_prep

array([[ 1.        ,  0.        ,  0.        , ..., -1.99231762,
        -1.05479169,  0.31079384],
       [ 0.        ,  1.        ,  0.        , ..., -0.32253463,
        -1.31459578, -0.11643851],
       [ 1.        ,  0.        ,  0.        , ...,  1.12461063,
        -0.42383891, -0.76486207],
       ...,
       [ 0.        ,  1.        ,  0.        , ...,  1.12461063,
        -0.28775106, -0.78607218],
       [ 0.        ,  1.        ,  0.        , ..., -0.32253463,
        -0.28775106, -1.01938346],
       [ 0.        ,  1.        ,  0.        , ..., -0.21121576,
         0.14525575, -0.57700104]])

## Bonus: creating a pipeline that transforms and trains a model

In [45]:
from sklearn.linear_model import LogisticRegression
logit = LogisticRegression()

In [46]:
final_pipe = Pipeline(
    [("transformations", transformer),
     ("logit", logit)]
)

In [47]:
display(final_pipe)

In [48]:
# Fitting the whole pipeline and model to the training set
final_pipe.fit(X_train, y_train)
# Tryng to predict the y_test using X_test
y_pred = final_pipe.predict(X_test)

In [49]:
from sklearn.metrics import accuracy_score, recall_score, precision_score

In [50]:
acc = accuracy_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)

In [51]:
print(f"The results of our model are:")
print(f"Accuracy: {acc}")
print(f"Recall: {rec}")
print(f"Precision: {prec}")

The results of our model are:
Accuracy: 0.7402597402597403
Recall: 0.5555555555555556
Precision: 0.6521739130434783
