# Preparation

## Import packages

In [1]:
# pandas is a package most used for managing and organizing data
import pandas as pd
# numpy is a package used for numerical operations
import numpy as np
# matplotlib and seaborn are packages for ploting graphs
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
# # scikit-learn is a package with many machine learning applications
# import sklearn as sk 
# Ignoring warnings for reading purposes
import warnings
warnings.filterwarnings("ignore")

## Versions

In [2]:
# Python version in use
!python --version

Python 3.7.11


In [3]:
print('The Pandas version utilized in this example is -',pd.__version__)
print('The Matplotlib version utilized in this example is -',mpl.__version__)
print('The Seaborn version utilized in this example is -',sns.__version__)
print('The Numpy version utilized in this example is -',np.__version__)

The Pandas version utilized in this example is - 1.1.5
The Matplotlib version utilized in this example is - 3.2.2
The Seaborn version utilized in this example is - 0.11.1
The Numpy version utilized in this example is - 1.19.5


Installing a more recent version of sklearn

In [4]:
!pip install scikit-learn==0.24.1 > y
import sklearn as sk
print(sk.__version__)
print('The Sklearn version utilized in this example is -',sk.__version__)

0.24.1
The Sklearn version utilized in this example is - 0.24.1


## Data

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are **females at least 21 years** old of Pima Indian heritage.

This dataset can be found on [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

In [5]:
# Getting the url of the dataset stored on the GEMINI github
url = "https://github.com/gemini-duke/toolbox/blob/main/machine-learning/data-preparation/diabetes_kaggle.zip?raw=true"

In [6]:
# reading data using the pandas package.
# the dataset is compressed in a zip file, so we need to use compression="zip"
data = pd.read_csv(url, compression="zip")

In [7]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Our dataset has 8 features (`Pregnancies, Glucose,BloodPressure, SkinThickness, Insulin, BMI,DiabetesPedigreeFunction, Age`) and the `Outcome`.


### Recoding null values
Some variables have null values that were coded as "0" instead of NaN.

In [8]:
fill_na = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',
           'DiabetesPedigreeFunction']

In [9]:
data[fill_na] = data[fill_na].replace(0, np.nan)

### Changing mumeric features for categorical

All of our features are numeric. I will build **extra features that are categorical** so we have different data types to handle.

#### Age categories

In [10]:
data["AgeClass"] = pd.cut(data.Age, bins=[18, 29,49,64, 500], 
                       labels=["18 to 29 years", "30 to 49 years", 
                               "50 to 64 years", "65 or more"],
                       include_lowest=True).astype("object")

In [11]:
data["AgeClass"].value_counts()

18 to 29 years    396
30 to 49 years    283
50 to 64 years     73
65 or more         16
Name: AgeClass, dtype: int64

Dropping the original Age

In [12]:
data = data.drop('Age', axis=1)

#### BMI classification

In [13]:
data["BMIClass"] = pd.cut(data.BMI, bins=[0, 18.4, 24.9, 29.9, 34.9, 40, 500],
                      labels=["Underweight", "Normal weight", "Pre-obesity",
                              "Obesity class I","Obesity class II", "Obesity class III"],
                          include_lowest=True).astype("object")

In [14]:
data["BMIClass"].value_counts()

Obesity class I      224
Pre-obesity          179
Obesity class II     152
Normal weight        102
Obesity class III     96
Underweight            4
Name: BMIClass, dtype: int64

Drop the original BMI

In [15]:
data = data.drop('BMI', axis=1)

## Spliting data in train and test

In [16]:
from sklearn.model_selection import train_test_split

Splitting our features and our target

In [17]:
# Features
X = data.drop('Outcome', axis=1).copy()
# Target
y = data.Outcome.copy()

Splitting the train and test set. Considering that we have a classification problem, it is important to ensure the stratification of our target

In [18]:
seed = np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, 
                                                    test_size=0.2)

# Transforming categorical data

In [19]:
print(f"Shape of the X_train set {X_train.shape}")
print(f"Shape of the X_test set {X_test.shape}\n")
print(f"Shape of the y_train set {y_train.shape}")
print(f"Shape of the y_test set {y_test.shape}")

Shape of the X_train set (614, 8)
Shape of the X_test set (154, 8)

Shape of the y_train set (614,)
Shape of the y_test set (154,)


Getting the categorical features

In [20]:
# Getting the categorical features
categorical = [col for col in X.columns if X[col].dtype == "object"]
categorical

['AgeClass', 'BMIClass']

### Imputing missing values
The `SimpleImputer` class from scikitlearn can impute categorical data considering the **most frequent class** or some **specific value**. I will show examples of both applications.

Checking misisng values

In [21]:
data[categorical].isna().sum()

AgeClass     0
BMIClass    11
dtype: int64

In [22]:
# Importing the imputer
from sklearn.impute import SimpleImputer

In [23]:
# defining the imputer to impute a "missing" for every entry that is null
imp_value = SimpleImputer(strategy="constrant", fill_value="missing")

# defining the imputer with most_frequent as the strategy
imp_frequent = SimpleImputer(strategy="most_frequent")


Every transfomer in sklearn has three methods:
* `fit()`: the transformer learns the classes or values to be used for transforming the data. **Should only be used on the train set** to avoid data leakage.
* `transform()`: transform the variables using the information gathered on the fit step. **Should be used in both sets**;
* `fit_transform()`: performs the fit and transform steps simultaneously. **Should only be used in the train set**, and the the `transform()` method can be used on the test set.

***
Here is how it can be done.

In [24]:
# Fit and transforming the imp_frequent object with the X_train set
X_train[categorical] = imp_frequent.fit_transform(X_train[categorical])

# After fit and transforming the categorical variables of the train_set
# we can **transform** the test set
X_test[categorical] = imp_frequent.transform(X_test[categorical])

Now we have no missing values 

In [25]:
X_train[categorical].isna().sum()

AgeClass    0
BMIClass    0
dtype: int64

In [26]:
X_test[categorical].isna().sum()

AgeClass    0
BMIClass    0
dtype: int64

### One hot encoding (OHE)
OHE is a method that creates new features based on each class present in categorical variables. This is important to **represent categorical data** and often **increase performance of machine learning models**. OHE is also referred as creating "dummy variables" or "dummyfication".

***

![img](https://i.imgur.com/mtimFxh.png)

Importing the encoder


In [27]:
from sklearn.preprocessing import OneHotEncoder

Here I specify that do not want a sparse matrix and that unknow categories in the test set should be ignoredm which will encode them as 0

In [28]:
# naming the encoder
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

Fitting and transforming to the training data

In [29]:
cat_train_transformed = ohe.fit_transform(X_train[categorical])

In [30]:
cat_train_transformed

array([[1., 0., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 0., 1., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 0., 1., 0.]])

Transforming the test data

In [31]:
cat_test_transformed = ohe.transform(X_test[categorical])

In [32]:
cat_test_transformed

array([[0., 1., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]])

The output of the transformation is a numpy array. The lack of column names makes it difficul to work with. In this sense, we could transform it back to a pandas dataframe using the category name used to encode

In [33]:
cols = list(ohe.categories_[0]) + list(ohe.categories_[1])

Dataframe for the train set

In [34]:
cat_train = pd.DataFrame(cat_train_transformed, index=X_train.index, columns=cols)

In [35]:
cat_train.head()

Unnamed: 0,18 to 29 years,30 to 49 years,50 to 64 years,65 or more,Normal weight,Obesity class I,Obesity class II,Obesity class III,Pre-obesity,Underweight
353,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
711,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
373,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
46,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
682,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


Dataframe for the test set

In [36]:
cat_test = pd.DataFrame(cat_test_transformed, index=X_test.index, columns=cols)

In [37]:
cat_test.head()

Unnamed: 0,18 to 29 years,30 to 49 years,50 to 64 years,65 or more,Normal weight,Obesity class I,Obesity class II,Obesity class III,Pre-obesity,Underweight
44,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
672,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
700,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
630,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
81,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


Now we can concatenate the transformed features into the X_train and X_test sets

In [38]:
X_train_prep = pd.concat([X_train, cat_train], axis=1)

In [39]:
X_train_prep.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,DiabetesPedigreeFunction,AgeClass,BMIClass,18 to 29 years,30 to 49 years,50 to 64 years,65 or more,Normal weight,Obesity class I,Obesity class II,Obesity class III,Pre-obesity,Underweight
353,1,90.0,62.0,12.0,43.0,0.58,18 to 29 years,Pre-obesity,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
711,5,126.0,78.0,27.0,22.0,0.439,30 to 49 years,Pre-obesity,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
373,2,105.0,58.0,40.0,94.0,0.225,18 to 29 years,Obesity class I,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
46,1,146.0,56.0,,,0.564,18 to 29 years,Pre-obesity,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
682,0,95.0,64.0,39.0,105.0,0.366,18 to 29 years,Obesity class III,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [40]:
X_test_prep = pd.concat([X_test, cat_test], axis=1)

In [41]:
X_test_prep.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,DiabetesPedigreeFunction,AgeClass,BMIClass,18 to 29 years,30 to 49 years,50 to 64 years,65 or more,Normal weight,Obesity class I,Obesity class II,Obesity class III,Pre-obesity,Underweight
44,7,159.0,64.0,,,0.294,30 to 49 years,Pre-obesity,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
672,10,68.0,106.0,23.0,49.0,0.285,30 to 49 years,Obesity class II,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
700,2,122.0,76.0,27.0,200.0,0.483,18 to 29 years,Obesity class II,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
630,7,114.0,64.0,,,0.732,30 to 49 years,Pre-obesity,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
81,2,74.0,,,,0.102,18 to 29 years,Obesity class I,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


# Speeding it up with Pipelines and the ColumnTransformer
The process above is quite "messy" and confusing. Luckly, sklearn provides the Pipeline and ColumnTransformer (among other) classes that can greaty help you in training and evaluating machine learning models.

The **Pipeline** allows you to define a series of steps that need to be performed in the specified order. Here we will create the `cat_transformer` to:
1. Impute missing values with the most frequent class in categorical variables;
2. Perform One-Hot-Encoding on your data.

The `Pipeline` takes a list (`[]`) of steps, and each step is organized in a tuple (`()`). The tuple takes two arguments: the name of the method and the processor used. The final shape should be something like this: P`ipeline([("step1", processor1),...,("stepN", processorN])`

---

Next, we will use the **ColumnTransformer** to specify which columns will be transformed. This method is similar to the **Pipeline**, but the difference is that it takes a third element inside the tuple with the names of the columns to be transformed. `ColumnTransformer(["step1", processor1, ["columnA", "columnB"], ..., ["stepN", processorN, ["columnC", "columnD"]])`

In [42]:
# Importing the Pipeline class
from sklearn.pipeline import Pipeline

In [43]:
# Defining the cat_transformer, specifying the steps to be taken
# 1. First we will apply the imp_frequent defined previously (using the SimpleImputer);
# 2. And then we will apply the ohe define previously (using the OneHotEncoder)
cat_transformer = Pipeline(
    [("imputer", imp_frequent),
     ("ohe", ohe)]
)

Now we define the ColumnTransformer method and specif which columns should be transformed.

**Note**: I will add an extra pipeline that will scale numeric variables for the sake of completion. I will build a separate notebook with detailed information for numeric variables as well.

In [44]:
from sklearn.preprocessing import StandardScaler
numeric_transformer = Pipeline(
    [("imputer_median", SimpleImputer(strategy="median")),
     ("std_scaler", StandardScaler())]
     )
numeric = [cname for cname in X.columns if X[cname].dtype != "object"]

In [45]:
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer(
    [("categoric_data", cat_transformer, categorical),
     ("numeric_data", numeric_transformer, numeric)])

Visualizing the pipeline

In [46]:
from sklearn import set_config
set_config(display="diagram")

display(transformer)

Now it is easy to obtain the transformed train and test versions of your dataset

In [47]:
seed = np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, 
                                                    test_size=0.2)

In [48]:
# We will use fit_transform in the train set
X_train_prep = transformer.fit_transform(X_train)
# And we will only use the transform on the test set
X_test_prep = transformer.transform(X_test)

In [49]:
X_train_prep

array([[ 1.        ,  0.        ,  0.        , ..., -1.91818693,
        -1.20336073,  0.31079384],
       [ 0.        ,  1.        ,  0.        , ..., -0.22987447,
        -1.47019479, -0.11643851],
       [ 1.        ,  0.        ,  0.        , ...,  1.23332967,
        -0.55533518, -0.76486207],
       ...,
       [ 0.        ,  1.        ,  0.        , ...,  1.23332967,
        -0.16143729, -0.78607218],
       [ 0.        ,  1.        ,  0.        , ..., -0.22987447,
        -0.16143729, -1.01938346],
       [ 0.        ,  1.        ,  0.        , ..., -0.1173203 ,
         0.02915846, -0.57700104]])

## Bonus: creating a pipeline that transforms and trains a model

In [50]:
from sklearn.linear_model import LogisticRegression
logit = LogisticRegression()

In [51]:
final_pipe = Pipeline(
    [("transformations", transformer),
     ("logit", logit)]
)

In [52]:
display(final_pipe)

In [53]:
# Fitting the whole pipeline and model to the training set
final_pipe.fit(X_train, y_train)
# Tryng to predict the y_test using X_test
y_pred = final_pipe.predict(X_test)

In [54]:
from sklearn.metrics import accuracy_score, recall_score, precision_score

In [55]:
acc = accuracy_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)

In [56]:
print(f"The results of our model are:")
print(f"Accuracy: {acc}")
print(f"Recall: {rec}")
print(f"Precision: {prec}")

The results of our model are:
Accuracy: 0.7467532467532467
Recall: 0.5555555555555556
Precision: 0.6666666666666666
