### **D2APR: Aprendizado de Máquina e Reconhecimento de Padrões** (IFSP, Campinas) <br/>
**Prof**: Samuel Martins (Samuka) <br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>. <br/><br/>

#### Custom CSS style

In [None]:
%%html
<style>
.dashed-box {
    border: 1px dashed black !important;
#    font-size: var(--jp-content-font-size1) !important;
}

.dashed-box table {

}

.dashed-box tr {
    background-color: white !important;
}
        
.alt-tab {
    background-color: black;
    color: #ffc351;
    padding: 4px;
    font-size: 1em;
    font-weight: bold;
    font-family: monospace;
}
// add your CSS styling here
</style>

<span style='font-size: 2.5em'><b>Cardiovascular Disease 💔</b></span><br/>
<span style='font-size: 1.5em'>Predict cardiovascular diseases</span>

<span style="background-color: #ffc351; padding: 4px; font-size: 1em;"><b>Sprint #3</b></span>

<img src="./imgs/cardio.png" width=300/>

---



## Before starting this notebook
This jupyter notebook is designed for **experimental and teaching purposes**. <br/>
Although it is (relatively) well organized, it aims at solving the _target problem_ by evaluating (and documenting) _different solutions_ for somes steps of the **machine learning pipeline** — see the [***Machine Learning Project Checklist by xavecoding***](https://github.com/xavecoding/IFSP-CMP-D2APR-2021.2/blob/main/cheat-sheets/machine-learning-project-checklist_by_xavecoding.pdf). <br/>
We tried to make this notebook as literally a _notebook_. Thus, it contains notes, drafts, comments, etc.<br/>

For teaching purposes, some parts of the notebook may be _overcommented_. Moreover, to simulate a real development scenario, we will divide our solution and experiments into **"sprints"** in which each sprint has some goals (e.g., perform _feature selection_, train more ML models, ...). <br/>
The **sprint goal** will be stated at the beginning of the notebook.

A ***final notebook*** (or any other kind of presentation) that compiles and summarizes all sprints — the target problem, solutions, and findings — should be created later.

#### Conventions

<ul>
    <li>💡 indicates a tip. </li>
    <li> ⚠️ indicates a warning message. </li>
    <li><span class='alt-tab'>alt tab</span> indicates and an extra content (<i>e.g.</i>, slides) to explain a given concept.</li>
</ul>

---

## 🎯 Sprint Goals
- Evaluate on the training set: different strategies/versions of Naive Bayes:
  + Strategy #1 - Gaussian Naive Bayes with only numerical features
  + Strategy #2 - Categorical Naive Bayes with only categorical features
  + Strategy #3 - Categorical Naive Bayes after converting numerical features to categorical
  + Strategy #4 - Mixed Naive Bayes
---

### 0. Imports and default settings for plotting

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")

params = {'legend.fontsize': 'x-large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)

## 🛠️ 5. Prepare the Data

Each strategy for Naive Bayes will require a specific preprocessing pipeline. So, let's skip this section and create a full pipeline with preprocessing and classifiers for each version.

**Preprocessing tasks**
- Fill in missing values (imputation)
- Add new features
- Feature Scaling
- One-Hot Encoding

### 5.1. Load the cleaned training set
Let's consider the training and testing sets already cleaned (Sprint #1)

In [None]:
cardio_train = pd.read_csv('./datasets/cardio_clean_train.csv')

In [None]:
cardio_train.head()

In [None]:
# Just to remember what categorical variables are like
for cat_attribute in ['gender', 'cholesterol', 'gluc', 'smoke', 'alco', 'active']:
    print(cardio_train[cat_attribute].value_counts())
    print()

### 5.2. Separate the features and the classes (target outcome)

In [None]:
cardio_train.columns

In [None]:
# store the target outcome into a numpy array
y_train = cardio_train['cardio'].values

In [None]:
y_train

In [None]:
y_train.shape

In [None]:
# overwrite the dataframe with only the features  
cardio_train = cardio_train.drop(columns=['cardio'])

In [None]:
cardio_train.head()

In [None]:
cardio_train.shape

### 🏋️‍♀️ 6. Train ML Algorithms

In [None]:
# numerical variables
num_vars = ['age', 'height', 'weight', 'ap_hi', 'ap_lo']

# categorical binary variables
bin_vars = ['gender', 'smoke', 'alco', 'active']

# categorical variables
cat_vars = ['cholesterol', 'gluc']

In [None]:
## separating the features into specific dataset according to their type
cardio_train_num = cardio_train[num_vars]
cardio_train_bin = cardio_train[bin_vars]
cardio_train_cat = cardio_train[cat_vars]

In [None]:
from sklearn.model_selection import cross_val_score

# printing function
def display_scores(scores):
    print("Scores:", scores)
    print("\nMean:", scores.mean())
    print("Standard deviation:", scores.std())

### **6.1. Strategy #1 - Gaussian Naive Bayes with only numerical features**

#### **Training**

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.naive_bayes import GaussianNB

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

### numerical preprocessing pipeline
num_preprocessing_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('robust_scaler', RobustScaler())
])


### complete preprocessing pipeline
# in this case, we are using it just to filter the desired columns automatically during the full pipeline.
# one could do this manually outside the pipeline
preprocessing_pipeline_1 = ColumnTransformer([
    ('numerical_preprocessing', num_preprocessing_pipeline, num_vars)
])


### naive bayes model = preprocessing + naive bayes classifier
naive_bayes_1 = Pipeline([
    ('preprocessing', preprocessing_pipeline_1),
    ('naive_bayes', GaussianNB())
])


### training naive bayes
naive_bayes_1.fit(cardio_train, y_train)

#### **Validation on Training Set**

In [None]:
naive_bayes_1_accs = cross_val_score(naive_bayes_1, cardio_train, y_train, scoring="accuracy", cv=10)
display_scores(naive_bayes_1_accs)

### **6.2. Strategy #2 - Categorical Naive Bayes with only categorical features**

#### **Training**

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.naive_bayes import CategoricalNB

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

### categorical preprocessing pipeline
# there will not be OneHotEncoder because the CategoricalNB expects ordinal labels (0, 1, 2, 3, ...)
cat_preprocessing_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent'))  # as the categories are numbers, we can use the SimpleImputer
])


### complete preprocessing pipeline
# in this case, we are using it just to filter the desired columns automatically during the full pipeline.
# one could do this manually outside the pipeline
preprocessing_pipeline_2 = ColumnTransformer([
    ('categorical_preprocessing', cat_preprocessing_pipeline, cat_vars)
])


### naive bayes model = preprocessing + naive bayes classifier
naive_bayes_2 = Pipeline([
    ('preprocessing', preprocessing_pipeline_2),
    ('naive_bayes', CategoricalNB())
])


### training naive bayes
naive_bayes_2.fit(cardio_train, y_train)

#### **Validation on Training Set**

In [None]:
naive_bayes_2_accs = cross_val_score(naive_bayes_2, cardio_train, y_train, scoring="accuracy", cv=10)
display_scores(naive_bayes_2_accs)

### **6.3. Strategy #3 - Categorical Naive Bayes after converting numerical features to categorical**

#### **Training**

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.naive_bayes import CategoricalNB

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

### numerical preprocessing pipeline
# we are using 10 bins but this hyperparameter could be optimized,
# the same for the strategy
num_preprocessing_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('discretization', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform'))
])


### categorical preprocessing pipeline
# there will not be OneHotEncoder because the CategoricalNB expects ordinal labels (0, 1, 2, 3, ...)
cat_preprocessing_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent'))  # as the categories are numbers, we can use the SimpleImputer
])


### complete preprocessing pipeline
# in this case, we are using it just to filter the desired columns automatically during the full pipeline.
# one could do this manually outside the pipeline
preprocessing_pipeline_3 = ColumnTransformer([
    ('numerical_preprocessing', num_preprocessing_pipeline, num_vars),
    ('categorical_preprocessing', cat_preprocessing_pipeline, cat_vars)
])


### naive bayes model = preprocessing + naive bayes classifier
naive_bayes_3 = Pipeline([
    ('preprocessing', preprocessing_pipeline_3),
    ('naive_bayes', CategoricalNB())
])


### training naive bayes
naive_bayes_3.fit(cardio_train, y_train)

#### **Validation on Training Set**

In [None]:
naive_bayes_3_accs = cross_val_score(naive_bayes_3, cardio_train, y_train, scoring="accuracy", cv=10)
display_scores(naive_bayes_3_accs)

### **6.4 Strategy #4 - Mixed Naive Bayes**

pip install mixed-naive-bayes <br/>
https://github.com/remykarem/mixed-naive-bayes

`Mixed Naive Bayes` allows us to use Naive Bayes with _continuous_ and _discrete/categorical data_. <br/>
For that, we need to specify the indices of all categorical features (including the binary ones).

When using `ColumnTransformer` to concatenate preprocessing pipelines, the **order of each pipeline** will result the order of their corresponding features in the final preprocessed data. <br/>
Let's consider that the order of the preprocessing pipelines will be: **continuous** and then **discrite/categorical**. <br/>
As our pipelines do not create any new feature, the final order will be:


[0] 'age', [1] 'height', [2] 'weight', [3] 'ap_hi', [4] 'ap_lo' <br/>
and then <br/>
[5] 'gender', [6] 'smoke', [7] 'alco', [8] 'active', [9] 'cholesterol', [10] 'gluc'

Therefore, the categorical features are those with indices from 5 to 10

#### **Training - NOT WORKING... I NEED TO REVIEW IT**

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import LabelEncoder
from mixed_naive_bayes import MixedNB

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

### numerical preprocessing pipeline
num_preprocessing_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('robust_scaler', RobustScaler())
])


### categorical preprocessing pipeline
bin_preprocessing_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent'))  # as the categories are numbers, we can use the SimpleImputer
])


### categorical preprocessing pipeline
# there will not be OneHotEncoder because the CategoricalNB expects ordinal labels (0, 1, 2, 3, ...)
cat_preprocessing_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # as the categories are numbers, we can use the SimpleImputer
])


### complete preprocessing pipeline
# in this case, we are using it just to filter the desired columns automatically during the full pipeline.
# one could do this manually outside the pipeline
preprocessing_pipeline_4 = ColumnTransformer([
    ('numerical_preprocessing', num_preprocessing_pipeline, num_vars),
    ('bin_preprocessing', bin_preprocessing_pipeline, bin_vars),
    ('categorical_preprocessing', cat_preprocessing_pipeline, cat_vars)
])



categorical_features_indices = list(range(5, 11))


### naive bayes model = preprocessing + naive bayes classifier
naive_bayes_4 = Pipeline([
    ('preprocessing', preprocessing_pipeline_4),
    ('naive_bayes', MixedNB(categorical_features=categorical_features_indices))
])


### training naive bayes
naive_bayes_4.fit(cardio_train, y_train)

#### **Validation on Training Set**