# Step 1: Prepare the input dataset for ML modeling

In [0]:
## test stuff

The project is based on the [Heart Failure Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction).

This first notebook:

- Performs a quick exploratory analysis of the input dataset: it looks at the structure of the dataset and the distribution of the values in the different categorical and continuous columns.

- Uses the functions from the <a href="https://doc.dataiku.com/dss/latest/python/reusing-code.html#sharing-python-code-within-a-project">project Python library</a> to clean & prepare the input dataset before Machine Learning modeling. We will first clean categorical and continuous columns, then split the dataset into a train set and a test set.

Finally, we will transform this notebook into a Python recipe in the project Flow that will output the new train and test datasets.

<div class="alert alert-block alert-info">
<b>Tip:</b> The <a href="https://doc.dataiku.com/dss/latest/python/reusing-code.html#sharing-python-code-within-a-project">project libraries</a> allow you to build shared code repositories. They can be synchronized with an external Git repository.
</div>

## 0. Import packages

**Make sure you're using the `heart-attack-project` code environment** (see prerequisites)

In [0]:
%pylab inline

In [0]:
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
from heart_attack_library import data_processing
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split

In [0]:
import warnings
warnings.filterwarnings('ignore')

## 1. Import the data

Let’s use the Dataiku Python API to import the input dataset. This piece of code allows retrieving data in the same manner, no matter where the dataset is stored (local filesystem, SQL database, Cloud data lakes, etc.)

In [0]:
dataset_heart_measures = dataiku.Dataset("heart_measures")
df = dataset_heart_measures.get_dataframe(limit=100000)

## 2. A quick audit of the dataset

### 2.1 Compute the shape of the dataset

In [0]:
print(f'The shape of the dataset is {df.shape}')

### 2.2 Look at a preview of the first rows of the dataset

In [0]:
df.head()

### 2.3 Inspect missing values & number of distinct values (cardinality) for each column

In [0]:
pdu.audit(df)

## 3. Exploratory data analysis

### 3.1 Define categorical & continuous columns

In [0]:
categorical_cols = ['Sex','ChestPainType', 'FastingBS', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
continuous_cols = ['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']

### 3.2 Look at the distibution of continuous features

In [0]:
nb_cols=2
fig = plt.figure(figsize=(8,6))
fig.suptitle('Distribution of continuous features', fontsize=11)
gs = fig.add_gridspec(math.ceil(len(continuous_cols)/nb_cols),nb_cols)
gs.update(wspace=0.3, hspace=0.4)
for i, col in enumerate(continuous_cols):
    ax = fig.add_subplot(gs[math.floor(i/nb_cols),i%nb_cols])
    sns.histplot(df[col], ax=ax)

### 3.3 Look at the distribution of categorical columns

In [0]:
nb_cols=2
fig = plt.figure(figsize=(8,6))
fig.suptitle('Distribution of categorical features', fontsize=11)
gs = fig.add_gridspec(math.ceil(len(categorical_cols)/nb_cols),nb_cols)
gs.update(wspace=0.3, hspace=0.4)
for i, col in enumerate(categorical_cols):
    ax = fig.add_subplot(gs[math.floor(i/nb_cols),i%nb_cols])
    plot = sns.countplot(df[col])

### 3.4 Look at the distribution of target variable

In [0]:
target = "HeartDisease"
fig = plt.figure(figsize=(4,2.5))
fig.suptitle('Distribution of heart attack diseases', fontsize=11, y=1.11)
plot = sns.countplot(df[target])

<div class="alert alert-block alert-info">
<b>Tip:</b> To ease collaboration, all the insights you create from Jupyter Notebooks can be
shared with other users by publishing them on dashboards. See <a href="https://doc.dataiku.com/dss/latest/dashboards/insights/jupyter-notebook.html">documentation</a> for more information.
</div>

## 4. Prepare data 

### 4.1 Clean categorical columns

In [0]:
# Transform string values from categorical columns into int, using the functions from the project libraries
df_cleaned = data_processing.transform_heart_categorical_measures(df, "ChestPainType", "RestingECG", 
                                                                  "ExerciseAngina", "ST_Slope", "Sex")

df_cleaned.head()

### 4.2 Transform categorical columns into dummies

In [0]:
df_cleaned = pd.get_dummies(df_cleaned, columns = categorical_cols, drop_first = True)

print("Shape after dummies transformation: " + str(df_cleaned.shape))

### 4.3 Scale continuous columns

Let's use the Scikit-Learn Robust Scaler to scale continuous features

In [0]:
scaler = RobustScaler()
df_cleaned[continuous_cols] = scaler.fit_transform(df_cleaned[continuous_cols])

## 5. Split the dataset into train and test

Let's now split the dataset into a train set that will be used for experimenting and training the Machine Learning models and test set that will be used to evaluate the deployed model.

In [0]:
heart_measures_train_df, heart_measures_test_df = train_test_split(df_cleaned, test_size=0.2, stratify=df_cleaned.HeartDisease)

## 6. Next: use this notebook to create a new step in the project workflow

Now that our notebook is up and running, we can use it to create the first step of our pipeline in the Flow:

- Click on the **+ Create Recipe** button at the top right of the screen.

- Select the **Python recipe** option.

- Choose the ```heart_measures``` dataset as the input dataset and create two output datasets: ```heart_measures_train``` and ```heart_measures_test```.

- Click on the **Create recipe** button.

- At the end of the recipe script, replace the last four rows of code with:


```python
heart_measures_train = dataiku.Dataset("heart_measures_train")
heart_measures_train.write_with_schema(heart_measures_train_df)
heart_measured_test = dataiku.Dataset("heart_measures_test")
heart_measured_test.write_with_schema(heart_measures_test_df)
```

- Run the recipe


<div class="alert alert-block alert-success">
<b>Success:</b> We can now go on the Flow, we'll see an orange circle that represents your first step (we call it a 'Recipe'), and two output datasets.
</div>