# Assignment: End to End Machine Learning - Predicting the risk of heart disease

## Table of Contents

* [A. Importing Libraries](#libraries)
* [B. Importing data](#data)


* [2. MLC2: Data understanding](#data_understanding)
    * [2.1. MLC2.1.: Univariate data analysis](#univariate_data_analysis)
        * [2.1.1. Dataset size](#dataset_size)
        * [2.1.2. Direct visualization of the data](#direct_visualization)
        * [2.1.3. Types of variables available](#variable_types)
        * [2.1.4. Descriptive statistics](#descriptive_statistics)
        * [2.1.5. Null values](#null_values)
        * [2.1.6. Distribution of target](#target_distribution)
        * [2.1.7. Identification of Outliers](#identification_outliers)
    * [2.2. MLC2.2.: Multivariate data analysis](#multivariate_data_analysis)
        * [2.2.1. Distribution of variables 2 to 2 (scatter plot)](#dist_num_var)
        * [2.2.2. Correlation Between Variables 2 to 2 (linear correlation)](#cor_num_var)
        * [2.2.3. Cross-tabs](#cross_tab)
        * [2.2.4. Correlation between combinations of variables and the class](#corr_comb)



        
* [3. MLC3: Data preparation](#data_preparation)
    * [3.1. MLC 3.1. Data cleaning](#data_cleaning)
        * [3.1.1. Dealing with variable types](#dealing_variable_types)
        * [3.1.2. Imputation of null values](#nulls_imputation)
        * [3.1.3. Handling Outliers](#handling_outliers)
        * [3.1.4. Elimination of features with low variance](#low_variance)
    * [3.2. MLC 3.2. Data transformation](#data_transformation)
        * [3.2.1. Transformation of categorical variables](#transformation_categorical)
        * [3.2.2. Transformation of numerical variables](#transformation_num)
        * [3.2.3. Transformation of date variables](#transformation_date)
        * [3.2.4. Transformation of text variables](#transformation_text)

## A. Importing libraries<a class="anchor" id="libraries"></a>

In [None]:
# Import Libraries
# =========================================================
# import numpy and pandas
import numpy as np
import pandas as pd

# import libraries for plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
# import plotly.express as px
# from plotly.subplots import make_subplots
# import plotly.graph_objects as go

# Configuration of pandas
pd.set_option('display.max_rows', 500)
pd.options.display.float_format = '{:,.2f}'.format

# Configuration of matplotlib
# %matplotlib inline 
plt.rcParams['savefig.bbox'] = "tight"
# plt.style.use('ggplot')



# import sklear for preprocessing functions
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler

In [None]:
# validate version of sklearn
# =========================================================
from sklearn import __version__ as sklearn_version
print('Installed version for scikit-learn: {}.'.format(sklearn_version))

## B. Importing data<a class="anchor" id="data"></a>

In [None]:
# Import data
# ======================================================================================
df_comm_act = pd.read_csv("data/commercial_activity_df.csv")
df_prod = pd.read_csv("data/products_df.csv")
df_socio_dem = pd.read_csv("data/sociodemographic_df.csv")

display(df_comm_act.shape)
display(df_comm_act.head())
display(df_prod.shape)
display(df_prod.head())
display(df_socio_dem.shape)
display(df_socio_dem.head())

## MLC2: Data Understanding<a class="anchor" id="data_understanding"></a>

## MLC2.1: Univariate data analysis<a class="anchor" id="univariate_data_analysis"></a>


### 2.1.1. Dataset size<a class="anchor" id="dataset_size"></a>


In [None]:
# Get the number of registers, attributes and in-memory size
# ======================================================================================

# testing out if I am modifying this in my branc

### 2.1.2.  Direct visualization of the data<a class="anchor" id="direct_visualization"></a>

In [None]:
# Overview of the columns and its data
# ======================================================================================

### 2.1.3. Types of variables available<a class="anchor" id="variable_types"></a>

In [None]:
# Data Types
# ======================================================================================
df.info(verbose=True)

### 2.1.4. Descriptive statistics<a class="anchor" id="descriptive_statistics"></a>

In [None]:
# Descriptive statistics for numerical values
# ======================================================================================

In [None]:
# Descriptive statistics for categorical values
# ======================================================================================


### 2.1.5. Number/fractions of null values<a class="anchor" id="null_values"></a>

In [None]:
# % of null values in our data set
# ======================================================================================

for col in df_chd_data.columns:
  total_values = len(df_chd_data)
  missing_values = df_chd_data[col].isna().sum()
  percentage = ((missing_values/total_values)*100).round(2)
  if missing_values != 0:
    print('Column {} with {} missing values, {}% of {} values'.format(col, missing_values, percentage, total_values))

### 2.1.6. Distribution / range of target values<a class="anchor" id="target_distribution"></a>

In [1]:
# Create a target variable
# ======================================================================================
target = 2

target = target*3

In [None]:
# Understand how are the values split in our target variable and visualize it 
# ======================================================================================

### 2.1.7 Identification of Outliers<a class="anchor" id="identification_outliers"></a>

#### Numercial Values

In [None]:
# Check if there are outliers in the data (numerical values)
# ======================================================================================

In [None]:
# Boxplot for columns with outliers
# ======================================================================================


#### Categorical Values

In [None]:
# Understand the cardinality of the categorical features
# ======================================================================================

## MLC2.2: Multivariate data analysis<a class="anchor" id="multivariate_data_analysis"></a>


### 2.2.1 Distribution of variables 2 to 2 (scatter plot) <a class="anchor" id="dist_num_var"></a>

### 2.2.2 Correlation Between Variables 2 to 2 (linear correlation) <a class="anchor" id="cor_num_var"></a>

### 2.2.3 Cross-tabs <a class="anchor" id="cross_tab"></a>

### 2.2.4 Correlation between combinations of variables and the class <a class="anchor" id="corr_comb"></a>

In [None]:
# Correlation
# ======================================================================

correlation = df.corr().round(2)
plt.figure(figsize = (14,7))
sns.heatmap(correlation, 
            annot = True, 
            cmap = sns.diverging_palette(20, 220, n=200), 
            center=0.05)

## MLC3: Data Preparation<a class="anchor" id="data_preparation"></a>


### MLC3.1: Data Cleaning<a class="anchor" id="data_cleaning"></a>


### 3.1.1. Dealing with variable types<a class="anchor" id="dealing_variable_types"></a>

### 3.1.2. Imputation of null values<a class="anchor" id="nulls_imputation"></a>

### 3.1.3. Handling Outliers<a class="anchor" id="handling_outliers"></a>

### 3.1.4. Elimination of features with low variance or highly correlated<a class="anchor" id="ft_low_var"></a>

### MLC3.2: Data transformation<a class="anchor" id="data_transformation"></a>
### 3.2.1. Transformation of categorical variables<a class="anchor" id="transformation_categorical"></a>

### 3.3.2. Transformation of numerical variables<a class="anchor" id="transformation_num"></a>

### 3.3.3. Transformation of date variables<a class="anchor" id="transformation_date"></a>

### 3.3.4. Transformation of text variables<a class="anchor" id="transformation_text"></a>