# HW3: Regression and Classification

In this assignment you will preprocess the dataset and perform some basic regression and classification tasks. The learning outcome of this part is to know how one can pre-process a real-world dataset and perform a supervised learning task, and to understand some of the fundamental mechanisms behind these tasks.

##  Grading: 

Pass/Fail.

To Pass this HW you need to provide a complete and correct solution, passing all the tests.

## OUTLINE: 

Data pre-processing, regression task and classification task

1. Reading the files
2. Missing Values
3. Imputing categorical variables
4. Imputing numerical variables
5. Classification with Decision Tree, single split
6. Classification with Decision Tree, Cross validation
7. Interpretation of the results

## Important instructions:

Each function you make will be considered during the grading, so it is important to strictly follow input and output instructions stated in the skeleton code.

You must not change the names of the functions, since, if you do, the tests will fail.

Since this Homework is, in part, focused on having you implement creative solutions to impute missing data, if at any point of the homework you will use functions like fillna(), SimpleImputer(), IterativeImputer(), or packages like fancyimpute, missingpy, or similar, you will fail a test designed to spot these packages. Please, try to avoid circumventing this rule, since een if you manage to pass the homework, a similar task might be in the exam, and there you would be spotted for sure.

## Homework Scenario: Cleaning and Preparing Heart Disease Data

You have recently joined the **Data Science and Analytics Unit** at the *Global Health Institute (GHI)*, a non-profit organization focused on improving cardiovascular disease diagnosis through data-driven research.  

A junior data analyst from your team, **Franco**, sends you a message:

> “Hey, welcome to the team! We’re preparing a predictive model to help doctors identify patients at risk of heart disease using clinical data from several hospitals.  
>   
> We have two related datasets:
> - **Cleveland dataset** → this will be used for **training and validation**
> - **Hungary dataset** → this will serve as our **independent test set**
>
> Unfortunately, it looks like something went wrong during the data collection process: some values appear to have been **corrupted or lost**. Before we can train any classification model, we need to **inspect and clean the data**, handle **missing or inconsistent values**, and make sure it’s ready for modeling. I'm completely lost and I have a lot of other work, can you please help me with the cleaning and with creating some baselines classification models?”

Your task is to **analyze and clean the datasets** before **building a classifier** to predict whether a patient has heart disease.

In [None]:
# these are the libraries that you will need throughout the assignment
import numpy as np 
import pandas as pd
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline

from matplotlib.colors import ListedColormap

from HW import *

RSEED = 8

## *1.* Reading the files

### `Task: Read the datasets from the 'datasets' folder. Use the files called cleveland.csv and hungary.csv that you have downloaded in this archive.`

## Heart Disease Dataset — Column Descriptions

Someone has changed the names of some columns in the dataset, so make sure to use this description and refer to it for the "allowed" values.

Common sense is useful when evaluating some of the features: for example, in this dataset there is no column called weight, but, if there was one, since we are talking about humans and not ethereal beings, if a patient had a value of 0 in the weight column, this value could be due to a typo, or corrupted, and would need to be cleaned in some way.

| **Column** | **Description** |
|-------------|-----------------|
| **Age** | Age of the patient (in years). This dataset only includes adult patients. |
| **Sex** | Biological sex of the patient: `1 = male`, `0 = female`. |
| **ChestPainType** | Type of chest pain experienced: <br>• `1` = typical angina <br>• `2` = atypical angina <br>• `3` = non-anginal pain <br>• `4` = asymptomatic. |
| **RestBP** | Resting blood pressure (in mm Hg) measured on admission to the hospital. |
| **Chol** | Serum cholesterol level (in mg/dl). |
| **FBS** | Fasting blood sugar: `1` if fasting blood sugar > 120 mg/dl, otherwise `0`. |
| **RestECG** | Resting electrocardiographic results: <br>• `0` = normal <br>• `1` = ST-T wave abnormality <br>• `2` = showing probable or definite left ventricular hypertrophy. |
| **MaxHR** | Maximum heart rate achieved during the exercise test. |
| **ExAng** | Exercise-induced angina: `1` = yes, `0` = no. |
| **Oldpeak** | ST depression induced by exercise relative to rest (a measure of exercise-induced ischemia). |
| **Slope** | Slope of the peak exercise ST segment: <br>• `1` = upsloping <br>• `2` = flat <br>• `3` = downsloping. |
| **Ca** | Number of major vessels (0–3) colored by fluoroscopy (a measure of blood flow). |
| **Thal** | Thalassemia test result: <br>• `3` = normal <br>• `6` = fixed defect <br>• `7` = reversible defect. |
| **Num** | Diagnosis of heart disease (target variable): <br>`0` = no heart disease, `1–4` = presence of heart disease with increasing severity. |


In [None]:
# From the folder 'datasets', read the files cleveland.csv and hungary.csv into the dataframes cleveland and test, respectively.

cleveland = pd.DataFrame()  # change this
test = pd.DataFrame()       # change this

In [4]:
# You can uncomment this to inspact the datasets
# cleveland.head(5)

In [5]:
# test.head(5)

In [6]:
# if you want to see information about the dataset, uncomment:
# cleveland.describe()

In [7]:
# if you want to see information about the dataset, uncomment:
# test.describe()

## *2.* Missing values

### `Task: use the function clean_data from the HW.py file to get a clean version of the cleveland and test dataframes.`

In [None]:
# Write your code here
cleveland_cleaned, missing_values_cleveland = pd.DataFrame(), {} # change this
test_cleaned, missing_values_test = pd.DataFrame(), {} # change this

## *3.* Imputing categorical variables

At the beginning of this file you can find the names of the columns and a description of their contents.

Determine which columns are categorical, and set their type to object.

Determine which columns are numerical, and set their type accordingly.

Do not include the target column in any of these lists!

In [9]:
categorical_columns = []        # change this
numerical_columns = []      # change this

### ` Task: Split the cleveland_cleaned dataframe in a train and a validation set, using train_test_split from sklearn. `

The train set must be called train, the validation set must be called val. The size of the validation set must be 30% of the total size of the cleveland_cleaned dataframe. Use shuffle=True and stratify the split based on y_cleveland. Make sure that both train and val are dataframes, and that the columns have the correct names. Reset the indexes of all four the dataframes, using drop=True.

In [10]:
# Split the data into X and y, where X contains the features and y contains the target variable.
X_cleveland = pd.DataFrame()  # change this
y_cleveland = pd.DataFrame()  # change this

X_test = pd.DataFrame()       # change this
y_test = pd.DataFrame()       # change this

In [None]:
X_train, X_val, y_train, y_val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame(), pd.DataFrame() # change this

In [12]:
# # if you want to see information about the split dataset, uncomment:
# X_train.head(5)

In [13]:
# # if you want to see information about the split dataset, uncomment:
# X_val.head(5)

In [14]:
# To make the classification task easier, transform the target variable into a binary variable.
# If the target variable is 0, it should remain 0. If the target variable is more than 0, it should be transformed into 1.
y_train = pd.DataFrame()  # change this
y_val = pd.DataFrame()    # change this
y_test = pd.DataFrame()   # change this

### ` Task: use the impute_missing_categorical function from the HW.py file to impute the missing data from the categorical features in your dataframes. `

In [None]:
# Write your code here
X_train_imputed_cat, X_val_imputed_cat, X_test_imputed_cat = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

## *4.* Imputing numerical variables

### ` Task: use the impute_missing_numeric function from the HW.py file to impute the missing data from the numeric features in your dataframes. `

In [None]:
# Write your code here

X_train_imputed_num, X_val_imputed_num, X_test_imputed_num = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()


### ` Task: use the merge_imputed function from the HW.py file to merge your imputed dataframes. `

In [None]:
# Merge the train_imputed_cat and train_imputed_num datasets. Call the resulting dataset X_train_imputed.
# Merge the val_imputed_cat and val_imputed_num datasets. Call the resulting dataset X_val_imputed.
# Merge the test_imputed_cat and test_imputed_num datasets. Call the resulting dataset X_test_imputed.

# Write your code here
X_train_imputed = pd.DataFrame()
X_val_imputed = pd.DataFrame()
X_test_imputed = pd.DataFrame()


## *5.* Classification, using a single split 

### ` Use the function train_and_evaluate_single_split to produce classification results for your test set.`

In [18]:
# The hyperparameters for the tree should be:
# criterion: ['gini', 'entropy']
# max_depth: [3, 5, 10]
# The hyperparameters for the logistic regression should be:
# penalty: ['l1', 'l2']
# C: [0.1, 10]
# solver: ['liblinear']

# For each combination of hyperparameters, train a classification pipeline using your function.


from sklearn.model_selection import ParameterGrid
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
import time

hyperparameters_tree = {} # change this
hyperparameters_logreg = {} # change this
performance_df = pd.DataFrame(columns=['params', 'F1 scores'])

# create a list from the grid of hyperparameters for each model, and create the models.

start = time.time() # DO NOT CHANGE/DELETE THIS LINE

for number in range(1, 11): # change this
    # call your function here, then concat the results to performance_df
    pass # remove this line

end = time.time() # DO NOT CHANGE/DELETE THIS LINE

print('Time elapsed to run the hyperparameter tuning with a single split: ', end - start) # DO NOT CHANGE/DELETE THIS LINE

Time elapsed to run the hyperparameter tuning with a single split:  3.4809112548828125e-05


In [None]:
# Concatenate the train and validation datasets. Call the resulting datasets X and y.
X = pd.DataFrame()  # change this
y = pd.DataFrame()  # change this

# retrain the model with the best hyperparameters on the whole training dataset.
# Remember to use the same preprocessing steps as before.

# Write your code here

## *6.* Classification with Decision Tree using Cross Validation

### ` Use the function train_and_evaluate_cross_validation to produce classification results for your test set.`

In [None]:
# 1. Use the same hyperparameters from the previous task.
# 2. Create a dataframe to store the performance of the model with cross-validation, containing the columns 'params' and 'Average F1 scores'
# 3. You can reuse the parameter grids from the previous step.
# 4. Run your function for each combination of hyperparameters, using 5-fold cross-validation.
# 5. Concatenate the results to the dataframe created in step 2.



X = [[5,6], [10,11], [15,16], [20,21], [25,26], [30,31], [35,36], [40,41], [45,46], [50,51]]    # Delete this line
y = [0,1,0,1,0,1,0,1,0,1]                                                                       # Delete this line

X = pd.DataFrame(X)                                                                             # Delete this line
y = pd.DataFrame(y)                                                                             # Delete this line


# DO NOT FORGET TO DELETE THE PREVIOUS LINES. They are only to make the empty assignment run without errors,
# but they will destroy the data you need.

performance_df_cv = pd.DataFrame(columns=['params', 'F1 scores'])

start_CV = time.time() # DO NOT CHANGE/DELETE THIS LINE

# call your function here, then concat the results to performance_df_cv

end_CV = time.time() # DO NOT CHANGE/DELETE THIS LINE

print('Time elapsed to run the hyperparameter tuning with Cross Validation: ', end_CV - start_CV) # DO NOT CHANGE/DELETE THIS LINE


Time elapsed to run the hyperparameter tuning with Cross Validation:  0.0020859241485595703


In [None]:
# retrain the model with the best hyperparameters on the whole training dataset.
# Remember to use the same preprocessing steps as before.

## *7.* Interpretation of the results 

### ` Which model performs the best? `

Write your explanation here. Delete this text.

### ` Task: use the best model to produce predictions on the test set, then calculate the F1 score on the test set. What do you notice? `

### ` What is a possible explanation? `