<a href="https://colab.research.google.com/github/fangnes/pucrio_data_science_and_analytics/blob/main/MVP%20-%20Machine%20Learning%20%26%20Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Discord Notes

Professor Patrick Happ says that we should try different approachs in order to determine which one suits better to our problem/dataset. He states that we could combine 6 apporachs:


*   original dataset with/without feature selection
*   normalized dataset with/without feature selection
*   standardized dataset with/without feature selection



Professor Hugo Villamizar says that some feature selection methods (like correlational analysis) are sensitive to data sacling and may not work well with a dataset that is not normalized/standardized

Professor Hugo Villamizar says that the proximity between mean and median are not enough to determine if the dataset is normalized, we must plot the dataset into a histogram in order to analyse if it is normalized, if it has a right-skewed distribution or a left-skewed distribution

# MVP Checklist

## Problem Definition

Objective: understand and clearly describe the problem being solved.

* What is the problem description?
* Do you have premises or hypotheses about the problem? Which?
* What restrictions or conditions were imposed to select the data?
* Describe your dataset (attributes, images, annotations, etc.).

## Dataset preparation

Objective: perform data preparation operations.

* Separate the dataset between training and testing (and validation, if applicable).
* Does it make sense to use a cross-validation method? Justify if you do not use it.
* Check which data transformation operations (such as normalization and standardization, transforming images into tensors) are most appropriate for your problem and save different views of your dataset for later model evaluation.
* Refine the number of available attributes, carrying out the feature selection process appropriately.

## Modeling and training

Objective: build models to solve the problem at hand.

* Select the most suitable algorithms for the chosen problem and dataset, justifying your choices.
* Are there any initial settings for the hyperparameters?
* Was the model properly trained? Has an underfitting problem been observed?
* Is it possible to optimize the hyperparameters of any of the models? If yes, do so, justifying all choices.
* Are there any advanced or more complex methods that can be evaluated?
Can I create a committee of different models for the problem (ensembles)?

## Results assessment

Objective: analyze the performance of models generated on unseen data (with the test base)

* Select the evaluation metrics consistent with the problem, justifying them.
* Train the chosen model with the entire training database, and test it with the test database.
* Do the results make sense?
* Have any overfitting issues been observed?
* Compare results from different models.
* Describe the best solution found, justifying it.

# Notebook configuration

In [None]:
!pip install pandas



In [None]:
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.6-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.6


In [None]:
import json
import pandas as pd
from ucimlrepo import fetch_ucirepo
from matplotlib import pyplot as plt

# Loading dataset

In [None]:
# fetch dataset
heart_disease = fetch_ucirepo(name='Heart Disease')

print(heart_disease.variables)

        name     role         type demographic  \
0        age  Feature      Integer         Age   
1        sex  Feature  Categorical         Sex   
2         cp  Feature  Categorical        None   
3   trestbps  Feature      Integer        None   
4       chol  Feature      Integer        None   
5        fbs  Feature  Categorical        None   
6    restecg  Feature  Categorical        None   
7    thalach  Feature      Integer        None   
8      exang  Feature  Categorical        None   
9    oldpeak  Feature      Integer        None   
10     slope  Feature  Categorical        None   
11        ca  Feature      Integer        None   
12      thal  Feature  Categorical        None   
13       num   Target      Integer        None   

                                          description  units missing_values  
0                                                None  years             no  
1                                                None   None             no  
2              

In [None]:
X = heart_disease.data.features
y = heart_disease.data.target

In [None]:
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45,1,1,110,264,0,0,132,0,1.2,2,0.0,7.0
299,68,1,4,144,193,1,0,141,0,3.4,2,2.0,7.0
300,57,1,4,130,131,0,0,115,1,1.2,2,1.0,7.0
301,57,0,2,130,236,0,2,174,0,0.0,2,1.0,3.0


In [None]:
y

Unnamed: 0,num
0,0
1,2
2,1
3,0
4,0
...,...
298,1
299,2
300,3
301,1


In [None]:
X.count()

age         303
sex         303
cp          303
trestbps    303
chol        303
fbs         303
restecg     303
thalach     303
exang       303
oldpeak     303
slope       303
ca          299
thal        301
dtype: int64

Using the X.count() we can observe that there is some NA values in the features dataset (4 rows in "ca" column and 2 rows in "thal" column). As none of the columns has presented a expressive amount of NA vales, we will drop the rows that caontains NA values. However, if we drop the NA rows directly from X dataset, we would not be able to drop the related target class in the dataset y, so, we will concatenate these dataframes and then drop the rows with NA values.

In [None]:
df_heart_disease = pd.concat([X,y], axis=1)
df_heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45,1,1,110,264,0,0,132,0,1.2,2,0.0,7.0,1
299,68,1,4,144,193,1,0,141,0,3.4,2,2.0,7.0,2
300,57,1,4,130,131,0,0,115,1,1.2,2,1.0,7.0,3
301,57,0,2,130,236,0,2,174,0,0.0,2,1.0,3.0,1


In [None]:
df_heart_disease = df_heart_disease.dropna()

df_heart_disease.count()

age         297
sex         297
cp          297
trestbps    297
chol        297
fbs         297
restecg     297
thalach     297
exang       297
oldpeak     297
slope       297
ca          297
thal        297
num         297
dtype: int64

Unfortunately, as we can see in the result above, the "ca" and "thal" rows with NaN didn't coincide, leaving us with the worst case scenario of 297 samples. Since we still have 98% of the original dataset, our model will not be penalyzed by this.