# Electrocardiograms

👇 Import the [`electrocardiograms.csv`](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Electrocardiograms_dataset.csv) dataset and display its first 5 rows

In [None]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Electrocardiograms_dataset.csv")

data.head()

ℹ️ Each obervation of the dataset is a numerically represented heartbeat, taken from a patient's electrocardiogram (ECG). The target is binary and defines whether the heartbeat is at risk of cardiovascular disease [1] or not [0]. 

# Data Exploration

👇 Plot an observation of each target class to get a visual idea of what the numbers represent.

In [None]:
import matplotlib.pyplot as plt

plt.plot(data.iloc[3])
plt.tick_params(labelbottom=False)
plt.title('At risk heartbeat')
plt.show()

In [None]:
plt.plot(data.iloc[19560])
plt.tick_params(labelbottom=False)
plt.title('Healthy heartbeat')
plt.show()

👇 How many observations of at-risk heartbeats are there? Save your answer as `at_risk_count`.

In [None]:
at_risk_count = data.target.value_counts()[1]
at_risk_count

👇 How many observations of healthy heartbeats are there? Save your answer as `healthy_count`.

In [None]:
healthy_count = data.target.value_counts()[0]
healthy_count

ℹ️ In certain cases, the class balance is representative of the true class distribution. This is the case here: the vast majority of people actually have healthy hearts. In such case, we preserve the class distribution to train the model based on reality, and adapt our modeling approach accordingly.

### ☑️ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('class_balance',
                         healthy = healthy_count,
                         at_risk = at_risk_count)
result.write()
print(result.check())

#  Logistic Regression

🎯 Your task is to flag heartbeats that are at risk of cardiovascular diseases.

👇 Let's start by investigating the performance of a `LogisticRegression` on that task. Use cross validation to evaluate the model on the following metrics:
- Accuracy
- Recall
- Precision
- F1

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression

# Ready X and y
X = data.loc[:, 'x_1':'x_187']
y = data['target']

# 10-Fold Cross validate model
log_cv_results = cross_validate(LogisticRegression(max_iter=1000), X, y, cv=10, 
                            scoring=['accuracy','recall','precision','f1'])

❓ What is the model's ratio of correct predictions? Save your answer under variable name `correct_pred_ratio`.

In [None]:
correct_pred_ratio = log_cv_results['test_accuracy'].mean()

correct_pred_ratio

❓ What percentage of at-risk heartbeats is the model able to flag? Save your answer under variable name `flag_ratio`.

In [None]:
flag_ratio = log_cv_results['test_recall'].mean()

flag_ratio

❓ When the model signals an at-risk heartbeat, how often is it correct? Save your answer under variable name `correct_detection_ratio`.

In [None]:
correct_detection_ratio = log_cv_results['test_precision'].mean()

correct_detection_ratio

❓ What is the model's ability to flag as many at-risk heartbeats as possible while limiting false alarms?  Save your answer under variable name `aggregated_metric`.

In [None]:
aggregated_metric = log_cv_results['test_f1'].mean()

aggregated_metric

ℹ️ By observing the different metrics, you should see that accuracy can be deceiving. To understand what is going on, we can observe a breakdown of the model's predictions in a confusion matrix.

### ☑️ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('logistic_regression_evaluation',
                         accuracy = correct_pred_ratio,
                         recall = flag_ratio,
                         precision = correct_detection_ratio,
                         f1 = aggregated_metric)
result.write()
print(result.check())

# Confusion Matrix

👇 Using `plot_confusion_matrix` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html)),  visualize the predictions breakdown of the Logistic Regression model.

<details>
<summary>💡 Hints</summary>

- `plot_confusion_matrix` takes as input a **trained model** and **test data**
    
- You'll need to go back to the **Holdout method!** You can use Sklearn's `train_test_split()` ([doc](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html))
    
- Look into the `normalize` parameter
  
</details>



In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)

# Instanciate and train the model on train data
log_model = LogisticRegression(max_iter=1000).fit(X_train,y_train)

# Plot confusion matrix by passing trained model and test data
plot_confusion_matrix(log_model, X_test, y_test)

ℹ️ The confusion matrix should show that the model is influenced by the class imbalance: it predicts heartbeats to be healthy most of the time. Due to this behaviour, the model is often correct and has a **high accuracy**. However, this also causes it to miss out on many at-risk heartbeats: it has **bad recall**.

👉 This model is therefore poor at the task of **flagging at-risk observations**.

⚠️ Don't be fooled by the accuracy and look at the metric that corresponds to your task! ⚠️

# KNN Classifier

👇 Would a default KNN classifier perform better at the task of flagging at-risk observations?

Save the you answer under `best_model` as "KNN" or "LogisticRegression".

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# 10-Fold Cross validate model and evaluate recall
knn_cv_results = cross_validate(KNeighborsClassifier(n_neighbors=5), X, y, cv=10, 
                            scoring=['recall']) 

knn_score = knn_cv_results['test_recall'].mean()
print(knn_score)

best_model = "KNN"

ℹ️ The KNN classifier should have a much higher recall than the LogisticRegression and therefore is better suited for the task.




### ☑️ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('best_model',
                         model = best_model)
result.write()
print(result.check())

# Classification Report

Now that we know the KNN model has the best recall, let's check out its performance across all the other classification metrics.

👇 Print out a `classification_report` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)) of the KNN model.

<details>
<summary> 💡 Hint  </summary>
    
You'll need to pass model predictions to `classification_report`. Sklearn's `cross_val_predict` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html)) might help 😉
</details>




In [None]:
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(KNeighborsClassifier(n_neighbors=5), X, y) # Make cross validated predictions of entire dataset

print(classification_report(y,y_pred)) # Pass predictions and true values to Classification report

❓ Looking at the classification report, what is the model's ratio of correctly predicted at-risk heartbeats? Save your answer as a float under `correct_atrisk_predictions`

In [None]:
correct_atrisk_predictions = 0.94

### ☑️ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('precision',
                         precision = correct_atrisk_predictions)
result.write()
print(result.check())

# Prediction

🎯 A patient comes to you for a second opinion on what he was told may be an at-risk heartbeat.  Download the data for his heartbeat [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Electrocardiograms_new_patient.csv).


❓ According to your optimal model, is he at-risk or not?  

Save the prediction of your model under variable name `prediction` as "at risk" or "healthy".

In [None]:
new_patient = pd.read_csv('https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Electrocardiograms_new_patient.csv')

new_patient

In [None]:
knn_model = KNeighborsClassifier().fit(X,y) # Fit the model you have found to be optimal (Default KNN)

model_prediction = knn_model.predict(new_patient)[0] # Make prediction
print(model_prediction)

prediction = "at risk"

### ☑️ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('prediction',
                         prediction = prediction)
result.write()
print(result.check())

# 🏁