## <font color="lime">BA 305 - Team Exercise - Data Exploration + Cleaning<font> 

## Team Name:

---

## Scenario - Heart Disease

<img src="https://images.pexels.com/photos/4386466/pexels-photo-4386466.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=2" width="800" height="500" alt="Train">


Machine learning's role in heart disease classification offers significant advantages to the medical field, enhancing healthcare professionals' ability to deliver care. Through the analysis of patient data, these algorithms can furnish precise diagnoses, tailor treatment plans, and evaluate risk levels. Additionally, they enable continuous monitoring of patient health, facilitate prognosis predictions over extended periods, and uncover emerging patterns and insights within cardiac health research. Essentially, applying machine learning for heart disease classification equips physicians with critical information and support, contributing to improved patient care and fostering the advancement of novel treatment approaches.

---

## Dataset Description

The ground truth is labeled 'HeartDisease' and indicates whether the person has a heart disease (1) or not (0)

**Files Included:**
- `team_exercise_data.csv` - the dataset for model train and test.
- `test.csv` - the actual test dataset provided for evaluation purposes by the teaching team. Your task is to predict the 'HeartDisease' variable for the 'ID's listed in this file.
- `sample_submission.csv` - a sample submission file in the correct format.

---

## Evaluation

Submissions are evaluated on the F1 Score.

- F1 Score: The F1 score is the harmonic mean of precision and recall. It provides a single metric that balances both the precision and recall, which is particularly useful when you want to balance the importance of correctly identifying the non-majority class against the cost of incorrectly labeling the majority class as the non-majority class.
https://www.v7labs.com/blog/f1-score-guide

---

## <font color="lime">Notebook Objective<font>

The goal of this notebook is to conduct data exploration and data preparation alongside your team, setting the stage for the development and training of machine learning models.

---

In [None]:
from google.colab import drive
drive.mount ('/content/drive')

In [9]:
# Import libraries

import seaborn as sns
import numpy as np
import pandas as pd

# TODO Add more

In [None]:
# Load Data
# TODO Change the path based on your set up

df = pd.read_csv('/content/drive/My Drive/team_exercise_data.csv')
df

---

## Data Exploration + Cleaning

Conduct data exploration and cleaning below. Feel free to use AI tools, online resources, and any other available technologies to accomplish the assigned tasks.

<font color="red"> Make sure you split the data into training and testing sets before applying any scaling operations!<font>

In [None]:
# TODO

---

## Save the Processed Data


In [None]:
# TODO Change the path based on your set up

processed_train.to_csv('/content/drive/My Drive/processed_train.csv', index=False)
processed_test.to_csv('/content/drive/My Drive/processed_test.csv', index=False)

In [None]:
# Check if the processed data was correctly saved.

check_train = pd.read_csv('/content/drive/My Drive/processed_train.csv')
check_train

In [None]:
check_test = pd.read_csv('/content/drive/My Drive/processed_test.csv')
check_test

---

## Repeat the Data Processing to test.csv

`test.csv` - the actual test dataset provided for evaluation purposes by the teaching team. Your task is to predict the 'y' variable for the 'ID's listed in this file.

`Why?`

Ensuring the test data is processed in the same way as the training data is crucial for maintaining the accuracy and effectiveness of a machine learning model. This consistency in preprocessing steps, such as scaling, handling missing values, and encoding categorical variables, ensures that the model receives test data in the format it expects, based on its training. Without matching preprocessing, the model might misinterpret the test data, leading to inaccurate predictions. Essentially, for the model to apply what it has learned accurately to new, unseen data, the feature space of the test data must closely align with that of the training data.

<font color="red"> If you've applied scaling earlier, remember to use the `transform` method instead of `fit_transform` for this test.csv data, as the scaler should be calibrated based on the original training data only.<font>

In [None]:
# TODO Change the path based on your set up

evaluation_df = pd.read_csv('/content/drive/My Drive/test.csv')
evaluation_df

In [None]:
# TODO Repeat Data Processing


---

## Save the Processed test.csv Data

In [None]:
# TODO Change the path based on your set up

processed_evaluation_df.to_csv('/content/drive/My Drive/processed_eval.csv', index=True)

In [None]:
# Check if the processed data was correctly saved.

check_df_eval = pd.read_csv('/content/drive/My Drive/processed_eval.csv', index_col=0)
check_df_eval