# Buisness Understanding
**Stakeholder:** 
* Medical Device Company

**Problem:** 
* The major issue is being unable to produce effective monitoring and treatment technologies for myocardial infarction (MI) survivors. Being able to predict myocardial complications is essential for advancing technologies in this background. MI can occur without complications or with complications that do not worsen the long-term prognosis. However about half of the patients in the acute and subacute periods have complications that lead to worsening of the disease and even death. Predicting complications of myocardial infarction is importnat in order to carry out the necessary preventive measures in upcoming developing medical devices that will keep those complication in mind.


# Data Understanding

**Source:**
* [University of California Irvine (UCI) Machine Learning Repositories](https://archive.ics.uci.edu/) 

**Dataset:**
* [Myocardial Infarction Complications](https://archive.ics.uci.edu/dataset/579/myocardial+infarction+complications)
    * 1700 rows, 124 columns 
    * 8 potential Target columns, 116 Feature columns
    * All the column names and defintions can be found in the [Column Descriptions](../data/column_descriptions.md) file

**Targets (Myocardial Infarction Complications):**
* FIBR_PREDS (Atrial Fibrillation) 
    * Life-threatening, irregular heartbeat caused by fast and irregular contractions in the upper chambers of the heart. Prevents the heart from pumping blood to the lower chamber of the heart
* PREDS_TAH (Supraventricular Tachycardia)
    * Irregular, fast, or erratic heartbeat that affects the heart's upper chambers. Its not usually serious and does not cause sudden death, heart damage, or heart attacks. In extreme cases that can result but its very unlikely.
* JELUD_TAH (Ventricular Tachycardia)
    * Irregular, fast, or erratic heartbeat that affects the heart's lower chambers. Can become life-threatening if the episode lasts longer than a few seconds also known as a sustained Ventricular Tachycardia. 
* FIBR_JELUD (Ventricular Fibrillation)
    * Life-threatening, irregular heartbeat that affects the heart's ventricles. The lower heart chambers contract rapidly and in an uncoordinated manner. As a result, prevents the heart from pumping blood to the rest of the body.
* A_V_BLOK (Third-degree AV block)
    * Medical condition that occurs when there is a complete loss of communcation between the heart's atria and ventricles. In other words, electrical signals cannot pass from the atria to the ventricles.   
* OTEK_LANC (Pulmonary edema)
    * Life-threatening condition caused by fluid build up in the lungs. This fluid collects in the air sacs in the lungs, making it difficult to breathe.
* RAZRIV (Myocardial rupture)
    * Tear in the heart that occurs after a heart attack. Is life-threatening however is a rare complication of a heart attack.
* DRESSLER (Dressler Syndrome)
    * Inflammation of the sac (pericardium) surrounding the heart. Immune system response due to damage to heart tissue or the sac itself.

**Data Types:**

For this dataset almost all columns have already been numerically encoded for nominal and ordinal columns. There are columns that are numeric that represent realtime values that were taken. These need to be scaled so they don't dramatically impact the modeling results.

Continous Data: 
* NA_BLOOD (Serum Sodium Content)
* ALT_BLOOD (Serum AIAT content)
* AST_BLOOD (Serum AsAT Content)
* KFK_BLOOD (Serum CPK Content)
* L_BLOOD (White Blood Cell Count)
* K_BLOOD (Serum Potassium Content)
* AGE (Age of patient)

Lets split data into a train-test split. Consider the imbalances between classes and ensure that the distribution of classes in the train and test subsets match the original dataset distribution.

In [9]:
import pandas as pd
df = pd.read_csv("../data/data_cleaned.csv")

# Has an unamed col
df = df.iloc[:, 1:]

In [10]:
df.head()

Unnamed: 0,ID,AGE,SEX,INF_ANAM,STENOK_AN,FK_STENOK,IBS_POST,GB,SIM_GIPERT,DLIT_AG,...,PREDS_TAH,JELUD_TAH,FIBR_JELUD,A_V_BLOK,OTEK_LANCRAZRIV,DRESSLER,ZSN,REC_IM,P_IM_STEN,LET_IS
0,1,77.0,1,2.0,1.0,1.0,2.0,3.0,0.0,7.0,...,0,0,0,0,0,0,0,0,0,0
1,2,55.0,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
2,3,52.0,1,0.0,0.0,0.0,2.0,2.0,0.0,2.0,...,0,0,0,0,0,0,0,0,0,0
3,5,60.0,1,0.0,0.0,0.0,2.0,3.0,0.0,7.0,...,0,0,0,0,0,0,0,0,0,0
4,6,64.0,1,0.0,1.0,2.0,1.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Import libraries
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit

# Create a StratifiedShuffleSplit object
stratified_split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

# Define X, y 
X = df.iloc[:, ] # Features
y = df.iloc[:, ] # Targets
# Use the stratified split to create train and test indices
for train_index, test_index in stratified_split.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]