1. [Train Test Split](#Train-Test-Split)
2. [Feature Engineering](#Feature-Engineering)
   - [Feature Scaling](#Feature-Scaling)
   - [Feature Encoding](#Feature-Encoding)
   - [Missing Data](#Missing-Data)

In [1]:
import pandas as pd
from ycimpute.imputer import knnimput
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split, 


from warnings import filterwarnings
filterwarnings('ignore')


In [2]:
heart_diases = pd.read_csv("../data/raw/heart_disease.csv")
df = heart_diases.copy()

## Train Test Split 

In [3]:
X = df.drop(["Heart Disease Status"] , axis = 1)
y = df["Heart Disease Status"]
X_train , X_test,y_train , y_test = train_test_split(X , y , test_size = 0.20 , random_state = 42)

NameError: name 'train_test_split' is not defined

In [None]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
X_train.head(2)

In [None]:
X_test.head(2)

In [None]:
y_train.head()

In [None]:
y_test.head()

In [None]:
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

## Feature Engineering

### Feature Scaling

In [None]:
numeric_cols = X_train.select_dtypes(include=["float64", "int64"]).columns.tolist()

scaler = StandardScaler()
X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])

### Feature Encoding 

In [None]:
categorical_cols = X_train.select_dtypes(include=["object"]).columns.tolist()
pref = ["Gen","Exercise","Smok","Family","Dia","HighB","LowHDL","HighLDL","Alch","Stress","Sugar"]

X_train = pd.get_dummies(data = X_train ,columns =  categorical_cols, prefix = pref,drop_first = True)
X_test = pd.get_dummies(data = X_test ,columns = categorical_cols ,prefix = pref,drop_first = True)

bool_columns = X_train.select_dtypes(include = "bool").columns
X_train[bool_columns] = X_train[bool_columns].astype(int)

bool_columns = X_test.select_dtypes(include = "bool").columns
X_test[bool_columns] = X_test[bool_columns].astype(int)

In [None]:
X_train.head(2)

In [None]:
X_test.head(2)

### Missing Data

In [None]:
X_train.isnull().sum()

In [None]:
(X_train.isnull().sum()/ len(df)) * 100

In [None]:
X_test.isnull().sum()

In [None]:
(X_test.isnull().sum()/ len(df)) * 100

In [None]:
(df.isnull().sum()/ len(df)) * 100

**Overview of Missing Data**:

A detailed analysis of missing data was conducted on the dataset. The findings indicate that:

* The missing data percentage ranges between 0.19% and 0.30% for most features.

* The 'Alcohol Consumption' column has a significantly higher missing data percentage (25.86%), which is approximately one-fourth of the dataset.

* No evident correlation was found between missing values and other variables.

* There are no outliers in the dataset that could distort the missing data imputation process.

**Strategy for Handling Missing Data**:

1. Dropping 'Alcohol Consumption' Column:

* The missing rate is excessively high (~26%), making it impractical to impute.

* Retaining this column and applying imputation could introduce significant biases.

* Therefore, the best approach is to remove this column from the dataset.

2. Imputing Missing Values for Other Features:

* Given the low missing percentages (0.12% - 0.25%), imputation is preferred over deletion.

* KNN Imputer will be used for missing value imputation because:

    * The dataset has no outliers, making KNN a suitable method.

    * Methods like mean/median imputation could distort feature distributions.

    * Regression-based imputation might be excessive for a medium-sized dataset.

    * KNN Imputer leverages feature similarity to generate realistic missing value replacements.

**Final Decision**:

* 'Alcohol Consumption' will be removed from the dataset. KNN Imputer will be used to fill missing values in all other columns.

* This strategy ensures data integrity while minimizing information loss, thus maintaining the quality of the dataset for machine learning applications.

In [None]:
X_train= X_train.drop(["Alch_Low","Alch_Medium"] , axis = 1)
X_test= X_test.drop(["Alch_Low","Alch_Medium"], axis = 1)

In [None]:
X_train_filled = knnimput.KNN(k= 5).complete(X_train.values)
X_train = pd.DataFrame(X_train_filled ,columns = X_train.columns)

X_test_filled = knnimput.KNN(k= 5).complete(X_test.values)
X_test = pd.DataFrame(X_test_filled ,columns = X_train.columns)

In [None]:
print(X_test.isnull().sum())
print("----------------------------------------")
print(X_train.isnull().sum())

In [None]:
X_train.to_csv("../data/processed/X_train.csv", index=False)
X_test.to_csv("../data/processed/X_test.csv", index=False)
y_train.to_csv("../data/processed/y_train.csv", index=False)
y_test.to_csv("../data/processed/y_test.csv", index=False)