# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# If failed to import, run: pip install -r requirements.txt

# Data Extraction

Read from **heart_train.csv** into a pandas data frame(call it df)

In [None]:
df = pd.read_csv('heart_train.csv')

# Data Visualization

Try viewing the first five rows of your data (Note. try the head function)

In [None]:
df.head()

Let's visualize our data bit and see number of people that have heart disease vs those who dont. In this particular dataset more people have heart disease than those who don't.

In [None]:
df['HeartDisease'].hist()

## Data Cleaning/PreProcessing

Before we contiune let us do some preprocessing on our data. Preprocessing is the process a data scientist or ML engineer goes through to make sure the data is clean and ready for the model. One example is checking to see if there are any null values in any of the columns and replacing them.

In [None]:
df.isnull().values.any()

## Feature Engineering

Now Time to do some feature engineering. Extract values from columns you can use as features(hint try to use numerical columns). Store it an variable called X. Note do not use PatientId and remember to use .values to convert it to numpy array.

In [None]:
# List of categorical columns to convert
categorical_cols = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']

# Apply One-Hot Encoding using pd.get_dummies
# This converts columns like 'Sex' (M/F) into 'Sex_F' and 'Sex_M' (0s and 1s)
df_clean = pd.get_dummies(df, columns=categorical_cols)

# Display the new columns to verify
print("New columns after encoding:")
print(df_clean.columns)

In [None]:
# Extract features into variable X
# We drop 'PatientId' (not a feature) and 'HeartDisease' (the target label)
# .values converts the Pandas DataFrame into a NumPy array, which is required for training
X = df_clean.drop(['PatientId', 'HeartDisease'], axis=1).values

Extract your labels in a variable called y (HeartDisease column). Do the same as above.

In [None]:
# Extract the target variable into y
y = df_clean['HeartDisease'].values

# Verify the shapes to ensure extraction was successful
print(f"\nShape of X: {X.shape}") # Should be (rows, number_of_features)
print(f"Shape of y: {y.shape}") # Should be (rows,)

# Data Normalization

We are going to now normalize our data. This will scale our data which will make it easier to train our model and make it more likley for our model to converge on the correct solution. Use the StandardScaler from sklearn to achieve this. Scale only the X variable. Store the result back into X.

In [None]:
sc = StandardScaler()

# Normalize X
# fit_transform calculates the mean and std dev for each feature, 
# then subtracts the mean and divides by the std dev. // mean 0, std 1
X = sc.fit_transform(X)


In [None]:
# Verify the result (Optional)
# The values should now be small, typically between -2 and 2.
print("First 5 rows of normalized X:")
print(X[:5])

# Train/Test Split

We are now going to split our data between train and test. It is important to do this because we want to reduce the chance of overfitting so we dont want to test on the same data we just trained on. We will use the **train_test_split** function to achieve this. This has already been imported for you. Store the result in variables *X_train, y_train, X_test, y_test*. Use a *80/20* split.

In [None]:
# Split the data into training and testing sets
# test_size=0.2 means 20% of the data goes to testing, 80% to training
# random_state=42 ensures the split is reproducible (same split every time you run it)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Let us view the shape of the train data. The first number represents how many rows, the second represents how many columns or features.

In [None]:
X_train.shape

Let us do the same for the test data.

In [None]:
X_test.shape

## Logistic Regression

Let us create a model and fit the model to the train dataset.Let us use the LogisticRegression model from sklearn.

In [None]:
from sklearn.linear_model import LogisticRegression

# 1. Create the Logistic Regression model
# max_iter=1000 is often needed to ensure the solver converges on the solution
clf = LogisticRegression(random_state=42, max_iter=1000)

Call the fit function for the classifier on *X_train* and *y_train*.

In [None]:
# 2. Fit the model to the training data
clf.fit(X_train, y_train)

We are now going to test our model. Call the score function on the classifier and pass in X_test, and y_test. The score you get represents the accuracy of the model e.g| a score of 0.9 means the model is 90% accurate.

In [None]:
# 3. Test the results and report accuracy
# The .score() method predicts on X_test and compares to y_test automatically
accuracy = clf.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

# Neural Network

Now let's try the same with a neural network. We will create a small neural network with some hidden layers and an output layer. (Note you are free to design this yourself). The network should output one value (try using sigmoid activation for last layer).

In [None]:
# 1. Create the Neural Network
from tensorflow.keras.layers import Dropout # Import Dropout

model = Sequential()

# Takes 20 things in, outputs 1 thing
model.add(Dense(units=36, activation='relu', input_dim=X_train.shape[1], ))
model.add(Dropout(0.5))  
model.add(Dense(units=16, activation='relu'))
model.add(Dropout(0.2))  
model.add(Dense(units=8, activation='relu'))

# Add the output layer
model.add(Dense(units=1, activation='sigmoid'))

In [None]:
weight_decay=0.004 # is a good starting point for small data
optimizer = tf.keras.optimizers.AdamW(
    learning_rate=0.001, 
    weight_decay=weight_decay
)

# 3. The Loss (Label Smoothing)
label_smoothing=0.05 # prevents the model from being "overconfident"
loss = tf.keras.losses.BinaryCrossentropy(label_smoothing=0.05)
    
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

Train the model, call the fit function and pass in X_train and y_train.

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

# 3. Fit the model to the training data
# Stop if validation loss doesn't improve for 5 epochs
callback = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

model.fit(X_train, y_train, epochs=100, batch_size=8, callbacks=[callback]) 

Let us now test the model. Call the evaluate function and pass in X_test and y_test.

In [205]:
# 4. Test the results
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Neural Network Accuracy: {accuracy:.2f}")

[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 38ms/step - accuracy: 0.8841 - loss: 0.4266
Neural Network Accuracy: 0.88


## Test

You are now going to test your model on the hold out test set. There is a file called **heart_test.csv**. You will notice that this file does not have a HeartDisease column. You will have to use your model to make predicitions on the test data. You will then create a file called submission.csv which you will upload to kaggle to see your results.

Read heart_test.csv into a data frame called test_df

In [None]:
test_df = pd.read_csv('heart_test.csv')

Let us view the first five rows.

In [None]:
test_df.head()

Lets us now extract the same features as we did aboove to test on. You can call it X_new.

In [None]:
test_df_clean = pd.get_dummies(test_df, columns=categorical_cols)

X_new = test_df_clean.drop(['PatientId'], axis=1).values

We now need to normalize the test data as well. Use the scaler that you created above called sc and call the transform function and pass in X_new. Store the result back into X_new.

In [None]:
X_new = sc.transform(X_new)  # Use transform, not fit_transform

Call the predict function on X_new to get the predicitons.

In [206]:
predictions_tensor = model.predict(X_new)

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 77ms/step


The neural network will output probabilties. We must convert those probabilites to 1 or 0. A probability greater than or equal to 0.5 is seen as a 1.Uncomment and run the cell below if the model you chose as your final model is a neural net created using tensorflow.

In [207]:
predictions = [1 if p >= 0.5 else 0 for p in predictions_tensor.squeeze()]

# Submission
Create a data frame with two columns PatientId and HeartDiesase (Try the pd.DataFrame function). The PatientId column should have the same values as the PatientId column from the test_df dataframe from above and HeartDisease column should be the predicitions you just created. Create a csv file from this data frame (Try using the .to_csv funtion, however make sure to remove indexes so set to the index flag to false). This should created a csv file, this is what you submit to kaggle.

In [208]:
# Create the Submission DataFrame
submission_df = pd.DataFrame({
    'PatientId': test_df['PatientId'],
    'HeartDisease': predictions
})
# Save to CSV
# index=False removes the row numbers, which Kaggle doesn't want
submission_df.to_csv('submission.csv', index=False)