# Assignment 2

### Objective:
This assignment aims to equip students with hands-on experience in predictive modeling using real-world clinical data related to heart failure. The students will implement regression and classification techniques to predict the severity of a condition and patient survival outcomes.

### Dataset:
The dataset contains records of 299 patients who suffered heart failure, with 13 clinical features and two target variables: DEATH_EVENT (binary: 0 or 1) and Severity (numerical). You can download the data from the course files section.

### Reference:
Davide Chicco & Giuseppe Jurman (2020): "Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone", BMC Medical Informatics and Decision Making, 20:16.

## Task 1: Regression Analysis (Predicting Severity)

### 1.1 Data Preparation
- Split the dataset into training (70%) and testing (30%) sets.
- Formulate: Describe the splitting mathematically

In [121]:
import pandas as pd

In [122]:
df = pd.read_csv('heart_failure_clinical_records_with_severity (3).csv')
df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT,Severity
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1,6.6
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1,2.0
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1,6.4
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1,4.6
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1,8.8


In [123]:
data = df.iloc[:,:-2]
data.head()


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8


In [124]:
Y=df.iloc[:,-1]
Y.head()

0    6.6
1    2.0
2    6.4
3    4.6
4    8.8
Name: Severity, dtype: float64

In [125]:
from sklearn.model_selection import train_test_split
# Split the dataset into training (70%) and testing (30%) sets
x_train, x_test, y_train, y_test = train_test_split(data, Y, test_size=0.3)

### 1.2 Linear Regression
- Train a Linear Regression model to predict Severity using all available clinical features except DEATH_EVENT.
- Formulate: Write the regression model. Derive the least squares solution. 

In [126]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Train a Linear Regression model to predict Severity using all available clinical features except DEATH_EVENT
L = LinearRegression()
L.fit(x_train, y_train) # training 
y_pred_L = L.predict(x_test) 
print("MSE [Linear]: ", mean_squared_error(y_test, y_pred_L), "\n")

MSE [Linear]:  0.7146626577226933 



- Formulate: Write the regression model. Derive the least squares solution. 
    - todo***

### 1.3 Ridge Regression
- Train a Ridge Regression model with regularization.
- Formulate: Derive the closed-form solution

In [127]:
from sklearn.linear_model import Ridge

R = Ridge(alpha=0.5)
R.fit(x_train, y_train) # training 
y_pred_R = R.predict(x_test) 
print("MSE [Ridge]: ", mean_squared_error(y_test, y_pred_R), "\n")

MSE [Ridge]:  0.7169162601473181 



- Formulate: Write the regression model. Derive the least squares solution. 
    - todo***

### 1.4 Lasso Regression
- Train a Lasso Regression model and identify the most important features.
- Formulate

In [128]:
from sklearn.linear_model import Lasso

L1 = Lasso()
L1.fit(x_train, y_train) # training 
y_pred_L1 = L1.predict(x_test) 
print("MSE [Lasso]: ", mean_squared_error(y_test, y_pred_L1), "\n")

# Imporant for feature selection
for col, coef in zip(data.columns, L1.coef_):
    print(f"{col}: {coef}")

MSE [Lasso]:  1.2634285879433556 

age: 0.0443524006575267
anaemia: 0.0
creatinine_phosphokinase: -0.0001368905019204772
diabetes: 0.0
ejection_fraction: -0.06923999414332471
high_blood_pressure: 0.0
platelets: 4.484416825801935e-07
serum_creatinine: 0.0
serum_sodium: -0.10262036736432031
sex: -0.0
smoking: 0.0
time: -0.005368244923153997


- Identify the most important features:
    - age
    - creatinine_phosphokinase
    - ejection_fraction
    - serum_sodium
    - time

- The above features have the most significant impact on the predictions made. All other features either have a coef value of 0 or close to 0 relative to the important features. 

- Formulate: Write the regression model. Derive the least squares solution. 
    - todo***

### 1.5 Kernel Regression
- Apply Kernel Regression with three kernels:
    - Linear
    - Polynomial
    - Radial Basis Function (RBF)
- Formulate: Derive the kernel regression estimator


In [129]:
from sklearn.kernel_ridge import KernelRidge

KL = KernelRidge(kernel='linear')
KL.fit(x_train, y_train) # training 
y_pred_KL = KL.predict(x_test) 
print("MSE [kernel ridge linear]: ", mean_squared_error(y_test, y_pred_KL), "\n")

MSE [kernel ridge linear]:  0.9044423599846368 



In [130]:
KP = KernelRidge(kernel='poly', degree=1)
KP.fit(x_train, y_train) # training 
y_pred_KP = KP.predict(x_test) 
print("MSE [kernel ridge poly]: ", mean_squared_error(y_test, y_pred_KP), "\n")

MSE [kernel ridge poly]:  0.8741161686888543 



In [131]:
KR = KernelRidge(kernel='rbf', gamma=0.1)
KR.fit(x_train, y_train) # training 
y_pred_KR = KR.predict(x_test) 
print("MSE [kernel ridge rbf]: ", mean_squared_error(y_test, y_pred_KR), "\n")

MSE [kernel ridge rbf]:  11.833777775565697 



- Formulate: Write the regression model. Derive the least squares solution. 
    - todo***

### 1.6 Evaluation
- Evaluate all models using:
    - Mean Squared Error (MSE)
    - R-squared 

In [132]:
print("Mean Squared Error:")
print("MSE [Linear]: ", mean_squared_error(y_test, y_pred_L))
print("MSE [Ridge]: ", mean_squared_error(y_test, y_pred_R))
print("MSE [Lasso]: ", mean_squared_error(y_test, y_pred_L1))
print("MSE [kernel ridge linear]: ", mean_squared_error(y_test, y_pred_KL))
print("MSE [kernel ridge poly]: ", mean_squared_error(y_test, y_pred_KP))
print("MSE [kernel ridge rbf]: ", mean_squared_error(y_test, y_pred_KR))

print("\nR Squared Error:")
print("R^2 [Linear]: ", r2_score(y_test, y_pred_L))
print("R^2 [Ridge]: ", r2_score(y_test, y_pred_R))
print("R^2 [Lasso]: ", r2_score(y_test, y_pred_L1))
print("R^2 [kernel ridge linear]: ", r2_score(y_test, y_pred_KL))
print("R^2 [kernel ridge poly]: ", r2_score(y_test, y_pred_KP))
print("R^2 [kernel ridge rbf]: ", r2_score(y_test, y_pred_KR))

Mean Squared Error:
MSE [Linear]:  0.7146626577226933
MSE [Ridge]:  0.7169162601473181
MSE [Lasso]:  1.2634285879433556
MSE [kernel ridge linear]:  0.9044423599846368
MSE [kernel ridge poly]:  0.8741161686888543
MSE [kernel ridge rbf]:  11.833777775565697

R Squared Error:
R^2 [Linear]:  0.7768937800793557
R^2 [Ridge]:  0.7761902415458549
R^2 [Lasso]:  0.6055778578190449
R^2 [kernel ridge linear]:  0.7176476007361545
R^2 [kernel ridge poly]:  0.7271149512846671
R^2 [kernel ridge rbf]:  -2.6943156303989384


### 1.7 Discussion
- Discuss the pros and cons of Linear, Ridge, Lasso, and Kernel Regression models.
    - todo***

## Task 2: Classification Analysis (Predicting DEATH_EVENT)

### 2.1 Logistic Regression 
- Train a Logistic Regression model using clinical features to predict DEATH_EVENT
- Formulate

In [133]:
Y=df.iloc[:,-2]
Y.head()

0    1
1    1
2    1
3    1
4    1
Name: DEATH_EVENT, dtype: int64

In [134]:
# Split the dataset into training (70%) and testing (30%) sets
x_train, x_test, y_train, y_test = train_test_split(data, Y, test_size=0.3)

In [135]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
LR.fit(x_train, y_train) # training 
y_pred_LR = LR.predict(x_test) 
print("MSE [LogisticRegression]: ", mean_squared_error(y_test, y_pred_LR), "\n")

MSE [LogisticRegression]:  0.2 



- Formulate: Write the regression model. Derive the least squares solution. 
    - todo***

### 2.2 Classifier Comparison
- Train and compare the following classifiers:
    - Support Vector Machine (SVM)
        - Linear Kernel
        - RBF Kernel
        - Formulate: Write the SVM primal and dual problem. Derive the kernelized form.



In [136]:
from sklearn.svm import SVC

# Linear Kernel
SVC_L = SVC(kernel='linear')
SVC_L.fit(x_train, y_train) # training
y_pred_SVC_L = SVC_L.predict(x_test) 

In [137]:
#RBF Kernel
SVC_RBF = SVC(kernel='rbf', gamma='scale')
SVC_RBF.fit(x_train, y_train) # training
y_pred_SVC_RBF = SVC_RBF.predict(x_test) 

- Formulate: Write the regression model. Derive the least squares solution. 
    - todo***

### 2.3 Evaluation
- Use the following metrics:
    - Accuracy
    - Precision
    - Recall

In [139]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

print("Accuracy:")
print("Accuracy [SVM Linear]: ", accuracy_score(y_test, y_pred_SVC_L))
print("Accuracy [SVM RBF]: ", accuracy_score(y_test, y_pred_SVC_RBF))

print("Precision:")
print("Precision [SVM Linear]: ", precision_score(y_test, y_pred_SVC_L, zero_division=0))
print("Precision [SVM RBF]: ", precision_score(y_test, y_pred_SVC_RBF, zero_division=0))

print("Recall:")
print("Recall [SVM Linear]: ", recall_score(y_test, y_pred_SVC_L, zero_division=0))
print("Recall [SVM RBF]: ", recall_score(y_test, y_pred_SVC_RBF, zero_division=0))


Accuracy:
Accuracy [SVM Linear]:  0.7666666666666667
Accuracy [SVM RBF]:  0.7
Precision:
Precision [SVM Linear]:  0.6363636363636364
Precision [SVM RBF]:  0.0
Recall:
Recall [SVM Linear]:  0.5185185185185185
Recall [SVM RBF]:  0.0


### 2.4 Discussion
- Analyze the effectiveness of each classifier:
    - todo***
- Discuss which model might be most appropriate in a clinical setting:
    - todo***
