# Logistic Regression

## Load data

In [1231]:
import pandas as pd

df = pd.read_csv('data/heart.csv')

## Data analysis

In [1232]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [1233]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [1234]:
# Convert the categorical columns to numeric
def convert_to_numeric(df, column):
    df[column] = df[column].astype('category').cat.codes

columns = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']

for column in columns:
    convert_to_numeric(df, column)

In [1235]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,1,1,140,289,0,1,172,0,0.0,2,0
1,49,0,2,160,180,0,1,156,0,1.0,1,1
2,37,1,1,130,283,0,2,98,0,0.0,2,0
3,48,0,0,138,214,0,1,108,1,1.5,1,1
4,54,1,2,150,195,0,1,122,0,0.0,2,0


In [1236]:
df.corr()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
Age,1.0,0.05575,-0.07715,0.254399,-0.095282,0.198039,-0.007484,-0.382045,0.215793,0.258612,-0.268264,0.282039
Sex,0.05575,1.0,-0.126559,0.005133,-0.200092,0.120076,0.071552,-0.189186,0.190664,0.105734,-0.150693,0.305445
ChestPainType,-0.07715,-0.126559,1.0,-0.020647,0.06788,-0.073151,-0.072537,0.289123,-0.354727,-0.177377,0.213521,-0.386828
RestingBP,0.254399,0.005133,-0.020647,1.0,0.100893,0.070193,0.022656,-0.112135,0.155101,0.164803,-0.075162,0.107589
Cholesterol,-0.095282,-0.200092,0.06788,0.100893,1.0,-0.260974,-0.196544,0.235792,-0.034166,0.050148,0.111471,-0.232741
FastingBS,0.198039,0.120076,-0.073151,0.070193,-0.260974,1.0,0.08705,-0.131438,0.060451,0.052698,-0.175774,0.267291
RestingECG,-0.007484,0.071552,-0.072537,0.022656,-0.196544,0.08705,1.0,-0.179276,0.0775,-0.020438,-0.006778,0.057384
MaxHR,-0.382045,-0.189186,0.289123,-0.112135,0.235792,-0.131438,-0.179276,1.0,-0.370425,-0.160691,0.343419,-0.400421
ExerciseAngina,0.215793,0.190664,-0.354727,0.155101,-0.034166,0.060451,0.0775,-0.370425,1.0,0.408752,-0.428706,0.494282
Oldpeak,0.258612,0.105734,-0.177377,0.164803,0.050148,0.052698,-0.020438,-0.160691,0.408752,1.0,-0.501921,0.403951


## Feature selection

Obviously, the target column is HeartDisease. 

In [1237]:
target = 'HeartDisease'

df.corr()[target].abs().sort_values(ascending=False)

HeartDisease      1.000000
ST_Slope          0.558771
ExerciseAngina    0.494282
Oldpeak           0.403951
MaxHR             0.400421
ChestPainType     0.386828
Sex               0.305445
Age               0.282039
FastingBS         0.267291
Cholesterol       0.232741
RestingBP         0.107589
RestingECG        0.057384
Name: HeartDisease, dtype: float64

As shown above, ST_Slope, ExerciseAngina and OldPeak have the highest correlation with the target column.

- ST_Slope: the slope of the peak exercise ST segment [Up (2): upsloping, Flat (1): flat, Down (0): downsloping]

ST_Slope is negatively correlated with the target column, which means that the higher the slope of the peak exercise ST segment, the lower the chance of having heart disease.

- ExerciseAngina: exercise-induced angina (1 = yes; 0 = no)

ExerciseAngina is positively correlated with the target column -- the higher the chance of having exercise-induced angina, the higher the chance of having heart disease. 

- Oldpeak: ST depression induced by exercise relative to rest

Oldpeak is positively correlated with the target column, so the higher the ST depression induced by exercise relative to rest, the higher the chance of having heart disease.

- MaxHR: maximum heart rate achieved

MaxHR is negatively correlated with the target column, so the higher the maximum heart rate achieved, the lower the chance of having heart disease.

## Model training

In [1238]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from LogisticRegressionCustom import LogisticRegression as LogisticRegressionCustom
from sklearn.metrics import accuracy_score

In [1239]:
y = df[target]

### On the given data

In [1240]:
X = df.drop(target, axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [1241]:
model = LogisticRegression(random_state=0, max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
score = accuracy_score(y_test, y_pred)

print('Accuracy Score:', score)

Accuracy Score: 0.8695652173913043


In [1242]:
y_train = y_train.to_numpy()  # Convert the y_train to a 1-d array
X_train = X_train.to_numpy() # Also to avoid KeyError: 0 for the fit function

model = LogisticRegressionCustom()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
score = accuracy_score(y_test, y_pred)

print('Accuracy Score:', score)

  return 1 / (1 + np.exp(-y))


Accuracy Score: 0.7210144927536232


### On the train subset

In [1243]:
# Columns with the highest absolute correlation between 0.4 and 0.8 with the target column
features = ['ST_Slope', 'ExerciseAngina', 'Oldpeak', 'MaxHR']
X = df[features]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [1244]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
score = accuracy_score(y_test, y_pred)

print('Accuracy Score:', score)

Accuracy Score: 0.7789855072463768


In [1245]:
y_train = y_train.to_numpy()  # Convert the y_train to a 1-d array
X_train = X_train.to_numpy() # Also to avoid KeyError: 0 for the fit function of the custom model

model = LogisticRegressionCustom()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
score = accuracy_score(y_test, y_pred)

print('Accuracy Score:', score)

Accuracy Score: 0.6521739130434783


## Conclusion

There was only one column with the correlation coefficient between 0.5 and 0.8 to the target column HeartDisease, so I lowered the threshold to 0.4.
The features include ST_Slope (slope of the peak exercise ST segment), ExerciseAngina (exercise-induced angina), Oldpeak (ST depression induced by exercise relative to rest) and MaxHR (maximum heart rate achieved).

The accuracy of the models trained on those 4 feature colummns is smaller than the ones trained on the initial data. Thus it can be concluded that there are other features in the initial dataset that contribute to predicting the target column and that omitting them results in lower accuracy. Hence, it is important to consider all relevant features and not simply rely on the correlation coefficients alone for feature selection.

Also, evidently, the implementation of Logistic Regression from scratch is less efficient than the model from sklearn.