# Logistic Regression

The Logistic Regression model is a widely used statistical model which is primarily used for classification purposes. It is a linear model which is used to predict the probability of a binary outcome. The model is based on the logistic function which is a sigmoid function that outputs values between 0 and 1. The model is trained using the maximum likelihood estimation method.

## Importing Modules

In [1]:
# loading dataset
import numpy as np
import pandas as pd
# visualization
import matplotlib.pyplot as plt
import seaborn as sns
# EDA
from ydata_profiling import ProfileReport
# splitting data
from sklearn.model_selection import train_test_split
# modeling data
from sklearn.linear_model import LogisticRegression
# evaluating model
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report



## Importing Dataset

In [2]:
data = pd.read_csv('data/framingham.csv')

## Objective

The objective is to use logistic regression, a classification algorithm, to predict whether an individual will develop **CHD (Coronary Heart Disease)** within ten years based on the other features in the dataset. We want to build a model that uses information like age, smoking habits, blood pressure, BMI, and other variables to predict the likelihood of CHD development.

## Exploratory Data Analysis

In [3]:
# Print the first 5 rows of the dataframe
data.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [4]:
data.shape # shape of the data (rows, columns)

(4238, 16)

In [5]:
data.info() # quick info about the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             4238 non-null   int64  
 1   age              4238 non-null   int64  
 2   education        4133 non-null   float64
 3   currentSmoker    4238 non-null   int64  
 4   cigsPerDay       4209 non-null   float64
 5   BPMeds           4185 non-null   float64
 6   prevalentStroke  4238 non-null   int64  
 7   prevalentHyp     4238 non-null   int64  
 8   diabetes         4238 non-null   int64  
 9   totChol          4188 non-null   float64
 10  sysBP            4238 non-null   float64
 11  diaBP            4238 non-null   float64
 12  BMI              4219 non-null   float64
 13  heartRate        4237 non-null   float64
 14  glucose          3850 non-null   float64
 15  TenYearCHD       4238 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 529.9 KB


In [6]:
data.describe() # summary statistics

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
count,4238.0,4238.0,4133.0,4238.0,4209.0,4185.0,4238.0,4238.0,4238.0,4188.0,4238.0,4238.0,4219.0,4237.0,3850.0,4238.0
mean,0.429212,49.584946,1.97895,0.494101,9.003089,0.02963,0.005899,0.310524,0.02572,236.721585,132.352407,82.893464,25.802008,75.878924,81.966753,0.151958
std,0.495022,8.57216,1.019791,0.500024,11.920094,0.169584,0.076587,0.462763,0.158316,44.590334,22.038097,11.91085,4.080111,12.026596,23.959998,0.359023
min,0.0,32.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,107.0,83.5,48.0,15.54,44.0,40.0,0.0
25%,0.0,42.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,206.0,117.0,75.0,23.07,68.0,71.0,0.0
50%,0.0,49.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,234.0,128.0,82.0,25.4,75.0,78.0,0.0
75%,1.0,56.0,3.0,1.0,20.0,0.0,0.0,1.0,0.0,263.0,144.0,89.875,28.04,83.0,87.0,0.0
max,1.0,70.0,4.0,1.0,70.0,1.0,1.0,1.0,1.0,696.0,295.0,142.5,56.8,143.0,394.0,1.0


In [7]:
profile = ProfileReport(data, title='Report on the dataset', explorative=True) # EDA report using ydata_profiling

In [8]:
profile.to_notebook_iframe() # display the EDA report

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [9]:
profile.to_file("data/report.html") # save the EDA report

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Data Preprocessing

In [10]:
# Check for missing values
data.isnull().sum()

male                 0
age                  0
education          105
currentSmoker        0
cigsPerDay          29
BPMeds              53
prevalentStroke      0
prevalentHyp         0
diabetes             0
totChol             50
sysBP                0
diaBP                0
BMI                 19
heartRate            1
glucose            388
TenYearCHD           0
dtype: int64

In [11]:
data.mean() # mean of each column

male                 0.429212
age                 49.584946
education            1.978950
currentSmoker        0.494101
cigsPerDay           9.003089
BPMeds               0.029630
prevalentStroke      0.005899
prevalentHyp         0.310524
diabetes             0.025720
totChol            236.721585
sysBP              132.352407
diaBP               82.893464
BMI                 25.802008
heartRate           75.878924
glucose             81.966753
TenYearCHD           0.151958
dtype: float64

In [12]:
# Fill missing values with the mean of the column
data = data.fillna(data.mean())

In [13]:
# Check for missing values
data.isnull().sum()

male               0
age                0
education          0
currentSmoker      0
cigsPerDay         0
BPMeds             0
prevalentStroke    0
prevalentHyp       0
diabetes           0
totChol            0
sysBP              0
diaBP              0
BMI                0
heartRate          0
glucose            0
TenYearCHD         0
dtype: int64

In [14]:
# Check for duplicates
data.duplicated().sum()

np.int64(0)

In [15]:
# Drop duplicates if any
data = data.drop_duplicates()

## Splitting the Dataset

In [16]:
data.columns

Index(['male', 'age', 'education', 'currentSmoker', 'cigsPerDay', 'BPMeds',
       'prevalentStroke', 'prevalentHyp', 'diabetes', 'totChol', 'sysBP',
       'diaBP', 'BMI', 'heartRate', 'glucose', 'TenYearCHD'],
      dtype='object')

In [17]:
data.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [18]:
X = data.drop(['TenYearCHD'], axis=1)
y = data['TenYearCHD']

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
len(X_train), len(X_test), len(y_train), len(y_test)

(3390, 848, 3390, 848)

## Model Building

In [20]:
model = LogisticRegression(max_iter=10000, random_state=42)
model

In [21]:
model.fit(X_train, y_train)

In [22]:
y_pred = model.predict(X_test)

## Model Evaluation

In [23]:
accuracy = accuracy_score(y_test, y_pred) # accuracy of the model
precision = precision_score(y_test, y_pred) # precision of the model
recall = recall_score(y_test, y_pred) # recall of the model
conf_matrix = confusion_matrix(y_test, y_pred) # confusion matrix
class_report = classification_report(y_test, y_pred) # classification report

In [24]:
print("Model Evaluation Metrics:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

Model Evaluation Metrics:
Accuracy: 0.86
Precision: 0.60
Recall: 0.07
Confusion Matrix:
[[718   6]
 [115   9]]
Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.99      0.92       724
           1       0.60      0.07      0.13       124

    accuracy                           0.86       848
   macro avg       0.73      0.53      0.53       848
weighted avg       0.82      0.86      0.81       848



## Conclusion
Using logistic regression on the dataset, we achieved an accuracy of 86%. This indicates that the model is good at predicting the majority class (individuals without coronary heart disease (CHD)), as seen in the high recall of 0.99 for class 0. However, the model struggles with identifying cases where CHD occurs within ten years, evidenced by the low recall of 0.07 for class 1. This low recall means the model misses many positive cases, likely due to an imbalance between the classes in the dataset.

Overall, while the model performs well in predicting non-CHD cases, further tuning or using a different approach (e.g., class balancing) could improve its ability to detect CHD risk cases.
