<a href="https://colab.research.google.com/github/audreychela/Audrey_first_repo/blob/main/logistic_linear_regression(practice)pynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression in scikit-learn - Lab

## Introduction

In this lab, you are going to fit a logistic regression model to a dataset concerning heart disease. Whether or not a patient has heart disease is indicated in the column labeled `'target'`. 1 is for positive for heart disease while 0 indicates no heart disease.

## Objectives

In this lab you will:

- Fit a logistic regression model using scikit-learn


## Let's get started!

Run the following cells that import the necessary functions and import the dataset:

In [12]:
# Import necessary functions
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

In [2]:
# Import data
df = pd.read_csv('heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [5]:
df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

## Define appropriate `X` and `y`

Recall the dataset contains information about whether or not a patient has heart disease and is indicated in the column labeled `'target'`. With that, define appropriate `X` (predictors) and `y` (target) in order to model whether or not a patient has heart disease.

In [26]:
# Split the data into target and predictors
y = df["target"].astype(int)
X = df.drop("target",axis=1)
df[:4]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1


In [49]:
df.shape

(303, 14)

In [11]:
df.isna().sum()

Unnamed: 0,0
age,0
sex,0
cp,0
trestbps,0
chol,0
fbs,0
restecg,0
thalach,0
exang,0
oldpeak,0


## Train- test split

- Split the data into training and test sets
- Assign 25% to the test set
- Set the `random_state` to 0

N.B. To avoid possible data leakage, it is best to split the data first, and then normalize.

In [23]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)

## Normalize the data

Normalize the data (`X`) prior to fitting the model.

In [32]:
# Your code here
scaler = StandardScaler()
cols = X.columns  #keep column names
X_scaled = scaler.fit_transform(X) #fit on X and transform
X_scaled = pd.DataFrame(X_scaled, columns=cols)
X_scaled.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,0.952197,0.681005,1.973123,0.763956,-0.256334,2.394438,-1.005832,0.015443,-0.696631,1.087338,-2.274579,-0.714429,-2.148873
1,-1.915313,0.681005,1.002577,-0.092738,0.072199,-0.417635,0.898962,1.633471,-0.696631,2.122573,-2.274579,-0.714429,-0.512922
2,-1.474158,-1.468418,0.032031,-0.092738,-0.816773,-0.417635,-1.005832,0.977514,-0.696631,0.310912,0.976352,-0.714429,-0.512922
3,0.180175,0.681005,0.032031,-0.663867,-0.198357,-0.417635,0.898962,1.239897,-0.696631,-0.206705,0.976352,-0.714429,-0.512922
4,0.290464,-1.468418,-0.938515,-0.663867,2.08205,-0.417635,0.898962,0.583939,1.435481,-0.379244,0.976352,-0.714429,-0.512922


## Fit a model

- Instantiate `LogisticRegression`
  - Make sure you don't include the intercept  
  - set `C` to a very large number such as `1e12`
  - Use the `'liblinear'` solver
- Fit the model to the training data

In [39]:
# Instantiate the model
logreg = LogisticRegression(
    fit_intercept=False,  # no intercept
    C= 1e12,
    solver="liblinear" #solver which is suitable for small datasets
)


# Fit the model
logreg.fit(X_train, y_train)

#print("Training accuracy:", logreg.score(X_train,y_train) )

## Predict
Generate predictions for the training and test sets.

In [46]:
# Generate predictions
y_hat_train = logreg.predict(X_train)
y_hat_test = logreg.predict(X_test)


## How many times was the classifier correct on the training set?

In [47]:
# Your code here
correct_train = (y_hat_train == y_train).sum()
print("Number of correct predictions on training_set:", correct_train)

Number of correct predictions on training_set: 194


## How many times was the classifier correct on the test set?

In [48]:
# Your code here
correct_test = (y_hat_test == y_test).sum()
print("Number of correct predictions on testing_set:", correct_test)

Number of correct predictions on testing_set: 63


## Analysis
Describe how well you think this initial model is performing based on the training and test performance. Within your description, make note of how you evaluated performance as compared to your previous work with regression.

In [None]:
# Your analysis here

The initial logistic regression model performs well, achieving a training accuracy of ~85% and a test accuracy of ~83%. The small difference between training and test performance indicates that the model generalizes effectively and is not overfitting. Unlike regression models, where performance is measured with metrics like RMSE or R², we evaluated this classification model using accuracy—the proportion of correct predictions. Overall, the model demonstrates strong predictive capability for identifying patients with heart disease, though further improvements could be explored through feature analysis or alternative classifiers.

## Summary

In this lab, you practiced a standard data science pipeline: importing data, split it into training and test sets, and fit a logistic regression model. In the upcoming labs and lessons, you'll continue to investigate how to analyze and tune these models for various scenarios.