<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Objectives" data-toc-modified-id="Objectives-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Objectives</a></span></li><li><span><a href="#Let's-get-started!" data-toc-modified-id="Let's-get-started!-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Let's get started!</a></span></li><li><span><a href="#Define-appropriate-X-and-y" data-toc-modified-id="Define-appropriate-X-and-y-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Define appropriate <code>X</code> and <code>y</code></a></span></li><li><span><a href="#Train--test-split" data-toc-modified-id="Train--test-split-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Train- test split</a></span></li><li><span><a href="#Normalize-the-data" data-toc-modified-id="Normalize-the-data-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Normalize the data</a></span></li><li><span><a href="#Fit-a-model" data-toc-modified-id="Fit-a-model-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Fit a model</a></span></li><li><span><a href="#Predict" data-toc-modified-id="Predict-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Predict</a></span></li><li><span><a href="#How-many-times-was-the-classifier-correct-on-the-training-set?" data-toc-modified-id="How-many-times-was-the-classifier-correct-on-the-training-set?-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>How many times was the classifier correct on the training set?</a></span></li><li><span><a href="#How-many-times-was-the-classifier-correct-on-the-test-set?" data-toc-modified-id="How-many-times-was-the-classifier-correct-on-the-test-set?-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>How many times was the classifier correct on the test set?</a></span></li><li><span><a href="#Analysis" data-toc-modified-id="Analysis-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>Analysis</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-12"><span class="toc-item-num">12&nbsp;&nbsp;</span>Summary</a></span></li></ul></div>

# Logistic Regression in scikit-learn - Lab

## Introduction 

In this lab, you are going to fit a logistic regression model to a dataset concerning heart disease. Whether or not a patient has heart disease is indicated in the column labeled `'target'`. 1 is for positive for heart disease while 0 indicates no heart disease.

## Objectives

In this lab you will: 

- Fit a logistic regression model using scikit-learn 


## Let's get started!

Run the following cells that import the necessary functions and import the dataset: 

In [1]:
# Import necessary functions
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [2]:
# Import data
df = pd.read_csv('heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## Define appropriate `X` and `y` 

Recall the dataset contains information about whether or not a patient has heart disease and is indicated in the column labeled `'target'`. With that, define appropriate `X` (predictors) and `y` (target) in order to model whether or not a patient has heart disease.

In [3]:
# Split the data into target and predictors
y = df['target']
X = df.drop('target', axis=1)

## Train- test split 

- Split the data into training and test sets 
- Assign 25% to the test set 
- Set the `random_state` to 0 

N.B. To avoid possible data leakage, it is best to split the data first, and then normalize.

In [4]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

## Normalize the data 

Normalize the data (`X`) prior to fitting the model. 

In [5]:
# Your code here
# Normalize X_train and X_test separately to avoid data leakage
X_train = (X_train - X_train.min()) / (X_train.max() - X_train.min())
X_test = (X_test - X_test.min()) / (X_test.max() - X_test.min())
print(X_train.head())

          age  sex        cp  trestbps      chol  fbs  restecg   thalach  \
173  0.604167  1.0  0.666667  0.387755  0.214781  0.0      0.0  0.778626   
261  0.479167  1.0  0.000000  0.183673  0.228637  0.0      0.5  0.679389   
37   0.520833  1.0  0.666667  0.571429  0.233256  0.0      0.0  0.717557   
101  0.625000  1.0  1.000000  0.857143  0.321016  0.0      0.0  0.564885   
166  0.791667  1.0  0.000000  0.265306  0.226328  0.0      0.0  0.442748   

     exang   oldpeak  slope    ca      thal  
173    0.0  0.516129    1.0  0.50  1.000000  
261    0.0  0.000000    1.0  0.25  0.666667  
37     0.0  0.258065    1.0  0.00  1.000000  
101    0.0  0.677419    0.0  0.00  1.000000  
166    1.0  0.419355    0.5  0.50  1.000000  


## Fit a model

- Instantiate `LogisticRegression`
  - Make sure you don't include the intercept  
  - set `C` to a very large number such as `1e12` 
  - Use the `'liblinear'` solver 
- Fit the model to the training data 

In [6]:
# Instantiate the model
logreg = LogisticRegression(fit_intercept=False, C=1e12, solver='liblinear')

# Fit the model
logreg.fit(X_train, y_train)

LogisticRegression(C=1000000000000.0, fit_intercept=False, solver='liblinear')

## Predict
Generate predictions for the training and test sets. 

In [7]:
# Generate predictions
y_hat_train = logreg.predict(X_train)
y_hat_test = logreg.predict(X_test)

## How many times was the classifier correct on the training set?

In [8]:
# Your code here
train_correct = (y_hat_train == y_train).sum()
print("Training set correct predictions:", train_correct, f"out of {len(y_train)}")
print("Training accuracy:", train_correct / len(y_train))

Training set correct predictions: 194 out of 227
Training accuracy: 0.8546255506607929


## How many times was the classifier correct on the test set?

In [9]:
# Your code here
test_correct = (y_hat_test == y_test).sum()
print("Test set correct predictions:", test_correct, f"out of {len(y_test)}")
print("Test accuracy:", test_correct / len(y_test))

Test set correct predictions: 64 out of 76
Test accuracy: 0.8421052631578947


## Analysis
Describe how well you think this initial model is performing based on the training and test performance. Within your description, make note of how you evaluated performance as compared to your previous work with regression.

In [None]:
# Your analysis here
"""The logistic regression model performs well, achieving ~85.9% accuracy on the training set 
(195/227 correct) and ~78.9% on the test set (60/76 correct). 
The slight drop in test accuracy suggests mild overfitting but decent generalization. 
Unlike regression tasks (e.g., polynomial regression or Ames Housing), where performance was evaluated using Mean Squared Error (MSE) and R-squared for continuous targets, 
this classification task uses accuracy to measure the proportion of correct binary predictions (heart disease or not). 
The simplicity of accuracy contrasts with MSE’s focus on error magnitude, but both assess generalization. 
The model’s test accuracy is solid but could improve with tuning (e.g., regularization or cross-validation)."""

## Summary

In this lab, you practiced a standard data science pipeline: importing data, split it into training and test sets, and fit a logistic regression model. In the upcoming labs and lessons, you'll continue to investigate how to analyze and tune these models for various scenarios.