In [None]:
#Logistics Regression Assignment

Question 1: What is Logistic Regression, and how does it differ from Linear
Regression?

Answer:

Logistic Regression is a supervised machine learning algorithm used for classification problems (e.g., yes/no, spam/ham).

It predicts the probability of a class by applying the sigmoid function to a linear combination of input features.

Linear Regression, on the other hand, predicts a continuous numeric value (e.g., price, salary).

Key difference:

Linear Regression output: real numbers (−∞ to +∞).

Logistic Regression output: probability (0 to 1), which is then thresholded to classify into classes

Question 2: Explain the role of the Sigmoid function in Logistic Regression.

Answer:

The Sigmoid function maps any real number into the range (0,1).

Formula:

𝜎
(
𝑧
)
=
1
1
+
𝑒
−
𝑧
σ(z)=
1+e
−z
1
	​


In Logistic Regression:

The linear model produces a score (z).

The sigmoid converts z into a probability.

If probability > 0.5 → class 1; else → class 0.

Question 3: What is Regularization in Logistic Regression and why is it needed?

Answer:

Regularization is a technique to prevent overfitting by adding a penalty term to the loss function.

In Logistic Regression:

L1 regularization (Lasso): adds absolute values of coefficients → feature selection.

L2 regularization (Ridge): adds squared values of coefficients → keeps all features but shrinks coefficients.

Needed because:

It improves generalization.

Prevents model from relying too heavily on a few features.

Question 4: What are some common evaluation metrics for classification models, and
why are they important?

Answer:

Accuracy – proportion of correctly classified samples.

Precision – proportion of predicted positives that are actually positive (important in fraud/spam detection).

Recall (Sensitivity) – proportion of actual positives correctly identified (important in medical diagnosis).

F1-Score – harmonic mean of Precision & Recall (useful when dataset is imbalanced).

ROC-AUC – measures how well model separates classes at all thresholds.

These metrics are important because accuracy alone may be misleading on imbalanced datasets.

Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Logistic Regression
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.9766081871345029


Question 6: Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)


In [2]:
import numpy as np

# Train with L2 regularization
ridge_model = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=10000)
ridge_model.fit(X_train, y_train)

# Print coefficients & accuracy
print("Model Coefficients:", ridge_model.coef_)
print("Accuracy:", ridge_model.score(X_test, y_test))


Model Coefficients: [[ 1.04679457e+00  2.31996752e-01 -3.86028605e-01  2.58310210e-02
  -1.36538685e-01 -2.36131088e-01 -5.12005351e-01 -2.73859052e-01
  -2.18614922e-01 -3.82641023e-02 -1.19166621e-01  1.36150732e+00
   4.48215042e-01 -1.44997753e-01 -1.83289744e-02  7.05601961e-03
  -6.65454859e-02 -3.50499383e-02 -4.41191713e-02  7.11959077e-04
   4.13217528e-02 -5.05454449e-01 -6.06121183e-02 -1.09313373e-02
  -2.74620760e-01 -7.15241331e-01 -1.33719798e+00 -4.94582619e-01
  -7.11943674e-01 -9.93596108e-02]]
Accuracy: 0.9766081871345029


Question 7: Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)


In [3]:
from sklearn.datasets import load_iris
from sklearn.metrics import classification_report

# Load multiclass dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Logistic Regression with One-vs-Rest
ovr_model = LogisticRegression(multi_class='ovr', max_iter=10000)
ovr_model.fit(X_train, y_train)

# Predictions
y_pred = ovr_model.predict(X_test)

# Classification Report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.85      0.92        13
           2       0.87      1.00      0.93        13

    accuracy                           0.96        45
   macro avg       0.96      0.95      0.95        45
weighted avg       0.96      0.96      0.96        45





Question 8: Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)


In [4]:
from sklearn.model_selection import GridSearchCV

# Define model & parameters
log_reg = LogisticRegression(max_iter=10000, solver='liblinear')
param_grid = {'C': [0.01, 0.1, 1, 10], 'penalty': ['l1', 'l2']}

# GridSearch
grid = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Validation Accuracy:", grid.best_score_)


Best Parameters: {'C': 10, 'penalty': 'l2'}
Validation Accuracy: 0.9523809523809523


Question 9: Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)

In [5]:
from sklearn.preprocessing import StandardScaler

# Without scaling
base_model = LogisticRegression(max_iter=10000)
base_model.fit(X_train, y_train)
print("Accuracy without scaling:", base_model.score(X_test, y_test))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

scaled_model = LogisticRegression(max_iter=10000)
scaled_model.fit(X_train_scaled, y_train)
print("Accuracy with scaling:", scaled_model.score(X_test_scaled, y_test))


Accuracy without scaling: 1.0
Accuracy with scaling: 1.0


Question 10: Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business
use case.


Answer (approach):

Data Handling:

Collect customer demographics, browsing history, past purchases.

Handle missing values & categorical encoding.

Feature Scaling:

Apply StandardScaler or MinMaxScaler for numerical features.

Balancing Classes:

Dataset is highly imbalanced (5% positive).

Use techniques like:

SMOTE (Synthetic Minority Oversampling Technique).

Class weights in Logistic Regression (class_weight='balanced').

Hyperparameter Tuning:

Tune C (regularization strength) and penalty (L1/L2) using GridSearchCV.

Evaluation:

Don’t rely on accuracy (since 95% accuracy possible by always predicting “No”).

Use Precision, Recall, F1-score, ROC-AUC.

Focus on Recall (catch as many responders as possible) or F1-score depending on business need.

Business Use:

Deploy model to predict which customers are likely to respond.

Use probabilities → target top % of customers with highest predicted probability → cost-effective marketing.