#ML_Logistic Regression_Assignment

Question 1:  What is Logistic Regression, and how does it differ from Linear Regression?

Goal: Logistic → classify (probability of class), Linear → predict continuous value.

Model: Logistic uses
𝑝
(
𝑦
=
1
∣
𝑥
)
=
𝜎
(
𝑤
⊤
𝑥
+
𝑏
)
p(y=1∣x)=σ(w
⊤
 x+b) with sigmoid; Linear uses
𝑦
^
=
𝑤
⊤
𝑥
+
𝑏
y
^
​
 =w
⊤
 x+b.

Loss: Logistic uses log-loss (maximum likelihood); Linear commonly uses MSE.

Output: Logistic in
[
0
,
1
]
[0,1] → threshold to label; Linear unbounded.

Assumptions: Logistic assumes log-odds are linear in features.

Question 2: Explain the role of the Sigmoid function in Logistic Regression.


Maps any real-valued score to a probability:
𝜎
(
𝑧
)
=
1
1
+
𝑒
−
𝑧
σ(z)=
1+e
−z

1
​
 , with
𝑧
=
𝑤
⊤
𝑥
+
𝑏
z=w
⊤
 x+b.

Monotonic → preserves ranking; differentiable → enables gradient-based optimization.

Decision rule: predict 1 if
𝜎
(
𝑧
)
≥
𝜏
σ(z)≥τ (often
𝜏
=
0.5
τ=0.5).

Question 3: What is Regularization in Logistic Regression and why is it needed?

Adds a penalty on coefficients to control model complexity and prevent overfitting.

L2 (Ridge):
𝜆
∥
𝑤
∥
2
2
λ∥w∥
2
2
​
  → shrinks weights smoothly; handles multicollinearity.

L1 (Lasso):
𝜆
∥
𝑤
∥
1
λ∥w∥
1
​
  → drives some weights to zero (feature selection).

Hyperparameter C (in sklearn) is inverse of regularization strength (
𝐶
↑
⇒
C↑⇒ weaker reg).

Question 4: What are some common evaluation metrics for classification models, and why are they important?

Common evaluation metrics for classification & why they matter
Accuracy: overall correctness; can mislead on imbalanced data.

Precision / Recall / F1: handle class imbalance; F1 balances P & R.

ROC-AUC: ranking quality across thresholds; good for balanced or when costs similar.

PR-AUC: better when positives are rare.

Confusion matrix: shows TP/FP/TN/FN to diagnose errors.

Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.

(Use Dataset from sklearn package)

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
clf = LogisticRegression(max_iter=1000, solver='liblinear', random_state=42)
clf.fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, clf.predict(X_test)))

Accuracy: 0.956140350877193


Question 6:  Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.

(Use Dataset from sklearn package)

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

clf = LogisticRegression(penalty='l2', solver='liblinear', max_iter=1000, random_state=42)
clf.fit(X_train, y_train)
print("Coefficients shape:", clf.coef_.shape)
print("First 5 coefficients:", clf.coef_[0][:5])
print("Intercept:", clf.intercept_)
print("Accuracy:", accuracy_score(y_test, clf.predict(X_test)))


Coefficients shape: (1, 30)
First 5 coefficients: [ 1.93035716e+00  7.14564675e-02 -5.03933094e-02 -1.69573330e-03
 -1.45563995e-01]
Intercept: [0.36481037]
Accuracy: 0.956140350877193


Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report.

(Use Dataset from sklearn package)

In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
clf = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=1000, random_state=42)
clf.fit(X_train, y_train)
print(classification_report(y_test, clf.predict(X_test)))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.90      0.95        10
           2       0.91      1.00      0.95        10

    accuracy                           0.97        30
   macro avg       0.97      0.97      0.97        30
weighted avg       0.97      0.97      0.97        30





Question 8: Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.

(Use Dataset from sklearn package)

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
candidates = [(p, C) for p in ['l1', 'l2'] for C in [0.01, 0.1, 1, 10]]
best, best_params = -1, None
for penalty, C in candidates:
    clf = LogisticRegression(solver='liblinear', penalty=penalty, C=C, max_iter=1000, random_state=42)
    clf.fit(X_tr, y_tr)
    score = accuracy_score(y_val, clf.predict(X_val))
    if score > best: best, best_params = score, {'penalty': penalty, 'C': C}
print("Best Params:", best_params)
print("Validation Accuracy:", round(best, 4))


Best Params: {'penalty': 'l2', 'C': 10}
Validation Accuracy: 0.9556


Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.

(Use Dataset from sklearn package)

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7, stratify=y)

plain = LogisticRegression(max_iter=1000, solver='liblinear', random_state=7).fit(X_train, y_train)
acc_plain = accuracy_score(y_test, plain.predict(X_test))

pipe = Pipeline([('scaler', StandardScaler()),
                 ('logreg', LogisticRegression(max_iter=1000, solver='liblinear', random_state=7))]).fit(X_train, y_train)
acc_scaled = accuracy_score(y_test, pipe.predict(X_test))

print("Without scaling:", round(acc_plain, 4))
print("With scaling:", round(acc_scaled, 4))


Without scaling: 0.9333
With scaling: 0.9333


Question 10: Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business
use case.

Data split & leakage control: Stratified train/val/test; keep

campaign-time order if temporal.

Preprocess: Handle missing values; one-hot encode categoricals; cap/transform skewed features; remove leakage features.

Scaling: Standardize numeric features (especially with regularization).

Class imbalance:

Use class_weight='balanced' (built-in) and/or resampling (e.g., SMOTE on train only).

Consider threshold tuning to hit desired recall/precision or business uplift.

Hyperparameters: Tune C, penalty (L1/L2), solver; try interaction terms or monotonic bins.

Metrics: Favor PR-AUC, Recall@k, Precision/Recall/F1, calibrated Brier score; report confusion matrix.

Calibration: Platt/Isotonic for well-calibrated probabilities → better targeting.

Explainability: Coefficients (sign/magnitude) or SHAP for stakeholder trust.

Deployment: Choose threshold by ROI curve (cost per contact vs expected uplift); monitor drift and re-train on fresh data.