Question 1: What is Logistic Regression, and how does it differ from Linear Regression?


Answer:

 Logistic Regression is a statistical method used for binary classification problems, where the output variable can take one of two possible values (e.g., Yes/No, 0/1, Spam/Not Spam). It predicts the probability of the target variable belonging to a particular class using the logistic (sigmoid) function, which maps predicted values between 0 and 1.
Key differences from Linear Regression:
Output Type – Linear Regression predicts continuous numerical values, while Logistic Regression predicts probabilities (which are later converted to classes).


Equation Used – Linear Regression uses a straight-line equation (y = β₀ + β₁x₁ + ...), whereas Logistic Regression uses the logistic function to model the log-odds of the target.


Loss Function – Linear Regression uses Mean Squared Error (MSE), whereas Logistic Regression uses Log Loss (cross-entropy loss).


Question 2: Explain the role of the Sigmoid function in Logistic Regression.


Answer:

 The Sigmoid function is used to map predicted values from any real number to a range between 0 and 1. Its formula is:
σ(z)=11+e−z\sigma(z) = \frac{1}{1 + e^{-z}}σ(z)=1+e−z1​
Role in Logistic Regression:
Converts the linear combination of features into a probability.


Allows us to interpret the model’s output as the likelihood of belonging to the positive class.


Facilitates decision-making by applying a threshold (commonly 0.5) to assign class labels.


Question 3: What is Regularization in Logistic Regression and why is it needed?


Answer:

 Regularization is a technique to prevent overfitting by adding a penalty term to the loss function of the model.
Types in Logistic Regression:
L1 Regularization (Lasso) – Adds the absolute value of coefficients to the penalty, leading to sparse models (some coefficients become zero).


L2 Regularization (Ridge) – Adds the squared magnitude of coefficients to the penalty, reducing model complexity without making coefficients exactly zero.


Need:
Controls large coefficients that may cause overfitting.


Improves generalization to unseen data.


Handles multicollinearity among features.


Question 4: What are some common evaluation metrics for classification models, and why are they important?

Answer:
 Common Metrics:
Accuracy – Proportion of correct predictions to total predictions.


Precision – Proportion of correctly predicted positives to total predicted positives.


Recall (Sensitivity) – Proportion of correctly predicted positives to all actual positives.


F1-score – Harmonic mean of Precision and Recall, useful for imbalanced datasets.


ROC-AUC – Measures the model’s ability to distinguish between classes.


Importance:
Different metrics capture different aspects of performance.


In imbalanced datasets, accuracy may be misleading; metrics like Precision, Recall, and F1-score provide better insight.


Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd


# Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42
)


# Train model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)


# Accuracy
accuracy = model.score(X_test, y_test)
print("Accuracy:", accuracy)

Accuracy: 0.956140350877193


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.

In [2]:
from sklearn.linear_model import LogisticRegression


# L2 regularization (default penalty='l2')
ridge_model = LogisticRegression(penalty='l2', max_iter=1000)
ridge_model.fit(X_train, y_train)


print("Coefficients:", ridge_model.coef_)
print("Accuracy:", ridge_model.score(X_test, y_test))

Coefficients: [[ 2.09981182  0.13248576 -0.10346836 -0.00255646 -0.17024348 -0.37984365
  -0.69120719 -0.4081069  -0.23506963 -0.02356426 -0.0854046   1.12246945
  -0.32575716 -0.06519356 -0.02371113  0.05960156  0.00452206 -0.04277587
  -0.04148042  0.01425051  0.96630267 -0.37712622 -0.05858253 -0.02395975
  -0.31765956 -1.00443507 -1.57134711 -0.69351401 -0.84095566 -0.09308282]]
Accuracy: 0.956140350877193


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report.

In [3]:
from sklearn.datasets import load_iris
from sklearn.metrics import classification_report


# Load iris dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)


# OvR model
ovr_model = LogisticRegression(multi_class='ovr', max_iter=1000)
ovr_model.fit(X_train, y_train)


y_pred = ovr_model.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.89      0.94         9
           2       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





Question 8:  Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy.

In [4]:
from sklearn.model_selection import GridSearchCV


params = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}


grid = GridSearchCV(LogisticRegression(max_iter=1000), params, cv=5)
grid.fit(X_train, y_train)


print("Best Parameters:", grid.best_params_)
print("Validation Accuracy:", grid.best_score_)

Best Parameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Validation Accuracy: 0.9583333333333334


Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.

In [5]:
from sklearn.preprocessing import StandardScaler


# Without scaling
model_no_scale = LogisticRegression(max_iter=1000)
model_no_scale.fit(X_train, y_train)
acc_no_scale = model_no_scale.score(X_test, y_test)


# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


model_scaled = LogisticRegression(max_iter=1000)
model_scaled.fit(X_train_scaled, y_train)
acc_scaled = model_scaled.score(X_test_scaled, y_test)


print("Accuracy without scaling:", acc_no_scale)
print("Accuracy with scaling:", acc_scaled)

Accuracy without scaling: 1.0
Accuracy with scaling: 1.0


Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

Answer:

 For an imbalanced dataset with only 5% positive responses:
Step-by-step approach:
Data Understanding & Cleaning – Handle missing values, outliers, and irrelevant features.


Feature Engineering & Scaling – Encode categorical variables, normalize/standardize features.


Class Balancing Techniques –


Oversampling minority class (e.g., SMOTE).


Undersampling majority class.


Class weights in Logistic Regression (class_weight='balanced').


Model Training – Use Logistic Regression with hyperparameter tuning for C, penalty, and solver.


Evaluation Metrics – Focus on Precision, Recall, F1-score, and ROC-AUC instead of Accuracy.


Cross-validation – To ensure results generalize well to unseen data.


Business Consideration –


Lower threshold from 0.5 to catch more positives (improve recall).


Cost-sensitive learning if false negatives are more harmful.
