Question 1 -  What is Logistic Regression, and how does it differ from Linear
Regression?

Answer - Logistic Regression is a statistical method used to predict categorical outcomes, usually binary (like yes/no, spam/not spam). Instead of predicting a number like Linear Regression, it predicts the probability of a certain class or event.

Linear Regression gives a straight-line output and is used when the target is a continuous number (like price or height). Logistic Regression, on the other hand, uses the logistic (sigmoid) function to squeeze the output between 0 and 1, making it perfect for classification tasks.

Question 2 - Explain the role of the Sigmoid function in Logistic Regression

Answer - The **sigmoid function** plays a key role in **Logistic Regression** because it transforms the linear output into a **probability** between **0 and 1**.

How it works:

* Logistic Regression first calculates a **linear combination** of input features (like Linear Regression):
  `z = w₁x₁ + w₂x₂ + ... + b`

* Then, it passes `z` through the **sigmoid function**:
  `sigmoid(z) = 1 / (1 + e^(-z))`

This squashes the output to a range between **0 and 1**, which can be interpreted as the **probability** that the input belongs to the **positive class** (e.g., "yes" or "1").

So, the sigmoid function makes it possible for Logistic Regression to **predict probabilities** instead of raw numbers.


Question 3 - What is Regularization in Logistic Regression and why is it needed?

Answer - Regularization in Logistic Regression is a technique used to prevent overfitting by adding a penalty to the model for having large or complex coefficients.

**Why it's needed**:

Without regularization, the model might fit the training data too well, including noise, which hurts its performance on new, unseen data.

Regularization helps the model stay simpler and more general, improving its ability to make accurate predictions on test data.

**How it works**:

It adds a penalty term to the cost function:

L1 regularization (Lasso) adds the absolute values of coefficients.

L2 regularization (Ridge) adds the squares of the coefficients.

This discourages the model from assigning too much weight to any one feature.

Question 4 - What are some common evaluation metrics for classification models, and
why are they important?

Answer - Here are some common evaluation metrics for classification models and why they matter:

1. Accuracy

**What it measures**: % of correct predictions out of total predictions.

**Why it's important**: Gives a quick snapshot of model performance — but can be misleading with imbalanced datasets.

2. Precision

**What it measures**: Of all predicted positives, how many were actually positive?

**Why it's important**: Useful when false positives are costly (e.g., spam filters).

3. Recall (Sensitivity)

**What it measures**: Of all actual positives, how many did the model correctly identify?

**Why it's important**: Crucial when missing a positive case is risky (e.g., detecting diseases).

4. F1 Score

**What it measures**: Harmonic mean of precision and recall.

**Why it's important**: A good balance when you need to consider both precision and recall.

5. ROC-AUC (Receiver Operating Characteristic - Area Under Curve)

**What it measures**: How well the model separates classes across all thresholds.

**Why it's important**: Helps compare models regardless of the chosen probability cutoff

Question 5 - Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.

In [1]:
# Python Program: Logistic Regression with sklearn dataset

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_breast_cancer()

# Convert to pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print("First 5 rows of dataset:")
print(df.head())

# Split into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize Logistic Regression model
model = LogisticRegression(max_iter=5000)  # increase iterations for convergence
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("\nModel Accuracy:", accuracy)


First 5 rows of dataset:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  wor

Question 6 - Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.

In [2]:
# Python Program: Logistic Regression with L2 regularization (Ridge)

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_breast_cancer()

# Convert to pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize Logistic Regression model with L2 regularization
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=5000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print coefficients and accuracy
print("Model Coefficients (first 10 shown):")
print(model.coef_[0][:10])  # show first 10 coefficients for readability

print("\nModel Intercept:")
print(model.intercept_)

print("\nModel Accuracy:", accuracy)


Model Coefficients (first 10 shown):
[ 1.0274368   0.22145051 -0.36213488  0.0254667  -0.15623532 -0.23771256
 -0.53255786 -0.28369224 -0.22668189 -0.03649446]

Model Intercept:
[28.64871395]

Model Accuracy: 0.956140350877193


Question 7 - Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.

In [3]:
# Python Program: Logistic Regression for Multiclass Classification (One-vs-Rest)

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load dataset from sklearn (Iris dataset for multiclass classification)
data = load_iris()

# Convert to pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print("First 5 rows of dataset:")
print(df.head())

# Split into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize Logistic Regression model with One-vs-Rest strategy
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=200)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))


First 5 rows of dataset:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  





Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30



Question 8 - Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.

In [4]:
# Python Program: Hyperparameter Tuning for Logistic Regression using GridSearchCV

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

# Load dataset (Iris for multiclass classification)
data = load_iris()

# Convert to pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Train/test split (though GridSearchCV will do CV internally)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define parameter grid for Logistic Regression
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],      # Regularization strength
    'penalty': ['l1', 'l2'],           # L1 = Lasso, L2 = Ridge
    'solver': ['liblinear']            # Solver that supports both l1 and l2
}

# Initialize model
log_reg = LogisticRegression(max_iter=5000)

# Apply GridSearchCV
grid = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Print best parameters and best validation score
print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Accuracy:", grid.best_score_)


Best Parameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Best Cross-Validation Accuracy: 0.9583333333333334


Question 9 - Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling.

In [5]:
# Python Program: Logistic Regression with and without Standardization

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset (Breast Cancer for binary classification)
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ----------- Logistic Regression WITHOUT Scaling -----------
model_no_scaling = LogisticRegression(max_iter=5000)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# ----------- Logistic Regression WITH Scaling -----------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaling = LogisticRegression(max_iter=5000)
model_scaling.fit(X_train_scaled, y_train)
y_pred_scaling = model_scaling.predict(X_test_scaled)
acc_scaling = accuracy_score(y_test, y_pred_scaling)

# Results
print("Accuracy without Scaling:", acc_no_scaling)
print("Accuracy with Scaling:", acc_scaling)


Accuracy without Scaling: 0.956140350877193
Accuracy with Scaling: 0.9736842105263158


Question 10 - Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business
use case.

Answer - If I were working at an e-commerce company and needed to predict which customers will respond to a marketing campaign using Logistic Regression, here’s how I’d approach it step by step:

1. Understanding the Data

The first thing I’d do is explore the dataset. Since only 5% of customers respond, the data is highly imbalanced. If I just train a plain Logistic Regression model, it will likely predict “no response” for almost everyone and still show high accuracy — but that won’t be useful for the business. What we care about is finding the right 5%.

2. Feature Engineering & Scaling

Next, I’d prepare the features.

Customer data usually includes demographics (age, income, location), purchase history (frequency, average basket size, last purchase date), and engagement (email opens, clicks, app logins).

Since Logistic Regression is sensitive to the scale of features, I’d standardize them using StandardScaler so that variables like “age” and “income” are on comparable scales.

3. Handling Class Imbalance

To deal with the 5% responders (minority class):

I could use SMOTE (Synthetic Minority Oversampling Technique) to generate synthetic samples of responders.

Alternatively, I’d try class weighting (setting class_weight='balanced' in Logistic Regression). This tells the model to “pay more attention” to the minority class without changing the data.

Sometimes, a mix of undersampling the majority and oversampling the minority works best.

4. Model Training with Logistic Regression

I’d train a Logistic Regression model with L2 regularization (Ridge) to avoid overfitting.

I’d tune the hyperparameter C (inverse of regularization strength) using GridSearchCV.

I’d test both penalty='l1' and penalty='l2' to see which gives better performance.

5. Evaluation Metrics

Since accuracy is misleading for imbalanced data, I’d focus on:

Precision (how many predicted responders were actually responders).

Recall (how many actual responders we correctly identified).

F1-score (balance between precision and recall).

ROC-AUC score (how well the model separates responders vs non-responders).

For the business case, recall might be more important because we don’t want to miss out on potential customers who could respond to the campaign.

6. Business Perspective

Finally, I’d explain the results to the marketing team in simple terms:

Instead of sending emails to all customers, the model will generate a ranked list of likely responders.

Even if we don’t perfectly predict all responders, identifying just 2x more responders than random guessing could save huge campaign costs and increase ROI.