Question 1: What is Logistic Regression, and how does it differ from Linear
Regression?

Answer:

Logistic Regression is a statistical technique used for classification problems where the dependent variable is categorical (e.g., 0/1, yes/no). It uses a sigmoid function to estimate probabilities between 0 and 1. On the other hand, Linear Regression is used for predicting continuous outcomes by fitting a straight-line relationship between variables.


Linear Regression predicts continuous values, while Logistic Regression predicts categorical outcomes based on probability.

Question 2: Explain the role of the Sigmoid function in Logistic Regression.

Answer:

The Sigmoid function in Logistic Regression transforms the linear combination of inputs into a probability value between 0 and 1. This makes it possible to interpret the output as the likelihood of a data point belonging to a particular class. It ensures that predictions remain within a valid probability range and helps in setting a threshold (e.g., 0.5) for classification decisions.

Question 3: What is Regularization in Logistic Regression and why is it needed?

Answer:

Regularization in Logistic Regression is a method used to control model complexity and prevent overfitting. In Logistic Regression, the model tries to fit the data by assigning weights to each feature. If these weights become too large, the model may perform very well on training data but fail on unseen data. Regularization addresses this by adding a penalty term to the cost function, which discourages very large coefficients.

There are two common types:

L1 Regularization (Lasso): Shrinks some coefficients to zero, effectively performing feature selection.

L2 Regularization (Ridge): Distributes penalty across coefficients, reducing their magnitude but not eliminating them.

Regularization is needed because it improves the model's ability to generalize, reduces variance, and ensures more stable and reliable predictions on new data.

Question 4: What are some common evaluation metrics for classification models, and why are they important?

Answer:

Common evaluation metrics for classification models include:

Accuracy: Measures the percentage of correctly predicted instances. Useful but can be misleading with imbalanced data.

Precision: Proportion of correctly predicted positive cases out of all predicted positives. Important when false positives are costly.

Recall (Sensitivity): Proportion of correctly predicted positive cases out of all actual positives. Important when missing positives is critical.

F1-Score: Harmonic mean of precision and recall, providing a balanced measure when both false positives and false negatives matter.

ROC-AUC: Measures the model’s ability to distinguish between classes across different thresholds.

These metrics are important because they give a deeper understanding of model performance beyond just accuracy, help handle class imbalance, and guide in choosing the best model for real-world applications.

Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)

Answer:

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy of Logistic Regression model:", accuracy)

Accuracy of Logistic Regression model: 0.956140350877193


Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy. (Use Dataset from sklearn package)

(Include your Python code and output in the code box below.)

Answer:

In [2]:
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(penalty='l2', max_iter=5000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model Coefficients:", model.coef_)
print("Accuracy of Logistic Regression with L2 Regularization:", accuracy)

Model Coefficients: [[ 1.0274368   0.22145051 -0.36213488  0.0254667  -0.15623532 -0.23771256
  -0.53255786 -0.28369224 -0.22668189 -0.03649446 -0.09710208  1.3705667
  -0.18140942 -0.08719575 -0.02245523  0.04736092 -0.04294784 -0.03240188
  -0.03473732  0.01160522  0.11165329 -0.50887722 -0.01555395 -0.016857
  -0.30773117 -0.77270908 -1.42859535 -0.51092923 -0.74689363 -0.10094404]]
Accuracy of Logistic Regression with L2 Regularization: 0.956140350877193


Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report. (Use Dataset from sklearn package)

(Include your Python code and output in the code box below.)

 Answer:

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris


data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


base_model = LogisticRegression(max_iter=5000)
model = OneVsRestClassifier(base_model)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)


print("Classification Report:\n")
print(classification_report(y_test, y_pred))
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris


data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


X = df.drop('target', axis=1)
y = df['target']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


base_model = LogisticRegression(max_iter=5000)
model = OneVsRestClassifier(base_model)
model.fit(X_train, y_train)


y_pred = model.predict(X_test)

print("Classification Report:\n")
print(classification_report(y_test, y_pred))

Classification Report:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.89      0.94         9
           2       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

Classification Report:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.89      0.94         9
           2       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30



Question 8: Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation accuracy. Use Dataset from sklearn package)

(Include your Python code and output in the code box below.)

Answer1:

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=5000, solver='liblinear')

param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2']
}

grid = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Accuracy:", grid.best_score_)

Best Parameters: {'C': 10, 'penalty': 'l2'}
Best Cross-Validation Accuracy: 0.9626373626373628


Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling. (Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)

Answer:

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model1 = LogisticRegression(max_iter=5000)
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_test)
acc_without_scaling = accuracy_score(y_test, y_pred1)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model2 = LogisticRegression(max_iter=5000)
model2.fit(X_train_scaled, y_train)
y_pred2 = model2.predict(X_test_scaled)
acc_with_scaling = accuracy_score(y_test, y_pred2)

print("Accuracy without scaling:", acc_without_scaling)
print("Accuracy with scaling:", acc_with_scaling)

Accuracy without scaling: 0.956140350877193
Accuracy with scaling: 0.9736842105263158


Question 10: Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling,feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

Answer:


To build a Logistic Regression model for an imbalanced marketing response dataset, the following approach would be effective:

Data Handling:

Clean and preprocess customer data (remove duplicates, handle missing values).

Encode categorical variables (e.g., gender, region) using one-hot encoding.

Standardize numerical features so that all variables are on a comparable scale.

Balancing Classes:

Since only 5% of customers respond, the dataset is highly imbalanced.

Apply techniques like SMOTE (Synthetic Minority Oversampling), undersampling the majority class, or use class weights in Logistic Regression (class_weight='balanced') to address imbalance.

Feature Scaling:

Standardize features using StandardScaler so that coefficients are meaningful and the model converges faster.

Hyperparameter Tuning:

Use GridSearchCV to tune parameters such as C (regularization strength) and penalty (L1/L2).

Adjust class weights if imbalance persists.

Model Evaluation:

Avoid relying only on accuracy, since it can be misleading with imbalanced data.

Use metrics like Precision, Recall, F1-score, and ROC-AUC to assess performance.

Specifically, focus on Recall (to capture as many responders as possible) and Precision (to avoid targeting uninterested customers).

Business Application:

A high Recall ensures that most potential responders are captured, maximizing campaign reach.

A good Precision ensures that marketing costs are not wasted on uninterested customers.

The final balance depends on whether the company prioritizes maximizing reach or optimizing costs.

This approach ensures the model is robust, fair, and business-oriented, while effectively handling class imbalance in a real-world scenario.