## Logistic Regression

##
### Question 1: What is Logistic Regression, and how does it differ from Linear Regression?


**Logistic Regression** is a statistical and machine learning technique used for **binary classification problems**, where the output variable is categorical (e.g., 0 or 1, Yes or No). It models the probability that a given input belongs to a particular class using the **sigmoid (logistic) function**.

The equation is:
$$
P(Y=1|X) = \frac{1}{1 + e^{-(β₀ + β₁X)}}
$$

**Difference from Linear Regression:**

* **Linear Regression** predicts continuous numeric values.
* **Logistic Regression** predicts probabilities and classifies outcomes into discrete categories.
* Logistic Regression uses a **logit (log-odds)** transformation to keep predictions between 0 and 1.

##
### Question 2: Explain the role of the Sigmoid function in Logistic Regression.


The **Sigmoid function** converts any real-valued number into a value between **0 and 1**, representing a **probability**.<br><br>
$$
σ(z) = \frac{1}{1 + e^{-z}}
$$<br><br>
In Logistic Regression, it ensures the model outputs probabilities that can be thresholded (commonly at 0.5) to classify observations into binary outcomes (e.g., “Yes” if P ≥ 0.5, otherwise “No”).

##
### Question 3: What is Regularization in Logistic Regression and why is it needed?

**Regularization** is a technique used to **prevent overfitting** by adding a penalty term to the loss function, discouraging large coefficient values.

* **L1 (Lasso):** Adds absolute values of coefficients → encourages sparsity (feature selection).
* **L2 (Ridge):** Adds squared values of coefficients → stabilizes model and reduces variance.
  It helps improve **generalization**, making the model perform better on unseen data.


##
### Question 4: What are some common evaluation metrics for classification models, and why are they important?


Common metrics include:

* **Accuracy:** Percentage of correctly classified instances.
* **Precision:** Proportion of true positives among predicted positives.
* **Recall (Sensitivity):** Proportion of true positives identified correctly.
* **F1-Score:** Harmonic mean of Precision and Recall.
* **ROC-AUC:** Measures overall model performance at different thresholds.

These metrics provide insights into **model reliability**, especially when data is imbalanced.

##
### Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package)

In [1]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X = data.data
y = (data.target == 0).astype(int)  # Binary classification

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 1.0


##
### Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.
(Use Dataset from sklearn package)

In [3]:
# Data
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=42)

# Model with L2 regularization
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)

print("Coefficients:", model.coef_)
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))

Coefficients: [[-0.39086522  0.92121445 -2.33169485 -0.9799742 ]
 [ 0.49862406 -0.30952765 -0.21642636 -0.73163851]
 [-0.10775883 -0.6116868   2.54812121  1.7116127 ]]
Accuracy: 1.0


##
### Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)

In [4]:
from sklearn.metrics import classification_report

# Data
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=42)

# Model
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        15
           1       1.00      0.91      0.95        11
           2       0.92      1.00      0.96        12

    accuracy                           0.97        38
   macro avg       0.97      0.97      0.97        38
weighted avg       0.98      0.97      0.97        38





##
### Question 8: Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy.
(Use Dataset from sklearn package)


In [5]:
from sklearn.model_selection import GridSearchCV

X, y = load_iris(return_X_y=True)

param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

grid = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5)
grid.fit(X, y)

print("Best Parameters:", grid.best_params_)
print("Best Validation Accuracy:", grid.best_score_)

Best Parameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Best Validation Accuracy: 0.9800000000000001


##
### Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.
(Use Dataset from sklearn package)


In [7]:
from sklearn.preprocessing import StandardScaler

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Without scaling
model1 = LogisticRegression(max_iter=1000)
model1.fit(X_train, y_train)
acc1 = accuracy_score(y_test, model1.predict(X_test))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model2 = LogisticRegression(max_iter=1000)
model2.fit(X_train_scaled, y_train)
acc2 = accuracy_score(y_test, model2.predict(X_test))

print("Accuracy without scaling:", acc1)
print("Accuracy with scaling:", acc2)

Accuracy without scaling: 1.0
Accuracy with scaling: 0.3157894736842105


##
### Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

To build a reliable Logistic Regression model for imbalanced data:

1. **Data Handling:**

   * Handle missing values and remove duplicates.
   * Encode categorical variables (Label/One-Hot Encoding).

2. **Feature Scaling:**

   * Standardize or normalize features using `StandardScaler` to ensure equal weight.

3. **Balancing Classes:**

   * Use **SMOTE (Synthetic Minority Oversampling Technique)** or **class_weight='balanced'** in Logistic Regression to handle imbalance.

4. **Hyperparameter Tuning:**

   * Use **GridSearchCV** to optimize parameters like `C`, `penalty`, and `solver`.

5. **Model Evaluation:**

   * Evaluate using **Precision, Recall, F1-score, ROC-AUC**, not just accuracy, as data is imbalanced.

6. **Business Insight:**

   * Focus on **recall (true positive rate)** — identifying as many responders as possible is key for marketing ROI.