# 1: What is Logistic Regression, and how does it differ from Linear Regression?
Answer:
Logistic Regression is a statistical method used for binary classification that models the probability that a given input belongs to a particular category. It uses the Sigmoid function to output values between 0 and 1.

Unlike Linear Regression, which predicts continuous outcomes, Logistic Regression is used for categorical outcomes, typically 0 or 1. Also, Linear Regression uses a straight line to fit the data, while Logistic Regression maps the output through the logistic (sigmoid) function to ensure the predictions stay within [0, 1].

# 2: Explain the role of the Sigmoid function in Logistic Regression.
### ✅ Explain the role of the Sigmoid function in Logistic Regression**

The **Sigmoid function** plays a central role in Logistic Regression by converting the raw output of the linear model into a **probability** between 0 and 1. This makes it suitable for **binary classification problems**, where the goal is to predict whether an instance belongs to class 1 or class 0.

#### 🔹 Formula:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

Where:

* $z = w^T x + b$ (i.e., the linear combination of inputs and weights)
* $\sigma(z)$ is the predicted probability of the positive class

#### 🔹 Role in Logistic Regression:

* It ensures the model output is a **probability**, which can be interpreted easily.
* Predictions are made by applying a **threshold** (e.g., 0.5):

  * If $\sigma(z) \geq 0.5$, predict class 1
  * If $\sigma(z) < 0.5$, predict class 0
* It makes the model **non-linear** despite using a linear equation, allowing it to classify points more effectively in a probabilistic way.

#### 🔹 Summary:

The Sigmoid function enables Logistic Regression to make **probabilistic predictions** and map any real-valued input into a bounded range $[0, 1]$, which is essential for binary classification.


# 3: What is Regularization in Logistic Regression and why is it needed?
Answer:
Regularization adds a penalty term to the loss function to prevent overfitting. In Logistic Regression, L1 (Lasso) or L2 (Ridge) regularization is often used.
It discourages complex models by penalizing large coefficients, which leads to simpler and more generalizable models, especially useful when dealing with high-dimensional data.



# 4: What are some common evaluation metrics for classification models, and why are they important?
Answer:
Common metrics include:

Accuracy – Overall correctness.

Precision – True Positives / (True Positives + False Positives).

Recall (Sensitivity) – True Positives / (True Positives + False Negatives).

F1 Score – Harmonic mean of Precision and Recall.

ROC-AUC – Measures the model's ability to distinguish between classes.

These metrics are critical to evaluate performance, especially on imbalanced datasets.

In [1]:
# 5: Python code to load CSV, split data, train logistic regression, and print accuracy
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Binary classification
X, y = X[y != 2], y[y != 2]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict & evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 1.0


In [2]:
# 6: Train Logistic Regression with L2 regularization and print coefficients and accuracy
model = LogisticRegression(penalty='l2', solver='liblinear')
model.fit(X_train, y_train)

print("Coefficients:", model.coef_)
print("Accuracy:", model.score(X_test, y_test))


Coefficients: [[-0.3753915  -1.39664105  2.15250857  0.96423532]]
Accuracy: 1.0


In [3]:
# 7: Multiclass classification using multi_class='ovr' and print classification report
from sklearn.metrics import classification_report

# Full multiclass
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(multi_class='ovr', solver='liblinear')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30





In [4]:
# 8: Apply GridSearchCV to tune C and penalty hyperparameters
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)


Best Parameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Best Accuracy: 0.9583333333333334


In [5]:
# 9: Standardize features and compare accuracy with/without scaling
from sklearn.preprocessing import StandardScaler

# Without scaling
model = LogisticRegression()
model.fit(X_train, y_train)
print("Without Scaling Accuracy:", model.score(X_test, y_test))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LogisticRegression()
model_scaled.fit(X_train_scaled, y_train)
print("With Scaling Accuracy:", model_scaled.score(X_test_scaled, y_test))


Without Scaling Accuracy: 1.0
With Scaling Accuracy: 1.0


# 10: Approach to building a Logistic Regression model for an imbalanced marketing dataset



### **Answer:**

To build a Logistic Regression model for predicting customer response in an e-commerce marketing campaign (with only **5%** responders), I would follow a systematic machine learning pipeline that addresses both data quality and class imbalance:

---

### **1. Data Handling**

* **Clean the data**: Handle missing values, remove duplicates, and deal with outliers.
* **Encode categorical variables**: Use one-hot encoding for nominal variables and label encoding if appropriate.
* **Split the dataset**: Use `StratifiedTrainTestSplit` to preserve class proportions in both training and test sets.

---

### **2. Feature Scaling**

* Apply **StandardScaler** to normalize numeric features so that all input features contribute equally to the model.
* Scaling is important because Logistic Regression is sensitive to feature magnitudes.

---

### **3. Class Imbalance Handling**

Given that only 5% of the data belongs to the positive class:

* **Option 1: Resampling**

  * Use **SMOTE** (Synthetic Minority Over-sampling Technique) to oversample the minority class.
  * Alternatively, undersample the majority class (with caution to avoid information loss).
* **Option 2: Class Weights**

  * Use `class_weight='balanced'` in Logistic Regression to automatically adjust the loss function to handle imbalance.

---

### **4. Model Training and Hyperparameter Tuning**

* Train a Logistic Regression model with regularization to prevent overfitting:

  ```python
  from sklearn.linear_model import LogisticRegression
  model = LogisticRegression(class_weight='balanced', penalty='l2', solver='liblinear')
  model.fit(X_train, y_train)
  ```
* Tune hyperparameters using **GridSearchCV** to find the best combination of `C` (inverse of regularization strength) and `penalty` (L1 or L2).

---

### **5. Evaluation Metrics**

Since the dataset is imbalanced, **accuracy is misleading**. Instead, evaluate the model using:

* **Precision**: Proportion of predicted responders that are correct.
* **Recall**: Ability to identify actual responders.
* **F1-score**: Harmonic mean of precision and recall.
* **ROC-AUC**: Indicates how well the model separates classes.
* Use a **confusion matrix** to assess false positives and false negatives.

---

### **6. Business Application**

* Optimize for **recall** to capture as many responders as possible without overly sacrificing precision.
* Adjust the classification threshold to align with the campaign budget and expected ROI.
* Periodically retrain the model with updated data to reflect changing customer behavior.

---

### ✅ **Conclusion**

A careful approach combining data preprocessing, resampling or weighted learning, feature scaling, hyperparameter tuning, and proper evaluation ensures that the Logistic Regression model performs effectively even on highly imbalanced marketing datasets—leading to better targeting, higher response rates, and improved business outcomes.



