# Logistic Regression Assignment 

1.  What is Logistic Regression, and how does it differ from Linear Regression?

Logistic Regression is a classification algorithm used to predict the probability of a categorical outcome (usually binary: yes/no, spam/not spam, disease/no disease).

How it works: Instead of predicting a continuous value, it predicts the probability that a given input belongs to a certain class.

Logistic Regression uses the **logistic (sigmoid) function** to map predictions into a range between 0 and 1:

Let
$$
z = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n
\qquad\text{and}\qquad
\sigma(z)=\frac{1}{1+e^{-z}}.
$$

Then
$$
P(y=1\mid x)=\sigma(z)=\frac{1}{1+e^{-(\beta_0+\beta_1 x_1+\dots+\beta_n x_n)}}.
$$

Decision rule with a threshold $\tau\in(0,1)$:
$$
\hat{y} =
\begin{cases}
1 & \text{if } P(y=1\mid x)\ge \tau,\\[6pt]
0 & \text{if } P(y=1\mid x)< \tau.
\end{cases}
$$

Equivalently (in terms of the linear predictor $z$):
$$
\hat{y} =
\begin{cases}
1 & \text{if } z \ge \log\!\left(\dfrac{\tau}{1-\tau}\right),\\[6pt]
0 & \text{if } z < \log\!\left(\dfrac{\tau}{1-\tau}\right).
\end{cases}
$$

For the common threshold $\tau = 0.5$ this simplifies to
$$
\hat{y} =
\begin{cases}
1 & \text{if } \beta_0+\beta_1 x_1+\dots+\beta_n x_n \ge 0,\\[6pt]
0 & \text{otherwise}.
\end{cases}
$$


If probability > threshold (commonly 0.5), we classify as 1, otherwise as 0.


| Feature             | **Linear Regression**                   | **Logistic Regression**                            |
| ------------------- | --------------------------------------- | -------------------------------------------------- |
| **Type of Problem** | Regression (predicts continuous values) | Classification (predicts categories/probabilities) |
| **Output Range**    | $-\infty$ to $+\infty$                  | $0$ to $1$ (probability)                           |
| **Equation**        | Straight line (linear function)         | Sigmoid curve (logistic function)                  |
| **Loss Function**   | Mean Squared Error (MSE)                | Log Loss (Cross-Entropy Loss)                      |
| **Interpretation**  | "What value will $y$ take?"             | "What is the probability $y$ belongs to class 1?"  |

---

2.   Explain the role of the Sigmoid function in Logistic Regression.

The sigmoid (logistic) function is at the heart of logistic regression.


The sigmoid function is defined as:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}, \quad \text{where } z = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n
$$


Without the sigmoid, the linear model can produce outputs in the range $(-\infty, +\infty)$. The sigmoid function squashes these outputs to lie strictly between 0 and 1, which makes them interpretable as probabilities.  

For classification, a threshold $\tau$ is applied:  

$$
\hat{y} =
\begin{cases}
1 & \text{if } \sigma(z) \ge \tau \\
0 & \text{if } \sigma(z) < \tau
\end{cases}
$$  

Typically, $\tau = 0.5$. This means if the probability is at least 0.5, the model predicts class 1, otherwise class 0.  

The sigmoid is also differentiable, and its derivative has a convenient form  

$$
\sigma'(z) = \sigma(z)\,(1 - \sigma(z)),
$$  

which makes optimization using gradient descent efficient.

---

3.   What is Regularization in Logistic Regression and why is it needed?

Regularization in logistic regression is a technique used to prevent overfitting and improve the generalization ability of the model. In logistic regression, the model tries to find the best coefficients (weights) for the features to minimize the log loss function. However, when the model is very complex or the number of features is large, the coefficients can take very high values, causing the model to fit the training data too closely. This leads to poor performance on unseen data.

The standard cost function for logistic regression is

$$
J(\beta) = -\frac{1}{m}\sum_{i=1}^{m}\Big[y^{(i)} \log(h_\beta(x^{(i)})) + (1-y^{(i)}) \log(1-h_\beta(x^{(i)}))\Big],
$$

where

$$
h_\beta(x) = \frac{1}{1 + e^{-\beta^T x}}
$$

is the sigmoid function.

To control the size of the coefficients, a penalty term is added to this cost function. This is called regularization. The two most common types are:

1. **L2 Regularization (Ridge)**  
   The penalty is the sum of squares of the coefficients:  

   $$
   J(\beta) = -\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}\log(h_\beta(x^{(i)})) + (1-y^{(i)})\log(1-h_\beta(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^n \beta_j^2
   $$

   This shrinks the coefficients but does not make them exactly zero.

2. **L1 Regularization (Lasso)**  
   The penalty is the sum of the absolute values of the coefficients:  

   $$
   J(\beta) = -\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}\log(h_\beta(x^{(i)})) + (1-y^{(i)})\log(1-h_\beta(x^{(i)}))] + \frac{\lambda}{m}\sum_{j=1}^n |\beta_j|
   $$

   This can shrink some coefficients exactly to zero, performing feature selection.

The parameter $\lambda$ (regularization strength) controls the amount of penalty.  
- If $\lambda$ is very large, the model will underfit because coefficients are heavily penalized.  
- If $\lambda$ is very small (close to 0), the effect of regularization disappears and the model may overfit.  

Regularization is needed because:  
- It prevents overfitting by avoiding very large coefficients.  
- It reduces variance and improves stability of the model.  
- It improves prediction accuracy on unseen test data.  
- In the case of L1 regularization, it also performs feature selection by eliminating less important features.  

In summary, regularization in logistic regression ensures that the model remains simple, interpretable, and generalizes well to new data, rather than memorizing the training dataset.


---

4.   What are some common evaluation metrics for classification models, and why are they important?

Evaluation metrics for classification models are used to measure how well the model performs in predicting categorical outcomes. They are important because accuracy alone is not always sufficient, especially when data is imbalanced. Common evaluation metrics include:

1. **Accuracy**  
   The proportion of correctly classified examples.  
   $$
   \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
   $$
   Useful when classes are balanced, but misleading when data is skewed.

2. **Precision**  
   The proportion of positive predictions that are actually correct.  
   $$
   \text{Precision} = \frac{TP}{TP + FP}
   $$
   Important when the cost of false positives is high (e.g., spam detection).

3. **Recall (Sensitivity or True Positive Rate)**  
   The proportion of actual positives that are correctly identified.  
   $$
   \text{Recall} = \frac{TP}{TP + FN}
   $$
   Important when the cost of false negatives is high (e.g., disease detection).

4. **F1 Score**  
   The harmonic mean of precision and recall.  
   $$
   F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
   $$
   Balances precision and recall, useful when both are important.

5. **ROC Curve and AUC (Area Under Curve)**  
   ROC plots the True Positive Rate against the False Positive Rate.  
   AUC measures the area under this curve (values closer to 1 indicate better models).  
   Useful for comparing classifiers across different thresholds.

These metrics are important because they provide a more complete picture of model performance. Depending on the application, one metric may be more relevant than others. For example, in medical diagnosis, recall is critical, while in spam detection, precision may be more important.


---

5.   Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.


(Use Dataset from sklearn package)


(Include your Python code and output in the code box below.)

In [10]:
# Import necessary libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_breast_cancer()

# Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns = data.feature_names)
df['target'] = data.target

print("First 5 rows of the dataset:")
print(df.head(), "\n")

# Split into features (X) and target (y)
X = df.drop('target', axis = 1)
y = df['target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Train Logistic Regression model
model = LogisticRegression(max_iter = 5000)  # increased max_iter for convergence
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
acc = accuracy_score(y_test, y_pred)
print("Accuracy of Logistic Regression model on test data:", acc)

First 5 rows of the dataset:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter 

---

6.   Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.


(Use Dataset from sklearn package)


(Include your Python code and output in the code box below.)

In [12]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset (Breast Cancer dataset from sklearn)
data = load_breast_cancer()

# Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns = data.feature_names)
df['target'] = data.target

print("First 5 rows of the dataset:")
print(df.head(), "\n")

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Logistic Regression model with L2 regularization (default)
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=5000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
acc = accuracy_score(y_test, y_pred)

print("Model Coefficients:")
print(model.coef_)
print("\nIntercept:")
print(model.intercept_)
print("\nAccuracy of Logistic Regression with L2 regularization:", acc)

First 5 rows of the dataset:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter 

---

7.   Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report.


(Use Dataset from sklearn package)


(Include your Python code and output in the code box below.)

In [14]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load dataset (Iris dataset for multiclass classification)
data = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print("First 5 rows of the dataset:")
print(df.head(), "\n")

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Logistic Regression model with One-vs-Rest strategy
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=5000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

First 5 rows of the dataset:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0   

Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





---

8.   Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy.


(Use Dataset from sklearn package)


(Include your Python code and output in the code box below.)

In [16]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()

# Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print("First 5 rows of the dataset:")
print(df.head(), "\n")

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define Logistic Regression model
log_reg = LogisticRegression(max_iter=5000, solver='liblinear')  
# Note: 'liblinear' supports both l1 and l2 penalties

# Define parameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],   # inverse of regularization strength
    'penalty': ['l1', 'l2']
}

# Apply GridSearchCV
grid = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Best parameters and score
print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Accuracy:", grid.best_score_)

# Evaluate on test set
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print("Test Set Accuracy:", test_acc)

First 5 rows of the dataset:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter 

---

9.   Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.


(Use Dataset from sklearn package)


(Include your Python code and output in the code box below.)

In [18]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()

# Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print("First 5 rows of the dataset:")
print(df.head(), "\n")

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, random_state = 42
)

# -----------------------------
# Model without scaling
# -----------------------------
model_no_scaling = LogisticRegression(max_iter=5000)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# -----------------------------
# Standardize the features
# -----------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model with scaling
model_scaled = LogisticRegression(max_iter=5000)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

# -----------------------------
# Print results
# -----------------------------
print("Accuracy without feature scaling:", acc_no_scaling)
print("Accuracy with feature scaling:", acc_scaled)

First 5 rows of the dataset:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter 

---

10.  Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

To predict which customers will respond to a marketing campaign using Logistic Regression on an imbalanced dataset (only 5% responders), the approach would involve the following steps:

1. **Data Exploration and Cleaning**  
   - Examine the dataset for missing values, outliers, and inconsistent entries.  
   - Encode categorical variables using one-hot encoding or ordinal encoding as appropriate.  
   - Remove irrelevant features or reduce dimensionality if necessary.

2. **Feature Scaling**  
   - Standardize numerical features using `StandardScaler` because Logistic Regression is sensitive to feature scales.  
   - Scaling ensures that the optimization converges efficiently and coefficients are comparable.

3. **Handling Imbalanced Classes**  
   - Since only 5% of customers respond, the dataset is highly imbalanced.  
   - Techniques to address imbalance:  
     - **Resampling**: Oversample the minority class (e.g., using SMOTE) or undersample the majority class.  
     - **Class weights**: Use `class_weight='balanced'` in `LogisticRegression` to penalize misclassification of the minority class more heavily.  
   - This ensures the model does not simply predict all customers as non-responders.

4. **Train-Test Split**  
   - Split the data into training and testing sets (e.g., 80-20 split) using stratified sampling to preserve the class distribution.  

5. **Hyperparameter Tuning**  
   - Use `GridSearchCV` or `RandomizedSearchCV` to tune hyperparameters such as:  
     - Regularization strength `C`  
     - Penalty type (`l1` or `l2`)  
     - Solver choice (`liblinear` for small datasets, `saga` for large datasets)  
   - Include cross-validation to select hyperparameters that generalize well.

6. **Model Training**  
   - Train the Logistic Regression model with the chosen hyperparameters and class balancing strategy.  
   - Monitor convergence and ensure the model has not overfitted to the training data.

7. **Evaluation Metrics**  
   - Accuracy is not sufficient for imbalanced data. Use metrics that focus on the minority class:  
     - **Precision**: Fraction of predicted responders who are actually responders.  
     - **Recall (Sensitivity)**: Fraction of actual responders correctly identified.  
     - **F1-score**: Harmonic mean of precision and recall.  
     - **ROC-AUC**: Measures model's ability to distinguish responders from non-responders across thresholds.  
   - Plot a **Precision-Recall curve** to analyze trade-offs between precision and recall.

8. **Threshold Tuning**  
   - Adjust the decision threshold based on business needs:  
     - If false negatives (missing potential responders) are costly, lower the threshold to increase recall.  
     - If false positives (sending unnecessary promotions) are costly, raise the threshold to increase precision.

9. **Business Considerations**  
   - Ensure the model outputs interpretable probabilities so marketing teams can prioritize high-probability responders.  
   - Monitor model performance periodically and retrain with new customer behavior data to maintain accuracy.  

10. **Deployment**  
    - Integrate the model into the marketing pipeline to score customers and guide campaign targeting.  
    - Track campaign response rates and adjust the model or threshold based on real-world performance.

---