# **Logistic Regression | Assignment**

1. **What is Logistic Regression, and how does it differ from Linear
Regression?**
-  **Logistic Regression** and **Linear Regression** are both statistical and machine learning algorithms used for predictive modeling under the supervised learning paradigm. They both aim to uncover the relationship between `independent (predictor) variables` and a `dependent (outcome) variable`. However, a key distinction lies in the nature of the outcome variable and the problems they are best suited to solve.
- **Logistic regression**
  - Purpose: Primarily used for classification problems, where the goal is to predict a categorical outcome.
  - Dependent Variable: Categorical, typically binary (e.g., yes/no, true/false, 0/1) but can extend to multiple unordered (multinomial) or ordered (ordinal) categories.
  - Output: Predicts the probability that an input belongs to a specific class (a value between 0 and 1). A decision threshold (often 0.5) is then used to assign the input to a particular category.
  - Mathematical Approach: Employs the logistic (sigmoid) function, an S-shaped curve, to transform a linear combination of inputs into a probability value between 0 and 1.
  - Relationship between variables: Does not assume a linear relationship between the independent and dependent variables, but rather models the probability of the outcome.
- **Linear regression**  
  - Purpose: Used for regression problems, where the objective is to predict a continuous numerical value.
  - Dependent Variable: Continuous (e.g., price, temperature, sales figures, test scores).
  - Output: Predicts a specific numerical value of the dependent variable.
  - Mathematical Approach: Uses a linear equation (Y = β0 + β1X + ε) to model the relationship between the independent and dependent variables.
  - Relationship between variables: Assumes a linear relationship exists between the variables.
-  Linear regression predicts a continuous numerical value (regression problems), while logistic regression predicts a categorical outcome (classification problems).
- This difference is reflected in their mathematical approaches; linear regression uses a linear equation, and logistic regression uses the logistic (sigmoid) function to output probabilities.
- While linear regression assumes a linear relationship between variables, logistic regression models the probability of an outcome without this assumption.  

2. **Explain the role of the Sigmoid function in Logistic Regression.**
- The Sigmoid function, also known as the logistic function, is a fundamental component of Logistic Regression, a statistical model used for binary classification tasks. It plays a crucial role in converting the raw output of the model into a probability value between 0 and 1.This transformation is essential because probabilities must always be between 0 and 1, a range the sigmoid function is ideally suited to provide.

  **Role of the Sigmoid function**

- **Mapping to Probabilities**: In logistic regression, the goal is to predict the probability that a data point belongs to a specific class (e.g., whether an email is spam or not, or a tumor is malignant or not). However, the linear combination of features in a logistic regression model can produce any real-valued number, which is unsuitable for directly representing probabilities that are bounded between 0 and 1. The sigmoid function addresses this by transforming the linear output into a probability score within the valid range of 0 to 1.
- **S-shaped Curve**: The sigmoid function produces an "S"-shaped curve, which is ideal for modeling the probability of a binary outcome. As the input to the sigmoid function becomes increasingly positive, the output approaches 1, indicating a high probability of belonging to the positive class. Conversely, as the input becomes increasingly negative, the output approaches 0, indicating a high probability of belonging to the negative class.
- **Thresholding for Classification**: After the sigmoid function outputs a probability, a threshold (usually 0.5) is applied to classify the data point into one of the two classes. For example, if the probability is greater than or equal to 0.5, the data point is classified as belonging to class 1, otherwise it belongs to class 0.


3. **What is Regularization in Logistic Regression and why is it needed?**
- Regularization is a technique used in machine learning, including logistic regression, to prevent a common problem called overfitting.
- Overfitting occurs when a model learns the training data too well, including the noise and random fluctuations, rather than just the underlying patterns. - This leads to excellent performance on the training data but poor performance on new data.
- Imagine fitting a complex curve to perfectly match every point in a dataset, even the outliers. While this might fit the training data perfectly, it won't generalize well if new data points don't follow the exact same "noise" patterns.

 **Why is regularization needed?**

- To prevent overfitting: As mentioned, logistic regression models can be prone to overfitting, especially with high-dimensional datasets or when the number of features exceeds the number of observations. Regularization helps to mitigate this issue, leading to a model that performs better on new, unseen data.
- To handle multicollinearity: Multicollinearity occurs when independent variables are highly correlated with each other, which can lead to unstable and unreliable estimates of regression coefficients. Regularization, particularly L2 regularization (Ridge), can help manage this by reducing the impact of highly correlated features and distributing their influence more evenly.
- To improve model interpretability (especially with L1 regularization): L1 regularization (Lasso) can drive the coefficients of less important features to zero, effectively performing feature selection and making the model easier to understand and interpret in terms of the most relevant features, according to Medium.
- To improve model generalizability: By preventing overfitting and dealing with multicollinearity, regularization helps create models that generalize well to new data, leading to more reliable predictions in real-world scenarios.  

4. **What are some common evaluation metrics for classification models, and
why are they important?**
- Classification models aim to categorize data into predefined labels or classes. Evaluating the performance of these models requires using appropriate metrics to understand their effectiveness and guide improvements.

 **Here are some common evaluation metrics for classification models and their importance:**

- **Accuracy**
 - What it is: The proportion of correctly predicted instances out of the total instances.
 - Importance: A good initial measure for overall performance when classes are balanced and misclassification costs are roughly equal for all classes.
 - Limitations: Can be misleading with imbalanced datasets. For example, a model classifying 99% of data as the majority class could achieve high accuracy while failing to identify the minority class, according to GeeksforGeeks.
- **Confusion matrix**
  - What it is: A table that summarizes predictions versus actual class labels. It shows True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
  - Importance: It provides a clear picture of the model's accuracy and misclassifications, helping to identify error patterns.
  - Components: This matrix includes:
    - True Positives (TP): Correctly predicted positive instances.
    - True Negatives (TN): Correctly predicted negative instances.
    - False Positives (FP) (Type I Error): Incorrectly predicted positive instances (actual negative).
    - False Negatives (FN) (Type II Error): Incorrectly predicted negative instances (actual positive).
- **Precision**
  - What it is: The ratio of true positives to the total number of positive predictions. It measures how many positively predicted instances were actually positive.
  - Importance: Precision is valuable when minimizing false positives is critical, such as in spam detection or financial fraud. High precision indicates a low rate of false alarms.
- **Recall (Sensitivity, True Positive Rate)**
-  - What it is: The ratio of true positives to the total number of actual positive instances. It indicates how many of the actual positive cases were correctly identified by the model.
  - Importance: Recall is crucial when missing true positives has significant consequences, such as in disease detection or security threat identification.
- **F1 score**
  - What it is: The harmonic mean of precision and recall.
  - Importance: It offers a balanced measure of performance, especially when there's a trade-off between precision and recall, or with imbalanced datasets. It helps to balance the impact of false positives and false negatives.
- **AUC-ROC Curve (Receiver Operating Characteristic Area Under the Curve)**
  - What it is: Measures the model's ability to distinguish between positive and negative classes across various thresholds. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR), and the AUC is the area under this curve.
  - Importance: It provides insight into the model's discrimination power, particularly for binary classification problems. A higher AUC indicates better class separation.
- **Logarithmic Loss (Log Loss)**
  - What it is: Evaluates the uncertainty of a model's predictions based on the probability assigned to predicted classes.
  - Importance: Useful for assessing the quality of predicted probabilities, especially when prediction confidence is important or with imbalanced datasets. Lower log loss suggests more accurate probability estimates.

  **Why these metrics are important :**

- Assessing Performance: They provide a way to quantitatively measure how well a model performs.
- Informing Decisions: Understanding these metrics helps determine if a model is suitable for deployment or requires further development.
- Identifying Strengths and Weaknesses: Each metric offers a different perspective on the model's behavior.
- Optimizing Performance: Analyzing metrics allows for iterative refinement to improve the model and align it with specific goals.
- Choosing the Right Metric: Different problems require different priorities. For example, a medical diagnosis model would prioritize recall, while a spam filter might prioritize precision. Selecting the appropriate metric ensures the model's performance aligns with the problem's context.   

5. **Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package)**


In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
iris = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Display first 5 rows
print("First 5 rows of the dataset:")
print(df.head())

# Split into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train Logistic Regression model
model = LogisticRegression(max_iter=200)  # Increased max_iter to ensure convergence
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.2f}")


First 5 rows of the dataset:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  

Model Accuracy: 1.00


6. **Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.
(Use Dataset from sklearn package)**


In [2]:
# Import necessary libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
iris = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Logistic Regression model with L2 regularization
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=200, multi_class='auto')

# Train the model
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print coefficients and accuracy
print("Model Coefficients (per class):")
print(model.coef_)
print("\nIntercepts (per class):")
print(model.intercept_)
print(f"\nModel Accuracy: {accuracy:.2f}")


Model Coefficients (per class):
[[-0.39345607  0.96251768 -2.37512436 -0.99874594]
 [ 0.50843279 -0.25482714 -0.21301129 -0.77574766]
 [-0.11497673 -0.70769055  2.58813565  1.7744936 ]]

Intercepts (per class):
[  9.00884295   1.86902164 -10.87786459]

Model Accuracy: 1.00




7. **Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)**

In [3]:
# Import necessary libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load dataset
iris = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Logistic Regression model with one-vs-rest strategy
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=200)

# Train the model
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))


Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





8. **Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.
(Use Dataset from sklearn package)**

In [5]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Load dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

X = df.drop('target', axis=1)
y = df['target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Base logistic regression model
base_model = LogisticRegression(solver='liblinear', max_iter=1000)

# Wrap in One-vs-Rest classifier
ovr_model = OneVsRestClassifier(base_model)

# Parameter grid
param_grid = {
    'estimator__C': [0.01, 0.1, 1, 10, 100],
    'estimator__penalty': ['l1', 'l2']
}

# Grid search
grid = GridSearchCV(ovr_model, param_grid=param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print(f"Best Cross-Validation Accuracy: {grid.best_score_:.2f}")


Best Parameters: {'estimator__C': 10, 'estimator__penalty': 'l1'}
Best Cross-Validation Accuracy: 0.96


9. **Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling.
(Use Dataset from sklearn package)**

In [6]:
# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ---------------- Without Scaling ----------------
model_no_scaling = LogisticRegression(max_iter=200)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# ---------------- With Scaling ----------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaling = LogisticRegression(max_iter=200)
model_scaling.fit(X_train_scaled, y_train)
y_pred_scaling = model_scaling.predict(X_test_scaled)
acc_scaling = accuracy_score(y_test, y_pred_scaling)

# ---------------- Results ----------------
print(f"Accuracy without scaling: {acc_no_scaling:.2f}")
print(f"Accuracy with scaling   : {acc_scaling:.2f}")


Accuracy without scaling: 1.00
Accuracy with scaling   : 1.00


10. **Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.**

 **Marketing Campaign Response Prediction — Logistic Regression Plan**

- Problem Context
  - Business goal: Identify customers most likely to respond to a campaign.
  - Dataset: Highly imbalanced (5% responders, 95% non-responders).
  - Challenge: Standard accuracy will be misleading — we need a method that prioritizes finding responders while controlling costs.
- Data Handling
  - Data Cleaning
    - Remove duplicates.
    - Handle missing values (imputation for numeric/categorical fields).
    - Standardize data formats (e.g., date fields to datetime).
  - Feature Engineering
    - Customer activity features (recent purchases, website visits).
    - Engagement history (previous campaign responses).
    - Demographic features (location, age group, income bracket).
  - Encoding
    - One-hot encoding for categorical variables.
    - Avoid dummy variable trap (drop one category).
  - Feature Selection
    - Remove highly correlated features to avoid multicollinearity.
    - Use domain knowledge + statistical tests.  
- Feature Scaling
  - Logistic Regression uses distance-based optimization — scaling improves convergence.
  - Apply StandardScaler to all numeric features.
  - Scaling applied after train-test split to avoid data leakage.
- Addressing Class Imbalance
  - Class Weights Approach: Set `class_weight='balanced'` so minority class gets proportionally higher weight in loss function.
  - Resampling Approach:
    - Oversample minority class with SMOTE.
    - Optionally undersample majority class for computational efficiency.
  - Why: Ensures the model doesn’t just predict "non-response" for everyone.
- Model Training & Hyperparameter Tuning
  - Base model:

     ```
     # LogisticRegression(solver='liblinear', max_iter=500)
     ```

  - Hyperparameters to tune:
    - C (inverse regularization strength): [0.01, 0.1, 1, 10]
    - penalty: ['l1', 'l2']
    - class_weight: ['balanced', None]
  - Tuning method:
    - Use GridSearchCV with StratifiedKFold (preserves class ratio in folds).
    - Optimize for F1-score (balances precision and recall).      
- Model Evaluation
  - Confusion Matrix: Understand FN (missed responders) and FP (extra campaign costs).
  - Metrics:
    - Precision: Of those predicted to respond, how many did.
    - Recall: Of actual responders, how many were found.
    - F1-score: Balance between precision and recall.
    - ROC-AUC: Class separation ability.
    - PR-AUC: Better for imbalanced problems.
  - Threshold tuning:
    - Default decision threshold is 0.5.
    - Adjust based on business priorities (e.g., set to 0.3 to increase recall).
- Deployment Plan
  - Integrate model into campaign management system.
  - Output a ranked list of customers with probability scores.
  - Let marketing choose a cutoff based on budget and response rate goals.
- Monitoring
  - Track actual campaign performance vs predicted probabilities.
  - Watch for data drift (e.g., change in purchase behavior over time).
  - Schedule retraining every quarter or after major market changes.