<a href="https://colab.research.google.com/github/ashish134/Machine-Learning-Assignments/blob/main/Logistic_Regression_%7C_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Logistic Regression | Assignment
**Question 1: What is Logistic Regression, and how does it differ from Linear Regression?**
Logistic Regression is a statistical and machine learning method used to predict categorical outcomes, most commonly binary outcomes (0/1, Yes/No, True/False).

Examples:

*   Will a customer buy? (Yes/No)
*   Is an email spam? (Spam/Not Spam)
*   Will a loan be approved? (Approved/Rejected)
| Feature        | **Linear Regression**       | **Polynomial Regression**                       |
| -------------- | --------------------------- | ----------------------------------------------- |
| Relationship   | Straight line               | Curved line                                     |
| Equation       | (Y = \beta_0 + \beta_1 X)   | (Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \dots) |
| Linearity      | Assumes linear relationship | Handles non-linear (curved) relationships       |
| Complexity     | Simple model                | More complex as degree increases                |
| Interpretation | Easy to interpret           | Harder to interpret higher-degree terms         |
| Flexibility    | Less flexible               | More flexible; fits complex patterns            |


**Question 2: Explain the role of the Sigmoid function in Logistic Regression.**
The Sigmoid function (also called the logistic function) plays a crucial role in Logistic Regression by converting a linear output into a probability between 0 and 1.
Why do we need the Sigmoid function?

The linear equation in logistic regression:
                  z=β0​+β1​X1​+⋯+βn​Xn​
can produce values from −∞ to +∞.

But classification needs probabilities, which must be between 0 and 1.
This is where the sigmoid function comes in.
Sigmoid Function Formula
          σ(z)=1+e−z1​
This transforms any value of z into a probability p:
                0≤p≤1
      Role of Sigmoid Function (Key Points)
1. Converts linear output to probability

The sigmoid maps any number to a value between 0 and 1, making it ideal for binary classification.

2. Helps in setting decision boundaries

If p ≥ 0.5 → Class = 1

If p < 0.5 → Class = 0

3. Ensures smooth gradient for optimization

The sigmoid function is differentiable, enabling gradient descent to learn model parameters efficiently.

4. Creates the S-shaped curve

This curve captures non-linear patterns while still using a linear decision boundary in log-odds space.

Question 3: What is Regularization in Logistic Regression and why is it needed?
Regularization is a technique used to prevent overfitting in machine learning models, including Logistic Regression. It works by adding a penalty term to the loss function, discouraging the model from fitting too closely to the training data (i.e., keeping the model simpler and more generalizable).

Why is Regularization Needed?
Without regularization:

The model may learn noise or irrelevant patterns in the training data.
This leads to high accuracy on training data but poor performance on unseen data (overfitting).
With regularization:

The model is penalized for using large weights, which often indicate over-reliance on specific features.
It encourages simpler models that generalize better.
Types of Regularization in Logistic Regression

| **Regularization Type**       | **Penalty Term**       | **Effect on Model**                                         | **When to Use**                                              |                                                                   |                                                                           |
| ----------------------------- | ---------------------- | ----------------------------------------------------------- | ------------------------------------------------------------ | ----------------------------------------------------------------- | ------------------------------------------------------------------------- |
| **L1 Regularization (Lasso)** | ( \lambda \sum         | w_i                                                         | )                                                            | Shrinks some weights to **zero** → performs **feature selection** | When you want a **sparse model** or want to remove irrelevant features    |
| **L2 Regularization (Ridge)** | ( \lambda \sum w_i^2 ) | Shrinks weights but **never to zero** → reduces overfitting | When all features are important but the model is overfitting |                                                                   |                                                                           |
| **Elastic Net**               | ( \lambda (\alpha \sum | w_i                                                         | + (1-\alpha) \sum w_i^2) )                                   | Combination of L1 + L2 → balances feature selection + stability   | When you want both **feature selection** and **regularization stability** |


**Question 4: What are some common evaluation metrics for classification models, and why are they important?**

Evaluation metrics help us measure the performance of a classification model. They are crucial because they show how well your model is working, and different metrics are useful depending on the problem context (e.g., fraud detection vs spam filtering).

Common Evaluation Metrics for Classification:

| **Metric**               | **Formula / Definition**            | **What It Measures**                 | **Why It’s Important**                                                 |
| ------------------------ | ----------------------------------- | ------------------------------------ | ---------------------------------------------------------------------- |
| **Accuracy**             | ((TP + TN) / (TP + TN + FP + FN))   | Overall correctness                  | Good for balanced datasets                                             |
| **Precision**            | (TP / (TP + FP))                    | Correct positive predictions         | Important when **False Positives** are costly (e.g., spam filtering)   |
| **Recall (Sensitivity)** | (TP / (TP + FN))                    | Ability to detect actual positives   | Important when **False Negatives** are risky (e.g., disease detection) |
| **F1-Score**             | Harmonic mean of Precision & Recall | Balance between precision and recall | Best for **imbalanced datasets**                                       |
| **Confusion Matrix**     | Table of TP, TN, FP, FN             | Detailed prediction breakdown        | Helps understand the type of errors                                    |
| **AUC-ROC**              | Area under ROC curve                | Class separation ability             | Helps compare models; works well for imbalance                         |

Confusion Matrix Terms:
                  	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Why These Metrics Are Important:


1.   Accuracy alone can be misleading in imbalanced datasets (e.g., cancer detection where 99% are healthy).
2.   Precision vs Recall trade-off helps you choose what error type is more acceptable.
3. F1 Score balances both precision and recall for uneven classes.
4. ROC-AUC gives an overall performance measure, independent of threshold.

Example:
In spam detection:

*   High precision = fewer legitimate emails marked as spam (low FP).
*   High recall = fewer spam emails go undetected (low FN).
*   HF1 score helps balance both if needed.




Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression # Import LogisticRegression

# Load breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split into train/test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)

# Predict and calculate accuracy
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy
print(f"Test Accuracy: {accuracy:.4f}")

Test Accuracy: 0.9737


Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.


In [2]:
# Import necessary libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model using L2 regularization (Ridge)
model = LogisticRegression(penalty='l2', C=1.0, solver='liblinear', max_iter=1000)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Print model coefficients
print("Model Coefficients (L2 Regularization):\n")
for feature, coef in zip(X.columns, model.coef_[0]):
    print(f"{feature}: {coef:.4f}")

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")

Model Coefficients (L2 Regularization):

mean radius: 2.1325
mean texture: 0.1528
mean perimeter: -0.1451
mean area: -0.0008
mean smoothness: -0.1426
mean compactness: -0.4156
mean concavity: -0.6519
mean concave points: -0.3445
mean symmetry: -0.2076
mean fractal dimension: -0.0298
radius error: -0.0500
texture error: 1.4430
perimeter error: -0.3039
area error: -0.0726
smoothness error: -0.0162
compactness error: -0.0019
concavity error: -0.0449
concave points error: -0.0377
symmetry error: -0.0418
fractal dimension error: 0.0056
worst radius: 1.2321
worst texture: -0.4046
worst perimeter: -0.0362
worst area: -0.0271
worst smoothness: -0.2626
worst compactness: -1.2090
worst concavity: -1.6180
worst concave points: -0.6153
worst symmetry: -0.7428
worst fractal dimension: -0.1170

Model Accuracy: 0.9561


Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report.

In [3]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load the iris dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Logistic Regression model for multiclass classification using OvR
model = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=1000)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Print the classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))

Classification Report:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30





Question 8: Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy.

In [4]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer # Using Breast Cancer dataset
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset from sklearn
print("Loading Breast Cancer dataset from sklearn...")
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

print(f"Dataset shape: {X.shape}")
print(f"Classes: {cancer.target_names}")
print(f"Class distribution: {np.bincount(y)}")

# Split into train/test sets (using the same split as Q6 for consistency)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardize features (important for regularization)
print("\nStandardizing features...")
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set shape: {X_train_scaled.shape}")
print(f"Test set shape: {X_test_scaled.shape}")


param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

model = LogisticRegression(solver='liblinear', random_state=42, max_iter=1000)

print("\nPerforming GridSearchCV...")
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
# Fit GridSearchCV on the scaled training data
grid_search.fit(X_train_scaled, y_train)

print("GridSearchCV complete.\n")

# Print the best parameters found
print("Best parameters found by GridSearchCV:")
print(grid_search.best_params_)

# Print the best cross-validation score (accuracy)
print("\nBest cross-validation accuracy:")
print(f"{grid_search.best_score_:.4f}")

# Evaluate the best model on the test set
best_model = grid_search.best_estimator_
y_pred_test = best_model.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_pred_test)

print("\nTest set accuracy with best parameters:")
print(f"{test_accuracy:.4f} ({test_accuracy*100:.2f}%)")

# Optional: Print classification report for the best model on the test set
from sklearn.metrics import classification_report
print("\nClassification Report on Test Set with Best Model:")

Loading Breast Cancer dataset from sklearn...
Dataset shape: (569, 30)
Classes: ['malignant' 'benign']
Class distribution: [212 357]

Standardizing features...
Training set shape: (455, 30)
Test set shape: (114, 30)

Performing GridSearchCV...
GridSearchCV complete.

Best parameters found by GridSearchCV:
{'C': 1, 'penalty': 'l2'}

Best cross-validation accuracy:
0.9802

Test set accuracy with best parameters:
0.9825 (98.25%)

Classification Report on Test Set with Best Model:


Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.

In [5]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer # Using Breast Cancer dataset
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset from sklearn
print("Loading Breast Cancer dataset from sklearn...")
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

print(f"Dataset shape: {X.shape}")
print(f"Classes: {cancer.target_names}")
print(f"Class distribution: {np.bincount(y)}")

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\nTraining set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

# --- Train model WITHOUT scaling ---
print("\nTraining Logistic Regression model WITHOUT scaling...")
model_no_scale = LogisticRegression(random_state=42, max_iter=1000, solver='liblinear') # Use a suitable solver
model_no_scale.fit(X_train, y_train)
y_pred_no_scale = model_no_scale.predict(X_test)
accuracy_no_scale = accuracy_score(y_test, y_pred_no_scale)

print(f"Accuracy WITHOUT scaling: {accuracy_no_scale:.4f} ({accuracy_no_scale*100:.2f}%)")

# --- Train model WITH scaling ---
print("\nStandardizing features...")
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Training Logistic Regression model WITH scaling...")
model_scaled = LogisticRegression(random_state=42, max_iter=1000, solver='liblinear')
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy WITH scaling:    {accuracy_scaled:.4f} ({accuracy_scaled*100:.2f}%)")

# --- Comparison ---
print("\nComparison of Accuracy:")
print(f"  Without Scaling: {accuracy_no_scale:.4f}")
print(f"  With Scaling:    {accuracy_scaled:.4f}")

Loading Breast Cancer dataset from sklearn...
Dataset shape: (569, 30)
Classes: ['malignant' 'benign']
Class distribution: [212 357]

Training set shape: (455, 30)
Test set shape: (114, 30)

Training Logistic Regression model WITHOUT scaling...
Accuracy WITHOUT scaling: 0.9561 (95.61%)

Standardizing features...
Training Logistic Regression model WITH scaling...
Accuracy WITH scaling:    0.9825 (98.25%)

Comparison of Accuracy:
  Without Scaling: 0.9561
  With Scaling:    0.9825


**Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.**

**Approach to Building a Logistic Regression Model for Imbalanced Data (E-commerce Marketing Campaign) **

Given an imbalanced dataset where only 5% of customers respond to a marketing campaign, building a robust Logistic Regression model requires careful consideration of data handling, class imbalance, and evaluation. Here’s a step-by-step approach:

1. Data Handling and Preprocessing:


*  ** Data Loading and Exploration**: Load the customer data (features like purchase history, demographics, browsing behavior, past campaign interactions, etc.) into a Pandas DataFrame. Perform initial exploratory data analysis (EDA) to understand feature distributions, identify missing values, and analyze the class distribution (confirm the 5% response rate).
*  ** Feature Engineering**: Create new features that might be predictive of response. This could include
*   Recency, Frequency, Monetary (RFM) values.
*   Number of visits in the last X days.
*   Time spent on site.
*   Category preferences.
*   Interaction counts with previous campaigns.


*   **Handling Missing Values**: Impute or remove missing values based on their extent and the nature of the feature.
*   **Encoding Categorical Features**: Convert categorical variables (e.g., gender, location) into numerical formats using techniques like One-Hot Encoding.
2. Feature Scaling:
*   ***Standardization or Normalization***: Since Logistic Regression uses gradient descent and can be sensitive to feature scales, it's crucial to scale the numerical features. StandardScaler (z-score normalization) or MinMaxScaler are common choices. Apply the scaling after splitting the data to prevent data leakage from the test set into the training process.
3. Handling Class Imbalance:

This is a critical step for imbalanced datasets. Directly training on the imbalanced data will likely result in a model that predicts the majority class (non-responders) most of the time, leading to high accuracy but poor performance on the minority class (responders), which is the class of interest. Techniques include:

*   Resampling Techniques

*   **Oversampling the Minority Class:** Duplicate instances of the minority class (responders) to increase their representation. SMOTE (Synthetic Minority Oversampling Technique) is a popular method that creates synthetic samples of the minority class.

*   **Undersampling the Majority Class**: Randomly remove instances of the majority class (non-responders). This can lead to loss of valuable information.
*   **Combination Approaches**: Techniques like SMOTEENN or SMOTETomek combine oversampling and undersampling.
*   **Using Class Weights**: Logistic Regression models in libraries like scikit-learn allow assigning higher weights to the minority class during training. This tells the model to penalize misclassifications of the minority class more heavily. This is often simpler and performs well compared to resampling.

4. Model Training:

*   **Splitting Data**: Split the preprocessed and potentially balanced data into training, validation (optional but recommended), and testing sets. A common split is 70/15/15 or 80/20 for train/test, with a portion of the training data used for validation during hyperparameter tuning. Ensure the split is stratified to maintain the class distribution in each set.
*   **Logistic Regression Model**: Initialize a LogisticRegression model from scikit-learn.

5. Hyperparameter Tuning:
*   Parameters to Tune: Key hyperparameters for Logistic Regression include:
    *   C: The inverse of regularization strength. Smaller values mean stronger regularization (L2 by default). This helps prevent overfitting.
    *   penalty: 'l1' or 'l2' regularization. L1 can lead to sparser coefficients (feature selection), while L2 shrinks coefficients.
    *   solver: The algorithm to use for optimization (e.g., 'liblinear', 'lbfgs', 'saga').
     *   class_weight: Use 'balanced' to automatically adjust weights inversely proportional to class frequencies, or provide a dictionary of weights.
*   Tuning Method: Use techniques like GridSearchCV or RandomizedSearchCV with cross-validation on the training (or training + validation) data to find the optimal combination of hyperparameters that maximizes a suitable evaluation metric.

6. Model Evaluation:

*  Choosing Appropriate Metrics: Since the dataset is imbalanced, accuracy is not a good primary evaluation metric. A model that predicts 'non-responder' for all customers would have 95% accuracy. Focus on metrics that are sensitive to the performance on the minority class:
    *   Precision: Out of all customers predicted as responders, how many actually responded? High precision is important if false positives (contacting non-responders) are costly.
    *   Recall (Sensitivity): Out of all actual responders, how many did the model correctly identify? High recall is important if false negatives (missing potential responders) are costly. In a marketing campaign, recall is often very important to capture as many potential responders as possible, even if it means contacting some non-responders.
    *   F1-Score: The harmonic mean of precision and recall. A good balance between the two. Useful when you need a balance.
    *   ROC-AUC: Measures the model's ability to distinguish between the positive and negative classes across different probability thresholds. A higher AUC indicates better discriminative power. This is often a good overall metric for imbalanced data.
    *   Confusion Matrix: Provides a detailed breakdown of true positives, true negatives, false positives, and false negatives.
*   **Evaluation on Test Set**: Evaluate the best model found during hyperparameter tuning on the held-out test set using the chosen metrics. The test set provides an unbiased estimate of the model's performance on unseen data.

7. Threshold Adjustment:

*   **Optimizing for the Business Goal**: The default probability threshold for classification is 0.5. However, in an imbalanced scenario, you can adjust this threshold based on the business objective.
  *   If you prioritize recall (finding more responders), you might lower the threshold.
  *   If you prioritize precision (minimizing contact with non-responders), you might raise the threshold.
  *   Analyze the Precision-Recall curve or ROC curve to determine the optimal threshold that balances the trade-off between precision and recall for the specific campaign goals (e.g., maximizing the number of responders contacted while keeping the cost of contacting non-responders manageable).
  















