#Logistic Regression

1. What is Logistic Regression, and how does it differ from Linear
Regression?

- Logistic regression is a supervised machine learning algorithm used for classification problems, where the outcome is categorical, such as "yes/no," "pass/fail," or "true/false".
- Logistic Regression is used for classification, predicting categorical outcomes (like "yes" or "no"), while Linear Regression is for regression, predicting continuous values (like stock prices).

2. Explain the role of the Sigmoid function in Logistic Regression.

- Probability Estimation: The primary role is to map the raw, continuous output of the linear model (often called logits) into a probability distribution. Since probabilities must be between 0 and 1, the sigmoid function is ideal because its output is always within this range, regardless of the input.
- Modeling Binary Classification: For binary classification (e.g., predicting "yes" or "no"), the sigmoid function's output is the predicted probability that the input belongs to the positive class.
- Class Assignment: A threshold (commonly 0.5) is applied to the sigmoid's output. If the predicted probability is above the threshold, the instance is classified as belonging to Class 1; otherwise, it's assigned to Class 0.
- Squashing Function: The sigmoid function acts as a "squashing function" that compresses the entire range of real numbers into the narrow range of 0 to 1, preventing extreme output values that would be problematic for a probability.
- Enabling Differentiability: The sigmoid function is differentiable, meaning its derivative can be easily calculated. This is crucial for training the logistic regression model using gradient descent, an optimization algorithm that requires gradients to adjust the model's weights.

3. What is Regularization in Logistic Regression and why is it needed?

- Regularization in Logistic Regression is a technique that adds a penalty term to the model's objective function, preventing it from becoming too complex by shrinking the feature coefficients.
- It is needed to combat overfitting, where a model learns the noise in the training data too well, leading to poor performance on new, unseen data. By adding this penalty, regularization creates a simpler model that has a better ability to generalize and make accurate predictions on new datasets.

4. What are some common evaluation metrics for classification models, and
why are they important?

- Common classification evaluation metrics include Accuracy, Precision, Recall (Sensitivity), F1-Score, Specificity, the Confusion Matrix, and AUC-ROC.
- These metrics are important because they quantify a model's performance, allowing for informed decisions on model selection, tuning, and ensuring the model meets the specific goals and requirements of a project, especially in cases of imbalanced datasets where accuracy alone can be misleading.

5. Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification # For creating a sample dataset

X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
df_sample = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(10)])
df_sample['target'] = y
df_sample.to_csv('sample_data.csv', index=False)

try:
    df = pd.read_csv('sample_data.csv')
except FileNotFoundError:
    print("Error: 'sample_data.csv' not found. Please ensure the CSV file exists in the same directory.")
    exit()

X = df.drop('target', axis=1)  # Assuming 'target' is the name of your target column
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Model Accuracy: {accuracy:.4f}")

Logistic Regression Model Accuracy: 0.8467


6. Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.
(Use Dataset from sklearn package)

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(penalty='l2', C=1.0, solver='liblinear', max_iter=1000, random_state=42)

model.fit(X_train, y_train)

print("Model Coefficients:")
print(model.coef_)
print("\nModel Intercept:")
print(model.intercept_)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")

Model Coefficients:
[[ 2.13248406e+00  1.52771940e-01 -1.45091255e-01 -8.28669349e-04
  -1.42636015e-01 -4.15568847e-01 -6.51940282e-01 -3.44456106e-01
  -2.07613380e-01 -2.97739324e-02 -5.00338038e-02  1.44298427e+00
  -3.03857384e-01 -7.25692126e-02 -1.61591524e-02 -1.90655332e-03
  -4.48855442e-02 -3.77188737e-02 -4.17516190e-02  5.61347410e-03
   1.23214996e+00 -4.04581097e-01 -3.62091502e-02 -2.70867580e-02
  -2.62630530e-01 -1.20898539e+00 -1.61796947e+00 -6.15250835e-01
  -7.42763610e-01 -1.16960181e-01]]

Model Intercept:
[0.40847797]

Model Accuracy: 0.9561


7. Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)

In [3]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))

Classification Report:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30





8. Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation accuracy. (Use Dataset from sklearn package)


In [4]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer # Using Breast Cancer dataset
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

print("Loading Breast Cancer dataset from sklearn...")
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

print(f"Dataset shape: {X.shape}")
print(f"Classes: {cancer.target_names}")
print(f"Class distribution: {np.bincount(y)}")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("\nStandardizing features...")
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set shape: {X_train_scaled.shape}")
print(f"Test set shape: {X_test_scaled.shape}")


param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

model = LogisticRegression(solver='liblinear', random_state=42, max_iter=1000)

print("\nPerforming GridSearchCV...")
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

grid_search.fit(X_train_scaled, y_train)

print("GridSearchCV complete.\n")

print("Best parameters found by GridSearchCV:")
print(grid_search.best_params_)

print("\nBest cross-validation accuracy:")
print(f"{grid_search.best_score_:.4f}")

best_model = grid_search.best_estimator_
y_pred_test = best_model.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_pred_test)

print("\nTest set accuracy with best parameters:")
print(f"{test_accuracy:.4f} ({test_accuracy*100:.2f}%)")

from sklearn.metrics import classification_report
print("\nClassification Report on Test Set with Best Model:")
print(classification_report(y_test, y_pred_test, target_names=cancer.target_names))

Loading Breast Cancer dataset from sklearn...
Dataset shape: (569, 30)
Classes: ['malignant' 'benign']
Class distribution: [212 357]

Standardizing features...
Training set shape: (455, 30)
Test set shape: (114, 30)

Performing GridSearchCV...
GridSearchCV complete.

Best parameters found by GridSearchCV:
{'C': 1, 'penalty': 'l2'}

Best cross-validation accuracy:
0.9802

Test set accuracy with best parameters:
0.9825 (98.25%)

Classification Report on Test Set with Best Model:
              precision    recall  f1-score   support

   malignant       0.98      0.98      0.98        42
      benign       0.99      0.99      0.99        72

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114



9. Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling. (Use Dataset from sklearn package)

In [5]:

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer # Using Breast Cancer dataset
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

print("Loading Breast Cancer dataset from sklearn...")
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

print(f"Dataset shape: {X.shape}")
print(f"Classes: {cancer.target_names}")
print(f"Class distribution: {np.bincount(y)}")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\nTraining set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

print("\nTraining Logistic Regression model WITHOUT scaling...")
model_no_scale = LogisticRegression(random_state=42, max_iter=1000, solver='liblinear') # Use a suitable solver
model_no_scale.fit(X_train, y_train)
y_pred_no_scale = model_no_scale.predict(X_test)
accuracy_no_scale = accuracy_score(y_test, y_pred_no_scale)

print(f"Accuracy WITHOUT scaling: {accuracy_no_scale:.4f} ({accuracy_no_scale*100:.2f}%)")

print("\nStandardizing features...")
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Training Logistic Regression model WITH scaling...")
model_scaled = LogisticRegression(random_state=42, max_iter=1000, solver='liblinear')
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"Accuracy WITH scaling:    {accuracy_scaled:.4f} ({accuracy_scaled*100:.2f}%)")

print("\nComparison of Accuracy:")
print(f"  Without Scaling: {accuracy_no_scale:.4f}")
print(f"  With Scaling:    {accuracy_scaled:.4f}")

Loading Breast Cancer dataset from sklearn...
Dataset shape: (569, 30)
Classes: ['malignant' 'benign']
Class distribution: [212 357]

Training set shape: (455, 30)
Test set shape: (114, 30)

Training Logistic Regression model WITHOUT scaling...
Accuracy WITHOUT scaling: 0.9561 (95.61%)

Standardizing features...
Training Logistic Regression model WITH scaling...
Accuracy WITH scaling:    0.9825 (98.25%)

Comparison of Accuracy:
  Without Scaling: 0.9561
  With Scaling:    0.9825


10.  Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.


**Approach to Building a Logistic Regression Model for Imbalanced Data (E-commerce Marketing Campaign)**

Given an imbalanced dataset where only 5% of customers respond to a marketing campaign, building a robust Logistic Regression model requires careful consideration of data handling, class imbalance, and evaluation. Here’s a step-by-step approach:

1. Data Handling and Preprocessing:

  - **Data Loading and Exploration**: Load the customer data (features like purchase history, demographics, browsing behavior, past campaign interactions, etc.) into a Pandas DataFrame. Perform initial exploratory data analysis (EDA) to understand feature distributions, identify missing values, and analyze the class distribution (confirm the 5% response rate).
 -** Feature Engineering**: Create new features that might be predictive of response. This could include:
    - Recency, Frequency, Monetary (RFM) values.
    - umber of visits in the last X days.
    - Time spent on site.
    - Category preferences.
    - Interaction counts with previous campaigns.
  - **Handling Missing Values**: Impute or remove missing values based on their extent and the nature of the feature.
  - **Encoding Categorical Features**: Convert categorical variables (e.g., gender, location) into numerical formats using techniques like One-Hot Encoding.

2. Feature Scaling:

  - Standardization or Normalization: Since Logistic Regression uses gradient descent and can be sensitive to feature scales, it's crucial to scale the numerical features. StandardScaler (z-score normalization) or MinMaxScaler are common choices. Apply the scaling after splitting the data to prevent data leakage from the test set into the training process.

3. Handling Class Imbalance:

  - This is a critical step for imbalanced datasets. Directly training on the imbalanced data will likely result in a model that predicts the majority class (non-responders) most of the time, leading to high accuracy but poor performance on the minority class (responders), which is the class of interest. Techniques include:

  - **Resampling Techniques**:
    - **Oversampling the Minority Class**: Duplicate instances of the minority class (responders) to increase their representation. SMOTE (Synthetic Minority Oversampling Technique) is a popular method that creates synthetic samples of the minority class.
    - **Undersampling the Majority Class**: Randomly remove instances of the majority class (non-responders). This can lead to loss of valuable information.
    - **Combination Approaches**: Techniques like SMOTEENN or SMOTETomek combine oversampling and undersampling.
  - Using Class Weights: Logistic Regression models in libraries like scikit-learn allow assigning higher weights to the minority class during training. This tells the model to penalize misclassifications of the minority class more heavily. This is often simpler and performs well compared to resampling.
**Choose the appropriate technique based on experimentation and cross-validation. Using class weights is often a good starting point.**

4. **Model Training:**

    - **Splitting Data**: Split the preprocessed and potentially balanced data into training, validation (optional but recommended), and testing sets. A common split is 70/15/15 or 80/20 for train/test, with a portion of the training data used for validation during hyperparameter tuning. Ensure the split is stratified to maintain the class distribution in each set.
    - **Logistic Regression Model**: Initialize a LogisticRegression model from scikit-learn.

5. Hyperparameter Tuning:

  - **Parameters to Tune**: Key hyperparameters for Logistic Regression include:
    - C: The inverse of regularization strength. Smaller values mean stronger regularization (L2 by default). This helps prevent overfitting.
    - penalty: 'l1' or 'l2' regularization. L1 can lead to sparser coefficients (feature selection), while L2 shrinks coefficients.
    - solver: The algorithm to use for optimization (e.g., 'liblinear', 'lbfgs', 'saga').
    - class_weight: Use 'balanced' to automatically adjust weights inversely proportional to class frequencies, or provide a dictionary of weights.
    - Tuning Method: Use techniques like GridSearchCV or RandomizedSearchCV with cross-validation on the training (or training + validation) data to find the optimal combination of hyperparameters that maximizes a suitable evaluation metric.

6. **Model Evaluation**:

  - **Choosing Appropriate Metrics**: Since the dataset is imbalanced, accuracy is not a good primary evaluation metric. A model that predicts 'non-responder' for all customers would have 95% accuracy. Focus on metrics that are sensitive to the performance on the minority class:
      - **Precision**: Out of all customers predicted as responders, how many actually responded? High precision is important if false positives (contacting non-responders) are costly.
      - **Recall (Sensitivity)**: Out of all actual responders, how many did the model correctly identify? High recall is important if false negatives (missing potential responders) are costly. In a marketing campaign, recall is often very important to capture as many potential responders as possible, even if it means contacting some non-responders.
      - **F1-Score**: The harmonic mean of precision and recall. A good balance between the two. Useful when you need a balance.
      - **ROC-AUC**: Measures the model's ability to distinguish between the positive and negative classes across different probability thresholds. A higher AUC indicates better discriminative power. This is often a good overall metric for imbalanced data.
      - **Confusion Matrix**: Provides a detailed breakdown of true positives, true negatives, false positives, and false negatives.
  - **Evaluation on Test Set**: Evaluate the best model found during hyperparameter tuning on the held-out test set using the chosen metrics. The test set provides an unbiased estimate of the model's performance on unseen data.

7. **Threshold Adjustment**:

    - **Optimizing for the Business Goal**: The default probability threshold for classification is 0.5. However, in an imbalanced scenario, you can adjust this threshold based on the business objective.
      - If you prioritize recall (finding more responders), you might lower the threshold.
      - If you prioritize precision (minimizing contact with non-responders), you might raise the threshold.
      - Analyze the Precision-Recall curve or ROC curve to determine the optimal threshold that balances the trade-off between precision and recall for the specific campaign goals (e.g., maximizing the number of responders contacted while keeping the cost of contacting non-responders manageable).

**Summary**:

Building a Logistic Regression model for an imbalanced marketing campaign dataset involves standard data processing, crucial handling of class imbalance (using techniques like class weights or resampling), hyperparameter tuning using appropriate metrics like F1-score or ROC-AUC, and careful evaluation on a held-out test set. Finally, adjusting the prediction threshold based on the business objective (balancing precision and recall) is key to deploying an effective model.

