#**Logistic Regression | Assignment**

##Question 1: What is Logistic Regression, and how does it differ from Linear Regression?
Logistic Regression is a supervised learning algorithm primarily used for classification tasks (often binary classification). Instead of predicting continuous numerical values like Linear Regression, Logistic Regression outputs probabilities of different classes.
-  In Linear Regression, we directly predict a continuous value. For example, predicting a house price based on features like square footage or location.
- In Logistic Regression, we use the predicted values within the Sigmoid function to obtain a probability between 0 and 1, which is then used for classification.

Mathematically, Linear Regression is:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ

Whereas Logistic Regression applies a Sigmoid (or logistic) function to the linear combination of features:

p = 1 / (1 + e^(-z)), where z = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ

##Question 2: Explain the role of the Sigmoid function in Logistic Regression.
The Sigmoid function, also known as the logistic function, maps any real-valued input into a number between 0 and 1. Specifically,
Sigmoid(z) = 1 / (1 + e^(-z))
- In Logistic Regression, we feed the linear combination of the input features (z) into the Sigmoid.
- The output of the Sigmoid function can be interpreted as the probability of the instance belonging to the positive class (e.g., “Class 1”).
- If the Sigmoid output is greater than 0.5, the instance is typically classified as Class 1; otherwise, it is classified as Class 0.

##Question 3: What is Regularization in Logistic Regression and why is it needed?
Regularization is a technique to prevent overfitting by penalizing large coefficients in the model. It forces the model to keep the coefficient values smaller, thus reducing variance and improving the model’s generalization capability.
- In Logistic Regression, common regularization techniques include L1 (Lasso) and L2 (Ridge).
- L2 Regularization (penalty='l2'): Adds a term proportional to the square of the magnitude of the coefficients.
- L1 Regularization (penalty='l1'): Adds a term proportional to the absolute value of the coefficients and can drive some coefficients to exactly zero, performing feature selection.

Regularization is critical to ensure the model does not memorize noise from training data and can generalize well to unseen data.

##Question 4: What are some common evaluation metrics for classification models, and why are they important?
Common classification evaluation metrics include:
1.	Accuracy: The proportion of correct predictions (both true positives and true negatives) among the total number of predictions.
2.	Precision: The proportion of true positives among all predicted positives. Helps measure how precise the model is in correctly predicting positives.
3.	Recall (Sensitivity): The proportion of true positives among all actual positives. Reflects how many of the actual positives the model captures.
4.	F1-Score: The harmonic mean of precision and recall. It provides a single metric balancing both precision and recall, especially useful for imbalanced datasets.
5.	ROC-AUC (Receiver Operating Characteristic - Area Under Curve): Measures how well the model separates classes at different thresholds. A higher AUC indicates better separability of classes.

These metrics are important because they give deeper insights into the model’s performance beyond just accuracy, helping to identify its strengths and weaknesses in different aspects of classification.



## Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy. (Use Dataset from sklearn package)

Below is an example using the Iris dataset from sklearn, saved to a CSV, then reloaded

In [None]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Step 1: Load a dataset from sklearn package
breast_cancer = load_breast_cancer()

# Convert to DataFrame
df = pd.DataFrame(data=breast_cancer.data, columns=breast_cancer.feature_names)
df['target'] = breast_cancer.target

# Save to CSV file
df.to_csv('breast_cancer_data.csv', index=False)
print("Dataset saved to 'breast_cancer_data.csv'")

# Step 2: Load the CSV file into a Pandas DataFrame
data = pd.read_csv('breast_cancer_data.csv')
print(f"\nDataset loaded successfully!")
print(f"Shape of dataset: {data.shape}")

# Step 3: Split the data into features (X) and target (y)
X = data.drop('target', axis=1)
y = data['target']

# Step 4: Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"\nTraining samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

# Step 5: Train a Logistic Regression model
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

# Step 6: Make predictions and calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print(f"\nLogistic Regression Model Accuracy: {accuracy:.4f}")
print(f"Accuracy in percentage: {accuracy * 100:.2f}%")

Dataset saved to 'breast_cancer_data.csv'

Dataset loaded successfully!
Shape of dataset: (569, 31)

Training samples: 455
Testing samples: 114

Logistic Regression Model Accuracy: 0.9561
Accuracy in percentage: 95.61%


##Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.   (Use Dataset from sklearn package)


In [None]:
# Import required libraries
import numpy as np
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Step 1: Load the Wine dataset from sklearn
wine = load_wine()
X = wine.data
y = wine.target

print("Wine Dataset Information:")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print(f"Number of classes: {len(np.unique(y))}")
print(f"Feature names: {wine.feature_names}")
print(f"Target names: {wine.target_names}")

# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Step 3: Scale the features (recommended for regularization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 4: Train Logistic Regression with L2 regularization (Ridge)
# penalty='l2' is default, but explicitly mentioning for clarity
# C is the inverse of regularization strength (smaller C = stronger regularization)
model_l2 = LogisticRegression(
    penalty='l2',
    C=1.0,  # You can adjust this for different regularization strengths
    solver='lbfgs',
    max_iter=1000,
    random_state=42
)

model_l2.fit(X_train_scaled, y_train)

# Step 5: Print model coefficients
print("\n" + "="*60)
print("L2 REGULARIZED LOGISTIC REGRESSION RESULTS")
print("="*60)

print("\nModel Coefficients (for each class):")
for i, class_name in enumerate(wine.target_names):
    print(f"\nClass {i} ({class_name}):")
    coefficients = model_l2.coef_[i]
    for j, feature_name in enumerate(wine.feature_names):
        print(f"  {feature_name}: {coefficients[j]:.4f}")

print("\nIntercepts for each class:")
for i, class_name in enumerate(wine.target_names):
    print(f"  Class {i} ({class_name}): {model_l2.intercept_[i]:.4f}")

# Step 6: Make predictions and calculate accuracy
y_pred = model_l2.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

print("\n" + "-"*60)
print(f"Model Accuracy on Test Set: {accuracy:.4f}")
print(f"Accuracy Percentage: {accuracy * 100:.2f}%")
print(f"Number of correct predictions: {np.sum(y_pred == y_test)}/{len(y_test)}")

# Additional: Show the effect of different regularization strengths
print("\n" + "="*60)
print("EFFECT OF DIFFERENT REGULARIZATION STRENGTHS")
print("="*60)

C_values = [0.01, 0.1, 1, 10, 100]
for C in C_values:
    model = LogisticRegression(penalty='l2', C=C, solver='lbfgs', max_iter=1000, random_state=42)
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    acc = accuracy_score(y_test, y_pred)
    print(f"C = {C:6.2f} (regularization = {1/C:6.2f}): Accuracy = {acc:.4f}")

Wine Dataset Information:
Number of samples: 178
Number of features: 13
Number of classes: 3
Feature names: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Target names: ['class_0' 'class_1' 'class_2']

L2 REGULARIZED LOGISTIC REGRESSION RESULTS

Model Coefficients (for each class):

Class 0 (class_0):
  alcohol: 0.7469
  malic_acid: 0.0919
  ash: 0.3996
  alcalinity_of_ash: -0.8327
  magnesium: 0.1115
  total_phenols: 0.3233
  flavanoids: 0.6736
  nonflavanoid_phenols: 0.0039
  proanthocyanins: -0.0021
  color_intensity: 0.0905
  hue: 0.0606
  od280/od315_of_diluted_wines: 0.5664
  proline: 0.8697

Class 1 (class_1):
  alcohol: -0.9548
  malic_acid: -0.3809
  ash: -0.7751
  alcalinity_of_ash: 0.5929
  magnesium: -0.1889
  total_phenols: -0.1210
  flavanoids: 0.1964
  nonflavanoid_phenols: -0.0053
  proanthocyanins: 0.5452
  c

##Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report.  (Use Dataset from sklearn package )


In [None]:
# Import required Libraries
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

# Load Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train OvR Logistic Regression
model = LogisticRegression(multi_class='ovr', max_iter=1000)
model.fit(X_train_scaled, y_train)

# Predictions
y_pred = model.predict(X_test_scaled)

# Classification Report
print("Wine Dataset - Classification Report (OvR):")
print(classification_report(y_test, y_pred, target_names=wine.target_names))

Wine Dataset - Classification Report (OvR):
              precision    recall  f1-score   support

     class_0       1.00      1.00      1.00        19
     class_1       1.00      0.95      0.98        21
     class_2       0.93      1.00      0.97        14

    accuracy                           0.98        54
   macro avg       0.98      0.98      0.98        54
weighted avg       0.98      0.98      0.98        54





##Question 8: Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy. (Use Dataset from sklearn package)


In [None]:
# Import required libraries
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# Step 1: Load the Breast Cancer dataset from sklearn
data = load_breast_cancer()
X = data.data
y = data.target

print("BREAST CANCER DATASET")
print("="*60)
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print(f"Classes: {data.target_names}")
print(f"Class distribution: {np.bincount(y)}")

# Step 2: Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"\nTraining samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

# Step 3: Scale the features (important for regularization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 4: Define the parameter grid for GridSearchCV
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],  # Regularization strength
    'penalty': ['l1', 'l2'],  # Regularization type
    'solver': ['liblinear']  # Solver that supports both L1 and L2
}

print("\n" + "="*60)
print("HYPERPARAMETER GRID")
print("="*60)
print(f"C values: {param_grid['C']}")
print(f"Penalty types: {param_grid['penalty']}")
print(f"Total combinations: {len(param_grid['C']) * len(param_grid['penalty'])}")

# Step 5: Create Logistic Regression model
log_reg = LogisticRegression(max_iter=1000, random_state=42)

# Step 6: Create GridSearchCV object
grid_search = GridSearchCV(
    estimator=log_reg,
    param_grid=param_grid,
    cv=5,  # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1,  # Use all available processors
    verbose=1
)

# Step 7: Fit GridSearchCV
print("\n" + "="*60)
print("PERFORMING GRID SEARCH...")
print("="*60)
grid_search.fit(X_train_scaled, y_train)

# Step 8: Print the best parameters and validation accuracy
print("\n" + "="*60)
print("GRID SEARCH RESULTS")
print("="*60)
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")

# Step 9: Get the best model and evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_pred)

print(f"\nTest Set Accuracy with Best Model: {test_accuracy:.4f}")

# Step 10: Show all results from grid search
print("\n" + "="*60)
print("ALL GRID SEARCH RESULTS (sorted by mean CV score)")
print("="*60)
results = grid_search.cv_results_
for i in range(len(results['params'])):
    print(f"Rank {results['rank_test_score'][i]}: "
          f"C={results['params'][i]['C']}, "
          f"penalty={results['params'][i]['penalty']} - "
          f"CV Accuracy: {results['mean_test_score'][i]:.4f} "
          f"(+/- {results['std_test_score'][i]*2:.4f})")

# Step 11: Classification report with best model
print("\n" + "="*60)
print("CLASSIFICATION REPORT (Best Model)")
print("="*60)
print(classification_report(y_test, y_pred, target_names=data.target_names))

BREAST CANCER DATASET
Number of samples: 569
Number of features: 30
Classes: ['malignant' 'benign']
Class distribution: [212 357]

Training samples: 398
Testing samples: 171

HYPERPARAMETER GRID
C values: [0.001, 0.01, 0.1, 1, 10, 100]
Penalty types: ['l1', 'l2']
Total combinations: 12

PERFORMING GRID SEARCH...
Fitting 5 folds for each of 12 candidates, totalling 60 fits

GRID SEARCH RESULTS
Best Parameters: {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}
Best Cross-Validation Accuracy: 0.9799

Test Set Accuracy with Best Model: 0.9883

ALL GRID SEARCH RESULTS (sorted by mean CV score)
Rank 12: C=0.001, penalty=l1 - CV Accuracy: 0.3718 (+/- 0.0078)
Rank 10: C=0.001, penalty=l2 - CV Accuracy: 0.9422 (+/- 0.0562)
Rank 11: C=0.01, penalty=l1 - CV Accuracy: 0.9146 (+/- 0.0796)
Rank 5: C=0.01, penalty=l2 - CV Accuracy: 0.9724 (+/- 0.0292)
Rank 4: C=0.1, penalty=l1 - CV Accuracy: 0.9749 (+/- 0.0386)
Rank 2: C=0.1, penalty=l2 - CV Accuracy: 0.9799 (+/- 0.0123)
Rank 3: C=1, penalty=l1 - CV A

## Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.  (Use Dataset from sklearn package)


In [None]:
# Import required libraries
import numpy as np
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Step 1: Load the Digits dataset from sklearn
digits = load_digits()
X = digits.data
y = digits.target

print("DATASET: Handwritten Digits Recognition")
print("="*50)
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print(f"Number of classes: {len(np.unique(y))}")

# Show feature value ranges (demonstrates need for scaling)
print(f"\nFeature value range: [{X.min():.2f}, {X.max():.2f}]")
print(f"Mean feature value: {X.mean():.2f}")
print(f"Std deviation: {X.std():.2f}")

# Step 2: Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"\nTrain set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

# Step 3: Train Logistic Regression WITHOUT Scaling
print("\n" + "="*50)
print("MODEL 1: LOGISTIC REGRESSION WITHOUT SCALING")
print("="*50)

# Create and train model without scaling
model_no_scale = LogisticRegression(max_iter=5000, random_state=42)
model_no_scale.fit(X_train, y_train)

# Predictions and accuracy
y_pred_no_scale = model_no_scale.predict(X_test)
accuracy_no_scale = accuracy_score(y_test, y_pred_no_scale)

print(f"Training completed!")
print(f"Accuracy without scaling: {accuracy_no_scale:.4f}")

# Step 4: Train Logistic Regression WITH Scaling
print("\n" + "="*50)
print("MODEL 2: LOGISTIC REGRESSION WITH STANDARDIZATION")
print("="*50)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Verify scaling worked
print(f"After scaling - Mean: {X_train_scaled.mean():.2e}")
print(f"After scaling - Std: {X_train_scaled.std():.2f}")

# Create and train model with scaling
model_with_scale = LogisticRegression(max_iter=5000, random_state=42)
model_with_scale.fit(X_train_scaled, y_train)

# Predictions and accuracy
y_pred_with_scale = model_with_scale.predict(X_test_scaled)
accuracy_with_scale = accuracy_score(y_test, y_pred_with_scale)

print(f"Training completed!")
print(f"Accuracy with scaling: {accuracy_with_scale:.4f}")

# Step 5: Compare Results
print("\n" + "="*50)
print("COMPARISON OF RESULTS")
print("="*50)
print(f"Accuracy WITHOUT Scaling: {accuracy_no_scale:.4f} ({accuracy_no_scale*100:.2f}%)")
print(f"Accuracy WITH Scaling:    {accuracy_with_scale:.4f} ({accuracy_with_scale*100:.2f}%)")
print(f"\nDifference: {abs(accuracy_with_scale - accuracy_no_scale):.4f}")

if accuracy_with_scale > accuracy_no_scale:
    improvement = ((accuracy_with_scale - accuracy_no_scale) / accuracy_no_scale) * 100
    print(f"Scaling improved accuracy by {improvement:.2f}%")
elif accuracy_no_scale > accuracy_with_scale:
    decrease = ((accuracy_no_scale - accuracy_with_scale) / accuracy_no_scale) * 100
    print(f"Scaling decreased accuracy by {decrease:.2f}%")
else:
    print("Both models achieved the same accuracy")

# Additional Analysis
print("\n" + "="*50)
print("ADDITIONAL INSIGHTS")
print("="*50)

# Check convergence
print(f"Iterations to converge (no scaling): {model_no_scale.n_iter_[0]}")
print(f"Iterations to converge (with scaling): {model_with_scale.n_iter_[0]}")

# Number of misclassified samples
misclassified_no_scale = np.sum(y_test != y_pred_no_scale)
misclassified_with_scale = np.sum(y_test != y_pred_with_scale)

print(f"\nMisclassified samples (no scaling): {misclassified_no_scale}/{len(y_test)}")
print(f"Misclassified samples (with scaling): {misclassified_with_scale}/{len(y_test)}")

DATASET: Handwritten Digits Recognition
Number of samples: 1797
Number of features: 64
Number of classes: 10

Feature value range: [0.00, 16.00]
Mean feature value: 4.88
Std deviation: 6.02

Train set size: 1257
Test set size: 540

MODEL 1: LOGISTIC REGRESSION WITHOUT SCALING
Training completed!
Accuracy without scaling: 0.9685

MODEL 2: LOGISTIC REGRESSION WITH STANDARDIZATION
After scaling - Mean: -2.83e-18
After scaling - Std: 0.98
Training completed!
Accuracy with scaling: 0.9704

COMPARISON OF RESULTS
Accuracy WITHOUT Scaling: 0.9685 (96.85%)
Accuracy WITH Scaling:    0.9704 (97.04%)

Difference: 0.0019
Scaling improved accuracy by 0.19%

ADDITIONAL INSIGHTS
Iterations to converge (no scaling): 140
Iterations to converge (with scaling): 32

Misclassified samples (no scaling): 17/540
Misclassified samples (with scaling): 16/540


##Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

When I have an imbalanced dataset — only 5% of customers respond (the positive class) — it's crucial that I handle data and model training carefully so that the model doesn't simply predict "no response" for everyone. My practical approach would involve:

1.	Data Collection and Cleaning:
I would gather relevant customer features (e.g., demographics, purchase history, browsing behavior, etc.). I would clean the data by handling missing values, outliers, and inconsistent entries. I would convert categorical variables to appropriate encodings (e.g., One-Hot Encoding).

2.	Balancing the Classes:
Since only 5% respond, the dataset is heavily imbalanced. Without addressing this, the model might predict "no response" for nearly everyone and have high accuracy but poor recall for responders. I would use techniques such as:

3.	Oversampling the minority class (e.g., SMOTE)
Undersampling the majority class
Adjusting class weights in Logistic Regression (parameter "class_weight='balanced'")
Feature Scaling:
I would standardize or normalize numerical features using StandardScaler or MinMaxScaler to ensure that features with large numerical ranges do not dominate the model coefficients.

4.	Model Training with Logistic Regression:
I would start with a baseline logistic regression. I would incorporate regularization (L1 or L2) to handle overfitting, particularly if the feature space is large. If using scikit-learn, I would consider setting class_weight='balanced' to make the algorithm pay more attention to the minority class.

5.	Hyperparameter Tuning:
I would use GridSearchCV or RandomizedSearchCV to find the best regularization parameter (C) and penalty type (L1 or L2). I would possibly tune solver and class_weight to optimize performance.

6.	Evaluation Metrics:
Accuracy alone can be misleading with highly imbalanced data. Instead, I would focus on:

- Precision: Of those predicted to respond, how many actually responded?
- Recall (or Sensitivity): Of all actual responders, how many did we correctly identify?
- F1-Score: Balances precision and recall
- ROC-AUC: Measures overall separability across different thresholds
- PR (Precision-Recall) AUC: Particularly useful in heavily imbalanced scenarios

I would consider business requirements. For example, a higher Recall might be desired if I want to target as many potential responders as possible. However, that might also increase marketing costs if the precision is too low.

**Iterative Improvement and Real-World Constraints:**
I would regularly update and retrain the model with new data to capture changing customer behaviors. It might be beneficial for me to create segmentation strategies, focusing on certain groups for more personalized campaigns. I would monitor the model's performance in production, track the real response rates, and adjust thresholds or re-balance as needed.

By following these steps — focusing on class balancing, proper evaluation metrics, and iterative tuning — my Logistic Regression model can better capture the minority class and provide valuable predictions for a marketing campaign.