1. What is Logistic Regression, and how does it differ from Linear Regression?

Logistic regression predicts categorical outcomes (like "yes/no" or "spam/not spam"), while linear regression predicts continuous numerical values (like "sales" or "temperature"). The key difference lies in their purpose: logistic regression is used for classification problems, mapping input variables to probabilities via a sigmoid (S-shaped) curve to produce a binary outcome. Linear regression, on the other hand, is for regression problems, using a linear equation to find a continuous relationship between variables. 

* Logistic Regression

Purpose: To predict a probability for a categorical dependent variable, often used for binary classification problems (e.g., disease present/absent). 

Output: A probability value between 0 and 1, which can then be thresholded to make a discrete classification. 

Equation: Utilizes a sigmoid (or logistic) function to map any input to a probability between 0 and 1. 

Applications: Identifying spam emails, predicting customer churn, or classifying tumors as cancerous or benign. 

* Linear Regression

Purpose: To predict a continuous dependent variable, where the outcome can take on any value within a range (e.g., predicting house prices). 

Output: A continuous numerical value, which can range from negative infinity to positive infinity. 

Equation: Uses a linear equation to find a linear relationship between independent and dependent variables. 

Applications: Forecasting sales, predicting temperatures, or estimating the number of hospital days a patient might need. 

2. Explain the role of the Sigmoid function in Logistic Regression.

In Logistic Regression, the Sigmoid (or logistic) function takes any real-valued input and "squashes" it into a probability between 0 and 1. This transformation is crucial because the output needs to be a probability, which by definition must be within this range. The S-shaped curve of the sigmoid function provides this, allowing for the interpretation of the output as the probability of a binary event (e.g., 90% chance of being spam), and it also ensures the model can be trained effectively using gradient descent. 

* Why it's needed

Probability Output: Logistic regression is used for classification problems, specifically binary classification (e.g., yes/no, spam/not spam). The raw output of a linear model can be any real number, which isn't suitable for representing a probability. 

Mapping to 0-1 Range: The sigmoid function, mathematically defined as 1 / (1 + e^(-z)) where z is the linear combination of input features, transforms any real-valued z into a value between 0 and 1. 

An S-Shaped Curve: The function's "S" shape is key. As the input z becomes very large and positive, the sigmoid output approaches 1. As z becomes very large and negative, the output approaches 0. For z=0, the output is 0.5. 


3. What is Regularization in Logistic Regression and why is it needed?

Regularization in Logistic Regression is a technique to prevent overfitting by adding a penalty term to the model's cost function, which discourages overly complex models with large coefficients. 

It is needed because without it, logistic regression can become too sensitive to the training data, capturing noise and random fluctuations, leading to poor performance on new, unseen data. 

Regularization introduces a slight increase in training error but significantly improves the model's ability to generalize to test data.  

What Regularization Does

Penalizes Model Complexity: Regularization adds a penalty to the cost function based on the magnitude of the model's coefficients. 

Shrinks Coefficients: This penalty discourages the model from assigning large weights to features, effectively shrinking them towards zero. 

Creates Simpler Models: By keeping coefficients small, regularization promotes simpler models that are less sensitive to individual data points. 

* Why It's Needed

Prevents Overfitting: The primary reason is to avoid overfitting, where a model learns the training data too well and fails to make accurate predictions on new data. 

Improves Generalization: Regularization improves a model's ability to generalize, meaning it performs better on unseen data by focusing on the underlying patterns rather than the noise in the training set. 

Handles High-Dimensional Data: It is especially useful when dealing with many features, as it helps control the model's complexity and prevents it from becoming too tailored to the training data. 

4. What are some common evaluation metrics for classification models, and why are they important? 

* Common evaluation metrics for classification models

1. Accuracy

Percentage of correctly predicted samples.

Good when classes are balanced, but misleading if one class dominates.

Example: If 95% of emails are not spam, a model predicting “not spam” always will get 95% accuracy but is actually useless.

2. Precision

Out of all the samples predicted as positive, how many were actually positive?

Focuses on quality of positive predictions.

Example: In spam detection, precision = how many emails predicted as spam were really spam.

Important when false positives are costly (e.g., marking an important email as spam).

3. Recall (Sensitivity / True Positive Rate)

Out of all the actual positive cases, how many did the model correctly predict?

Focuses on catching all positives.

Example: In disease detection, recall = how many sick patients were actually detected.

Important when false negatives are costly (e.g., missing a disease case).

4. F1 Score

Harmonic mean of Precision and Recall.

Good when you want a balance between precision and recall.

Example: Useful in fraud detection, where both false positives and false negatives are bad.

5. ROC Curve & AUC (Area Under Curve)

ROC curve shows how well the model separates the two classes at different thresholds.

AUC = overall ability of the model to distinguish between classes.

Closer to 1 = better model.

Example: AUC of 0.9 means the model is very good at distinguishing spam vs. not spam.

6. Confusion Matrix

A table that shows true positives, true negatives, false positives, and false negatives.

Gives a complete picture of model performance.

* Why Evaluation Metrics Are Important

Model Performance Assessment: They provide a quantitative way to measure how well a model is performing, revealing its strengths and weaknesses. 

Model Selection: Metrics help compare different classification models to determine which one is best suited for a specific problem. 

Hyperparameter Tuning: Metrics guide the tuning of model parameters to improve performance, ensuring the model aligns with the problem's goals. 

Understanding Trade-offs: Metrics like precision and recall highlight the trade-offs between different types of errors, allowing for informed decisions based on the cost of false positives versus false negatives. 

Identifying Issues: Metrics reveal issues like poor performance on specific classes or if the model is failing to detect relevant instances, which is particularly important with imbalanced datasets. 

5. Write a Python program that loads a CSV file into a Pandas DataFrame, 
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy. 
(Use Dataset from sklearn package) 

In [13]:
# Import necessary libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [3]:
# Load dataset from sklearn
data = load_breast_cancer()
data

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0

In [5]:
# Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [6]:
df['target'] = data.target  # Add target column

In [8]:
# Split into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

In [9]:
# Train/Test Split (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
# Create Logistic Regression model
model = LogisticRegression(max_iter=5000)  # Increased iterations for convergence

In [12]:
# Train the model
model.fit(X_train, y_train)

In [14]:
# Predict on test set
y_pred = model.predict(X_test)



In [15]:
# Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Logistic Regression model: {accuracy:.4f}")

Accuracy of Logistic Regression model: 0.9561


6. Write a Python program to train a Logistic Regression model using L2 
regularization (Ridge) and print the model coefficients and accuracy. 

In [16]:
# Import libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
# Load dataset
data = load_breast_cancer()

In [None]:
# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

In [None]:
# Split into features and target
X = df.drop('target', axis=1)
y = df['target']

In [None]:
# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [None]:
# Create Logistic Regression model with L2 Regularization (Ridge)
model = LogisticRegression(
    penalty='l2',       # L2 regularization
    solver='lbfgs',     # Solver that supports L2
    max_iter=5000       # Increase iterations for convergence
)


In [None]:
# Train model
model.fit(X_train, y_train)

In [None]:
# Predict on test data
y_pred = model.predict(X_test)


In [None]:
# Print Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")


Model Accuracy: 0.9561


In [None]:
# Print Coefficients
print("\nModel Coefficients (one per feature):")
for feature, coef in zip(X.columns, model.coef_[0]):
    print(f"{feature}: {coef:.4f}")



Model Coefficients (one per feature):
mean radius: 0.9654
mean texture: 0.2243
mean perimeter: -0.3643
mean area: 0.0263
mean smoothness: -0.1531
mean compactness: -0.2286
mean concavity: -0.5178
mean concave points: -0.2740
mean symmetry: -0.2208
mean fractal dimension: -0.0367
radius error: -0.0931
texture error: 1.3761
perimeter error: -0.1513
area error: -0.0897
smoothness error: -0.0222
compactness error: 0.0460
concavity error: -0.0439
concave points error: -0.0315
symmetry error: -0.0323
fractal dimension error: 0.0114
worst radius: 0.1030
worst texture: -0.5126
worst perimeter: -0.0193
worst area: -0.0165
worst smoothness: -0.3041
worst compactness: -0.7666
worst concavity: -1.4159
worst concave points: -0.4973
worst symmetry: -0.7248
worst fractal dimension: -0.1010


In [None]:
# Print Intercept
print(f"\nIntercept: {model.intercept_[0]:.4f}")


Intercept: 29.5015


7. Write a Python program to train a Logistic Regression model for multiclass 
classification using multi_class='ovr' and print the classification report. 
(Use Dataset from sklearn package)

In [27]:
# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report



In [None]:
# Load dataset
data = load_iris()
data

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [None]:
# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


In [None]:
# Split into features and target
X = df.drop('target', axis=1)
y = df['target']


In [None]:
# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [None]:
# Create Logistic Regression model for multiclass classification
model = LogisticRegression(
    multi_class='ovr',   # One-vs-Rest strategy
    solver='lbfgs',      # Solver that supports OvR
    max_iter=5000
)


In [None]:
# Train model
model.fit(X_train, y_train)


In [None]:
# Predict on test data
y_pred = model.predict(X_test)

In [None]:
# Print Classification Report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))

Classification Report:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30



8. Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation 
accuracy. 

(Use Dataset from sklearn package) 

In [37]:
# Import libraries
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

In [38]:
# 1. Load dataset
data = load_wine()

In [39]:
# 2. Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

In [41]:
# 3. Split features and target
X = df.drop('target', axis=1)
y = df['target']

In [42]:
# 4. Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


In [None]:
# 5. Define Logistic Regression model
log_reg = LogisticRegression(solver='liblinear', max_iter=5000)


In [44]:
# 6. Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],   # Regularization strength
    'penalty': ['l1', 'l2']         # L1 (Lasso) or L2 (Ridge)
}

In [46]:
# 7. Apply GridSearchCV
grid = GridSearchCV(
    estimator=log_reg,
    param_grid=param_grid,
    cv=5,              # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1
)


In [47]:
# 8. Fit the model
grid.fit(X_train, y_train)


In [48]:

# 9. Print Best Parameters and Best Validation Accuracy
print("Best Parameters:", grid.best_params_)
print(f"Best Cross-Validation Accuracy: {grid.best_score_:.4f}")


Best Parameters: {'C': 100, 'penalty': 'l1'}
Best Cross-Validation Accuracy: 0.9650


In [49]:

# 10. Test accuracy on unseen data
test_accuracy = grid.score(X_test, y_test)
print(f"Test Set Accuracy: {test_accuracy:.4f}")

Test Set Accuracy: 0.9167


8. Write a Python program to apply GridSearchCV to tune C and penalty 
hyperparameters for Logistic Regression and print the best parameters and validation 
accuracy.

In [50]:
# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression


In [None]:
# Load dataset
data = load_iris()
data

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [None]:
# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

In [None]:
# Split into features and target
X = df.drop('target', axis=1)
y = df['target']


In [None]:
# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [None]:
# Define Logistic Regression model
log_reg = LogisticRegression(solver='liblinear', max_iter=5000)


In [None]:
# Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],   # Regularization strength
    'penalty': ['l1', 'l2']         # L1 (Lasso) or L2 (Ridge)
}



In [None]:
#  Apply GridSearchCV
grid = GridSearchCV(
    estimator=log_reg,
    param_grid=param_grid,
    cv=5,              # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1
)


In [None]:
# Fit the model
grid.fit(X_train, y_train)

In [None]:
# Print Best Parameters and Best Validation Accuracy
print("Best Parameters:", grid.best_params_)
print(f"Best Cross-Validation Accuracy: {grid.best_score_:.4f}")

Best Parameters: {'C': 10, 'penalty': 'l1'}
Best Cross-Validation Accuracy: 0.9667


9. Write a Python program to standardize the features before training Logistic 
Regression and compare the model's accuracy with and without scaling. 

In [61]:
# Import libraries
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score


In [63]:
# 1. Load dataset
data = load_wine()

data

{'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
         1.065e+03],
        [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
         1.050e+03],
        [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
         1.185e+03],
        ...,
        [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
         8.350e+02],
        [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
         8.400e+02],
        [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
         5.600e+02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [64]:
X = data.data
y = data.target

In [66]:
# 2. Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [67]:
# ---------- Without Scaling ----------
model_no_scaling = LogisticRegression(max_iter=5000, solver='lbfgs', multi_class='ovr')
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

In [68]:
# ---------- With Scaling ----------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaling = LogisticRegression(max_iter=5000, solver='lbfgs', multi_class='ovr')
model_scaling.fit(X_train_scaled, y_train)
y_pred_scaling = model_scaling.predict(X_test_scaled)
accuracy_scaling = accuracy_score(y_test, y_pred_scaling)

In [69]:
# ---------- Print Results ----------
print(f"Accuracy without Scaling: {accuracy_no_scaling:.4f}")
print(f"Accuracy with Scaling:    {accuracy_scaling:.4f}")

Accuracy without Scaling: 0.9444
Accuracy with Scaling:    1.0000


10. Imagine you are working at an e-commerce company that wants to 
predict which customers will respond to a marketing campaign. Given an imbalanced 
dataset (only 5% of customers respond), describe the approach you’d take to build a 
Logistic Regression model — including data handling, feature scaling, balancing 
classes, hyperparameter tuning, and evaluating the model for this real-world business 
use case.

Step 1: Data Handling

Data Cleaning

Handle missing values (imputation strategies).

Remove or cap outliers (extreme spending values, unusual behavior).

Feature Engineering

Create meaningful features like:

Recency (days since last purchase)

Frequency (number of purchases in last 6 months)

Monetary value (average spend per order)

Campaign interactions (email opens, clicks, etc.)

Feature Scaling

Apply StandardScaler or MinMaxScaler to ensure all features are on the same scale (important for Logistic Regression since coefficients depend on feature scales).

 Step 2: Handling Class Imbalance

Since only 5% are responders:

Resampling techniques

Oversampling the minority class (e.g., SMOTE) to create synthetic responders.

Undersampling the majority class to balance, but carefully (don’t lose too much data).

Class weights

Logistic Regression allows class_weight='balanced' to penalize mistakes on minority class more heavily.

In practice, I’d try both SMOTE + class_weight.

 Step 3: Model Building

Train a Logistic Regression model with:

Regularization (L1/L2) to prevent overfitting.

Feature scaling applied to all numeric features.

class_weight='balanced' to handle imbalance.

 Step 4: Hyperparameter Tuning

Use GridSearchCV to tune:

C → Regularization strength.

penalty → L1 or L2.

solver → Choose depending on penalty (liblinear for L1, lbfgs for L2).

Use Stratified K-Fold Cross Validation (to preserve class ratios).

 Step 5: Evaluation Metrics

Since the dataset is imbalanced, accuracy is misleading.

Focus on:

Precision → Are predicted responders really responders? (important if contacting customers is costly).

Recall (Sensitivity) → Are we catching most responders? (important if missing responders means lost revenue).

F1 Score → Balance between precision and recall.

ROC-AUC → Overall ability to distinguish responders vs. non-responders.

PR Curve (Precision-Recall) → More informative in imbalanced settings.

 Step 6: Business Integration

Use the predicted probabilities, not just classes.

Rank customers by probability of response.

The marketing team can target the top N% customers (budget-limited campaigns).

Example: If targeting top 10% yields 60% of all responders → campaign efficiency improves dramatically.