#Logistic Regression | Assignment

**Question 1:  What is Logistic Regression, and how does it differ from Linear
Regression?**

**Answer:-** **Logistic Regression** is a supervised machine learning algorithm primarily used for predicting categorical outcomes, especially binary classes (such as yes/no, 0/1) by estimating the probability that an observation belongs to a certain class. It differs from Linear Regression in both its purpose and mathematical formulation: linear regression predicts continuous values, while logistic regression predicts probabilities for discrete classes.

**Logistic Regression:**-
Logistic regression estimates the probability of a categorical event, usually binary, occurring based on independent variables.

It uses a logistic (sigmoid) function to transform the output of a linear combination of input features into a value between 0 and 1, representing probability.

The logistic regression equation is:-

       y(x)= e^(a0+a1x1+a2x2+⋯+aixi)/1+e^(a0+a1x1+a2x2+⋯+aixi)

 where y(x)is the probability estimation.


**Question 2: Explain the role of the Sigmoid function in Logistic Regression.**

**Answer:-** The sigmoid function is central to logistic regression because it converts the linear combination of input features into a probability value between 0 and 1, enabling meaningful classification for binary outcomes.

What the Sigmoid Function Does

The sigmoid function has the formula:

           σ(z)= 1/1+e^−z

where z is the output of the linear equation (log-odds) computed from input features and model parameters.

The output is always between 0 and 1, making it ideal for representing probabilities.

If z is very large and positive, σ(z) approaches 1; ifz is very large and negative,σ(z) approaches 0; when z=0
, σ(z)=0.5.

**Role of the Sigmoid function in Logistic Regression.**

>1. Logistic regression uses the sigmoid function to translate potentially unbounded real values (from the linear equation) into a valid probability range.

>2. This probability can then be thresholded (typically at 0.5) to classify an observation as class 0 or 1.

>3. The sigmoid’s S-shaped curve ensures small changes near the threshold (0.5) cause strong changes in class assignment, improving model interpretability for classification tasks.




**Question 3: What is Regularization in Logistic Regression and why is it needed?**

**Answer:-** **Regularization in logistic regression** is a technique for penalizing model complexity to prevent overfitting and improve the ability to generalize predictions to unseen data.
Regularization modifies the loss function by adding a penalty term based on the magnitude of model coefficients.

**The two most common types are:**

**L1 regularization (Lasso):** Adds the sum of absolute values of coefficients as a penalty, which can reduce some coefficients entirely to zero, effectively performing feature selection.

**L2 regularization (Ridge):** Adds the sum of squared values of coefficients as a penalty, shrinking them toward zero but not exactly to zero.

**Needs of Regularization:-**

>1.Prevents overfitting: Without regularization, logistic regression models can learn noise and idiosyncrasies of the training data, resulting in poor performance on new data.

>2.Improves generalizability: By constraining excessively large coefficient weights, the model focuses on more substantial patterns and avoids memorizing the training set.

>3.Handles multicollinearity: Regularization techniques reduce the magnitude of correlated predictors, stabilizing model estimates and making coefficients more interpretable.

>4.Balances bias-variance: Regularization increases bias slightly, but drastically lowers variance—the main culprit behind overfitting.




**Question 4: What are some common evaluation metrics for classification models, and
why are they important?**

**Answer:-** Common evaluation metrics for classification models include accuracy, precision, recall, F1 score, confusion matrix, and AUC-ROC. These metrics are crucial for understanding model performance beyond simple correctness, especially when data is imbalanced or when different types of errors carry different consequences.

**Key Classification Metrics:-**

**Accuracy:** Measures the proportion of correctly classified instances among all instances. Useful as a general measure, but can be misleading with imbalanced classes.

**Precision:** Indicates how many predicted positives are true positives, important when the cost of false positives is high (e.g., disease diagnosis).

**Recall:** Tells how many actual positives were correctly identified by the model. Crucial when missing a positive instance has a high cost (e.g., fraud detection).

**F1 Score:** Harmonic mean of precision and recall. Provides a balance between precision and recall, especially useful when classes are imbalanced.

**Confusion Matrix:** Tabular representation of actual vs. predicted classifications, showing true positives, false positives, true negatives, and false negatives. Offers deeper insight into model performance across all outcomes.

**AUC-ROC:** Area Under the Receiver Operating Characteristic Curve reflects the model's ability to distinguish between classes across thresholds. Valuable for interpreting model ranking and robustness, independent of a single threshold.

**Importance of These Metrics:-**

>1. Metrics like precision and recall help pinpoint weaknesses, such as a model with high accuracy but poor recall for minority classes.

>2. Choosing appropriate metrics prevents misleading evaluation and aligns performance measurement with business or scientific goals (for example, maximizing recall in cancer screening; maximizing precision in spam detection).

>3. These metrics guide model selection, hyperparameter tuning, and deployment decisions to optimize real-world outcomes.

**Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package)**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_iris

# Load dataset from scikit-learn
data = load_iris()
df = pd.DataFrame(data.data,columns=data.feature_names)
df["target"]=data.target

# Split into features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into train test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Logistic Regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict and accuracy
y_pred = model.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9666666666666667


**Question 6:  Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_iris

# Load iris dataset
data = load_iris()
df = pd.DataFrame(data.data,columns=data.feature_names)
df["target"]=data.target

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Create and train logistic regression model with L2 regularization (default penalty='l2')
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l2', max_iter=200)
model.fit(X_train, y_train)

# Predict and measure accuracy
y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)

# Get model coefficients and intercept
coefficients = model.coef_
intercept = model.intercept_

# Print results
print("Accuracy:", accuracy)
print("Coefficients:", coefficients)

print("Intercept:", intercept)

Accuracy: 0.9666666666666667
Coefficients: [[-0.43171259  0.82344651 -2.35119244 -0.96938012]
 [ 0.61818491 -0.42815386 -0.20595953 -0.82952283]
 [-0.18647232 -0.39529265  2.55715197  1.79890295]]
Intercept: [  9.49111756   1.6376074  -11.12872496]


**Question 7: Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_iris

# Load iris dataset
data= load_iris()
df = pd.DataFrame(data.data,columns=data.feature_names)
df["target"]=data.target

# Define features and target
X = df.drop('target', axis=1)
y = df['target']

# Split data into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train Logistic Regression model with one-vs-rest strategy
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(multi_class='ovr', max_iter=200)
model.fit(X_train, y_train)

# Make predictions on test set
y_pred = model.predict(X_test)

# Print classification report
from sklearn.metrics import classification_report
classification_report = classification_report(y_test,y_pred)
print(classification_report)


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      0.69      0.82        13
           2       0.60      1.00      0.75         6

    accuracy                           0.87        30
   macro avg       0.87      0.90      0.86        30
weighted avg       0.92      0.87      0.87        30



**Question 8: Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_iris

# Load iris dataset
data = load_iris()
df = pd.DataFrame(data.data,columns=data.feature_names)
df["target"]=data.target

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into train and validation sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Define the logistic regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=200, solver='liblinear')

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

# Setup GridSearchCV with 5-fold cross-validation
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')

# Fit GridSearchCV to training data
grid_search.fit(X_train, y_train)

# Obtain best parameters and accuracy on validation set
best_params = grid_search.best_params_
y_pred = grid_search.predict(X_test)
from sklearn.metrics import accuracy_score
val_accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", best_params)
print("Validation Accuracy:", val_accuracy)

Best Parameters: {'C': 10, 'penalty': 'l2'}
Validation Accuracy: 0.9333333333333333


**Question 9: Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling.**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df["target"] = iris.target  # Correct assignment to df, not undefined 'data'

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

# Train logistic regression without scaling
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fixed variable name to X_train_scaled
X_test_scaled = scaler.transform(X_test)  # fixed variable name to X_test_scaled

# Train logistic regression with scaling
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
y_pred_scaled = model.predict(X_test_scaled)  # use X_test_scaled for prediction
accuracy_scaling = accuracy_score(y_test, y_pred_scaled)

print("Accuracy without scaling:", accuracy_no_scaling)
print("Accuracy with scaling:", accuracy_scaling)

Accuracy without scaling: 0.9777777777777777
Accuracy with scaling: 0.9555555555555556


**Question 10: Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business**

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')


from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score
from imblearn.over_sampling import SMOTE

# Step 1: Simulate imbalanced dataset (5% positive class)
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10000, n_features=20, n_classes=2,
                           weights=[0.95, 0.05], flip_y=0, random_state=1)

# Step 2: Split into training and testing sets with stratification
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=1)

# Step 3: Apply SMOTE to balance training data
smote = SMOTE(random_state=1)
X_train_bal, y_train_bal = smote.fit_resample(X_train, y_train)

# Step 4: Feature scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_bal_scaled = scaler.fit_transform(X_train_bal)
X_test_scaled = scaler.transform(X_test)

# Step 5: Logistic Regression with hyperparameter tuning and class weights
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']  # 'liblinear' supports both l1 and l2 penalties
}

logreg = LogisticRegression(class_weight='balanced', max_iter=500)

grid_search = GridSearchCV(logreg, param_grid, cv=5, scoring='f1', n_jobs=-1)
grid_search.fit(X_train_bal_scaled, y_train_bal)

# Step 6: Evaluation on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_scaled)
y_pred_prob = best_model.predict_proba(X_test_scaled)[:, 1]

print("Best Hyperparameters:", grid_search.best_params_)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, y_pred_prob))

Best Hyperparameters: {'C': 0.1, 'penalty': 'l1', 'solver': 'liblinear'}
Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.92      0.95      2850
           1       0.36      0.84      0.50       150

    accuracy                           0.92      3000
   macro avg       0.67      0.88      0.73      3000
weighted avg       0.96      0.92      0.93      3000

ROC AUC Score: 0.933008187134503
