# **Logistic Regression | Assignment**

**Question 1)  What is Logistic Regression, and how does it differ from Linear
Regression?**

**Answer 1)Logistic Regression** is a statistical method used for classification problems. It predicts the probability that a given input belongs to a particular class (e.g., spam vs. not spam, yes vs. no). It uses the sigmoid (logistic) function to map predictions into a range between 0 and 1.

**Linear Regression** is used for regression problems where the target is continuous (e.g., predicting house prices, salary, or temperature). It fits a straight line to estimate numeric values.

**Question 2) Explain the role of the Sigmoid function in Logistic Regression.**

**Answer 2)** The Sigmoid function plays a central role in Logistic Regression because it converts the linear combination of input features (which can take any real value) into a probability between 0 and 1.

* Logistic Regression first computes a linear equation:

𝑧
=
𝑏
0
+
𝑏
1
𝑥
1
+
𝑏
2
𝑥
2
+
⋯
+
𝑏
𝑛
𝑥
𝑛
z=b
0
	​

+b
1
	​

x
1
	​

+b
2
	​

x
2
	​

+⋯+b
n
	​

x
n
	​


* The Sigmoid function is then applied:

𝜎
(
𝑧
)
=
1
1
+
𝑒
−
𝑧
σ(z)=
1+e
−z
1
	​


* This output represents the probability of the event happening (class = 1).

* If the probability is greater than a chosen threshold (commonly 0.5), the observation is classified as 1 (positive class), otherwise 0 (negative class).

**Question 3) What is Regularization in Logistic Regression and why is it needed?**

**Answer 3)** Regularization in Logistic Regression is a technique used to prevent the model from overfitting by adding a penalty to very large coefficient values.

* In Logistic Regression, the model learns weights (coefficients) for each feature. If these weights become too large, the model may fit the training data very well but perform poorly on unseen data (overfitting).

* Regularization adds a penalty term to the cost function so that the model prefers smaller, more balanced weights.

**Question 4) What are some common evaluation metrics for classification models, and
why are they important?**

**Answer 4)** Common evaluation metrics for classification models are used to measure how well the model is performing. They are important because they provide different perspectives on model accuracy, especially when data is imbalanced.

**1) Accuracy**

* Measures the percentage of correctly predicted observations.

* Formula:

Accuracy
=
TP + TN
TP + TN + FP + FN
Accuracy=
TP + TN + FP + FN
TP + TN
	​


* Useful when classes are balanced, but misleading for imbalanced datasets.

**2) Precision**

* Out of all predicted positives, how many are actually positive.

* Formula:

Precision
=
TP
TP + FP
Precision=
TP + FP
TP
	​


* Important when the cost of false positives is high (e.g., spam detection).

**3) Recall (Sensitivity or True Positive Rate)**

* Out of all actual positives, how many were correctly predicted.

**Formula:**
Recall=
TP + FN
TP
	​


* Important when the cost of false negatives is high (e.g., disease diagnosis).

In [None]:
# Question 5)Write a Python program that loads a CSV file into a Pandas DataFrame,
#splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
#(Use Dataset from sklearn package)

Logistic Regression with sklearn dataset

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()


df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Features (X) and Target (y)
X = df.drop('target', axis=1)
y = df['target']

# Split into train (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Logistic Regression model: {accuracy:.4f}")


Accuracy of Logistic Regression model: 0.9561


In [None]:
#Question 6)  Write a Python program to train a Logistic Regression model using L2
#regularization (Ridge) and print the model coefficients and accuracy.



import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer


data = load_breast_cancer()

# Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Features (X) and Target (y)
X = df.drop('target', axis=1)
y = df['target']

# Split into train (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=5000)
model.fit(X_train, y_train)


y_pred = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)


print("Model Coefficients (per feature):")
for feature, coef in zip(X.columns, model.coef_[0]):
    print(f"{feature}: {coef:.4f}")

print(f"\nIntercept: {model.intercept_[0]:.4f}")
print(f"\nAccuracy of Logistic Regression with L2 Regularization: {accuracy:.4f}")


Model Coefficients (per feature):
mean radius: 1.0274
mean texture: 0.2215
mean perimeter: -0.3621
mean area: 0.0255
mean smoothness: -0.1562
mean compactness: -0.2377
mean concavity: -0.5326
mean concave points: -0.2837
mean symmetry: -0.2267
mean fractal dimension: -0.0365
radius error: -0.0971
texture error: 1.3706
perimeter error: -0.1814
area error: -0.0872
smoothness error: -0.0225
compactness error: 0.0474
concavity error: -0.0429
concave points error: -0.0324
symmetry error: -0.0347
fractal dimension error: 0.0116
worst radius: 0.1117
worst texture: -0.5089
worst perimeter: -0.0156
worst area: -0.0169
worst smoothness: -0.3077
worst compactness: -0.7727
worst concavity: -1.4286
worst concave points: -0.5109
worst symmetry: -0.7469
worst fractal dimension: -0.1009

Intercept: 28.6487

Accuracy of Logistic Regression with L2 Regularization: 0.9561


In [None]:
# Question 7) Write a Python program to train a Logistic Regression model for multiclass
#classification using multi_class='ovr' and print the classification report.



import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris

data = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=5000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# Print classification report
print("Classification Report (One-vs-Rest Logistic Regression):\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report (One-vs-Rest Logistic Regression):

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





In [None]:
#Question 8) Write a Python program to apply GridSearchCV to tune C and penalty
#hyperparameters for Logistic Regression and print the best parameters and validation
#accuracy.


# Question 8: Hyperparameter Tuning with GridSearchCV for Logistic Regression

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


log_reg = LogisticRegression(solver='liblinear', max_iter=5000)

# Define parameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}


grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)


print("Best Parameters:", grid_search.best_params_)
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")

test_accuracy = grid_search.score(X_test, y_test)
print(f"Test Set Accuracy: {test_accuracy:.4f}")


Best Parameters: {'C': 100, 'penalty': 'l1'}
Best Cross-Validation Accuracy: 0.9670
Test Set Accuracy: 0.9825


In [None]:
#Question 9) Write a Python program to standardize the features before training Logistic
#Regression and compare the model's accuracy with and without scaling.



# Question 9: Logistic Regression Accuracy with and without Feature Scaling

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer


data = load_breast_cancer()


df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


X = df.drop('target', axis=1)
y = df['target']

# Split into train (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42

model_no_scaling = LogisticRegression(max_iter=5000)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaling = LogisticRegression(max_iter=5000)
model_scaling.fit(X_train_scaled, y_train)
y_pred_scaling = model_scaling.predict(X_test_scaled)
accuracy_scaling = accuracy_score(y_test, y_pred_scaling)


print(f"Accuracy without Scaling: {accuracy_no_scaling:.4f}")
print(f"Accuracy with Scaling   : {accuracy_scaling:.4f}")


Accuracy without Scaling: 0.9561
Accuracy with Scaling   : 0.9737


Question 10) Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business
use case.


**Answer 10) 1) Data Handling**

Data Cleaning: Handle missing values, remove duplicates, and treat outliers.

**Feature Engineering:** Create useful features like recency of last purchase, frequency of past purchases, or average spend.
*  Encode categorical variables using One-Hot Encoding or Target Encoding depending on cardinality.

**2) Feature Scaling**

* Logistic Regression uses gradient descent, so features should be on a similar scale.

* Apply StandardScaler (z-score normalization) or MinMaxScaler to numerical features for stable convergence.

**3) Balancing Classes**

* Since only 5% customers respond, the dataset is highly imbalanced. To address this:

**Resampling techniques:**

* Oversample the minority class (e.g., SMOTE) or

Undersample the majority class to balance proportions.

* Class Weights: Use class_weight='balanced' in Logistic Regression so the model penalizes misclassifications of the minority class more.

**4) Model Training & Hyperparameter Tuning**

Fit a Logistic Regression model with regularization (L1/L2) to prevent overfitting.

Perform Grid Search or Random Search with cross-validation to tune:

C (inverse of regularization strength),

penalty type (L1, L2).

**5) Evaluation Metrics**

* Since accuracy can be misleading in imbalanced datasets, focus on:

* Precision, Recall, and F1-score (especially Recall, as missing a responder is costly),

* ROC-AUC Score to measure overall discrimination,

PR-AUC (Precision-Recall AUC) which is more informative in highly imbalanced cases.