1)What is Logistic Regression, and how does it differ from Linear
Regression?
->
Logistic Regression is used for classification problems to predict the probability of a categorical outcome (like "yes" or "no"), often using a sigmoid function to produce an S-shaped curve. In contrast, Linear Regression is used for regression problems to predict a continuous numerical outcome (like sales or temperature) by modeling a linear relationship between variables. The key difference lies in the type of problem they solve: logistic for classification, linear for regression.

2)Explain the role of the Sigmoid function in Logistic Regression.
->

In Logistic Regression, the Sigmoid (or logistic) function serves a crucial role by transforming the output of a linear model into a probability between 0 and 1, making it suitable for binary classification problems. Its "S-shaped" curve squashes any real-valued input into this specific probability range, enabling the model to predict the likelihood of an event. The sigmoid function also provides a differentiable output, which is essential for the gradient descent algorithm used to train the logistic regression model.

3)What is Regularization in Logistic Regression and why is it needed?
->
Regularization in Logistic Regression is a technique to prevent overfitting by adding a penalty term to the model's loss function, discouraging overly complex models with extreme parameter values. It is needed because without it, logistic regression can drive the loss to zero by fitting the noise in the training data too closely, leading to poor performance on new, unseen data. Regularization helps improve the model's ability to generalize by finding a balance between fitting the training data and maintaining model simplicity.

4)What are some common evaluation metrics for classification models, and
why are they important?
->
Common classification evaluation metrics include Accuracy, Precision, Recall, F1-Score, and Area Under the Receiver Operating Characteristic Curve (AUC-ROC). These metrics are important because they provide a quantitative measure of a model's performance by quantifying different aspects of its predictions, such as the ratio of correct to incorrect classifications (Accuracy), the reliability of positive predictions (Precision), and the model's ability to identify all relevant cases (Recall). This allows for informed decisions regarding model selection, hyperparameter tuning, and ensuring the model aligns with specific business or scientific goals.


Common classification evaluation metrics include Accuracy, Precision, Recall, F1-Score, and Area Under the Receiver Operating Characteristic Curve (AUC-ROC). These metrics are important because they provide a quantitative measure of a model's performance by quantifying different aspects of its predictions, such as the ratio of correct to incorrect classifications (Accuracy), the reliability of positive predictions (Precision), and the model's ability to identify all relevant cases (Recall). This allows for informed decisions regarding model selection, hyperparameter tuning, and ensuring the model aligns with specific business or scientific goals.
Common Evaluation Metrics

Accuracy
: The ratio of correctly predicted instances (both true positives and true negatives) to the total number of instances.
Importance: A straightforward measure of overall performance, but can be misleading on imbalanced datasets.

Precision
: The ratio of true positives (correctly predicted positives) to all instances predicted as positive.
Importance: Useful when minimizing false positives is crucial, such as in scenarios where a positive prediction carries a high cost.

Recall
(or Sensitivity): The ratio of true positives to all actual positive instances in the dataset.
Importance: Important when minimizing false negatives is paramount, such as in medical diagnoses where failing to detect a condition can be very costly.

F1-Score
: A single metric that provides a harmonic mean of precision and recall, balancing the two.
Importance: Provides a unified view of a model's performance on imbalanced datasets by considering both precision and recall.

AUC-ROC
(Area Under the Receiver Operating Characteristic Curve): A single number summarizing the performance of a classification model across all possible thresholds. The ROC curve itself plots the true positive rate (recall) against the false positive rate.
Importance: A robust metric, especially for imbalanced datasets, that measures a model's ability to distinguish between classes.


5)Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package)


In [1]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset (Iris dataset from sklearn)
iris = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features and target
X = df[iris.feature_names]
y = df['target']

# Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Logistic Regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Logistic Regression Model Accuracy:", accuracy)


Logistic Regression Model Accuracy: 1.0


In [2]:
#Write a Python program to train a Logistic Regression model using L2
#regularization (Ridge) and print the model coefficients and accuracy.

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset (Iris)
iris = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features and target
X = df[iris.feature_names]
y = df['target']

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Logistic Regression with L2 regularization (default = 'l2')
model = LogisticRegression(penalty='l2', solver='lbfgs', multi_class='auto', max_iter=200)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Logistic Regression with L2 regularization (Ridge)")
print("Coefficients:\n", model.coef_)
print("Intercept:\n", model.intercept_)
print("Accuracy:", accuracy)



Logistic Regression with L2 regularization (Ridge)
Coefficients:
 [[-0.39345607  0.96251768 -2.37512436 -0.99874594]
 [ 0.50843279 -0.25482714 -0.21301129 -0.77574766]
 [-0.11497673 -0.70769055  2.58813565  1.7744936 ]]
Intercept:
 [  9.00884295   1.86902164 -10.87786459]
Accuracy: 1.0




In [3]:
'''Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)'''

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load dataset (Iris dataset)
iris = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features and target
X = df[iris.feature_names]
y = df['target']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Logistic Regression with one-vs-rest (OvR)
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=200)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Classification Report
print("Logistic Regression with OvR (One-vs-Rest)")
print(classification_report(y_test, y_pred, target_names=iris.target_names))


Logistic Regression with OvR (One-vs-Rest)
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





In [4]:
'''Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.
(Use Dataset from sklearn package)
'''

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

# Load dataset (Iris)
iris = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features and target
X = df[iris.feature_names]
y = df['target']

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Logistic Regression model
log_reg = LogisticRegression(max_iter=500, solver='liblinear')
# 'liblinear' supports both l1 and l2 penalties

# Define parameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],       # Regularization strength
    'penalty': ['l1', 'l2']             # Penalty types
}

# GridSearchCV
grid_search = GridSearchCV(
    estimator=log_reg,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Fit model
grid_search.fit(X_train, y_train)

# Results
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)
print("Test Accuracy:", grid_search.score(X_test, y_test))


Best Parameters: {'C': 10, 'penalty': 'l1'}
Best Cross-Validation Accuracy: 0.9583333333333334
Test Accuracy: 1.0


In [5]:
'''Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling.
(Use Dataset from sklearn package)
'''

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset (Iris)
iris = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features and target
X = df[iris.feature_names]
y = df['target']

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Logistic Regression without scaling
model_no_scaling = LogisticRegression(max_iter=500)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression with scaling
model_with_scaling = LogisticRegression(max_iter=500)
model_with_scaling.fit(X_train_scaled, y_train)
y_pred_with_scaling = model_with_scaling.predict(X_test_scaled)
accuracy_with_scaling = accuracy_score(y_test, y_pred_with_scaling)

# Results
print("Logistic Regression Accuracy (without scaling):", accuracy_no_scaling)
print("Logistic Regression Accuracy (with scaling):   ", accuracy_with_scaling)


Logistic Regression Accuracy (without scaling): 1.0
Logistic Regression Accuracy (with scaling):    1.0


In [6]:
'''Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business
use case.
'''

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score, average_precision_score
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

# -------------------------------
# 1. Load your dataset
# (Here we just simulate with sklearn dataset for demo purposes)
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, n_features=20,
                           n_classes=2, weights=[0.95, 0.05],
                           random_state=42)

df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
df['target'] = y

# -------------------------------
# 2. Train/Test Split (stratified to preserve imbalance)
X_train, X_test, y_train, y_test = train_test_split(
    df.drop("target", axis=1),
    df["target"],
    test_size=0.2,
    stratify=df["target"],
    random_state=42
)

# -------------------------------
# 3. Build Pipeline: Scaling + SMOTE + Logistic Regression
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("smote", SMOTE(random_state=42)),
    ("logreg", LogisticRegression(solver="saga", max_iter=5000))
])

# -------------------------------
# 4. Define Hyperparameter Grid
param_grid = {
    "logreg__C": [0.01, 0.1, 1, 10],
    "logreg__penalty": ["l1", "l2"]
}

# -------------------------------
# 5. Grid Search with Stratified CV
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    scoring="roc_auc",   # focus on ranking positives higher
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

# -------------------------------
# 6. Evaluate on Test Set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:, 1]

print("Best Parameters:", grid_search.best_params_)
print("Test ROC-AUC:", roc_auc_score(y_test, y_prob))
print("Test PR-AUC:", average_precision_score(y_test, y_prob))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Best Parameters: {'logreg__C': 0.01, 'logreg__penalty': 'l1'}
Test ROC-AUC: 0.8575853775853776
Test PR-AUC: 0.48348181189907835

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.87      0.92       945
           1       0.25      0.76      0.38        55

    accuracy                           0.86      1000
   macro avg       0.62      0.81      0.65      1000
weighted avg       0.94      0.86      0.89      1000

