In [3]:
'''

##Question 1:  What is Logistic Regression, and how does it differ from Linear
Regression?

Logistic Regression is a statistical / machine‑learning method used to predict categorical outcomes, especially when there are two possible classes (binary classification). It estimates the probability that a given input belongs to one class rather than another (for example, “spam” vs “not spam”, “disease” vs “no disease”). Based on that probability, one can decide a threshold (often 0.5) and assign a class label. It uses features (input variables) to compute how they influence the likelihood of each class.

In contrast, Linear Regression is used when the target/output is continuous (such as house prices, temperature, height). It models a direct linear relationship between input variables and the numerical output: given inputs, it predicts a value on a continuous scale.

Key differences:

Output type: Logistic gives probabilities / classes (categorical), Linear gives numerical values (continuous).

Interpretation: In Linear Regression, the change in output is directly proportional to changes in input; in Logistic Regression, inputs affect the log‑odds or relative risk, hence indirectly influence the probability of class membership.

Assumptions and model fitting: Linear Regression assumes the errors are normally distributed, constant variance (homoscedasticity), etc., and typically uses least squares methods. Logistic Regression does not assume the output is normally distributed; it uses methods like maximum likelihood estimation to find the best parameters.


##Question 2: Explain the role of the Sigmoid function in Logistic Regression.

Purpose of the Sigmoid in Logistic Regression

Mapping to Probability Values
When the model computes a raw score (a weighted sum of features plus a bias), that result can be any real number — positive, negative, large, or small. The sigmoid transforms that into something between 0 and 1. That output can be interpreted as the probability that the input belongs to the positive class.

Enabling Decision Making
Once we have a probability, we can pick a threshold (commonly 0.5) to decide whether to assign class “yes/1” or “no/0.” Without the sigmoid, the raw score wouldn’t map cleanly into a probability, and we couldn’t as naturally choose such thresholds.

Smooth Transition Around the Boundary
The sigmoid gives a smooth, “S‑shaped” transition from low probability to high probability as the raw score shifts. That means small changes near the decision boundary (where the model is uncertain) lead to gradual change in output. This allows the model to learn in a stable way when changing its parameters.

Suitability for Optimization
Because the sigmoid is smooth and differentiable everywhere, it supports gradient‑based optimization. The model’s training process (e.g. maximizing likelihood or minimizing a loss) needs to know how output changes as parameters change. The sigmoid’s smoothness helps compute those gradients reliably.


##Question 3: What is Regularization in Logistic Regression and why is it needed?

Regularization in Logistic Regression is a technique used to reduce overfitting by constraining or penalizing the magnitude of the model’s parameters (coefficients). It helps the model generalize better to unseen data rather than just memorizing the training examples.

Why Regularization Is Needed

Models with many features or those that are very flexible can fit even random noise in the training data. That leads to high performance on training data but poor performance on new/unseen data. This is called overfitting.

Regularization prevents the model from relying too heavily on any one feature or giving extremely large weights to some features. That reduces variance in model predictions.


It improves stability: with regularization, small changes in training data or slight noise won’t cause wildly different parameter estimates. This makes the model more robust.

How Regularization Works (Conceptually)

Even without going into formulas:

Introduce a penalty or constraint on large parameter values. The model then finds a balance between accurately classifying the training data and keeping its parameters “small” or “simple.”

There are different styles:

One type encourages sparse solutions (making some parameters zero → effectively ignoring less useful features)

Another just “shrinks” all parameters somewhat toward simpler values without necessarily zeroing them out


##Question 4: What are some common evaluation metrics for classification models, and
why are they important?

Here are some common evaluation metrics for classification models, and why they matter — all without using equations:

Common Metrics

Accuracy
Measures how often the model’s predictions are correct overall (both positive and negative). It is intuitive and widely used.

Precision
Of those cases the model predicts as positive, how many are truly positive. Useful when false alarms (false positives) are costly.

Recall (Sensitivity)
Of all actual positive cases, how many did the model correctly identify. Important when missing positive cases is more serious.

F1‑Score
A harmonic balancing of precision and recall. It gives a single score that reflects both, helpful especially when classes are imbalanced or when you require a trade‑off between false positives and false negatives.

ROC‑AUC (Receiver Operating Characteristic – Area Under Curve)
Considers how well the model distinguishes positive and negative classes across different thresholds. It reflects how good the ranking of predictions is, not just at one fixed cutoff.

Why They Are Important

Different costs & risks: In many real‑world problems, false positives and false negatives have very different consequences. For example in disease detection, missing a sick patient can be worse than wrongly diagnosing a healthy one. Choosing metrics like recall or precision helps align with these real costs.

Class imbalance: When one class is much more common than the other(s), accuracy can be misleading (a model that always predicts the majority class might have high accuracy but useless behavior). Metrics like F1, precision, recall, and ROC‑AUC give better insight under imbalance.

Model comparison & tuning: They allow comparing different models or settings (thresholds, regularization, features) in a meaningful way. One model might have better precision but worse recall; metrics help decide which trade‑offs are acceptable.

Business relevance: Ultimately, models often serve business or health or safety goals; metrics translate raw predictive performance into quantities easier to interpret by stakeholders: mistakes, risks, benefits. Choosing appropriate metrics ensures model evaluation ties back to what matters in practice.





'''



'\n\n##Question 1:  What is Logistic Regression, and how does it differ from Linear \nRegression? \n\nLogistic Regression is a statistical / machine‑learning method used to predict categorical outcomes, especially when there are two possible classes (binary classification). It estimates the probability that a given input belongs to one class rather than another (for example, “spam” vs “not spam”, “disease” vs “no disease”). Based on that probability, one can decide a threshold (often 0.5) and assign a class label. It uses features (input variables) to compute how they influence the likelihood of each class.\n\nIn contrast, Linear Regression is used when the target/output is continuous (such as house prices, temperature, height). It models a direct linear relationship between input variables and the numerical output: given inputs, it predicts a value on a continuous scale.\n\nKey differences:\n\nOutput type: Logistic gives probabilities / classes (categorical), Linear gives numerical va

In [5]:
#Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,
#splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
#(Use Dataset from sklearn package)


import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

def main():
    # Load Iris dataset
    iris = load_iris(as_frame=True)
    df = iris.frame  # pandas DataFrame including feature columns + target
    X = df.drop(columns=["target"])
    y = df["target"]

    # Split into train and test
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.25, random_state=0, stratify=y
    )

    # Train logistic regression
    model = LogisticRegression(max_iter=200)
    model.fit(X_train, y_train)

    # Predict and evaluate
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"Iris dataset — accuracy: {acc:.2f}")

if __name__ == "__main__":
    main()

Iris dataset — accuracy: 1.00


In [6]:
#Question 6:  Write a Python program to train a Logistic Regression model using L2
#regularization (Ridge) and print the model coefficients and accuracy.
#(Use Dataset from sklearn package)

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

def main():
    # Load a built‑in dataset (multiclass)
    data = load_wine()
    X = data.data
    y = data.target

    # It is often good to standardize features when using regularization
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Split into train and test
    X_train, X_test, y_train, y_test = train_test_split(
        X_scaled, y, test_size=0.2, random_state=0, stratify=y
    )

    # Train Logistic Regression with L2 regularization
    # In sklearn, penalty='l2' is default, so you can also use that explicitly
    model = LogisticRegression(
        penalty='l2',
        C=1.0,            # inverse of regularization strength
        solver='lbfgs',   # works well for multiclass
        max_iter=200
    )
    model.fit(X_train, y_train)

    # Print model coefficients (one set per class in multiclass case)
    print("Intercepts for each class:")
    print(model.intercept_)
    print("Coefficients for each feature and class:")
    print(model.coef_)

    # Predict and compute accuracy
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"Accuracy on test set: {acc:.2f}")

if __name__ == "__main__":
    main()

Intercepts for each class:
[ 0.38558157  0.76600918 -1.15159075]
Coefficients for each feature and class:
[[ 0.76902723  0.22076332  0.41423956 -0.80212986  0.10545623  0.19121707
   0.66221979 -0.16873448  0.2017629   0.11575909  0.19514148  0.61414432
   1.0467281 ]
 [-0.96407141 -0.43735909 -0.82801935  0.58153442 -0.14249886  0.06443012
   0.33515281  0.14281946  0.20936102 -0.9554117   0.58673576  0.12245181
  -1.09535428]
 [ 0.19504417  0.21659577  0.41377979  0.22059543  0.03704263 -0.25564719
  -0.9973726   0.02591502 -0.41112391  0.83965261 -0.78187724 -0.73659613
   0.04862618]]
Accuracy on test set: 1.00


In [7]:
#Question 7: Write a Python program to train a Logistic Regression model for multiclass
#classification using multi_class='ovr' and print the classification report.
#(Use Dataset from sklearn package)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

def main():
    # Load the Iris dataset
    iris = load_iris()
    X = iris.data
    y = iris.target
    class_names = iris.target_names

    # Split into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42, stratify=y
    )

    # Instantiate Logistic Regression with one‑vs‑rest multiclass strategy
    model = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=200, random_state=42)
    model.fit(X_train, y_train)

    # Predict on test set
    y_pred = model.predict(X_test)

    # Print accuracy
    acc = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {acc:.2f}")

    # Print classification report (precision, recall, f1 for each class)
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=class_names))

if __name__ == "__main__":
    main()

Accuracy: 0.91

Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        15
  versicolor       1.00      0.73      0.85        15
   virginica       0.79      1.00      0.88        15

    accuracy                           0.91        45
   macro avg       0.93      0.91      0.91        45
weighted avg       0.93      0.91      0.91        45





In [10]:
#Question 8: Write a Python program to apply GridSearchCV to tune C and penalty
#hyperparameters for Logistic Regression and print the best parameters and validation
#accuracy.
#(Use Dataset from sklearn package)

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define the Logistic Regression model
log_reg = LogisticRegression(max_iter=10000)

# Define the parameter grid
param_grid = {
    'C': np.logspace(-4, 4, 20),  # Regularization strength
    'penalty': ['l1', 'l2'],     # Regularization type
    'solver': ['liblinear', 'saga']  # Solvers that support both penalties
}

# Set up GridSearchCV
grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)

# Fit the model
grid_search.fit(X_train_scaled, y_train)

# Print the best parameters and validation accuracy
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")

# Evaluate the best model on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {test_accuracy:.4f}")

Fitting 5 folds for each of 80 candidates, totalling 400 fits
Best Parameters: {'C': np.float64(1.623776739188721), 'penalty': 'l1', 'solver': 'saga'}
Best Cross-Validation Accuracy: 0.9583
Test Accuracy: 1.0000


In [11]:
#Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling. (Use Dataset from sklearn package) (Include your Python code and output in the code box below.)

import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Digits dataset
digits = load_digits()
X = digits.data
y = digits.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

# Initialize the Logistic Regression model
model = LogisticRegression(max_iter=10000, random_state=42)

# Train and evaluate the model without scaling
model.fit(X_train, y_train)
y_pred_no_scaling = model.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# Apply Standard Scaling to the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train and evaluate the model with scaling
model.fit(X_train_scaled, y_train)
y_pred_scaled = model.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

# Print the results
print(f"Accuracy without scaling: {accuracy_no_scaling:.4f}")
print(f"Accuracy with scaling: {accuracy_scaled:.4f}")

Accuracy without scaling: 0.9622
Accuracy with scaling: 0.9778


In [12]:
'''
Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

To build a robust Logistic Regression model for predicting customer responses in an imbalanced e-commerce dataset (5% responders), I would adopt a comprehensive approach encompassing data handling, feature scaling, class balancing, hyperparameter tuning, and model evaluation.

1. Data Handling

Missing Values: Impute missing data using median imputation for numerical features and mode imputation for categorical features.

Categorical Encoding: Apply one-hot encoding to nominal variables and ordinal encoding to ordinal variables.

Feature Engineering: Create interaction terms and aggregate features to capture non-linear relationships.

2. Class Balancing

Class Weights: Assign higher weights to the minority class during model training to penalize misclassifications of responders more heavily.

SMOTE: Implement the Synthetic Minority Over-sampling Technique to generate synthetic samples for the minority class, enhancing model sensitivity to rare events.

3. Feature Scaling

Standardization: Use StandardScaler to standardize numerical features, ensuring they have a mean of 0 and a standard deviation of 1. This is crucial for models sensitive to feature scales.


4. Hyperparameter Tuning

GridSearchCV: Employ GridSearchCV to tune hyperparameters such as the regularization strength (C) and penalty type (l1 or l2). Use metrics like F1-score or AUC-ROC for evaluation to better capture performance on imbalanced data.


5. Model Evaluation

Metrics: Focus on precision, recall, F1-score, and AUC-ROC to assess model performance, as accuracy is misleading in imbalanced datasets.

Confusion Matrix: Analyze the confusion matrix to understand misclassifications and adjust thresholds accordingly.

6. Business Integration

Threshold Adjustment: Set a decision threshold that balances precision and recall, aligning with business objectives.

Cost-Benefit Analysis: Evaluate the model's performance in terms of return on investment, considering the cost of targeting non-responders.

'''

"\nQuestion 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.\n\nTo build a robust Logistic Regression model for predicting customer responses in an imbalanced e-commerce dataset (5% responders), I would adopt a comprehensive approach encompassing data handling, feature scaling, class balancing, hyperparameter tuning, and model evaluation.\n\n1. Data Handling\n\nMissing Values: Impute missing data using median imputation for numerical features and mode imputation for categorical features.\n\nCategorical Encoding: Apply one-hot encoding to nominal variables and ordinal encoding to ordinal variables.\n\nFeature Engineering: Create inte