Logistic Regression Theory Questions and Answers

1. What is Logistic Regression, and how does it differ from Linear Regression?
Answer: Logistic Regression is a classification algorithm used to predict the probability of a categorical dependent variable. Unlike Linear Regression that predicts continuous outcomes, Logistic Regression predicts probabilities by applying the logistic (sigmoid) function to the linear combination of inputs.

2. What is the mathematical equation of Logistic Regression?
Answer: The logistic regression model estimates the probability p as:
p = 1 / (1 + e^-(β0 + β1x1 + β2x2 + ... + βnxn))

3. Why do we use the Sigmoid function in Logistic Regression?
Answer: The Sigmoid function maps any real-valued number into the (0,1) interval, which is ideal for modeling probabilities.

4. What is the cost function of Logistic Regression?
Answer: The cost function is the Log Loss (Binary Cross-Entropy), which penalizes wrong predictions more when the confidence is higher.

5. What is Regularization in Logistic Regression? Why is it needed?
Answer: Regularization adds a penalty term to the loss function to avoid overfitting by discouraging large coefficients.

6. Explain the difference between Lasso, Ridge, and Elastic Net regression.
Answer: Lasso (L1) adds absolute value penalties, leading to sparse models; Ridge (L2) adds squared penalties, leading to small but non-zero coefficients; Elastic Net combines both penalties.

7. When should we use Elastic Net instead of Lasso or Ridge?
Answer: Elastic Net is useful when there are multiple correlated features. It balances L1 and L2 penalties.

8. What is the impact of the regularization parameter (λ) in Logistic Regression?
Answer: Higher λ increases penalty strength, reducing model complexity and variance but increasing bias.

9. What are the key assumptions of Logistic Regression?
Answer: Assumes linearity between independent variables and log-odds, no multicollinearity, and independence of observations.

10. What are some alternatives to Logistic Regression for classification tasks?
Answer: Alternatives include Decision Trees, Random Forests, SVM, Neural Networks, and Naive Bayes.

11. What are Classification Evaluation Metrics?
Answer: Accuracy, Precision, Recall, F1-Score, ROC-AUC, Confusion Matrix.

12. How does class imbalance affect Logistic Regression?
Answer: It may bias the model toward the majority class, reducing recall on the minority class.

13. What is Hyperparameter Tuning in Logistic Regression?
Answer: Process of selecting optimal parameters like regularization strength (C), penalty type, and solver to improve performance.

14. What are different solvers in Logistic Regression? Which one should be used?
Answer: Solvers include liblinear, saga, lbfgs, newton-cg, and sag. Choice depends on dataset size and penalty used.

15. How is Logistic Regression extended for multiclass classification?
Answer: By using One-vs-Rest (OvR) or Softmax (Multinomial) regression.

16. What are the advantages and disadvantages of Logistic Regression?
Answer: Advantages: simplicity, interpretability, efficiency. Disadvantages: limited to linear decision boundaries, struggles with complex patterns.

17. What are some use cases of Logistic Regression?
Answer: Credit scoring, disease diagnosis, spam detection, marketing response prediction.

18. What is the difference between Softmax Regression and Logistic Regression?
Answer: Softmax handles multiple classes by generalizing Logistic Regression, which is binary.

19. How do we choose between One-vs-Rest (OvR) and Softmax for multiclass classification?
Answer: OvR is simpler and works well with many classes, Softmax models all classes simultaneously and can be more accurate.

20. How do we interpret coefficients in Logistic Regression?
Answer: Coefficients indicate the change in log-odds of the outcome per unit change in the predictor.



In [1]:
# Logistic Regression Practical Examples with Question Numbers

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, RandomizedSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score, cohen_kappa_score, matthews_corrcoef, precision_recall_curve
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

# BC Write a Python program that loads a dataset, splits it, applies Logistic Regression, and prints accuracy
def q1_basic_logistic_regression():
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    clf = LogisticRegression(max_iter=1000)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print("Q1 - Accuracy:", accuracy_score(y_test, y_pred))

# 'C Write a Python program to apply L1 regularization (Lasso) and print accuracy
def q2_l1_regularization():
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    clf = LogisticRegression(penalty='l1', solver='saga', max_iter=1000)
    clf.fit(X_train, y_train)
    print("Q2 - L1 Regularization Accuracy:", clf.score(X_test, y_test))

# $C Write a Python program to train Logistic Regression with L2 regularization and print accuracy and coefficients
def q3_l2_regularization():
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    clf = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)
    clf.fit(X_train, y_train)
    print("Q3 - L2 Regularization Accuracy:", clf.score(X_test, y_test))
    print("Coefficients:", clf.coef_)

# #C Write a Python program to train Logistic Regression with Elastic Net Regularization
def q4_elastic_net_regularization():
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    clf = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5, max_iter=1000)
    clf.fit(X_train, y_train)
    print("Q4 - Elastic Net Accuracy:", clf.score(X_test, y_test))

# C Logistic Regression model for multiclass classification (One-vs-Rest)
def q5_multiclass_ovr():
    from sklearn.datasets import load_iris
    X, y = load_iris(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    clf = LogisticRegression(multi_class='ovr', max_iter=1000)
    clf.fit(X_train, y_train)
    print("Q5 - OvR Multiclass Accuracy:", clf.score(X_test, y_test))

# "C Apply GridSearchCV to tune hyperparameters of Logistic Regression
def q6_gridsearchcv_tuning():
    X, y = load_breast_cancer(return_X_y=True)
    param_grid = {'C': [0.01, 0.1, 1, 10], 'penalty': ['l1', 'l2'], 'solver': ['liblinear', 'saga']}
    clf = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5)
    clf.fit(X, y)
    print("Q6 - Best Params:", clf.best_params_)
    print("Q6 - Best Accuracy:", clf.best_score_)

# C Stratified K-Fold Cross-Validation evaluation
def q7_stratified_kfold_cv():
    X, y = load_breast_cancer(return_X_y=True)
    clf = LogisticRegression(max_iter=1000)
    skf = StratifiedKFold(n_splits=5)
    scores = cross_val_score(clf, X, y, cv=skf)
    print("Q7 - Stratified K-Fold CV Accuracy:", scores.mean())

# C Load CSV dataset, apply Logistic Regression, and evaluate accuracy (dummy example)
def q8_logistic_regression_csv():
    # Example: Using breast cancer dataset as CSV-like
    data = load_breast_cancer()
    df = pd.DataFrame(data.data, columns=data.feature_names)
    df['target'] = data.target
    X = df.drop('target', axis=1)
    y = df['target']
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    clf = LogisticRegression(max_iter=1000)
    clf.fit(X_train, y_train)
    print("Q8 - Accuracy on CSV dataset:", clf.score(X_test, y_test))


if __name__ == "__main__":
    q1_basic_logistic_regression()
    q2_l1_regularization()
    q3_l2_regularization()
    q4_elastic_net_regularization()
    q5_multiclass_ovr()
    q6_gridsearchcv_tuning()
    q7_stratified_kfold_cv()
    q8_logistic_regression_csv()


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Q1 - Accuracy: 0.958041958041958




Q2 - L1 Regularization Accuracy: 0.958041958041958


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Q3 - L2 Regularization Accuracy: 0.958041958041958
Coefficients: [[ 2.15532225  0.18714602 -0.24416747  0.00572091 -0.1497949  -0.37285194
  -0.68842716 -0.40105696 -0.21745825 -0.02431973 -0.10238349  1.21390283
   0.02970182 -0.10397078 -0.0207384   0.01710296 -0.04370813 -0.04714993
  -0.04576097  0.00424877  1.18521282 -0.41575137 -0.02673249 -0.02611631
  -0.28772255 -0.93912745 -1.60622852 -0.67726337 -0.78534399 -0.089987  ]]




Q4 - Elastic Net Accuracy: 0.958041958041958
Q5 - OvR Multiclass Accuracy: 0.9736842105263158




Q6 - Best Params: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Q6 - Best Accuracy: 0.9578326346840551


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Q7 - Stratified K-Fold CV Accuracy: 0.9525694767893185
Q8 - Accuracy on CSV dataset: 0.958041958041958


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
