 What is Logistic Regression, and how does it differ from Linear Regression.

Logistic Regression is a fundamental machine learning algorithm used for classification tasks. Unlike Linear Regression, which predicts continuous values, Logistic Regression estimates the probability that a given instance belongs to a particular class

3  Why do we use the Sigmoid function in Logistic Regression.
The Sigmoid function is used in Logistic Regression because it transforms any real-valued number into a probability between 0 and 1
The output of the Sigmoid function can be directly interpreted as a probability, making it ideal for binary classification
The function is continuous and differentiable, which helps in optimization using gradient-based methods
- If the probability is greater than 0.5, the instance is classified as 1; otherwise, it is classified as 0


4   What is the cost function of Logistic Regression.
The cost function in Logistic Regression is called Log Loss or Binary Cross-Entropy Loss. It measures how well the model's predicted probabilities match the actual labels


5 What is Regularization in Logistic Regression? Why is it needed.
Regularization in Logistic Regression is a technique used to prevent overfitting by adding a penalty term to the cost function. It helps the model generalize better to unseen data by discouraging overly complex models


6  Explain the difference between Lasso, Ridge, and Elastic Net regression
- Use Lasso when you suspect some features are irrelevant and want automatic feature selection.
- Use Ridge when all features contribute to the model but need to be controlled to prevent overfitting.
- Use Elastic Net when features are correlated, as it combines the benefits of both Lasso and Ridge


7 When should we use Elastic Net instead of Lasso or Ridge.
Elastic Net is preferred over Lasso or Ridge when features are highly correlated. It combines the strengths of both Lasso (L1 regularization) and Ridge (L2 regularization) to balance feature selection and coefficient shrinkage.


8 What is the impact of the regularization parameter (λ) in Logistic Regression
The regularization parameter (λ) in Logistic Regression controls the strength of regularization, which helps prevent overfitting and improves generalization.
Small λ (Weak Regularization):
- The model relies heavily on the training data.
- Coefficients can become large, leading to overfitting.
- The model may perform well on training data but poorly on unseen data.
Large λ (Strong Regularization):
- Shrinks coefficients, reducing model complexity.
- Prevents overfitting but may lead to underfitting if too strong.


9 What are the key assumptions of Logistic Regression
Key Assumptions:
- Binary or Categorical Dependent Variable: The target variable should be binary (e.g., Yes/No, 0/1) or categorical for multiclass classification.
- Independent Observations: Each observation should be independent of the others, meaning no correlation between input variables.
- No Multicollinearity: Predictor variables should not be highly correlated with each other, as multicollinearity can distort coefficient estimates.
- No Extreme Outliers: Outliers can significantly impact the model’s coefficients, so they should be handled appropriately.


10  What are some alternatives to Logistic Regression for classification tasks.
Machine Learning-Based Alternatives
- Decision Trees – Simple, interpretable models that split data based on feature values.
- Random Forest – An ensemble of decision trees that improves accuracy and reduces overfitting.
- Support Vector Machines (SVM) – Finds the optimal boundary between classes using hyperplanes.
- Naïve Bayes – A probabilistic classifier based on Bayes' theorem, useful for text classification.
- K-Nearest Neighbors (KNN) – Classifies based on the majority vote of nearest neighbors.


11  What are Classification Evaluation Metrics
Classification evaluation metrics help assess the performance of a classification model. Here are some key metrics

12  How does class imbalance affect Logistic Regression
Effects of Class Imbalance
- Biased Predictions – The model tends to classify most instances as the majority class, ignoring the minority class.
- Poor Recall for Minority Class – The model struggles to correctly identify minority class instances, leading to high false negatives.
- Misleading Accuracy – High accuracy may be deceptive if the model predicts the majority class most of the time.
- Skewed Decision Boundary – The model’s decision boundary may not be optimal, making it harder to separate classes effectively.


13 C What is Hyperparameter Tuning in Logistic Regression.
Key Hyperparameters in Logistic Regression
- Regularization Strength (λ or C) – Controls the penalty applied to coefficients to prevent overfitting.
- Penalty Type (L1, L2, Elastic Net) – Determines whether Lasso, Ridge, or Elastic Net regularization is used.
- Solver Choice – Different optimization algorithms (e.g., liblinear, saga, lbfgs) affect convergence speed and accuracy.
- Maximum Iterations (max_iter) – Defines how many iterations the solver runs before stopping.
- Class Weights (class_weight) – Adjusts weights for imbalanced datasets to improve minority class predictions.



14  What are different solvers in Logistic Regression? Which one should be used.
- For large datasets → lbfgs, newton-cg, sag, or saga
- For small datasets → liblinear
- For Elastic Net regularization → saga


15 How is Logistic Regression extended for multiclass classification.
One-vs-Rest (OvR)
Multinomial Logistic Regression

16  What are the advantages and disadvantages of Logistic Regression
Advantages --------------
- Simple & Interpretable – Easy to understand and interpret compared to complex models.
- Efficient & Fast – Works well with small datasets and requires less computational power.
- Probabilistic Predictions – Outputs probabilities, making it useful for decision-making.
- Handles Categorical & Continuous Variables – Can work with different types of input features.
- Less Prone to Overfitting – Regularization techniques (L1, L2) help control complexity.
Disadvantages ---------
- Assumes Linearity – Requires a linear relationship between independent variables and log-odds.
- Sensitive to Outliers – Extreme values can distort predictions.
- Struggles with Complex Relationships – Cannot model highly non-linear patterns well.
- Requires Feature Engineering – Needs well-prepared input data for optimal performance.
- Not Ideal for Large Feature Sets – Can overfit when the number of features is much larger than observations.


17 What are some use cases of Logistic Regression
1  Medical Diagnosis
- Predicting the likelihood of diseases (e.g., diabetes, heart disease).
- Identifying risk factors based on patient data.
2. Fraud Detection
- Detecting fraudulent transactions in banking and e-commerce.
- Identifying suspicious activities in cybersecurity.
3. Marketing & Customer Analytics
- Predicting customer churn (whether a customer will leave a service).
- Classifying potential leads as high or low conversion prospects.
4. Spam Detection
- Filtering spam emails based on text patterns.
- Identifying phishing attempts.
5. Credit Scoring & Risk Asses


18 How do we choose between One-vs-Rest (OvR) and Softmax for multiclass classification
One-vs-Rest (OvR)
- Trains multiple binary classifiers, one for each class.
- Each classifier predicts whether an instance belongs to a specific class or not.
- The class with the highest probability is chosen.
Softmax Regression
- Uses the Softmax function instead of the Sigmoid function.
- Computes probabilities for all classes simultaneously.
- The class with the highest probability is selected.


19 How do we interpret coefficients in Logistic Regression?
- Log-Odds Representation
- Each coefficient represents the change in log-odds of the dependent variable for a one-unit increase in the predictor variable.
- If a coefficient is positive, it increases the probability of the event occurring.
- If a coefficient is negative, it decreases the probability of the event occurring.
Interpreting Categorical Variables
- For binary categorical predictors (e.g., Male vs. Female), the coefficient represents the difference in log-odds between the two categories.
- For multi-category predictors, one category is chosen as the reference, and coefficients represent changes relative to that reference.
Impact of Large Coefficients
Large coefficients indicate a strong effect of the predictor on the outcome.
Small coefficients suggest a weak or negligible effect.




In [12]:
#  Load a dataset, split it, apply Logistic Regression, and print accuracy

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris


data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target  


X = df.drop('target', axis=1)
y = df['target']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


model = LogisticRegression(max_iter=200)  # Increase iterations to prevent convergence warnings
model.fit(X_train, y_train)


y_pred = model.predict(X_test)
print("Model Accuracy:", accuracy_score(y_test, y_pred))

Model Accuracy: 1.0


In [14]:
#   Write a Python program to apply L1 regularization (Lasso) on a dataset using LogisticRegression(penalty='l1')
# and print the model accuracy


model = LogisticRegression(penalty='l2') 
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("L2 Regularization Model Accuracy:", accuracy_score(y_test, y_pred))
print("Model Coefficients:", model.coef_)

L2 Regularization Model Accuracy: 1.0
Model Coefficients: [[-0.39339961  0.96258869 -2.37510705 -0.99874611]
 [ 0.5084024  -0.25486663 -0.21301372 -0.77575531]
 [-0.11500279 -0.70772206  2.58812078  1.77450141]]


In [15]:
#  Write a Python program to train Logistic Regression with L2 regularization (Ridge) using
# LogisticRegression(penalty='l2'). Print model accuracy and coefficientsC

model = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5)  # ElasticNet requires 'saga' solver
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Elastic Net Model Accuracy:", accuracy_score(y_test, y_pred))

Elastic Net Model Accuracy: 1.0




In [16]:
#  Write a Python program to train Logistic Regression with Elastic Net Regularization (penalty='elasticnet')C

model = LogisticRegression(multi_class='ovr', solver='lbfgs')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("OvR Model Accuracy:", accuracy_score(y_test, y_pred))

OvR Model Accuracy: 0.9666666666666667




In [17]:
# Write a Python program to train a Logistic Regression model for multiclass classification using
# multi_class='ovr'

from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2'], 'solver': ['liblinear']}
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)

Best Parameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Best Accuracy: 0.9583333333333334


In [18]:
#  Write a Python program to apply GridSearchCV to tune the hyperparameters (C and penalty) of Logistic Regression. Print the best parameters and accuracyC

from sklearn.model_selection import StratifiedKFold
import numpy as np

skf = StratifiedKFold(n_splits=5)
accuracies = []

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    model = LogisticRegression()
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    accuracies.append(accuracy_score(y_test, y_pred))

print("Average Accuracy using Stratified K-Fold:", np.mean(accuracies))

Average Accuracy using Stratified K-Fold: 0.9733333333333334


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [20]:
# Write a Python program to evaluate Logistic Regression using Stratified K-Fold Cross-Validation. Print the average accuracyC

data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target  


X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("CSV Dataset Model Accuracy:", accuracy_score(y_test, y_pred))

CSV Dataset Model Accuracy: 1.0


In [22]:
# M Write a Python program to implement One-vs-One (OvO) Multiclass Logistic Regression and print accuracy

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsOneClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris


data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


model = OneVsOneClassifier(LogisticRegression(max_iter=200))
model.fit(X_train, y_train)


y_pred = model.predict(X_test)


print("One-vs-One (OvO) Logistic Regression Model Accuracy:", accuracy_score(y_test, y_pred))

One-vs-One (OvO) Logistic Regression Model Accuracy: 1.0


In [24]:
#  Write a Python program to train a Logistic Regression model and evaluate its performance using Precision, Recall, and F1-ScoreM

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.datasets import load_iris


data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)


y_pred = model.predict(X_test)


print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, average='weighted'))
print("Recall:", recall_score(y_test, y_pred, average='weighted'))
print("F1-Score:", f1_score(y_test, y_pred, average='weighted'))

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-Score: 1.0


In [25]:
#  Write a Python program to train a Logistic Regression model on imbalanced data and apply class weights to improve model performanceM

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.datasets import make_classification


X, y = make_classification(n_classes=2, weights=[0.9, 0.1], n_samples=5000, random_state=42)


df = pd.DataFrame(X)
df['target'] = y


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)


model = LogisticRegression(class_weight='balanced', max_iter=200)
model.fit(X_train, y_train)


y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-Score:", f1_score(y_test, y_pred))


Accuracy: 0.852
Precision: 0.40271493212669685
Recall: 0.8476190476190476
F1-Score: 0.5460122699386503


In [26]:
# Write a Python program to apply feature scaling (Standardization) before training a Logistic Regression model. Evaluate its accuracy and compare results with and without scalingM

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


model_no_scaling = LogisticRegression(max_iter=200)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


model_scaled = LogisticRegression(max_iter=200)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)


print("Accuracy WITHOUT scaling:", accuracy_no_scaling)
print("Accuracy WITH scaling:", accuracy_scaled)

Accuracy WITHOUT scaling: 1.0
Accuracy WITH scaling: 1.0


In [27]:
#  Write a Python program to train Logistic Regression and evaluate its performance using ROC-AUC scoreM 

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.datasets import make_classification


X, y = make_classification(n_classes=2, weights=[0.7, 0.3], n_samples=5000, random_state=42)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)


y_prob = model.predict_proba(X_test)[:, 1] 


roc_auc = roc_auc_score(y_test, y_prob)
print("ROC-AUC Score:", roc_auc)

ROC-AUC Score: 0.9289692404030436


In [28]:
# M Write a Python program to train Logistic Regression using a custom learning rate (C=0.5) and evaluate
# accuracy

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris


data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


model = LogisticRegression(C=0.5, max_iter=200)
model.fit(X_train, y_train)


y_pred = model.predict(X_test)
print("Logistic Regression Model Accuracy with C=0.5:", accuracy_score(y_test, y_pred))

Logistic Regression Model Accuracy with C=0.5: 1.0


In [30]:
#  Write a Python program to train Logistic Regression with different solvers (liblinear, saga, lbfgs) and compare
# their accuracy

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris


data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


solvers = ['liblinear', 'saga', 'lbfgs']


for solver in solvers:
    model = LogisticRegression(solver=solver, max_iter=200)
    model.fit(X_train, y_train)
    
   
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    print(f"Solver: {solver} | Accuracy: {accuracy:.4f}")

Solver: liblinear | Accuracy: 1.0000
Solver: saga | Accuracy: 1.0000
Solver: lbfgs | Accuracy: 1.0000




In [31]:
# Write a Python program to train Logistic Regression and evaluate its performance using Matthews
# Correlation Coefficient (MCC)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import matthews_corrcoef, accuracy_score
from sklearn.datasets import make_classification


X, y = make_classification(n_classes=2, weights=[0.8, 0.2], n_samples=5000, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)


model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)


mcc = matthews_corrcoef(y_test, y_pred)


print("Accuracy:", accuracy_score(y_test, y_pred))
print("Matthews Correlation Coefficient (MCC):", mcc)

Accuracy: 0.886
Matthews Correlation Coefficient (MCC): 0.6244685699066264


In [32]:
# Write a Python program to train Logistic Regression and find the optimal C (regularization strength) using
# cross-validation

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
import numpy as np


data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


C_values = np.logspace(-3, 3, 10)  

best_score = 0
best_C = None


for C in C_values:
    model = LogisticRegression(C=C, max_iter=200)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')  # 5-fold CV
    avg_score = scores.mean()

    if avg_score > best_score:
        best_score = avg_score
        best_C = C

print(f"Optimal C: {best_C}")
print(f"Best Cross-Validation Accuracy: {best_score:.4f}")


final_model = LogisticRegression(C=best_C, max_iter=200)
final_model.fit(X_train, y_train)

test_accuracy = final_model.score(X_test, y_test)
print(f"Test Accuracy with Optimal C: {test_accuracy:.4f}")

Optimal C: 0.46415888336127775
Best Cross-Validation Accuracy: 0.9583
Test Accuracy with Optimal C: 1.0000
