<a href="https://colab.research.google.com/github/bagmitadas/ML/blob/main/LogisticRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*** Theoretical Questions ***

Q1. What is Logistic Regression, and how does it differ from Linear Regression?

Answer. Logistic Regression is a statistical method used for binary classification problems, predicting the probability of a binary outcome (0 or 1).
While Linear Regression predicts a continuous output, Logistic Regression predicts the probability of a categorical outcome.  It uses a sigmoid function to map the linear combination of inputs to a probability between 0 and 1.

Q2. What is the mathematical equation of Logistic Regression?

Answer. The mathematical equation of Logistic Regression is typically represented as: p = 1 / (1 + e&lt;sup>-z&lt;/sup>), where 'p' is the probability of the positive class, 'e' is the base of the natural logarithm, and 'z' is the linear combination of the input features and their coefficients (z = b&lt;sub>0&lt;/sub> + b&lt;sub>1&lt;/sub>x&lt;sub>1&lt;/sub> + b&lt;sub>2&lt;/sub>x&lt;sub>2&lt;/sub> + ... + b&lt;sub>n&lt;/sub>x&lt;sub>n&lt;/sub>).  

Q3.Why do we use the Sigmoid function in Logistic Regression?

Answer. The Sigmoid function is used in Logistic Regression to map the predicted linear combination of input features into a probability value between 0 and 1.  This makes it suitable for binary classification, where we need to predict the likelihood of an instance belonging to a particular class.

Q4.What is the cost function of Logistic Regression?

Answer. The cost function of Logistic Regression is typically the Cross-Entropy Loss (also known as Log Loss).  It measures the error between predicted probabilities and the actual binary outcomes.  

Q5.What is Regularization in Logistic Regression? Why is it needed?

Answer. Regularization in Logistic Regression is a technique used to prevent overfitting.
It adds a penalty term to the cost function to discourage large coefficients, making the model simpler and less prone to fitting noise in the training data.  

Q6.Explain the difference between Lasso, Ridge, and Elastic Net regression.

Answer. Lasso (L1 regularization) adds the absolute value of the coefficients to the penalty term, which can shrink some coefficients to exactly zero, effectively performing feature selection.  
Ridge (L2 regularization) adds the squared value of the coefficients to the penalty term, which shrinks coefficients towards zero but rarely exactly to zero.   
Elastic Net is a combination of both L1 and L2 regularization, adding both the absolute and squared values of the coefficients to the penalty term.  It balances the feature selection of Lasso and the coefficient shrinkage of Ridge.

Q7.When should we use Elastic Net instead of Lasso or Ridge?

Answer. Elastic Net is preferred when there are many correlated features.  It can group correlated features together, selecting some and shrinking others, unlike Lasso which tends to select only one.  It also provides a balance between Lasso and Ridge, potentially leading to better performance than either alone.  

Q8.What is the impact of the regularization parameter (λ) in Logistic Regression?

Answer. The regularization parameter (λ, often represented as 'C' in scikit-learn, where C = 1/λ) controls the strength of regularization.
A larger λ (smaller C) increases regularization, leading to simpler models with smaller coefficients, which can help prevent overfitting.  A smaller λ (larger C) reduces regularization, allowing the model to fit the training data more closely.

Q9.What are the key assumptions of Logistic Regression?

Answer. Key assumptions of Logistic Regression include:
Binary or ordinal dependent variable.  
Independence of observations.   
Linearity between the independent variables and the log-odds of the outcome.  
Sufficiently large sample size.   
No multicollinearity among independent variables.  

Q10. What are some alternatives to Logistic Regression for classification tasks?

Answer. Alternatives to Logistic Regression for classification tasks include:
Decision Trees.
Support Vector Machines (SVMs).
Naive Bayes.
K-Nearest Neighbors (KNN).
Random Forest.
Gradient Boosting algorithms.
Neural Networks.  

Q11.What are Classification Evaluation Metrics?

Answer. Classification evaluation metrics are used to assess the performance of a classification model.   
Common metrics include accuracy, precision, recall, F1-score, and ROC-AUC.

Q12.How does class imbalance affect Logistic Regression?

Answer. Class imbalance, where one class has significantly more samples than the other, can negatively impact Logistic Regression.  
It can lead to biased models that perform well on the majority class but poorly on the minority class.  

Q13. What is Hyperparameter Tuning in Logistic Regression?

Answer. Hyperparameter tuning in Logistic Regression involves selecting the best values for parameters that are not learned from the data, such as the regularization strength (C) and the penalty type.  
Techniques like Grid Search and Randomized Search are used to find the optimal combination of hyperparameters.  

Q14.What are different solvers in Logistic Regression? Which one should be used?

Answer. Different solvers in Logistic Regression are algorithms used to optimize the cost function.  
Common solvers include 'liblinear', 'lbfgs', 'sag', 'saga', and 'newton-cg'.
The choice of solver depends on the dataset size, the penalty type, and whether you need multiclass classification.  For example, 'liblinear' is suitable for small datasets, while 'saga' is good for large datasets.  

Q15.How is Logistic Regression extended for multiclass classification?

Answer. Logistic Regression can be extended for multiclass classification using techniques like:
One-vs-Rest (OvR): Training a separate binary classifier for each class against all other classes.   
One-vs-One (OvO): Training a binary classifier for every pair of classes.
Softmax Regression: A generalization of Logistic Regression that directly handles multiple classes.

Q16. What are the advantages and disadvantages of Logistic Regression?

Advantages:
Simple to implement and interpret.
Efficient to train.
Provides probability estimates.
Disadvantages:
Assumes linearity between features and log-odds.  
Can struggle with complex non-linear relationships.   
Sensitive to multicollinearity.  

Q17. What are some use cases of Logistic Regression?

Answer. Use cases of Logistic Regression include:
Medical diagnosis (predicting disease presence).  
Spam detection.
Customer churn prediction.
Credit risk assessment.
Sentiment analysis.

Q18.What is the difference between Softmax Regression and Logistic Regression?

Answer. Logistic Regression is used for binary classification, while Softmax Regression is a generalization of Logistic Regression for multiclass classification.
Softmax Regression assigns probabilities to multiple classes, and the sum of probabilities across all classes is 1.

Q19.How do we choose between One-vs-Rest (OvR) and Softmax for multiclass classification?

Answer. The choice between OvR and Softmax depends on the specific problem.
Softmax is preferred when the classes are mutually exclusive.  * OvR can be suitable when the classes are not mutually exclusive.  

Q20. How do we interpret coefficients in Logistic Regression?

Answer. Coefficients in Logistic Regression represent the change in the log-odds of the outcome for a one-unit change in the predictor variable, holding other variables constant.  
A positive coefficient indicates that an increase in the predictor increases the log-odds of the outcome, while a negative coefficient indicates a decrease.

In [None]:
# PRACTICAL
''' Q1. Write a Python program that loads a dataset, splits it into training and testing sets, applies Logistic Regression, and prints the model accuracy.'''
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris  # Using iris dataset for example

# Load the dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply Logistic Regression
model = LogisticRegression(solver='liblinear', multi_class='ovr')  # You can change the solver
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Print the model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

In [None]:
'''Q2. Write a Python program to apply L1 regularization (Lasso) on a dataset using LogisticRegression(penalty='l1') and print the model accuracy.'''
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load the dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply Logistic Regression with L1 regularization
model = LogisticRegression(penalty='l1', solver='liblinear', multi_class='ovr', C=1.0) # You can adjust C
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy with L1 Regularization:", accuracy)

In [None]:
'''Q3.Write a Python program to train Logistic Regression with L2 regularization (Ridge) using LogisticRegression(penalty='l2'). Print model accuracy and coefficients.'''
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression with L2 regularization
model = LogisticRegression(penalty='l2', solver='lbfgs', multi_class='ovr', C=1.0)  # You can adjust C and solver
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy with L2 Regularization:", accuracy)

# Print coefficients
print("Model Coefficients:", model.coef_)

In [None]:
#Q4. Write a Python program to train Logistic Regression with Elastic Net Regularization (penalty='elasticnet').
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression with Elastic Net regularization
model = LogisticRegression(penalty='elasticnet', solver='saga', multi_class='ovr', C=1.0, l1_ratio=0.5) # You MUST use 'saga' or 'elasticnet' with 'saga'.  Adjust C and l1_ratio
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy with Elastic Net Regularization:", accuracy)

In [None]:
#Q5. Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr'.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression with OvR multiclass
model = LogisticRegression(solver='liblinear', multi_class='ovr', C=1.0)  # You can change the solver and C
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy with OvR Multiclass:", accuracy)

In [None]:
# Q6. Write a Python program to apply GridSearchCV to tune the hyperparameters (C and penalty) of Logistic Regression. Print the best parameters and accuracy.
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define hyperparameter grid
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2', 'elasticnet', None], 'solver': ['liblinear', 'saga', 'lbfgs']}  # Added 'None' to penalty

# Apply GridSearchCV
grid_search = GridSearchCV(LogisticRegression(multi_class='ovr'), param_grid, cv=3, verbose=0) # Added verbose=0
grid_search.fit(X_train, y_train)

# Print best parameters
print("Best Parameters:", grid_search.best_params_)

# Predict with best model
y_pred = grid_search.best_estimator_.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Best Model Accuracy:", accuracy)

In [None]:
# Q7. Write a Python program to evaluate Logistic Regression using Stratified K-Fold Cross-Validation. Print the average accuracy.
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Apply Stratified K-Fold Cross-Validation
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42) # You can change the number of splits
accuracies = []

for train_index, test_index in skf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        model = LogisticRegression(solver='liblinear', multi_class='ovr')  # You can change the solver
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        accuracies.append(accuracy)

# Print the average accuracy
average_accuracy = np.mean(accuracies)
print("Average Accuracy from Stratified K-Fold CV:", average_accuracy)


In [None]:
# Q8. Write a Python program to load a dataset from a CSV file, apply Logistic Regression, and evaluate its accuracy.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the dataset from a CSV file
data = pd.read_csv('PeopleData.csv')

# Separate features (X) and target (y)
X = data.drop('Last Name', axis=1)
y = data['Last Name']

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply Logistic Regression
model = LogisticRegression(solver='liblinear', multi_class='ovr')
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy on CSV Data:", accuracy)

In [None]:
#Q9. Write a Python program to apply RandomizedSearchCV for tuning hyperparameters (C, penalty, solver) in Logistic Regression. Print the best parameters and accuracy.**

import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from scipy.stats import uniform, loguniform  # For parameter distributions

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define hyperparameter distributions for RandomizedSearchCV
param_distributions = {
'C': loguniform(0.001, 100),  #  Log-uniform distribution for C
        'penalty': ['l1', 'l2', 'elasticnet', None],
        'solver': ['liblinear', 'saga', 'lbfgs']
    }

# Apply RandomizedSearchCV
random_search = RandomizedSearchCV(LogisticRegression(multi_class='ovr'),
                                     param_distributions,
                                     n_iter=10,  # Number of random samples
                                     cv=3,
                                     random_state=42,
                                     n_jobs=-1)  # Use all available cores
random_search.fit(X_train, y_train)

# Print best parameters
print("Best Parameters:", random_search.best_params_)

# Predict with best model
y_pred = random_search.best_estimator_.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Best Model Accuracy (RandomizedSearchCV):", accuracy)


In [None]:
# Q10. Write a Python program to implement One-vs-One (OvO) Multiclass Logistic Regression and print accuracy.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.multiclass import OneVsOneClassifier

#Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Implement OvO Multiclass Logistic Regression
ovo_model = OneVsOneClassifier(LogisticRegression(solver='liblinear', C=1.0))  #  You can adjust solver and C
ovo_model.fit(X_train, y_train)

# Predict
y_pred = ovo_model.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("OvO Multiclass Logistic Regression Accuracy:", accuracy)

In [None]:
# Q11. Write a Python program to train a Logistic Regression model and visualize the confusion matrix for binary classification.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.datasets import load_breast_cancer  # Using breast cancer dataset for binary classification
import seaborn as sns

# Load data (binary classification dataset)
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression(solver='liblinear')  # Good for binary classification
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

In [None]:


# Q12. Write a Python program to train a Logistic Regression model and evaluate its performance using Precision, Recall, and F1-Score.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.datasets import load_breast_cancer

# Load data (binary classification dataset)
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Calculate Precision, Recall, and F1-Score
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the metrics
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)

In [None]:

# 13. Write a Python program to train a Logistic Regression model on imbalanced data and apply class weights to improve model performance.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import make_classification  # For generating an imbalanced dataset

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, weights=[0.9, 0.1],  # 90% class 0, 10% class 1
                           random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)  # Stratify to keep imbalance

# Train Logistic Regression WITHOUT class weights (baseline)
model1 = LogisticRegression(solver='liblinear')
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_test)
print("--- Without Class Weights ---")
print(classification_report(y_test, y_pred1))


# Train Logistic Regression WITH class weights
model2 = LogisticRegression(solver='liblinear', class_weight='balanced')  # Apply class weights
model2.fit(X_train, y_train)
y_pred2 = model2.predict(X_test)
print("\n--- With Class Weights ---")
print(classification_report(y_test, y_pred2))


# Q14. Write a Python program to train Logistic Regression on the Titanic dataset, handle missing values, and evaluate performance.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.impute import SimpleImputer  # For handling missing values

# Load the Titanic dataset
titanic_data = pd.read_csv('titanic.csv')  # Replace 'titanic.csv' with the actual filename

# Select features and target (you might need to adjust these based on your dataset)
X = titanic_data[['Pclass', 'Age', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked']]
y = titanic_data['Survived']

# Handle categorical features
X['Sex'] = X['Sex'].map({'male': 0, 'female': 1})
X['Embarked'] = X['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})  # Or use one-hot encoding

# Handle missing values using SimpleImputer
imputer = SimpleImputer(strategy='median')  # Or 'mean', 'most_frequent'
X['Age'] = imputer.fit_transform(X[['Age']])
X['Embarked'] = imputer.fit_transform(X[['Embarked']])
X = X.fillna(X.median()) #  Another way to fill missing values


# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate performance
print("Titanic Dataset - Classification Report:\n", classification_report(y_test, y_pred))
print("Titanic Dataset - Accuracy:", accuracy_score(y_test, y_pred))

In [None]:


# Q15. Write a Python program to apply feature scaling (Standardization) before training a Logistic Regression model.
# Evaluate its accuracy and compare results with and without scaling.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression WITHOUT scaling
model_no_scale = LogisticRegression(solver='liblinear', multi_class='ovr')
model_no_scale.fit(X_train, y_train)
y_pred_no_scale = model_no_scale.predict(X_test)
accuracy_no_scale = accuracy_score(y_test, y_pred_no_scale)
print("Accuracy WITHOUT Scaling:", accuracy_no_scale)

# Apply StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression WITH scaling
model_scaled = LogisticRegression(solver='liblinear', multi_class='ovr')
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print("Accuracy WITH Scaling:", accuracy_scaled)

In [None]:
# Q16. Write a Python program to train Logistic Regression and evaluate its performance using ROC-AUC score.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.datasets import load_breast_cancer

# Load data (binary classification dataset)
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)

# Predict probabilities
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Calculate ROC-AUC score
roc_auc = roc_auc_score(y_test, y_pred_proba)
print("ROC-AUC Score:", roc_auc)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()


In [None]:
# Q17. Write a Python program to train Logistic Regression using a custom learning rate (C=0.5) and evaluate accuracy.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression with custom C (inverse of regularization strength)
custom_C = 0.5
model = LogisticRegression(solver='liblinear', multi_class='ovr', C=custom_C)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with C={custom_C}:", accuracy)



In [None]:
# Q18. Write a Python program to train Logistic Regression and identify important features based on model coefficients.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression (using one-vs-rest for multiclass)
model = LogisticRegression(solver='liblinear', multi_class='ovr')
model.fit(X_train, y_train)

# Get coefficients (for OvR, there will be coefficients for each class)
coefficients = model.coef_

# Identify important features (magnitude of coefficients indicates importance)
print("Feature Importance (based on absolute coefficient magnitude for each class):")
for i, class_coefs in enumerate(coefficients):
    print(f"Class {i}:")
    feature_importance = sorted(zip(feature_names, abs(class_coefs)), key=lambda x: x[1], reverse=True)
    for feature, importance in feature_importance:
        print(f"  {feature}: {importance:.4f}")



In [None]:
# Q19. Write a Python program to train Logistic Regression and evaluate its performance using Cohen's Kappa Score.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import cohen_kappa_score
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression(solver='liblinear', multi_class='ovr')
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Calculate Cohen's Kappa Score
kappa_score = cohen_kappa_score(y_test, y_pred)
print("Cohen's Kappa Score:", kappa_score)


In [None]:
# Q20. Write a Python program to train Logistic Regression and visualize the Precision-Recall Curve for binary classification.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.datasets import load_breast_cancer

# Load data (binary classification dataset)
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)

# Predict probabilities for the positive class
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Calculate Precision-Recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)

# Calculate average precision score
average_precision = average_precision_score(y_test, y_pred_proba)
print(f"Average Precision-Recall Score: {average_precision:.2f}")

# Plot Precision-Recall curve
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, color='blue', lw=2, label=f'Precision-Recall curve (AP = {average_precision:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.legend(loc="lower left")
plt.grid(True)
plt.show()

In [None]:
# Q21. Write a Python program to train Logistic Regression with different solvers (liblinear, saga, lbfgs) and compare their accuracy.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define solvers to compare
solvers = ['liblinear', 'saga', 'lbfgs']
accuracies = {}

# Train and evaluate Logistic Regression for each solver
for solver in solvers:
    model = LogisticRegression(solver=solver, multi_class='ovr', max_iter=10000)  # Increase max_iter to ensure convergence
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies[solver] = accuracy
    print(f"Accuracy with solver '{solver}': {accuracy}")

# Compare accuracies
best_solver = max(accuracies, key=accuracies.get)
print(f"\nBest Solver: {best_solver} with Accuracy: {accuracies[best_solver]}")



In [None]:
# Q22. Write a Python program to train Logistic Regression and evaluate its performance using Matthews Correlation Coefficient (MCC).

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import matthews_corrcoef
from sklearn.datasets import load_breast_cancer

# Load data (binary classification dataset)
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Calculate MCC
mcc = matthews_corrcoef(y_test, y_pred)
print("Matthews Correlation Coefficient (MCC):", mcc)


In [None]:
# Q23. Write a Python program to train Logistic Regression on both raw and standardized data.
# Compare their accuracy to see the impact of feature scaling.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression on RAW data
model_raw = LogisticRegression(solver='liblinear', multi_class='ovr')
model_raw.fit(X_train, y_train)
y_pred_raw = model_raw.predict(X_test)
accuracy_raw = accuracy_score(y_test, y_pred_raw)
print("Accuracy on Raw Data:", accuracy_raw)

# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression on STANDARDIZED data
model_scaled = LogisticRegression(solver='liblinear', multi_class='ovr')
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print("Accuracy on Standardized Data:", accuracy_scaled)

# Compare accuracies
print("\nComparison:")
print(f"Difference in Accuracy: {accuracy_scaled - accuracy_raw}")


In [None]:
# Q24. Write a Python program to train Logistic Regression and find the optimal C (regularization strength) using cross-validation.

import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a range of C values to test
C_values = [0.001, 0.01, 0.1, 1, 10, 100]
best_C = None
best_accuracy = 0

# Use K-Fold Cross-Validation to find the optimal C
kf = KFold(n_splits=5, shuffle=True, random_state=42)  # You can adjust the number of folds

for C in C_values:
    model = LogisticRegression(solver='liblinear', multi_class='ovr', C=C, max_iter=10000)
    scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='accuracy')  # Use cross_val_score
    avg_accuracy = np.mean(scores)

    print(f"C={C}, Average Accuracy (CV): {avg_accuracy}")

    if avg_accuracy > best_accuracy:
        best_accuracy = avg_accuracy
        best_C = C

print(f"\nOptimal C: {best_C} with Best Average Accuracy: {best_accuracy}")


In [None]:
# Q25. Write a Python program to train Logistic Regression, save the trained model using joblib, and load it again to make predictions.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
import joblib  # For saving and loading the model

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression(solver='liblinear', multi_class='ovr')
model.fit(X_train, y_train)

# Save the trained model
filename = 'logistic_regression_model.joblib'
joblib.dump(model, filename)
print(f"Model saved as {filename}")

# Load the model
loaded_model = joblib.load(filename)
print(f"Model loaded from {filename}")

# Make predictions using the loaded model
y_pred = loaded_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Loaded Model:", accuracy)