<a href="https://colab.research.google.com/github/Yashgg10/LOGISTIC/blob/main/LOGISTIC_ASSIGNMENT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **1. What is Logistic Regression, and how does it differ from Linear Regression?**  
Logistic Regression is a classification algorithm used to predict categorical outcomes, unlike Linear Regression, which is used for continuous outcomes. While Linear Regression fits a straight line to predict a continuous variable, Logistic Regression applies the **Sigmoid function** to map predictions to probabilities for classification tasks.  

### **2. What is the mathematical equation of Logistic Regression?**  
The equation for Logistic Regression is:  
\[
h_{\theta}(x) = \frac{1}{1 + e^{-(\theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n)}}
\]  
where:  
- \( h_{\theta}(x) \) is the predicted probability,  
- \( \theta \) are the model parameters (weights),  
- \( x \) are the input features, and  
- \( e \) is Euler’s number (~2.718).  

### **3. Why do we use the Sigmoid function in Logistic Regression?**  
The Sigmoid function converts any real-valued number into a probability between 0 and 1. It ensures that the output can be interpreted as a probability and enables classification decisions based on a threshold (e.g., 0.5).  

### **4. What is the cost function of Logistic Regression?**  
The cost function in Logistic Regression is the **Log Loss** (Binary Cross-Entropy):  
\[
J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)})) \right]
\]  
This function penalizes incorrect predictions heavily, making it suitable for classification.  

### **5. What is Regularization in Logistic Regression? Why is it needed?**  
Regularization is a technique to prevent overfitting by adding a penalty term to the cost function. In Logistic Regression, **L1 (Lasso) and L2 (Ridge) regularization** are commonly used. Regularization ensures that the model generalizes well to unseen data by controlling the complexity of the learned parameters.  

### **6. Explain the difference between Lasso, Ridge, and Elastic Net Regression.**  
- **Lasso (L1)**: Shrinks some coefficients to zero, performing feature selection.  
- **Ridge (L2)**: Distributes shrinkage across all coefficients but never makes them exactly zero.  
- **Elastic Net**: A combination of L1 and L2, useful when there are correlated features.  

### **7. When should we use Elastic Net instead of Lasso or Ridge?**  
Elastic Net is preferred when:  
- There are **many correlated features** (Lasso alone may select only one of them).  
- You want **both feature selection and coefficient shrinkage**.  

### **8. What is the impact of the regularization parameter (λ) in Logistic Regression?**  
- **Higher λ**: More regularization, simpler model (but may underfit).  
- **Lower λ**: Less regularization, more complex model (but may overfit).  
- **λ = 0**: Regularization is disabled, reducing Logistic Regression to its basic form.  

### **9. What are the key assumptions of Logistic Regression?**  
- The **dependent variable** is binary or categorical.  
- There is **no multicollinearity** among independent variables.  
- Observations are **independent** of each other.  
- The **log-odds of the outcome** have a **linear relationship** with the independent variables.  

### **10. What are some alternatives to Logistic Regression for classification tasks?**  
- Decision Trees  
- Random Forest  
- Support Vector Machines (SVM)  
- k-Nearest Neighbors (k-NN)  
- Naïve Bayes  
- Neural Networks (ANN, CNN)  

### **11. What are Classification Evaluation Metrics?**  
- **Accuracy**: Percentage of correctly predicted labels.  
- **Precision**: True Positives / (True Positives + False Positives).  
- **Recall (Sensitivity)**: True Positives / (True Positives + False Negatives).  
- **F1 Score**: Harmonic mean of Precision and Recall.  
- **AUC-ROC**: Measures the model’s ability to distinguish between classes.  

### **12. How does class imbalance affect Logistic Regression?**  
Class imbalance can cause the model to be biased toward the majority class, reducing recall for the minority class. Solutions include:  
- Using **balanced class weights**.  
- **Oversampling** the minority class (SMOTE).  
- **Undersampling** the majority class.  

### **13. What is Hyperparameter Tuning in Logistic Regression?**  
Hyperparameter tuning involves selecting optimal values for parameters like:  
- **Regularization strength (λ or C in Scikit-Learn)**.  
- **Solver choice** (e.g., "lbfgs", "saga").  
Techniques like **Grid Search, Random Search, and Bayesian Optimization** can be used.  

### **14. What are different solvers in Logistic Regression? Which one should be used?**  
- **lbfgs**: Default, works well for small to medium datasets.  
- **saga**: Best for large datasets and L1 regularization.  
- **newton-cg**: Works well with L2 regularization.  
- **liblinear**: Suitable for smaller datasets with L1/L2 regularization.  
- **sag**: Stochastic gradient descent, good for large datasets.  

### **15. How is Logistic Regression extended for multiclass classification?**  
- **One-vs-Rest (OvR)**: Trains multiple binary classifiers, one for each class.  
- **Softmax Regression (Multinomial Logistic Regression)**: Uses the Softmax function to predict probabilities for multiple classes in one model.  

### **16. What are the advantages and disadvantages of Logistic Regression?**  
✅ **Advantages**:  
- Simple and interpretable.  
- Works well on **linearly separable** data.  
- Efficient on small datasets.  

❌ **Disadvantages**:  
- Struggles with **non-linear relationships**.  
- **Sensitive to outliers**.  
- Poor performance on **high-dimensional sparse data**.  

### **17. What are some use cases of Logistic Regression?**  
- **Medical Diagnosis** (e.g., predicting disease presence).  
- **Fraud Detection** (e.g., detecting fraudulent transactions).  
- **Spam Classification** (e.g., email filtering).  
- **Customer Churn Prediction**.  

### **18. What is the difference between Softmax Regression and Logistic Regression?**  
- **Logistic Regression** is used for **binary classification** (0 or 1).  
- **Softmax Regression** is an extension for **multiclass classification**, where it assigns probabilities to multiple categories.  

### **19. How do we choose between One-vs-Rest (OvR) and Softmax for multiclass classification?**  
- **OvR**: Preferred when there are a large number of classes.  
- **Softmax**: Better when classes are mutually exclusive and the number of classes is small.  

### **20. How do we interpret coefficients in Logistic Regression?**  
The coefficients represent the **log-odds change** for a one-unit increase in the predictor variable:  
\[
e^{\theta_i}
\]  
- If **\( e^{\theta_i} > 1 \)**, the feature increases the probability of the positive class.  
- If **\( e^{\theta_i} < 1 \)**, the feature decreases the probability of the positive class.  

In [21]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

In [22]:
# Write a Python program that loads a dataset, splits it into training and testing sets, applies Logistic Regression, and prints the model accuracy
from sklearn.datasets import load_iris
dataset = load_iris()


In [23]:
print(dataset.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

In [24]:
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [25]:
X= df.drop('target', axis=1)
y= df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [26]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

In [27]:
y_pred = clf.predict(X_test)
y_pred

array([0, 1, 1, 0, 2, 1, 2, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       2, 0, 2, 1, 0, 0, 1, 2, 1, 2, 1, 2, 2, 0, 1, 0, 1, 2, 2, 0, 2, 2,
       1])

In [28]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.9777777777777777

In [29]:
# Write a Python program to apply L1 regularization (Lasso) on a dataset using LogisticRegression(penalty='l1') and print the model accuracy
from sklearn.datasets import load_iris
dataset = load_iris()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target
X= df.drop('target', axis=1)
y= df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
clf = LogisticRegression(penalty='l1', solver='saga', C=1.5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.9777777777777777

In [30]:
# Write a Python program to train Logistic Regression with L2 regularization (Ridge) using LogisticRegression(penalty='l2'). Print model accuracy and coefficient
from sklearn.datasets import load_iris
dataset = load_iris()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target
X= df.drop('target', axis=1)
y= df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
clf = LogisticRegression(penalty='l2', solver='saga', C=0.5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(clf.coef_)

0.9555555555555556
[[ 0.41991059  1.11747107 -1.68042736 -0.76572796]
 [ 0.40429244 -0.45943723  0.02729104 -0.52749527]
 [-0.82420303 -0.65803384  1.65313632  1.29322323]]


In [31]:
# Write a Python program to train Logistic Regression with Elastic Net Regularization (penalty='elasticnet')
from sklearn.datasets import load_iris
dataset = load_iris()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target
X= df.drop('target', axis=1)
y= df['target']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5850)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

1.0

In [33]:
# Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr
from sklearn.datasets import load_iris
dataset = load_iris()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target
x=df.drop('target',axis='columns')
y=df.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
model = LogisticRegression(multi_class='ovr')
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.8666666666666667

In [35]:
# Write a Python program to apply GridSearchCV to tune the hyperparameters (C and penalty) of Logistic Regression. Print the best parameters and accuracy
from sklearn.datasets import load_iris
dataset = load_iris()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target
x=df.drop('target',axis='columns')
y=df.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
from sklearn.model_selection import GridSearchCV
params = {'C': [0.1, 1, 10, 100, 1000], 'penalty': ['l1', 'l2']}
grid_search = GridSearchCV(LogisticRegression(), params, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print("Best parameters: ", grid_search.best_params_)
print("Best accuracy: ", grid_search.best_score_)

Best parameters:  {'C': 1, 'penalty': 'l2'}
Best accuracy:  0.9714285714285715


In [36]:
# Write a Python program to train a Logistic Regression model and evaluate its performance using Precision, Recall, and F1-Score
from sklearn.datasets import load_iris
dataset = load_iris()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target
x=df.drop('target',axis='columns')
y=df.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
from sklearn.metrics import precision_score, recall_score, f1_score
print("Precision: ", precision_score(y_test, y_pred, average='weighted'))
print("Recall: ", recall_score(y_test, y_pred, average='weighted'))
print("F1-Score: ", f1_score(y_test, y_pred, average='weighted'))


Precision:  0.9793650793650793
Recall:  0.9777777777777777
F1-Score:  0.9778718400940623


In [42]:
#  Write a Python program to train Logistic Regression with different solvers (liblinear, saga, lbfgs) and compare their accuracy
from sklearn.datasets import load_iris
dataset = load_iris()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target
X=df.drop('target',axis='columns')
y=df.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy with liblinear solver: ", accuracy_score(y_test, y_pred))
model = LogisticRegression(solver='saga')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy with saga solver: ", accuracy_score(y_test, y_pred))
model = LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy with lbfgs solver: ", accuracy_score(y_test, y_pred))

Accuracy with liblinear solver:  0.8888888888888888
Accuracy with saga solver:  0.9777777777777777
Accuracy with lbfgs solver:  0.9777777777777777
