                    Logistic Regression

  
1.What is Logistic Regression, and how does it differ from Linear
Regression?

- Logistic Regression is a statistical model used for binary classification problems, where the outcome is categorical and can have only two possible values (e.g., yes/no, true/false, spam/not spam). It uses a logistic function (sigmoid function) to model the probability of the outcome belonging to a particular category.

Linear Regression, on the other hand, is used for predicting a continuous outcome variable based on one or more predictor variables. It assumes a linear relationship between the input variables and the output variable.

key differences:

Output: Logistic Regression predicts a probability (between 0 and 1), while Linear Regression predicts a continuous value.

Relationship: Logistic Regression uses a sigmoid function to model a non-linear relationship between input and output, while Linear Regression assumes a linear relationship.

Loss Function: Logistic Regression typically uses the log loss (or cross-entropy loss) function, while Linear Regression uses the mean squared error (MSE) loss function.

Applications: Logistic Regression is used for classification tasks, while Linear Regression is used for regression tasks.

2.Explain the role of the Sigmoid function in Logistic Regression?

- The sigmoid function, also known as the logistic function, plays a crucial role in Logistic Regression. It is a mathematical function that takes any real-valued number and maps it to a value between 0 and 1. In Logistic Regression, the sigmoid function is used to transform the linear output of the model into a probability.

Specifically, the linear combination of the input features and their corresponding weights (i.e., the output of the linear part of the model) is passed through the sigmoid function. The output of the sigmoid function is then interpreted as the probability that the instance belongs to the positive class (usually denoted as 1).

The formula for the sigmoid function is:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

where $z$ is the linear output of the model.

The sigmoid function has several desirable properties for this purpose:

- It is monotonic and differentiable, which is important for optimization algorithms like gradient descent.
- It squashes the output to a range between 0 and 1, which can be interpreted as a probability.
- It has a clear interpretation: values closer to 1 indicate a higher probability of belonging to the positive class, while values closer to 0 indicate a higher probability of belonging to the negative class.

3.What is Regularization in Logistic Regression and why is it needed?

- Regularization is a technique used in Logistic Regression (and other machine learning models) to prevent overfitting. Overfitting occurs when a model learns the training data too well, including the noise and random fluctuations, which leads to poor performance on unseen data.
In Logistic Regression, overfitting can happen when the model has too many features or when the features are highly correlated. Regularization addresses this by adding a penalty term to the loss function that the model tries to minimize during training. This penalty term discourages the model from assigning excessively large weights to the features.

There are two common types of regularization used in Logistic Regression:

L1 Regularization (Lasso Regularization): Adds a penalty proportional to the absolute value of the weights. This type of regularization can lead to sparse models, where some feature weights become exactly zero, effectively performing feature selection.

L2 Regularization (Ridge Regularization): Adds a penalty proportional to the square of the weights. This type of regularization shrinks the weights towards zero but does not necessarily make them exactly zero.

Need Of regularization->
Preventing Overfitting:
The primary reason for using regularization is to prevent the model from overfitting the training data, which improves its generalization ability to unseen data.
Handling Multicollinearity: Regularization can help in situations where there is high correlation between features

(multicollinearity). It can shrink the weights of correlated features, making the model more stable.

Improving Model Interpretability: L1 regularization, by setting some weights to zero, can help in identifying the most important features, leading to a more interpretable model.

Reducing Model Complexity: By discouraging large weights, regularization effectively reduces the complexity of the model, making it less sensitive to small variations in the training data.


4.What are some common evaluation metrics for classification models, and why are they important?

- Here are some common evaluation metrics:

Accuracy:
The proportion of correctly classified instances out of the total number of instances.
Importance: Provides a general overview of the model's performance. However, it can be misleading in cases of imbalanced datasets where one class is significantly more prevalent than others.

Precision:
The proportion of true positive predictions among all positive predictions.
Importance: Useful when the cost of false positives is high. It measures the model's ability to avoid incorrectly classifying negative instances as positive.

Recall (Sensitivity or True Positive Rate):
The proportion of true positive predictions among all actual positive instances.
Importance: Useful when the cost of false negatives is high. It measures the model's ability to find all positive instances.

F1-Score:
The harmonic mean of precision and recall.
Importance: Provides a balance between precision and recall, especially useful when there is an uneven class distribution.

Confusion Matrix:
A table that summarizes the performance of a classification model on a set of test data. It shows the number of true positives, true negatives, false positives, and false negatives.
Importance: Provides a detailed breakdown of the model's predictions and helps in understanding where the model is making errors.

AUC (Area Under the ROC Curve):
The area under the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate against the false positive rate at various threshold settings.
Importance: Measures the model's ability to distinguish between positive and negative classes. A higher AUC indicates better performance.

Log Loss (Cross-Entropy Loss):
A metric that penalizes incorrect predictions based on the predicted probability. It measures the performance of a classification model whose output is a probability value between 0 and 1.

Significance: Useful in evaluating the confidence of the model's predictions. Lower log loss indicates better performance.

Why are they important?

Understanding Performance: Metrics provide a quantitative way to measure how well a model is performing on a given task.

Comparing Models: Different models can be compared based on their performance on various metrics to choose the best model for a specific problem.

Identifying Model Weaknesses: Analyzing different metrics can reveal specific areas where the model is struggling (e.g., high false positives or false negatives).

Guiding Model Improvement: Understanding the metrics can help in guiding the process of improving the model, such as tuning hyperparameters or selecting different features.

Communicating Results: Metrics provide a standardized way to communicate the performance of a model to stakeholders.




5.Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snp
import warnings
warnings.filterwarnings('ignore')
from sklearn.datasets import load_iris
data=load_iris()
data
data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [2]:
df=pd.DataFrame(data.data,columns=data.feature_names)
df['target']=data.target
df.head()
df.tail()
df.sample(1)
df.target.unique()

array([0, 1, 2])

In [3]:
# for binary classification only two classes are required
df=df[df.target!=2]
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
95,5.7,3.0,4.2,1.2,1
96,5.7,2.9,4.2,1.3,1
97,6.2,2.9,4.3,1.3,1
98,5.1,2.5,3.0,1.1,1


In [4]:
df.target.unique()

array([0, 1])

In [5]:
x=df.iloc[:,:-1]
x
y=df.iloc[:,-1]
y

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0
...,...
95,1
96,1
97,1
98,1


In [6]:
from sklearn.model_selection import train_test_split #Train test split process
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1)
x_train.shape,x_test.shape,y_train.shape,y_test.shape

((80, 4), (20, 4), (80,), (20,))

In [7]:
# Model Training
from sklearn.linear_model import LogisticRegression
classifier=LogisticRegression()
classifier.fit(x_train,y_train)
classifier
y_pred=classifier.predict(x_test)
y_pred


array([1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0])

In [8]:
classifier_proba=classifier.predict_proba(x_test)
classifier_proba

array([[0.04043262, 0.95956738],
       [0.01046123, 0.98953877],
       [0.98706759, 0.01293241],
       [0.05440516, 0.94559484],
       [0.1383348 , 0.8616652 ],
       [0.97966131, 0.02033869],
       [0.98204504, 0.01795496],
       [0.03292232, 0.96707768],
       [0.03380573, 0.96619427],
       [0.00850516, 0.99149484],
       [0.02466034, 0.97533966],
       [0.97515854, 0.02484146],
       [0.00517837, 0.99482163],
       [0.00238366, 0.99761634],
       [0.0077453 , 0.9922547 ],
       [0.98619342, 0.01380658],
       [0.96597536, 0.03402464],
       [0.94907826, 0.05092174],
       [0.00735493, 0.99264507],
       [0.97742977, 0.02257023]])

In [43]:
#Matric Evaluation
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
print(f"\n Accuracy Score:\n")
print(accuracy_score(y_test,y_pred))
print(f"\n Confusion Matrix:\n")
print(confusion_matrix(y_test,y_pred))
print(f"\n Classification Report:\n")
print(classification_report(y_test,y_pred))


 Accuracy Score:

1.0

 Confusion Matrix:

[[ 8  0]
 [ 0 12]]

 Classification Report:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         8
           1       1.00      1.00      1.00        12

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20



6.Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.


In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Train a Logistic Regression model with L2 regularization
# C is the inverse of regularization strength; smaller values specify stronger regularization.
classifier_l2 = LogisticRegression(penalty='l2', C=1.0)
classifier_l2.fit(x_train, y_train)

# Print the model coefficients
print("\n Model Coefficients:\n")
print(classifier_l2.coef_)
print("\n Model Intercept:\n")
print(classifier_l2.intercept_)


# Make predictions on the test set
y_pred_l2 = classifier_l2.predict(x_test)

# Print the accuracy
accuracy_l2 = accuracy_score(y_test, y_pred_l2)
print(f"\nAccuracy with L2 regularization: {accuracy_l2}")


 Model Coefficients:

[[ 0.46100411 -0.78836575  2.18624929  0.92865666]]

 Model Intercept:

[-6.80586873]

Accuracy with L2 regularization: 1.0


7.Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report.

In [37]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Use the full dataset for multiclass classification
# Reload the data to include all three classes
from sklearn.datasets import load_iris
data = load_iris()
df_multi = pd.DataFrame(data.data, columns=data.feature_names)
df_multi['target'] = data.target

x_multi = df_multi.iloc[:, :-1]
y_multi = df_multi.iloc[:, -1]

# Split the data into training and testing sets
x_train_multi, x_test_multi, y_train_multi, y_test_multi = train_test_split(x_multi, y_multi, test_size=0.2, random_state=1)

# Train a Logistic Regression model with multi_class='ovr'
classifier_ovr = LogisticRegression(multi_class='ovr', solver='liblinear') # 'liblinear' solver is suitable for 'ovr'
classifier_ovr.fit(x_train_multi, y_train_multi)

# Make predictions on the test set
y_pred_ovr = classifier_ovr.predict(x_test_multi)

# Print the classification report
print("\nClassification Report (multi_class='ovr'):\n")
print(classification_report(y_test_multi, y_pred_ovr))


Classification Report (multi_class='ovr'):

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      0.62      0.76        13
           2       0.55      1.00      0.71         6

    accuracy                           0.83        30
   macro avg       0.85      0.87      0.82        30
weighted avg       0.91      0.83      0.84        30



8.Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.

In [44]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Define the parameter grid
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

# Create a Logistic Regression model
# Use a solver that supports both l1 and l2 penalties, like 'liblinear' or 'saga'
# For 'l1' penalty, 'liblinear' is generally faster for small datasets
model = LogisticRegression(solver='liblinear')

# Create GridSearchCV object
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')

# Fit the grid search to the data
# Using the binary classification data (x, y) from previous steps
grid_search.fit(x, y)

# Print the best parameters and best score (validation accuracy)
print("\nBest Parameters:\n")
print(grid_search.best_params_)
print("\nBest Validation Accuracy:\n")
print(grid_search.best_score_)

# You can also evaluate the best model on the test set
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(x_test)
test_accuracy = accuracy_score(y_test, y_pred_tuned)
print(f"\nTest Accuracy with best parameters: {test_accuracy}")


Best Parameters:

{'C': 0.01, 'penalty': 'l2'}

Best Validation Accuracy:

1.0

Test Accuracy with best parameters: 1.0


9.Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.

In [45]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Use the binary classification data (x, y)
# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

# --- Model without scaling ---
print("\n--- Model without Scaling ---")
# Train a Logistic Regression model without scaling
model_no_scale = LogisticRegression()
model_no_scale.fit(x_train, y_train)

# Make predictions and evaluate
y_pred_no_scale = model_no_scale.predict(x_test)
accuracy_no_scale = accuracy_score(y_test, y_pred_no_scale)
print(f"Accuracy without scaling: {accuracy_no_scale}")

# --- Model with scaling ---
print("\n--- Model with Scaling ---")
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and testing data
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

# Train a Logistic Regression model with scaled data
model_scaled = LogisticRegression()
model_scaled.fit(x_train_scaled, y_train)

# Make predictions and evaluate
y_pred_scaled = model_scaled.predict(x_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy with scaling: {accuracy_scaled}")


--- Model without Scaling ---
Accuracy without scaling: 1.0

--- Model with Scaling ---
Accuracy with scaling: 1.0


10.Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

- Breakdown of the approach :

1. Data Handling:

Understand the Data: Thoroughly explore the dataset to understand the features available (e.g., customer demographics, purchase history, website activity, previous campaign interactions) and the target variable (customer response: 1 for responded, 0 for not responded). Identify missing values, outliers, and data types.
Feature Engineering: Create new features that might be more informative for the model. For example, you could calculate metrics like:
Time since last purchase
Number of purchases in the last year
Average order value
Frequency of website visits
Interaction with previous marketing materials
Handle Categorical Features: Encode categorical features using techniques like one-hot encoding or dummy encoding.
2. Feature Scaling:

Standardization or Normalization: Logistic Regression is sensitive to the scale of features. Standardize (using StandardScaler) or normalize (using MinMaxScaler) the numerical features. It's generally recommended to fit the scaler on the training data and then transform both the training and testing data to prevent data leakage.
3. Balancing Classes:

Given the significant class imbalance (only 5% response rate), simply training the model on the raw data will likely lead to a model that predicts the majority class (no response) most of the time, resulting in high accuracy but poor performance in identifying the responders. To address this:

Choose an Appropriate Technique:
Oversampling the Minority Class: Techniques like SMOTE (Synthetic Minority Over-sampling Technique) create synthetic samples of the minority class to increase its representation.
Undersampling the Majority Class: This involves randomly removing samples from the majority class. However, this can lead to loss of valuable information.
Combination of Oversampling and Undersampling: Using techniques like SMOTE followed by Tomek links or Edited Nearest Neighbors.
Implementation: Apply the chosen balancing technique only to the training data. Do not balance the test set, as it should represent the real-world imbalanced distribution.
4. Hyperparameter Tuning:

Identify Important Hyperparameters: For Logistic Regression, key hyperparameters to tune include:
C: Inverse of regularization strength. Smaller values mean stronger regularization.
penalty: Specifies the norm used in the penalization ('l1' or 'l2'). 'l1' can lead to sparser models (feature selection).
solver: Algorithm to use for optimization (e.g., 'liblinear', 'saga', 'lbfgs'). Choose a solver that supports the chosen penalty.
Use Cross-Validation: Employ techniques like GridSearchCV or RandomizedSearchCV with cross-validation on the balanced training data to find the best combination of hyperparameters.
Specify Appropriate Scoring: When using GridSearchCV or RandomizedSearchCV, use evaluation metrics that are suitable for imbalanced datasets, such as:
F1-score
Precision-Recall AUC
Area Under the ROC Curve (AUC) - while less sensitive to imbalance than accuracy, still useful.
5. Evaluating the Model:

Since accuracy is not a reliable metric for imbalanced datasets, focus on other evaluation metrics:

Confusion Matrix: Analyze the confusion matrix to understand the number of true positives (correctly identified responders), false positives (incorrectly identified responders), true negatives (correctly identified non-responders), and false negatives (incorrectly identified non-responders).
Precision and Recall:
Precision: The proportion of actual responders among those predicted as responders. Important if the cost of contacting a non-responder is high.
Recall: The proportion of predicted responders among actual responders. Important if you want to maximize the number of responders you reach, even if it means contacting some non-responders.
F1-Score: The harmonic mean of precision and recall, providing a balanced measure.
AUC-ROC and AUC-PR:
AUC-ROC: Measures the ability of the model to distinguish between the two classes.
AUC-PR (Area Under the Precision-Recall Curve): Particularly useful for imbalanced datasets as it focuses on the performance of the model on the minority class.
Classification Report: Use sklearn.metrics.classification_report to get a summary of precision, recall, F1-score, and support for each class.
Real-World Business Use Case Considerations:

Define the Objective: Clearly define what "successful" means for the marketing campaign. Is it maximizing the number of responders (high recall) or minimizing the cost of contacting non-responders (high precision)? This will guide your choice of evaluation metrics and potentially the balancing technique.
Thresholding: The default threshold for Logistic Regression is 0.5. For imbalanced datasets, you might need to adjust this threshold to optimize for precision or recall based on the business objective.
Business Impact: Evaluate the model's performance in terms of business impact. For example, what is the expected return on investment (ROI) based on the model's predictions?
A/B Testing: Once you have a trained model, it's crucial to conduct A/B testing to compare the performance of the campaign targeting customers identified by the model versus a control group.
Model Monitoring: Continuously monitor the model's performance in production as customer behavior and data patterns can change over time. Retrain the model periodically with new data.
