# <span style="color:darkblue;">[LDATS2350] - DATA MINING</span>

### <span style="color:darkred;">Python13 - Logistic Regression</span>

**Prof. Robin Van Oirbeek**  

<br/>

**<span style="color:darkgreen;">Guillaume Deside</span>** (<span style="color:gray;">guillaume.deside@uclouvain.be</span>)

---

## **🔹 What is Logistic Regression?**
Logistic Regression is a **supervised learning algorithm** used for **binary classification** problems. Unlike linear regression, which predicts continuous values, **logistic regression predicts the probability** of a given class.

### **Key Characteristics:**
✅ Used for **binary classification** (Yes/No, Spam/Not Spam, Churn/No Churn).  
✅ Outputs a probability score between **0 and 1**.  
✅ Uses the **sigmoid function** to map any real-valued number into the range **(0,1)**.  
✅ The decision boundary is defined by a threshold (e.g., **0.5**).  

---

## **🔹 The Logistic (Sigmoid) Function**
The logistic function is given by:


$\sigma(z) = \frac{1}{1 + e^{-z}}
$

where:


$z = w_1 X_1 + w_2 X_2 + ... + w_n X_n + b = \mathbf{w}^T \mathbf{x} + b$

- $ X $ = feature vector  
- $ w $ = weight coefficients  
- $ b $ = bias term  
- $ e $ = Euler’s number (**≈ 2.718**)  

The output of the function is a probability value between **0 and 1**.

**Decision Rule:**  
$
\hat{y} =
\begin{cases} 
1, & \text{if } \sigma(z) \geq 0.5 \\ 
0, & \text{otherwise}
\end{cases}
$

🚀 **Interpretation:**  
- If **\( P(y=1 | X) \) ≥ 0.5**, classify the input as **Class 1**.  
- If **\( P(y=1 | X) \) < 0.5**, classify the input as **Class 0**.

---

## **🔹 Cost Function for Logistic Regression**
Logistic Regression **does not** use Mean Squared Error (MSE) as its cost function because it leads to a **non-convex** function, making optimization difficult.

Instead, if we maximizes the likelihood
$$ L(w) :=P(Y_1=y_1,\ldots,Y_m=y_m|w,x_1,\dots,x_m) =\Pi_{i=1}^nP(Y_i=y_i|w,x_1,\dots,x_m). $$
Assuming independence:
$$ L(w) := \prod_{i|y_i=1} p(x_i) * \prod_{i|y_i=0}(1- p(x_i)) $$
$$ L(w) := \prod_{i=1}^n p(x_i)^{y_i}(1- p(x_i))^{1-y_i} $$

is equivalent to maximize the log likelihood

$$l(w)=\sum_{i=1}^n (y_i log(p(x_i)) + (1-y_i) log(1- p(x_i)) )$$
$$ = \sum_{i=1}^n (y_i log(\frac{p(x_i}{1-p(x_i)}) + log(1-p(x_i))) $$


Therefore we aim to resolve the following optimization problem
$$\min_{w} \frac{1}{2}||w||^2 + C \sum_{i=1}^n (y_i w^T x_i) - log(1-e^{w^T x_i})) $$


## **🔹 Logistic Regression vs Linear Regression**
| Feature | Linear Regression | Logistic Regression |
|---------|------------------|----------------------|
| **Output Type** | Continuous | Probability (0-1) |
| **Function Used** | Linear Equation | Sigmoid Function |
| **Use Case** | Regression | Classification |
| **Loss Function** | Mean Squared Error (MSE) | Log Loss (Binary Cross Entropy) |
| **Decision Boundary** | Any real number | Threshold (0.5) |

---

## **🔹 Visual Representation of Logistic Regression**


![Logistic Regression Visualization](https://www.saedsayad.com/images/LogReg_1.png)



---

## **🎯 Summary**
✅ Logistic Regression is used for **classification** tasks, not regression.  
✅ Uses the **sigmoid function** to map real values to probabilities.  
✅ The **cost function is Log Loss**, optimized using **Gradient Descent**.  
✅ The **decision threshold (0.5)** determines classification.  

🚀 **Next Steps:** Implement Logistic Regression in Python using `scikit-learn`! 🏆
---

# Data Loading

In [3]:
import pandas as pd
data = pd.read_csv('diabetes.csv')

In [5]:
X = data.iloc[:,0:-1]
column_names = list(X) 
y = data.iloc[:,-1] 

from sklearn.preprocessing import StandardScaler
#from sklearn.preprocessing import MinMaxScaler
'''
USING STANDARDSCALER:

Standardize features by removing the mean and scaling to unit variance

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using the transform method.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
'''

scaler = StandardScaler().fit(X) 

X_scaled = scaler.transform(X)

X_scaled = pd.DataFrame(X_scaled)
X_scaled.columns = column_names

from sklearn.model_selection import train_test_split #like usual

#SPLIT DATA INTO TRAIN AND TEST SET
X_train, X_test, y_train, y_test = train_test_split(X, y,  #X_scaled
                                                    test_size =0.30, #by default is 75%-25%
                                                    #shuffle is set True by default,
                                                    stratify=y,
                                                    random_state= 123) #fix random seed for replicability

print(X_train.shape)

(537, 8)


# Logistic Regression Model

<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">
 
## **Exercise: Logistic Regression Model Selection and Coefficients Analysis**

#### **Objective**
In this exercise, you will:
- Tune hyperparameters of **Logistic Regression** using **GridSearchCV**.
- Evaluate model performance using **recall** as the primary metric.
- Analyze **feature coefficients** to interpret the importance of each feature.

---

#### **Steps to Complete**

### **1️⃣ Train and Optimize a Logistic Regression Model**
1. Define a **Logistic Regression classifier**.
2. Set up a **GridSearchCV** to tune hyperparameters:
   - Regularization strength (**C**): `[0.01, 0.1, 1]`
   - Penalty: `'l2'`
   - Maximum iterations: `[50000, 10000]`
3. Use **3-fold cross-validation** (`cv=3`) and optimize based on **recall**.


---

### **2️⃣ Evaluate Performance on the Test Set**
- Retrieve the best model from GridSearch.
- Predict on **both training and test sets**.
- Calculate **F1-score** to evaluate performance.


---

### **3️⃣ Feature Importance: Analyzing Coefficients**
- Retrieve model coefficients.
- Zip feature names with their respective coefficients.
- **Plot feature importance** using a bar plot.


---

### **4️⃣ Evaluate Model Performance with ROC Curve**
- Compute **ROC Curve** and **AUC score**.
- Plot **ROC Curve** to visualize model discrimination power.

---

### **Discussion Questions**
1. **Why do we optimize recall instead of accuracy in this case?**
2. **What does the sign of the coefficients tell us about the relationship between each feature and the target?**
3. **How does regularization (`C` parameter) affect feature importance?**
4. **What do you observe in the ROC Curve? Does the model have good discrimination power?** 

---

![features_importances_lr.png](attachment:30f957d0-5c86-4ccd-8f11-b2da2878869b.png)