# Session 2 – Supervised Learning

In this notebook, we will:
- Understand the concept of regression and classification
- Train a Linear Regression model using scikit-learn
- Evaluate model performance with appropriate metrics
- Visualize predictions vs real values
- Train and evaluate a Logistic Regression classifier
- Experiment with alternative classifiers

Dataset source:  **Diabetes dataset** and **Breast Cancer dataset** included in scikit-learn.  

## Part 1: Regression with the Diabetes Dataset

We’ll use the **Diabetes dataset** included in scikit-learn.  
It contains data from patients with diabetes, including clinical measurements such as age, BMI, blood pressure, and blood serum variables.  

Our goal will be to **predict disease progression** (a continuous target variable) based on these features.


In [None]:
from sklearn.datasets import load_diabetes
import pandas as pd
import numpy as np

# Load the dataset
diabetes = load_diabetes()

# Create a DataFrame
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df['target'] = diabetes.target

df.head()

### Inspect and Prepare the Data

Do we need any preprocessing this time?

In [None]:
# Check basic info
print(df.info())

# Basic statistics
print(df.describe())

# Check for missing values
print(df.isnull().sum())

### Split the Data

We’ll separate our dataset into:
- **Features (X)**: the 10 clinical variables  
- **Target (y)**: the disease progression measure  

Then we’ll split it into training and test sets to evaluate performance on unseen data.

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {X_train.shape[0]}, Testing samples: {X_test.shape[0]}")

### Train a Linear Regression Model

We’ll use **LinearRegression** from scikit-learn, which fits a straight line to predict a continuous target variable.

Let's first try to fit a regression with only the age variable:

In [None]:
X_train

In [None]:
from sklearn.linear_model import LinearRegression

X_train_age = X_train[['age']]
X_test_age = X_test[['age']]

# Initialize and fit the model
model = LinearRegression()
model.fit(X_train_age, y_train)

# Get predictions
y_pred = model.predict(X_test_age)

In [None]:
model.coef_

### Evaluate the Model

We’ll use **Mean Squared Error (MSE)** and **R² score** to measure performance.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

### Visualize Predictions

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(6,6))
sns.scatterplot(x=y_test, y=y_pred)
plt.xlabel("True Values")
plt.ylabel("Predicted Values")
plt.title("True vs Predicted Disease Progression")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.show()

### Exercise 1

1. Compute now a linear regression using all the features from the dataset
2. What does the model learn? Are coefficients positive or negative? What might that mean biologically?

**1. Compute now a linear regression using all the features from the dataset**

**2. What does the model learn? Are coefficients positive or negative? What might that mean biologically?**

### Exercise 2: Compute R² score, MAE and RMSE

Add two more evaluation metrics: 

1. R² score
2. Mean Absolute Error (MAE)
3. Root Mean Squared Error (RMSE)
4. Compare these values for train and for test. Which one is better?

*Hint:* from sklearn.metrics import mean_absolute_error

**1. R² score**

**2. Mean Absolute Error (MAE)**

**3. Root Mean Squared Error (RMSE)**

**4. Compare these values for train and for test. Which one is better?**

### Exercise 3: Plot again the predictions vs targets

Are the results better this time?

### Exercise 4: Inspect the Residuals

The residuals are defined as the difference between the true values and the predicted values. 

1. Visualize the residuals in a histrogram. 
2. Plot a scatterplot of the residuals vs the predicted. 

Are the residuals centered around 0? Is there any pattern (heteroscedasticity)?

**1. Visualize the residuals in a histrogram.**

**2. Plot a scatterplot of the residuals vs the predicted.**

### Exercise 5: Try Ridge Regression

Ridge regression adds regularization to reduce overfitting.

1. Import and train a Ridge() model.
2. Look at the coefficients this time. How do they compare to the Linear Regression?
3. Compare its R² score with LinearRegression.

*Hint:* from sklearn.linear_model import Ridge

**1. Import and train a Ridge() model.**

**2. Look at the coefficients this time. How do they compare to the Linear Regression?**

**3. Compare its R² score with LinearRegression.**

### Exercise 6: Hyperparameter Tuning for Ridge

Use a small loop to see how alpha affects regularization and performance. What would be the optimal alpha range?

### Exercise 7: Experiment also with Lasso

Try out Lasso regression and compare it with the previous models.

### Exercise 8: Feature Importance

Sort and plot the regression coefficients to see which features contribute most.

*Hint:* Use sns.barplot() with the coefficients DataFrame.

## Additional Exercises

### Exercise 9: Polynomial Regression

Use PolynomialFeatures(degree=2) to add interaction terms and see if performance improves.

### Exercise 10: Feature Scaling Comparison

Compare model performance with and without StandardScaler.

## Part 2: Classification with the Breast Cancer Dataset

We’ll use the **Breast Cancer dataset** included in scikit-learn.  
This dataset contains **diagnostic features** of breast tissue (e.g., radius, texture, symmetry), computed from digitized images of fine needle aspirates (FNA) of breast masses.

The task is to predict whether a tumor is **malignant (1)** or **benign (0)**.

In [None]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np

# Load the dataset
cancer = load_breast_cancer()

# Create DataFrame
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

df.head()


In [None]:
df.info()
df.describe().T.head()

Let's check the distribution of the targets:

In [None]:
df['target'].value_counts()

### Split into Training and Test Sets

We’ll use an 80/20 split for training and testing.

This time we use *stratify=y* to maintain the same class proportion in both train and test sets.

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training samples: {X_train.shape[0]}, Test samples: {X_test.shape[0]}")

### Train a Logistic Regression Classifier

Logistic Regression models the **probability** of belonging to a class.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=5000, solver='liblinear')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]  # probability for class 1 (malignant)

In [None]:
coef_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0]
}).sort_values(by='Coefficient', ascending=False)

coef_df.head(10)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=cancer.target_names)
disp.plot(cmap='Blues')
plt.title("Confusion Matrix")
plt.show()


### Exercise 1

1. Use K nearest neighbors to perform the classification
2. Plot the confusion matrix and compare it from the one obtained for the logistic regression.
3. (Optional) Try scaling features before fitting.

### Exercise 2:

1. Do the same using Decision Trees. Important: Setting max_depth helps prevent overfitting.
2. Plot feature importance.
3. Decision Trees are intuitive and interpretable — they learn “if–then” rules. Visualie the tree and interpret.

**1. Do the same using Decision Trees. Important: Setting max_depth helps prevent overfitting.**

**2. Plot feature importance.**

**3. Decision Trees are intuitive and interpretable — they learn “if–then” rules. Visualie the tree and interpret.**

### Exercise 3:

1. Do the same using Random Forest.
2. Plot feature importance.
3. Compare with the results obtained from the Decision Tree. What are the main differences?

### Exercise 4: Hyperparameter Search

1. Try different max_depth for the Decision tree model. Which number is the best choice?
2. Try different number of trees for the Random Forest model. Which number is the best?
3. Is overfitting noticeable?

*Hint: maybe it is convenient to look into from sklearn.model_selection import GridSearchCV*