<a href="https://colab.research.google.com/github/antndlcrx/Intro-to-Python-DPIR/blob/main/Week%206/W6_sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://cdn.githubraw.com/antndlcrx/Intro-to-Python-DPIR/main/images/logo_dpir.png?raw=true:,  width=35" alt="My Image" width=175>  

# Scikit Learn for Machine Learning

## **1**.&nbsp; Why Scikit-Learn?

<img src="https://cdn.githubraw.com/antndlcrx/Intro-to-Python-DPIR/main/images/W6/Scikit_learn_logo_small.png?raw=true:,  width=25" alt="My Image" width=175>

Scikit-learn is one of the most widely used Python libraries for machine learning. It offers a clean, consistent API that simplifies tasks like data preprocessing, model selection, and evaluation. Scikit-learn supports a broad range of algorithms—classification, regression, clustering, and more—and integrates seamlessly with libraries like NumPy and pandas. Last but not least, it is a very well-documented and well-maintained library.

See:
- [Scikit-learn User Guide](https://scikit-learn.org/stable/user_guide.html#user-guide).
- [Examples](https://scikit-learn.org/stable/auto_examples/index.html).



### A Reminder on Machine Learning

[**Supervised Learning**](https://scikit-learn.org/stable/supervised_learning.html)  
Supervised learning algorithms learn from labeled data where each sample has a known target output. They seek to predict or categorize new data based on patterns learned from these labels.  
- **Classification** focuses on predicting discrete categories or classes (e.g., spam vs. not spam).  
- **Regression** predicts continuous values (e.g., house prices).

[**Unsupervised Learning**](https://scikit-learn.org/stable/unsupervised_learning.html)

Unsupervised learning deals with unlabeled data. The algorithms aim to discover hidden structures or patterns without predefined targets.  
- **Clustering** groups similar samples together (e.g., grouping customers by purchasing behavior).  
- **Dimensionality Reduction** simplifies data by reducing its number of features while retaining important information (e.g., projecting data for visualisation in fewer dimensions).

**Semi-Supervised Learning**  
Semi-supervised learning combines both labeled and unlabeled data. The idea is to leverage a small amount of labeled data alongside larger amounts of unlabeled data to improve learning accuracy or reveal additional patterns when obtaining labels for every data point is costly or time-consuming.


### Data Representation Glossary in Scilit-Learn

In [1]:
!git clone https://github.com/antndlcrx/Intro-to-Python-DPIR.git

Cloning into 'Intro-to-Python-DPIR'...
remote: Enumerating objects: 124, done.[K
remote: Counting objects: 100% (124/124), done.[K
remote: Compressing objects: 100% (107/107), done.[K
remote: Total 124 (delta 42), reused 58 (delta 10), pack-reused 0 (from 0)[K
Receiving objects: 100% (124/124), 2.97 MiB | 9.93 MiB/s, done.
Resolving deltas: 100% (42/42), done.


In [2]:
import pandas as pd
import numpy as np
import seaborn as sns

In [3]:
qog_link = '/content/Intro-to-Python-DPIR/datasets/qog2022.csv'
qog = pd.read_csv(qog_link)

In ML, you might find people use sligtly different terminology to what you were used to in statistics. Variables are called **features** (your columns in the dataframe, although feature extends its use beyond 2d data). **Example** or **sample** is how people refer to an individual data point (your dataset row). The data table containing information on all features for all examples is **feature matrix**.
The feature we are predicting is called **target** (your dependent variable).






In [None]:
# make pairplot

## **2**.&nbsp; **Estimator API**

Scikit-learn provides a unified interface called the Estimator API. At its core, every algorithm in Scikit-learn is implemented as a class with a consistent set of methods—primarily `fit()`, `predict()`, and for some models, `transform()`. Here's the typical workflow:


1. Choose a Model Class and Import It.
2. Instantiate the Model with Desired Hyperparameters.
3. Arrange Data into a Feature Matrix (X) and a Target Vector (y).
    - X is usually a 2d array of shape (`n_samples`, `n_features`)
4. Fit the Model to Your Training Data.
5. Apply the Model to New Data.
    - predict for supervised learning
    - transform or predict for unsupervised

Estimator API gives you consistency, ease of use, clarity of code.


## **3**.&nbsp; **Regression**

In [None]:
#@title Implementation Example

# 1 choose model
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor

# instantiate model
# mod = LinearRegression()
mod = KNeighborsRegressor(n_neighbors=3)

# arrange data
X = qog.drop(["hdi", "country", "region", "iso3c", "fh_status"], axis=1)
y = qog["hdi"]

X = X.fillna(X.mean())
# X = X.values.reshape(-1,1)
y = y.fillna(y.mean())

# fit model
mod.fit(X, y)
# predict
preds = mod.predict(X)

# eval
sns.scatterplot(x=preds, y=y);

# coeffs
# mod.coef_
# mod.intercept_

This fit command causes a number of model-dependent internal computations to take place, and the results of these computations are stored in model-specific attributes that the user can explore. In Scikit-Learn, by convention all model parameters that were learned during the fit process have trailing underscores.

scikit-learn does not give you uncertainty estimates for model parameters. This is because the focus of the library is on prediciton: interpreting model parameters is much more a statistical modeling question than a machine learning question. If you need the statistical modelling functionality, refer to the [statmodels library](https://www.statsmodels.org/stable/index.html).

## **4**.&nbsp; **Data Splitting**

A critical step in building and assessing machine learning models is to split your available dataset into different subsets for training and evaluation. Typically, you create a training set (used to fit the model) and a test set (used to evaluate how well the model generalises to unseen data). This approach helps you detect overfitting: if the model performs well on training data but poorly on the test set, it suggests your model has memorised the training data rather than learning generalisable patterns.

Scikit-learn provides a convenient utility function called `train_test_split` to help you partition your data in one line of code.

In [None]:
#@title Sample Split Example

from sklearn.model_selection import train_test_split

# instantiate model
mod = LinearRegression()
# mod = KNeighborsRegressor(n_neighbors=2)

# arrange data
X = qog.drop(["hdi", "country", "region", "iso3c", "fh_status"], axis=1)
y = qog["hdi"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

X_train = X_train.fillna(X_train.mean())
# X = X.values.reshape(-1,1)
y_train = y_train.fillna(y_train.mean())

X_test = X_test.fillna(X_test.mean())
y_test = y_test.fillna(y_test.mean())


# fit model
mod.fit(X_train, y_train)
# predict
preds = mod.predict(X_test)

# eval
sns.scatterplot(x=preds, y=y_test);

## **5**.&nbsp; **Performance Metrics**

Evaluating your model with the right metric is crucial. Different tasks (classification vs. regression) and different data characteristics (imbalanced classes, outliers, etc.) often require different metrics. Scikit-learn provides a variety of metrics to help you assess model performance in a consistent manner.

Below is a concise section about performance metrics in Scikit-learn, with a focus on the most commonly used ones. This includes a general overview, a table of key metrics, their interpretations, and example syntax to get you started quickly.

**Regression Metrics**


**[Mean Squared Error (MSE)](https://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-error)**
$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$
- **Interpretation**: Penalises large errors more heavily than small ones (due to squaring).
- **Code Example**:
  ```python
  from sklearn.metrics import mean_squared_error
  mse = mean_squared_error(y_true, y_pred)
  ```

**[Mean Absolute Error (MAE)](https://scikit-learn.org/stable/modules/model_evaluation.html#mean-absolute-error)**
$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left|y_i - \hat{y}_i\right|
$$
- **Interpretation**: Provides a direct measure of how far predictions deviate from actual values on average; more robust to outliers than MSE.
- **Code Example**:
  ```python
  from sklearn.metrics import mean_absolute_error
  mae = mean_absolute_error(y_true, y_pred)
  ```

**[\( R^2 \) Score](https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score)**
$$
R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}
$$
- **Interpretation**: Proportion of variance in \(y\) explained by the model. A value of 1 is a perfect fit, while negative values mean the model is worse than a simple horizontal line.
- **Code Example**:
  ```python
  from sklearn.metrics import r2_score
  r2 = r2_score(y_true, y_pred)
  ```





**Classification Metrics**

**[Accuracy](https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score)**

$$\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$$

- **Interpretation**: Proportion of samples correctly predicted. Good for balanced datasets.
- **Code Example**:
  ```python
  from sklearn.metrics import accuracy_score
  acc = accuracy_score(y_true, y_pred)
  ```


**[Precision](https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics)**
$$
\text{Precision} = \frac{TP}{TP + FP}
$$
- **Interpretation**: Of all predicted positives, how many are actually positive? Useful when false positives are costly (e.g., spam detection).
- **Code Example**:
  ```python
  from sklearn.metrics import precision_score
  prec = precision_score(y_true, y_pred, average='binary')
  ```


**[Recall](https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics)**
$$
\text{Recall} = \frac{TP}{TP + FN}
$$
- **Interpretation**: Of all actual positives, how many did we correctly identify? Important in scenarios where missing positives is costly (e.g., disease screening).
- **Code Example**:
  ```python
  from sklearn.metrics import recall_score
  rec = recall_score(y_true, y_pred, average='binary')
  ```

**[F1 Score](https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics)**
$$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$
- **Interpretation**: Harmonic mean of Precision and Recall, balancing both in one metric. Often used for imbalanced classification.
- **Code Example**:
  ```python
  from sklearn.metrics import f1_score
  f1 = f1_score(y_true, y_pred, average='binary')
  ```

> **Note**: For multi-class problems, specify `average='macro'`, `average='weighted'`, etc.




**Key Takeaways**

- **Pick Metrics Wisely**: Choose metrics relevant to your problem and data characteristics. For example, use precision/recall/F1 for imbalanced classification tasks, or MAE if you want to be less sensitive to outliers in a regression context.  

- **Interpretation**: Always interpret metrics in the context of your domain. A 90% accuracy can be misleading if your classes are highly imbalanced.  

- **Compare Multiple Metrics**: Using more than one metric (e.g., accuracy + F1 score) often gives a more complete picture of model performance.  




In [None]:
#@title Metrics Example

from sklearn.metrics import mean_squared_error, r2_score

# pick and inst model
# mod = KNeighborsRegressor(n_neighbors=5)
mod = LinearRegression()

# data prep
y = qog["hdi"]
X = qog.drop(["country", "region", "iso3c", "fh_status", "hdi"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42,
                                                    test_size=0.3)

X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_test.mean())
y_train = y_train.fillna(y_train.mean())
y_test = y_test.fillna(y_test.mean())

mod.fit(X_train, y_train)

preds = mod.predict(X_test)
print(mean_squared_error(y_test, preds),
      r2_score(y_test, preds))

In [None]:
#@title Exercise

# Use gdp_pc as a target, predict it using appropriate features from the data.
# explore how well your prediction is doing.

## **6**.&nbsp; **Data Preprocessing**



[Data preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) is a crucial step in machine learning that ensures models work efficiently and produce reliable results. Raw data often contains inconsistencies such as missing values, different feature scales, or categorical variables that need to be transformed before training a model. Scikit-learn provides a variety of preprocessing tools to standardise, normalise, and encode data for better performance.

- **Scaling and Normalisation**:

    Many machine learning algorithms work better when numerical features are on similar scales. Scaling improves numerical stability and speeds up model convergence.

    - `StandardScaler`: Standardises features by removing the mean and scaling to unit variance, making data normally distributed. See [docs](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler).

    ```python
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    ```

    - `QuantileTransformer`: Transforms data to follow a uniform or normal distribution, useful when features contain outliers or are highly skewed. See [docs](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html#sklearn.preprocessing.QuantileTransformer).

    ```python
    from sklearn.preprocessing import QuantileTransformer
    transformer = QuantileTransformer(output_distribution='normal')
    X_transformed = transformer.fit_transform(X)
    ```

- **Encoding Categorical Data**:

    Many machine learning models cannot directly process categorical variables, so they must be converted into numerical representations. You might be familiar with this process as creating dummy variables.

    - `OneHotEncoder`: Converts categorical variables into binary columns, creating a separate column for each category. See [docs](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder).

    ```python
    from sklearn.preprocessing import OneHotEncoder
    encoder = OneHotEncoder()
    X_encoded = encoder.fit_transform(X_categorical)
    ```

- **Imputing Missing Data**

    Many real-world datasets contain missing values, which must be handled before training a model. One approach is to use an imputer to fill in missing values based on a chosen strategy. See [docs](https://scikit-learn.org/stable/api/sklearn.impute.html).

    - `SimpleImputer`:
    ```python
        from sklearn.impute import SimpleImputer
        imputer = SimpleImputer(strategy='mean')  # Replace missing values with the mean
        X_imputed = imputer.fit_transform(X)
        ```

        Common imputation strategies:
        - 'mean' (default): replaces missing values with the column mean.
        - 'median': replaces missing values with the median, useful for skewed distributions.
        - 'most_frequent': replaces missing values with the most common value (mode).
        - 'constant': fills missing values with a specified constant.


- **Feature Engineering**:

    Feature engineering can improve model performance by transforming or creating new features.

    - `PolynomialFeatures`: Generates polynomial and interaction terms from existing numerical features, useful for capturing non-linear relationships.

        ```python
        from sklearn.preprocessing import PolynomialFeatures
        poly = PolynomialFeatures(degree=2, interaction_only=False)
        X_poly = poly.fit_transform(X)
        ```






    







In [None]:
#@title Data Scaling Example
from sklearn.preprocessing import StandardScaler, QuantileTransformer

# orignial
sns.scatterplot(x="gdp_pc", y="hdi", data=qog);

# scaled
X_ex = X_train['gdp_pc'].values.reshape(-1, 1)
X_scaled = scaler.fit_transform(X_ex)
sns.scatterplot(x=X_scaled[:, 0], y=y_train);


scaler = QuantileTransformer()
mod = KNeighborsRegressor(n_neighbors=5)

# data prep
y = qog["hdi"]
X = qog.drop(["country", "region", "iso3c", "fh_status", "hdi"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42,
                                                    test_size=0.25)

X_train = X_train.fillna(X_train.mean())
X_train = scaler.fit_transform(X_train)

X_test = X_test.fillna(X_test.mean())
X_test = scaler.transform(X_test)

y_train = y_train.fillna(y_train.mean())
y_test = y_test.fillna(y_test.mean())

mod.fit(X_train, y_train)

preds = mod.predict(X_test)
print(mean_squared_error(y_test, preds),
      r2_score(y_test, preds))

In [None]:
#@title Exercise

# Try to improve your prediction from previous exercise by experimenting with
# different data scaling options!

# try including region (categorical variable) into the prediction.
# Refer to documentation and user-guide.

## **7**.&nbsp; **Pipeline**

The Pipeline in scikit-learn provides a structured way to automate the sequence of preprocessing steps and model training, ensuring consistency and preventing data leakage. Instead of manually applying transformations and then fitting a model separately, a pipeline chains multiple steps together, making the workflow cleaner and reproducible.

Each step in a pipeline consists of a transformer (e.g., `StandardScaler`, `OneHotEncoder`, `PolynomialFeatures`) followed by an estimator (e.g., `LogisticRegression`, `RandomForestClassifier`). Once defined, the entire pipeline can be treated like a single model—fitting, transforming, and predicting in one step.

Basic syntax is:



```python
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('step_name_1', transformer_1),
    ('step_name_2', transformer_2),
    ('model', estimator)
])
```

For more info, see [Pipeline Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). Also, see [ColumnTransformer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) for processing different column types.

In [None]:
#@title Pipeline Example

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

X = qog.drop(["hdi", "country", "region", "iso3c", "fh_status"], axis=1)
y = qog["hdi"]


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

y_train = y_train.fillna(y_train.mean())
y_test = y_test.fillna(y_test.mean())


pipe = Pipeline(
    [("imputer", SimpleImputer(strategy="median")),
     ("scaler", StandardScaler()),
     ("knn", KNeighborsRegressor(n_neighbors=5))]
)

pipe.fit(X_train, y_train)
preds = pipe.predict(X_test)

mean_squared_error(y_test, preds)


## with column transformer
# from sklearn.compose import ColumnTransformer

# X = qog.drop(["hdi", "country", "region", "iso3c", "fh_status"], axis=1)
# y = qog["hdi"]


# X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# y_train = y_train.fillna(y_train.mean())
# y_test = y_test.fillna(y_test.mean())

# cat_cols = ["region", "fh_status"]
# num_cols = ["perc_wip", "gdp_pc", "corruption", "glob_index", "fh_polity"]

# # create separate pipes for each col type
# numeric_pipeline = Pipeline([
#     ("imputer", SimpleImputer(strategy="mean")),
#     ("scaler", StandardScaler())
# ])

# categorical_pipeline = Pipeline([
#     ("imputer", SimpleImputer(strategy="most_frequent")),
#     ("onehot", OneHotEncoder(handle_unknown="ignore"))
# ])

# # pull them into column transformer
# preprocessor = ColumnTransformer([
#     ("num", numeric_pipeline, num_cols),
#     ("cat", categorical_pipeline, cat_cols)
# ])

# # main pipe
# pipe = Pipeline([
#     ("preprocessor", preprocessor),
#     ("knn", KNeighborsRegressor(n_neighbors=5))
# ])

# pipe.fit(X_train, y_train)
# preds = pipe.predict(X_test)
# print("R^2:", r2_score(y_test, preds))


## **8**.&nbsp; **Cross Validation and Grid Search**

Machine learning models often have hyperparameters (e.g., `n_neighbors` in k-NN) that significantly impact performance. Instead of manually guessing these values, [Grid Search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) and [Cross-Validation](https://scikit-learn.org/stable/modules/cross_validation.html) help find the best combination automatically.

**Cross Validation**

Cross-validation (CV) helps evaluate model performance by splitting data into multiple training and validation subsets. A common approach is k-fold cross-validation, where the dataset is divided into k subsets (folds), and the model is trained k times, each time using a different fold as the validation set.



In [None]:
from sklearn.model_selection import cross_val_score

X = qog.drop(["hdi", "country", "region", "iso3c", "fh_status"], axis=1)
y = qog["hdi"]


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

y_train = y_train.fillna(y_train.mean())
y_test = y_test.fillna(y_test.mean())


pipe = Pipeline(
    [("imputer", SimpleImputer(strategy="median")),
     ("scaler", StandardScaler()),
     ("knn", KNeighborsRegressor(n_neighbors=5))]
)

cv_scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='r2')

print("Cross-Validation R^2 Scores:", cv_scores)
print("Mean R^2 Score:", cv_scores.mean())

Cross-Validation R^2 Scores: [0.6024366  0.76431027 0.63143364 0.78889832 0.64496714]
Mean R^2 Score: 0.6864091930582085


**Grid Search**

`GridSearchCV` from scikit-learn systematically tests multiple hyperparameter combinations, using cross-validation to evaluate each set of parameters, and selects the combination that yields the best performance according to a specified scoring metric. See [docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).




```python
from sklearn.model_selection import GridSearchCV

model_or_pipeline = Model()
param_grid = {
    "<step_or_model>__<parameter_name>": [value1, value2, ...],
    # add more parameters as needed
}

grid_search = GridSearchCV(
    estimator=model_or_pipeline,    # the model or pipeline
    param_grid=param_grid,       # the parameter combinations to try
    cv=...,                # how many cross-validation folds or a CV splitter
    scoring='...',          # metric to optimise (e.g., 'accuracy', 'r2')
    # other optional arguments like n_jobs, refit, etc.
)
```



In [None]:
#@title Grid Search Example
from sklearn.model_selection import GridSearchCV

X = qog.drop(["hdi", "country", "region", "iso3c", "fh_status"], axis=1)
y = qog["hdi"]


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

y_train = y_train.fillna(y_train.mean())
y_test = y_test.fillna(y_test.mean())


pipe = Pipeline(
    [("imputer", SimpleImputer(strategy="median")),
     ("scaler", StandardScaler()),
     ("knn", KNeighborsRegressor(n_neighbors=5))]
)

# define param grid
param_grid = {
    'knn__n_neighbors': [3, 5, 7, 9],  # different k values for k-NN;
    'scaler': [StandardScaler(), None]  # scaling vs. no scaling
}

# do grid search
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best R^2 Score:", grid_search.best_score_)


# pd.DataFrame(grid_search.cv_results_)

## Classification

In [None]:
#@title Exercise: Implement Classificaiton on your own!

# Predict "fh_status". For that, remove the continous equivalent of fh_status first.
# Then, pick an appropriate estimator (refer to docs), preprocess data as you see fit.
# Use grid search to find the best performing configuration for your estimator.



## Homework

Read the description and download the [Credit Card Fraud Dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud). Your task is to predict the `class` feature, which records whether a transaction is an instance of credit card fraud or is a legitimate transaction.

You need to implement data preprocessing, decide which estimator to use, what metric(s) to evaluate the performance on. The decision of how to organise the pipeline (including sample splitting, grid seach over hyperparams) is entirerly up to you.







