![](img/330-banner.png)

# Lecture 14: Midterm review

UBC 2022-23

Instructor: Varada Kolhatkar

## Imports

In [3]:
import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import numpy.random as npr
import pandas as pd
from sklearn.compose import (
    ColumnTransformer,
    TransformedTargetRegressor,
    make_column_transformer,
)
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, RidgeCV
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor

## Terminologies

- **Supervised learning** ([Gmail spam filtering])
    - Training a model from input data and its corresponding targets to predict targets for new examples.
    - **Classification vs. Regression**
        - **Classification problem**: predicting among two or more discrete classes
            - Example1: Predict whether a patient has a liver disease or not
            - Example2: Predict whether a student would get an A+ or not in quiz2.  
        - **Regression problem**: predicting a continuous value
            - Example1: Predict housing prices 
            - Example2: Predict a student's score in quiz2.
- **Unsupervised learning** ([Google News](https://news.google.com/))
    - Training a model to find patterns in a dataset, typically an unlabeled dataset.
    - training data consists of observations ($X$) **without any corresponding targets**
- Supervised machine learning is about function approximation, i.e., finding the mapping function between `X` and `y` whereas unsupervised machine learning is about concisely describing the data.   


- **Features** 
: columns of input data. Features are relevant characteristics of the problem, usually suggested by experts. Features are typically denoted by $X$ and the number of features is usually denoted by $d$.  

- **Target**
: Target is the feature we want to predict (typically denoted by $y$). 

- **Example** 
: A row of feature values. When people refer to an example, it may or may not include the target corresponding to the feature values, depending upon the context. The number of examples is usually denoted by $n$. 

- **Training**
: The process of learning the mapping between the features ($X$) and the target ($y$). Fitting.

### `DummyClassifier` 

- `sklearn`'s baseline model for classification 
- **Baseline**
: A simple machine learning algorithm based on simple rules of thumb. 

    - For example, most frequent baseline always predicts the most frequent label in the training set. 
    - Baselines provide a way to sanity check your machine learning model.  
    - Baselines serve as reference points in ML workflow. 
### Steps to train a classifier using `sklearn` 

1. Read the data
2. Create $X$ and $y$
3. Create a classifier object
4. `fit` the classifier
5. `predict` on new examples
6. `score` the model
    - `score` gives the **accuracy** of the model, i.e., proportion of correctly predicted targets. 
    - **error**, which is usually $1 - accuracy$

### [`DummyRegressor`](https://scikit-learn.org/0.15/modules/generated/sklearn.dummy.DummyRegressor.html)
regression problems use `DummyRegressor`, which predicts mean, median, or constant value of the training set for all examples. 

### `DecisionTree` 
- are models that make predictions by sequentially looking at features and checking whether they are above/below a threshold
- learn a hierarchy of if/else questions, similar to questions you might ask in a 20-questions game.       
- learn axis-aligned decision boundaries (vertical and horizontal lines with 2 features)    
- One way to control the complexity of decision tree models is by using the depth hyperparameter (`max_depth` in `sklearn`). 
- **Predict**
    - Start at the top of the tree. Ask binary questions at each node and follow the appropriate path in the tree. Once you are at a leaf node, you have the prediction. 
    - Note that the model only considers the features which are in the learned tree and ignores all other features. 
- **fit**
    - Each node either represents a question or an answer. The terminal nodes (called leaf nodes) represent answers. 
    - Which features are most useful for classification? 
    - Minimize **impurity** at each question
    - Common criteria to minimize impurity: [gini index](https://scikit-learn.org/stable/modules/tree.html#classification-criteria), information gain, cross entropy
### Decision tree for regression problems

- We can also use decision tree algorithm for regression. 
- Instead of gini, we use [some other criteria](https://scikit-learn.org/stable/modules/tree.html#mathematical-formulation) for splitting. A common one is mean squared error (MSE). (More on this in later videos.)
- `scikit-learn` supports regression using decision trees with `DecisionTreeRegressor` 
    - `fit` and `predict` paradigms similar to classification
    - `score` returns somethings called [$R^2$ score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score).     
        - The maximum $R^2$ is 1 for perfect predictions. 
        - It can be negative which is very bad (worse than `DummyRegressor`). 
### Parameters 

- The decision tree algorithm primarily learns two things: 
    - the best feature to split on
    - the threshold for the feature to split on at each node
- These are called **parameters** of the decision tree model.  

**Parameters**
: When you call `fit`, a bunch of values get set, like the features to split on and split thresholds. These are called **parameters**. These are learned by the algorithm from the data during training. We need them during prediction time. 

**Hyperparameters**
: Even before calling `fit` on a specific data set, we can set some "knobs" that control the learning. These are called **hyperparameters**. These are specified based on: expert knowledge, heuristics, or systematic/automated optimization (more on this in the coming lectures).  

### Max-depth is a **hyperparameter** of `DecisionTreeClassifier`. 
- With the default setting, the nodes are expanded until all leaves are "pure". 
- The decision tree is creating very specific rules, based on just one example from the data. 
- Is it possible to control the learning in any way? 
    - Yes! One way to do it is by controlling the **depth** of the tree, which is the length of the longest path from the tree root to a leaf. 
- **Decision stump**
: A decision tree with only one split (depth=1) is called a **decision stump**. 

### `k-nearest neighbours`
- Given a new data point, predict the class of the data point by finding the "closest" data point in the training set, i.e., by finding its "nearest neighbour" or majority vote of nearest neighbours. 
- knn = KNeighborsClassifier(n_neighbors=k)
- `weights` $\rightarrow$ When predicting label, you can assign higher weight to the examples which are closer to the query example.  
- For regular $k$-NN for supervised learning (not with sparse matrices), you should scale your features.
### Pros of $k$-NNs for supervised learning

- Easy to understand, interpret.
- Simple hyperparameter $k$ (`n_neighbors`) controlling the fundamental tradeoff.
- Can learn very complex functions given enough data.
- Lazy learning: Takes no time to `fit`
### Cons of $k$-NNs for supervised learning
- Can be potentially be VERY slow during prediction time, especially when the training set is very large. 
- Often not that great test accuracy compared to the modern approaches.
- It does not work well on datasets with many features or where most feature values are 0 most of the time (sparse datasets).   
### Curse of dimensionality

- Affects all learners but especially bad for nearest-neighbour. 
- $k$-NN usually works well when the number of dimensions $d$ is small but things fall apart quickly as $d$ goes up.
- If there are many irrelevant attributes, $k$-NN is hopelessly confused because all of them contribute to finding similarity between examples. 
- With enough irrelevant attributes the accidental similarity swamps out meaningful similarity and $k$-NN is no better than random guessing. 

### `Support Vector Machines (SVMs) with RBF kernel`
- Another popular similarity-based algorithm is Support Vector Machines with RBF Kernel (SVM RBFs)
- Superficially, SVM RBFs are more like weighted $k$-NNs.
    - The decision boundary is defined by **a set of positive and negative examples** and **their weights** together with **their similarity measure**. 
    - A test example is labeled positive if on average it looks more like positive examples than the negative examples. 
<br>
- The primary difference between $k$-NNs and SVM RBFs is that 
    - Unlike $k$-NNs, SVM RBFs only remember the key examples (support vectors). 
    - SVMs use a different similarity metric which is called a "kernel". A popular kernel is Radial Basis Functions (RBFs)
    - They usually perform better than $k$-NNs! 
- We can think of SVM with RBF kernel as "smooth KNN". 
### Support vectors 

- Each training example either is or isn't a "support vector".
  - This gets decided during `fit`.

- **Main insight: the decision boundary only depends on the support vectors.**
### Hyperparameters of SVM 

- Key hyperparameters of `rbf` SVM are
    - `gamma`
        - #### Relation of `gamma` and the fundamental trade-off
            - `gamma` controls the complexity (fundamental trade-off), just like other hyperparameters we've seen.
              - larger `gamma` $\rightarrow$ more complex
              - smaller `gamma` $\rightarrow$ less complex
    - `C`
        - #### Relation of `C` and the fundamental trade-off
            - `C` _also_ affects the fundamental tradeoff
                - larger `C` $\rightarrow$ more complex 
                - smaller `C` $\rightarrow$ less complex 

### How to approximate generalization error? 

A common way is **data splitting**. 
- split on X and y
    - X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
- split on df
    - train_df, test_df = train_test_split(df, test_size=0.2, random_state=123)
- validation data
    - for hyperparameter tuning
    - don't pass into fit

### Summary of train, validation, test, and deployment data

|         | `fit` | `score` | `predict` |
|----------|-------|---------|-----------|
| Train    | ✔️      | ✔️      | ✔️         |
| Validation |      | ✔️      | ✔️         |
| Test    |       |  once   | once         |
| Deployment    |       |       | ✔️         |

You can typically expect $E_{train} < E_{validation} < E_{test} < E_{deployment}$.

### Cross-validation
- problem: when dataset is too small, train/validation data will also be small, you might end up with results that doesn't represent your test data well. This is when you want to use cross-validation
- Split the data into $k$ folds ($k>2$, often $k=10$). In the picture below $k=4$.
- Each "fold" gets a turn at being the validation set, each fold gets a score
- The purpose of cross-validation is to **evaluate** how well the model will generalize to unseen data. 
- Gives a more **robust** measure of error on unseen data.
#### `cross_val_score`
- Gives us a list of validation scores in each fold.
#### `cross_validate`
- Gives us access to training and validation scores. 

#### Our typical supervised learning set up is as follows: 

- We are given training data with features `X` and target `y`
- We split the data into train and test portions: `X_train, y_train, X_test, y_test`
- We carry out hyperparameter optimization using cross-validation on the train portion: `X_train` and `y_train`. 
- We assess our best performing model on the test portion: `X_test` and `y_test`.  
- What we care about is the **test error**, which tells us how well our model can be generalized.
- If this test error is "reasonable" we deploy the model which will be used on new unseen examples.

### Types of errors
- $E_\textrm{train}$ is your training error (or mean train error from cross-validation).
- $E_\textrm{valid}$ is your validation error (or mean validation error from cross-validation).
- $E_\textrm{test}$ is your test error.
- $E_\textrm{best}$ is the best possible error you could get for a given problem.
### Underfitting 
- If your model is too simple, like `DummyClassifier` or `DecisionTreeClassifier` with `max_depth=1`, it's not going to pick up on some random quirks in the data but it won't even capture useful patterns in the training data.
- The model won't be very good in general. Both train and validation errors would be high (or train and validation scores are low). This is **underfitting**.
- The gap between train and validation error is going to be lower.
- $E_\textrm{best} \lt E_\textrm{train} \lesssim E_\textrm{valid}$
### Overfitting 
- If your model is very complex, like a `DecisionTreeClassifier(max_depth=None)`, then you will learn unreliable patterns in order to get every single training example correct.
- The training error is going to be very low (training score is high), but there will be a big gap between the training error and the validation error (big gap between training score and validation score). This is **overfitting**.
- In overfitting scenario, usually we'll see: 
$E_\textrm{train} \lt E_\textrm{best}  \lt E_\textrm{valid}$
- In general, if $E_\textrm{train}$ is low, we are likely to be in the overfitting scenario. It is fairly common to have at least a bit of this.
### The "fundamental tradeoff" of supervised learning:
- **As you increase model complexity, $E_\textrm{train}$ tends to go down (training score goes up) but $E_\textrm{valid}-E_\textrm{train}$ tends to go up (the gap between training score and validation score goes up).**
### How to pick a model that would generalize best?
- There are many subtleties here and there is no perfect answer but a  common practice is to pick the model with minimum cross-validation error (best validation score). 
### The golden rule <a name="4"></a>
- Even though we care the most about test error **THE TEST DATA CANNOT INFLUENCE THE TRAINING PHASE IN ANY WAY**. 
- should never call **fit** on validation and test data
- To avoid violating the golden rule:
    - **Splitting**: Before doing anything, split the data `X` and `y` into `X_train`, `X_test`, `y_train`, `y_test` or `train_df` and `test_df` using `train_test_split`. 
    - **Select the best model using cross-validation**: Use `cross_validate` with `return_train_score = True` so that we can get access to training scores in each fold. (If we want to plot train vs validation error plots, for instance.) 
    - **Scoring on test data**: Finally score on the test data with the chosen hyperparameters to examine the generalization performance.

## ML fundamentals

- What are four types of data we have seen so far? 
    - train data, test data, validation data, deployment data
- What are the advantages of cross-validation?
    - cross-validation allows us to split data into k-folds, each fold takes turn to be the validation set, if we have a small dataset, the train and validation data might not be a good representation of our test data, doing this will get a more robust estimate on model performance. 
    - makes better use of data
    - can know if model is too sensitive to the given training data
- Why do we split data?
    - so that we can get a robust estimate of the model, splitting data helps us better generalized unseen examples
- Why it's important to look at sub-scores of cross-validation?
- What is the fundamental trade-off in supervised machine learning?
- What is the Golden rule? 

### Dimensions in ML problems 

In ML, usually we deal with high dimensional problems where examples are hard to visualize.  

- $d \approx 20$ is considered low dimensional
- $d \approx 1000$ is considered medium dimensional 
- $d \approx 100,000$ is considered high dimensional 

### Feature vectors 
**Feature vector**
: is composed of feature values associated with an example.
### Distance between feature vectors 

- A common way to calculate the distance between vectors is calculating the **Euclidean distance**. 
- The euclidean distance between vectors $u = <u_1, u_2, \dots, u_n>$ and $v = <v_1, v_2, \dots, v_n>$ is defined as: 

$$distance(u, v) = \sqrt{\sum_{i =1}^{n} (u_i - v_i)^2}$$ 
- Subtract the two cities, Square the difference, Sum them up, Take the square root
- In sklearn: euclidean_distances(two_things)

## Pros and cons of different ML models

- Decision trees
- KNNs, SVM RBFs
- Linear models 
- Random forests
- LGBM, CatBoost
- Stacking Averaging 

## Preprocessing

### `StandardScaler`
- We'll use `scikit-learn`'s [`StandardScaler`], which is a `transformer`.   
| Approach | What it does | How to update $X$ (but see below!) | sklearn implementation | 
|---------|------------|-----------------------|----------------|
| standardization | sets sample mean to $0$, s.d. to $1$   | `X -= np.mean(X,axis=0)`<br>`X /=  np.std(X,axis=0)` | [`StandardScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) |
### `fit` and `transform` paradigm for transformers
- `sklearn` uses `fit` and `transform` paradigms for feature transformations. 
- We `fit` the transformer on the train split and then `transform` the train split as well as the test split. 
### `sklearn` API summary: transformers

```
transformer.fit(X_train, [y_train])
X_train_transformed = transformer.transform(X_train)
X_test_transformed = transformer.transform(X_test)
```  
- You can pass `y_train` in `fit` but it's usually ignored. It allows you to pass it just to be consistent with usual usage of `sklearn`'s `fit` method.   
- You can also carry out fitting and transforming in one call using `fit_transform`. But be mindful to use it only on the train split and **not** on the test split. 
### Common preprocessing techniques
- Imputation: Tackling missing values
- Scaling: Scaling of numeric features
- One-hot encoding: Tackling categorical variables 
### `SimpleImputer`
- `SimpleImputer` is a transformer in `sklearn` to deal with this problem. For example, 
    - You can impute missing values in categorical columns with the most frequent value.
    - You can impute the missing values in numeric columns with the mean or median of the column.    
### Breaking the golden rule
- Are we applying `fit_transform` on train portion and `transform` on validation portion in each fold?  
    - Here you might be allowing information from the validation set to **leak** into the training step.


### Pipelines

- [`scikit-learn Pipeline`] allows you to define a "pipeline" of transformers with a final estimator.
- Using a `Pipeline` takes care of applying the `fit_transform` on the train portion and only `transform` on the validation portion in each fold.   
- #### Pipeline() 
    - Syntax: pass in a list of steps.
    - The last step should be a **model/classifier/regressor**.
    - All the earlier steps should be **transformers**.
    ```
    pipe = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler()),
            ("regressor", KNeighborsRegressor()),
        ]
    )
    ```
- #### Make_pipeline() 
    - Does not permit naming steps
    - Instead the names of steps are set to lowercase of their types automatically; `StandardScaler()` would be named as `standardscaler`
    ```
    pipe = make_pipeline(
        SimpleImputer(strategy="median"), StandardScaler(), KNeighborsRegressor()
)
    ```
- #### .named_steps["NAME OF THE STEP"]

Let's bring back our quiz2 grades toy dataset. 

In [4]:
grades_df = pd.read_csv('data/quiz2-grade-toy-col-transformer.csv')
grades_df.head()

Unnamed: 0,enjoy_course,ml_experience,major,class_attendance,university_years,lab1,lab2,lab3,lab4,quiz1,quiz2
0,yes,1,Computer Science,Excellent,3,92,93.0,84,91,92,A+
1,yes,1,Mechanical Engineering,Average,2,94,90.0,80,83,91,not A+
2,yes,0,Mathematics,Poor,3,78,85.0,83,80,80,not A+
3,no,0,Mathematics,Excellent,3,91,,92,91,89,A+
4,yes,0,Psychology,Good,4,77,83.0,90,92,85,A+


In [5]:
X, y = grades_df.drop(columns=['quiz2']), grades_df['quiz2']

In [6]:
numeric_feats = ["university_years", "lab1", "lab3", "lab4", "quiz1"]  # apply scaling
categorical_feats = ["major"]  # apply one-hot encoding
passthrough_feats = ["ml_experience"]  # do not apply any transformation
drop_feats = [
    "lab2",
    "class_attendance",
    "enjoy_course",
]  # do not include these features in modeling

- What's the difference between sklearn estimators and transformers?  
- Can you think of a better way to impute missing values compared to `SimpleImputer`? 

### One-hot encoding 
- Here we simply assign an integer to each of our unique categorical labels. 
- What's the purpose of the following arguments of one-hot encoding?
    - handle_unknown="ignore"
        - some categories show up very rarely, not present in training data
        - if not used, when you want to transform data, it does not know how to handle
        - everything will be encoded as 0
        - if you have many unknown values, all of them will be 0
    - sparse=False
        - have many rows and categorical columns with many categories, you will not want to use this
        - don't really want to use this
        - if you know that there are many 0's, and only few values that are not 0, you want to keep track of those that are not 0's
    - drop="if_binary"    
        - when you know that there are only two categories for this column, use 0 and 1 to encode the information there
- How do you deal with categorical features with only two possible categories? 
### `OneHotEncoder` and sparse features 
- By default, `OneHotEncoder` also creates sparse features. 
- You could set `sparse=False` to get a regular `numpy` array. 
- If there are a huge number of categories, it may be beneficial to keep them sparse.
- For smaller number of categories, it doesn't matter much.

### Ordinal encoding
- Create new binary columns to represent our categories.
- If we have $c$ categories in our column.
    - We create $c$ new binary columns to represent those categories.
- What's the difference between ordinal encoding and one-hot encoding? 
    - one-hot encoding will create new columns for new categories
- What happens if we do not order the categories when we apply ordinal encoding?  Does it matter if we order the categories in ascending or descending order? 
    - if we dont do order, it will alphabetically order the numbers, and we dont want that, we want our model to capture that excellent and good are closer to each other instead of excellent and bad.
    - it doesnt really matter
- What would happen if an unknown category shows up during validation or test time during ordinal encoding? For example, for `class_attendance` feature what if a category called "super poor" shows up? 

<br><br><br><br>

#### OHE vs. ordinal encoding

- Since `enjoy_course` feature is binary you decide to apply one-hot encoding with `drop="if_binary"`. Your friend decide to apply ordinal encoding on it. Will it make any difference in the transformed data? 
    - there wont be any difference

In [7]:
ohe = OneHotEncoder(drop="if_binary", sparse=False)
ohe_encoded = ohe.fit_transform(grades_df[['enjoy_course']]).ravel()

In [8]:
oe = OrdinalEncoder()
oe_encoded = oe.fit_transform(grades_df[['enjoy_course']]).ravel()

In [9]:
data = { "oe_encoded": oe_encoded, 
         "ohe_encoded": ohe_encoded}
pd.DataFrame(data)

Unnamed: 0,oe_encoded,ohe_encoded
0,1.0,1.0
1,1.0,1.0
2,1.0,1.0
3,0.0,0.0
4,1.0,1.0
5,0.0,0.0
6,1.0,1.0
7,0.0,0.0
8,0.0,0.0
9,1.0,1.0


- In what scenarios it's OK to break the golden rule?
    - if you are using one hot encoding or ordinal encoding, and you know in advance what possible categories there are for these features, then its ok
    - sometimes its ok to incorporate human knowledge in the model
- What are possible ways to deal with categorical columns with large number of categories? 
    - group together different categories
    - treat it as binary features
    - take the most frequent occurring values
- In what scenarios you'll not include a feature in your model even if it's a good predictor? 

### `ColumnTransformer`
- we want to apply different transformation on different columns. 
- define numerical_features, passthrough_features, categorical_features, and drop_features
- ColumnTransformer()
- make_column_transformer()
```
ct = make_column_transformer(    
    (StandardScaler(), numeric_feats),  # scaling on numeric features
    ("passthrough", passthrough_feats),  # no transformations on the binary features    
    (OneHotEncoder(), categorical_feats),  # OHE on categorical features
    ("drop", drop_feats),  # drop the drop features
)
```
- viewing dataframe after transformation
    ```
    column_names = (
        numeric_feats
        + passthrough_feats    
        + ct.named_transformers_["onehotencoder"].get_feature_names_out().tolist()
    )
    pd.DataFrame(transformed, columns=column_names)
    ```
#### Cases where it's OK to break the golden rule 

- If it's some fix number of categories. For example, if it's something like provinces in Canada or majors taught at UBC. We know the categories in advance and this is one of the cases where it might be OK to violate the golden rule and get a list of all possible values for the **categorical variable**. 

### Raw text problems
- Some popular representations of raw text include: 
    - **Bag of words** 
    - TF-IDF
    - Embedding representations 
- #### Bag of words (BOW) representation
    - One of the most popular representation of raw text 
    - Ignores the syntax and word order
    - It has two components: 
        - The vocabulary (all unique words in all documents) 
        - A value indicating either the presence or absence or the count of each word in the document. 
- #### Extracting BOW features using `scikit-learn`
    - `CountVectorizer`
        - Converts a collection of text documents to a matrix of word counts.  
        - Each row represents a "document" (e.g., a text message in our example). 
        - Each column represents a word in the vocabulary (the set of unique words) in the training data. 
        - Each cell represents how often the word occurs in the document.
        - The vocabulary (mapping from feature indices to actual words) can be obtained using `get_feature_names()`
#### Important hyperparameters of `CountVectorizer` 

- `binary`
    - whether to use absence/presence feature values or counts
- `max_features`
    - only consider top `max_features` ordered by frequency in the corpus
- `max_df`
    - ignore features which occur in more than `max_df` documents 
- `min_df` 
    - ignore features which occur in less than `min_df` documents 
- `ngram_range`
    - consider word sequences in the given range 
### Why sparse matrices? 

- Most words do not appear in a given document.
- We get massive computational savings if we only store the nonzero elements.
- There is a bit of overhead, because we also need to store the locations:
    - e.g. "location (3,27): 1".
    
- However, if the fraction of nonzero is small, this is a huge win.

- What's the problem with calling `fit_transform` on the test data in the context of `CountVectorizer`?
    - will be breaking the golden rule
    - your training data and test data are different
- Do we need to scale after applying bag-of-words representation? 

## Linear models
**Linear models** is a fundamental and widely used class of models. They are called **linear** because they make a prediction using a **linear function** of the input features.  

### Linear regression 

- Imagine a hypothetical regression problem of predicting weight of a snake given its length. 
- #### `Ridge()`
    - complexity hyperparameter `alpha`, controls the fundamental tradeoff
        - larger `alpha` $\rightarrow$ likely to underfit
        - smaller `alpha` $\rightarrow$ likely to overfit
- #### Prediction of linear regression
    - The prediction will be the corresponding weight on the linear line. 
    - we can access the slope (i.e., coefficient or weight) and the intercept using `coef_` and `intercept_`, respectively. 
        - .coef_
            - Sign
                - Positive number: as number gets bigger, positive impact on prediction
                - Negative number: as number gets bigger, negative impact on prediction
            - Magnitude
                - Bigger magnitude $\rightarrow$ bigger impact on the prediction 
        - .intercept_
    - Given a feature value $x_1$ and learned coefficient $w_1$ and intercept $b$, we can get the prediction $\hat{y}$ with the following formula:
$$\hat{y} = w_1x_1 + b$$
    - same value as calling .predict()
    - ### Generalizing to more features
        For more features, the model is a higher dimensional hyperplane and the general prediction formula looks as follows: 

        $\hat{y} =$ <font color="red">$w_1$</font> <font color="blue">$x_1$ </font> $+ \dots +$ <font color="red">$w_d$</font> <font color="blue">$x_d$</font> + <font  color="green"> $b$</font>
### Logistic regression

- A linear model for **classification**. 
- Similar to linear regression, it learns weights associated with each feature and the bias. 
- It applies a **threshold** on the raw output to decide whether the class is positive or negative. 
- A linear classifier learns **weights** or **coefficients** associated with the features.  
- So the prediction is based on the **weighted sum** of the input features.
So a linear classifier is a linear function of the input `X`, followed by a threshold. 

\begin{equation}
\begin{split}
z =& w_1x_1 + \dots + w_dx_d + b\\
=& w^Tx + b
\end{split}
\end{equation}

$$\hat{y} = \begin{cases}
         1, & \text{if } z \geq r\\
         -1, & \text{if } z < r
\end{cases}$$
- `coef_` and `intercept_`
- The `classes_` attribute tells us which class is considered negative and which one is considered positive
### Main hyperparameter of logistic regression 

- `C` is the main hyperparameter which controls the fundamental trade-off.
    - smaller `C` $\rightarrow$ might lead to underfitting
    - bigger `C` $\rightarrow$ might lead to overfitting
### Decision boundary of logistic regression

- The decision boundary of logistic regression is a **hyperplane** dividing the feature space in half. 
- For $d=2$, the decision boundary is a line (1-dimensional)
- For $d=3$, the decision boundary is a plane (2-dimensional)
- For $d\gt 3$, the decision boundary is a $d-1$-dimensional hyperplane
### Linear SVM 


- There is also a linear SVM. You can pass `kernel="linear"` to create a linear SVM. 
- `predict` method of linear SVM and logistic regression works the same way. 
- We can get `coef_` associated with the features and `intercept_` using a Linear SVM model. 
### Interpretation of coefficients in linear models 
- the $j$th coefficient tells us how feature $j$ affects the prediction
- if $w_j > 0$ then increasing $x_{ij}$ moves us toward predicting $+1$
- if $w_j < 0$ then increasing $x_{ij}$ moves us toward prediction $-1$
- if $w_j == 0$ then the feature is not used in making a prediction
### Pros and Cons of Linear Models
- #### Strengths of linear models 
    - Fast to train and predict
    - Scale to large datasets and work well with sparse data 
    - Relatively easy to understand and interpret the predictions
    - Perform well when there is a large number of features 
- #### Limitations of linear models 
    - Is your data "linearly separable"? Can you draw a hyperplane between these datapoints that separates them with 0 error. 
        - If the training examples can be separated by a linear decision rule, they are **linearly separable**.


### `predict_proba`

- For most of the `scikit-learn` classification models we can access this confidence score or probability score using a method called `predict_proba`.  
- The output of `predict_proba` is the probability of each class. 
#### The sigmoid function 
- The sigmoid function "squashes" the raw model output from any number to the range $[0,1]$ using the following formula, where $x$ is the raw model output. 
$$\frac{1}{1+e^{-x}}$$

## Hyperparameter optimization 
- ### Automated optimizations methods
    - #### Exhaustive grid search: [`sklearn.model_selection.GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
        - For `GridSearchCV` we need
            - an instantiated model or a pipeline
            - a parameter grid: A user specifies a set of values for each hyperparameter. 
            - other optional arguments 
        - can call `fit`, `predict` or `score` on it
        - Fitting the `GridSearchCV` object 
            - Searches for the best hyperparameter values
            - You can access the best score and the best hyperparameters using `best_score_` and `best_params_` attributes, respectively. 
        - 
        ```
        grid_search = GridSearchCV(
            pipe, param_grid, cv=5, n_jobs=-1, return_train_score=True
        )
        ```
        - pd.DataFrame(grid_search.cv_results_)
        - pd.DataFrame(grid_search.cv_results_).set_index("rank_test_score").sort_index()
        - `n_jobs=-1`
        - model__hyperparameterName
            - "svc__gamma": [0.001, 0.01, 0.1, 1.0, 10, 100],
            - "svc__C": [0.001, 0.01, 0.1, 1.0, 10, 100],
        - #### Problems with exhaustive grid search 
            - Required number of models to evaluate grows exponentially with the dimensionally of the configuration space. 
            - It might take a really long time

    - #### Randomized search: [`sklearn.model_selection.RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
        - Samples configurations at random until certain budget (e.g., time) is exhausted 
        - 
        ```
        random_search = RandomizedSearchCV(
            pipe_svm, param_distributions=param_grid, n_jobs=-1, n_iter=20, cv=5, random_state=42
        )
        ```
        - 
        ```
        pd.DataFrame(random_search.cv_results_)[
            [
                "mean_test_score",
                "param_svc__gamma",
                "param_svc__C",
                "mean_fit_time",
                "rank_test_score",
            ]
        ].set_index("rank_test_score").sort_index().T
        ```
        - `n_iter`
            - Larger `n_iter` will take longer but it'll do more searching.
        - #### Advantages of `RandomizedSearchCV`
            - Faster compared to `GridSearchCV`.
            - Adding parameters that do not influence the performance does not affect efficiency.
            - Works better when some parameters are more important than others. 
    - #### Pros and Cons of Automated Optimization
        - Advantages 
            - reduce human effort
            - less prone to error and improve reproducibility
            - data-driven approaches may be effective
        - Disadvantages
            - may be hard to incorporate intuition
            - be careful about overfitting on the validation set <br><br>

- What makes hyperparameter optimization a hard problem?
    - because there are many models and transformers, we want to find the best
    -  hard search problem, there are many possibilities
- What are two different tools provide by sklearn for hyperparameter optimization?  
    - greed search cv
    - randomized search cv
        - not commited to run all experiements exhaustively
        - provide some distribution instead of greed
- What is optimization bias? 
    - overfit on validation set, found a hyperparameter that is not 

## Evaluation metrics
- ### `.score` by default returns accuracy which is 
    
    - $$\frac{\text{correct predictions}}{\text{total examples}}$$
    
    - misleading when you have class imbalance
    
- ### Confusion matrix 
    - confusion_matrix(y_train, cross_val_predict(pipe, X_train, y_train))
    - 
    ```
        TN, FP, FN, TP = confusion_matrix(y_valid, predictions).ravel()
        plot_confusion_matrix_example(TN, FP, FN, TP)
    ```
- ### Recall
    - Among all positive examples, how many did you identify?
    $$ recall = \frac{TP}{TP+FN} = \frac{TP}{\#positives} $$
- ### Precision 
    - Among the positive examples you identified, how many were actually positive?
    $$ precision = \frac{TP}{TP+FP}$$
- ### F1-score
    - F1-score combines precision and recall to give one score, which could be used in hyperparameter optimization, for instance. 
    - F1-score is a harmonic mean of precision and recall. 
    $$ f1 = 2 \times \frac{ precision \times recall}{precision + recall}$$
- ### Macro average
    - You give equal importance to all classes and average over all classes.  
- ### Weighted average
    - Weighted by the number of samples in each class. 
    - Divide by the total number of samples. 
- ### Cross validation with different metrics
    ```
    scoring = [
        "accuracy",
        "f1",
        "recall",
        "precision",
    ]  # scoring can be a string, a list, or a dictionary
    pipe = make_pipeline(StandardScaler(), LogisticRegression())
    scores = cross_validate(
        pipe, X_train_big, y_train_big, return_train_score=True, scoring=scoring
    )
    pd.DataFrame(scores)
    ```
- ### Sklearn API
    - accuracy: accuracy_score(y_valid, pipe_lr.predict(X_valid)))
    - error: 1 - accuracy_score(y_valid, pipe_lr.predict(X_valid)))
    - precision: precision_score(y_valid, pipe_lr.predict(X_valid), zero_division=1)
    - recall: recall_score(y_valid, pipe_lr.predict(X_valid))
    - f1 score: f1_score(y_valid, pipe_lr.predict(X_valid))
    - classification report that gives all the above info
    ```
    print(
        classification_report(
            y_valid, pipe_lr.predict(X_valid), target_names=["non-fraud", "fraud"]
        )
    )
    ```
- ### Precision/Recall tradeoff 
    - But there is a trade-off between precision and recall. 
    - If you identify more things as "fraud", recall is going to increase but there are likely to be more false positives. 
    - #### Decreasing the threshold
        - Decreasing the threshold means a lower bar for predicting fraud. 
            - You are willing to risk more false positives in exchange of more true positives. 
            - recall would either stay the same or go up and precision is likely to go down
            - occasionally, precision may increase if all the new examples after decreasing the threshold are TPs. 
    - #### Increasing the threshold
        - Increasing the threshold means a higher bar for predicting fraud. 
            - recall would go down or stay the same but precision is likely to go up 
            - occasionally, precision may go down as the denominator for precision is TP+FP.     
    - #### Precision-recall curve
        - The top-right would be a perfect classifier (precision = recall = 1).
        - Usually the goal is to keep recall high as precision goes up. 
    ```
    precision, recall, thresholds = precision_recall_curve(
        y_valid, pipe_lr.predict_proba(X_valid)[:, 1]
    )
    plt.plot(precision, recall, label="logistic regression: PR curve")
    plt.xlabel("Precision")
    plt.ylabel("Recall")
    plt.plot(
        precision_score(y_valid, pipe_lr.predict(X_valid)),
        recall_score(y_valid, pipe_lr.predict(X_valid)),
        "or",
        markersize=10,
        label="threshold 0.5",
    )
    plt.legend(loc="best");
    ```
    - #### AP score 
        - one number summarizing the PR plot (e.g., in hyperparameter optimization)
        - This is called **average precision** (AP score)
        - AP score has a value between 0 (worst) and 1 (best). 
        -
        ```
        ap_lr = average_precision_score(y_valid, pipe_lr.predict_proba(X_valid)[:, 1])
        print("Average precision of logistic regression: {:.3f}".format(ap_lr))
        ```
        - 
        ```
        PrecisionRecallDisplay.from_estimator(pipe_lr, X_valid, y_valid);
        ```
- ### AP vs. F1-score
    - F1 score is for a given threshold and measures the quality of `predict`.
    - AP score is a summary across thresholds and measures the quality of `predict_proba`.
- ### Receiver Operating Characteristic (ROC) curve 

- Another commonly used tool to analyze the behavior of classifiers at different thresholds.  
- Similar to PR curve, it considers all possible thresholds for a given classifier given by `predict_proba` but instead of precision and recall it plots false positive rate (FPR) and true positive rate (TPR or recall).
$$ TPR = \frac{TP}{TP + FN}, FPR  = \frac{FP}{FP + TN}$$
- The ideal curve is close to the top left
    - Ideally, you want a classifier with high recall while keeping low false positive rate.  
```
fpr, tpr, thresholds = roc_curve(y_valid, pipe_lr.predict_proba(X_valid)[:, 1])
plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR (recall)")

default_threshold = np.argmin(np.abs(thresholds - 0.5))

plt.plot(
    fpr[default_threshold],
    tpr[default_threshold],
    "or",
    markersize=10,
    label="threshold 0.5",
)
plt.legend(loc="best");
```
- ### Area under the curve (AUC)
    - AUC provides a single meaningful number for the model performance. 
    ```
    roc_lr = roc_auc_score(y_valid, pipe_lr.predict_proba(X_valid)[:, 1])
    roc_svc = roc_auc_score(y_valid, pipe_svc.decision_function(X_valid))
    print("AUC for LR: {:.3f}".format(roc_lr))
    print("AUC for SVC: {:.3f}".format(roc_svc))
    ```
    - AUC of 0.5 means random chance. 
    - AUC can be interpreted as evaluating the **ranking** of positive examples.
    - What's the probability that a randomly picked positive point has a higher score according to the classifier than a randomly picked point from the negative class. 
    - AUC of 1.0 means all positive points have a higher score than all negative points. 
    ```
    RocCurveDisplay.from_estimator(pipe_lr, X_valid, y_valid);
    ```
    

<br><br> 
- Why accuracy is not always enough?
    - if class imbalance, accuracy is misleading
    - care about how well you are performing in the class that you are interested in 
- Why it's useful to get prediction probabilities? 
- In what scenarios do you care more about precision or recall? 
    - recall: when you care about false negative, minimize it
    - precision: false positive
    - use f1 score to balance precision and recall
- What's the main difference between AP score and F1 score?
    - changing threshold 
    - F1 is when threshold is 0.5
    - AP is calculated over a number of thresholds, quality of predict proba
- What are advantages of RMSE or MAPE over MSE? 
    - interpretability

### Class imbalance in training sets

- This typically refers to having many more examples of one class than another in one's training set.
- Real world data is often imbalanced. 
    - Our Credit Card Fraud dataset is imbalanced.
    - Ad clicking data is usually drastically imbalanced. (Only around ~0.01% ads are clicked.)
    - Spam classification datasets are also usually imbalanced.
### Handling imbalance
There are two common approaches for this: 
- **Changing the data (optional)** (not covered in this course)
   - Undersampling
   - Oversampling 
       - Random oversampling
       - SMOTE 
- **Changing the training procedure** 
    - `class_weight`
        - All `sklearn` classifiers have a parameter called `class_weight`.
        - This allows you to specify that one class is more important than another.
    - `class_weight="balanced"`
        - A useful setting is `class_weight="balanced"`.
        - This sets the weights so that the classes are "equal".

### `RidgeCV`

- automatically tunes `alpha` based on cross-validation.
```
ridgecv_pipe = make_pipeline(preprocessor, RidgeCV(alphas=alphas, cv=10))
ridgecv_pipe.fit(X_train, y_train);
best_alpha = ridgecv_pipe.named_steps['ridgecv'].alpha_
best_alpha
```

## Scoring functions for regression

- ### `mean squared error (MSE)`
    - MSE (mean squared error) is in units of target squared, hard to interpret; 0 is best
    ```
    mean_squared_error(y_train, lr_tuned.predict(X_train))
    ```
- ### `$R^2$`
    - $R^2$ is the default .score(), it is unitless, 0 is bad, 1 is best
    - The maximum is 1 for perfect predictions
    - Negative values are very bad: "worse than DummyRegressor" (very bad)
- ### `root mean squared error (RMSE)`
    - RMSE (root mean squared error) is in the same units as the target; 0 is best
    ```
    np.sqrt(mean_squared_error(y_train, lr_tuned.predict(X_train)))
    ```
- ### `Mean absolute percent error (MAPE)`
    - MAPE (mean absolute percent error) is unitless; 0 is best, 1 is bad
    ```
    mean_absolute_percentage_error(y_train, pred_train)
    ```
    - to reduce MAPE
    ```
    ttr = TransformedTargetRegressor(
    Ridge(alpha=best_alpha), func=np.log1p, inverse_func=np.expm1
    ) # transformer for log transforming the target
    ttr_pipe = make_pipeline(preprocessor, ttr)
    ```
- ### Sklearn API
    ```
    pd.DataFrame(
        cross_validate(
            lr_tuned,
            X_train,
            y_train,
            return_train_score=True,
            scoring=["neg_mean_squared_error", "neg_mean_absolute_percentage_error"]
        )
    )
    ```
    ```
    
    ```

## Ensembles
- **Ensembles** are models that combine multiple machine learning models to create more powerful models. 
- ### `RandomForestClassifier` 
    - A single decision tree is likely to overfit
    - Use a collection of diverse decision trees
    - Each tree overfits on some part of the data but we can reduce overfitting by averaging the results 
    - `n_estimators`: number of decision trees (higher = more complexity)
    - `max_depth`: max depth of each decision tree (higher = more complexity)
    - `max_features`: the number of features you get to look at each split (higher = more complexity)
- ### Strengths and weaknesses
    - Strengths
        - Usually one of the best performing off-the-shelf classifiers without heavy tuning of hyperparameters
        - Don't require scaling of data 
        - Less likely to overfit 
        - Slower than decision trees because we are fitting multiple trees but can easily parallelize training because all trees are independent of each other (that said, sklearn implementation is kind of slow)
        - In general, able to capture a much broader picture of the data compared to a single decision tree. 
    - Weaknesses
        - Require more memory 
        - Hard to interpret
        - Tend not to perform well on high dimensional sparse data such as text data
    
<br><br>
- How does a random forest model inject randomness in the model?
- What's the difference between random forests and gradient boosted trees?
- Why do we need averaging or stacking? 
- What are the benefits of stacking over averaging?  

### Feature importances and selection 

- What are the limitations of looking at simple correlations between features and targets? 
- How can you get feature importances or non-linear models?
- What you might need to explain a single prediction?
- What's the difference between feature engineering and feature selection? 
- Why do we need feature selection?
- What are the three possible ways we looked at for feature selection? 
