
# Supervised Learning

## Data preprocessing

Common computing paradigms: Rules + Data = Answers (Through Computing)

Machine Learning: Data + Answers = Rules (Learning!)

Using data is a process that repeats in a cycle:

1. Business understanding
2. Data understanding
    - Refer back to **1.** to ensure cohesiveness
    - Aiming to select variables and clean data (filter outliers, irrelevant variables, delete null values)
3. Data preparation
4. Modeling
    - Refer back to **3.** to ensure cohesiveness
5. Evaluation
    - Return back to **1.** if needed
6. Deployment

Steps to cleaning data:

1. Create variables
2. Eliminate redundant variables
3. Treat null values
4. Treat outliers as valid or invalid

**Why do we want to clean data?**

- Dirty data that doesn't make sense in our context (Eg, age being negative, is 0 a true value or absence of something, incomplete data, duplicate)
- Missing values **can be kept** since it may be valuable. It is good practice to create an additional category or indicator variable for them
- Missing values **should be deleted** when there are too many missing values
- Missing values **can be replaced** when you can estimate them using imputation procedures
    - For *continuous variables*, replace with median or mean (median more robust to outliers)
    - For *nominal variables*, replace with mode
    - For *regression* or *tree-based imputation*, predict using other variables; cannot use target class as predictor if missing values can occur during model usage
- Outliers **can be kept** as null values or **deleted** when they are invalid observations or out of distribution due to low sample value
    - Some may be hidden in one-dimensional views of multidimensional data
    - Invalid observations are typically treated as null values, while out of distribution are typically removed and a note is made in the policy
- **Valid outliers** and their treatments
    - Truncation based on z-scores: replace all variable values with z-scores above or below 3 with the mean $\pm$ 3 standard deviations
    - Truncation based on inter-quartile range: replace all variable values with $Median \pm 3s, \text{ where s }= IQR/(2 \times 0.6745)$
    - Truncation using a sigmoid

## Statistical models

### Input data

There are multiple types of numerical data

- Continuous, categorical, integer

In general, we will assume that for each individual sample $i$, there is a vector $x_i$ that stores the information of $p$ variables that represent that individual. We have a sample $X$, the design matrix, of cardinality $I$ such that for every $i$ we know their information $x_i$

### Target Variables

The core of *supervised learning* is that there is a variable $y$ representing the outcome

- Continuous $y$ is a supervised regression problem
- Discrete $y$ is a supervised classification problem
- Binary $y$ is the easiest version of a supervised classification problem

If no set $Y$ outcomes exist, then the problem is *unsupervised*

## Defining a model and losses

A statistical model takes an input and gives an output. A model minimizes loss or error function by measuring the goodness of fit of the function. There are many examples of loss minimizing functions, such as *mean-square error* or *cross-entropy*(MLE)

**Mean-square error**: $L(Y,\hat{Y}) = \sum_i (y_i - \hat{y}_i)^2$

- Appropriate for regression analysis

**Cross entropy**: When you have no understanding, then entropy (chaos) is at its worst. Each additional piece of information is worth less than the one before.

We start with one hot vector $y_i$ where it is 1 if in the correct class and 0 otherwise. There are a total of $k$ classes

Eg, If you have 3 classes with probabilities $P_A, P_B, P_C$ with a response vector $y_i = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}$ and $f(x_i) = (P_a, P_b, P_c)$,

- You can isolate the probabilities by setting $(P_a^{y_a} \times P_b^{y_b} \times P_c^{y_c})$
- Thus, the loss function is: $L(y_i, \hat{y}_i) = -\sum_k y_i^k log\hat{(y_i^k)}$
    - If $P_n$ is 0, then we accept that $0log(0) = 0$

$l(y_i, \hat{y}_i) = -y_ilog(\hat{y}_i) - (1-y_i)log(1-\hat{y}_i)$ for a binary version

$l(y_i, \hat{y}_i) = - \sum_k y_i^k log\hat{(y_i^k)}$ for multinomial cross-entropy, where $k$ represents the number of vectors

### Regression functions

The "best" model is the one that minimizes the expected error over the full distribution of the set of input and output

The problem, mathematically, becomes $min_\beta E[L(Y,f(x|\beta))]$

**Linear regression example**

For the whole sample:

$$Y = X\beta + \epsilon$$

We can then take the least-squares to find the model for our $\beta$



# Likelihood

"Given a distribution with parameters $\theta$, how likely are my data?"

We assume that:

- Assume points are independent
- Can be used with any distribution

## Properties of MLE

- Functional invariance
    - $y=f(\theta) \rightarrow \hat y=f(\hat \theta)$
- Asymptotic properties
    - Estimate is asymptotically unbiased
    - Estimate is asymptotically efficient (the best model)
    - Estimate is asymptotically normall distributed



# Logistic regression

With logistic regression, we can identify how every variable impacts the model.

### Classification and regression

When you use least-squares, there is no guarantee that your response is between 0 and 1. Also, your target must be normally distributed

Logistic regression eliminates ambiguity by pushing middle cases to 0 or 1.
    - Intuitively, the odds of an event grow exponentially

MLE for logistic regression is **supervised**, so it should have a target. Cases in the sample are independent



# Model validation

We can compare supervised models using test data. The model with better generalizable characteristics will be better. 

## Training, validation, and test set

1. training set is to calibrate the parameters
2. validation set: part of training, but used to make decisions of model construction
    - Deciding what variables
    - Deciding when to stop training
3. test set: **Only** used to calculate final metrics. No decision should be made on the test set

## Misuse of the test set

- Using it to help build your model, using it to remove variables and calibrate the model
    - Leads to overly optimistic models
- To do with test set:
    1. Split at the very beginning
    2. Only decision should be model selection
    3. If you need to go back to a previous step, resplit training/validation/test set
        - Usually not feasible and expensive, but can have serious consequences if leakage occurs

## Conditional vs expected test error



# Uncertainty management

## Estimation and sampling distribution

## Confidence intervals

## The bootstrap

Make sure to save checkpoints and your model outputs so you don't have to reload your training each time.

## Prediction uncertainty

How confident you are about making decisions on your point estimate. Need to save outputs of model

# Row and columnar data formats

A new data type that was specially formulated for data scientists

## How data is stored

Data is stored linearly by the computer in a string of zeroes and ones. 

We must orient our data to make it make sense to the computer: 

- Row data storage
- Columnar data storage

### Row storage (for operational databases)

Used in databases, essentially each row from a table all next to each other in one line. Since it is in one row, it is expensive to search all data and replace or delete

Eg, Name|Age|Place|Name|Age|Place|Name|Age|Place|

Advantages:

- each value in the same row are close in space
- deleting a row is easy, same as inserting a new row
- searching is easy
- accessing data is easier

Disadvantages:

- Expensive and inefficient for analytics

### Columnar storage (Superior data method)

Each column from a table is next to each other, sequentially. Is much more efficient for calculating over columns. It also has less data space on your disk

Eg, Name|Name|Name|Age|Age|Age|Place|Place|Place

Advantages:

- Single instruction operations are very efficient
- Modifying columns is faster
- Uncompressed data is more effiicent
- Compressed data is more efficient

## Why does it matter?

## Columnar vs row data formats

## Spark / Arrow / DuckDB

**Apache Parquet** (in disk) and **Apache Arrow** (in memory) are two softwares that store data as columnar

Implementations

- DuckDB: traditional databases with columnar storage, optimized for data warehousing and storage
- Polars: Library for data manipulation, well-structured API that is expressive and easy to use