# **Constant Model, Loss, and Transformations**

### **The modeling Process** ###

Modeling deals with: How can I use a set of $x$ variables in order to predict the value of a $y$ variable?
* Can we use height to predict weight?
* Do more hours studying lead to better grades?


In order to predict target variables as functions of specific features, we estbalish the following workflow 
1) Choose a model - how should we represent the world?
2) Choose a loss function - how do we quantify prediction error?
3) Fit the model - how do we choose the best parameter of our model given our data?
4) Evaluate model performance - how do we evaluate whether this process gave rise to a good model?


Prediction vs Estimation: 
* **Prediction** is the task of using a model to predict outputs for unseen data
* **Estimation** is the task of using data to calculate model parameters 

For example, in Simple Linear Regression we have 

$$\hat{y} = \hat{\theta_1} + \hat{\theta_2}$$
* We **estimate** the parameters by minimizing average loss (Often times through Least Squares Estimation, or minimizing MSE)
* We **predict** using these estimations 

### **Evaluating the SLR Model** ###

How "good" are the predictions made by this "best" fit model?

1) Visualize data and compute statistics 
    * Plot the original data 
    * Compute each column's mean and standard deviation (if both of these are close to the original observed $y$ values, we might be inclined to say our model did well)
    * Compute $r$, the correlation coefficient. A large magnitude of $r$ between features and response could also indicate our model has done well

2) Performance metrics 
    * We can take the **Root Mean Squared Error (RMSE)**
    * A lower RMSE indicates more "accurate" predictions, as we have lower "average loss" across the data 

$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}$$

3) Visualization 
    * Look at the residual plot of $e_i = y_i - \hat{y_i}$ to visualize differences between actual and predicted values 
    * A good residual plot should not show any pattern between inputs $x$ and residual values $e_i$

### **Constant Model + MSE** ###

* **Constant** models are also known as a summary statistics 
* Always predicts the *same constant number* 
* Employs a **simplifying assumption**, and does not take into account the relationships between variables 

$$\hat{y} = \theta_0$$

#### **What is the optimal** $\theta_0$ **for L2 Loss (MSE)** ####
* What number should we guess each time in order to have the lowest possible **average loss?**
* If we use Mean Squared Error as our loss function, then with a constant model our loss function is minimize at $\hat{\theta_0} = \bar{y}$

#### **What is the optimal** $\theta_0$ **for L1 Loss (MAE)?** ####
* What number should we guess each time in order to have the lowest possible **average loss?**

* If we use Mean Absolute Error as our loss function, then with a constant model our loss function is minimize at $\hat{\theta_0} = \text{median}(y)$

#### **Summary: Loss Optimization, Calculus, and Critical Points** ####

First, define the **objective function** as average 
* Plug in L1 or L2 loss 
* Plug in the model so that the resulting expression is a function of $\theta$

Then, find the minimum of the objective function
1) Differntiate with respect to $\theta$
2) Set equal to $0$
3) Solve for $\hat{\theta}$
4) (If we have multiple parameters) repeat 1-3 with partial derivatives

MSE has a property - **convexity** - tha guarantees $R(\hat{\theta})$ is a global minimum
* MAE is also convex

#### **Comparing Loss Functions** ####

Shape
* The loss surface of MSE is smooth - easy to minimize
* The loss surface of MAE is Piecewise


<img src="https://ds100.org/course-notes/constant_model_loss_transformations/images/mse_loss_26.png" alt="Image Alt Text" width="500" height="300">

<img src="https://ds100.org/course-notes/constant_model_loss_transformations/images/mae_loss_infinite.png" alt="Image Alt Text" width="500" height="300">



Outliers: 

* Under MSE, the optimal parameter $\hat{\theta}$ is strongly affected by the presence out outliers 
* Under MAE, the optimal parameters is not as influenced by outlying data 

So, MSE **sensitive** to outliers, while the MAE is **robust** to outliers 

Single optimal value
* MSE has a **unique** solution for $\hat{\theta}$ 

* MAE is not guaranteed to have a single unique solution (when the number of observations in $y$ is even then there could be multiple)

### **Transformations to fit Linear Models**

Even if we make transformation to specific features, as long as our parameters are still **linear**, $\theta = [\theta_0, \theta_1]$ then all of our ideas above can still be applied

* In terms of specific transformations to use, refer to the Bulge Diagram from the Visualization Notes