# Intinital research and methodology

## 1. Overview

Since we are building a regression model, there are several models we can choose from:
- Linear Regression
- Ridge Regression
- Lasso Regression
- ElasticNet Regression
- Decision Tree Regression
- Random Forest Regression

We wil be using Cross Value Validation to pick the best between the models. But before that, let's dive a little deeper into what we will be working with to understand our process better.

## 2. Model and methodology research

### 2.1 How to evaluate the models - what metrics we are using

First, let's explore what metrics we are using to evaluate our models. For regression models, we typically use the following metrics:
- Mean Absolute Error (MAE): Average of absolute differences betweens predictions and actual values.
- Mean Squared Error (MSE): Average of squared differences between predictions and actual values.
- Root Mean Squared Error (RMSE): Square roots of MSE. 
- R-squared (Coefficient of Determination): Measure the proportion of variance in the dependent variable that is predictable from the independent variables. 

For our analysis, we're be using MAE and RMSE to understand the prediction error in term of our price unit (which is million VND) and R-quare to understand how well our model explain the variance in the housing price. We will also be measure the time it takes to run each model (cost of time).

I'd like to take some time to talk a bit more about R-square since it might be a bit difficult to understand at first. The formula for R-square is:

${Var(mean) - Var(fit) \over Var(mean)}$


Where `Var(mean)` describe the distribution of data points around their mean, while `Var(fit)` describe the distribution of the data points around the regression line. To explain how R-square work, let's imagine 2 senarios. 

1. If all the points falls in the regression line:

In this senario, the `Var(mean)` would have a value which we will call V, but `Var(fit)` wil have a value of 0. Because all the predicted data points fall on the regression line, the distance from every points to the regression line (and that also mean model - regression line can make perfect prediction). 

In this case, the value of R-square would be 1. Because:

${Var(mean) - Var(fit) \over Var(mean)}$ = $Var(mean) - 0 \over Var(mean)$ = 1

Which we can interpret as the dependent variable is perfected explained by the independent variables.

2. If no points fall in the regression line

Let's pretend like there is a senario where the distance of the variables to the regression line is the same as the distance to the mean of the data set. This regression line must be a very bad model because it does no better job than the mean line of the data set. 

In this case, the values of R-square would be 0. Because: 

${Var(mean) - Var(fit) \over Var(mean)}$ = $Var(mean) -  Var(mean) \over Var(mean)$ = 0

Which we can interpret as the dependent can't be explained by(and have no correlation with) the independent variables at all.

**In conclusion**: the way we are interpreting R-square is how much the dependent variable (value we want to predict) can be explain with the independent variable (values we use to predict). It values can range from 0 to 1 (although the value can also be negitive but I'm not going to dive into the reason why).


### 2.2. How do each model works

#### 2.2.1 Multiple Linear Regression

The multiple linear regression is linear regression apply to multiple dimension (depends on the number of predictors in our analysis).

$$Y = β0​+β1​X1​+β2​X2​+β3​X3​+⋯+βp​Xp​+ε$$

Where:

- Y is the dependent variable (the variable being predicted). 
- X1, X2, X3,…, Xp ​ are the independent variables (predictors). 
- β0​ is the intercept (constant term). 
- β1, β2, β3,…, βp ​ are the coefficients or slopes that represent the effect of each independent variable on the dependent variable. 
- ε represents the error term, capturing the difference between the actual and predicted values.

The main ideas of multiple linear regression is pretty much the same as linear regression, where we're trying to minimize Mean Squared Error (MSE) for the final regression line.

#### 2.2.2 Ridge Regression

Ridge regression is another type of regression, where instead of finding the regression that best fit the test data, the model a penalty term to the Least Square method.

By doing this, the models aim to minimize the sum of squared residuals while also penalizing large coefficients, which could mitigate the problem of overfitting.

The formula for Ridge Regression is:

$$\sum_{i=1}^{n} ​(yi​−β0​−β1*​xi1​−β2*​xi2​−⋯−βp​*xip​)^2 + λ*\sum_{j=1}^{p} (βj^2) $$

Where:
- yi is the observed value of the dependent variable.
- xi1, xi2,…, xip​ are the independent variables.
- β0, β1,…, βp are the coefficients being estimated.
- p is the number of predictors.
- n is the number of observe data points.
- λ is the regularization parameter (the hyperparameter of this model) that determine the strength of the penalty term.

This regression is very similar to Lasso Regression as we will see below. Ridge regression is effective in situations where many predictors are relevent.

#### 2.2.3 Lasso Regression

Lasso Regression is very similar to Ridge Regression as we will see with their formular. The only different is that instead of squaring the slope, Lasso use the absolute value of the slope. 

The difference between Ridge and Lasso Regression is that Ridge regression can shrink values of parameter but never be equal to 0. With Lasso Regression, irrelevant parameter can be shrink to 0 and will be exluded from the calculation.

The formula for Lasso Regression is:

$$\sum_{i=1}^{n} ​(yi​−β0​−β1*​xi1​−β2*​xi2​−⋯−βp​*xip​)^2 + λ*\sum_{j=1}^{p} (|βj|) $$

Where:
- yi is the observed value of the dependent variable.
- xi1, xi2,…, xip​ are the independent variables.
- β0, β1,…, βp are the coefficients being estimated.
- p is the number of predictors.
- n is the number of observe data points.
- λ is the regularization parameter (the hyperparameter of this model) that determine the strength of the penalty term.

Ridge regression is effective in situations where are many predictors but only a subset of them might be relevent.


#### 2.2.4 ElasticNet Regression

ElasticNet Regression a hybrid regularization technique that combines penalties from both Lasso (L1 regularization) and Ridge (L2 regularization) regression methods. It aims to overcome the limitations of each method by providing a compromise between L1 and L2 regularization.

The formula for Lasso Regression is:

$$\sum_{i=1}^{n} ​(yi​−β0​−β1*​xi1​−β2*​xi2​−⋯−βp​*xip​)^2 + λ1*\sum_{j=1}^{p} (|βj|) + λ2*\sum_{j=1}^{p} (βj^2) $$

Where:
- yi is the observed value of the dependent variable.
- xi1, xi2,…, xip​ are the independent variables.
- β0, β1,…, βp are the coefficients being estimated.
- p is the number of predictors.
- n is the number of observe data points.
- λ is the regularization parameter (the hyperparameter of this model) that determine the strength of the penalty term.

ElasticNet regression is a versatile technique that combines the strengths of Lasso and Ridge regularization methods, offering flexibility in handling multicollinearity, feature selection, and model complexity in linear regression settings.

Limitation
- High computational complexity
- Difficulty in fine-tunning parameters
- Less effective in highly sparse or dense data (sparse dataset where most predictors are irrelevant is better trained with Lasso Regression, and dense data set where large number of highly correlated predictors is better trained with Ridge Regression).

#### 2.2.5 Decision Tree Regression

Decision tree regression is a non-parametric supervised learning method used for regression tasks, which employ a tree structure to do predict continuous numeric values.

A tree regression works by using a tree structure - a hierarchical tree-like structure consist of nodes and branches. Each internal node represents a test on an attribute, and each branch represent an outcome of the test. The predicted continuous values is in the leaf nodes.

With each predictors, the decision tree algorithm selects the best features and optimal thresholds to split the dataset at each node - optimal being minimizing the variance of the target variable within the resulting subset after split.

If working with many predictors, the Decision Tree Regression create decision trees for all predictors, and pick the candidate with smallest sum of squared residuals to be the root, and contininue to split observations into smaller group.

Spliting is very important in Decision Tree Regression as it determine the ability to classify observations into optermal threshold. Usually, we can effect the slitting process by tunning the hyperparameters (maximum depth, minimum samples per leaf, minimum samples per split) of algorithm.

Advantages:
- Interpretability: It is easy to interpret and visualize.
- Handling Non-linearity: It can capture non-linear relationhips between features and the target variable.

Limitations:
- Overfitting: Decision trees are prone to overfiting when tree depth is too deep or sample per leaf is to few.

#### 2.2.6 Random Forest Regression

Random Forest Regression is a type of non-parameter regression model that is build upon Decision Tree Regression. 

The way it works is quite complication compared to the other models, but also really cool.

1. Create random Subset of Data and Features:

Random Forest builds multiple decision trees by sampling the dataset with replacement (bootstrap sampling) to create different training sets for each tree.

Additionally, at each node of each tree, it considers only a random subset of features (a process called feature bagging).

2. Construct of Decision Trees:

Each decision tree in the Random Forest is grown independently, typically with a limited depth or stopping criteria.
The trees are trained on different subsets of the data and features, leading to a diverse set of trees. This process works by using a random subset of variables at each step of build trees. Meaning at each step, we only consider k number of features, and picking one to create a new feature branch for the tree until we reach leaf (our sample size become too small to divide).

3. Aggregate of Predictions:

When making predictions for a new data point, each tree in the forest independently predicts the target variable.
For regression tasks, the final prediction is calculated by averaging (or taking the median) of the predictions from all individual trees.

It's worth noting that the bootstrap sampling allow duplication in sample, which mean after builing all our Decision Trees there will be a subset of samples (typically 1/3 of the data set) left. This is called the out of bag sample. We can use this to test the model, which will help use measure of accurate our forest it. 

One cool things about Random Forest is that it has a particular effective way to fill in missing data, using a proximity matrix. 

Advantage:
- Reduced overfitting: Combining multiple trees reduces overfitting compared to a single decision tree by averaging out individual tree predictions
- High Predictive Accuracy: They often provide high predictive accuracy and generalize well to new data.
- Handling Large Feature Spaces: It can handle a large number of features and automatically select important features by using feature bagging.

Limitation: 
- Complexity and Interpretability: Random Forests can be more complex and less interpretable than a single decision tree.
- Computationally Intensive: Training a large number of trees can be computationally expensive for large datasets. This will become a big deal as you will see with our analysis below.

Similar with Decision Tree Regression, Random Forest also has many hyperparameter to tune: `number of Trees`, `maximum depth of trees`, `minimum samples split and milimun samples leaf`, `maximum features`.









<!-- Here's the `TL;DR` version -->

#### TL;DR

| Regression Model           | What it is about | Number of hyperparameter|  
| -------------------------- | -------------------------------------------------------------------------------------- | --- |   
| Multiple Linear Regression | Best fit a line (or in high dimension a plane) to sample points                         | 0 |  
| Ridge Regression           | Introduces bias to avoid overfitting test data. Works well with highly correlated data | 1 |   
| Lasso Regression           | Similar to Ridge Regression, suitable for sparse datasets with potentially unrelated features | 1 |  
| ElasticNet Regression      | Combines Ridge and Lasso Regression techniques                                            | 2 |  
| Decision Tree Regression   | Uses a tree data structure to handle non-linearity, can be prone to overfitting          | many |  
| Random Forest Regression   | Ensemble of decision trees, improves upon Decision Tree Regression by using multiple trees | many |  


### 2.3 Model selection

For model selection, we are going to use K-fold Cross Validation to evaluate the efficiency of each regression model to determine the most suitable model to use for our dataset.

#### 2.3.1 K-fold Cross Validation

The problem with the simple method of separating data into 2 catagorty of `test` and `train` is that the `train` data might not be representative of the population and can mistrain the model that way, leading it to have a bad score when tested. This could be due to a missing features and un-even ratio of features.

To solve this problem, we can use K-fold Split Method. Unlike with the simple method, k-fold slit method split the data into multiple folds, and train the data on those folds (fold is a subset of data divided equally by number of folds). While this work better than the simple method, this solution still encounter problem of missing features and un-even ratio of features. 

To solve this problem, we use Stratified K-fold Split Method, which works just like K-fold Slit Method, but it makes sure to take number of features and ratio of features into account when slitting the data set. This solves our problems of represting the features in the dataset use for test and train properly.

K-fold Cross Validation (note that we are usig Stratified K-fold Slitting method) is a resampling technique used to evaluate the performance of a machine learning model. Essentially its combines the K-fold Cross Validation technique to investigate model efficiency, estimating scores through training and testing multiple dataset. This method is very handy to use to do model selection and hyperparameter tunning. (Which is what we are trying to do now).