# Modelling Techniques in ML
Random Forest and Decision Tree Regression are both machine learning techniques used for regression tasks, where the goal is to predict continuous numerical values based on input features.

**Decision Tree Regression**: Decision trees are a popular and intuitive machine learning model. In Decision Tree Regression, the algorithm builds a tree structure where each internal node represents a decision based on a feature, and each leaf node represents the output value (or a prediction). The decision at each node is based on maximizing the information gain or minimizing impurity, typically measured by metrics like Gini impurity or entropy. Decision trees are easy to interpret and visualize, but they can be prone to overfitting if not properly regularized.

**Random Forest Regression**: Random Forest is an ensemble learning technique that builds multiple decision trees and combines their predictions to produce a more robust and accurate model. In Random Forest Regression, each tree is trained on a random subset of the data and a random subset of the features. During prediction, the output is averaged over all the individual tree predictions. Random Forests help to reduce overfitting compared to individual decision trees and generally offer better performance. They also provide feature importance scores, which can be useful for feature selection.


In this article I am covering both techniques brieflly, here is the points why anyone should choose one technich over another: 

**Decision Tree Regression**:
- You want a simple and interpretable model that can be easily visualized.
- The dataset is relatively small and has low to moderate complexity.
- You want to understand the decision-making process of the model.
- You are not concerned about overfitting or have methods to mitigate it, such as pruning.

**Random Forest Regression**:
- You have a larger dataset with higher dimensionality.
- The dataset is noisy or contains outliers, as Random Forests are more robust to such issues.
- You are prioritizing predictive accuracy over interpretability.
- You want to reduce the risk of overfitting, as Random Forests tend to generalize better than individual decision trees.
- You want to assess feature importance or perform feature selection.

## Introduction to DataFrame, review your Model

In [7]:
import pandas as pd
from pandas import DataFrame

df: DataFrame = pd.read_csv('./data/melb_data.csv')
df.describe()  # get summary of dataframe     

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


- count: shows how many rows have non-missing values
- mean: average
- std: standard deviation, which measures how numerically spread out the values are.
  - lets say in average there are approximately 3 (2.938 plus/minus 0.956)rooms in selected dataset.

- min: smallest value
  - 25%: 25% smallest values
  - 50%: 50% smallest values
  - 75%: 75% smallest values
- max values,

In [8]:
# Get all columns name
print(df.columns)  # property


# Get Average Price
# df['Price'].mean()
print(df.Price.mean())


# Get youngest building
print(df.YearBuilt.min())
print(df.sort_values('YearBuilt', ascending=False)['YearBuilt'].iloc[0])

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')
1075684.079455081
1196.0
2018.0


## Scikit-learn 
##### *to implement machine learning models and statistical modelling*
___

Scikit-learn is the most popular library for modeling the types of data typically stored in *DataFrames*.
Through scikit-learn, we can implement various machine learning models for regression, classification, clustering, and statistical tools for analyzing these models.


<div>
<img src="attachment:4f344ee5-b1c9-4cd8-b2a0-4517f35afd4f.png" width="500"/>
</div>

## Decsision Tree Regressor

In [9]:
from sklearn.tree import DecisionTreeRegressor


y = df.Price
X = df[['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']]

print(X.describe())
print(X.head())


melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

              Rooms      Bathroom       Landsize     Lattitude    Longtitude
count  13580.000000  13580.000000   13580.000000  13580.000000  13580.000000
mean       2.937997      1.534242     558.416127    -37.809203    144.995216
std        0.955748      0.691712    3990.669241      0.079260      0.103916
min        1.000000      0.000000       0.000000    -38.182550    144.431810
25%        2.000000      1.000000     177.000000    -37.856822    144.929600
50%        3.000000      1.000000     440.000000    -37.802355    145.000100
75%        3.000000      2.000000     651.000000    -37.756400    145.058305
max       10.000000      8.000000  433014.000000    -37.408530    145.526350
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
0      2       1.0     202.0   -37.7996    144.9984
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
3      3       2.0      94.0   -37.7969    144.9969
4      4       1.0     120.0   -37.8072    144.

**melbourne_model.fit(X, y)**: fits (trains) the decision tree model on the training data, where X is the feature matrix and y is the target vector. The model learns to predict housing prices (y) based on the features (X).

After executing this code, melbourne_model will be a trained decision tree regression model capable of making predictions on new data. You can then use this model to predict housing prices for new instances or evaluate its performance using appropriate metrics.

In [10]:
print("Making predictions for the following 5 houses:")
print("Model Prediction:\n", melbourne_model.predict(X.head()))
print("Actual Prices: \n", list(df.Price.head()))

Making predictions for the following 5 houses:
Model Prediction:
 [1480000. 1035000. 1465000.  850000. 1600000.]
Actual Prices: 
 [1480000.0, 1035000.0, 1465000.0, 850000.0, 1600000.0]


Sure, the model prediction is accurate, since we feed it with the data from initial dataset. Let's next review how we can validate our model and test it. 

## MAE
There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE). Example of MAE would be if a house cost $150,000 and you predicted it would cost $100,000 the MAE is $50,000.

Here is how we can Calculate the MAE:

In [11]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices_head = melbourne_model.predict(X.head())
print("MAE of head of dataset")
print(mean_absolute_error(df.Price.head(), predicted_home_prices_head))

print("MAE of entier dataset")
predicted_home_prices = melbourne_model.predict(X)
print(mean_absolute_error(y, predicted_home_prices))

MAE of head of dataset
0.0
MAE of entier dataset
1125.1804614629357


The measure we just computed called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it. 

The dataset could have erroneous patterns and model could learn on it. When you provide peace from training data for validation it may not be accurate.

## Split Testing
The scikit-learn library has a function `train_test_split` to break up the data into two pieces. We'll use some of that data as training data to fit the model, and we'll use the other data as validation data to calculate mean_absolute_error.

In [12]:
from sklearn.model_selection import train_test_split

"""Split arrays or matrices into random train and test subsets.
    Parameters
    ----------
    *arrays: sequence of indexables with same length / shape[0]
    test_size : float | int, default=None
        If float (form 0.0 to 1.0) represent the proportion of the split
        If int, represents the absolute number of test samples. 
        If None, the value is set equally. If ``train_size``== None test_size=0.25
    train_size: float or int, default=None, The same as test_size, but represents train size,
    random_state: int | None: Controls the shuffling applied to the data before applying the split.
    shuffle: bool, default=True, Whether or not to shuffle the data before splitting.
    stratify: array-like, default=None, If not None, data is split in a stratified fashion, using 
        this asthe class labels
        
    Returns
    -------
    splitting : list, length=2 * len(arrays)
"""
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, random_state=1)

# here we define our model based on test data.
melbourne_model = DecisionTreeRegressor(random_state=0)
melbourne_model.fit(X_train, y_train)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(X_val)  # type numpy.ndarray
print(type(val_predictions))
print("MAE:", mean_absolute_error(y_val, val_predictions))
print("Average Price:", df.Price.mean())

<class 'numpy.ndarray'>
MAE: 236582.11855670103
Average Price: 1075684.079455081


Now we know that actually our model is not perfectly accurate, the MAE is quite high of ~25% percent

## Underfitting and Overfitting
Improving the model by examining right fit of node leafs, and avoiding underfitting and overfitting, 

<img src="./public/overfitting-lowerfitting.jpeg" width="500"/>

There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes. But the max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.

**Overfitting**: leads to capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
**Underfitting**: leads to failing to capture relevant patterns, again leading to less accurate predictions.

Let's now create a function to compare the accuracy of models built with different values for max_leaf_nodes.

In [13]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor


def get_mae(max_depth, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_depth, random_state=0)
    model.fit(train_X, train_y)
    prediction = model.predict(val_X)
    mae = mean_absolute_error(val_y, prediction)
    return mae
    

In [14]:
for max_depth in [5, 50, 100, 500, 1000, 10000, 15000, 50000]:
    mae = get_mae(max_depth, X_train, X_val, y_train, y_val)
    print(f"Max leaf nodes: {max_depth=} | MAE: {mae}")

Max leaf nodes: max_depth=5 | MAE: 359783.60063327465
Max leaf nodes: max_depth=50 | MAE: 265943.7314522826
Max leaf nodes: max_depth=100 | MAE: 245165.74948283634
Max leaf nodes: max_depth=500 | MAE: 220984.7553960781
Max leaf nodes: max_depth=1000 | MAE: 222150.45614078475
Max leaf nodes: max_depth=10000 | MAE: 237319.17949189985
Max leaf nodes: max_depth=15000 | MAE: 237292.54731222385
Max leaf nodes: max_depth=50000 | MAE: 237292.54731222385


The optimal max leaf nodes for the current model is 500. That is sligtly better then when we do not specify the max_leaf_nodes in this case. That is becouse when max_leaf_nodes is set to None, the max_leaf_nodes will be defined dynamically, nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples, so it may not be too low or too hight, but still it does not guarantie the best fit.

## Random Forest

In [6]:
from sklearn