## Intro to Machine Learning (Kaggle)  
#### Notes taken by: Sara Nasab / Github: @basanaras
Caution: There is no source dataset that I am extracting from. Running the kernel will result in errors.
 

### (1) Importing relevant data

In [None]:
# Import pandas library using convential abbreviation
import pandas as pd
file = '../input/..'
data = pd.read_csv(file)

In [None]:
# To print summary of data
data.describe()

# To print the first five lines of dataframe: 
data.head() 
# Pro tip: To change the number N of lines shown, insert N in parentheses

# List column headers (used to see what kind of data is stored in df):
data.columns 

Extract data to make predictions: 

In [None]:
# Choose features of interest that factor in the prediction target 
y = data.Price

features = ['Rooms', 'Bathroom', 'Size']
X = data[features]

### (2) Creating a decision tree model: 

In [None]:
# Use decision tree from the scikit-learn library
from sklearn.tree import DecisionTreeRegressor

# Define model. The decision tree is initialized with a random_state to ensure reproducibility
data_model = DecisionTreeRegressor(random_state = 1)

# Fit model 
data_model.fit(X, y)

# Predictions of y using X
data_model.predict(X)
data_model.predict(X.head())

### (3) Model validation: How much can we trust the model that we created? 
* Avoid making predictions with the *training data* and compare predictions to the *training data* 
* We can measure predictive accuracy different ways. One useful metric is the **Mean Absolute Error (MAE)** 
$$ \text{MAE} = \frac{\sum^{n}_{i=1} | y_i - x_i |}{n} $$

In [None]:
# How to use built-in MAE function from scikit-learn
from sklearn.metrics import mean_absolute_error

predicted_from_model = data.predict(X)
mean_absolute_error(y, predicted_from_model)

This is an "in-sample" score, i.e. using the predictions from the training model to calculate MAE. Most likely the model will perform poorly in practice (on new data).

One way to get around this: Exclude some data points from the training data, and use that to test the model's accuracy. This set of training points is called **validation data**. 

We can use the **validation data** to compute the MAE, and ultimately measure the accuracy of our model. 

In [None]:
# Input requires X and y, and then returns split training and testing/validation data 
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0) # Insert random state 

# Create model 
data_model = DecisionTreeRegressor() 
# Fit model 
data_model.fit(train_X, train_y) 

Just like before, we fit the model to the training data. *However* this time, the fit is only done to a subset of the available data, renamed as `train_X` and `train_y`. 

We can obtain the *predicted* data from the *validation* data. 

In [None]:
val_predictions = data_model.predict(val_X)

# Calculate MAE correctly! 
mean_absolute_error(val_y, val_predictions)

### (4) Experimenting with different models: Overfitting vs. underfitting 

In the context of decision trees that are composed of splits (levels) and the leaves, we say that the *deeper* the tree is, the more splits the tree has (and the *shallower* the tree, the fewer splots the tree has. 

* **Overfitting**: When the tree has many splits (many leaves), the model fits the training set very well. Often captures outliers that are not indicative of new datasets.
* **Underfitting**: When the tree is too shallow (fewer leaves), the likelihood of recognizing distinct patterns in the data is smaller. 

Both cases are likely to result in **poor predictions** on both the validation and new data sets.

We can often avoid these issues by varying the looking at how the number of leaves influences the MAE of our model using `max_leaf_nodes`: 

`DecisionTreeRegressor(max_leaf_nodes = n, random_state = 0)`

We can take advantage of varying the depth and compare different models to choose the appropriate one that minimizes the MAE.

**Note: We use the *validation* data to test the model!**

#### In practice: 
1. Find the appropriate `max_leaf_nodes` by comparing MAEs on the **training set**. 
2. Once the optimal `max_leaf_nodes` are found, then fit the data on the **entire** dataset. 