## Notes for **Intro to Machine Learning** Course provided on Kaggle.

- Decision Tree: 2 possible predictions
- Fitting / Training: The process of capturing patterns from data (training data)
- The major shortcoming of Decision Tree is that it doesn't capture most factors affecting the prediction. We could capture more factors using a tree with more "splits" (deeper trees).
- Leaf: Nodes at the bottom of the tree that do not have any child.

Components in the data:
 - DataFrame: Holds data in the form of table. Similar to sheet in Excel and table in a SQL database.
 - Using Pandas library to work with data in this course.

In [None]:
import pandas as pd   # library for manipulating data

# file path 
file_path = "../file_Directory/file_SubDirectory/file.csv"

# import the data and store it in DataFrame df
df = pd.read_csv(file_path)

# summary of the data in df
df.describe()

Components in describe() for numeric columns:
- count: Number of rows with non-null values
- mean: Average value of that column
- std: Standard Deviation (how the values are numerically spread)
- min: Smallest value
- 25%, 50%, 75%: nth precentile
- max: Largest Value

In [None]:
# display all columns in dataframe
df.columns

# dropping null values
df = df.dropna(axis = 0)

# 2 ways to access a single attribute/features (columns in dataframe)
# return in Series
att1 = df.att1    # dot-notation
att1 = df["att1"] 

# providing a list of columns names inside square brackets
many_att = df["att1", "att2", "att3", "att4"]   

# the first n rows (5 rows by default)
df.head()   # first 5 rows
df.head(10)   # first 10 rows

- Use scikit-learn library to create the machine learning model. Scikit-learn (written as sklearn) is the most popular library for modeling the types of data typically stored in DataFrames
- Steps to build model:
    1. Define 
        - Model and their parameters to use
    2. Fit
       - Capture patterns (train) from training data. 
    3. Predict
    4. Evaluate
       - Determine the accuracy of the model'spredictions



In [None]:
# import scikit-learn library and the model to be used
from sklearn.tree import DecisionTreeRegressor

# create an object of the model and specify random_state to ensure we'll get the same result each run
# the number assigned to random_state will not affect the result of prediction
# step 1 of building model
df_model = DecisionTreeRegressor (random_state = 1)

# step 2: fit the model
# X are the predictive features
# y is the label feature (the result)
df_model.fit(X, y)

# step 3: predict
df_model.predict(X)

Model Validation
 - Mean Absolute Error (MAE)
    - error = actual - predicted
    - MAE will consider the absolute value of each error and take the average of those absolute errors
    - Measure of model quality

 - Validation data (or Test Data): Data that only used for measuring the model's accuracy. Not invlove in model training nor model prediction. 

In [None]:
# get the MAE using methods in sklearn
from sklearn.metrics import mean_absolute_error

# the prediction
predictions = df_model.predict(X)

# MAE
mean_absolute_error(y, predictions)

In [None]:
# split data into train set and test set
from sklearn.model_selection import train_test_split

# random_state make sure that we get the same split for each run
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

df_model = DecisionTreeRegressor()

# fit the model using train set
df_model.fit(X_train, y_train)

# use test set to predict
predictions = df_model.predict(X_test)

# get the MAE
print(mean_absolute_error(y_test, predictions))

Underfitting and Overfitting
 - Overfitting
    - A model matches the training data prefectly but does poorly in validation and other new data.
    - Usually occur on Decision Tree with higher tree depth
    - Lower MAE
 - Underfitting
    - A model fials to capture important distinction and patterns in the data and i preforms poorly in training data.
    - Usually occur on Decision Tree with lower tree depth
    - Lower MAE


How to control tree depth?
- max_leaf_nodes argument can control overfitting and underfiting.
- The more leaves a model has, the more we move from underfitting to overfitting.

In [None]:
# compare MAE scores for model with different max_leaf_nodes
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

def model_MAE(max_leaf_nodes, X, y):
  X_train, X_test, y_train, y_test = train_test_split(X,y)
  df_model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 42)
  df_model.fit(X_train, y_train)
  return mean_absolute_error(y_test, df_model.predict(X_test))

In [None]:
# get MAE for different max_leaf_nodes 
for n in [5, 50, 500, 5000]:
  mae = model_MAE(n, X, y)
  print("Max leaf nodes: %d \t\t Mean Absolute Error: %d"%(n, mae))

Random Forest Model
 - Uses many trees and makes a prediction by averaging the predictions of each component tree.
 - Has better predictive accuracy than a single decision tree.
 - Works well with default parameters, but many of those are sensitive to getting the right parameters.

In [None]:
# building random forest model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state = 1)
forest_model.fit(X_train, y_train)
predictions = forest_model.predict(X_test)
print("Mean absolute error of using Random Forest Model: {}".format(mean_absolute_error(y_test, predictions)))