You can find this course on Kaggle's [Intro To Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)

# Introduction
In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to improve and see how you stack up to others taking this course.

The focus of this notebook are:

1. Build a Random Forest model with all of your data (X and y).
2. Read in the "test" data, which doesn't include values for the target. Predict home values in the test data with your Random Forest model.
3. Submit those predictions to the competition and see your score.
4. Optionally, come back to see if you can improve your model by adding features or changing your model. Then you can resubmit to see how that stacks up on the competition leaderboard.

# Tasks
Before building our model and submiting those predictions to the competition, we'll do a step-by-step process from loading the data to preparing it for our model.

These steps consist of:

1. Loading the Data
2. Inspecting the Data (i.e identifying the prediction target and features) 
3. Split the Data
4. Build Model
5. Create Model for Competition

# Task 1: Load the data

We'll use the python library `pandas` to load and view our data

In [1]:
import pandas as pd

# file path 
filepath = 'train.csv'

# read csv 
train_data = pd.read_csv(filepath)

# view first 5 rows of data
train_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


# Task 2: Inspect data

Although we do see a whole column filled with `NaN`, we're going to disregard it in this case since if you refer to the `data_description.txt` you'll see that the `NaN` indicate that the houses did not have whatever the column name states. It does not mean that data wasn't recorded for it.

We can continue to inspecting our data to see what our prediction target is and what features we want to select for our model.

In [2]:
# use .columns to view the column names in the data
train_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

## Task 2.1 Defining the prediction target and features
In the Intro to Machine Learning course, I'll start off with the given features listed in the tutorial. As I advance within the course, I'll figure out ways to improve my model and add more features.

For now, I'll use the following features: `['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']`

And since we're dealing with housing prices, our prediction target will be `SalePrice`

In [3]:
# Edit date: 03/24/2021 

# defining prediction target
y = train_data.SalePrice

# choosing features
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

# defining features
X = train_data[features]

# Task 3: Split the data

Before we start building our model and making predictions, we'll have to split the data into two sets: training data & validation data

**Why?** When we fit the model, the patterns are derived from the training data, meaning that if we make predictions using the training data then our model will appear very accurate. If it were given new data, then model would produce very inaccurate predictions.

Therefore, we'll use `train_test_split` function from `sklearn.model_selection` to split our data into two different datasets. 

In [4]:
# import the train_test_split function
from sklearn.model_selection import train_test_split

# define our two separate X and y
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0) # specify a number for random_state to 
                                                                        # ensure same results each run

# Task 4: Build model with training data
### Subtasks:
1. Find `best_tree_size` from `max_leaf_node`
2. Make validation predictions 
    * when not specifying `max_leaf_node` 
    * specifying `max_leaf_node`
    * using RandomForestRegressor
    
We'll first build our model using our training data and build 3 different models to see which produces the lowest MAE (mean absolute error).

**Mean Absolute Error**: The average of the prediction errors found in the model

In summary, `max_leaf_node` is a list of values on how deep we want our tree. This helps in controlling underfitting vs. overfitting. We'll pick the `best_tree_size`, which is the leaf node with minimum MAE.

Ive already created a function called `get_mae` that'll perform that process, so all we'll do is import it.

In [6]:
from find_maxleafnode import get_mae

Now, let's find our best tree size

In [7]:
# set list of potential tree depth 
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]

# use list comprehension to create a dictionary of each leaf node corresponding to its MAE
scores = {leaf_node: get_mae(leaf_node, train_X, val_X, train_y, val_y) for leaf_node in candidate_max_leaf_nodes}

# define best tree: min leaf_node and its MAE
best_tree= {min(scores, key=scores.get) : min(scores.values())}

# define best_tree_size
best_tree_size = min(scores, key=scores.get)

# view best_tree
best_tree

{50: 27825.888386265695}

## Task 4.1 Make predictions
Now that the hard part is over with, all we'll have to do is plug in our variables into our model and see what predictions we get. 

We'll use `DecisionTreeRegressor` and `RandomForestRegressor` as our models.

**Decision Trees**: Decision Trees predict values based off of decisions made by splitting into "leafs" (nodes) from the given features

**Random Forest**: The Random Forest model uses many decision trees and makes a prediction by averaging the predictions of each component tree.

In [9]:
# import all the necessary scikit-learn functions
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

In [10]:
# define model 
train_model = DecisionTreeRegressor(random_state=0)

# fit model
train_model.fit(train_X, train_y)

DecisionTreeRegressor(random_state=0)

In [11]:
# make validation predictions and calculate mean absolute error w/out specifying max_leaf_nodes
val_predictions = train_model.predict(val_X)
val_mae = mean_absolute_error(val_y, val_predictions)
print("Validation Mean absolute error without specifing max_leaf nodes: {:,.0f}".format(val_mae))

Validation Mean absolute error without specifing max_leaf nodes: 32,411


In [12]:
# redefine model with best_tree_size
train_model = DecisionTreeRegressor(max_leaf_nodes=50, random_state=0)

# fit model
train_model.fit(train_X,train_y)

# calculating MAE using max_leaf_node 
val_predictions = train_model.predict(val_X)
val_mae = mean_absolute_error(val_y, val_predictions)
print("Validation Mean absolute error for best value of max_leaf_node: {:,.0f}".format(val_mae))

Validation Mean absolute error for best value of max_leaf_node: 27,826


In [13]:
# define RandomForestRegressor model
rf_model = RandomForestRegressor(random_state=0)

# fit model
rf_model.fit(train_X, train_y)

# make predictions using RandomForestRegressor
rf_val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(val_y, rf_val_predictions)
print("Validation Mean absolute error for RandomForest Model: {:,.0f}".format(rf_val_mae))

Validation Mean absolute error for RandomForest Model: 23,093


It seems that the `RandomForestRegressor` model predicts a smaller MAE at 23,093. With that information, I'll use the `RandomForestRegressor` to create a model for the competition.

# Task 5: Create model for competition

For better accuracy, I'll create a new model and train it on all the training data before making predictions from the data in `test.csv`

In [14]:
# build RF model and train it on all X and y 
rf_model_full_data = RandomForestRegressor(random_state=0)
rf_model_full_data.fit(X,y)

RandomForestRegressor(random_state=0)

In [15]:
# load data from test.csv
test_data = pd.read_csv('test.csv')

# create test_X which includes the columns from the predictions and applies it to the test_data
test_X= test_data[features]

# make predictions for the competition
test_pred = rf_model_full_data.predict(test_X)

Since this was part of Kaggle's **Intro to Machine Learning** course, we were to submit our model into the *Housing Prices Competiton for Kaggle Learn Users*. 

In [16]:
# use pandas to create .csv with selected data
output = pd.DataFrame({'Id': test_data.Id,
                      'SalePrice': test_pred})
output.to_csv('submission.csv', index=False)

# Conclusion

This course went into the basics of:

1. Building our first model
2. Validating our model (i.e splitting our data)
3. Making sure we don't underfit or overfit our model
4. Learning about **Decision Trees** & **Random Forests**
5. Importance of Mean Absolute Error
6. Testing our model

I learned enough to gain a general understanding on how machine learning works and the overall framework of building a model to testing it on new data. 

#### My next steps are:

1. Learning about the different types of Machine Learning (i.e Supervised, Unsupervised, etc.)
2. Figuring out how to choose the best features for the model (Feature Engineering)
3. Creating Visualizations of the model 