# [Random Forests](https://www.kaggle.com/dansbecker/random-forests)

## Introduction

Decision trees leave you with a difficult decision.  
A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf.  
But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.  
Even today's most sophisticated modeling techniques face this tension between underfitting and overfitting.  
But, many models have clever ideas that can lead to better performance.  
We'll look at the random forest as an example.  
The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree.  
It generally has much better predictive accuracy than a single decision tree and it works well with the default parameters.  
If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right settings.

In [1]:
import pandas as pd

file_path = 'input/melbourne_data.csv'
mb_data = pd.read_csv(file_path)
mb_data = mb_data.dropna(axis=0)
y = mb_data.Price
mb_predictors = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']
X = mb_data[mb_predictors]

In [2]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

A `RandomForest` is built in much the same way that a decision tree is:

In [3]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as mae

forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)
melbourne_predictions = forest_model.predict(val_X)
print(mae(val_y, melbourne_predictions))

203752.1210673553


## Conclusion

There is likely room for further improvement, but this is a big improvement over the best decision tree error of 250,000.  
There are parameters which allow you to change the performance of the Random Forest much as we changed the maximum depth of the single decision tree.  
But one of the best features of Random Forest models is that they generally work reasonably even without this tuning.  
You'll soon learn the XGBoost model, which provides better performance when tuned well with the right parameters (but which requires some skill to get the right settings).

# [Submitting From A Kernel](https://www.kaggle.com/dansbecker/submitting-from-a-kernel)

Machine learning competitions are a great way to improve your skills and measure your progress as a data scientist.  
If you are using data from a competition on Kaggle, you can easily submit it from your notebook.  
We're doing very minimal data set up here so we can focus on how to submit modeling results to competitions.  
Other tutorials will teach you how to build great models.  
So the model in this example will be fairly simple.  
We'll start with the sample code to read data, select predictors, and fit a model.

In addition to your training data, there will be test data.  
This is frequently stored in a file with the title test.csv.  
This data won't include a column with your target (y), because that is what we'll have to predict and submit.  
Here is the sample code to do that.

## Prepare the submission file

We make submissions in CSV files.  
Your submissions usually have two columns: an ID column and a prediction column.  
The ID field comes from the test data (keeping whatever name the ID field had in that data, which for the housing data is the string 'Id').  
The prediction column will use the name of the target field.  
We will create a DataFrame with this data, and then use the dataframe's `to_csv` method to write our submission file.  
Explicitly include the argument `index=False` to prevent pandas from adding another column in our csv file.

Hit the blue **Publish** button at the top of your notebook screen.  
It will take some time for your kernel to run.  
When it has finished your navigation bar at the top of the screen will have a tab for Output.  
This only shows up if you have written an output file (like we did in the Prepare Submission File step).