# **Introduction**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X_train_full = pd.read_csv("../resources/datasets/train.csv", index_col="Id")
X_test_full = pd.read_csv("../resources/datasets/test.csv", index_col="Id")

# Obtain target and preditction
y = X_train_full.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = X_train_full[features].copy()
X_test_copy = X_test_full[features].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=42)

You will work with data from the Housing Prices Competition for Kaggle Learn Users to predict home prices in Iowa using 79 explanatory variables describing (almost) every aspect of the homes.

Ames Housing dataset image

Run the next code cell without changes to load the training and validation features in X_train and X_valid, along with the prediction targets in y_train and y_valid. The test features are loaded in X_test.

Use the next cell to print the first several rows of the data. It's a nice way to get an overview of the data you will use in your price prediction model.

In [2]:
X_train.head()

Unnamed: 0_level_0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
255,8400,1957,1314,0,1,3,5
1067,7837,1993,799,772,2,3,7
639,8777,1910,796,0,1,2,4
800,7200,1937,981,787,1,3,7
381,5000,1924,1026,665,2,3,6


The next code cell defines five different random forest models.  Run this code cell without changes.

In [3]:
from sklearn.ensemble import RandomForestRegressor

model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=180,random_state=42)
model_3 = RandomForestRegressor(n_estimators=100, criterion="absolute_error", random_state=0)
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=30, random_state=42)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=0)

model = [model_1, model_2, model_3, model_4, model_5]

To select the best model out of the five, we define a function score_model() below. This function returns the mean absolute error (MAE) from the validation set. Recall that the best model will obtain the lowest MAE.

In [4]:
from sklearn.metrics import mean_absolute_error

def score_model(model, X_t=X_train, X_v=X_test, y_t=y_train, y_v=y_test):
    model.fit(X=X_t,y=y_t)
    pred_val = model.predict(X_v)
    return mean_absolute_error(y_v, pred_val)

for i in range(0, len(model)):
    mae = score_model(model[i])
    print(f"Model {i+1} MAE: {mae}")

Model 1 MAE: 22325.589543378996
Model 2 MAE: 22364.56139287526
Model 3 MAE: 22666.73301369863
Model 4 MAE: 23105.203705748143
Model 5 MAE: 22419.19760824819


Good model is lower number

In [5]:
best_model = model_2

# **Step 2: Generate test predictions**<br>
Now it's time to go through the modeling process and make predictions. In the line below, create a Random Forest model with the variable name my_model

In [6]:
my_model = RandomForestRegressor(n_estimators=180,criterion="absolute_error", max_depth=12, min_samples_split=3, random_state=42)

The code fits the model to the training and validation data, and then generates test predictions that are saved to a CSV file. These test predictions can be submitted directly to the competition!

In [7]:
# Fit model
my_model.fit(X, y)

# generate test prediction
predt_val = my_model.predict(X_test_copy)

# Save prediction to CSV
output = pd.DataFrame({"Id": X_test_copy.index, "SalePrice": predt_val})
output.to_csv("../resources/result/prediction_sale_price.csv", index=False)