**This notebook is an exercise in the [Introduction to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/machine-learning-competitions).**

---

# Introduction

In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to apply what you've learned and move up the leaderboard.

Begin by running the code cell below to set up code checking and the filepaths for the dataset.

In [1]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex7 import *

# Set up filepaths
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv")

Here's some of the code you've written so far. Start by running it again.

In [2]:
# Import helpful libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

# Load the data, and separate the target
iowa_file_path = '../input/train.csv'
home_data = pd.read_csv(iowa_file_path)

In [3]:
# path to file you will use for predictions
test_data_path = '../input/test.csv'

# read test data file using pandas
test_data = pd.read_csv(test_data_path)

# Results default model

First 15 predictions:

[122657.0,
 156789.0,
 182959.0,
 178102.0,
 189049.0,
 180979.0,
 172797.0,
 173717.0,
 187535.0,
 116172.0,
 190798.0,
 93823.0,
 89249.0,
 145111.0,
 124696.0]
 
Validation MAE for DEFAULT Random Forest Model: **21,857**



# Model optimization

· Will clean the dataset, hot-code categorical features, scale numerical features, and train a new model

## Cleaning dataset

In [4]:
import pandas as pd
import numpy as np

In [5]:
pd.set_option('display.max_rows', None)

In [6]:
home_data2 = home_data.copy()
home_data2.set_index('Id', inplace=True)

In [7]:
#Drop columns
threshold = 200

for col in home_data2.columns:
    if home_data2[col].isnull().sum() >= threshold:
        del(home_data2[col])

In [8]:
#Inspect remaining columns with NaNs
columns_with_nan = []

for col in home_data2.columns:
    if home_data2[col].isnull().sum() > 0:
        columns_with_nan.append(col)

home_data2[columns_with_nan].head(10)  
        

Unnamed: 0_level_0,MasVnrArea,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,Electrical,GarageType,GarageYrBlt,GarageFinish,GarageQual,GarageCond
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,196.0,Gd,TA,No,GLQ,Unf,SBrkr,Attchd,2003.0,RFn,TA,TA
2,0.0,Gd,TA,Gd,ALQ,Unf,SBrkr,Attchd,1976.0,RFn,TA,TA
3,162.0,Gd,TA,Mn,GLQ,Unf,SBrkr,Attchd,2001.0,RFn,TA,TA
4,0.0,TA,Gd,No,ALQ,Unf,SBrkr,Detchd,1998.0,Unf,TA,TA
5,350.0,Gd,TA,Av,GLQ,Unf,SBrkr,Attchd,2000.0,RFn,TA,TA
6,0.0,Gd,TA,No,GLQ,Unf,SBrkr,Attchd,1993.0,Unf,TA,TA
7,186.0,Ex,TA,Av,GLQ,Unf,SBrkr,Attchd,2004.0,RFn,TA,TA
8,240.0,Gd,TA,Mn,ALQ,BLQ,SBrkr,Attchd,1973.0,RFn,TA,TA
9,0.0,TA,TA,No,Unf,Unf,FuseF,Detchd,1931.0,Unf,Fa,TA
10,0.0,TA,TA,No,GLQ,Unf,SBrkr,Attchd,1939.0,RFn,Gd,TA


In [9]:
home_data3 = home_data2.copy()

#Fill NaNs with mean for numerical and mode for rest
columns_with_nan = []

for col in home_data3.columns:
    if home_data3[col].isnull().sum() > 0 and home_data3[col].dtype == 'float64':
        home_data3[col] = home_data3[col].fillna(value = home_data3[col].mean())
    elif home_data3[col].isnull().sum() > 0 and home_data3[col].dtype == 'object':
        home_data3[col] = home_data3[col].fillna(value = "non-existent")
        

In [10]:
home_data3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1460 entries, 1 to 1460
Data columns (total 73 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotArea        1460 non-null   int64  
 3   Street         1460 non-null   object 
 4   LotShape       1460 non-null   object 
 5   LandContour    1460 non-null   object 
 6   Utilities      1460 non-null   object 
 7   LotConfig      1460 non-null   object 
 8   LandSlope      1460 non-null   object 
 9   Neighborhood   1460 non-null   object 
 10  Condition1     1460 non-null   object 
 11  Condition2     1460 non-null   object 
 12  BldgType       1460 non-null   object 
 13  HouseStyle     1460 non-null   object 
 14  OverallQual    1460 non-null   int64  
 15  OverallCond    1460 non-null   int64  
 16  YearBuilt      1460 non-null   int64  
 17  YearRemodAdd   1460 non-null   int64  
 18  RoofStyle    

# Group low frequent categories into 'other' category

In [11]:
def replace_less_frequent(df, list_col, threshold=0.02, new_value='other'):
    vals_to_change = []

    # Iterate over each column in the list
    for col in list_col:
        # Get values with frequency less than the threshold
        filtered_values = df[col].value_counts(normalize=True)
        filtered_values = filtered_values[filtered_values < threshold].index.tolist()
        vals_to_change.extend(filtered_values)  # Extend instead of append
    
    # Replace less frequent values with new_value
    for col in list_col:
        df[col] = np.where(df[col].isin(vals_to_change), new_value, df[col])
    
    # Print value counts for each modified column
    for col in list_col:
        print("\n", df[col].value_counts(normalize=True, dropna=False))
        print("\n",f'** NEW {col} created correctly**')

In [12]:
# create list of categorical variables
l_cat = []
for col in home_data3.columns:
    if home_data3[col].dtype.kind == 'O':
        l_cat.append(col)

In [13]:
for col in l_cat:
    print("\n", home_data3[col].value_counts(normalize=True, dropna=False))


 MSZoning
RL         0.788356
RM         0.149315
FV         0.044521
RH         0.010959
C (all)    0.006849
Name: proportion, dtype: float64

 Street
Pave    0.99589
Grvl    0.00411
Name: proportion, dtype: float64

 LotShape
Reg    0.633562
IR1    0.331507
IR2    0.028082
IR3    0.006849
Name: proportion, dtype: float64

 LandContour
Lvl    0.897945
Bnk    0.043151
HLS    0.034247
Low    0.024658
Name: proportion, dtype: float64

 Utilities
AllPub    0.999315
NoSeWa    0.000685
Name: proportion, dtype: float64

 LotConfig
Inside     0.720548
Corner     0.180137
CulDSac    0.064384
FR2        0.032192
FR3        0.002740
Name: proportion, dtype: float64

 LandSlope
Gtl    0.946575
Mod    0.044521
Sev    0.008904
Name: proportion, dtype: float64

 Neighborhood
NAmes      0.154110
CollgCr    0.102740
OldTown    0.077397
Edwards    0.068493
Somerst    0.058904
Gilbert    0.054110
NridgHt    0.052740
Sawyer     0.050685
NWAmes     0.050000
SawyerW    0.040411
BrkSide    0.039726
Crawfor

In [14]:
# replace low frequent (<6%) values with "others"
replace_less_frequent(df=home_data3, list_col=l_cat, threshold=0.06)


 MSZoning
RL       0.788356
RM       0.149315
other    0.062329
Name: proportion, dtype: float64

 ** NEW MSZoning created correctly**

 Street
Pave     0.99589
other    0.00411
Name: proportion, dtype: float64

 ** NEW Street created correctly**

 LotShape
Reg      0.633562
IR1      0.331507
other    0.034932
Name: proportion, dtype: float64

 ** NEW LotShape created correctly**

 LandContour
Lvl      0.897945
other    0.102055
Name: proportion, dtype: float64

 ** NEW LandContour created correctly**

 Utilities
AllPub    0.999315
other     0.000685
Name: proportion, dtype: float64

 ** NEW Utilities created correctly**

 LotConfig
Inside     0.720548
Corner     0.180137
CulDSac    0.064384
other      0.034932
Name: proportion, dtype: float64

 ** NEW LotConfig created correctly**

 LandSlope
Gtl      0.946575
other    0.053425
Name: proportion, dtype: float64

 ** NEW LandSlope created correctly**

 Neighborhood
other      0.597260
NAmes      0.154110
CollgCr    0.102740
OldTown    

## Hot-code categorical features and scale numerical features

In [15]:
home_data4 = home_data3.copy()

In [16]:
"""
#Minmax scale of numeric variables
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# create list of numerical variables
l_num = []
target = 'SalePrice'
for col in home_data4.columns:
    if col != target:
        if home_data4[col].dtype.kind == 'i' or home_data4[col].dtype.kind == 'f':
            l_num.append(col)
        
# scale numerical variables
home_data4[l_num] = scaler.fit_transform(home_data4[l_num])"""

"\n#Minmax scale of numeric variables\nfrom sklearn.preprocessing import MinMaxScaler\nscaler = MinMaxScaler()\n\n# create list of numerical variables\nl_num = []\ntarget = 'SalePrice'\nfor col in home_data4.columns:\n    if col != target:\n        if home_data4[col].dtype.kind == 'i' or home_data4[col].dtype.kind == 'f':\n            l_num.append(col)\n        \n# scale numerical variables\nhome_data4[l_num] = scaler.fit_transform(home_data4[l_num])"

# Target encoding of categorical variables with more than 4 categories

In [17]:
""" Target encode categorical variables
encoder = TargetEncoder()

for col in l_cat:
    if home_data4[col].dtype == 'object':
        home_data4[col] = encoder.fit_transform(home_data4[col], home_data4.SalePrice)
home_data4[l_cat] = encoder.fit_transform(home_data4[l_cat], home_data4.SalePrice) """

" Target encode categorical variables\nencoder = TargetEncoder()\n\nfor col in l_cat:\n    if home_data4[col].dtype == 'object':\n        home_data4[col] = encoder.fit_transform(home_data4[col], home_data4.SalePrice)\nhome_data4[l_cat] = encoder.fit_transform(home_data4[l_cat], home_data4.SalePrice) "

In [18]:
# create dummy variables
home_data_dummies = pd.get_dummies(home_data4, columns=l_cat, dtype='int32')

## Correlation and elimination of features

In [19]:
corr_matrix = home_data_dummies.corr()

# Transform the upper-right triangle to zeros
upper_triangle_mask = np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
corr_matrix_upper_zero = corr_matrix.mask(upper_triangle_mask)

# Print the pairs of correlated variables to SalePrice
corr_result = corr_matrix_upper_zero.unstack().sort_values(ascending=False)
corr_result_df = corr_result.reset_index()

In [20]:
print(corr_result_df[corr_result_df['level_0'] != corr_result_df['level_1']].head(50).to_string(index=False))

            level_0               level_1        0
       SaleType_New SaleCondition_Partial 0.986819
Exterior1st_VinylSd   Exterior2nd_VinylSd 0.977525
Exterior1st_MetalSd   Exterior2nd_MetalSd 0.973065
Exterior1st_HdBoard   Exterior2nd_HdBoard 0.883271
         GarageCars            GarageArea 0.882475
Exterior1st_Wd Sdng   Exterior2nd_Wd Sdng 0.859244
   GarageType_other    GarageFinish_other 0.828844
          GrLivArea          TotRmsAbvGrd 0.825489
        TotalBsmtSF              1stFlrSF 0.819530
           2ndFlrSF     HouseStyle_2Story 0.809150
        OverallQual             SalePrice 0.790982
   GarageQual_other      GarageCond_other 0.786216
      GarageQual_TA         GarageCond_TA 0.786216
  Exterior1st_other     Exterior2nd_other 0.781775
          YearBuilt           GarageYrBlt 0.780555
 GarageFinish_other      GarageCond_other 0.762394
Exterior1st_Plywood   Exterior2nd_Plywood 0.755085
 GarageFinish_other      GarageQual_other 0.718900
         BsmtFinSF2    BsmtFinT

In [21]:
corr_result_df[corr_result_df[0] < 0].head()

Unnamed: 0,level_0,level_1,0
4982,GarageYrBlt,GarageFinish_other,-1.409657e-15
4983,LandContour_other,Exterior2nd_other,-1.818293e-05
4984,RoofStyle_Gable,Heating_other,-9.304735e-05
4985,LotConfig_other,BsmtExposure_other,-9.510832e-05
4986,BsmtFullBath,3SsnPorch,-0.0001060915


In [22]:
# Get features with np.abs(corr to price) > 0.4

corr_result_df.columns = ['feature_1', 'feature_2', 'corr']
high_corr_features = corr_result_df[(np.abs(corr_result_df['corr']) > 0.2) & (np.abs(corr_result_df['corr']) < 0.8)]

high_corr_features_unique = set(high_corr_features['feature_1'].tolist() + high_corr_features['feature_2'].tolist())

In [23]:
len(high_corr_features_unique)

127

In [24]:
high_corr_features_unique.remove('SalePrice')

## Optimizing model

### Cleaning test data 

In [25]:
test_data = pd.read_csv(test_data_path)
test_data.set_index('Id', inplace=True)

In [26]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1459 entries, 1461 to 2919
Data columns (total 79 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1459 non-null   int64  
 1   MSZoning       1455 non-null   object 
 2   LotFrontage    1232 non-null   float64
 3   LotArea        1459 non-null   int64  
 4   Street         1459 non-null   object 
 5   Alley          107 non-null    object 
 6   LotShape       1459 non-null   object 
 7   LandContour    1459 non-null   object 
 8   Utilities      1457 non-null   object 
 9   LotConfig      1459 non-null   object 
 10  LandSlope      1459 non-null   object 
 11  Neighborhood   1459 non-null   object 
 12  Condition1     1459 non-null   object 
 13  Condition2     1459 non-null   object 
 14  BldgType       1459 non-null   object 
 15  HouseStyle     1459 non-null   object 
 16  OverallQual    1459 non-null   int64  
 17  OverallCond    1459 non-null   int64  
 18  YearBuilt 

In [27]:
#Drop columns
threshold = 200

for col in home_data.columns:
    if home_data[col].isnull().sum() >= threshold:
        del(test_data[col])

In [28]:
test_data2 = test_data.copy()

#Inspect remaining columns with NaNs
columns_with_nan = []

for col in test_data2.columns:
    if test_data2[col].isnull().sum() > 0:
        columns_with_nan.append(col)

test_data2[columns_with_nan].head(10)  

Unnamed: 0_level_0,MSZoning,Utilities,Exterior1st,Exterior2nd,MasVnrArea,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,...,KitchenQual,Functional,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,SaleType
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1461,RH,AllPub,VinylSd,VinylSd,0.0,TA,TA,No,Rec,468.0,...,TA,Typ,Attchd,1961.0,Unf,1.0,730.0,TA,TA,WD
1462,RL,AllPub,Wd Sdng,Wd Sdng,108.0,TA,TA,No,ALQ,923.0,...,Gd,Typ,Attchd,1958.0,Unf,1.0,312.0,TA,TA,WD
1463,RL,AllPub,VinylSd,VinylSd,0.0,Gd,TA,No,GLQ,791.0,...,TA,Typ,Attchd,1997.0,Fin,2.0,482.0,TA,TA,WD
1464,RL,AllPub,VinylSd,VinylSd,20.0,TA,TA,No,GLQ,602.0,...,Gd,Typ,Attchd,1998.0,Fin,2.0,470.0,TA,TA,WD
1465,RL,AllPub,HdBoard,HdBoard,0.0,Gd,TA,No,ALQ,263.0,...,Gd,Typ,Attchd,1992.0,RFn,2.0,506.0,TA,TA,WD
1466,RL,AllPub,HdBoard,HdBoard,0.0,Gd,TA,No,Unf,0.0,...,TA,Typ,Attchd,1993.0,Fin,2.0,440.0,TA,TA,WD
1467,RL,AllPub,HdBoard,HdBoard,0.0,Gd,TA,No,ALQ,935.0,...,TA,Typ,Attchd,1992.0,Fin,2.0,420.0,TA,TA,WD
1468,RL,AllPub,VinylSd,VinylSd,0.0,Gd,TA,No,Unf,0.0,...,TA,Typ,Attchd,1998.0,Fin,2.0,393.0,TA,TA,WD
1469,RL,AllPub,HdBoard,HdBoard,0.0,Gd,TA,Gd,GLQ,637.0,...,Gd,Typ,Attchd,1990.0,Unf,2.0,506.0,TA,TA,WD
1470,RL,AllPub,Plywood,Plywood,0.0,TA,TA,No,ALQ,804.0,...,TA,Typ,Attchd,1970.0,Fin,2.0,525.0,TA,TA,WD


In [29]:
test_data3 = test_data2.copy()

#Fill NaNs with mean for numerical and mode for rest
columns_with_nan = []

for col in test_data3.columns:
    if test_data3[col].isnull().sum() > 0 and test_data3[col].dtype == 'float64':
        test_data3[col] = test_data3[col].fillna(value = test_data3[col].mean())
    elif test_data3[col].isnull().sum() > 0 and test_data3[col].dtype == 'object':
        test_data3[col] = test_data3[col].fillna(value = "non-existent")

In [30]:
"""test_data4 = test_data3.copy()
#Minmax scale of numeric variables
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

l_num = []
for col in test_data4.columns:
    if test_data4[col].dtype.kind == 'i' or test_data4[col].dtype.kind == 'f':
        l_num.append(col)
        
# scale numerical variables
test_data4[l_num] = scaler.fit_transform(test_data4[l_num])"""


"test_data4 = test_data3.copy()\n#Minmax scale of numeric variables\nfrom sklearn.preprocessing import MinMaxScaler\nscaler = MinMaxScaler()\n\nl_num = []\nfor col in test_data4.columns:\n    if test_data4[col].dtype.kind == 'i' or test_data4[col].dtype.kind == 'f':\n        l_num.append(col)\n        \n# scale numerical variables\ntest_data4[l_num] = scaler.fit_transform(test_data4[l_num])"

In [31]:
test_data4 = test_data3.copy()

In [32]:
# create list of categorical variables
l_cat = []
for col in test_data4.columns:
    if test_data4[col].dtype.kind == 'O':
        l_cat.append(col)

# replace low frequent (<6%) values with "others"
replace_less_frequent(df=test_data4, list_col=l_cat, threshold=0.06)

# create dummy variables
test_data_dummies = pd.get_dummies(test_data4, columns=l_cat, dtype='int32')


 MSZoning
RL       0.763537
RM       0.165867
other    0.070596
Name: proportion, dtype: float64

 ** NEW MSZoning created correctly**

 Street
Pave     0.995888
other    0.004112
Name: proportion, dtype: float64

 ** NEW Street created correctly**

 LotShape
Reg      0.640164
IR1      0.331734
other    0.028101
Name: proportion, dtype: float64

 ** NEW LotShape created correctly**

 LandContour
Lvl      0.898561
other    0.101439
Name: proportion, dtype: float64

 ** NEW LandContour created correctly**

 Utilities
AllPub    0.998629
other     0.001371
Name: proportion, dtype: float64

 ** NEW Utilities created correctly**

 LotConfig
Inside    0.740918
Corner    0.169979
other     0.089102
Name: proportion, dtype: float64

 ** NEW LotConfig created correctly**

 LandSlope
Gtl      0.95682
other    0.04318
Name: proportion, dtype: float64

 ** NEW LandSlope created correctly**

 Neighborhood
other      0.492803
NAmes      0.149417
OldTown    0.086361
CollgCr    0.080192
Somerst    0.0

In [33]:
test_data_dummies.head()

Unnamed: 0_level_0,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,PavedDrive_N,PavedDrive_Y,PavedDrive_other,SaleType_New,SaleType_WD,SaleType_other,SaleCondition_Abnorml,SaleCondition_Normal,SaleCondition_Partial,SaleCondition_other
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1461,20,11622,5,6,1961,1961,0.0,468.0,144.0,270.0,...,0,1,0,0,1,0,0,1,0,0
1462,20,14267,6,6,1958,1958,108.0,923.0,0.0,406.0,...,0,1,0,0,1,0,0,1,0,0
1463,60,13830,5,5,1997,1998,0.0,791.0,0.0,137.0,...,0,1,0,0,1,0,0,1,0,0
1464,60,9978,6,6,1998,1998,20.0,602.0,0.0,324.0,...,0,1,0,0,1,0,0,1,0,0
1465,120,5005,8,5,1992,1992,0.0,263.0,0.0,1017.0,...,0,1,0,0,1,0,0,1,0,0


# Train a new model

In [34]:
import random
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

In [35]:
# Function to optimize max_leaf_nodes
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

In [36]:
"""
X = home_data_dummies[list(high_corr_features_unique)]
y = home_data_dummies.SalePrice

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

for max_leaf_nodes in range(10, 10000, 50):
    rf_model = RandomForestRegressor(random_state=1, max_leaf_nodes=max_leaf_nodes)
    rf_model.fit(train_X, train_y)
    rf_val_predictions = rf_model.predict(val_X)
    rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)
    
    print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, rf_val_mae))
    
    if rf_val_mae < 15000:
        print("Stopping criteria reached.")
        break
"""

'\nX = home_data_dummies[list(high_corr_features_unique)]\ny = home_data_dummies.SalePrice\n\ntrain_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)\n\nfor max_leaf_nodes in range(10, 10000, 50):\n    rf_model = RandomForestRegressor(random_state=1, max_leaf_nodes=max_leaf_nodes)\n    rf_model.fit(train_X, train_y)\n    rf_val_predictions = rf_model.predict(val_X)\n    rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)\n    \n    print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, rf_val_mae))\n    \n    if rf_val_mae < 15000:\n        print("Stopping criteria reached.")\n        break\n'

In [37]:
"""X = home_data_dummies[list(high_corr_features_unique)]
#X = home_data_dummies.drop(['SalePrice'], axis=1)
y = home_data_dummies.SalePrice

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Define hyperparameters to explore
n_estimators_list = [50, 100, 150]
max_depth_list = [None, 10, 20]
min_samples_split_list = [2, 5, 10]

best_mae = float('inf')
best_hyperparameters = {}

for n_estimators in n_estimators_list:
    for max_depth in max_depth_list:
        for min_samples_split in min_samples_split_list:
            rf_model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, random_state=1, max_leaf_nodes=300)
            rf_model.fit(train_X, train_y)
            rf_val_predictions = rf_model.predict(val_X)
            rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)
            
            print("n_estimators: {}, max_depth: {}, min_samples_split: {} \t\t Mean Absolute Error: {}".format(n_estimators, max_depth, min_samples_split, rf_val_mae))
            
            if rf_val_mae < best_mae:
                best_mae = rf_val_mae
                best_hyperparameters = {'n_estimators': n_estimators, 'max_depth': max_depth, 'min_samples_split': min_samples_split}

print("\nBest Hyperparameters:")
print(best_hyperparameters)
print("Best Mean Absolute Error:", best_mae)"""

'X = home_data_dummies[list(high_corr_features_unique)]\n#X = home_data_dummies.drop([\'SalePrice\'], axis=1)\ny = home_data_dummies.SalePrice\n\ntrain_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)\n\n# Define hyperparameters to explore\nn_estimators_list = [50, 100, 150]\nmax_depth_list = [None, 10, 20]\nmin_samples_split_list = [2, 5, 10]\n\nbest_mae = float(\'inf\')\nbest_hyperparameters = {}\n\nfor n_estimators in n_estimators_list:\n    for max_depth in max_depth_list:\n        for min_samples_split in min_samples_split_list:\n            rf_model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, random_state=1, max_leaf_nodes=300)\n            rf_model.fit(train_X, train_y)\n            rf_val_predictions = rf_model.predict(val_X)\n            rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)\n            \n            print("n_estimators: {}, max_depth: {}, min_samples_split: {} \t\t Mean Ab

# Train in all data with best hyperparameters

In [38]:
high_corr_features_unique_test = set(test_data_dummies.columns).intersection(high_corr_features_unique)

In [39]:
common_columns = list(set(home_data_dummies[list(high_corr_features_unique)].columns).intersection(test_data_dummies[list(high_corr_features_unique_test)].columns))

In [40]:
X = home_data_dummies[common_columns] #already without SalePrice
y = home_data_dummies.SalePrice

# To improve accuracy, train on all training data
rf_model_on_full_data = RandomForestRegressor(random_state=1, max_leaf_nodes=300, n_estimators= 150, max_depth= 20, min_samples_split= 2)
rf_model_on_full_data.fit(X, y)

In [41]:
# make predictions which we will submit. 
test_preds_modelV2 = rf_model_on_full_data.predict(test_data_dummies[common_columns])

test_preds_modelV2[0:15].round(0).tolist()

[129840.0,
 155869.0,
 181660.0,
 183333.0,
 196375.0,
 183029.0,
 169810.0,
 175988.0,
 181028.0,
 120223.0,
 198745.0,
 97440.0,
 101878.0,
 153892.0,
 141264.0]

# Generate a submission

Run the code cell below to generate a CSV file with your predictions that you can use to submit to the competition.

In [42]:
# Run the code to save predictions in the format used for competition scoring

output = pd.DataFrame({'Id': test_data.index,
                       'SalePrice': test_preds_modelV2})
output.to_csv('submission_v5.csv', index=False)

# Submit to the competition

To test your results, you'll need to join the competition (if you haven't already).  So open a new window by clicking on **[this link](https://www.kaggle.com/c/home-data-for-ml-course)**.  Then click on the **Join Competition** button.

![join competition image](https://storage.googleapis.com/kaggle-media/learn/images/axBzctl.png)

Next, follow the instructions below:
1. Begin by clicking on the **Save Version** button in the top right corner of the window.  This will generate a pop-up window.  
2. Ensure that the **Save and Run All** option is selected, and then click on the **Save** button.
3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
4. Click on the **Data** tab near the top of the screen.  Then, click on the file you would like to submit, and click on the **Submit** button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

If you want to keep working to improve your performance, select the **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.


# Continue Your Progress
There are many ways to improve your model, and **experimenting is a great way to learn at this point.**

The best way to improve your model is to add features.  To add more features to the data, revisit the first code cell, and change this line of code to include more column names:
```python
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
```

Some features will cause errors because of issues like missing values or non-numeric data types.  Here is a complete list of potential columns that you might like to use, and that won't throw errors:
- 'MSSubClass'
- 'LotArea'
- 'OverallQual' 
- 'OverallCond' 
- 'YearBuilt'
- 'YearRemodAdd' 
- '1stFlrSF'
- '2ndFlrSF' 
- 'LowQualFinSF' 
- 'GrLivArea'
- 'FullBath'
- 'HalfBath'
- 'BedroomAbvGr' 
- 'KitchenAbvGr' 
- 'TotRmsAbvGrd' 
- 'Fireplaces' 
- 'WoodDeckSF' 
- 'OpenPorchSF'
- 'EnclosedPorch' 
- '3SsnPorch' 
- 'ScreenPorch' 
- 'PoolArea' 
- 'MiscVal' 
- 'MoSold' 
- 'YrSold'

Look at the list of columns and think about what might affect home prices.  To learn more about each of these features, take a look at the data description on the **[competition page](https://www.kaggle.com/c/home-data-for-ml-course/data)**.

After updating the code cell above that defines the features, re-run all of the code cells to evaluate the model and generate a new submission file.  


# What's next?

As mentioned above, some of the features will throw an error if you try to use them to train your model.  The **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course will teach you how to handle these types of features. You will also learn to use **xgboost**, a technique giving even better accuracy than Random Forest.

The **[Pandas](https://kaggle.com/Learn/Pandas)** course will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects. 

You are also ready for the **[Deep Learning](https://kaggle.com/Learn/intro-to-Deep-Learning)** course, where you will build models with better-than-human level performance at computer vision tasks.

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intro-to-machine-learning/discussion) to chat with other learners.*