**This notebook is an exercise in the [Introduction to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/machine-learning-competitions).**

---

# Introduction

In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to apply what you've learned and move up the leaderboard.

Begin by running the code cell below to set up code checking and the filepaths for the dataset.

In [1]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex7 import *

# Set up filepaths
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv")

Here's some of the code you've written so far. Start by running it again.

In [2]:
# Import helpful libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from category_encoders import TargetEncoder


# Load the data, and separate the target
iowa_file_path = '../input/train.csv'
home_data = pd.read_csv(iowa_file_path)

In [3]:
# path to file you will use for predictions
test_data_path = '../input/test.csv'

# read test data file using pandas
test_data = pd.read_csv(test_data_path)

# Results default model

First 15 predictions:

[122657.0,
 156789.0,
 182959.0,
 178102.0,
 189049.0,
 180979.0,
 172797.0,
 173717.0,
 187535.0,
 116172.0,
 190798.0,
 93823.0,
 89249.0,
 145111.0,
 124696.0]
 
Validation MAE for DEFAULT Random Forest Model: **21,857**



# Model optimization

· Will clean the dataset, hot-code categorical features, scale numerical features, and train a new model

## Cleaning dataset

In [4]:
import pandas as pd
import numpy as np

In [5]:
pd.set_option('display.max_rows', None)

In [6]:
home_data2 = home_data.copy()
home_data2.set_index('Id', inplace=True)

In [7]:
#Drop columns
threshold = 200

for col in home_data2.columns:
    if home_data2[col].isnull().sum() >= threshold:
        del(home_data2[col])

In [8]:
#Inspect remaining columns with NaNs
columns_with_nan = []

for col in home_data2.columns:
    if home_data2[col].isnull().sum() > 0:
        columns_with_nan.append(col)

home_data2[columns_with_nan].head(10)  
        

Unnamed: 0_level_0,MasVnrArea,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,Electrical,GarageType,GarageYrBlt,GarageFinish,GarageQual,GarageCond
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,196.0,Gd,TA,No,GLQ,Unf,SBrkr,Attchd,2003.0,RFn,TA,TA
2,0.0,Gd,TA,Gd,ALQ,Unf,SBrkr,Attchd,1976.0,RFn,TA,TA
3,162.0,Gd,TA,Mn,GLQ,Unf,SBrkr,Attchd,2001.0,RFn,TA,TA
4,0.0,TA,Gd,No,ALQ,Unf,SBrkr,Detchd,1998.0,Unf,TA,TA
5,350.0,Gd,TA,Av,GLQ,Unf,SBrkr,Attchd,2000.0,RFn,TA,TA
6,0.0,Gd,TA,No,GLQ,Unf,SBrkr,Attchd,1993.0,Unf,TA,TA
7,186.0,Ex,TA,Av,GLQ,Unf,SBrkr,Attchd,2004.0,RFn,TA,TA
8,240.0,Gd,TA,Mn,ALQ,BLQ,SBrkr,Attchd,1973.0,RFn,TA,TA
9,0.0,TA,TA,No,Unf,Unf,FuseF,Detchd,1931.0,Unf,Fa,TA
10,0.0,TA,TA,No,GLQ,Unf,SBrkr,Attchd,1939.0,RFn,Gd,TA


In [9]:
home_data3 = home_data2.copy()

#Fill NaNs with mean for numerical and mode for rest
columns_with_nan = []

for col in home_data3.columns:
    if home_data3[col].isnull().sum() > 0 and home_data3[col].dtype == 'float64':
        home_data3[col] = home_data3[col].fillna(value = home_data3[col].mean())
    elif home_data3[col].isnull().sum() > 0 and home_data3[col].dtype == 'object':
        home_data3[col] = home_data3[col].fillna(value = "non-existent")
        

In [10]:
home_data3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1460 entries, 1 to 1460
Data columns (total 73 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotArea        1460 non-null   int64  
 3   Street         1460 non-null   object 
 4   LotShape       1460 non-null   object 
 5   LandContour    1460 non-null   object 
 6   Utilities      1460 non-null   object 
 7   LotConfig      1460 non-null   object 
 8   LandSlope      1460 non-null   object 
 9   Neighborhood   1460 non-null   object 
 10  Condition1     1460 non-null   object 
 11  Condition2     1460 non-null   object 
 12  BldgType       1460 non-null   object 
 13  HouseStyle     1460 non-null   object 
 14  OverallQual    1460 non-null   int64  
 15  OverallCond    1460 non-null   int64  
 16  YearBuilt      1460 non-null   int64  
 17  YearRemodAdd   1460 non-null   int64  
 18  RoofStyle    

# Group low frequent categories into 'other' category

In [11]:
def replace_less_frequent(df, list_col, threshold=0.02, new_value='other'):
    vals_to_change = []

    # Iterate over each column in the list
    for col in list_col:
        # Get values with frequency less than the threshold
        filtered_values = df[col].value_counts(normalize=True)
        filtered_values = filtered_values[filtered_values < threshold].index.tolist()
        vals_to_change.extend(filtered_values)  # Extend instead of append
    
    # Replace less frequent values with new_value
    for col in list_col:
        df[col] = np.where(df[col].isin(vals_to_change), new_value, df[col])
    
    # Print value counts for each modified column
    for col in list_col:
        print("\n", df[col].value_counts(normalize=True, dropna=False))
        print("\n",f'** NEW {col} created correctly**')

In [12]:
# create list of categorical variables
l_cat = []
for col in home_data3.columns:
    if home_data3[col].dtype.kind == 'O':
        l_cat.append(col)

In [13]:
# replace low frequent (<6%) values with "others"
replace_less_frequent(df=home_data3, list_col=l_cat, threshold=0.04)


 MSZoning
RL       0.788356
RM       0.149315
FV       0.044521
other    0.017808
Name: proportion, dtype: float64

 ** NEW MSZoning created correctly**

 Street
Pave     0.99589
other    0.00411
Name: proportion, dtype: float64

 ** NEW Street created correctly**

 LotShape
Reg      0.633562
IR1      0.331507
other    0.034932
Name: proportion, dtype: float64

 ** NEW LotShape created correctly**

 LandContour
Lvl      0.897945
other    0.058904
Bnk      0.043151
Name: proportion, dtype: float64

 ** NEW LandContour created correctly**

 Utilities
AllPub    0.999315
other     0.000685
Name: proportion, dtype: float64

 ** NEW Utilities created correctly**

 LotConfig
Inside     0.720548
Corner     0.180137
CulDSac    0.064384
other      0.034932
Name: proportion, dtype: float64

 ** NEW LotConfig created correctly**

 LandSlope
Gtl      0.946575
other    0.053425
Name: proportion, dtype: float64

 ** NEW LandSlope created correctly**

 Neighborhood
other      0.290411
NAmes      0.15

## Hot-code categorical features and scale numerical features

In [14]:
home_data4 = home_data3.copy()

In [15]:
l_num = []
for col in home_data4.columns:
    if 'Year' not in col and 'Yr' not in col:  # Exclude columns with 'Year' or 'Yr'
        if home_data4[col].dtype.kind in ['i', 'f']:  # Check if column is numeric (integer or float)
            if len(home_data4[col].unique()) > 10:  # Check if numeric column has 4 or fewer unique values
                if col != 'SalePrice':
                    l_num.append(col)

In [16]:
#Minmax scale of numeric variables
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
      
# scale numerical variables
home_data4[l_num] = scaler.fit_transform(home_data4[l_num])

# Target encoding of categorical variables with more than 4 categories

In [17]:
home_data5 = home_data4.copy()

In [18]:
# Drop the target variable from X
X = home_data5.drop('SalePrice', axis=1)
y = home_data5['SalePrice']

# Initialize an empty dictionary to store encoders for later use
encoders = {}

# Target encode categorical variables with more than 3 unique values
for col in X.select_dtypes(include=['object']).columns:
    if len(X[col].unique()) > 3:
        # Initialize the encoder
        encoder = TargetEncoder()
        # Fit the encoder
        encoder.fit(X[col], y)
        # Store the trained encoder
        encoders[col] = encoder

In [19]:
# Apply the trained encoders from 'home_data5' to 'home_data5'
for col, encoder in encoders.items():
    if col in home_data5.columns:
        home_data5[col] = encoder.transform(home_data5[col])

In [20]:
l_cat = []
for col in home_data5.columns:
    if home_data5[col].dtype == 'object':
        l_cat.append(col)
        
# create dummy variables
home_data_dummies = pd.get_dummies(home_data5, columns=l_cat, dtype='int32')

## Correlation and elimination of features

In [21]:
corr_matrix = home_data_dummies.corr()

# Transform the upper-right triangle to zeros
upper_triangle_mask = np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
corr_matrix_upper_zero = corr_matrix.mask(upper_triangle_mask)

# Print the pairs of correlated variables to SalePrice
corr_result = corr_matrix_upper_zero.unstack().sort_values(ascending=False)
corr_result_df = corr_result.reset_index()

In [22]:
print(corr_result_df[corr_result_df['level_0'] != corr_result_df['level_1']].head(50).to_string(index=False))

         level_0            level_1        0
     Exterior1st        Exterior2nd 0.966388
   SaleCondition       SaleType_New 0.951084
      GarageCars         GarageArea 0.882475
       GrLivArea       TotRmsAbvGrd 0.825489
     TotalBsmtSF           1stFlrSF 0.819530
     OverallQual          SalePrice 0.790982
GarageQual_other   GarageCond_other 0.786216
   GarageQual_TA      GarageCond_TA 0.786216
       YearBuilt        GarageYrBlt 0.780555
      BsmtFinSF2 BsmtFinType2_other 0.716235
       GrLivArea          SalePrice 0.708624
        2ndFlrSF          GrLivArea 0.687501
    BedroomAbvGr       TotRmsAbvGrd 0.676620
 ExterQual_other  KitchenQual_other 0.671600
    ExterQual_TA     KitchenQual_TA 0.671600
      GarageType       GarageFinish 0.651273
       YearBuilt   Foundation_PConc 0.651199
      BsmtFinSF1       BsmtFullBath 0.649212
     OverallQual    ExterQual_other 0.646247
     GarageYrBlt   Foundation_PConc 0.645394
      GarageCars          SalePrice 0.640409
       Yea

In [23]:
corr_result_df[corr_result_df[0] < 0].head()

Unnamed: 0,level_0,level_1,0
3126,KitchenAbvGr,LandContour_other,-7.2e-05
3127,RoofStyle_Gable,Heating_other,-9.3e-05
3128,BsmtFullBath,3SsnPorch,-0.000106
3129,TotalBsmtSF,BsmtHalfBath,-0.000315
3130,LandSlope_Gtl,BsmtCond_other,-0.0004


In [24]:
# Get features with np.abs(corr to price) > 0.4

corr_result_df.columns = ['feature_1', 'feature_2', 'corr']
high_corr_features = corr_result_df[(np.abs(corr_result_df['corr']) > 0.2) & (np.abs(corr_result_df['corr']) < 0.8)]

high_corr_features_unique = set(high_corr_features['feature_1'].tolist() + high_corr_features['feature_2'].tolist())

In [25]:
len(high_corr_features_unique)

94

In [26]:
high_corr_features_unique.remove('SalePrice')

## Optimizing model

### Cleaning test data 

In [27]:
test_data = pd.read_csv(test_data_path)
test_data.set_index('Id', inplace=True)

In [28]:
#Drop columns
threshold = 200

for col in home_data.columns:
    if home_data[col].isnull().sum() >= threshold:
        del(test_data[col])

In [29]:
test_data2 = test_data.copy()

#Inspect remaining columns with NaNs
columns_with_nan = []

for col in test_data2.columns:
    if test_data2[col].isnull().sum() > 0:
        columns_with_nan.append(col)

test_data2[columns_with_nan].head(10)  

Unnamed: 0_level_0,MSZoning,Utilities,Exterior1st,Exterior2nd,MasVnrArea,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,...,KitchenQual,Functional,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,SaleType
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1461,RH,AllPub,VinylSd,VinylSd,0.0,TA,TA,No,Rec,468.0,...,TA,Typ,Attchd,1961.0,Unf,1.0,730.0,TA,TA,WD
1462,RL,AllPub,Wd Sdng,Wd Sdng,108.0,TA,TA,No,ALQ,923.0,...,Gd,Typ,Attchd,1958.0,Unf,1.0,312.0,TA,TA,WD
1463,RL,AllPub,VinylSd,VinylSd,0.0,Gd,TA,No,GLQ,791.0,...,TA,Typ,Attchd,1997.0,Fin,2.0,482.0,TA,TA,WD
1464,RL,AllPub,VinylSd,VinylSd,20.0,TA,TA,No,GLQ,602.0,...,Gd,Typ,Attchd,1998.0,Fin,2.0,470.0,TA,TA,WD
1465,RL,AllPub,HdBoard,HdBoard,0.0,Gd,TA,No,ALQ,263.0,...,Gd,Typ,Attchd,1992.0,RFn,2.0,506.0,TA,TA,WD
1466,RL,AllPub,HdBoard,HdBoard,0.0,Gd,TA,No,Unf,0.0,...,TA,Typ,Attchd,1993.0,Fin,2.0,440.0,TA,TA,WD
1467,RL,AllPub,HdBoard,HdBoard,0.0,Gd,TA,No,ALQ,935.0,...,TA,Typ,Attchd,1992.0,Fin,2.0,420.0,TA,TA,WD
1468,RL,AllPub,VinylSd,VinylSd,0.0,Gd,TA,No,Unf,0.0,...,TA,Typ,Attchd,1998.0,Fin,2.0,393.0,TA,TA,WD
1469,RL,AllPub,HdBoard,HdBoard,0.0,Gd,TA,Gd,GLQ,637.0,...,Gd,Typ,Attchd,1990.0,Unf,2.0,506.0,TA,TA,WD
1470,RL,AllPub,Plywood,Plywood,0.0,TA,TA,No,ALQ,804.0,...,TA,Typ,Attchd,1970.0,Fin,2.0,525.0,TA,TA,WD


In [30]:
test_data3 = test_data2.copy()

#Fill NaNs with mean for numerical and mode for rest
columns_with_nan = []

for col in test_data3.columns:
    if test_data3[col].isnull().sum() > 0 and test_data3[col].dtype == 'float64':
        test_data3[col] = test_data3[col].fillna(value = test_data3[col].mean())
    elif test_data3[col].isnull().sum() > 0 and test_data3[col].dtype == 'object':
        test_data3[col] = test_data3[col].fillna(value = "non-existent")

In [31]:
test_data4 = test_data3.copy()

In [32]:
l_num = []
for col in test_data4.columns:
    if 'Year' not in col and 'Yr' not in col:  # Exclude columns with 'Year' or 'Yr'
        if test_data4[col].dtype.kind in ['i', 'f']:  # Check if column is numeric (integer or float)
            if len(test_data4[col].unique()) > 10:  # Check if numeric column has 4 or fewer unique values
                if col != 'SalePrice':
                    l_num.append(col)

# scale numerical variables
test_data4[l_num] = scaler.fit_transform(test_data4[l_num])

In [33]:
# create list of categorical variables
l_cat = []
for col in test_data4.columns:
    if test_data4[col].dtype.kind == 'O':
        l_cat.append(col)

# replace low frequent (<6%) values with "others"
replace_less_frequent(df=test_data4, list_col=l_cat, threshold=0.04)


 MSZoning
RL       0.763537
RM       0.165867
FV       0.050720
other    0.019877
Name: proportion, dtype: float64

 ** NEW MSZoning created correctly**

 Street
Pave     0.995888
other    0.004112
Name: proportion, dtype: float64

 ** NEW Street created correctly**

 LotShape
Reg      0.640164
IR1      0.331734
other    0.028101
Name: proportion, dtype: float64

 ** NEW LotShape created correctly**

 LandContour
Lvl      0.898561
other    0.053461
HLS      0.047978
Name: proportion, dtype: float64

 ** NEW LandContour created correctly**

 Utilities
AllPub    0.998629
other     0.001371
Name: proportion, dtype: float64

 ** NEW Utilities created correctly**

 LotConfig
Inside     0.740918
Corner     0.169979
CulDSac    0.056203
other      0.032899
Name: proportion, dtype: float64

 ** NEW LotConfig created correctly**

 LandSlope
Gtl      0.95682
other    0.04318
Name: proportion, dtype: float64

 ** NEW LandSlope created correctly**

 Neighborhood
other      0.291295
NAmes      0.14

In [34]:
# Apply the trained encoders from 'home_data5' to 'home_data5'
for col, encoder in encoders.items():
    if col in test_data4.columns:
        test_data4[col] = encoder.transform(test_data4[col])

In [35]:
l_cat = []
for col in test_data4.columns:
    if test_data4[col].dtype == 'object':
        l_cat.append(col)
        
# create dummy variables
test_data_dummies = pd.get_dummies(test_data4, columns=l_cat, dtype='int32')

In [36]:
test_data_dummies.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotArea,LotConfig,Neighborhood,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,...,GarageQual_TA,GarageQual_other,GarageCond_TA,GarageCond_other,PavedDrive_N,PavedDrive_Y,PavedDrive_other,SaleType_New,SaleType_WD,SaleType_other
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1461,0.0,134887.463418,0.184147,176938.047529,145847.080044,175985.477961,5,6,1961,1961,...,1,0,1,0,0,1,0,0,1,0
1462,0.0,191004.994787,0.232124,181623.425855,145847.080044,175985.477961,6,6,1958,1958,...,1,0,1,0,0,1,0,0,1,0
1463,0.235294,191004.994787,0.224197,176938.047529,192821.904993,210051.764045,5,5,1997,1998,...,1,0,1,0,0,1,0,0,1,0
1464,0.235294,191004.994787,0.154326,176938.047529,192821.904993,210051.764045,6,6,1998,1998,...,1,0,1,0,0,1,0,0,1,0
1465,0.588235,191004.994787,0.064121,176938.047529,189674.313679,175985.477961,8,5,1992,1992,...,1,0,1,0,0,1,0,0,1,0


# Train a new model

In [37]:
import random
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

In [38]:
# Function to optimize max_leaf_nodes
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

In [39]:
"""
X = home_data_dummies[list(high_corr_features_unique)]
y = home_data_dummies.SalePrice

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

for max_leaf_nodes in range(10, 10000, 50):
    rf_model = RandomForestRegressor(random_state=1, max_leaf_nodes=max_leaf_nodes)
    rf_model.fit(train_X, train_y)
    rf_val_predictions = rf_model.predict(val_X)
    rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)
    
    print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, rf_val_mae))
    
    if rf_val_mae < 15000:
        print("Stopping criteria reached.")
        break
"""

'\nX = home_data_dummies[list(high_corr_features_unique)]\ny = home_data_dummies.SalePrice\n\ntrain_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)\n\nfor max_leaf_nodes in range(10, 10000, 50):\n    rf_model = RandomForestRegressor(random_state=1, max_leaf_nodes=max_leaf_nodes)\n    rf_model.fit(train_X, train_y)\n    rf_val_predictions = rf_model.predict(val_X)\n    rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)\n    \n    print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, rf_val_mae))\n    \n    if rf_val_mae < 15000:\n        print("Stopping criteria reached.")\n        break\n'

In [40]:
"""X = home_data_dummies[list(high_corr_features_unique)]
#X = home_data_dummies.drop(['SalePrice'], axis=1)
y = home_data_dummies.SalePrice

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Define hyperparameters to explore
n_estimators_list = [50, 100, 150, 200]  # Add more values
max_depth_list = [None, 10, 20, 30]  # Add more values
min_samples_split_list = [2, 5, 10, 15]  # Add more values

best_mae = float('inf')
best_hyperparameters = {}

for n_estimators in n_estimators_list:
    for max_depth in max_depth_list:
        for min_samples_split in min_samples_split_list:
            rf_model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, random_state=1, max_leaf_nodes=600)
            rf_model.fit(train_X, train_y)
            rf_val_predictions = rf_model.predict(val_X)
            rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)
            
            print("n_estimators: {}, max_depth: {}, min_samples_split: {} \t\t Mean Absolute Error: {}".format(n_estimators, max_depth, min_samples_split, rf_val_mae))
            
            if rf_val_mae < best_mae:
                best_mae = rf_val_mae
                best_hyperparameters = {'n_estimators': n_estimators, 'max_depth': max_depth, 'min_samples_split': min_samples_split}

print("\nBest Hyperparameters:")
print(best_hyperparameters)
print("Best Mean Absolute Error:", best_mae)"""

'X = home_data_dummies[list(high_corr_features_unique)]\n#X = home_data_dummies.drop([\'SalePrice\'], axis=1)\ny = home_data_dummies.SalePrice\n\ntrain_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)\n\n# Define hyperparameters to explore\nn_estimators_list = [50, 100, 150, 200]  # Add more values\nmax_depth_list = [None, 10, 20, 30]  # Add more values\nmin_samples_split_list = [2, 5, 10, 15]  # Add more values\n\nbest_mae = float(\'inf\')\nbest_hyperparameters = {}\n\nfor n_estimators in n_estimators_list:\n    for max_depth in max_depth_list:\n        for min_samples_split in min_samples_split_list:\n            rf_model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, random_state=1, max_leaf_nodes=600)\n            rf_model.fit(train_X, train_y)\n            rf_val_predictions = rf_model.predict(val_X)\n            rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)\n            \n            prin

# Train in all data with best hyperparameters

In [41]:
high_corr_features_unique_test = set(test_data_dummies.columns).intersection(high_corr_features_unique)

In [42]:
common_columns = list(set(home_data_dummies[list(high_corr_features_unique)].columns).intersection(test_data_dummies[list(high_corr_features_unique_test)].columns))

In [43]:
X = home_data_dummies[common_columns] #already without SalePrice
y = home_data_dummies.SalePrice

# To improve accuracy, train on all training data
rf_model_on_full_data = RandomForestRegressor(random_state=1, max_leaf_nodes=600, n_estimators= 200, max_depth= None, min_samples_split= 2)
rf_model_on_full_data.fit(X, y)

In [44]:
# make predictions which we will submit. 
test_preds_modelV2 = rf_model_on_full_data.predict(test_data_dummies[common_columns])

test_preds_modelV2[0:15].round(0).tolist()

[135352.0,
 160086.0,
 195178.0,
 204484.0,
 216965.0,
 192208.0,
 168005.0,
 178850.0,
 218543.0,
 137296.0,
 222418.0,
 95906.0,
 108419.0,
 162114.0,
 144056.0]

# Generate a submission

Run the code cell below to generate a CSV file with your predictions that you can use to submit to the competition.

In [45]:
# Run the code to save predictions in the format used for competition scoring

output = pd.DataFrame({'Id': test_data.index,
                       'SalePrice': test_preds_modelV2})
output.to_csv('submission_v7.csv', index=False)

# Submit to the competition

To test your results, you'll need to join the competition (if you haven't already).  So open a new window by clicking on **[this link](https://www.kaggle.com/c/home-data-for-ml-course)**.  Then click on the **Join Competition** button.

![join competition image](https://storage.googleapis.com/kaggle-media/learn/images/axBzctl.png)

Next, follow the instructions below:
1. Begin by clicking on the **Save Version** button in the top right corner of the window.  This will generate a pop-up window.  
2. Ensure that the **Save and Run All** option is selected, and then click on the **Save** button.
3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
4. Click on the **Data** tab near the top of the screen.  Then, click on the file you would like to submit, and click on the **Submit** button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

If you want to keep working to improve your performance, select the **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.


# Continue Your Progress
There are many ways to improve your model, and **experimenting is a great way to learn at this point.**

The best way to improve your model is to add features.  To add more features to the data, revisit the first code cell, and change this line of code to include more column names:
```python
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
```

Some features will cause errors because of issues like missing values or non-numeric data types.  Here is a complete list of potential columns that you might like to use, and that won't throw errors:
- 'MSSubClass'
- 'LotArea'
- 'OverallQual' 
- 'OverallCond' 
- 'YearBuilt'
- 'YearRemodAdd' 
- '1stFlrSF'
- '2ndFlrSF' 
- 'LowQualFinSF' 
- 'GrLivArea'
- 'FullBath'
- 'HalfBath'
- 'BedroomAbvGr' 
- 'KitchenAbvGr' 
- 'TotRmsAbvGrd' 
- 'Fireplaces' 
- 'WoodDeckSF' 
- 'OpenPorchSF'
- 'EnclosedPorch' 
- '3SsnPorch' 
- 'ScreenPorch' 
- 'PoolArea' 
- 'MiscVal' 
- 'MoSold' 
- 'YrSold'

Look at the list of columns and think about what might affect home prices.  To learn more about each of these features, take a look at the data description on the **[competition page](https://www.kaggle.com/c/home-data-for-ml-course/data)**.

After updating the code cell above that defines the features, re-run all of the code cells to evaluate the model and generate a new submission file.  


# What's next?

As mentioned above, some of the features will throw an error if you try to use them to train your model.  The **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course will teach you how to handle these types of features. You will also learn to use **xgboost**, a technique giving even better accuracy than Random Forest.

The **[Pandas](https://kaggle.com/Learn/Pandas)** course will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects. 

You are also ready for the **[Deep Learning](https://kaggle.com/Learn/intro-to-Deep-Learning)** course, where you will build models with better-than-human level performance at computer vision tasks.

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intro-to-machine-learning/discussion) to chat with other learners.*