**This notebook is an exercise in the [Introduction to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/dansbecker/machine-learning-competitions).**

---


# Introduction

In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to improve and see how you stack up to others taking this course.

The steps in this notebook are:
1. Build a Random Forest model with all of your data (**X** and **y**).
2. Read in the "test" data, which doesn't include values for the target.  Predict home values in the test data with your Random Forest model.
3. Submit those predictions to the competition and see your score.
4. Optionally, come back to see if you can improve your model by adding features or changing your model. Then you can resubmit to see how that stacks up on the competition leaderboard.

## Recap
Here's the code you've written so far. Start by running it again.

In [1]:
# Code you have previously used to load data
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read. We changed the directory structure to simplify submitting to a competition
iowa_file_path = './data/train.csv'

home_data = pd.read_csv(iowa_file_path)
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE when not specifying max_leaf_nodes: {:,.0f}".format(val_mae))

# Using best value for max_leaf_nodes
iowa_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)
iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE for best value of max_leaf_nodes: {:,.0f}".format(val_mae))

# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)
rf_val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)

print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))

c:\softwares\python3\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
c:\softwares\python3\lib\site-packages\numpy\.libs\libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64.dll
  stacklevel=1)


Validation MAE when not specifying max_leaf_nodes: 29,653
Validation MAE for best value of max_leaf_nodes: 27,283
Validation MAE for Random Forest Model: 21,857


# Creating a Model For the Competition

Build a Random Forest model and train it on all of **X** and **y**.

In [2]:
home_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [3]:
# home_data.info()


from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
home_data['Condition1_code'] = label_encoder.fit_transform(home_data[['Condition1']])


print(home_data[['Condition1']].value_counts())
home_data['Condition1_code'].value_counts()

Condition1
Norm          1260
Feedr           81
Artery          48
RRAn            26
PosN            19
RRAe            11
PosA             8
RRNn             5
RRNe             2
dtype: int64


  return f(*args, **kwargs)


2    1260
1      81
0      48
6      26
4      19
5      11
3       8
8       5
7       2
Name: Condition1_code, dtype: int64

In [4]:
home_data.CentralAir.value_counts()

Y    1365
N      95
Name: CentralAir, dtype: int64

In [5]:
h = home_data[["CentralAir", 'SalePrice']].copy()
h.groupby('CentralAir').mean()

Unnamed: 0_level_0,SalePrice
CentralAir,Unnamed: 1_level_1
N,105264.073684
Y,186186.70989


In [6]:
def convert_categorical_to_numeric(df, col):
    from sklearn.preprocessing import LabelEncoder
    label_encoder = LabelEncoder()
    df[col + '_num'] = label_encoder.fit_transform(df[[col]])

In [7]:
# Convert categorical variables to numeric
convert_categorical_to_numeric(home_data, 'KitchenQual')
convert_categorical_to_numeric(home_data, 'MSZoning')
convert_categorical_to_numeric(home_data, 'CentralAir')

  return f(*args, **kwargs)


In [8]:
# To improve accuracy, create a new Random Forest model which you will train on all training data
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'TotRmsAbvGrd',
            'OverallCond', 'GrLivArea', 'PoolArea', 'KitchenQual_num', 'MSZoning_num', 'CentralAir_num']
X = home_data[features]

rf_model_on_full_data = RandomForestRegressor(random_state=1)

# fit rf_model_on_full_data on all data from the training data
rf_model_on_full_data.fit(X, y)

RandomForestRegressor(random_state=1)

# Make Predictions
Read the file of "test" data. And apply your model to make predictions

In [11]:
# path to file you will use for predictions
test_data_path = './data/test.csv'

# read test data file using pandas
test_data = pd.read_csv(test_data_path)

# Convert categorical variables to numeric

test_data.fillna(value={'KitchenQual': 'TA', 'MSZoning': 'A', 'CentralAir': 'N'}, inplace=True)

convert_categorical_to_numeric(test_data, 'KitchenQual')
convert_categorical_to_numeric(test_data, 'MSZoning')
convert_categorical_to_numeric(test_data, 'CentralAir')

# create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable called features
test_X = test_data[features].fillna(0)

# make predictions which we will submit. 
test_preds = rf_model_on_full_data.predict(test_X)

# The lines below shows how to save predictions in format used for competition scoring
# Just uncomment them.

output = pd.DataFrame({'Id': test_data.Id,
                      'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)

  return f(*args, **kwargs)


### TASK : Create a data pipeline using sklearn

In [28]:
home_data = pd.read_csv('data/train.csv')

y = home_data.pop('SalePrice')

numerical_data = home_data.select_dtypes(exclude='object')
numerical_cols = numerical_data.columns

categorical_data = home_data.select_dtypes(exclude=['int64', 'float64'])
categorical_cols = categorical_data.columns

print(numerical_cols, categorical_cols)

Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold'],
      dtype='object') Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', '

In [44]:
categorical_data.nunique().sort_values()

Street            2
Alley             2
CentralAir        2
Utilities         2
LandSlope         3
PoolQC            3
PavedDrive        3
GarageFinish      3
BsmtQual          4
ExterQual         4
MasVnrType        4
KitchenQual       4
BsmtCond          4
BsmtExposure      4
Fence             4
MiscFeature       4
LandContour       4
LotShape          4
FireplaceQu       5
Electrical        5
HeatingQC         5
GarageQual        5
GarageCond        5
MSZoning          5
LotConfig         5
BldgType          5
ExterCond         5
BsmtFinType1      6
RoofStyle         6
GarageType        6
Foundation        6
Heating           6
BsmtFinType2      6
SaleCondition     6
Functional        7
RoofMatl          8
HouseStyle        8
Condition2        8
SaleType          9
Condition1        9
Exterior1st      15
Exterior2nd      16
Neighborhood     25
dtype: int64

In [56]:
# Get all columns relevant for OneHot encoding
val_less_than_10 = categorical_data.nunique() < 10

onehot_cols = val_less_than_10[val_less_than_10 == 1]
onehot_cols = onehot_cols.index

drop_cols = val_less_than_10[val_less_than_10 == 0].index

print("OneHot Columns : {}".format(onehot_cols))
print("Dropping Columns : {}".format(drop_cols))

OneHot Columns : Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'RoofStyle', 'RoofMatl', 'MasVnrType', 'ExterQual',
       'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
       'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir',
       'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC',
       'Fence', 'MiscFeature', 'SaleType', 'SaleCondition'],
      dtype='object')
Dropping Columns : Index(['Neighborhood', 'Exterior1st', 'Exterior2nd'], dtype='object')


In [59]:
home_data = pd.read_csv('data/train.csv')

y = home_data.pop('SalePrice')
home_data.drop(drop_cols, axis=1, inplace=True)
home_data.drop('Id', axis=1, inplace=True)

numerical_data = home_data.select_dtypes(exclude='object')
numerical_cols = numerical_data.columns

In [67]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBRegressor

preprocessing_num = SimpleImputer(strategy='median')

preprocessing_cat = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('OneHot', OneHotEncoder(handle_unknown='ignore'))
    ])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', preprocessing_num, numerical_cols),
        ('cat', preprocessing_cat, onehot_cols)
    ])

model = XGBRegressor(n_estimators=1000)

my_pipeline = Pipeline(
    steps=[
        ('preprocess', preprocessor),
        ('model', model)
    ])

In [68]:
X_train, X_val, y_train, y_val = train_test_split(home_data, y, test_size=0.2)
X_train.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
1328,50,RM,60.0,10440,Pave,Grvl,Reg,Lvl,AllPub,Corner,...,480,0,,MnPrv,Shed,1150,6,2008,WD,Normal
259,20,RM,70.0,12702,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,12,2008,WD,Normal
652,60,RL,70.0,8750,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,7,2009,WD,Normal
1310,20,RL,100.0,17500,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,5,2010,WD,Normal
1321,20,RL,,6627,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,,0,7,2008,WD,Normal


In [69]:
my_pipeline.fit(X_train, y_train)
val_preds = my_pipeline.predict(X_val)

print("Mean absolute error : {}".format(mean_absolute_error(val_preds, y_val)))

Mean absolute error : 14132.42924604024


In [72]:
# Predictions on test data
test_data = pd.read_csv(test_data_path)

test_data.drop(drop_cols, axis=1, inplace=True)
X_test = test_data.drop('Id', axis=1)

test_preds = my_pipeline.predict(X_test)

# Store in CSV
output = pd.DataFrame({'Id': test_data.Id,
                      'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)

Before submitting, run a check to make sure your `test_preds` have the right format.

# Test Your Work

To test your results, you'll need to join the competition (if you haven't already).  So open a new window by clicking on [this link](https://www.kaggle.com/c/home-data-for-ml-course).  Then click on the **Join Competition** button.

![join competition image](https://i.imgur.com/axBzctl.png)

Next, follow the instructions below:
1. Begin by clicking on the blue **Save Version** button in the top right corner of the window.  This will generate a pop-up window.  
2. Ensure that the **Save and Run All** option is selected, and then click on the blue **Save** button.
3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
4. Click on the **Output** tab on the right of the screen.  Then, click on the blue **Submit** button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

If you want to keep working to improve your performance, select the blue **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.


# Continuing Your Progress
There are many ways to improve your model, and **experimenting is a great way to learn at this point.**

The best way to improve your model is to add features.  Look at the list of columns and think about what might affect home prices.  Some features will cause errors because of issues like missing values or non-numeric data types. 

The **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course will teach you how to handle these types of features. You will also learn to use **xgboost**, a technique giving even better accuracy than Random Forest.


# Other Courses
The **[Pandas](https://kaggle.com/Learn/Pandas)** course will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects. 

You are also ready for the **[Deep Learning](https://kaggle.com/Learn/Deep-Learning)** course, where you will build models with better-than-human level performance at computer vision tasks.

---




*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161285) to chat with other Learners.*