**This notebook is an exercise in the [Introduction to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/machine-learning-competitions).**

---


# Introduction

In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to apply what you've learned and move up the leaderboard.

Begin by running the code cell below to set up code checking and the filepaths for the dataset.

In [1]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex7 import *

# Set up filepaths
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 

Here's some of the code you've written so far. Start by running it again.

In [2]:
# Import helpful libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

# Load the data, and separate the target
iowa_file_path = '../input/train.csv'
home_data = pd.read_csv(iowa_file_path)

# Test Data Loading
test_data_path = '../input/test.csv'

# read test data file using pandas
test_data = pd.read_csv(test_data_path)

y = home_data.SalePrice

# Create X (After completing the exercise, you can return to modify this line!)
#features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

# Select columns corresponding to features, and preview the data
#X = home_data[features]
#X.head()

# Split into validation and training data
#train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Define a random forest model
#rf_model = RandomForestRegressor(random_state=1)
#rf_model.fit(train_X, train_y)
#rf_val_predictions = rf_model.predict(val_X)
#rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)

#print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))

# Exploration

In [3]:
print(home_data.head())
print(home_data.shape)

   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  SaleType  SaleCondition  SalePrice  
0   2008        WD   

> This dataset has 81 columns 

In [4]:
home_data[home_data.duplicated()]

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice


> No Duplicate rows.

> Id has no effect on price of homes

In [5]:
test_data_Id = test_data.Id     # for submission format

> to be used while submitting

In [6]:
home_data.drop(['Id','SalePrice'], inplace=True, axis=1)
home_data.head()
test_data.drop('Id', inplace=True, axis=1)

In [7]:
print(home_data.shape)
print(test_data.shape)

(1460, 79)
(1459, 79)


# Checking dtypes

In [8]:
print("Home Data")
print(home_data.dtypes.value_counts())
print("--------------------------------------")
print('Test Data')
print(test_data.dtypes.value_counts())

Home Data
object     43
int64      33
float64     3
dtype: int64
--------------------------------------
Test Data
object     43
int64      25
float64    11
dtype: int64


> Now we need to see which object type variables can we convert to other datatypes

In [9]:
cols_with_obj_dtype_HD = [col for col in home_data.columns 
                                 if home_data[col].dtype =='object']  
print(cols_with_obj_dtype_HD)
print('-------------------------------------------------------------------------------------------')
cols_with_obj_dtype_TD = [col for col in test_data.columns 
                                 if test_data[col].dtype =='object']  
print(cols_with_obj_dtype_TD)

['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']
-------------------------------------------------------------------------------------------
['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinTyp

In [10]:
print(np.array_equal(cols_with_obj_dtype_HD,cols_with_obj_dtype_TD))

True


> So both arrays have same columns as 'object datatype'

Now Exploring these object dtypes

In [11]:
for col_name in cols_with_obj_dtype_HD:
    print('{:20s} {} {:8}'.format(col_name, home_data[col_name].nunique(), test_data[col_name].nunique()))


MSZoning             5        5
Street               2        2
Alley                2        2
LotShape             4        4
LandContour          4        4
Utilities            2        1
LotConfig            5        5
LandSlope            3        3
Neighborhood         25       25
Condition1           9        9
Condition2           8        5
BldgType             5        5
HouseStyle           8        7
RoofStyle            6        6
RoofMatl             8        4
Exterior1st          15       13
Exterior2nd          16       15
MasVnrType           4        4
ExterQual            4        4
ExterCond            5        5
Foundation           6        6
BsmtQual             4        4
BsmtCond             4        4
BsmtExposure         4        4
BsmtFinType1         6        6
BsmtFinType2         6        6
Heating              6        4
HeatingQC            5        5
CentralAir           2        2
Electrical           5        4
KitchenQual          4        4
Funct

# Checking Null Values

In [12]:
cols_with_missing = [col for col in home_data.columns 
                                 if home_data[col].isnull().any()]  
print(cols_with_missing)

['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']


In [13]:
for col_name in cols_with_missing:
    #if home_data[col_name].isnull().sum().any():
        #print(col_name,'\t','\t', home_data[col_name].isnull().sum())
    print('{:20s} {}'.format(col_name, home_data[col_name].isnull().sum()))

LotFrontage          259
Alley                1369
MasVnrType           8
MasVnrArea           8
BsmtQual             37
BsmtCond             37
BsmtExposure         38
BsmtFinType1         37
BsmtFinType2         38
Electrical           1
FireplaceQu          690
GarageType           81
GarageYrBlt          81
GarageFinish         81
GarageQual           81
GarageCond           81
PoolQC               1453
Fence                1179
MiscFeature          1406


> PoolQC, MiscFeature, Fence, Alley has more than 1000 missing values out of 1460. We can drop them since the model will be highly biased by replacement of these null values

In [14]:
new_home_data = home_data.drop(cols_with_missing, axis=1)  
print(new_home_data.shape)
new_test_data = test_data.drop(cols_with_missing, axis=1)
print(new_test_data.shape)

(1460, 60)
(1459, 60)


> In this we are simply removing the columns with missing values. Might not be a good strategy. Therefore new dataset created.

In [15]:
# "cardinality" means the number of unique values in a column.
low_cardinality_cols = [cname for cname in new_home_data.columns if 
                                new_home_data[cname].nunique() < 10 and
                                new_home_data[cname].dtype == "object"]
print(low_cardinality_cols)
#for col in low_cardinality_cols:
#    print(col)
#    print(new_home_data[col].value_counts())

numeric_cols = [cname for cname in new_home_data.columns if 
                                new_home_data[cname].dtype in ['int64', 'float64']]
my_cols = low_cardinality_cols + numeric_cols
new_train_predictors = new_home_data[my_cols]
print('-------------------------------------------')
print(new_train_predictors.shape)
new_test_predictors = new_test_data[my_cols]
print(new_test_predictors.shape)

['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'PavedDrive', 'SaleType', 'SaleCondition']
-------------------------------------------
(1460, 57)
(1459, 57)


> For new dataset

# Replacing null values by mode

In [16]:
for col_name in home_data.columns:
    if home_data[col_name].isnull().sum().any():
        home_data[col_name] = home_data[col_name].fillna(home_data[col_name].mode()[0])
        
for col_name in test_data.columns:
    if test_data[col_name].isnull().sum().any():
        test_data[col_name] = test_data[col_name].fillna(test_data[col_name].mode()[0])

In [17]:
print(home_data.isnull().sum().any())
print(test_data.isnull().sum().any())

False
False


> No null values remaining in our dataset

# Columns to be considered for encoding

In [18]:
low_cardinality_cols = [cname for cname in home_data.columns if 
                                home_data[cname].nunique() < 10 and
                                home_data[cname].dtype == "object"]
print(low_cardinality_cols)
#for col in low_cardinality_cols:
#    print(col)
#    print(home_data[col].value_counts())

numeric_cols = [cname for cname in home_data.columns if 
                                home_data[cname].dtype in ['int64', 'float64']]
my_cols = low_cardinality_cols + numeric_cols       # my_cols represents the columns you will use in model for predictions
train_predictors = home_data[my_cols]
print('-------------------------------------------')
print(train_predictors.shape)

['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']
-------------------------------------------
(1460, 76)


In [19]:
print(test_data.shape)
test_predictors = test_data[my_cols]
print('-------------------------------------------')
print(test_predictors.shape)

(1459, 79)
-------------------------------------------
(1459, 76)


# Dropping highly biased Columns

In [20]:
print(train_predictors.shape)

(1460, 76)


In [21]:
#home_data.drop(['Alley', 'PoolQC', 'Fence', 'MiscFeature'], inplace = True, axis = 1)
#print(train_predictors.shape)
#test_data.drop(['Alley', 'PoolQC', 'Fence', 'MiscFeature'], inplace = True, axis = 1)
#print(test_predictors.shape)


# Encoding Categorical Data

In [22]:
#one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
#print(one_hot_encoded_training_predictors)

In [23]:
#from sklearn.preprocessing import OneHotEncoder
#encoder = OneHotEncoder(categories = "auto", handle_unknown = 'ignore')
#train_encoded = encoder.fit_transform(train_predictors)
#test_encoded = encoder.transform(test_predictors)
#print(train_encoded.shape)
#print(test_encoded.shape)

# Encoding with Column Transformer

In [24]:
# checking features
cat = train_predictors.select_dtypes(include='O').keys()
# display variabels
cat

Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'RoofStyle', 'RoofMatl', 'MasVnrType', 'ExterQual',
       'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
       'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir',
       'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC',
       'Fence', 'MiscFeature', 'SaleType', 'SaleCondition'],
      dtype='object')

> Features to be encoded

In [25]:
print(train_predictors.shape)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
transformer = ColumnTransformer(transformers=[('cat', OneHotEncoder(categories = "auto", handle_unknown = 'ignore', drop='first', sparse=False), cat)], remainder='passthrough')
train_encoded = transformer.fit_transform(train_predictors)
print(train_encoded.shape)
test_encoded = transformer.transform(test_predictors)
print(test_encoded.shape)

(1460, 76)
(1460, 192)
(1459, 192)


For New table

In [26]:
encoder = OneHotEncoder(categories = "auto", handle_unknown = 'ignore')
new_train_encoded = encoder.fit_transform(new_train_predictors)
new_test_encoded = encoder.transform(new_test_predictors)
print(new_train_encoded.shape)
print(new_test_encoded.shape)

(1460, 6969)
(1459, 6969)


In [27]:
#one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)
#print(one_hot_encoded_test_predictors)

In [28]:
#train_cols = list(one_hot_encoded_training_predictors.columns)
#test_cols = list(one_hot_encoded_test_predictors.columns)
#cols_not_in_test = {c:0 for c in train_cols if c not in test_cols}
#one_hot_encoded_test_predictors = one_hot_encoded_test_predictors.assign(**cols_not_in_test)
#print(one_hot_encoded_test_predictors)


> Making number of columns in final test set and training dataset equal. But these columns are still not in same order

# Feature Scaling

> Scaling not required for Random Forest Algorithm

# Splitting Datasets

In [29]:
train_X, test_X, train_y, test_y = train_test_split(train_encoded, y, random_state=1)
train_Xn, test_Xn, train_yn, test_yn = train_test_split(new_train_encoded, y, random_state=1)

# Training model

# Null values removed dataset

In [30]:
reg_n = RandomForestRegressor(random_state=1)
reg_n.fit(train_Xn, train_yn)
reg_n_predictions = reg_n.predict(test_Xn)
reg_n_val_mae = mean_absolute_error(reg_n_predictions, test_yn)

print("Validation MAE for Random Forest Model when null columns removed: {:,.0f}".format(reg_n_val_mae))

Validation MAE for Random Forest Model when null columns removed: 21,820


# Null values replaced Dataset

In [31]:
reg = RandomForestRegressor(random_state=1)
reg.fit(train_X, train_y)
reg_predictions = reg.predict(test_X)
reg_val_mae = mean_absolute_error(reg_predictions, test_y)

print("Validation MAE for Random Forest Model: {:,.0f}".format(reg_val_mae))

Validation MAE for Random Forest Model: 16,572


# Train a model for the competition

The code cell above trains a Random Forest model on **`train_X`** and **`train_y`**.  

Use the code cell below to build a Random Forest model and train it on all of **`X`** and **`y`**.

In [32]:
# To improve accuracy, create a new Random Forest model which you will train on all training data
rf_model_on_full_data = RandomForestRegressor(random_state = 1)

# fit rf_model_on_full_data on all data from the training data
rf_model_on_full_data.fit(train_encoded,y)

RandomForestRegressor(random_state=1)

Now, read the file of "test" data, and apply your model to make predictions.

In [33]:
# make predictions which we will submit. 
test_preds = rf_model_on_full_data.predict(test_encoded)

Before submitting, run a check to make sure your `test_preds` have the right format.

# For null removed Dataset

In [34]:
# To improve accuracy, create a new Random Forest model which you will train on all training data
#rf_new_on_full_data = RandomForestRegressor(random_state = 1)

# fit rf_model_on_full_data on all data from the training data
#rf_new_on_full_data.fit(new_train_encoded,y)

#new_test_preds = rf_new_on_full_data.predict(new_test_encoded)

In [35]:
# Check your answer (To get credit for completing the exercise, you must get a "Correct" result!)
step_1.check()
# step_1.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

# Generate a submission

Run the code cell below to generate a CSV file with your predictions that you can use to submit to the competition.

In [36]:
# Run the code to save predictions in the format used for competition scoring

output = pd.DataFrame({'Id': test_data_Id,
                       'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)

# Submit to the competition

To test your results, you'll need to join the competition (if you haven't already).  So open a new window by clicking on **[this link](https://www.kaggle.com/c/home-data-for-ml-course)**.  Then click on the **Join Competition** button.

![join competition image](https://i.imgur.com/axBzctl.png)

Next, follow the instructions below:
1. Begin by clicking on the **Save Version** button in the top right corner of the window.  This will generate a pop-up window.  
2. Ensure that the **Save and Run All** option is selected, and then click on the **Save** button.
3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
4. Click on the **Output** tab on the right of the screen.  Then, click on the file you would like to submit, and click on the **Submit** button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

If you want to keep working to improve your performance, select the **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.


# Continue Your Progress
There are many ways to improve your model, and **experimenting is a great way to learn at this point.**

The best way to improve your model is to add features.  To add more features to the data, revisit the first code cell, and change this line of code to include more column names:
```python
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
```

Some features will cause errors because of issues like missing values or non-numeric data types.  Here is a complete list of potential columns that you might like to use, and that won't throw errors:
- 'MSSubClass'
- 'LotArea'
- 'OverallQual' 
- 'OverallCond' 
- 'YearBuilt'
- 'YearRemodAdd' 
- '1stFlrSF'
- '2ndFlrSF' 
- 'LowQualFinSF' 
- 'GrLivArea'
- 'FullBath'
- 'HalfBath'
- 'BedroomAbvGr' 
- 'KitchenAbvGr' 
- 'TotRmsAbvGrd' 
- 'Fireplaces' 
- 'WoodDeckSF' 
- 'OpenPorchSF'
- 'EnclosedPorch' 
- '3SsnPorch' 
- 'ScreenPorch' 
- 'PoolArea' 
- 'MiscVal' 
- 'MoSold' 
- 'YrSold'

Look at the list of columns and think about what might affect home prices.  To learn more about each of these features, take a look at the data description on the **[competition page](https://www.kaggle.com/c/home-data-for-ml-course/data)**.

After updating the code cell above that defines the features, re-run all of the code cells to evaluate the model and generate a new submission file.  


# What's next?

As mentioned above, some of the features will throw an error if you try to use them to train your model.  The **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course will teach you how to handle these types of features. You will also learn to use **xgboost**, a technique giving even better accuracy than Random Forest.

The **[Pandas](https://kaggle.com/Learn/Pandas)** course will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects. 

You are also ready for the **[Deep Learning](https://kaggle.com/Learn/intro-to-Deep-Learning)** course, where you will build models with better-than-human level performance at computer vision tasks.

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intro-to-machine-learning/discussion) to chat with other learners.*