> Iowa's City data for house prices prediction using advanced regression techniques


In [None]:
import pandas as pd

iowa_file_path = '../input/train.csv'
iowa_data = pd.read_csv(iowa_file_path)

To print the data summary use .describe() method of pandas

In [None]:
iowa_data.describe()

To view the column names of data set use *.columns* of pandas

In [None]:
print(iowa_data.columns)

If you want to view a short summary of data set used *.head()* method of pandas or you can pass a param to head method to view that # of rows,  *head(10)* to only view top 10 rows of dataset

In [None]:
iowa_data.head()

## **Selecting and Filtering Data Assignment**

Print all columns present in dataset

In [None]:
iowa_data.columns

Cretaed new var *lowa_sales_price* and inserted just sales price data into it and then printing the first 5 rows of sales price column

In [None]:
iowa_sales_price = iowa_data.SalePrice
iowa_sales_price.head(5)

Picked two variables(columns) and store them to a new DataFrame

In [None]:
columns_of_interest = ['MSSubClass', 'SalePrice']
two_columns_of_iowa_data = iowa_data[columns_of_interest]

Printing the dataset for just the two columns that caught our intrest

In [None]:
two_columns_of_iowa_data.describe()

## Predicting sales price using DecisionTreeRegressor from scikit learn

Defining the predicting target. we will be predicting values for *SalePrice* column.

In [None]:
y = iowa_data.SalePrice

Defining the predictors, we will be using *LotArea, YearBuilt, 1stFlrSF, 2ndFlrSF, FullBath, BedroomAbvGr, TotRmsAbvGrd* for the prediction of *SalePrice* value.

In [None]:
iowa_predictors = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = iowa_data[iowa_predictors]

Using scikit-learn to build model to predict *SalePrice* value.

In [None]:
from sklearn.tree import DecisionTreeRegressor

Defining DecisionTreeRegressor Model 

In [None]:
iowa_model = DecisionTreeRegressor()

Fit Model: Capture patterns from provided data.

In [None]:
iowa_model.fit(X, y)

Predicting *SalePrice* value for the first 5 house

In [None]:
print("Making predictions for the following 5 houses : ")
print(X.head())
print("The predictions are : ")
print(iowa_model.predict(X.head()))

we will calculate the Mean Absolute Error to measure the model quality

In [None]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = iowa_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

We will now split the data into two parts one is used for training and the other is used for validating model.

In [None]:
from sklearn.model_selection import train_test_split

Split data into training and validation data, for both predictors and target. The split is based on a random number generator. Supplying a numeric value to. The random_state argument guarantees we get the same split every time we run this script.

In [None]:
column_name_X, column_value_X, column_name_y, column_value_y = train_test_split(X, y, random_state = 0)

# Define model after splitting the data
iowa_split_model = DecisionTreeRegressor()

# Fit model after splitting the data
iowa_split_model.fit(column_name_X, column_name_y)

# get predicted sales prices on validation data
sales_price_prediction = iowa_split_model.predict(column_value_X)
print(mean_absolute_error(column_value_y, sales_price_prediction))

## Underfitting, Overfitting and Model Optimization

More read about above topics can be found [here](https://www.kaggle.com/dansbecker/underfitting-overfitting-and-model-optimization) .


In [None]:
def get_mae(max_leaf_nodes, column_name_X, column_value_X, column_name_y, column_value_y):
    
    iowa_function_model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state=0)
    iowa_function_model.fit(column_name_X, column_name_y)
    sales_price_prediction = iowa_function_model.predict(column_value_X)
    mae = mean_absolute_error(column_value_y, sales_price_prediction)
    
    return (mae)

In [None]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150]:
    my_mae = get_mae(max_leaf_nodes, column_name_X, column_value_X, column_name_y, column_value_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Build a RandomForest similarly to how we built a decision tree in scikit-learn.

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor()

forest_model.fit(column_name_X, column_name_y)
forest_sales_price_prediction = forest_model.predict(column_value_X)

print(mean_absolute_error(column_value_y, forest_sales_price_prediction))