# Capstone Project

## PGDP - ML : 18/19 batch

### Members

Mr. Karan Mitra
Mr. Prabhakaran
Mr. Devanandh

### Mentor

Ms. Sulekha Aloorravi

## Problem Definition

A house value is simply more than location and square footage. Like the features that make up a person, an educated party would want to know all aspects that give a house its value. For example, you want to sell a house and you don’t know the price which you can take — it can’t be too low or too high. To find house price you usually try to find similar properties in your neighbourhood and based on gathered data you will try to assess your house price. 

### Objective 

Take advantage of all of the feature variables available below, use it to analyse and predict house prices. 

1.	cid: a notation for a house
2.	dayhours: Date house was sold
3.	room_bed: Number of Bedrooms/House
4.	room_bath: Number of bathrooms/bedrooms
5.	living_measure: square footage of the home
6.	lot_measure: quare footage of the lot
7.	ceil: Total floors (levels) in house
8.	coast: House which has a view to a waterfront
9.	sight: Has been viewed
10.	condition: How good the condition is (Overall)
11.	quality: grade given to the housing unit, based on grading system
12.	ceil_measure: square footage of house apart from basement
13.	basement_measure: square footage of the basement
14.	yr_built: Built Year
15.	yr_renovated: Year when house was renovated
16.	zipcode: zip
17.	lat: Latitude coordinate
18.	long: Longitude coordinate
19.	living_measure15: Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area
20.	lot_measure15: lotSize area in 2015(implies-- some renovations)
21.	furnished: Based on the quality of room

22: total_area: Measure of both living and lot

### Target Variable

price: Price is prediction target


In [None]:
#Basic libraries import

%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.plotly as py
import plotly.graph_objs as go
%matplotlib inline
import plotly
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot



from sklearn.ensemble import RandomForestClassifier


%matplotlib inline


print (__version__) # requires version >= 1.9.0
init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Importing dataset

house_df = pd.read_csv('innercity.csv')

# EDA and Pre-processing (Feature Engineering and Selection)

In [None]:
# Viewing first 10 entries in the dataframe
house_df.head(10)

In [None]:
# analyzing the size of the dataframe and the variable datatypes
house_df.info()
df_size=house_df.shape

There are a total of 21613 data points. 

The input variables are all either integer or float datatype. 

The target variable(price) is of the datatype integer. 

Hence, we would look into regression based models to evaluate the integer target varible.

The "dayhours" column is of object type. Let's eyeball the data.



In [None]:
# Eyeballing dayhours column
house_df['dayhours'].head(5)

The first 4digits quanitfy the year, the next 2 digits the month and the next 2 digits the date and the last 7 digits probably the time stamp. Hence, splitting this variable into respective variable groups

## 'Dayhours' data manipulation

In [None]:
# copying the source dataframe onto a new dataframe for manipulation
house_df_new=house_df.copy()

In [None]:
# creating a new column to mimic the timeframe
house_df_new['sold_date_full']=house_df_new['dayhours'].str[:8].astype('int64')

In [None]:
# Sold date versus price - Pairplot visualization
sns.pairplot(house_df_new,x_vars='sold_date_full',y_vars='price')

There are clusters forming up in the time series. Hence, let's split them up into indivdual features - date, month and year

In [None]:
# Creating separate features for sold date,month and year
house_df_new['sold_year']=house_df_new['dayhours'].str[:4].astype('int64')
house_df_new['sold_month']=house_df_new['dayhours'].str[4:6].astype('int64')
house_df_new['sold_date']=house_df_new['dayhours'].str[6:8].astype('int64')

In [None]:
# Evaluating feature - sold_year
house_df_new['sold_year'].head(5)

In [None]:
# evaluating feature - sold_month
house_df_new['sold_month'].head(5)

In [None]:
# evaluating feature- sold_date
house_df_new['sold_date'].head(5)

In [None]:
# having split the dayhours data, dropping the dayhours column from the dataframe

house_df_new=house_df_new.drop('dayhours',axis=1)

In [None]:
# looking into the new dataframe after dropping dayhours feature
house_df_new.info()

# Checking for missing values

In [None]:
# Check for NA values and count for each feature
house_df_new.isna().sum(axis=0)

There are no missing values in the dataset.

# Deliverable -1 (Exploratory data quality report reflecting the following)


# 1. Univariate analysis

Univariate analysis – data types and description of the independent attributes which should include (name, meaning, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions / tails, missing values, outliers

In [None]:
# bird's eye view of the numerical distribution of the dataframe

house_df_new.describe().transpose()

To get more clarity let's evalaute each variable separately

# Univariate Analysis 

### Check the input variable distribution and outliers 

# Bivariate Analysis

### Incase of outlier presence, evaluate importance through corelation analysis with target price

## Variable: CID -> A notation for the house

In [None]:
# Verifying the distribution and histogram of the variable

sns.distplot(house_df_new['cid'])
house_df_new.head(3)

### Inference of distribution

There are two distinct distribution peaks in the column. Though this variable is used just as an identification variable, the histogram shows there are repeat/duplicate entries

In [None]:
# creation of a copy dataframe to evaluate the duplicates

house_df_f=house_df_new.copy()

# duplicate entry extraction

house_df_f['cid_id']=house_df_new['cid'].duplicated()

In [None]:
# size of the duplicate dataframe

house_df_f.loc[house_df_f['cid_id'] == True].shape

There are a total of 177 duplicate entries in the dataframe. Let's look into the nature of the duplicates and check if they are really duplicates

In [None]:
# Listing the top 5 duplicate entries

house_df_f.loc[house_df_f['cid_id'] == True].sort_values(by='cid').head(5)

In [None]:
# returning all instances of the first 5 duplicate entries

house_df_f.loc[house_df_f['cid'].isin(['1000102','7200179','109200390','123039336','251300110'])].sort_values(by='cid')

### Inference of duplicates:

There are certain repeated entries indicating the same property has been bought and resold, since there is no change in the other parameters. Hence, we are retaining them in the dataframe.

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['cid'],name='House property identifier',showlegend=True))]
plotly.offline.iplot(data)


In [None]:
# corelation of cid vs price
house_df_new['cid'].corr(house_df_new['price'])

### Inference of outliers

No outliers present in this variable. 

Note: This is an identification variable only, also implicated by no corelation with price, hence, this variable can be dropped for regression analysis.

## Variable room_bed -> No. of bedrooms/home

In [None]:
#evaluating the unique entries in the variable list

house_df_new['room_bed'].sort_values().unique()

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['room_bed'])
plt.show()

### Inference of distribution

The data is right skewed indicating outliers, and also there are distinct peaks indicating clusters present in the dataframe

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['room_bed'],name='room_bed',showlegend=True))]
plotly.offline.iplot(data)


### Inference of outliers

The no. of rooms vary from 0 to 33. 

The outliers are present below 2 and above 5 as per the box plot. Outliers treatment is required here.

In [None]:
# Evaluating the outliers in number of bedrooms - Case A 

# case A

room_bed_outlier=house_df_new[(house_df_new.room_bed>5)|(house_df_new.room_bed<2)]
room_bed_outlier.shape

In [None]:
# Case A corelation with price
room_bed_outlier['room_bed'].corr(house_df_new['price'])

There are 546 entries with no. of bedrooms >5 and <2. It has a very low corelation with price. Let's check the number of entries with no. of bedrooms >5 and <1.

In [None]:
# Case B

room_bed_outlier_1=house_df_new[(house_df_new.room_bed>5)|(house_df_new.room_bed<1)]
room_bed_outlier_1.shape

In [None]:
# Case B corelation with price
room_bed_outlier_1['room_bed'].corr(house_df_new['price'])

There are 347 entries with no. of bedrooms >5 and <1. It has no corelation with price. Hence, we can remove them from the analysis dataframe. If more improvement to model accuracy is required, then Case A can be removed.

## Variable room_bath -> No. of bathrooms/bedroom

In [None]:
#evaluating the unique entries in the variable list

house_df_new['room_bath'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['room_bath'].sort_values().unique().shape

The no. of room_bath vary from 0 to 8 with a total of 30 unique entries and there are decimal values too. 

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['room_bath'])
plt.show()

### Inference of distribution

The data is right skewed indicating outliers, and also there are distinct peaks indicating clusters present in the dataframe

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['room_bath'],name='room_bath',showlegend=True))]
plotly.offline.iplot(data)

### Inference of suspected outliers

The outliers are present below 0.75 and above 3.5 as per the box plot. Outliers treatment is required here.

In [None]:
# creating outlier dataframe
room_bath_outlier=house_df_new[(house_df_new.room_bath>3.5)|(house_df_new.room_bath<0.75)]
room_bath_outlier.shape

In [None]:
# corelation of no. of bathrooms/bedrooms with price
room_bath_outlier['room_bath'].corr(house_df_new['price'])

There are 571 outlier entries and have moderate positive corelation only with price. Hence based on modelling accuracy we can decide if to retain or remove the outliers.

## Variable living_measure -> square footage of the home

In [None]:
#evaluating the unique entries in the variable list

house_df_new['living_measure'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['living_measure'].sort_values().unique().shape

There are 1038 unique entries of living measure

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['living_measure'])
plt.show()

### Inference of distribution

The data is normally distributed and it is right skewed indicating outliers.

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['living_measure'],name='living_measure',showlegend=True))]
plotly.offline.iplot(data)

### Inference of suspected outliers

The outliers are present above 4230 as per the box plot.

In [None]:
# creating outlier dataframe
living_measure_outlier=house_df_new[(house_df_new.living_measure>4230)]
living_measure_outlier.shape

In [None]:
# corelation of living measure with price
living_measure_outlier['living_measure'].corr(house_df_new['price'])

There are 572 outlier entries. Also they have moderate postivie corelation with price. Hence, decision to remove or retain need to be taken based on model accuracy.

## Variable lot_measure -> square footage of the lot

In [None]:
#evaluating the unique entries in the variable list

house_df_new['lot_measure'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['lot_measure'].sort_values().unique().shape

There are 9782 unique entries of lot measure

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['lot_measure'])
plt.show()

### Inference of distribution

The data is normally distributed and it is right skewed indicating outliers. The histogram shows an abnormaly high number of instances in the lot measures of smaller size, indicating maximum presence of such properties.

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['lot_measure'],name='lot_measure',showlegend=True))]
plotly.offline.iplot(data)

### Inference of suspected outliers

The outliers are present above 19141 as per the box plot.

In [None]:
# creating outlier dataframe
lot_measure_outlier=house_df_new[(house_df_new.lot_measure>19141)]
lot_measure_outlier.shape

In [None]:
# corelation of lot measure with price
lot_measure_outlier['lot_measure'].corr(house_df_new['price'])

There are 2425 outlier entries. They have no corelation with price. Hence, they can be removed from the analysis dataframe.

## Variable ceil -> Total floors in the house

In [None]:
#evaluating the unique entries in the variable list

house_df_new['ceil'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['ceil'].sort_values().unique().shape

There are 6 unique entries of total floors in house. There are decimal entries too. 

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['ceil'])
plt.show()

### Inference of distribution

The data has 4 peaks indicating 4 clusters. Maximum occurance in the histogram is 1 floor followed by 2 floors. The data is right skewed indicating outliers in the higher no.of floors.

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['ceil'],name='ceil',showlegend=True))]
plotly.offline.iplot(data)

### Inference of suspected outliers

Contrary to the inference of distribution plot, there are no outliers in the no. of floors in box plot. Hence, no outlier treatment required.

## Variable coast -> If the property is near a waterbody

In [None]:
#evaluating the unique entries in the variable list

house_df_new['coast'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['coast'].sort_values().unique().shape

This variable is of categorical type indicating if the property is facing a waterbody or not.

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['coast'])
plt.show()

### Inference of distribution

The data is extremely right skewed. The histogram shows an abnormaly high number of instances of the properties without facing a waterbody.

In [None]:
# Number of houses not facing waterbody
coast_no=house_df_new[house_df_new.coast==0].shape
coast_no[0]

In [None]:
# Number of houses facing waterbody
coast_yes=house_df_new[house_df_new.coast==1].shape
coast_yes[0]

In [None]:
# Percentage of houses facing waterbody
print ('%3.2f'%(coast_yes[0]/df_size[0]*100),'%')

The data shows only 163 houses are facing a waterbody, while the remaining is not. That is only 0.75% of the total houses are facing the waterbody.

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['furnished'],name='Furnished',showlegend=True))]
plotly.offline.iplot(data)

### Inference of Suspected outliers

As the number of houses facing waterbody are only 0.75% of the total population in the database, they have been marked as outliers. Hence, let's evaluate the impact of facing waterbody against the price.

In [None]:
# Waterbody facing status versus price - Pairplot visualization
sns.pairplot(house_df_new,x_vars='coast',y_vars='price')

In [None]:
# Furnished status versus price - Correlation analysis
house_df_new['coast'].corr(house_df_new['price'])

The variable furnished has a low corelation with the target price. Also, as the population of this sample is low, we can remove them from analysis.

In [None]:
# creating outlier dataframe
coast_outlier=house_df_new[(house_df_new.coast==1)]
coast_outlier.shape

## Variable sight -> If the property has been viewed

In [None]:
#evaluating the unique entries in the variable list

house_df_new['sight'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['sight'].sort_values().unique().shape

This variable is of categorical type indicating how many times the property has been viewed. 0 indicates the property has not been viewed earlier, while maximum number of views is 4.

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['sight'])
plt.show()

### Inference of distribution

The data is extremely right skewed. The histogram shows an abnormaly high number of instances of the properties without previous viewings.

In [None]:
# Number of houses not viewed previously : Case C0
sight_no=house_df_new[house_df_new.sight==0].shape
sight_no[0]

In [None]:
# Number of houses previously viewed once : Case C1
sight_once=house_df_new[house_df_new.sight==1].shape
sight_once[0]

In [None]:
# Number of houses previously viewed twice : Case C2
sight_twice=house_df_new[house_df_new.sight==2].shape
sight_twice[0]

In [None]:
# Number of houses previously viewed thrice : Case C3
sight_thrice=house_df_new[house_df_new.sight==3].shape
sight_thrice[0]

In [None]:
# Number of houses previously viewed fourtimes : Case C4
sight_four=house_df_new[house_df_new.sight==4].shape
sight_four[0]

In [None]:
# No. of houses viewed previously
print (sight_once[0]+sight_twice[0]+sight_thrice[0]+sight_four[0])

In [None]:
# Percentage of houses viewed previously
print ('%3.2f'%((sight_once[0]+sight_twice[0]+sight_thrice[0]+sight_four[0])/df_size[0]*100),'%')

The data shows only 2164 houses were viewed previously accounting for 9.83% of the total house population.This is a signifant number. Hence, the impact of pricing needs to be evaulated for decision on outlier treatment.

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['sight'],name='sight',showlegend=True))]
plotly.offline.iplot(data)

### Inference of Suspected outliers

As the number of houses viewed previously are only 9.83% of the total population in the database, they have been marked as outliers. Hence, let's evaluate the impact of property viewed previously against the price.

In [None]:
# House previously viewed status versus price - Pairplot visualization
sns.pairplot(house_df_new,x_vars='sight',y_vars='price')

In [None]:
# House previously viewed status versus price - Correlation analysis
house_df_new['sight'].corr(house_df_new['price'])

The variable furnished has a low corelation with the target price. However, the call to remove or retain them can be taken based on modelling accuracy since they the number of entities are of sizeable number

In [None]:
# creating outlier dataframe
sight_outlier=house_df_new[(house_df_new.sight>0)]
sight_outlier.shape

## Variable condition -> How good the condition is (Overall)

In [None]:
#evaluating the unique entries in the variable list

house_df_new['condition'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['condition'].sort_values().unique().shape

This variable is of categorical type indicating the overall condition of the property with value ranging from 1 to 5. Probably 1 indicating poor condition and 5 indicating very good condition.  

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['condition'])
plt.show()

### Inference of distribution

The data has three peaks. With the mean rating of 3 the maximum number of the occurances, the distribution is centrally spread. 

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['condition'],name='condition',showlegend=True))]
plotly.offline.iplot(data)

### Inference of Suspected outliers

The boxplot shows that the condition 1 is an outlier. Let's quantify the number of instances of condition 1.

In [None]:
# creating outlier dataframe
condition_outlier=house_df_new[(house_df_new.condition==1)]
condition_outlier.shape

There are only 30 outlier entitites. Let us remove them from the analysis dataframe.

## Variable quality -> grade given to the housing unit, based on grading system

In [None]:
#evaluating the unique entries in the variable list

house_df_new['quality'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['quality'].sort_values().unique().shape

This variable is of categorical type indicating the grade of the property with value ranging from 1 to 13. 

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['quality'])
plt.show()

### Inference of distribution

The data has 7 peaks. With the mean rating of 7 the maximum number of the occurances, the distribution is centrally spread. 

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['quality'],name='quality',showlegend=True))]
plotly.offline.iplot(data)

### Inference of Suspected outliers

The boxplot shows that the quality <6 and >9 are outliers. Let's quantify the number outlier instances.

In [None]:
# creating outlier dataframe
quality_outlier=house_df_new[(house_df_new.quality<6)|(house_df_new.quality>9)]
quality_outlier.shape

There are 1911 outlier entitites. Let us look into the respective corelation with price.

In [None]:
# correlation of quality with price
quality_outlier['quality'].corr(house_df_new['price'])

The outlier variables have a moderate positive corelation with price. Hence, based on model accuracy the decision to remove or retain the outliers can be made

## Variable ceil_measure -> square footage apart from basement

In [None]:
# evaluating the number of unique entries
house_df_new['ceil_measure'].sort_values().unique().shape

There are 946 unique entries in the variable

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['ceil_measure'])
plt.show()

### Inference of distribution

The data is normally distributed with moderate right skewedness indicating presence of outliers in high ceil measures.

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['ceil_measure'],name='ceil_measure',showlegend=True))]
plotly.offline.iplot(data)

### Inference of Suspected outliers

The boxplot shows that ceil measure > 3740 are outliers. Let's quantify the number of instances.

In [None]:
# creating outlier dataframe
ceil_measure_outlier=house_df_new[(house_df_new.ceil_measure>3740)]
ceil_measure_outlier.shape

In [None]:
# corelation of ceil_measure with price

ceil_measure_outlier['ceil_measure'].corr(house_df_new['price'])

There are 611 outlier entitites and they have moderate positive corelation with price. Hence, based on model accuracy we can choose to remove or retain them.

## Variable basement -> square footage of basement

In [None]:
#evaluating the unique entries in the variable list

house_df_new['basement'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['basement'].sort_values().unique().shape

There are 306 unique entries in the variable

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['basement'])
plt.show()

### Inference of distribution

The data has two peaksright skewedness indicating presence of outliers in high basement measures. Also histogram indicates that most of the houses do not have basement.

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['basement'],name='basement',showlegend=True))]
plotly.offline.iplot(data)

### Inference of Suspected outliers

The boxplot shows that basement measures > 1400 are outliers. Let's quantify the number of instances.

In [None]:
# creating outlier dataframe
basement_outlier=house_df_new[(house_df_new.basement>1400)]
basement_outlier.shape

In [None]:
# corelation of basement size with price

basement_outlier['basement'].corr(house_df_new['price'])

There are 496 outlier entitites and they have a moderate positive corelation with price. Hence, based on modelling accuracy let's take a call to remove or retain them.

## Variable yr_built-> Built year

In [None]:
#evaluating the unique entries in the variable list

house_df_new['yr_built'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['yr_built'].sort_values().unique().shape

There are 116 unique entries in the variable. The earliest built house is in 1900 while the latest being built in 2015.

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['yr_built'])
plt.show()

### Inference of distribution

The data shows that the houses being built were following an increase in trend from 1900 and peaking in the 2000s. 

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['yr_built'],name='yr_built',showlegend=True))]
plotly.offline.iplot(data)

### Inference of Suspected outliers

The boxplot shows there are no outliers in the data.

## Variable yr_renovated -> Renovated year

In [None]:
#evaluating the unique entries in the variable list

house_df_new['yr_renovated'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['yr_renovated'].sort_values().unique().shape

There are 70 unique entries in the variable. '0' would be typically refering that the property was not renovated.

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['yr_renovated'])
plt.show()

### Inference of distribution

The data has two inferences. One being that most of the houses are not renovated. The second being that the renovations followed were mostly in the 2000s.

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['yr_renovated'],name='yr_renovated',showlegend=True))]
plotly.offline.iplot(data)

### Inference of Suspected outliers

The boxplot shows that all the renovations being an outlier. Let's look into the number of such instances.

In [None]:
# creating outlier dataframe
yr_renovated_outlier=house_df_new[(house_df_new.yr_renovated>0)]
yr_renovated_outlier.shape

There are 914 outlier entitites. Let us look into its corelation with price to decide on whether to remove them or not from the analysis dataframe.

In [None]:
# corelation of yr_renovated with price
yr_renovated_outlier['yr_renovated'].corr(house_df_new['price'])

The corelation with price is very low. Hence, we can remove them from the analysis dataframe

## Variable zipcode -> Property Zipcode

In [None]:
#evaluating the unique entries in the variable list

house_df_new['zipcode'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['zipcode'].sort_values().unique().shape

There are 70 unique zipcode entries in the dataset. On looking up the zipcodes, these are located in Seattle, Washington in the USA.

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['zipcode'])
plt.show()

### Inference of distribution

There are multiple peaks indicating clusters present in the data. Almost all zipcodes have multiple entries, indicating multiple house properties in a given area. 

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['zipcode'],name='zipcode',showlegend=True))]
plotly.offline.iplot(data)

### Inference of outliers

There are no outliers present in the dataset.

## Variable lat -> Property latitude

In [None]:
#evaluating the unique entries in the variable list

house_df_new['lat'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['lat'].sort_values().unique().shape

There are 5034 unique latitude entries in the dataset. 

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['lat'])
plt.show()

### Inference of distribution

There are three distinct peaks indicating 3 prominent latitude clusters present in the data. 

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['lat'],name='lat',showlegend=True))]
plotly.offline.iplot(data)

### Inference of suspected outliers

As per boxplot, latitudes < 47.1622 are outliers. And thus there are two outliers present in the dataset. Let's us look in conjection with longitude details to check if they are truly outliers.

## Suspected Outlier Verification - Latitude

In [None]:
# printing all rows with suspected outliers in Latitude and checking against longitude and zipcode

house_df_new[house_df_new.lat<47.1622][['lat','long','zipcode']]

### Outlier Conclusion

The suspected latitude coordinates were verified along with the respective longitudes and were matching to the zipcode provided. Hence, they are not cases of mis-entry and thus these data points would be retained in the dataset.

#### Online verification of co-ordinates

https://www.melissa.com/v2/lookups/latlngzip4/index?lat=47.1559&lng=-121.646

https://www.melissa.com/v2/lookups/latlngzip4/index?lat=47.1593&lng=-121.957

## Variable long -> Property longitude

In [None]:
#evaluating the unique entries in the variable list

house_df_new['long'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['long'].sort_values().unique().shape

There are 752 unique longitude entries in the dataset. 

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['long'])
plt.show()

### Inference of distribution

There are five distinct peaks indicating 5 prominent longitude clusters present in the data. 

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['long'],name='long',showlegend=True))]
plotly.offline.iplot(data)

### Inference of suspected outliers

As per boxplot, all values greater that -121.821 are marked as outliers. Let's evaluate them along with latitude to check if they are truly outliers.

## Suspected Outlier Verification - Latitude

In [None]:
# printing all rows with suspected outliers in Latitude and checking against longitude and zipcode

long_outlier=house_df_new[house_df_new.long>-121.821]

long_outlier.shape

There are a total of 256 suspected outliers in longitude

In [None]:
# verifying unique pincodes of the suspected longitude outliers

long_outlier['zipcode'].unique()

### Evaluating one such coordinate in the uszipcode database to check the correctness of the longitude

ID: 21514 Lat: 47.4834 Long: -121.773 Zipcode: 98045

In [None]:
#!pip install uszipcode  #to install the uszipcode package

In [None]:
# using the SearchEngine module in the uszipcode package

from uszipcode import SearchEngine 
search = SearchEngine(simple_zipcode=True) #import only simple_zipcode package (9mb)

In [None]:
# Employing reverse Geocoding to evaluate the zipcode for the input lat and long
res=search.by_coordinates(47.4834,-121.773,radius=10,returns=0)
for i in range(len(res)):
    print(res[i].zipcode)

In [None]:
# corelation of longitude with price

long_outlier['long'].corr(house_df_new['price'])

### Outlier inference

For this lat and long, the zipcode in the US database is not matching with the entries in our house database. Also, there is no correlation with price, hence, we'll remove them from the analysis dataframe.


## Variable living_measure15 -> If living measure in 2015 (denotes some renovations)

In [None]:
#evaluating the unique entries in the variable list

house_df_new['living_measure15'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['living_measure15'].sort_values().unique().shape

There are 777 unique renovated living measure entries in the dataset.

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['living_measure15'])
plt.show()

### Inference of distribution

The data has a normal distribution with only one peak. The distribution is slightly right skewed, indicating a possibility of outliers.

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['living_measure15'],name='Living measure renovated in 2015',showlegend=True))]
plotly.offline.iplot(data)

### Inference of Suspected outliers

As per boxplot, values > 3660 suggests suspected outliers.

In [None]:
# creating outlier dataframe
liv_meas15_outlier=house_df_new[(house_df_new.living_measure15>3660)]
liv_meas15_outlier.shape

In [None]:
# correlation of living measure (2015) with price

liv_meas15_outlier['living_measure15'].corr(house_df_new['price'])

There are 544 outlier entries. They have no corelation with price. Hence, they can be removed from the analysis dataframe

## Variable lot_measure15 -> Lot measure in 2015 

In [None]:
#evaluating the unique entries in the variable list

house_df_new['lot_measure15'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['lot_measure15'].sort_values().unique().shape

There are 8689 unique renovated lot measure entries in the dataset.

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['lot_measure15'])
plt.show()

### Inference of distribution

The data has two peaks indicating two clusters. The histogram shows an abnormaly high number of instances in the lot measures of smaller size, indicating maximum presence of such properties. Also, the data is extremely right skewed indicating presence of outliers.

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['lot_measure15'],name='Lot measure in 2015',showlegend=True))]
plotly.offline.iplot(data)

### Inference of Suspected outliers

As per boxplot, values > 17.55k suggests suspected outliers.

In [None]:
# creating outlier dataframe
lot_meas15_outlier=house_df_new[(house_df_new.lot_measure15>17550)]
lot_meas15_outlier.shape

In [None]:
# correlation of lot measure(2015) with price

lot_meas15_outlier['lot_measure15'].corr(house_df_new['price'])

There are 2194 outlier entries. And they have no corelation with price. Hene, they can be removed from the analysis dataframe.

## Variable Furnished -> Based on the quality of room 

In [None]:
#evaluating the unique entries in the variable list

house_df_new['furnished'].sort_values().unique()

In [None]:
# evaluating the number of unique entries
house_df_new['furnished'].sort_values().unique().shape

This variable is a categorical variable indicating if the room has been furnished or not

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['furnished'])
plt.show()

### Inference of distribution

As the data is categorical, distribution cannot be quanitfied. But based on the histogram, we can see that only a few entities are furnished, while the rest are unfurnished.

In [None]:
# Number of unfurnished houses
furn_no=house_df_new[house_df_new.furnished==0].shape
furn_no[0]

In [None]:
# Number of furnished houses
furn_yes=house_df_new[house_df_new.furnished==1].shape
furn_yes[0]

In [None]:
# Percentage of Furnished houses
print ('%3.2f'%(furn_yes[0]/df_size[0]*100),'%')

The data shows only 4251 houses are furnished, while the remaining is unfurnished. That is only 19.67% of the houses are furnished.

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['furnished'],name='Furnished',showlegend=True))]
plotly.offline.iplot(data)

### Inference of Suspected outliers

As the number of furnished houses are only 19.67 % of the total population in the database, they have been marked as outliers. Hence, let's evaluate the impact of furnishing against the price.

In [None]:
# Furnished status versus price - Pairplot visualization
sns.pairplot(house_df_new,x_vars='furnished',y_vars='price')

In [None]:
# Furnished status versus price - Correlation analysis
house_df_new['furnished'].corr(house_df_new['price'])

The variable furnished has a moderate positive corelation with the target price. Also, in the pairplot it can be seen that the furnished houses have a higher price. Hence, we would retain all the rows.


## Variable total_area -> Measure of both living and lot

In [None]:
#evaluating the unique entries in the variable list

house_df_new['total_area'].sort_values().unique()

In [None]:
# evaluating the number of unique entries

house_df_new['total_area'].sort_values().unique().shape

There are 11163 unique total measure entries in the dataset.

In [None]:
# Verifying the distribution of the variable

sns.distplot(house_df_new['total_area'])
plt.show()

### Inference of distribution

The data has a single distinguishable peak. The histogram shows an abnormaly high number of instances in the total area measures of smaller size, indicating maximum presence of such properties. Also, the data is extremely right skewed indicating presence of outliers.

In [None]:
# Plotting boxplot to detect outliers

data=[(go.Box(x=house_df_new['total_area'],name='total_area',showlegend=True))]
plotly.offline.iplot(data)

### Inference of Suspected outliers

As per boxplot, values > 21942 suggests suspected outliers.

In [None]:
# creating outlier dataframe
total_area_outlier=house_df_new[(house_df_new.total_area>21942)]
total_area_outlier.shape

In [None]:
# correlation of total area with price

total_area_outlier['total_area'].corr(house_df_new['price'])

There are 2419 outlier entries. They have no corelation with price. Hence, they can be removed from the analysis dataframe.

# Outlier Summary

Based on the univariate and bivariate analysis, the following decisions were taken. 

### Outliers to be removed:

1) total_area         (2419 entries)

2) lot_measure15      (2194 entries)

3) room_bed -> Case B (347 entries)

4) lot_measure        (2425 entries)

5) long               (256 entries)

6) coast              (163 entries)

7) condition          (30 entries)

8) yr_renovated       (914 entries)

9) cid                (entire column)



### Outliers decision to be taken based on modelling accuracy:



# Corelation analysis

In [None]:
X=house_df_new.drop('price',axis=1)

In [None]:
X_corr=X.corr()

In [None]:
fig, ax = plt.subplots(figsize=(20,20))   
sns.heatmap(X_corr,annot=True)

## Inference

Iot_measure and total_area are very strongly corelated. Likewise ceiling measure and living measure are highly corelated. As there are strong positive corelations between some of the input variables, we can do a PCA to reduce the dimensions

# Unsupervised learning to evaluate the number of clusters using k means clustering - Base model

In [None]:
house_df_base=house_df_new.copy()
attributes = house_df_base.drop('price',axis=1)

#Finding optimal no. of clusters
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
clusters=range(1,20)
meanDistortions=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(attributes)
    prediction=model.predict(attributes)
    meanDistortions.append(sum(np.min(cdist(attributes, model.cluster_centers_, 'euclidean'), axis=1)) / attributes.shape[0])

print (meanDistortions)    
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')


In [None]:
#  K = 4
final_model=KMeans()
final_model.fit(attributes)
prediction=final_model.predict(attributes)

#Append the prediction 
house_df_base["GROUP"] = prediction
print("Groups Assigned : \n")
house_df_base[["price", "GROUP"]].head(5)

In [None]:
sns.pairplot(house_df_base,hue='GROUP',size=3)

In [None]:
house_df_base.groupby(by='GROUP',axis=0).count()

The group 0 and 2 contain similar number of samples, however, in group 3 there is very low number of samples.

There are total 21613 entities. Hence, let's upsample and downsample among the 4 groups to a mean sample size of 5404 samples each.

In [None]:
#!pip install imbalanced-learn --user

In [None]:
#from imblearn.over_sampling import SMOTE

In [None]:
# outliers to be removed - grouping and creating a dataframe

house_df_out1=house_df_new.copy()
house_df_out1.info()

In [None]:
# outlier strategy 1
# droping of rows containing the index of outlier elements.
house_df_out1=house_df_out1.drop(total_area_outlier['cid'].index|lot_meas15_outlier['cid'].index|
                                 liv_meas15_outlier['cid'].index|room_bed_outlier_1['cid'].index|
                                 lot_measure_outlier['cid'].index|long_outlier['cid'].index|
                                 coast_outlier['cid'].index|condition_outlier['cid'].index|
                                 yr_renovated_outlier['cid'].index,axis=0)
# dropping 'cid' from the dataframe
house_df_out1=house_df_out1.drop('cid',axis=1)
house_df_out1.info()

In [None]:
# Outlier Strategy 2
# creating new dataframe with suspected outliers to be removed based on model accuracy

house_df_out2=house_df_new.copy()
house_df_out2=house_df_out2.drop(total_area_outlier['cid'].index|lot_meas15_outlier['cid'].index|
                                 liv_meas15_outlier['cid'].index|room_bed_outlier_1['cid'].index|
                                 lot_measure_outlier['cid'].index|long_outlier['cid'].index|
                                 coast_outlier['cid'].index|condition_outlier['cid'].index|
                                 yr_renovated_outlier['cid'].index|room_bath_outlier['cid'].index|
                                 room_bed_outlier['cid'].index|living_measure_outlier['cid'].index|
                                 sight_outlier['cid'].index|quality_outlier['cid'].index|
                                 room_bath_outlier['cid'].index,axis=0)
# dropping 'cid' from the dataframe
house_df_out2=house_df_out2.drop('cid',axis=1)
house_df_out2.info()


On the outlier treatment strategy 1, a total 4,233 entities have been removed from the analysis dataframe. 

The outlier treatment strategy 2, removes an addition of 2,200 entities have been removed. Initially for the analysis we'll use only outlier treatment strategy 1.

In [None]:
# With reference to univariate inferences, converting binomial datatypes(coast, furnished) and date datatypes (sold_year,sold_month,yr_renovated) into categorical variable by one-hot coding

house_df_out2=pd.get_dummies(house_df_out2, columns= ['sold_year','sold_month','yr_renovated','coast','furnished'])
                                                                   

In [None]:
house_df_out2.info()

In [None]:
house_df_out2.head(5)

# Base Modelling - Benchmark`

In [None]:
# input and target variable definition

X_base=house_df_new.drop('price',axis=1)
y_base=house_df_new['price']
print("Base X set size:", X_base.shape)
print("Base Y set size:",y_base.shape)
from sklearn.model_selection import train_test_split
X_train_base, X_test_base, y_train_base, y_test_base = train_test_split(X_base, y_base, test_size=0.30, random_state=10)
print("X_train set size:", X_train_base.shape)
print("Y_train set size:",y_train_base.shape)
print("X_train set size:", X_test_base.shape)
print("Y_train set size:",y_test_base.shape)


# Linear Regression - base model

In [None]:
# Linear Regression Model 
from sklearn.linear_model import LinearRegression

In [None]:
base_regression_model = LinearRegression()
base_regression_model.fit(X_train_base, y_train_base)

In [None]:
base_reg_train_acc=round((base_regression_model.score(X_train_base,y_train_base)*100),2)
print ('Train model accuracy:' ,base_reg_train_acc,'%')

In [None]:
base_reg_test_acc= round((base_regression_model.score(X_test_base,y_test_base)*100),2)
print ('Test model accuracy:', base_reg_test_acc,'%')

With base modeling, Linear Regression yields an accuracy of 71%

In [None]:
# create a panda summary dataframe of results
#data=['Base model','Linear regression',base_reg_train_acc,base_reg_test_acc]
#data
data = {'Strategy': ['Base model'], 'Modelling method': ['Linear regression'],'Train model accuracy': 
     [base_reg_train_acc],'Test model':[base_reg_test_acc]}
acc_df = pd.DataFrame(data)
acc_df

# Decision Tree Regression - base model

In [None]:
# Base Model - Decision Tree Regression (with base dataframe)

# Decision Tree Regression model

from sklearn.tree import DecisionTreeRegressor

base_dt_model = DecisionTreeRegressor(max_depth=12,random_state=100)
# depth of 12 has been considered since the number of clusters is 12 in the dataframe
base_dt_model.fit(X_train_base,y_train_base)

In [None]:
base_dt_train_acc=round((base_dt_model.score(X_train_base,y_train_base)*100),2)
print ('Train model accuracy:' ,base_dt_train_acc,'%')

In [None]:
base_dt_test_acc= round((base_dt_model.score(X_test_base,y_test_base)*100),2)
print ('Test model accuracy:', base_dt_test_acc,'%')

#### With base modeling, Decision Tree Regression yields an accuracy of 73.5%

In [None]:
# append to accuracy summary table
row_add=['Base model','Decision Tree Regression',base_dt_train_acc,base_dt_test_acc]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array)
acc_df

# Random Forest Regression - Base model

In [None]:
# Random Forest Regression

from sklearn.ensemble import RandomForestRegressor
base_rtr_model=RandomForestRegressor(max_depth=12,random_state=100)
# depth of 12 has been considered since the number of clusters is 12 in the dataframe
base_rtr_model.fit(X_train_base,y_train_base)

In [None]:
base_rtr_train_acc=round((base_rtr_model.score(X_train_base,y_train_base)*100),2)
print ('Train model accuracy:' ,base_rtr_train_acc,'%')

In [None]:
base_rtr_test_acc= round((base_rtr_model.score(X_test_base,y_test_base)*100),2)
print ('Test model accuracy:', base_rtr_test_acc,'%')

#### With base modeling, Random forest Regression yields an accuracy of 87%

In [None]:
# append to accuracy summary table
row_add=['Base model','Random Forest Regression',base_rtr_train_acc,base_rtr_test_acc]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array)
acc_df

# Normalizing & Outlier Treated [Strategy 1] Dataframes - Modelling

In [None]:
# Normalizing the dataframe to evaluate the spread across the elements in a similar fashion

from scipy.stats import zscore

house_scaled_df_out_1 = house_df_out1.apply(zscore)
house_scaled_df_out_1.info()

In [None]:
house_scaled_df_out_1=house_scaled_df_out_1.fillna(0)
house_scaled_df_out_1[['yr_renovated','coast']].describe()

In [None]:
#convert the numpy array back into a dataframe 

house_scaled_df_out_1 = pd.DataFrame(house_scaled_df_out_1, columns=house_df_out1.columns)
#Evaluating the scaled dataframe

house_scaled_df_out_1.shape

In [None]:
X_out_1=house_scaled_df_out_1.drop('price',axis=1)
y_out_1=house_scaled_df_out_1['price']
print("Out 1 X set size:", X_out_1.shape)
print("Out 1 Y set size:",y_out_1.shape)
X_train_out_1, X_test_out_1, y_train_out_1, y_test_out_1 = train_test_split(X_out_1, y_out_1, test_size=0.30, random_state=10)
print("Out 1 X_train set size:", X_train_out_1.shape)
print("Out 1 Y_train set size:",y_train_out_1.shape)
print("Out 1 X_train set size:", X_test_out_1.shape)
print("Out 1 Y_train set size:",y_test_out_1.shape)

# Linear Regression - Outlier Strategy 1 model

In [None]:
out_1_regression_model = LinearRegression()
out_1_regression_model.fit(X_train_out_1, y_train_out_1)

In [None]:
out_1_reg_train_acc=round((out_1_regression_model.score(X_train_out_1,y_train_out_1)*100),2)
print ('Train model accuracy:' ,out_1_reg_train_acc,'%')

In [None]:
out_1_reg_test_acc= round((out_1_regression_model.score(X_test_out_1,y_test_out_1)*100),2)
print ('Test model accuracy:', out_1_reg_test_acc,'%')

With base modeling, Linear Regression yields an accuracy of 69%

In [None]:
# append to accuracy summary table
row_add=['Out 1 model','Linear Regression',out_1_reg_train_acc,out_1_reg_test_acc]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array)
acc_df

# Decision Tree Regression - Outlier Strategy 1 model

In [None]:
# Decision Tree Regression model

out_1_dt_model = DecisionTreeRegressor(max_depth=12,random_state=100)
# depth of 12 has been considered since the number of clusters is 12 in the dataframe
out_1_dt_model.fit(X_train_out_1,y_train_out_1)

In [None]:
out_1_dt_train_acc=round((out_1_dt_model.score(X_train_out_1,y_train_out_1)*100),2)
print ('Train model accuracy:' ,out_1_dt_train_acc,'%')

In [None]:
out_1_dt_test_acc= round((out_1_dt_model.score(X_test_out_1,y_test_out_1)*100),2)
print ('Test model accuracy:', out_1_dt_test_acc,'%')

#### With Outlier 1 strategy modeling, Decision Tree Regression yields an accuracy of 76.5%

In [None]:
# append to accuracy summary table
row_add=['Out 1 model','Decision Tree Regression',out_1_dt_train_acc,out_1_dt_test_acc]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array)
acc_df

# Random Forest Regression - Outlier 1 strategy model

In [None]:
# Random Forest Regression

from sklearn.ensemble import RandomForestRegressor
out_1_rtr_model=RandomForestRegressor(max_depth=12,random_state=100)
# depth of 12 has been considered since the number of clusters is 12 in the dataframe
out_1_rtr_model.fit(X_train_out_1,y_train_out_1)

In [None]:
out_1_rtr_train_acc=round((out_1_rtr_model.score(X_train_out_1,y_train_out_1)*100),2)
print ('Train model accuracy:' ,out_1_rtr_train_acc,'%')

In [None]:
out_1_rtr_test_acc= round((out_1_rtr_model.score(X_test_out_1,y_test_out_1)*100),2)
print ('Test model accuracy:', out_1_rtr_test_acc,'%')

#### With base modeling, Random forest Regression yields an accuracy of 86.5%

In [None]:
# append to accuracy summary table
row_add=['Out 1 model','Random Forest Regression',out_1_rtr_train_acc,out_1_rtr_test_acc]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array)
acc_df

# KNN Regression - Outlier Strategy 1 model

In [None]:
from sklearn import neighbors
from sklearn.metrics import mean_squared_error 
from math import sqrt

In [None]:
rmse_val = [] #to store rmse values for different k
for K in range(20):
    K = K+1
    model = neighbors.KNeighborsRegressor(n_neighbors = K)

    model.fit(X_train_out_1, y_train_out_1)  #fit the model
    pred=model.predict(X_test_out_1) #make prediction on test set
    error = sqrt(mean_squared_error(y_test_out_1,pred)) #calculate rmse
    rmse_val.append(error) #store rmse values
    print('RMSE value for k= ' , K , 'is:', error)

In [None]:
#plotting the rmse values against k values
curve = pd.DataFrame(rmse_val) #elbow curve 
curve.plot()

The knee appears at a k value of 7

In [None]:
out_1_knn_model = neighbors.KNeighborsRegressor(n_neighbors = 7)
out_1_knn_model.fit(X_train_out_1,y_train_out_1)

In [None]:
out_1_knn_train_acc=round((out_1_knn_model.score(X_train_out_1,y_train_out_1)*100),2)
print ('Train model accuracy:' ,out_1_knn_train_acc,'%')

In [None]:
out_1_knn_test_acc= round((out_1_knn_model.score(X_test_out_1,y_test_out_1)*100),2)
print ('Test model accuracy:', out_1_knn_test_acc,'%')

#### With outlier 1 modeling, kNN Regression yields an accuracy of 75.5%

In [None]:
# append to accuracy summary table
row_add=['Out 1 model','kNN Regression',out_1_knn_train_acc,out_1_knn_test_acc]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array)
acc_df

## Hyperparameter tuning

For Model testing we have considered 70:30 data on Random Forest Regression since it has the best accuracy.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

from pprint import pprint

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 1, stop = 100, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

pprint(random_grid)

In [None]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor(random_state = 42)
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
                              n_iter = 100, scoring='neg_mean_absolute_error', 
                              cv = 3, verbose=2, random_state=42, n_jobs=-1,
                              return_train_score=True)

# Fit the random search model
rf_random.fit(X_train_out_1, y_train_out_1);

In [None]:
rf_random.best_params_

In [None]:
randomcv_rtr_model=RandomForestRegressor(n_estimators=100,min_samples_split=2,min_samples_leaf=2,max_features='auto',
                                         max_depth=90,random_state=100,bootstrap='True')
# depth of 12 has been considered since the number of clusters is 12 in the dataframe
randomcv_rtr_model.fit(X_train_out_1,y_train_out_1)

In [None]:
randomcv_rtr_train_acc=round((randomcv_rtr_model.score(X_train_out_1,y_train_out_1)*100),2)
print ('Train model accuracy:' ,randomcv_rtr_train_acc,'%')

In [None]:
randomcv_rtr_test_acc= round((randomcv_rtr_model.score(X_test_out_1,y_test_out_1)*100),2)
print ('Test model accuracy:', randomcv_rtr_test_acc,'%')

#### With random search CV modeling, Random forest Regression yields an accuracy of 88%

In [None]:
# append to accuracy summary table
row_add=['Out 1 model','Randomsearch CV Forest Regression',randomcv_rtr_train_acc,randomcv_rtr_test_acc]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array)
acc_df