# Washington House Price Prediction  -- Capstone Project

## Table of Content

###    1. Introduction
###    2. Data Exploratory Analysis
###    3. Data Processing
###    4. Model Building
###    5. Model Comparision
###    6. Conclusion

# 1. Introduction
## Problem Statement
As a house value is simply more than location and square footage. Like the features that make up a person, an educated party would want to know all aspects that give a house its value. For example, if we want to sell a house and we don't know the price which we can take, as it can't be too low or too high. To find house price we usually try to find similar properties in our neighbourhood and based on collected data we trying to assess our house price.

#### Data Dictionary

    1. cid: a notation for a house (house id)
    2. dayhours: Date house was sold (date-month-year of sale)
    3. price: Price is prediction target (Target variable)
    4. room_bed: Number of Bedrooms/House (number of bedrooms per house)
    5. room_bath: Number of bathrooms/bedrooms (number of bathrooms per bedrooms)
    6. living_measure: square footage of the home
    7. lot_measure: square footage of the lot
    8. ceil: Total floors (levels) in house (how many floors, ground, 1st floor, 2nd floor ..)
    9. coast: House which has a view to a waterfront (house with or without coastview/seafacing)
    10. sight: Has been viewed (sight has been viewed before buying this house)
    11. condition: How good the condition is (Overall)
    12. quality: grade given to the housing unit, based on grading system
    13. ceil_measure: square footage of house apart from basement
    14. basement_measure: square footage of the basement
    15. yr_built: Built Year (year the house was built)
    16. yr_renovated: Year when house was renovated (year the house was renovated)
    17. zipcode: zip code of the area 
    18. lat: Latitude coordinate
    19. long: Longitude coordinate
    20. living_measure15: Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area
    21. lot_measure15: lotSize area in 2015(implies-- some renovations)
    22. furnished: Based on the quality of room
    23. total_area: Measure of both living and lot

## Initial data loading

#### Import Initial Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
import warnings
warnings.filterwarnings("ignore")

In [None]:
## import important libraries for model building

from scipy.stats import norm
from scipy import stats
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor

from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from math import sqrt

#### Read Dataset

In [None]:
#read dataset
dataset_raw = pd.read_excel("innercity.xlsx")

In [None]:
#display head of the complete dataset (with all columns)
pd.set_option("display.max_columns", None)
dataset_raw.head()

In [None]:
print('These are the variable columns in the dataset: \n\n', dataset_raw.columns )

In [None]:
# number of columns and rows
print('The number of rows (observations) is',dataset_raw.shape[0],'\n''The number of columns (variables) is',dataset_raw.shape[1])

In [None]:
dataset_raw.info()

Note: 

1. There are 23 variables. There are 12 float datatypes, 4 integer datatypes and 7 object datatypes.
2. There are few time-data columns like dayhours (day at which house was sold), 
3. yr_built (year on which house was built), and yr_renovated (year on which house was renovated). These variables are not in time format, which we have to change, "if required".
4. Object datatypes can also be changed to float for seeing their correlation with other variables. 

In [None]:
## statistical summary of the raw dataset
dataset_raw.describe().round(2).T

### Dealing with location data

#### Using zipcodes to track city, county, state and population of the area.

In [None]:
us_zipcodes = pd.read_excel("USA_Zipcodes.xlsx")
us_zipcodes.head(3)

In [None]:
dataset_new = pd.merge(dataset_raw,us_zipcodes, how='left',left_on=['zipcode'],right_on=['zip']).dropna()
dataset = dataset_new
dataset.head(3)

In [None]:
cols_unique = ('city', 'state_id', 'state_name', 'county_name', 'county_names_all')
for i in cols_unique:
    print('unique items in', i , ': \n', dataset[i].unique(), '\n')

In [None]:
# lets drop the unnecessary columns from us zipcodes 
# and also drop repeated columns like lat and long and rename them as latitude and lontitude

dataset.drop(['lat_y', 'lng', 'state_id', 'state_name', 'county_name', 'county_names_all', 'zip'], axis=1, inplace=True)
dataset.rename(columns={'lat_x':'latitude', 'long':'longitude'}, inplace=True)
dataset.head()

## Basic Data Cleaning

### Convert datatypes for ease

In [None]:
## converting dayhours into timestamp
dataset['dayhours'] = pd.to_datetime(dataset['dayhours'], format='%Y-%m-%d')

In [None]:
## changing datatype of cid (house id) from int to object
dataset['cid']=dataset['cid'].astype(np.object)

In [None]:
## changing datatype of zipcode from int to integer datatype
dataset = dataset.astype({'zipcode':'int64'})

In [None]:
## printing total number of unique house in the dataset
print('The list of unique house ids present in the dataset is',dataset['cid'].nunique(), ', \n as compared to total number of 21387 rows of the dataset.')

This shows that there are many house that are sold more than one times.

In [None]:
## list of columns which are float types
floatColumns = dataset.dtypes[dataset.dtypes == np.object]
print(floatColumns)

here, total area and lontitude needs to be converted into numeric values

In [None]:
## printing unique values of columns which is object datatype but has numeric value'
## we are ignoring total area which is obviously wrongly placed as object

object_unique = ('ceil', 'coast', 'condition')
for i in object_unique:
    print('unique items in', i , ': \n', dataset[i].unique(), '\n')

In [None]:
## list of columns which are numeric types
numericColumns = dataset.dtypes[dataset.dtypes != np.object]
print(numericColumns)

In [None]:
## printing unique values of columns which is float datatype but might have a object type feel'

## zipcode is an obvious object which is recorded as float
numeric_float_unique = ('room_bed', 'room_bath', 'sight', 'quality' , 'yr_renovated' , 'furnished')
for i in numeric_float_unique:
    print('unique items in', i , ': \n', dataset[i].unique(), '\n')

In [None]:
## Changing datatype from object to float with error coarse to change unwanted strings into NaN values

change_datatype_cols = dataset[['ceil','coast','condition', 'yr_built' , 'longitude', 'total_area']]

for i in change_datatype_cols:
    dataset[i] = pd.to_numeric(dataset[i],errors='coerce')

In [None]:
## Converting other date variables from object/int datatype to date datatype
    # dataset['yr_built'] = pd.to_datetime(dataset['yr_built'], errors='coerce', format='%Y')
    # dataset['yr_renovated'] = pd.to_datetime(dataset['yr_renovated'], format='%Y')

In [None]:
## Remove unwanted string characters
   #dataset = dataset.replace("$","")

In [None]:
## checking datatypes after conversion
dataset.dtypes

### Renaming columns for ease to understand

we are renaming the confusing column names for an ease of understanding.

In [None]:
# creating reference dataframe of the list of columns which are renamed

cols_renaming = {'original_Column':['cid','dayhours','room_bed','room_bath','ceil', 'coast', 'sight', 'quality','living_measure','lot_measure','ceil_measure','basement','lat', 'long', 'living_measure15', 'lot_measure15', 'density'],
        'renamed_Column':['house_id','date','bedroom','ratio_bathroom','total_floors', 'seaface', 'sight_viewed', 'quality_grade','living_area','lot_area','floor_area','basement_area','latitude', 'longitude', 'living_area_2015', 'lot_area_2015', 'population_density']}
  
# Create DataFrame
df_renamedCols = pd.DataFrame(cols_renaming)
df_renamedCols

In [None]:
## renaming all the confusing column names
dataset.rename(columns={'cid':'house_id','dayhours':'date','room_bed':'bedroom','room_bath':'ratio_bathroom','ceil':'total_floors', 'coast':'seaface', 'sight':'sight_viewed', 'quality':'quality_grade','living_measure':'living_area','lot_measure':'lot_area','ceil_measure':'floor_area','basement':'basement_area','lat':'latitude', 'long':'longitude', 'living_measure15':'living_area_2015', 'lot_measure15':'lot_area_2015', 'density':'population_density'}, inplace=True)

## rechecking head after renaming the columns
dataset.head()

In [None]:
## sorting dataset based on date
dataset = dataset.sort_values('date')

In [None]:
## resorted datset
dataset.reset_index(inplace=True)
dataset.drop(['index'], axis=1, inplace=True)

In [None]:
dataset.head()

In [None]:
## checking duplicate rows
dataset.duplicated().sum()

In [None]:
## checking if there are duplicate houses that sold more than one times
dataset.duplicated(subset=['house_id']).sum()

Note: This shows that few of the house are sold multiples times.

In [None]:
## house sold multiple times
id_count = dataset['house_id'].value_counts()
id_count[id_count>1]

### Deriving new columns

#### 1. Adding a new column 'prev_sold'
Note: new dataset showing house resold no of times
1. 0 mean not resold or sold the very first time. House sold before 0 times. 
2. 1 means house sold once before current purchase. House sold before 1 times.
3. 2 means house sold twice before current purchase. House sold before 2 times.

The reason behind doing this is because the price might increase or decrease if a property is resold multiple times.

In [None]:
## lets generate a new column that counts in order that how many times a house was sold
dataset['prev_sold'] = dataset.groupby('house_id').cumcount()

In [None]:
## new dataset showing house resold no of times
dataset.head(3)

In [None]:
dataset[(dataset['prev_sold'] == 2)]

#### 2. from ratio of bathroom per bedroom to number of bathrooms

In [None]:
dataset['bathroom']=dataset['ratio_bathroom']*dataset['bedroom']

In [None]:
## new dataset showing new bathroom column
dataset.head(3)

#### 3. from Counting house built-years and renovation-years

We are counting years of house built and renovation years instead of keeping them as a year number. This will help us in further analysis.

In [None]:
## deriving house age from the year house is built by taking next year (2016) as refernce year.
## we haven't taken 2015 as ref year orlese we have got 0 for many house built in 2015.

dataset['house_age'] = 2016 - dataset['yr_built']

In [None]:
## deriving renovation age

dataset['renovation_yrs'] = np.where(dataset['yr_renovated']!= 0, 2016 - dataset['yr_renovated'], 0)

In [None]:
## deriving renovation status
dataset['renovated_orNot'] = np.where(dataset['renovation_yrs']!= 0, 1, 0)

#### 4. Month house was sold

In [None]:
## converting dayhours into timestamp

dataset['sold_month'] = pd.to_datetime(dataset['date'], format='%Y-%m')
dataset['sold_month'] = dataset['sold_month'].apply(lambda x: x.strftime('%m'))
dataset['sold_month'].head()

In [None]:
## changing datatype of zipcode from int to integer datatype
dataset = dataset.astype({'sold_month':'int64'})

In [None]:
dataset.head(3)

#### 5. deal with areas

In [None]:
areas=dataset[['living_area','lot_area', 'floor_area', 'basement_area', 'total_area', 'living_area_2015', 'lot_area_2015', 'renovated_orNot']]
areas.head()

##### Note:
Here, we can observe that:
1. total_area = living_area + lot_area
2. living area = floor_area + basement_area
3. even though house is not renovated but still living area and lot area in 2015 has been updated with changed values.
4. here we can see that few of the house may not have basement. So we can derive a new columns as basement_orNot.

#### 6. column mentioning Basement availability

In [None]:
dataset["basement_orNot"]=dataset["basement_area"].apply( lambda x:1 if x>0 else 0)
dataset.head(3)

### Eliminating unnecessary columns

Note: We are removing unnecessary columns like house id, yr_built and yr_renovated.
We have already extracted meaningful values and created new columns based on these columns. 

In [None]:
dataset = dataset.drop(['house_id', 'ratio_bathroom', 'yr_built', 'yr_renovated'], axis = 1)

In [None]:
dataset = dataset.drop(['date'], axis = 1)

#### Statistical Summary of the new dataset

In [None]:
#statistical summary
dataset.describe().round(2).T

In [None]:
dataset.info()

### Checking missing values in dataset

check whether while changing datatype we have replaced the special unwanted characters with NaN or not?

In [None]:
## missing values before datatype change and string manipulation
dataset_raw.isna().sum()

In [None]:
#for reference for renamed columns
df_renamedCols

In [None]:
## missing values after datatype change and string manipulation

dataset.isna().sum()

Note: We have few missing values across the columns, but the number is small enough. Hence, can be imputed or even can be removed. 
We will impute this before modeling.

# 2. Exploratory Data Analysis
Let's do some visual data analysis of the features

## Uni-variate Analysis

In [None]:
dataset.shape

In [None]:
dataset.columns

In [None]:
features = ['bedroom', 'bathroom', 'total_floors', 'seaface', 'sight_viewed', 'condition', 'quality_grade',
       'furnished', 'prev_sold', 'house_age', 'renovation_yrs', 'city']
list(enumerate(features))

In [None]:
#count plot

plt.figure(figsize = (30, 50))
for i in enumerate(features):
    plt.subplot(7, 2,i[0]+1)
    sns.countplot(i[1], data = dataset)
    plt.xticks(rotation = 90)

In [None]:
plt.figure(figsize = (20, 5))
sns.countplot(data = dataset, x= 'sold_month');

Note: The most number of house were sold in the month of April-2015 also in June and July 2014. 
The least were sold in month of May-2015.

## Bi-Variate Analysis

In [None]:
#correlation heatmap

plt.figure(figsize = (20,15))
sns.heatmap(dataset.corr(), annot=True,mask=np.triu(dataset.corr()),cmap ='coolwarm', vmin = -1, vmax= 1);

Note: The following correlated items can be removed to avoid multi-collinearity.
1. total_area is totally correlated with lot_area, 
2. floor_area is highly correlated with living_area.
3. quality_grade is correlated with furnished.
4. renovated_orNot is correlated with renovation_yrs.

### Analyzing Bivariate for Feature

#### 1. Months vs Price
Variation in Price over the period of Months

In [None]:
## At which month price was higher or lesser
sns.factorplot(x='sold_month',y='price',data=dataset, size=8, aspect=2 );

Note: We can see that in the month of Feb-2015, the prices were cheaper, whereas in the month of April. the price goes high.

In [None]:
dataset.columns

In [None]:
plotsizeX=8 
plotsizeY=5

In [None]:
plt.figure(figsize=(plotsizeX, plotsizeY))
sns.scatterplot(dataset['living_area'],dataset['price'],hue=dataset['sight_viewed'],palette='Paired',legend='full');

In [None]:
plt.figure(figsize=(plotsizeX, plotsizeY))
sns.scatterplot(dataset['living_area'],dataset['price'],hue=dataset['condition'],palette='Paired',legend='full');

In [None]:
plt.figure(figsize=(plotsizeX, plotsizeY))
sns.scatterplot(dataset['living_area'],dataset['price'],hue=dataset['quality_grade'],palette='Paired',legend='full');

In [None]:
plt.figure(figsize=(plotsizeX, plotsizeY))
sns.scatterplot(dataset['living_area'],dataset['price'],hue=dataset['furnished'],palette='Paired',legend='full');

In [None]:
sns.relplot(data=dataset, x='total_floors',  y='price', hue='seaface', kind='line');  

Note: the property with seafacing front are costlier than the house without any seafacing front. But it also shows that how with increase in number of floors the seaface house cost so costlier.

In [None]:
sns.relplot(data=dataset, x='total_floors',  y='price', hue='furnished', kind='line');

Note: This shows that the furnished house has higher cost and as the number of floors increases beyond 3 then the cost of the house shoots up for furnished house.

In [None]:
import missingno as msn
import folium
from folium import plugins
import branca.colormap as cm


m = folium.Map([47 ,-122], zoom_start=5,width="%100",height="%100")
locations = list(zip(dataset.latitude, dataset.longitude))
cluster = plugins.MarkerCluster(locations=locations,popups=dataset["price"].tolist())
m.add_child(cluster)
m

# 3. Data Processing
Data processing before Modeling

### Imputing the missing values

In [None]:
dataset.isnull().sum()

In [None]:
dataset.isnull().sum().sum()

In [None]:
missing_col = ['total_floors', 'seaface', 'longitude', 'total_area']
 
# treating misiing values using median to impute the missing values
for i in missing_col:
    dataset.loc[dataset.loc[:,i].isnull(),i]=dataset.loc[:,i].median()

print("count of NULL values after imputation\n")
dataset.isnull().sum()

## Basic Linear Regression Modeling

In [None]:
df=dataset
df.head(3)

In [None]:
#for basic modeling we are dropping city
df=df.drop('city', axis=1)

# changing data type of zipcode from int to object
df=df.astype({'zipcode':'object'})
df.dtypes

In [None]:
df.head(2)

In [None]:
## outlier treatment

def remove_outlier(col):
    sorted(col)
    Q1,Q3=np.percentile(col,[25,75])
    IQR=Q3-Q1
    lower_range= Q1-(1.5 * IQR)
    upper_range= Q3+(1.5 * IQR)
    return lower_range, upper_range

In [None]:
out_df=df.drop('zipcode', axis=1).columns

for i in out_df:
    lr = df[i].quantile(0.25)
    ur = df[i].quantile(0.75)
    df[i] = np.where(df[i] <lr, lr,df[i])
    df[i] = np.where(df[i] >ur, ur,df[i])

#### Feature Selection

In [None]:
df.head(3)

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), cmap="RdBu")
plt.title("Correlations Between Variables", size=15)
plt.show()

In [None]:
important_num_cols = list(df.corr()["price"][(df.corr()["price"]>0.1) | (df.corr()["price"]<-0.1)].index)
cat_cols = ['zipcode']
important_cols = important_num_cols + cat_cols

df2 = df[important_cols]
df2.head(3)

In [None]:
df.shape

In [None]:
df2.shape

In [None]:
# changing data type of zipcode from obj to int64
df2=df2.astype({'zipcode':'int64'})
df2.head(2)

In [None]:
#categorical coding of zipcode

#df['zipcode'] = df_phase1['zipcode'].astype("category").cat.codes
#df.head(2)

In [None]:
## normalising the price distribution (log price)
#df2['price_log'] = np.log(df2.price+1)
#plt.figure(figsize=(7,5))
#sns.distplot(df2['price_log'], fit=norm)
#plt.title("Log-Price Distribution Plot",size=15, weight='bold')

In [None]:
#df3=df2.drop('price', axis=1)

#### X, y Split

In [None]:
#X = df3.drop("price_log", axis=1)
#y = df3["price_log"]

In [None]:
X = df2.drop("price", axis=1)
y = df2["price"]

In [None]:
## standarizing the data
X.columns

In [None]:
## scaling the data
#important_num_cols=['bedroom', 'living_area', 'total_floors', 'quality_grade', 'floor_area',
       'basement_area', 'latitude', 'living_area_2015', 'total_area', 'city',
       'population', 'population_density', 'bathroom', 'basement_orNot','zipcode']

#scaler = StandardScaler()
#X[important_num_cols] = scaler.fit_transform(X[important_num_cols])

In [None]:
X.head(2)

In [None]:
## Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Linear Regression using statsmodel

In [None]:
# concatenate X and y into a single dataframe
data_train = pd.concat([X_train, y_train], axis=1)
data_test=pd.concat([X_test,y_test],axis=1)
data_train.head(3)

In [None]:
data_train.columns

In [None]:
expr= 'price ~ bedroom + living_area + total_floors + quality_grade + floor_area + basement_area + latitude + living_area_2015 + total_area  + population + population_density + bathroom + basement_orNot + zipcode'

In [None]:
import statsmodels.formula.api as smf
lm1 = smf.ols(formula= expr, data = data_train).fit()
lm1.params

In [None]:
print(lm1.summary())

In [None]:
# Calculate MSE
mse = np.mean((lm1.predict(data_train.drop('price',axis=1))-data_train['price'])**2)

#Root Mean Squared Error - RMSE
np.sqrt(mse)

In [None]:
# Prediction on Test data
y_pred = lm1.predict(data_test)

In [None]:
#plt.scatter(y_test['price'], y_pred)

plt.figure(figsize=(6,4))
plt.plot(y_test,"blue")
plt.plot(y_pred,"red")
plt.title("Actual Vs Predicted Price")
plt.xlabel ("Data points")
plt.ylabel ("Predicted Price");
plt.grid(True, color ="k")
plt.style.use("fivethirtyeight")

In [None]:
for i,j in np.array(lm1.params.reset_index()):
    print('({}) * {} +'.format(round(j,2),i),end=' ')

In [None]:
pred_df=df

In [None]:
pred_df['predicted_price']= (-13035181.5) + (3758.52) * df.bedroom + (74.83) * df.living_area + (-1484.73) * df.total_floors + (51286.05) * df.quality_grade + (44.3) * df.floor_area + (3.95) * df.basement_area + (592247.9) * df.latitude + (61.13) * df.living_area_2015 + (0.75) * df.total_area + (-2.66) * df.population + (37.17) * df.population_density + (-2866.85) * df.bathroom + (19175.76) * df.basement_orNot + (-156.7) * df.zipcode

In [None]:
pred_df['%diff']=(pred_df.predicted_price - pred_df.price)*100/pred_df.price

In [None]:
pred_df[['price','predicted_price', '%diff']]

#### --------- lets move on to advance modeling

## Advanced Modeling

### Creating datasets for two phase modelling 
#### (Phase.1 without any treatment to outliers and multi-collinarity and Phase.2 with all th treatment to outliers and multicollinearity)

In [None]:
df_phase1=dataset ## (modeling without dealing with multicollinearity or outliers )

In [None]:
df_phase2=dataset ## (modeling with less features without signs of multi-collinearity)

## Phase.1 Modeling without treatment of outliers or multicollinearity

### Categorical encoding

In [None]:
#categorical coding of city
df_phase1['city'] = df_phase1['city'].astype("category").cat.codes

In [None]:
df_phase1.head()

### X, y Split
Splitting the data into X and y chunks

In [None]:
X = df_phase1.drop("price", axis=1)
y = df_phase1["price"]

### Train-Test Split
Splitting the data into Train and Test set

In [None]:
## splitting the data into 80:20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Machine Learning Models

In [None]:
## Defining several evaluation functions for convenience

def rmse_cv(model):
    rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5)).mean()
    return rmse
    

def evaluation(y, predictions):
    mae = mean_absolute_error(y, predictions)
    mse = mean_squared_error(y, predictions)
    rmse = np.sqrt(mean_squared_error(y, predictions))
    r_squared = r2_score(y, predictions)
    return mae, mse, rmse, r_squared

In [None]:
models = pd.DataFrame(columns=["Model","MAE","MSE","RMSE","R2 Score","RMSE (Cross-Validation)"])

#### Model.1 Linear Regression

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
predictions = lin_reg.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(lin_reg)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "LinearRegression","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models = models.append(new_row, ignore_index=True)

#### Model.2 Ridge Regression

In [None]:
ridge = Ridge()
ridge.fit(X_train, y_train)
predictions = ridge.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(ridge)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "Ridge","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models = models.append(new_row, ignore_index=True)

#### Model.3 Lasso Regression

In [None]:
lasso = Lasso()
lasso.fit(X_train, y_train)
predictions = lasso.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(lasso)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "Lasso","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models = models.append(new_row, ignore_index=True)

#### Model.3 Elastic Regression

In [None]:
elastic_net = ElasticNet()
elastic_net.fit(X_train, y_train)
predictions = elastic_net.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(elastic_net)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "ElasticNet","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models = models.append(new_row, ignore_index=True)

#### Model.4 Support Vector Machines

In [None]:
svr = SVR(C=100000)
svr.fit(X_train, y_train)
predictions = svr.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(svr)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "SVR","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models = models.append(new_row, ignore_index=True)

#### Model.5 Random Forest Regressor

In [None]:
random_forest = RandomForestRegressor(n_estimators=100)
random_forest.fit(X_train, y_train)
predictions = random_forest.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(random_forest)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "RandomForestRegressor","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models = models.append(new_row, ignore_index=True)

#### Mode.6 XGBoost Regressor

In [None]:
xgb = XGBRegressor(n_estimators=1000, learning_rate=0.01)
xgb.fit(X_train, y_train)
predictions = xgb.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(xgb)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "XGBRegressor","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models = models.append(new_row, ignore_index=True)

#### Model.7 Polynomial Regression (Degree=2)

In [None]:
poly_reg = PolynomialFeatures(degree=2)
X_train_2d = poly_reg.fit_transform(X_train)
X_test_2d = poly_reg.transform(X_test)

lin_reg = LinearRegression()
lin_reg.fit(X_train_2d, y_train)
predictions = lin_reg.predict(X_test_2d)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(lin_reg)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "Polynomial Regression (degree=2)","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models = models.append(new_row, ignore_index=True)

### Model Comparison
1. Mean Absolute Error (MAE) shows the difference between predictions and actual values.
2. Mean squared error (MSE) tells you how close a regression line is to a set of points
3. Root Mean Square Error (RMSE) shows how accurately the model predicts the response. 
4. R^2 will be calculated to find the goodness of fit measure.
5. RMSE cross validation: The less the Root Mean Squared Error (RMSE), The better the model is.

In [None]:
models.sort_values(by="RMSE (Cross-Validation)")

In [None]:
plt.figure(figsize=(15,8))
sns.barplot(x=models["Model"], y=models["RMSE (Cross-Validation)"])
plt.title("Models' RMSE Scores (Cross-Validated)", size=15)
plt.xticks(rotation=90, size=12)
plt.show()

#### Here we can see that XGB Regressor gives us the best model among all.

#### ------------- Moving on to Phase 2.

## Phase.2 Modeling after treating the outliers and multicollinearity

### *Data Preparation before Modeling*
### 1. Outliers Treatment

In [None]:
df_phase2.columns

In [None]:
## let's boxplot all the numerical columns and see if there any outliers

data_plot=df_phase2[['price', 'bedroom', 'living_area', 'lot_area', 'total_floors',
       'seaface', 'sight_viewed', 'condition', 'quality_grade', 'floor_area',
       'basement_area', 'zipcode', 'latitude', 'longitude', 'living_area_2015',
       'lot_area_2015', 'furnished', 'total_area', 'population',
       'population_density', 'prev_sold', 'bathroom', 'house_age',
       'renovation_yrs', 'renovated_orNot', 'sold_month', 'basement_orNot']]

fig=plt.figure(figsize=(20,15));
for i in range(0,len(data_plot.columns)):
    ax=fig.add_subplot(7,4,i+1)
    sns.boxplot(data_plot[data_plot.columns[i]])
    plt.tight_layout()

Note: We can see that we can see the outliers that might affect our results. 
The variables in which outliers might affect our model are the following: 
1. price
2. bedroom
3. bathroom
4. living area
5. lot area
6. floor area
7. basement_area
8. total area
9. living area 2015
10. lot area 2015
11. total area

Here, we can refer the correlation between these and treat outliers based on that. so that we wont have a significant data loss. 

#### Trimming the extreme ouliers without affecting any significant loss of data

In [None]:
## bedroom distribution
print("Skewness", df_phase2.bedroom.skew())
plt.figure(figsize=(10,5))
sns.distplot(df_phase2.bedroom)
df_phase2.bedroom.describe()

In [None]:
## trimming extreme outliers where number of bedroom is more than 10
df_phase2=df_phase2[(df_phase2['bedroom']<=10)]
print('loss of data is', (1-(df_phase2.index.size/dataset.index.size))*100, 'percent')

In [None]:
## living area distribution
print("Skewness", df_phase2.living_area.skew())
plt.figure(figsize=(10,5))
sns.distplot(df_phase2.living_area)
df_phase2.living_area.describe()

In [None]:
## trimming extreme outliers
df_phase2=df_phase2[(df_phase2['living_area']<=8000)]
print('loss of data is', (1-(df_phase2.index.size/dataset.index.size))*100, 'percent')

In [None]:
## bathroom distribution
print("Skewness", df_phase2.bathroom.skew())
plt.figure(figsize=(10,5))
sns.distplot(df_phase2.bathroom)
df_phase2.bathroom.describe()

In [None]:
## trimming extreme outliers
df_phase2=df_phase2[(df_phase2['bathroom']<=30)]
print('loss of data is', (1-(df_phase2.index.size/dataset.index.size))*100, 'percent')

In [None]:
## price distribution
print("Skewness", df_phase2.price.skew())
plt.figure(figsize=(10,5))
sns.distplot(df_phase2.price)
df_phase2.price.describe()

In [None]:
## trimming extreme outliers
df_phase2=df_phase2[(df_phase2['price']<=4000000)]
print('loss of data is', (1-(df_phase2.index.size/dataset.index.size))*100, 'percent')

In [None]:
## After trimming lets check the boxplot

## let's boxplot all the numerical columns and see if there any outliers

data_plot=df_phase2[['price', 'bedroom', 'living_area', 'lot_area', 'total_floors',
       'seaface', 'sight_viewed', 'condition', 'quality_grade', 'floor_area',
       'basement_area', 'zipcode', 'latitude', 'longitude', 'living_area_2015',
       'lot_area_2015', 'furnished', 'total_area', 'population',
       'population_density', 'prev_sold', 'bathroom', 'house_age',
       'renovation_yrs', 'renovated_orNot', 'sold_month', 'basement_orNot']]

fig=plt.figure(figsize=(20,15));
for i in range(0,len(data_plot.columns)):
    ax=fig.add_subplot(7,4,i+1)
    sns.boxplot(data_plot[data_plot.columns[i]])
    plt.tight_layout()

#### Only the main variables from which we wish to treat outliers

Highly correlation varibales
1. total_area is totally correlated with lot_area with correlation of 1.
2. floor_area is highly correlated with living_area.
3. quality_grade is correlated with furnished.
4. renovated_orNot is correlated with renovation_yrs.

In [None]:
## drop total area which completely correlated variable
df_phase2.drop('total_area', axis=1, inplace=True)

In [None]:
df_treated = df_phase2

In [None]:
out_data=df_treated[['bedroom','bathroom', 'living_area', 'lot_area', 'floor_area', 'basement_area', 'longitude','living_area_2015','lot_area_2015']]

for i in out_data:
    lr = df_treated[i].quantile(0.25)
    ur = df_treated[i].quantile(0.75)
    df_treated[i] = np.where(df_treated[i] <lr, lr,df_treated[i])
    df_treated[i] = np.where(df_treated[i] >ur, ur,df_treated[i])

In [None]:
out_plot=df_treated[['bedroom', 'living_area', 'lot_area', 'total_floors',
       'seaface', 'sight_viewed', 'condition', 'quality_grade', 'floor_area',
       'basement_area', 'zipcode', 'latitude', 'longitude', 'living_area_2015',
       'lot_area_2015', 'furnished', 'population',
       'population_density', 'prev_sold', 'bathroom', 'house_age',
       'renovation_yrs', 'renovated_orNot', 'sold_month', 'basement_orNot']]
fig=plt.figure(figsize=(20,15))
for i in range(0,len(out_plot.columns)):
    ax=fig.add_subplot(7,4,i+1)
    sns.boxplot(out_plot[out_plot.columns[i]], whis=3)
    plt.tight_layout()
print('Shape after Outliers Treatment',df_treated.shape)

In [None]:
## signs of data loss after outlier treatment
print('loss of data is', (1-(df_treated.index.size/dataset.index.size))*100, 'percent.')

In [None]:
df_treated.head()

### Standadization of data

### Normalising the price distribution (if required)

In [None]:
## checking price distribution

plt.figure(figsize=(10,10))
sns.distplot(df_treated['price'], fit=norm)
plt.title("Price Distribution Plot",size=15, weight='bold')

In [None]:
## price distribution
print("Skewness", df_treated.price.skew())
df_treated.price.describe()

Note: The above distribution graph shows that there is a right-skewed distribution on price. This means there is a positive skewness. Log transformation will be used to make this feature less skewed. This will help to make easier interpretation and better statistical analysis

Since division by zero is a problem, log+1 transformation would be better.

In [None]:
df_treated['price_log'] = np.log(df_treated.price+1)


In [None]:
plt.figure(figsize=(10,10))
sns.distplot(df_treated['price_log'], fit=norm)
plt.title("Log-Price Distribution Plot",size=15, weight='bold')

In [None]:
df_treated.plot.scatter(x='price', y='price_log', figsize=(7,5));

## Dealing with Multi-Collinearity

In [None]:
## features which are highly correlated with price of house (Top 15 features except price)
df_treated.corr()["price"].nlargest(17)

These are the best features that highly correlated with the price 
    1. quality_grade         
    2. living_area           
    3. furnished             
    4. living_area_2015      
    5. floor_area            
    6. bathroom              
    7. latitude              
    8. sight_viewed          
    9. bedroom               
    10. total_floors          
    11. basement_area         
    12. basement_orNot        
    13. population_density    
    14. seaface               
    15. renovated_orNot

#### Checking collinearity using correlation map

In [None]:
from dython import nominal
nominal.associations(df_treated,figsize=(30,20),mark_columns=True);

In [None]:
#zipcode & city is changed from object datatype to int
#dataset2 = dataset2.astype({'zipcode':'int64', 'city':'int64', 'sold_month':'int64'})

In [None]:
#correlation heatmap to understand the relation between the variables

plt.figure(figsize = (20,15))
sns.heatmap(df_treated.corr(), annot=True,mask=np.triu(df_treated.corr()),cmap ='coolwarm', vmin = -1, vmax= 1);

In [None]:
## checking multi-collinearity using eigen values

corr=df_treated.corr(method='pearson')

#Eigen vector of a correlation matrix.
multicollinearity, V=np.linalg.eig(corr)
multicollinearity

### Multi-collinearity check using Variance Inflation Factor
There might be redundant variables in the dataset, to eliminate which we should use VIF.

## Feature Selection

In [None]:
df_treated.columns

In [None]:
Selected_features = df_treated[['price','bedroom', 'living_area', 'lot_area', 'total_floors',
       'seaface', 'sight_viewed', 'condition', 'quality_grade', 'floor_area',
       'basement_area', 'zipcode', 'latitude', 'longitude', 'living_area_2015',
       'lot_area_2015', 'furnished', 'city', 'population',
       'population_density', 'prev_sold', 'bathroom', 'house_age',
       'renovation_yrs', 'renovated_orNot', 'sold_month', 'basement_orNot',
       ]]

from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
X = Selected_features.drop('price', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

In [None]:
X = X.drop('zipcode', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

In [None]:
X = X.drop('latitude', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

In [None]:
X = X.drop('longitude', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

In [None]:
X = X.drop('living_area', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

In [None]:
X = X.drop('quality_grade', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

In [None]:
X = X.drop('bedroom', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

In [None]:
X = X.drop('lot_area_2015', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

In [None]:
X = X.drop('floor_area', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

In [None]:
X = X.drop('living_area_2015', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

In [None]:
X = X.drop('condition', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

In [None]:
X = X.drop('bathroom', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

In [None]:
X = X.drop('basement_orNot', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

In [None]:
X = X.drop('lot_area', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

In [None]:
X = X.drop('total_floors', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

In [None]:
X = X.drop('population', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

In [None]:
X = X.drop('city', axis = 1)
calc_vif(X).sort_values(by = 'VIF', ascending = False)

Using VIF we came with best factors which are not affected by multi-collinearity.

1. house_age
2. population_density
3. sold_month
4. renovated_orNot
5. renovation_yrs
6. basement_area
7. sight_viewed
8. furnished
9. seaface
10. prev_sold

# Model Building

### Model Building based on Feature selection using VIF

In [None]:
df_VIF = df_treated[['price','house_age', 'population_density', 'sold_month', 'renovated_orNot','renovation_yrs','basement_area','sight_viewed','furnished','seaface','prev_sold']]
df_VIF.head(3)

### X, y Split
Splitting the data into X and y chunks

In [None]:
X = df_VIF.drop("price", axis=1)
y = df_VIF["price"]

#### Standardizing the Data

In [None]:
scaler = StandardScaler()
X[X.columns] = scaler.fit_transform(X[X.columns])

In [None]:
## check the standarization
X.head(3)

### Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Machine Learning Models

In [None]:
models_VIF = pd.DataFrame(columns=["Model","MAE","MSE","RMSE","R2 Score","RMSE (Cross-Validation)"])

#### Linear Regression

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
predictions = lin_reg.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(lin_reg)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "LinearRegression","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_VIF = models_VIF.append(new_row, ignore_index=True)

#### Ridge Regression

In [None]:
ridge = Ridge()
ridge.fit(X_train, y_train)
predictions = ridge.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(ridge)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "Ridge","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_VIF = models_VIF.append(new_row, ignore_index=True)

#### Lasso Regression

In [None]:
lasso = Lasso()
lasso.fit(X_train, y_train)
predictions = lasso.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(lasso)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "Lasso","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_VIF = models_VIF.append(new_row, ignore_index=True)

#### Elastic Net

In [None]:
elastic_net = ElasticNet()
elastic_net.fit(X_train, y_train)
predictions = elastic_net.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(elastic_net)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "ElasticNet","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_VIF = models_VIF.append(new_row, ignore_index=True)

### Support Vector Machines

In [None]:
svr = SVR(C=100000)
svr.fit(X_train, y_train)
predictions = svr.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(svr)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "SVR","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_VIF = models_VIF.append(new_row, ignore_index=True)

#### Random Forest Regressor

In [None]:
random_forest = RandomForestRegressor(n_estimators=100)
random_forest.fit(X_train, y_train)
predictions = random_forest.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(random_forest)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "RandomForestRegressor","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_VIF = models_VIF.append(new_row, ignore_index=True)

#### XGBoost Regressor

In [None]:
xgb = XGBRegressor(n_estimators=1000, learning_rate=0.01)
xgb.fit(X_train, y_train)
predictions = xgb.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(xgb)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "XGBRegressor","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_VIF = models_VIF.append(new_row, ignore_index=True)

#### Polynomial Regression (Degree=2)

In [None]:
poly_reg = PolynomialFeatures(degree=2)
X_train_2d = poly_reg.fit_transform(X_train)
X_test_2d = poly_reg.transform(X_test)

lin_reg = LinearRegression()
lin_reg.fit(X_train_2d, y_train)
predictions = lin_reg.predict(X_test_2d)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(lin_reg)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "Polynomial Regression (degree=2)","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_VIF = models_VIF.append(new_row, ignore_index=True)

### Model Comparison_VIF

In [None]:
models_VIF.sort_values(by="RMSE (Cross-Validation)")

In [None]:
plt.figure(figsize=(12,8))
sns.barplot(x=models_VIF["Model"], y=models_VIF["RMSE (Cross-Validation)"])
plt.title("Models' RMSE Scores (Cross-Validated)", size=15)
plt.xticks(rotation=30, size=12)
plt.show()

# Modeling with one hot encoding

In [None]:
df_phase3 = df_treated
df_phase3.head()

In [None]:
df_phase3.columns

In [None]:
df_phase3.shape

In [None]:
df_p3= df_phase3.drop('price_log', axis=1)

In [None]:
# Getting dummies for columns ceil, coast, sight, condition, quality, yr_renovated, furnished
df_ohc = pd.get_dummies(df_p3, columns=['zipcode','city','sold_month'],drop_first=False)

In [None]:
df_ohc.head()

In [None]:
df_ohc.shape

### Model Building

In [None]:
#Creating X, y for training and testing set
X = df_ohc.drop("price" , axis=1)
y = df_ohc["price"]

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

In [None]:
print(X_train.shape)
print(X_val.shape)

In [None]:
models_ohc = pd.DataFrame(columns=["Model","MAE","MSE","RMSE","R2 Score","RMSE (Cross-Validation)"])

### Linear Regression

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
predictions = lin_reg.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(lin_reg)
print("RMSE Cross-Validation:", rmse_cross_val)

In [None]:
new_row = {"Model": "LinearRegression","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_ohc = models_ohc.append(new_row, ignore_index=True)

### Ridge

In [None]:
ridge = Ridge()
ridge.fit(X_train, y_train)
predictions = ridge.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(ridge)
print("RMSE Cross-Validation:", rmse_cross_val)

In [None]:
new_row = {"Model": "Ridge","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_VIF = models_VIF.append(new_row, ignore_index=True)

### Lasso

In [None]:
lasso = Lasso()
lasso.fit(X_train, y_train)
predictions = lasso.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(lasso)
print("RMSE Cross-Validation:", rmse_cross_val)

In [None]:
new_row = {"Model": "Lasso","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_ohc = models_ohc.append(new_row, ignore_index=True)

### Elastic Net

In [None]:
elastic_net = ElasticNet()
elastic_net.fit(X_train, y_train)
predictions = elastic_net.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(elastic_net)
print("RMSE Cross-Validation:", rmse_cross_val)

In [None]:
new_row = {"Model": "ElasticNet","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_ohc = models_ohc.append(new_row, ignore_index=True)

### SVM

In [None]:
svr = SVR(C=100000)
svr.fit(X_train, y_train)
predictions = svr.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(svr)
print("RMSE Cross-Validation:", rmse_cross_val)

In [None]:
new_row = {"Model": "SVR","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_ohc = models_ohc.append(new_row, ignore_index=True)

### Random Forest

In [None]:
random_forest = RandomForestRegressor(n_estimators=100)
random_forest.fit(X_train, y_train)
predictions = random_forest.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(random_forest)
print("RMSE Cross-Validation:", rmse_cross_val)

In [None]:
new_row = {"Model": "RandomForestRegressor","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_ohc = models_ohc.append(new_row, ignore_index=True)

### XGBoost

In [None]:
xgb = XGBRegressor(n_estimators=1000, learning_rate=0.01)
xgb.fit(X_train, y_train)
predictions = xgb.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(xgb)
print("RMSE Cross-Validation:", rmse_cross_val)


In [None]:
new_row = {"Model": "XGBRegressor","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_ohc = models_ohc.append(new_row, ignore_index=True)

### KNN Regressor

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=4,weights='distance')
knn.fit(X_train, y_train)

#predicting result over test data
predictions= knn.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(knn)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "KNN_Regressor","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_ohc = models_ohc.append(new_row, ignore_index=True)


### Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor
DT = DecisionTreeRegressor()
DT.fit(X_train, y_train)

#predicting result over test data
predictions= DT.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(DT)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "DecisionTree_Regressor","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_ohc = models_ohc.append(new_row, ignore_index=True)

### Ensemble techniques

#### Bagging and Boosting

In [None]:
from sklearn.ensemble import GradientBoostingRegressor, BaggingRegressor
GB = GradientBoostingRegressor(n_estimators = 200, learning_rate = 0.1, random_state=22)
GB.fit(X_train, y_train)

#predicting result over test data
predictions= GB.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(GB)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "GradientBoosting_Regressor","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_ohc = models_ohc.append(new_row, ignore_index=True)


In [None]:
BGG=BaggingRegressor(n_estimators=50, oob_score= True,random_state=14)
BGG.fit(X_train, y_train)

#predicting result over test data
predictions= BGG.predict(X_test)

mae, mse, rmse, r_squared = evaluation(y_test, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-"*30)
rmse_cross_val = rmse_cv(BGG)
print("RMSE Cross-Validation:", rmse_cross_val)

new_row = {"Model": "BaggingRegressor_Regressor","MAE": mae, "MSE": mse, "RMSE": rmse, "R2 Score": r_squared, "RMSE (Cross-Validation)": rmse_cross_val}
models_ohc = models_ohc.append(new_row, ignore_index=True)

In [None]:
models_ohc.sort_values(by="RMSE (Cross-Validation)")

In [None]:
plt.figure(figsize=(12,8))
sns.barplot(x=models_ohc["Model"], y=models_VIF["RMSE (Cross-Validation)"])
plt.title("Models' RMSE Scores (Cross-Validated)", size=15)
plt.xticks(rotation=30, size=12)
plt.show()

### feature importance

In [None]:
rf_imp_feature_1=pd.DataFrame(xgb.feature_importances_, columns = ["Imp"], index = X_val.columns)
rf_imp_feature_1.sort_values(by="Imp",ascending=False)
rf_imp_feature_1['Imp'] = rf_imp_feature_1['Imp'].map('{0:.5f}'.format)
rf_imp_feature_1=rf_imp_feature_1.sort_values(by="Imp",ascending=False)
rf_imp_feature_1.Imp=rf_imp_feature_1.Imp.astype("float")

rf_imp_feature_1[:30].plot.bar(figsize=(10,5))

#First 20 features have an importance of 90.5% and first 30 have importance of 95.15
print("First 20 feature importance:\t",(rf_imp_feature_1[:20].sum())*100)
print("First 30 feature importance:\t",(rf_imp_feature_1[:30].sum())*100)

Above are top 30 important features that account for 95% of variation in model. 
We will further analyse for the hypertuning of the models for a better score.

### Filtering out the important features for modeling

In [None]:
rf_imp_feature_1[:30]

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
def result (model,pipe_model,X_train_set,y_train_set,X_val_set,y_val_set):
    pipe_model.fit(X_train_set,y_train_set)
    #predicting result over test data
    y_train_predict= pipe_model.predict(X_train_set)
    y_val_predict= pipe_model.predict(X_val_set)

    trscore=r2_score(y_train_set,y_train_predict)
    trRMSE=np.sqrt(mean_squared_error(y_train_set,y_train_predict))
    trMSE=mean_squared_error(y_train_set,y_train_predict)
    trMAE=mean_absolute_error(y_train_set,y_train_predict)
    
    vlscore=r2_score(y_val,y_val_predict)
    vlRMSE=np.sqrt(mean_squared_error(y_val,y_val_predict))
    vlMSE=mean_squared_error(y_val,y_val_predict)
    vlMAE=mean_absolute_error(y_val,y_val_predict)
    result_df=pd.DataFrame({'Method':[model],'val score':vlscore,'RMSE_val':vlRMSE,'MSE_val':vlMSE,'MSE_vl': vlMSE,
                          'train Score':trscore,'RMSE_tr': trRMSE,'MSE_tr': trMSE, 'MAE_tr': trMAE})  
    return result_df

In [None]:
result_dff=pd.DataFrame()
pipe_LR = Pipeline([('LR', LinearRegression())])
result_dff=pd.concat([result_dff,result('LR',pipe_LR,X_train,y_train,X_val,y_val)])

pipe_knr = Pipeline([('KNNR', KNeighborsRegressor(n_neighbors=4,weights='distance'))])
result_dff=pd.concat([result_dff,result('KNNR',pipe_knr,X_train,y_train,X_val,y_val)])

pipe_DTR = Pipeline([('DTR', DecisionTreeRegressor())])
result_dff=pd.concat([result_dff,result('DTR',pipe_DTR,X_train,y_train,X_val,y_val)])

pipe_GBR = Pipeline([('GBR', GradientBoostingRegressor(n_estimators = 200, learning_rate = 0.1, random_state=22))])
result_dff=pd.concat([result_dff,result('GBR',pipe_GBR,X_train,y_train,X_val,y_val)])

pipe_BGR = Pipeline([('BGR', BaggingRegressor(n_estimators=50, oob_score= True,random_state=14))])
result_dff=pd.concat([result_dff,result('BGR',pipe_BGR,X_train,y_train,X_val,y_val)])

pipe_RFR = Pipeline([('RFR', RandomForestRegressor())])
result_dff=pd.concat([result_dff,result('RFR',pipe_RFR,X_train,y_train,X_val,y_val)])

result_dff

Note: The best model is still from Linear Regression

#### Modeling Summary
<li>The ensemble models have performed well compared to that of linear,KNN,SVR models
<li>The best performance is given by Gradient boosting model with training (score-90%,RMSE- 108130), Validation (score-89.3%,RSME-109201).
<li>The top key features that drive the price of the property are: 'furnished','latitude', 'zipcode', 'quality_grade', living_measure','quality_8', 'HouseLandRatio', 'lot_measure15', 'quality_9', 'ceil_measure', 'total_area'.
The above data is also reinforced by the analysis done during bivariate analysis.
For further improvization, the datasets can be made by treating outliers in different ways and hypertuning the ensemble models.