<a href="https://colab.research.google.com/github/VivianKingasia/Time-Series-Modeling-Forecasting-Zillow-Real-Estate-Prices/blob/main/Time_Series_Modeling_Forecasting_Zillow_Real_Estate_Prices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# THE LEAGUE PROJECT
Authors:
 - Keith Maina
 - Vivian Kingasia
 - Ann Maureen
 - Brian Kigen
 - Charity Gakuru
 - Hannah Mutua
 - Mercy Ngila
 - Steve Troy

# 1 INTRODUCTION

##  1.1 Business Understanding

### 1.1.1 Introduction
In 2021, the real estate industry in the United States was valued at USD 3.69 trillion and was expected to experience a 5.2% compound interest growth for the period between 2022 and 2030. This potential predicted growth of the industry, coupled with rising population rates in the US create a huge lucrative opportunity potential for real estate investors to make huge profits provided they <b>manage risk</b> and <b>make the right investments</b>.<br>
According to <a href= 'https://www.peoplescapitalgroup.com/average-roi-real-estate/'>People's Capital Group</a>, residential properties have an average annual return of 10.6% and commercial properties have a 9.5% average return.

### 1.1.2 Problem Statement
The stakeholder in this project is a real estate investment firm that is looking to construct residential homes in top five locations in the US that would provide a high return on their investment. This project therefore is a time series analysis on a Zillow dataset on various locations around the United States.<br><br>

The project will involve analyzing the house sale prices from 1996 to 2018 to determine the top five locations with the highest Return on Investment (ROI).<br>
The stakeholder is also risk-averse and therefore the project involves recommending locations with low price volatility which can easily be predicted with the model.<br>

In our time series analysis, the metric of success to determine model viability will be RMSE/MSE.

### 1.1.3 Project Scope
The primary goal of this project will be to conduct a time series analysis to predict the five best locations to invest in based on ROI.

### 1.1.4 Problem Questions
- What are the five best locations to invest in around the US?
- What makes these locations so lucrative?
- Does urbanization affect the prices of houses?
- How long does it take to cash out on the investment?
- How risky <b>(measured as ---)</b> is the investment?
- When are the prices most volatile <b>(measured as frequency of price change in a small timeframe)</b> and where are the locations of the houses with most volatility?
- How much profit can investors potentially make based on our predictions?
- Can we use the information gained from this project to gain clients and if so, how?


### 1.1.5 Project Objectives
1. Provide effective real estate investment recommendations to the stakeholder.
2. Increase the real estate investor’s customer base.


# 2 DATA PREPARATION

## 2.1 IMPORTING LIBRARIES AND LOADING DATA

In [51]:
# importing libraries
import pandas as pd 

In [52]:
# loading the dataset
df = pd.read_csv('./data/zillow_data.csv')
df.head()

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,...,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04
0,84654,60657,Chicago,IL,Chicago,Cook,1,334200.0,335400.0,336500.0,...,1005500,1007500,1007800,1009600,1013300,1018700,1024400,1030700,1033800,1030600
1,90668,75070,McKinney,TX,Dallas-Fort Worth,Collin,2,235700.0,236900.0,236700.0,...,308000,310000,312500,314100,315000,316600,318100,319600,321100,321800
2,91982,77494,Katy,TX,Houston,Harris,3,210400.0,212200.0,212200.0,...,321000,320600,320200,320400,320800,321200,321200,323000,326900,329900
3,84616,60614,Chicago,IL,Chicago,Cook,4,498100.0,500900.0,503100.0,...,1289800,1287700,1287400,1291500,1296600,1299000,1302700,1306400,1308500,1307000
4,93144,79936,El Paso,TX,El Paso,El Paso,5,77300.0,77300.0,77300.0,...,119100,119400,120000,120300,120300,120300,120300,120500,121000,121500


In [3]:
# information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14723 entries, 0 to 14722
Columns: 272 entries, RegionID to 2018-04
dtypes: float64(219), int64(49), object(4)
memory usage: 30.6+ MB


In [53]:
# getting the columns in the dataset
df.columns

Index(['RegionID', 'RegionName', 'City', 'State', 'Metro', 'CountyName',
       'SizeRank', '1996-04', '1996-05', '1996-06',
       ...
       '2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12',
       '2018-01', '2018-02', '2018-03', '2018-04'],
      dtype='object', length=272)

## 2.2 DATA CLEANING 

###  2.2.1 Validity

##### The first step in the data cleaning process is checking for validity and dropping any columns if necessary. Renaming of columns will also be done here

In [54]:
# Change the column name from RegionName to zipcode beause the column actually contains the zipcodes
df = df.rename(columns={'RegionName': 'Zipcode'})
df.head()

Unnamed: 0,RegionID,Zipcode,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,...,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04
0,84654,60657,Chicago,IL,Chicago,Cook,1,334200.0,335400.0,336500.0,...,1005500,1007500,1007800,1009600,1013300,1018700,1024400,1030700,1033800,1030600
1,90668,75070,McKinney,TX,Dallas-Fort Worth,Collin,2,235700.0,236900.0,236700.0,...,308000,310000,312500,314100,315000,316600,318100,319600,321100,321800
2,91982,77494,Katy,TX,Houston,Harris,3,210400.0,212200.0,212200.0,...,321000,320600,320200,320400,320800,321200,321200,323000,326900,329900
3,84616,60614,Chicago,IL,Chicago,Cook,4,498100.0,500900.0,503100.0,...,1289800,1287700,1287400,1291500,1296600,1299000,1302700,1306400,1308500,1307000
4,93144,79936,El Paso,TX,El Paso,El Paso,5,77300.0,77300.0,77300.0,...,119100,119400,120000,120300,120300,120300,120300,120500,121000,121500


 - The column RegionName has been renamed to zipcode which is the true representation of what is captured in that column.
 - All columns here are useful so non will be dropped at this point.

### 2.2.2 COMPLETENESS 

- The dataset that will be analyzed and modelled needs to be complete. All the missing values must be handled.


In [55]:
# Check for null values 
print(f'The data has {df.isna().sum().sum()} missing values')

The data has 157934 missing values


In [56]:
# checking for null values in the dataset
df.isna().sum()

RegionID       0
Zipcode        0
City           0
State          0
Metro       1043
            ... 
2017-12        0
2018-01        0
2018-02        0
2018-03        0
2018-04        0
Length: 272, dtype: int64

There are 1043 missing values in the Metro column. 

In [57]:
# imputing the missing values
df.Metro.fillna('None', inplace=True)

The metro column won't be dropped but instead the missing values will be filled with None. This will indicate that these particular values are missing in these rows.

### 2.2.3 CONSISTENCY

The consistency of the data will be checked by looking for any duplicates and working on them

In [58]:
# checking for duplicates

print(f'The data has {df.duplicated().sum()} duplicates')

The data has 0 duplicates


- The data has no dupicates hence it's consistent.

### 2.2.4 UNIFORMITY

Uniformity of the data is important. This will be checked by looking at the data types of the different columns and ensuring they are are correct

In [59]:
# checking the column datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14723 entries, 0 to 14722
Columns: 272 entries, RegionID to 2018-04
dtypes: float64(219), int64(49), object(4)
memory usage: 30.6+ MB


In [60]:
df.dtypes['Zipcode']

dtype('int64')

In [61]:
# Convert all the zipcodes to strings 
df.Zipcode = df.Zipcode.astype('string')

All the Zipcodes have been converted from integer to strings since zipodes aren't numbers but strings; they just happen to be restricted to a numbers-only alphabet. 

In [62]:
print(df.Zipcode.min())
print(df.Zipcode.max())

1001
99901


In [63]:
# The zipcodes need to be 5 digits long, so a zero will be added to the ones that have four digits 
for i in range(len(df)):
    df.Zipcode[i] = df.Zipcode[i].rjust(5, '0')

Standard zipcodes are 5 digits long. That's why a zero had to be added to the ones that were four digit long

In [64]:
print(df.Zipcode.min())

01001


All the zipcodes are now 5 digits long

### Convert the data to time series

Now, we will create a time series by changing the data from wide view to long view, and indexing it by the Date.

In [65]:
def melt_df(df):
    melted = pd.melt(df, id_vars=['RegionID','Zipcode', 'City', 'State', 'Metro', 'CountyName', 'SizeRank'], 
                     var_name='Date')
    melted['Date'] = pd.to_datetime(melted['Date'], infer_datetime_format=True)
    melted = melted.dropna(subset=['value'])
    return melted

In [66]:
melted_df = melt_df(df)

In [68]:
# Make sure the data type of the 'Date' column is datetime
melted_df['Date'] = pd.to_datetime(melted_df['Date'], format='%m/%y')

# Set the 'Date' column as index
melted_df.set_index('Date', inplace=True)

In [69]:
melted_df.head()

Unnamed: 0_level_0,RegionID,Zipcode,City,State,Metro,CountyName,SizeRank,value
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1996-04-01,84654,60657,Chicago,IL,Chicago,Cook,1,334200.0
1996-04-01,90668,75070,McKinney,TX,Dallas-Fort Worth,Collin,2,235700.0
1996-04-01,91982,77494,Katy,TX,Houston,Harris,3,210400.0
1996-04-01,84616,60614,Chicago,IL,Chicago,Cook,4,498100.0
1996-04-01,93144,79936,El Paso,TX,El Paso,El Paso,5,77300.0
