# Zillow Research Analysis

by: Armun Shakeri

In [1]:
# 3 year, 5 year, and 10 year
# forecast housing prices 
# ROI, highest and lowest
# Median Sale Price


# Questions for monday
# Is it ok if I only look at tx times? 
# Should I remove Texas from here too? 
# Need help with separating date from dataset

## Overview and Business Problem

This project analyzes 3 bedroom homes in Texas to determine which 10 zipcodes had the highest ROI for 3, 5 and 10 year span. 

## Data Understanding

This data represents median monthly housing sales for 3 bedroom homes across the United States. 

Each row represents a unique ZipCode. Each record contains loccation info and median housing sales prices for each month.

There are 23404 rows and 281 variables:

RegionID: Unique index, 
<br />RegionName: Unique Zip Code,
<br />City: City in which the zip code is located,
<br />State: State in which the zip code is located,
<br />Metro: Metropolitan Area in which the zip code is located,
<br />CountyName: County in which the zip code is located,
<br />SizeRank: Numerical rank of size of zip code, ranked 1 through 23404
2000-01-31 through 2022-08-31: refers to the median housing sales values for January 2000 through August 2022, that is 274 data points of monthly data for each zip code

## Import standard packages and data

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import time
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

In [3]:
data = pd.read_csv("Data/Zip_zhvi_bdrmcnt_3_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv")
data.head()

Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName,2000-01-31,...,2021-11-30,2021-12-31,2022-01-31,2022-02-28,2022-03-31,2022-04-30,2022-05-31,2022-06-30,2022-07-31,2022-08-31
0,91940,0,77449,zip,TX,TX,,"Houston-The Woodlands-Sugar Land, TX",Harris County,96603.0,...,232376.0,236021.0,239693.0,244103.0,249528.0,255561.0,261406.0,266140.0,269363.0,271087.0
1,91982,1,77494,zip,TX,TX,,"Houston-The Woodlands-Sugar Land, TX",Fort Bend County,163540.0,...,330128.0,334969.0,339733.0,346806.0,355476.0,365971.0,374539.0,379908.0,381471.0,380327.0
2,93144,2,79936,zip,TX,TX,El Paso,"El Paso, TX",El Paso County,87170.0,...,167760.0,169442.0,171444.0,173448.0,175569.0,178832.0,182377.0,185780.0,188090.0,189856.0
3,62080,3,11368,zip,NY,NY,New York,"New York-Newark-Jersey City, NY-NJ-PA",Queens County,324450.0,...,814606.0,815163.0,817786.0,818496.0,823195.0,827059.0,836542.0,842837.0,849412.0,852484.0
4,62093,4,11385,zip,NY,NY,New York,"New York-Newark-Jersey City, NY-NJ-PA",Queens County,279395.0,...,749033.0,750202.0,754601.0,758353.0,764191.0,766694.0,772804.0,778368.0,783167.0,785138.0


Obtain information regarding data columns.

In [4]:
data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23404 entries, 0 to 23403
Data columns (total 281 columns):
 #   Column      Dtype  
---  ------      -----  
 0   RegionID    int64  
 1   SizeRank    int64  
 2   RegionName  int64  
 3   RegionType  object 
 4   StateName   object 
 5   State       object 
 6   City        object 
 7   Metro       object 
 8   CountyName  object 
 9   2000-01-31  float64
 10  2000-02-29  float64
 11  2000-03-31  float64
 12  2000-04-30  float64
 13  2000-05-31  float64
 14  2000-06-30  float64
 15  2000-07-31  float64
 16  2000-08-31  float64
 17  2000-09-30  float64
 18  2000-10-31  float64
 19  2000-11-30  float64
 20  2000-12-31  float64
 21  2001-01-31  float64
 22  2001-02-28  float64
 23  2001-03-31  float64
 24  2001-04-30  float64
 25  2001-05-31  float64
 26  2001-06-30  float64
 27  2001-07-31  float64
 28  2001-08-31  float64
 29  2001-09-30  float64
 30  2001-10-31  float64
 31  2001-11-30  float64
 32  2001-12-31  float64
 33  2002-01-31

Drop all states outside of Texas. 

In [5]:
data = data[data['State'] == 'TX']
data.head()

Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName,2000-01-31,...,2021-11-30,2021-12-31,2022-01-31,2022-02-28,2022-03-31,2022-04-30,2022-05-31,2022-06-30,2022-07-31,2022-08-31
0,91940,0,77449,zip,TX,TX,,"Houston-The Woodlands-Sugar Land, TX",Harris County,96603.0,...,232376.0,236021.0,239693.0,244103.0,249528.0,255561.0,261406.0,266140.0,269363.0,271087.0
1,91982,1,77494,zip,TX,TX,,"Houston-The Woodlands-Sugar Land, TX",Fort Bend County,163540.0,...,330128.0,334969.0,339733.0,346806.0,355476.0,365971.0,374539.0,379908.0,381471.0,380327.0
2,93144,2,79936,zip,TX,TX,El Paso,"El Paso, TX",El Paso County,87170.0,...,167760.0,169442.0,171444.0,173448.0,175569.0,178832.0,182377.0,185780.0,188090.0,189856.0
7,91733,7,77084,zip,TX,TX,Houston,"Houston-The Woodlands-Sugar Land, TX",Harris County,96608.0,...,227606.0,230914.0,234220.0,237989.0,242657.0,247697.0,252703.0,256619.0,259138.0,260480.0
17,92593,17,78660,zip,TX,TX,Pflugerville,"Austin-Round Rock-Georgetown, TX",Travis County,142867.0,...,404382.0,415007.0,424494.0,435592.0,446426.0,455652.0,462620.0,458565.0,449787.0,438317.0


We will drop 'StateName' and 'RegionType' due to redundancy, and 'SizeRank' due to it being irrelevant for analysis. 

In [6]:
data = data.drop(['StateName', 'RegionType', 'SizeRank'], axis=1)

We can see that there are 101,792 missing values within the dataset. 

In [7]:
data.isna().sum().sum()

101792

## Analyze 'RegionID'

There are 1293 unique values within RegionID.

In [8]:
print(data.RegionID.value_counts())
print(data.RegionID.nunique())
print(data.RegionID.min())
print(data.RegionID.max())

92157    1
90733    1
90759    1
92808    1
90761    1
        ..
91392    1
91395    1
91397    1
91398    1
92160    1
Name: RegionID, Length: 1293, dtype: int64
1293
90611
787971


In [9]:
data[data.RegionID >= 100000]

Unnamed: 0,RegionID,RegionName,State,City,Metro,CountyName,2000-01-31,2000-02-29,2000-03-31,2000-04-30,...,2021-11-30,2021-12-31,2022-01-31,2022-02-28,2022-03-31,2022-04-30,2022-05-31,2022-06-30,2022-07-31,2022-08-31
105,399727,78542,TX,Edinburg,"McAllen-Edinburg-Mission, TX",Hidalgo County,,,,,...,167788.0,170786.0,173059.0,175352.0,177976.0,181052.0,184327.0,187405.0,190153.0,193009.0
316,399724,77407,TX,Richmond,"Houston-The Woodlands-Sugar Land, TX",Fort Bend County,147866.0,148551.0,149188.0,150314.0,...,285231.0,289066.0,293108.0,299310.0,306808.0,315533.0,322869.0,327716.0,329450.0,328649.0
624,399725,77498,TX,Sugar Land,"Houston-The Woodlands-Sugar Land, TX",Fort Bend County,106128.0,106515.0,106830.0,107517.0,...,253347.0,256227.0,259172.0,263808.0,269430.0,276184.0,281709.0,285183.0,285964.0,284472.0
930,787970,75072,TX,McKinney,"Dallas-Fort Worth-Arlington, TX",Collin County,,,,,...,397635.0,409134.0,422416.0,432973.0,442135.0,453732.0,465911.0,476474.0,477881.0,473370.0
936,399638,78665,TX,Round Rock,"Austin-Round Rock-Georgetown, TX",Williamson County,,,,,...,435297.0,444927.0,457794.0,467751.0,476143.0,481926.0,486732.0,485045.0,477600.0,466026.0
1215,422746,75033,TX,Frisco,"Dallas-Fort Worth-Arlington, TX",Denton County,176703.0,175835.0,175682.0,175662.0,...,442829.0,453538.0,466795.0,478994.0,490201.0,504329.0,518516.0,531760.0,534340.0,532447.0
4240,399637,78633,TX,Georgetown,"Austin-Round Rock-Georgetown, TX",Williamson County,,,,,...,505507.0,512947.0,526379.0,537648.0,548173.0,555543.0,562518.0,563592.0,557858.0,547291.0
4645,787971,75036,TX,Frisco,"Dallas-Fort Worth-Arlington, TX",Denton County,,,,,...,416331.0,426120.0,438039.0,450001.0,461054.0,473164.0,483866.0,492962.0,494705.0,494559.0
5654,399726,77523,TX,,"Houston-The Woodlands-Sugar Land, TX",Chambers County,,,,,...,275669.0,281829.0,286846.0,295486.0,298133.0,300903.0,300833.0,301340.0,302540.0,305548.0


Doesn't mean anything and all are unique so we will change the astype to string. 

In [10]:
data.RegionID = data.RegionID.astype('string')

In [11]:
data.RegionID.unique()

<StringArray>
['91940', '91982', '93144', '91733', '92593', '92481', '90654', '91926',
 '91968', '92036',
 ...
 '91948', '92496', '92406', '91954', '91965', '92918', '92929', '92177',
 '92087', '91942']
Length: 1293, dtype: string

## Analyze 'RegionName'

Next we will look at RegionName, this is the zip code. We can see that there are 419 unique values. 

In [12]:
data.RegionName.value_counts() 

77880    1
76437    1
76476    1
78526    1
76085    1
        ..
78933    1
79226    1
77657    1
75134    1
77563    1
Name: RegionName, Length: 1293, dtype: int64

All zipcodes are unique. I will change the astype to string.

In [13]:
data.RegionName = data.RegionName.astype('string')

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1293 entries, 0 to 23400
Columns: 278 entries, RegionID to 2022-08-31
dtypes: float64(272), object(4), string(2)
memory usage: 2.8+ MB


In [15]:
data[['RegionName', 'State']].sort_values(by=['RegionName'])

Unnamed: 0,RegionName,State
7529,75001,TX
142,75002,TX
768,75006,TX
571,75007,TX
6689,75009,TX
...,...,...
3745,79932,TX
4607,79934,TX
7240,79935,TX
2,79936,TX


## Analyze 'City'

In [16]:
data.City.nunique()

701

In [17]:
data.City.isna().sum()

102

In [18]:
data.City.fillna('None', inplace=True)

## Analyze 'Metro'

Fillna with None

In [19]:
print(data.Metro.value_counts())
print(data.Metro.nunique())

Dallas-Fort Worth-Arlington, TX         249
Houston-The Woodlands-Sugar Land, TX    213
San Antonio-New Braunfels, TX            94
Austin-Round Rock-Georgetown, TX         83
El Paso, TX                              26
                                       ... 
Zapata, TX                                1
Lamesa, TX                                1
Uvalde, TX                                1
Vernon, TX                                1
Andrews, TX                               1
Name: Metro, Length: 71, dtype: int64
71


In [20]:
data.Metro.fillna('None', inplace=True)

In [21]:
data.Metro.value_counts()

Dallas-Fort Worth-Arlington, TX         249
Houston-The Woodlands-Sugar Land, TX    213
None                                    189
San Antonio-New Braunfels, TX            94
Austin-Round Rock-Georgetown, TX         83
                                       ... 
Vernon, TX                                1
Andrews, TX                               1
Pampa, TX                                 1
Del Rio, TX                               1
Pearsall, TX                              1
Name: Metro, Length: 72, dtype: int64

## Analyze 'CountyName'

In [22]:
data.CountyName.value_counts()

Harris County      129
Dallas County       77
Bexar County        65
Tarrant County      63
Travis County       45
                  ... 
Bailey County        1
Sabine County        1
Kendall County       1
Brewster County      1
Yoakum County        1
Name: CountyName, Length: 206, dtype: int64

In [23]:
data.isna().sum()

RegionID       0
RegionName     0
State          0
City           0
Metro          0
              ..
2022-04-30    16
2022-05-31    13
2022-06-30     6
2022-07-31     4
2022-08-31     0
Length: 278, dtype: int64

In [24]:
data.CountyName.fillna('None', inplace=True)

## Drop sales data that have missing values

In [25]:
data = data.dropna()

In [26]:
data.isna().sum()

RegionID      0
RegionName    0
State         0
City          0
Metro         0
             ..
2022-04-30    0
2022-05-31    0
2022-06-30    0
2022-07-31    0
2022-08-31    0
Length: 278, dtype: int64

## EDA on ZipCodes

In [27]:
# Check out most recent 1 year ROI
data['recent_1_yr_ROI'] = (data['2022-08-31'] - data['2021-08-31'])/(data['2021-08-31'])
data['recent_1_yr_ROI']

0        0.232488
1        0.219513
2        0.182189
7        0.207553
17       0.123829
           ...   
16325    0.232613
16800    0.155550
17642    0.222251
17728    0.146788
18069    0.148290
Name: recent_1_yr_ROI, Length: 448, dtype: float64

In [28]:
# Lowest Values
data.sort_values('recent_1_yr_ROI').head()[['RegionName', 'City', 'recent_1_yr_ROI']]

Unnamed: 0,RegionName,City,recent_1_yr_ROI
5649,77056,Houston,0.055792
7674,77098,Houston,0.059896
5726,77006,Houston,0.061651
5653,78752,Austin,0.062984
1388,77057,Houston,0.067271


In [29]:
# Highest Values 
data.sort_values('recent_1_yr_ROI', ascending=False).head()[['RegionName', 'City', 'recent_1_yr_ROI']]

Unnamed: 0,RegionName,City,recent_1_yr_ROI
4647,75078,Prosper,0.318655
9627,75210,Dallas,0.311498
15500,75424,Blue Ridge,0.287703
798,75216,Dallas,0.287055
10256,75454,Melissa,0.286421


In [36]:
# Find avg ROI for the past 3 years 
def average_one_year_ROI(df):
    average_one_year_ROI = []
    for i in range(len(df)):
        year_1_ROI = df['recent_1_yr_ROI'][i]
        year_2_ROI = (df.iloc[i,-12] - df.iloc[i,-24])/df.iloc[i,-24]
        year_3_ROI = (df.iloc[i,-24] - df.iloc[i,-36])/df.iloc[i,-36]
        avg_ROI = (year_1_ROI + year_2_ROI + year_3_ROI)/3
        average_one_year_ROI.append(avg_ROI)
    return average_one_year_ROI  

In [37]:
data['avg_one_yr_ROI'] = average_one_year_ROI(data)

KeyError: 3