# Zillow Research Analysis

by: Armun Shakeri

In [1]:
# 3 year, 5 year, and 10 year
# forecast housing prices 
# ROI, highest and lowest
# Median Sale Price


# Questions for monday
# Is it ok if I only look at tx times? 
# Should I remove Texas from here too? 
# Need help with separating date from dataset

## Overview and Business Problem

This project analyzes 3 bedroom homes in DFW metroplex of Texas to determine which 10 zipcodes had the highest ROI for 3, 5 and 10 year span. 

## Data Understanding

This data represents median monthly housing sales for 3 bedroom homes across the United States. 

Each row represents a unique ZipCode. Each record contains loccation info and median housing sales prices for each month.

There are 23404 rows and 281 variables:

RegionID: Unique index, 
<br />RegionName: Unique Zip Code,
<br />City: City in which the zip code is located,
<br />State: State in which the zip code is located,
<br />Metro: Metropolitan Area in which the zip code is located,
<br />CountyName: County in which the zip code is located,
<br />SizeRank: Numerical rank of size of zip code, ranked 1 through 23404
2000-01-31 through 2022-08-31: refers to the median housing sales values for January 2000 through August 2022, that is 274 data points of monthly data for each zip code

## Import standard packages and data

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from datetime import datetime

In [3]:
data = pd.read_csv("Data/Zip_zhvi_bdrmcnt_3_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv")
data.head()

Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName,2000-01-31,...,2021-11-30,2021-12-31,2022-01-31,2022-02-28,2022-03-31,2022-04-30,2022-05-31,2022-06-30,2022-07-31,2022-08-31
0,91940,0,77449,zip,TX,TX,,"Houston-The Woodlands-Sugar Land, TX",Harris County,96603.0,...,232376.0,236021.0,239693.0,244103.0,249528.0,255561.0,261406.0,266140.0,269363.0,271087.0
1,91982,1,77494,zip,TX,TX,,"Houston-The Woodlands-Sugar Land, TX",Fort Bend County,163540.0,...,330128.0,334969.0,339733.0,346806.0,355476.0,365971.0,374539.0,379908.0,381471.0,380327.0
2,93144,2,79936,zip,TX,TX,El Paso,"El Paso, TX",El Paso County,87170.0,...,167760.0,169442.0,171444.0,173448.0,175569.0,178832.0,182377.0,185780.0,188090.0,189856.0
3,62080,3,11368,zip,NY,NY,New York,"New York-Newark-Jersey City, NY-NJ-PA",Queens County,324450.0,...,814606.0,815163.0,817786.0,818496.0,823195.0,827059.0,836542.0,842837.0,849412.0,852484.0
4,62093,4,11385,zip,NY,NY,New York,"New York-Newark-Jersey City, NY-NJ-PA",Queens County,279395.0,...,749033.0,750202.0,754601.0,758353.0,764191.0,766694.0,772804.0,778368.0,783167.0,785138.0


Obtain information regarding data columns.

In [4]:
data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23404 entries, 0 to 23403
Data columns (total 281 columns):
 #   Column      Dtype  
---  ------      -----  
 0   RegionID    int64  
 1   SizeRank    int64  
 2   RegionName  int64  
 3   RegionType  object 
 4   StateName   object 
 5   State       object 
 6   City        object 
 7   Metro       object 
 8   CountyName  object 
 9   2000-01-31  float64
 10  2000-02-29  float64
 11  2000-03-31  float64
 12  2000-04-30  float64
 13  2000-05-31  float64
 14  2000-06-30  float64
 15  2000-07-31  float64
 16  2000-08-31  float64
 17  2000-09-30  float64
 18  2000-10-31  float64
 19  2000-11-30  float64
 20  2000-12-31  float64
 21  2001-01-31  float64
 22  2001-02-28  float64
 23  2001-03-31  float64
 24  2001-04-30  float64
 25  2001-05-31  float64
 26  2001-06-30  float64
 27  2001-07-31  float64
 28  2001-08-31  float64
 29  2001-09-30  float64
 30  2001-10-31  float64
 31  2001-11-30  float64
 32  2001-12-31  float64
 33  2002-01-31

After examining the information regarding this data, we will remove columns 'RegionID', 'SizeRank', 'RegionType', 'StateName', 'Metro', 'City', and 'CountyName'. Since we are only looking at ZipCodes we do not need these columns. 

Drop all N/A values. 

In [5]:
data = data.dropna()
data.head()

Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName,2000-01-31,...,2021-11-30,2021-12-31,2022-01-31,2022-02-28,2022-03-31,2022-04-30,2022-05-31,2022-06-30,2022-07-31,2022-08-31
2,93144,2,79936,zip,TX,TX,El Paso,"El Paso, TX",El Paso County,87170.0,...,167760.0,169442.0,171444.0,173448.0,175569.0,178832.0,182377.0,185780.0,188090.0,189856.0
6,84630,6,60629,zip,IL,IL,Chicago,"Chicago-Naperville-Elgin, IL-IN-WI",Cook County,133522.0,...,247985.0,250548.0,253484.0,255487.0,257701.0,259941.0,263459.0,265572.0,266131.0,265415.0
7,91733,7,77084,zip,TX,TX,Houston,"Houston-The Woodlands-Sugar Land, TX",Harris County,96608.0,...,227606.0,230914.0,234220.0,237989.0,242657.0,247697.0,252703.0,256619.0,259138.0,260480.0
8,96361,8,91331,zip,CA,CA,Los Angeles,"Los Angeles-Long Beach-Anaheim, CA",Los Angeles County,144773.0,...,659038.0,665242.0,669849.0,677464.0,687363.0,700572.0,711423.0,712829.0,709625.0,698236.0
9,96193,9,90650,zip,CA,CA,Norwalk,"Los Angeles-Long Beach-Anaheim, CA",Los Angeles County,172932.0,...,639038.0,644314.0,649243.0,657427.0,670768.0,687588.0,702447.0,705918.0,704289.0,692281.0


Drop all states outside of TX. 

In [6]:
data = data[data['Metro'] == 'Dallas-Fort Worth-Arlington, TX']
data.head()

Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName,2000-01-31,...,2021-11-30,2021-12-31,2022-01-31,2022-02-28,2022-03-31,2022-04-30,2022-05-31,2022-06-30,2022-07-31,2022-08-31
34,90654,34,75052,zip,TX,TX,Grand Prairie,"Dallas-Fort Worth-Arlington, TX",Dallas County,103839.0,...,278604.0,282369.0,287535.0,294913.0,302589.0,311724.0,319510.0,326135.0,328404.0,327660.0
57,90769,57,75217,zip,TX,TX,Dallas,"Dallas-Fort Worth-Arlington, TX",Dallas County,70802.0,...,195299.0,198111.0,202856.0,208669.0,214852.0,220711.0,225957.0,231034.0,233564.0,233783.0
92,90764,94,75211,zip,TX,TX,Cockrell Hill,"Dallas-Fort Worth-Arlington, TX",Dallas County,89680.0,...,236396.0,238021.0,241675.0,248546.0,255200.0,262083.0,267357.0,271871.0,273548.0,273650.0
120,91221,124,76063,zip,TX,TX,Mansfield,"Dallas-Fort Worth-Arlington, TX",Tarrant County,124882.0,...,318619.0,323787.0,330083.0,338174.0,345494.0,354082.0,362034.0,368297.0,369942.0,367734.0
163,91325,167,76244,zip,TX,TX,Fort Worth,"Dallas-Fort Worth-Arlington, TX",Tarrant County,136538.0,...,323571.0,329714.0,336507.0,345259.0,353469.0,363158.0,372160.0,379177.0,380951.0,378390.0


In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 149 entries, 34 to 16800
Columns: 281 entries, RegionID to 2022-08-31
dtypes: float64(272), int64(3), object(6)
memory usage: 328.3+ KB


In [7]:
# data['Date'] = pd.to_datetime(data['Date']).data.date