The first dataset we decided to use was the Zillow Home Value Index (found [here](https://www.zillow.com/research/data/)) which "reflects the typical value for homes in the 35th to 65th percentile range for a given region." We thought this would be a good estimate of housing prices in various cities. This dataset is organized by metropolitan statistical area (MSA), which usually encompasses multiple counties surrounding a large city. We thought this would give us more accurate and varied data as opposed to using data from the cities' limits themselves. 



The cities we decided to focus on are: New York, NY; Boston, MA; Washington, DC; San Francisco, CA; Los Angeles, CA; Miami, FL; Louisville, KY; Cincinatti, OH; Houston, TX; Denver, CO; Chicago, IL; and Seattle, WA. We thought these choices were a good variety of climate, geography, and demographics.

As for the factors that we want to compare with housing prices, we found a [useful table builder](https://data.census.gov/mdat/#/) on the US Census Website that allowed us to gather data on education completion, household income, gross rent as a percentage of income (used to estimate home affordability), and property taxes for each of the intended MSAs. Unfortunately, these tables were slightly difficult to organize due to the fact that the date could only be organized by counties or districts, not MSAs. To get around this, we downloaded the csv files and consolidated them below so that we had 1 datapoint for each category in each MSA.

We also wanted to see if weather or climate had an effect on housing prices, so we utilized the [NOAA National Weather Service Website](https://w2.weather.gov/climate/) to calculate the Mean High Temperature, Mean Low Temperature, and Mean Annual Temperature in 2018 for each city. **We manually inputted this data into our final dataset. (Temporary? Might change)** 

In addition to these factores, we wanted to analyze the effect of crime on housing prices. We found datasets on the [FBI website](https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2018/topic-pages/tables/table-6) organized by MSA and year. We chose to record Violent Crime per 100k, Robbery per 100k, Property Crime per 100k, and Burglary per 100k for our our project. **Not all of our cities had data for 2018; in these cases we used data from 2017.** 
**We manually inputted this data into our final dataset. (Temporary? Might change)** 

We chose to focus on the year 2018, since it was the most recent year that was shared by all of our datasets.

In [1]:
import pandas as pd
msazhvi = pd.read_csv("MSAZHVI.csv")
msazhvi.head()
msazhvi["2018 ZHVI"] = msazhvi["2018-01-31"] + msazhvi["2018-02-28"] + msazhvi["2018-03-31"] + msazhvi["2018-04-30"] + msazhvi["2018-05-31"] + msazhvi["2018-06-30"] + msazhvi["2018-07-31"] + msazhvi["2018-08-31"] + msazhvi["2018-09-30"] + msazhvi["2018-10-31"] + msazhvi["2018-11-30"] + msazhvi["2018-12-31"]
msazhvi["2018 ZHVI"]=msazhvi["2018 ZHVI"]/12
msazhvi = msazhvi[["RegionName", "2018 ZHVI"]]
msazhvi.head()


Unnamed: 0,RegionName,2018 ZHVI
0,United States,232749.083333
1,"New York, NY",470289.833333
2,"Los Angeles-Long Beach-Anaheim, CA",658484.166667
3,"Chicago, IL",236369.0
4,"Dallas-Fort Worth, TX",240213.666667


In [6]:
def MSA_data_frame_builder(household, education):
    ed = pd.read_csv(education)
    db = pd.read_csv(household)
    hs_or_equivalent = []
    bachelors = []
    postgrad = []
    for i in range(len(ed["Selected Geographies"])):
        percent = (ed["Regular high school diploma"][i] + ed["GED or alternative credential"][i])/ed["Total"][i]
        hs_or_equivalent.append(percent)
        percent = ed["Bachelor's degree"][i]/ ed["Total"][i]
        bachelors.append(percent)
        percent = (ed["Master's degree"][i] + ed["Professional degree beyond a bachelor's degree"][i] + ed["Doctorate degree"][i])/ ed["Total"][i]
        postgrad.append(percent)
    db["HS/Equivalent Education Percentage"] = hs_or_equivalent
    db["Bachelor's Percentage"] = bachelors
    db["Postgraduate Percentage"] = postgrad
    db.insert(loc=2, column="Total Population", value=ed["Total"])
    return db

SF = MSA_data_frame_builder("Michael's Data/SF.csv", "Michael's Data/SFEducation.csv")
SF.head()

Unnamed: 0,Selected Geographies,Households,Total Population,"Household income (past 12 months, use ADJINC to adjust HINCP to constant dollars)",Property taxes (yearly real estate taxes),Gross rent as a percentage of household income past 12 months,HS/Equivalent Education Percentage,Bachelor's Percentage,Postgraduate Percentage
0,Alameda County (Northwest)--Oakland (Northwest...,84703,157376,82051.09937,1475.481943,24.56521,0.133731,0.257485,0.171456
1,Alameda County (Northeast)--Oakland (East) & P...,60634,125395,171262.5969,5657.404427,11.266962,0.085713,0.306671,0.271717
2,Alameda County (North Central)--Oakland City (...,45274,118915,54451.07788,1585.381058,21.780823,0.204583,0.100711,0.048665
3,"Alameda County (West)--San Leandro, Alameda & ...",66871,162789,106320.8521,3110.638005,16.015881,0.177039,0.188668,0.122539
4,"Alameda County (North Central)--Castro Valley,...",47279,130138,112788.4682,3315.345396,14.194526,0.184105,0.169966,0.070733


In [None]:
def weightedMeans(MSAdata):