# INFO 2950 Project Phase II 
By Colin Hoffer, Michael Wang, John Lo, and Patricio Fraga-Errecart

---
## Research Questions

 The effects of various factors (education, taxes, income) on housing prices.
 What factors are the best for predicting housing prices within a particular county?

---
## Data Collection and Cleaning

The first dataset we decided to use was the Zillow Home Value Index (found [here](https://www.zillow.com/research/data/)) which "reflects the typical value for homes in the 35th to 65th percentile range for a given region." We thought this would be a good estimate of housing prices in various cities. This dataset is organized by metropolitan statistical area (MSA), which usually encompasses multiple counties surrounding a large city. We thought this would give us more accurate and varied data as opposed to using data from the cities' limits themselves. 



The cities we decided to focus on are: New York, NY; Boston, MA; Washington, DC; San Francisco, CA; Los Angeles, CA; Miami, FL; Louisville, KY; Cincinatti, OH; Houston, TX; Denver, CO; Chicago, IL; and Seattle, WA. We thought these choices were a good variety of climate, geography, and demographics.

As for the factors that we want to compare with housing prices, we found a [useful table builder](https://data.census.gov/mdat/#/) on the US Census Website that allowed us to gather data on education completion, household income, gross rent as a percentage of income (used to estimate home affordability), and property taxes for each of the intended MSAs. Unfortunately, these tables were slightly difficult to organize due to the fact that the date could only be organized by counties or districts, not MSAs. To get around this, we downloaded the csv files and consolidated them below so that we had 1 datapoint for each category in each MSA.

We also wanted to see if weather or climate had an effect on housing prices, so we utilized the [NOAA National Weather Service Website](https://w2.weather.gov/climate/) to calculate the Mean High Temperature, Mean Low Temperature, and Mean Annual Temperature in 2018 for each city. **We manually inputted this data into our final dataset. (Temporary? Might change)** 

In addition to these factores, we wanted to analyze the effect of crime on housing prices. We found datasets on the [FBI website](https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2018/topic-pages/tables/table-6) organized by MSA and year. We chose to record Violent Crime per 100k, Robbery per 100k, Property Crime per 100k, and Burglary per 100k for our our project. **Not all of our cities had data for 2018; in these cases we used data from 2017.** 
**We manually inputted this data into our final dataset. (Temporary? Might change)** 

We chose to focus on the year 2018, since it was the most recent year that was shared by all of our datasets.

In [2]:
import pandas as pd
SFEducation = pd.read_csv("Michael's Data/SFEducation.csv")
print("High School diploma: " + "{:.2f}".format(100 * SFEducation.iloc[0]["Regular high school diploma"] / SFEducation.iloc[0]["Total"]) + "%")
print("Bachelor's degree: " + "{:.2f}".format(100 * SFEducation.iloc[0]["Bachelor's degree"] / SFEducation.iloc[0]["Total"]) + "%")
print("Master's degree: " + "{:.2f}".format(100 * SFEducation.iloc[0]["Master's degree"] / SFEducation.iloc[0]["Total"]) + "%")

High School diploma: 12.33%
Bachelor's degree: 23.30%
Master's degree: 10.26%


In [3]:
SFIncome = pd.read_csv("Michael's Data/SFIncome.csv")
SFIncome = SFIncome.rename(columns={"Household income (past 12 months, use ADJINC to adjust HINCP to constant dollars)": "Income"})
print("Mean income: $" + "{:.2f}".format(SFIncome["Income"].mean()))

Mean income: $135406.49


In [9]:
SFRentIncome = pd.read_csv("Michael's Data/SFRentIncome.csv")
SFRentIncome = SFRentIncome.rename(columns={"Gross rent as a percentage of household income past 12 months": "GRPIP"})
print("Average GRPIP: " + "{:.2f}".format(SFRentIncome["GRPIP"].mean()) + "%")

Average GRPIP: 14.18%


In [11]:
SFTax = pd.read_csv("Michael's Data/SFTax.csv")
SFTax = SFTax.rename(columns={"Unnamed: 1": "Tax"})
print("Mean tax: $" + "{:.2f}".format(SFTax["Tax"].mean()))

Mean tax: $104114.76


In [40]:
#dataset 1 is your tax data, dataset 2 is your education data
def meanTax(dataset1, dataset2):
    tax = pd.read_csv(dataset1)
    ed_for_purpose_of_population = pd.read_csv(dataset2)
    tax = tax.rename(columns={"Unnamed: 1": "Tax"})
    avg = 0
    for i in range(len(tax["Selected Geographies"])-1):
        avg += tax["Tax"][i]*ed_for_purpose_of_population["Total"][i]/ed_for_purpose_of_population["Total"][0]
    print("Mean tax: $" + "{:.2f}".format(avg))

#dataset 1 is your income data, dataset 2 is your education data
def meanInc(dataset1, dataset2):
    inc = pd.read_csv(dataset1)
    ed_for_purpose_of_population = pd.read_csv(dataset2)
    inc = inc.rename(columns={"Household income (past 12 months, use ADJINC to adjust HINCP to constant dollars)": "Income"})
    avg = 0
    for i in range(len(inc["Selected Geographies"])-1):
        avg += inc["Income"][i]*ed_for_purpose_of_population["Total"][i]/ed_for_purpose_of_population["Total"][0]
    print("Mean income: $" + "{:.2f}".format(avg))
    
#dataset 1 is your GRPIP data, dataset 2 is your education data
def meanGR(dataset1, dataset2):
    grpip = pd.read_csv(dataset1)
    ed_for_purpose_of_population = pd.read_csv(dataset2)
    grpip = grpip.rename(columns={"Gross rent as a percentage of household income past 12 months": "GRPIP"})
    avg = 0
    for i in range(len(grpip["Selected Geographies"])-1):
        avg += grpip["GRPIP"][i]*ed_for_purpose_of_population["Total"][i]/ed_for_purpose_of_population["Total"][0]
    print("Mean GRPIP: $" + "{:.2f}".format(avg))

In [41]:
meanTax("Michael's Data/SFTax.csv", "Michael's Data/SFEducation.csv")
meanInc("Michael's Data/SFIncome.csv", "Michael's Data/SFEducation.csv")
mean

Mean tax: $1824477.66
Mean income: $208877.41


---
## Data Description

TODO

---
## Data Limitations

TODO

---
## Exploratory Data Analysis

TODO

---
## Questions For Reviewers