# Housing Hypothesis
Do the housing prices in college towns weather economic recession better than those in non-college towns? In this project, I seek to find an answer to that question. 

There are four sources of data that will be used to support or nullify the hypothesis that the housing market in college towns is more resistant to economic downturn: 
- university_towns.txt - list of college towns
- state_abbreviations.csv - list of state abbreviations
- City_Zhvi_AllHomes.csv - all cities with monthly mean housing prices
- gdplev.xls - quarterly US GDP info 1947-2016q2, and annual GDP 1929-2015

These data sources will be loaded into datasets, cleaned and formatted, then used to run a ttest on the hypothesis.

In [1]:
import numpy as np
import pandas as pd

### 1. Reformat university_towns dataset
The university_towns dataset is in a format that is not conducive to
comparison with the other data sets. 
- It has one row for the state, followed by several rows for the college towns that are in that state. This needs converting to a two-column dataframe with state and city. 
- The states are spelled out but the housing data states use abbreviations. Long names need converting to two-letter abbreviations.

In [2]:
utowns = pd.read_table('data/university_towns.txt', header=None, names = ['RegionName'])
utowns.head()

Unnamed: 0,RegionName
0,Alabama[edit]
1,Auburn (Auburn University)[1]
2,Florence (University of North Alabama)
3,Jacksonville (Jacksonville State University)[2]
4,Livingston (University of West Alabama)[2]


#### 1.1 Functions to strip extraneous characters and abbreviate states
- `fill_state()` and `clean_region()` functions will be used to strip off unneeded characters for the State and RegionName columns.
- `abbreviate_states()` is an all-purpose function to convert long state names to two-letter abbreviations.

In [3]:
def fill_state(row):
    """
    (dataframe row) -> str

    Return name of state extracted from RegionName column. 
    Expects that the dataframe was created from university_towns.txt file. 
    In that file, the first record for each region is the name of the state
    with "[edit]" appended. The regions in the following rows will be 
    regions for that state. 
    The purpose of this function is to support the creation of a new
    column with state values. 
    """
    rowval = row.loc['RegionName']
    sep = '['
    if 'edit' in rowval:
        global cur_state
        cur_state = rowval
    return cur_state.split(sep, 1)[0]

def clean_region(row):
    """
    (dataframe row -> str)

    Return name of the city (RegionName) stripped of its extraneous
    characters. 
    Expects that the dataframe was created from university_towns.txt file. 
    In that file, the RegionName values have extraneous characters starting
    with ' ('. 
    """
    rowval = row.loc['RegionName']
    sep = ' ('
    return rowval.split(sep, 1)[0]

def abbreviate_states(df):
    """
    (dataframe -> dataframe)

    Returns dataframe with abbreviated state names (eg Alabama -> AL).
    Dataframe must nave a column named 'State'.
    Original long-form of state name will be replaced with two-letter
    abbreviation.
    """
    import csv
    st_abbrevs = csv.reader(open('data/state_abbreviations.csv', 'r'))
    state_dict = {}
    for row in st_abbrevs:
       k, v = row
       state_dict[k] = v
    
    df['State'] = df['State'].str.upper()
    df['State'] = df['State'].map(state_dict)
    return df

#### 1.2 Create university towns datasframe
The method used is:
1. Create a clean State column so that each city row has a state.
2. Remove rows that are state "header" rows with no city information.
3. Remove extraneous characters from the RegionName column.
4. Convert long state names to abbreviations.
5. Move State column to the first column position.

In [4]:
# create a clean state column
cur_state = ''
utowns['State'] = utowns.apply(fill_state, axis = 1)

# remove rows that are state 'headers'
utowns = utowns[~utowns['RegionName'].str.endswith('[edit]')]

# clean regionname data values
utowns['RegionName'] = utowns.apply(clean_region, axis = 1)

# convert state names to abbreviated names
utowns = abbreviate_states(utowns)

# move state column to front
states = utowns['State']
utowns.drop(labels=['State'], axis=1,inplace = True)
utowns.insert(0, 'State', states)

utowns.head()

Unnamed: 0,State,RegionName
1,AL,Auburn
2,AL,Florence
3,AL,Jacksonville
4,AL,Livingston
5,AL,Montevallo


### 2. Reformat housing market dataset
The dataset with housing market data is monthly, from April 1996 to August 2016. 

In [5]:
housing = pd.read_csv('data/City_Zhvi_AllHomes.csv')
housing.head()

Unnamed: 0,RegionID,RegionName,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,1996-07,...,2015-11,2015-12,2016-01,2016-02,2016-03,2016-04,2016-05,2016-06,2016-07,2016-08
0,6181,New York,NY,New York,Queens,1,,,,,...,573600,576200,578400,582200,588000,592200,592500,590200,588000,586400
1,12447,Los Angeles,CA,Los Angeles-Long Beach-Anaheim,Los Angeles,2,155000.0,154600.0,154400.0,154200.0,...,558200,560800,562800,565600,569700,574000,577800,580600,583000,585100
2,17426,Chicago,IL,Chicago,Cook,3,109700.0,109400.0,109300.0,109300.0,...,207800,206900,206200,205800,206200,207300,208200,209100,211000,213000
3,13271,Philadelphia,PA,Philadelphia,Philadelphia,4,50000.0,49900.0,49600.0,49400.0,...,122300,121600,121800,123300,125200,126400,127000,127400,128300,129100
4,40326,Phoenix,AZ,Phoenix,Maricopa,5,87200.0,87700.0,88200.0,88400.0,...,183800,185300,186600,188000,189100,190200,191300,192800,194500,195900


#### 2.2 Convert monthly data into quarterly columns
To compare this against the GDP dataset, the monthly data must be converted to quarterly data. It will only look at the period of time from January 2000 to August 2008.
1. Define start and end of time period we want.
2. Create quarterly columns that sum each three months' mean market values
3. Remove all columns besides State, RegionName, and quarterly columns.
4. Index the dataframe

In [6]:
start=housing.columns.get_loc('2000-01')
end=housing.columns.get_loc('2016-08')

# iterate across every third column from 2001-01 to 2016-08
columns_to_keep = ['RegionName', 'State']
for i in range(start, end, 3):
    mon = housing.columns[i][5:7]  # get month
    yr = housing.columns[i][0:4]   # get year
    q = (int(mon) // 3) + 1        # calculate quarter
    col_name = yr + 'q' + str(q)   # make column name for yr/qtr
    
    # sum values from columns for quarter
    qsum1 = housing[housing.columns[i]].astype(float)
    qsum2 = housing[housing.columns[i+1]].astype(float)
    qsum3 = housing[housing.columns[i+2]].astype(float)
    
    # create column for this quarter
    housing[col_name] = ((qsum1 + qsum2 + qsum3)/3).astype(float)
    columns_to_keep.append(col_name)

# Keep quarterly data columns
housing = housing[columns_to_keep]

housing.head()

Unnamed: 0,RegionName,State,2000q1,2000q2,2000q3,2000q4,2001q1,2001q2,2001q3,2001q4,...,2014q2,2014q3,2014q4,2015q1,2015q2,2015q3,2015q4,2016q1,2016q2,2016q3
0,New York,NY,,,,,,,,,...,515466.666667,522800.0,528066.666667,532266.666667,540800.0,557200.0,572833.333333,582866.666667,591633.333333,
1,Los Angeles,CA,207066.666667,214466.666667,220966.666667,226166.666667,233000.0,239100.0,245066.666667,253033.333333,...,498033.333333,509066.666667,518866.666667,528800.0,538166.666667,547266.666667,557733.333333,566033.333333,577466.666667,458388.888889
2,Chicago,IL,138400.0,143633.333333,147866.666667,152133.333333,156933.333333,161800.0,166400.0,170433.333333,...,192633.333333,195766.666667,201266.666667,201066.666667,206033.333333,208300.0,207900.0,206066.666667,208200.0,187466.666667
3,Philadelphia,PA,53000.0,53633.333333,54133.333333,54700.0,55333.333333,55533.333333,56266.666667,57533.333333,...,113733.333333,115300.0,115666.666667,116200.0,117966.666667,121233.333333,122200.0,123433.333333,126933.333333,103466.666667
4,Phoenix,AZ,111833.333333,114366.666667,116000.0,117400.0,119600.0,121566.666667,122700.0,124300.0,...,164266.666667,165366.666667,168500.0,171533.333333,174166.666667,179066.666667,183833.333333,187900.0,191433.333333,167411.111111


### 3. Reformat GDP dataset
The GDP data has both yearly and quarterly data. The quarterly data will be selected for the datafame and then it will be analized to find the start, end, and bottom of the economic recession.

In this section, the functions are defined in a separate file (get_recession_period.py), then are called within the Jupyter cells.

In [9]:
import get_recession_period as grp
gdp = pd.read_excel('data/gdplev.xls')
gdp = grp.clean_gdp(gdp)
gdp.head()

Unnamed: 0,Quarter,GDP Current,GDP Chained,GDP Change
0,2000q1,10031.0,12359.1,
1,2000q2,10278.3,12592.5,247.3
2,2000q3,10357.4,12607.7,79.1
3,2000q4,10472.3,12679.3,114.9
4,2001q1,10508.1,12643.3,35.8


In [10]:
rec_start = grp.get_recession_start(gdp)
rec_end = grp.get_recession_end(gdp, rec_start)
rec_bottom = grp.get_recession_bottom(gdp, rec_start, rec_end)
print('Recession start:  {}\nRecession end:    {}'.format(rec_start, rec_end))
print('Recession bottom: {}'.format(rec_bottom))

Recession start:  2008q3
Recession end:    2009q4
Recession bottom: 2009q2


### 4. Run ttest on housing hypothesis
1. Create new data showing the decline or growth of housing prices between the recession start and the recession bottom.
2. Run a ttest comparing the university town values to the non-university towns values. Return whether the alternative hypothesis (that the two groups are the same) is true or not, as well as the p-value of the confidence.     
3. Produce three variables: different, p, and better:
   - different=True if the t-test is True at a p<0.01 (we reject the null hypothesis)
   - different=False if otherwise (we cannot reject the null hypothesis).
   - The variable 'p' is the exact p value returned from scipy.stats.ttest_ind().
   - The value for 'better' is either "university town" or "non-university town" depending on which has a lower mean price ratio (which is equivilent to a reduced market loss).

#### 4.1 Keep only the quarters corresponding to time of economic recession

In [11]:
cols = [x for x in housing.columns if rec_start <= x <= rec_bottom]
cols[0:0] = ['State', 'RegionName']
housing = housing[cols]
housing.head()

Unnamed: 0,State,RegionName,2008q3,2008q4,2009q1,2009q2
0,NY,New York,499766.666667,487933.333333,477733.333333,465833.333333
1,CA,Los Angeles,469500.0,443966.666667,426266.666667,413900.0
2,IL,Chicago,232000.0,227033.333333,223766.666667,219700.0
3,PA,Philadelphia,116933.333333,115866.666667,116200.0,116166.666667
4,AZ,Phoenix,193766.666667,183333.333333,177566.666667,168233.333333


#### 4.2 Create a PriceRatio column based on market data at start and bottom of recesson.

In [12]:
housing['PriceRatio'] = housing[rec_start].div(housing[rec_bottom])
housing.head()

Unnamed: 0,State,RegionName,2008q3,2008q4,2009q1,2009q2,PriceRatio
0,NY,New York,499766.666667,487933.333333,477733.333333,465833.333333,1.072844
1,CA,Los Angeles,469500.0,443966.666667,426266.666667,413900.0,1.134332
2,IL,Chicago,232000.0,227033.333333,223766.666667,219700.0,1.055985
3,PA,Philadelphia,116933.333333,115866.666667,116200.0,116166.666667,1.0066
4,AZ,Phoenix,193766.666667,183333.333333,177566.666667,168233.333333,1.151773


#### 4.3 Create a column combining State and RegionName
This will make for easy merging of datasets. <BR>
Do this for both the housing and the university towns datasets.

In [13]:
# build column for easy merge
housing['StateRegion'] = housing['State'] + housing['RegionName']
housing.head()

Unnamed: 0,State,RegionName,2008q3,2008q4,2009q1,2009q2,PriceRatio,StateRegion
0,NY,New York,499766.666667,487933.333333,477733.333333,465833.333333,1.072844,NYNew York
1,CA,Los Angeles,469500.0,443966.666667,426266.666667,413900.0,1.134332,CALos Angeles
2,IL,Chicago,232000.0,227033.333333,223766.666667,219700.0,1.055985,ILChicago
3,PA,Philadelphia,116933.333333,115866.666667,116200.0,116166.666667,1.0066,PAPhiladelphia
4,AZ,Phoenix,193766.666667,183333.333333,177566.666667,168233.333333,1.151773,AZPhoenix


In [14]:
utowns['StateRegion'] = utowns['State'] + utowns['RegionName']
utowns.head()

Unnamed: 0,State,RegionName,StateRegion
1,AL,Auburn,ALAuburn
2,AL,Florence,ALFlorence
3,AL,Jacksonville,ALJacksonville
4,AL,Livingston,ALLivingston
5,AL,Montevallo,ALMontevallo


#### 4.4 Create a dataframe containing only college town market data...

In [15]:
# create dataframe for housing in college towns
housing_uni = pd.merge(utowns, housing, left_on='StateRegion', right_on='StateRegion')
housing_uni.head()

Unnamed: 0,State_x,RegionName_x,StateRegion,State_y,RegionName_y,2008q3,2008q4,2009q1,2009q2,PriceRatio
0,AL,Montevallo,ALMontevallo,AL,Montevallo,127266.666667,125800.0,124033.333333,125200.0,1.016507
1,AL,Tuscaloosa,ALTuscaloosa,AL,Tuscaloosa,139600.0,140100.0,139133.333333,136933.333333,1.019474
2,AK,Fairbanks,AKFairbanks,AK,Fairbanks,249966.666667,242900.0,234966.666667,225833.333333,1.106863
3,AZ,Flagstaff,AZFlagstaff,AZ,Flagstaff,322633.333333,318733.333333,309400.0,299600.0,1.07688
4,AZ,Tempe,AZTempe,AZ,Tempe,228133.333333,219766.666667,214666.666667,207500.0,1.099438


#### ... and a dataframe for non-college town market data.

In [16]:
# create dataframe for housing in non-college towns
housing_non_uni = housing[(~housing.StateRegion.isin(housing_uni.StateRegion))]
housing_non_uni.head()

Unnamed: 0,State,RegionName,2008q3,2008q4,2009q1,2009q2,PriceRatio,StateRegion
0,NY,New York,499766.666667,487933.333333,477733.333333,465833.333333,1.072844,NYNew York
1,CA,Los Angeles,469500.0,443966.666667,426266.666667,413900.0,1.134332,CALos Angeles
2,IL,Chicago,232000.0,227033.333333,223766.666667,219700.0,1.055985,ILChicago
3,PA,Philadelphia,116933.333333,115866.666667,116200.0,116166.666667,1.0066,PAPhiladelphia
4,AZ,Phoenix,193766.666667,183333.333333,177566.666667,168233.333333,1.151773,AZPhoenix


#### 4.5 Run ttest to test null hypothesis

In [17]:
from scipy.stats import ttest_ind

p = ttest_ind(housing_uni.dropna()['PriceRatio'], housing_non_uni.dropna()['PriceRatio'])[1]
different = p < 0.01  # check null hypothesis

better = "Non-university town"
if housing_uni['PriceRatio'].mean() < housing_non_uni['PriceRatio'].mean():
    better = "University town"
    
print("Probability of a null hypothesis is", p)
print("Null hypothesis can be rejected:", different)
print(better, "housing markets weather economic recession better")

Probability of a null hypothesis is 0.005992806213991878
Null hypothesis can be rejected: True
University town housing markets weather economic recession better
