# Planning Methods: Part II, Spring 2021

# Lab 5: Logistic Regressions

**About This Lab**
* We will be running through this notebook together. If you have a clarifying question or other question of broad interest, feel free to interrupt or use a pause to unmute and ask it! If you have a question that may result in a one-on-one breakout room (think: detailed inquiry, conceptual question, or help debugging), please ask it in the chat!
* We recognize learning Python via Zoom comes with its challenges and that there are many modes of learning. Please go with what works best for you. That might be printing out the Jupyter notebook, duplicating it such that you can refer to the original, working directly in it. Up to you! There isn't a single right way.
* This lab requires that you download the following files and place them in the same directory as this Jupyter notebook:
    * `clean_property_data.csv`
    * `properties_wtenancy.csv`
* This data includes properties that were sold through a real estate site (like Zillow) between 2001 and 2006 in Bogota. There are apartments and houses, characteristics of the structure like area and bathrooms, and characteristics of the neighborhood like density and a proxy for neighborhood income which is called SES.

## Objectives
By the end of this lab, you will have reviewed how to:
>1. Introduce robust errors

You will also learn how to:
>1. Run a logistic regression
>2. Analyze odds ratios
>3. Plot predicted probabilities

## 1 Import packages, load & clean data

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.discrete.discrete_model import Logit
from statsmodels.discrete.discrete_model import MNLogit

In [None]:
raw = pd.read_csv('clean_property_data.csv')

# create subdataframe
var_list = ['price_000','pop_dens','ses','house','area_m2','num_bath','pcn_green','thefts', 'year']
data = raw[var_list].copy()
data.head()

### 1.1 Recode variables
Imagine that you want to see how prices change over time, while controlling for property characteristics. Controlling for property characteristics is important because housing markets change over time. For part of the year single family homes may be hot, for another it's apartments. Another year, it may be that there is a glut of small apartments and studios, so prices for those may go down. Using price/ft2 to track market changes without adjusting for property attributes, as many realtors do, is entirely inappropriate. The proposed
approach of tracking prices over time while controlling for property
attributes is not unusual. It is precisely how price tracking apps like
Zillow and Trulia work. One way to track prices over time is to create dummy variables per unit of time (months or years, for example). This is what we will do next.

In [None]:
# create SES dv
data['high_ses'] = np.where(data['ses']>=5, 1, 0)

Now, let's create dummy variables for the year the property was sold. There are 6 unique years, and we can create a new dummy column for each year. This could be a long process, but python can make this easier for us!

In [None]:
# look at the unique values in the 'year' variable
data['year'].unique()

In [None]:
# create dummies from categorical variable year
dummies = pd.get_dummies(data['year'], prefix = 'yr') 
dummies.head()

In [None]:
# append the dummies to our larger dataframe
data = pd.concat([data, dummies], axis = 1)
data.head()

## 2 Robust errors in multivariable regression
One way to mitigate the impacts of heteroskedasticity is to use
robust errors. Remember from last lab and class what heteroskedasticity is: the errors are not uniformly distributed around a line of slope zero. A consequence of heteroskedasticity is that the standard errors you estimate commonly can be unusually small and hence you may inappropriately reject or accept a null hypothesis that the coefficient is zero. Hence the use of robust errors, which isn't required but simply an option to mitigate this problem if it is present.

### 2.1 Without robust errors

In [None]:
# define independent variables
ind_var = ['high_ses', 'house', 'area_m2', 'num_bath', 'pcn_green', 
           'thefts','yr_2002','yr_2003','yr_2004','yr_2005','yr_2006'] 
# note that the year variable is categorical so we need to exclude one to prevent collinearity 
# within our model - we will exclude year 2001 - we chose to have the earlier year be our base year

x = data[ind_var].assign(Intercept = 1) # independent variables
y = data['price_000'] # dependent variable

model = sm.OLS(y, x).fit()
# save the results as "model" - this will be useful for other functions below.

model.summary2()

### 2.2 With robust errors
Take a look at the output - what's changed? What's stayed the same?

In [None]:
model = sm.OLS(y, x).fit(cov_type='HC0') ##cov_type='HC0' introduces robust errors
model.summary2()

## 3 Logistic regression (aka logit model)
Remember - logistic regressions are used when the dependent variables are categorical. The simplest example is when a variable can take only binary values (0 or 1). For this example, we are going to transform our price into a dummy variable, using the median as cutoff and use that as our dependent variable.

### 3.1 Create dependent variable

In [None]:
# identify dummy threshold
price_median = data['price_000'].median()
print(price_median)

# create dummy dependent variable
data['high_price'] = np.where(data['price_000']>price_median, 1, 0)
data.head()

### 3.2 Run logit model

In [None]:
y = data['high_price'] # dependent variable - it's a dummy!
x = data[ind_var].assign(Intercept = 1) # independent variables - same list as before

# define and run logit model
logit_model = Logit(y, x).fit()
logit_model.summary2()

### 3.3 Display odds ratios

In [None]:
# odds ratios
or_table = np.exp(logit_model.conf_int()) # exponentiate confidence intervals
or_table['Odds Ratio'] = np.exp(logit_model.params) # exponentiate coefficients

or_table.columns = ['2.5%', '97.5%', 'Odds Ratio'] # name columns
or_table

## 4 Plot predicted probabilities

In [None]:
data['num_bath'].unique()

In [None]:
# predicted probabilities
df_predict = data.copy()
df_predict['pred_high_price'] = logit_model.predict()

# plot probabilities by key independent variable
df_predict2 = df_predict.groupby(by = 'num_bath').agg(np.mean)[['pred_high_price']]
df_predict2.plot()

# plot with labels
plt.title('Predicted Probability of High Price Based on Number of Bathrooms')
plt.xlabel('Number of Bathrooms')
plt.ylabel('Probability')
positions = (1, 2, 3, 4, 5)
labels = ('1', '2', '3', '4', '5')
plt.xticks(positions, labels)
legend = ['Pr(High Price)']
plt.legend(legend)

## 5 Multinomial logistic regression
In this section, we're using a new version of the Bogota dataset with a tenancy variable in wich 0 = renter-occupied, 1 = owner-occupied, and 2 = vacant. 

In [None]:
data2 = pd.read_csv('properties_wtenancy.csv')

In [None]:
# define indpendent variables
ind_var = ['price', 'SES', 'area', 'dist'] 

### 5.1 Run logit model

In [None]:
y = data2['tenancy'] # dependent variable - it's a categorical variable!
x = data2[ind_var].assign(Intercept = 1) # independent variables - new for this dataset

# define and run logit model
mnlogit_model = MNLogit(y, x).fit()
summary = mnlogit_model.summary2()
summary

### 5.2 Display odds ratios

In [None]:
# extract coefficients and confidence interval values for tenancy = 0 (renter-occupied)
df = summary.tables[1]
conf_int = df[['[0.025','0.975]']].to_numpy()
coef = mnlogit_model.params.iloc[:,0].to_numpy()

In [None]:
# odds ratios for tenancy = 0 (renter-occupied)
or_table = pd.DataFrame(data = np.exp(conf_int))
or_table['Odds Ratio'] = np.exp(coef)
or_table.columns = ['2.5%', '97.5%', 'Odds Ratio']
or_table.index = ['price', 'SES', 'area', 'dist', 'Intercept'] 

In [None]:
or_table

In [None]:
# extract coefficients and confidence interval values for tenancy = 1 (owner-occupied)
df = summary.tables[2]
conf_int = df[['[0.025','0.975]']].to_numpy()
coef = mnlogit_model.params.iloc[:,1].to_numpy()

In [None]:
# odds ratios for tenancy = 1 (owner-occupied)
or_table = pd.DataFrame(data = np.exp(conf_int))
or_table['Odds Ratio'] = np.exp(coef)
or_table.columns = ['2.5%', '97.5%', 'Odds Ratio']
or_table.index = ['price', 'SES', 'area', 'dist', 'Intercept'] 

In [None]:
or_table

### 5.3 Plot predicted probabilities

In [None]:
# predicted probabilities for y = 0 (renter-occupied)
df_predict = data2.copy()
df_predict['pred_renter'] = mnlogit_model.predict()[:,0]

# plot probabilities by key independent variable
df_predict2 = df_predict.groupby(by = 'SES').agg(np.mean)[['pred_renter']]
df_predict2.plot()

# plot with labels
plt.title('Predicted Probability of Renting Based on SES')
plt.xlabel('SES')
plt.ylabel('Probability')
positions = (1, 2, 3, 4, 5, 6)
labels = ('1', '2', '3', '4', '5', '6')
plt.xticks(positions, labels)
legend = ['Pr(renter-occupied)']
plt.legend(legend)

In [None]:
# predicted probabilities for y = 1 (owner-occupied)
df_predict = data2.copy()
df_predict['pred_owner'] = mnlogit_model.predict()[:,1]

# plot probabilities by key independent variable
df_predict2 = df_predict.groupby(by = 'SES').agg(np.mean)[['pred_owner']]
df_predict2.plot()

# plot with labels
plt.title('Predicted Probability of Owning Based on SES')
plt.xlabel('SES')
plt.ylabel('Probability')
positions = (1, 2, 3, 4, 5, 6)
labels = ('1', '2', '3', '4', '5', '6')
plt.xticks(positions, labels)
legend = ['Pr(owner-occupied)']
plt.legend(legend)