# Planning Methods: Part II, Spring 2021

# Lab 3: Correlations and Regressions

**About This Lab**
* We will be running through this notebook together. If you have a clarifying question or other question of broad interest, feel free to interrupt or use a pause to unmute and ask it! If you have a question that may result in a one-on-one breakout room (think: detailed inquiry, conceptual question, or help debugging), please ask it in the chat!
* We recognize learning Python via Zoom comes with its challenges and that there are many modes of learning. Please go with what works best for you. That might be printing out the Jupyter notebook, duplicating it such that you can refer to the original, working directly in it. Up to you! There isn't a single right way.
* This lab requires that you download the following file and place it in the same directory as this Jupyter notebook:
    * `clean_property_data.csv`
* This data includes properties that were sold through a real estate site (like Zillow) between 2001 and 2006 in Bogota. There are apartments and houses, characteristics of the structure like area and bathrooms, and characteristics of the neighborhood like density and a proxy for neighborhood income which is called SES.

## Objectives
By the end of this lab, you will have reviewed how to:
>1. Call a correlation matrix
>2. Run a bivariate linear regression

You will also learn how to:
>1. Run a bivariate linear regression with a binary independent variable
>2. Run a multivariable linear regression with binary independent variables

As discussed in lecture, planners can use regressions in a multitude of ways, from diagnosing a problem or bringing it to light, to determining at what level to intervene, to assessing the impact of an intervention. The kind of regression you will run depends on what phenomenon you want to understand (your dependent variable) and what factor(s) you hypothesize may be associated with it (your independent variables; explanatory or control variables).

## 1 Import packages and data

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

import statsmodels.api as sm
from scipy.stats import pearsonr

In [None]:
# read in data
data = pd.read_csv('clean_property_data.csv')
data.head()

In [None]:
# create sub-dataframe
df = data[['price_000','pop_dens','ses','house','area_m2','num_bath','pcn_green','thefts']].copy()

## 2 Correlation matrix

In [None]:
# correlation matrix for all variables
df.corr()

In [None]:
# function for matrix of p-values (thank you, Jonathan!)
def pearsonr_pval(x,y):
    return pearsonr(x,y)[1]

corr = df.corr(method=pearsonr_pval)
corr

## 3 Bivariate linear regression
A simple linear regression—also known as a binary linear regression—is one where the dependent variable is continuous and is “simple” only because there is one independent or explanatory variable to explore. We’re going to start with exploring the possible relationship between two continuous variables to practice that interpretation.

### 3.1 Hypothesis 1: neighborhood population density is associated with higher property values

#### 3.1.1 Explore your variables (descibe and visualize)

In [None]:
# define a function to set up boxplot and histogram side by side

def plots (df, var, title, box_label, hist_label):
    # set up figure for two subplots: boxplot & histogram
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) # define subplots and figure size
    plt.suptitle(title) # title the figure
    
    ### boxplot
    ax1.boxplot(var)
    ax1.set_ylabel(box_label)
    ### format axes to include thousands separator
    ax1.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x1, loc: "{:,}".format(int(x1))))
    ax1.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x1, loc: "{:,}".format(int(x1))))

    ### histogram
    ax2.hist(var)
    ax2.set_xlabel(hist_label)
    ### format axes number to include thousands separator
    ax2.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x2, loc: "{:,}".format(int(x2))))
    ax2.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x2, loc: "{:,}".format(int(x2))))
    ax2.tick_params(axis='x', labelrotation=35)

    plt.show()

In [None]:
# set up variables
x = df['pop_dens'] # define independent variable
y = df['price_000'] # define dependent variable

In [None]:
# describe neighborhood density
var = x
title = 'Neighborhood Population Density'
box_label = '[Residents/Mi$^2$]'
hist_label = '[Residents/Mi$^2$]'

print(var.describe())
plots(df, var, title, box_label, hist_label)

In [None]:
# describe property price
var = y
title = 'Property Asking Price'
box_label = '[Thousands of Pesos]'
hist_label = '[Thousands of Pesos]'

print(var.describe())
plots(df, var, title, box_label, hist_label)

#### 3.1.2 Visualize the relationship (scatterplot)
Like we did last lab, we’ll now draw a scatterplot with our two variables to get a visual sense of how these variables move together. We’ll also add a line of best fit, a straight line drawn through the center of the data points that minimizes the total sum of squared distances between the line and each data point to give a general picture of the trend in the data.

We're using the Seaborn library here (as opposed to Matplotlib) because it's better at incorporating the line of best fit, and it generally produces more nuanced visualizations.

In [None]:
# define a function to show and style regression scatterplot
def reg_scatter(df, x, y, xlabel, ylabel, color):
    plt.figure(figsize=(10,6))

    ax = sns.regplot(x, y, data=df, line_kws=color) 

    ax.set(xlabel=xlabel, ylabel=ylabel);
    ax.get_xaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
    ax.get_yaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

In [None]:
# set up parameters for plot function
xlabel = 'Neighborhood Population Density [Residents/Mi$^2$]'
ylabel = 'Property Asking Price [Thousands of Pesos]'
color = {'color':'red'}

In [None]:
reg_scatter(df, x, y, xlabel, ylabel, color)

#### 3.1.3 Pairwise correlation
What do we notice about this scatterplot? Let’s run a correlation of these two variables (aka a 'pairwise' correlation) to see how strong the relationship is.

In [None]:
# returns a Pearson’s correlation coefficient for each pair specified
var_list = ['pop_dens', 'price_000']
df[var_list].corr()

In [None]:
# pairwise correlation with significance
# returns a Pearson’s correlation coefficient, 2-tailed p-value
pearsonr(x, y)

It looks like this isn't a statistically significant relationship, but let’s take a closer look with a regression anyway! We hypothesized that neighborhood density might be an explanatory variable for a property's asking price, so pop_dens will be our X and price_000 will be our Y.

#### 3.1.4 Bivariate linear regression
Before we run our regression, let’s review what we’re looking for:

> • RQ:  
> • H0:  
> • HA:  

In [None]:
# bivariate linear regression
x = df[['pop_dens']].assign(Intercept = 1) # redefine independent variable and include intercept

sm.OLS(y, x).fit().summary2()

Let's answer the following questions:
> • How can you explain the impact of population density on the asking price of a property?  
> • Is our model statistically significant?   
> • How can you tell?   

We're bound to run into insignificant findings, particularly as we work with complex topics and datasets in city planning. However these results can still be revealing and teach us about our research question and/or the dataset.

### 3.2 Hypothesis 2: high SES neighborhoods (ses_bin=6 is highest) are associated with higher property values

In [None]:
# set up variables
x = 
y = 

xlabel = 
ylabel = 

#### 3.2.1 Explore your variables (describe and recode)

In [None]:
# describe SES & price
stats = ['count','min','max','mean', 'median', 'std']
y.groupby(x).agg(stats)

We’re interested in the difference between high and low SES status, so we can simplify this categorical variable into a dummy in which those that are rated 5 & 6 are coded 1, and all others are coded 0.  

In [None]:
# recode SES to high SES dummy variable
    # create dummy variable
    # reassign independent x variable

# describe ses_dv & price

#### 3.2.2 Visualize the relationship (scatterplot)

#### 3.2.3 Pairwise correlation

#### 3.2.4 Bivariate linear regression

In [None]:
# reassign independent variable w/ intercept

# run OLS regression

When interpreting our regression output, we are most interested in the value of **b**, from our regression equation, which is the coefficient of each explanatory variable, shown as 'Coef.' above. The value of b tells us how much Y changes when X changes. Remember: It doesn’t necessarily imply a causal relationship, just an association.

> In this case, because X is a dummy variable, it changes from 0 (lower SES) to 1 (high SES). Take a moment to interpret these findings - how does high socioeconomic status impact the price of a property?

## 4 Multivariable linear regression

After creating a “simple” model (i.e. one explanatory variable), we can start to build a more sophisticated model by including additional explanatory variables or “controls”. Let’s take our ses_dv variable from our simple linear regression: what else do we think might influence a property's listing price, in addition to whether it's in a high socioeconomic status neighborhood?  


From our extensive literature review, we know that there might be property-specific and neighborhood-level factors which influence a property's price. We hypothesize that a property's type, size, and number of bathrooms, in addition to the neighborhood's green space and theft rate, might be helpful predictors of a property's price.

In [None]:
# define independent variables
ind_var = ['ses_dv', 'house', 'area_m2', 'num_bath', 'pcn_green', 'thefts'] 
# generally, first variable is your 'key explanatory variable', followed by your control variables

x = df[ind_var].assign(Intercept = 1) # independent variables
y = df['price_000'] # dependent variable

sm.OLS(y, x).fit().summary2()

> • How would you interpret the coefficient for each variable?  
> • Which factors increase a property's asking price?    
> • Which decrease the asking price?  