# Lab 3. Correlation & Regressions in Python

### Learning Objectives
In this lab, we will be supporting you in running regressions for the first time in Python! By the end, we will have covered how to do the following in Python:  
> • Do a Simple Linear Regression (one independent variable)  
> • Do a Multivariate Linear Regression (multiple independent variables)   

As discussed in lecture, planners can use regressions in a multitude of ways, from diagnosing a problem or bringing it to light, to determining at what level to intervene, to assessing the impact of an intervention. The kind of regression you will run depends on what phenomenon you want to understand (your dependent variable) and what factor(s) you hypothesize may be associated with it (your independent variables; explanatory or control variables). 

# Connecting to Bogota Property Dataset

In [None]:
#Importing Libraries
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

import statsmodels.api as sm
from scipy.stats import pearsonr

In [None]:
#Reading in Data
data = pd.read_csv('Property Data.csv')
data.head()

In [None]:
#Creating Subset Dataframe w/ Variables of Focus
df = data[['price_000','pop_dens','ses','house','area_m2','num_bath','pcn_green','homicides']].copy()

### Defining Major Functions

In [None]:
# Defining Function to Set Up Boxplot & Histogram Side by Side

def plots (df, var, title, box_label, hist_label):
    #Plot Boxplot & Histogram
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) #Defines subplots & figure size
    plt.suptitle(title) #Title for overall figure

    ### Boxplot
    ax1.boxplot(var)
    ax1.set_ylabel(box_label)
    ### Formats axes to include thousands separator
    ax1.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x1, loc: "{:,}".format(int(x1))))
    ax1.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x1, loc: "{:,}".format(int(x1))))

    ### Histogram
    ax2.hist(var)
    ax2.set_xlabel(hist_label)
    ### Formats axes number to include thousands separator
    ax2.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x2, loc: "{:,}".format(int(x2))))
    ax2.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x2, loc: "{:,}".format(int(x2))))
    ax2.tick_params(axis='x', labelrotation=35)

    plt.show()

In [None]:
#Defining Function to Show Regression Scatterplot
def reg_scatter(df, x, y, xlabel, ylabel, color):
    plt.figure(figsize=(10,6))

    ax = sns.regplot(x, y, data=df, line_kws=color) 

    ax.set(xlabel=xlabel, ylabel=ylabel);
    ax.get_xaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
    ax.get_yaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

# Correlation Matrix

In [None]:
#Correlation Matrix of all df variables
df.corr()

# Linear Regression
A simple linear regression—also known as a binary linear regression—is one where the dependent variable is continuous and is “simple” only because there is one independent or explanatory variable to explore. We’re going to start with exploring the possible relationship between two continuous variables to practice that interpretation. For example, CHIS includes responses to people’s food consumption habits.

We may hypothesize that there is a relationship between how many times someone eats fast food each week and how many sodas a person drinks per week. Since fast food restaurants serve soda, it seems logical that these variables could covary. We can investigate this by performing a simple linear regression in Stata. Remember the equation for a linear regression with one explanatory variable is:
Y = a + bX + e
In this case, we believe that our independent variable, X, the number of times eating fast food per week, will have a relationship with our dependent variable, Y, the number of sodas drunk per week.   
Before we investigate the relationship of these variables, let’s see if we need to drop any missing or “inapplicable” values, and recode them with names that make more sense so that we can more easily interpret our outputs. After recoding each one, we’ll summarize them to understand the units we’re working with.

## Hypothesis #1
Neighborhood population density is associated with higher property values

In [None]:
#Hypothesis #1 Variables
df = df
x = df['pop_dens'] #Define independent variable
y = df['price_000'] #Define dependent variable
var_list = ['pop_dens', 'price_000']

xlabel = 'Neighborhood Population Density [Residents/Mi$^2$]'
ylabel = 'Property Asking Price [Thousands of Pesos]'
color = {'color':'red'}

In [None]:
#Describe Neighborhood Density
var = x
title = 'Neighborhood Population Density'
box_label = '[Residents/Mi$^2$]'
hist_label = '[Residents/Mi$^2$]'

print(x.describe())
plots(df, var, title, box_label, hist_label)

In [None]:
#Describe Property Price
var = y
title = 'Property Asking Price'
box_label = '[Thousands of Pesos]'
hist_label = '[Thousands of Pesos]'

print(x.describe())
plots(df, var, title, box_label, hist_label)

### Scatterplot
Like we did in class, first we’ll draw a scatterplot with our two variables to get a visual sense of how these variables may move together. We’ll also add a line of best fit, where the line of best fit is a straight line drawn through the center of the data points that minimizes the total sum of squared distances between the line and each data point to give a general picture of the trend in the data.

We're using the Seaborn library here (as opposed to Matplotlib) because it's better at incorporating the line of best fit, and it generally produces more nuanced visualizations.

In [None]:
reg_scatter(df, x, y, xlabel, ylabel, color)

### Pairwise Correlation
What do we notice about this scatterplot? Let’s run a correlation of these two variables (aka a 'pairwise' correlation) to see how strong the relationship is.

In [None]:
#Pairwise correlation
#Returns: Pearson’s correlation coefficient for each pair specified

df[var_list].corr()

In [None]:
#Pairwise correlation with significance
#Returns: (Pearson’s correlation coefficient, 2-tailed p-value)

pearsonr(x, y)

It looks like this isn't a statistically significant relationship, but let’s take a closer look with our first regression anyway! We hypothesized that neighborhood density might be an explanatory variable for a property's asking price, so pop_dens will be our X and price_000 will be our Y.

### Bivariate Linear Regression

Before we run our regression, let’s review what we’re looking for:

RQ:  

H0:  
HA:  

In [None]:
#Bivariate Linear Regression
x = df[['pop_dens']].assign(Intercept = 1) #Redefine independent variable including intercept

sm.OLS(y, x).fit().summary2()

Practice interpreting this regression output with your neighbor:  

> • How can you explain the impact of population density on the asking price of a property?  
> • Is our model statistically significant?   
> • How can you tell?   

We're bound to run into insignificant findings, particularly as we work with complex topics and datasets in city planning. However these results can still be revealing and teach us about our research question and/or the dataset!

## Hypothesis #2
Being in a high SES neighborhood (ses_bin=6 is highest) is associated with higher property values

In [None]:
#Hypothesis #2 Variables
x = 
y = 

xlabel = 
ylabel = 

In [None]:
#Describe Variables
stats = ['count','min','max','mean', 'median', 'std']

#Describe SES & Price

We’re interested in the difference between high and low SES status, so we can simplify this categorical variable into a dummy in which those that are rated 5 & 6 are coded 1, and all others are coded 0.  

In [None]:
#Recode SES
  #Create Dummy Variable
  #Reassign Independent X Variable

#Describe ses_dv & price

### Scatterplot

In [None]:
#Scatterplot of ses_dv & price

### Pairwise Correlation

In [None]:
#Pearson Correlation

### Bivariate Linear Regression

In [None]:
#Redefine independent variable - and include intercept

#OLS Regression

When interpreting our regression output, we are most interested in the value of **b**, from our regression equation, which is the coefficient of each explanatory variable, shown as 'Coef.' above. The value of b tells us how much Y changes when X changes. Remember: It doesn’t necessarily imply a causal relationship – just an association!

In this case, because X is a dummy variable, it changes from 0 (lower SES) to 1 (high SES). Take a moment to interpret these findings with a neighbor - how does high socioeconomic status impact the price of a property?

## Hypothesis #3
Neighborhood SES is negatively associated with neighborhood density

In [None]:
#Hypothesis #3 Variables
x = df['ses_dv']
y = data['pop_dens']

xlabel = 'Neighborhood Socioeconomic Status'
ylabel = 'Neighborhood Population Density [Residents/Mi$^2$]'

In [None]:
reg_scatter(df, x, y, xlabel, ylabel, color)

In [None]:
pearsonr(x, y)

In [None]:
#Regression
x = df[['ses_dv']].assign(Intercept = 1) #Redefine independent variable - and include intercept

sm.OLS(y, x, missing='drop').fit().summary2()

# Multivariate linear regression

After creating a “simple” model (i.e. one explanatory variable), we can start to “build” a more sophisticated model by including additional explanatory variables or “controls”. Let’s take our ses_dv variable from our simple linear regression: what else do we think might influence a property's listing price, in addition to whether it's in a high socioeconomic status neighborhood?  


From our extensive literature review, we know that there might be property-specific and neighborhood-level factors which influence a property's price. We hypothesize that a property's type, size, and number of bathrooms, in addition to the neighborhood's green space and homicide rate, might be helpful predictors of a property's price.

In [None]:
#Define Independent Variables
ind_var = ['ses_dv', 'house', 'area_m2', 'num_bath', 'pcn_green', 'homicides'] 
#Generally, first variable is your 'key explanatory variable', followed by your control variables

x = df[ind_var].assign(Intercept = 1) #Independent Variables
y = df['price_000'] #Dependent Variable

sm.OLS(y, x).fit().summary2()

With your neighbor, practice interpreting the coefficient for each variable.  
 
Which factors increase a property's asking price?    
Which decrease the asking price?   