# Planning Methods: Part II, Spring 2021 

# Lab 2: Plotting, Correlation and Regression 

**About This Lab**
* We will be running through this notebook together. If you have a clarifying question or other question of broad interest, feel free to interrupt or use a pause to unmute and ask it! If you have a question that may result in a one-on-one breakout room (think: detailed inquiry, conceptual question, or help debugging), please ask it in the chat!
* We recognize learning Python via Zoom comes with its challenges and that there are many modes of learning. Please go with what works best for you. That might be printing out the Jupyter notebook, duplicating it such that you can refer to the original, working directly in it. Up to you! There isn't a single right way.
* This lab requires that you download the following file and place it in the same directory as this Jupyter notebook:
    * `clean_property_data.csv`


## Objectives 
By the end of this lab, you will review how to:
1. Check for outliers 
2. Plot histograms 
3. Create subdataframes 
4. Create Dummy Variables 

You will learn how to: 
1. Create scatterplots
2. Test Correlation
3. Use a correlation matrix 
4. Run a bivariate linear regresion

## Import Packages

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

import statsmodels.api as sm
from scipy.stats import pearsonr

## Read Files 

In [None]:
df = pd.read_csv('clean_property_data.csv')
df.head()

In [None]:
df.dtypes

In [None]:
df['id'] = df['id'].astype(str)
df.dtypes

In [None]:
#We want to create a dataframe with the variables that we want to examine 
property_df= df[['price_000', 'pop_dens', 'ses', 'house', 'apt',
                      'pcnt_indu', 'pcnt_com', 'pcn_green', 'homicides',
                     'area_m2', 'num_bath', 'lnprice']].copy()

#Let's rename a few of these variables 


## Histograms 

### Visualizing data in plots

In [None]:
#To look at the distribution of population density, we can first create a boxplot  
                               #Define population density as x 
                               #Create the boxplot 
plt.show()

#We can also look at a histogram of population density  
                     #Let's define the number of bins we want to use 
                    #Let's plot the histogram using x, and defined number of bins 
plt.show()

### Personalizing the plot 

In [None]:
#We will create a histogram of population dnesity 
x =                      # We first define population density as x 
n_bins = 500             #Then we define the number of bins we want to use        

# Let's define graph size (x, y) in inches 
plt.figure(figsize = (6, 4))

#Using the matplotlib function, we can plot our histogram 


#We can create axis labels and a title for our histogram 
plt.xlabel('Population Density [$\mathregular{person/km^2}]$')
plt.ylabel ('Frequency')
plt.title ('Population Density')

#We can adjut the labels to the figure area
plt.tight_layout()

#Remember to save the figure to the project folder
plt.savefig('pop_dens_hist.jpg')

plt.show()

In [None]:
#We will repeat the exercise above, and plot a histogram of the natural log of price 
                 # We first define population density as x 
                 #Then we define the number of bins we want to use

#Let's define graph size (x, y) in inches 


#We can now plot our histogram using the matplotlib function 


#Create your axis labels and title


#We can adjust labels to the figure area
plt.tight_layout()

#Let's save the figure we have created 
plt.savefig('cost_hist.jpg')

plt.show()

## Dummy Variables 

### Creating a dummy from the SES category 

In [None]:
#Socioeconomic status (ses) is a categorical variable. 
#We can first look at its distribution using df.groupby().size()


In [None]:
#We will creame a dummy variable for high socioeoconomic status 
property_df['high_ses'] = np.where((property_df['ses'] == 5) |
                                    (property_df['ses'] ==6), 1, 0)

In [None]:
#Let's now look at the distribution of properties that are in a neighborhood with high socioeconomic status 
#versus those that are not 


### Creating a binary density variable 

In [None]:
#We will use population, and define define its median  
pop_dens_med = 
pop_dens_med

In [None]:
#We can create a dummy for dense areas that have a population density higher than the median 
property_df['dens_pop_dv'] = 
                         
#We can now look at the distribution of this new dummy variable 


### Creating a different set of categories 

In [None]:
#Let's look at the distrubution of the number of bathrooms using our categorical variables 
property_df.groupby('bathrooms').size()

In [None]:
#We create a number variable creating new categories 
property_df['bathrooms_cat'] =
#We can now look at the distribution of this new dummy variable 


## Scatterplots

In [None]:
#To create a scatterplot, we first need to define our x, and y variables of interest 
#Here we are interested in plotting the price of properties, and the percentage of industrial properties
#in a neighborhood 
x = 
y = 

#Let's make our scatter plot 
                               #We use this function to plot x, and y,
                               #define the color of our points, their transparency and size
        
#Let's create our axis labels and titles 



plt.show()

In [None]:
#Let us create a scatterplot of the properties' area and their price
x = property_df['area_m2']
y = property_df['price']

#Set the index upon which the plot color will be based
c = x

#Let's create a our scatterplot, and use cmap to the points on our plot 

#Let's create our axis labels and titles 
plt.xlabel('Area $[\mathregular{m^2}]$')
plt.ylabel('Price')
plt.title('Property Area and Prices')

plt.show()

In [None]:
#We can also plot a line of the relationship between our x, y (area, and price)


#Let's create our axis labels and titles 
plt.xlabel('Area $[\mathregular{m^2}]$')
plt.ylabel('Price')
plt.title('Property Area and Prices')

plt.show()

In [None]:
#Use the 'help()' to understand the properties of a function


In [None]:
#We want to create a figure with more than 1 scatterplot 

#Let's define x, and y for our first scatterplot 
x1 = 
y1 = 

#Let's define x, and y for our second scatterplot 
x2 = 
y2 = 

#We then create our figure with 2 subplots, and define its size 



#Create the first subplot
ax1.scatter(x1, y1, c = 'red', alpha = 0.3, s = 50)
#Let's create our axis labels 
ax1.set_xlabel('Percentage of Industrial Properties')
ax1.set_ylabel('Price')
#Format y-axis number to include thousands separator
ax1.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda y1, loc: "{:,}".format(int(y1))))

#Create the second subplot

#Create its axis labels 

#Format y-axis number to include thousands separator
ax2.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda y2, loc: "{:,}".format(int(y2))))

#Create title for the overall figure


#Save your figure 


plt.show()

## Correlation 

### Pairwise Correlation

In [None]:
#Define independent and dependent variables 
x = 
y = 

#Define list to run a pairwise correlation 
cor_list = 
property_df[cor_list].corr()

In [None]:
#Pearsonr function returns the correlation coefficient, and P-value


### Correlation Matrix 

In [None]:
#We can also create a correlation matrix for our dataframe 


## Bivarite Linear Regression 

In [None]:
#Define the independent variable, and include the intercept 
x = 

#Define the dependent variable
y = 

#Run the regression
