In [None]:
from datascience import *
from math import *

import matplotlib
matplotlib.use('Agg', warn=False)
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Coding Review

Today, we are going to look at a dataset that has various information, such as the number of tech workers, the percentage of insured individuals, access to broadband, median rent, and more, about urban counties in the United States. Load the data in the cell below.

In [None]:
urban = Table().read_table("urban_2017_clean.csv")
urban.show(5)

In [None]:
## Q1: First, let's practice importing our libraries for analysis.
# Let's use NumPy and the stats module from SciPy. Import both with the names you prefer below. 


In [None]:
## Q2: Write a function that takes in a column label in urban and returns the following summary statistics as a list:
# Mean, median, standard deviation, interquartile range
# We also want to use the optional arguments: center, and spread
# if summary_stats is called with center = True, print the mean instead of the median 
# and if spread = True, print the standard deviation instead of the IQR

def summary_stats(label, center = True, spread = True):
    if ...
    else:
    ...
    print("The center of " + label  + " is " + str(print_center) + " and the spread is " + str(print_spread))
    return ...

summary_stats("Median rent")

In [None]:
## Q3: I want to see how the number of tech workers is correlated with every other quantitative variable in the dataset
# Complete the following code so we get a 2 column table, with 1 column being each quant var. in the dataset
# and the other column being the corresponding r value for that variable and # of tech workers
# hint: use the pearsonr(x_arr, y_arr) function in SciPy in conjunction with other code we've learned! 

quant_vars = list(urban.drop("Urban area", "Has rapid transit", "Tech workers").labels)
cleaned_urban = urban.where("Tech workers", are.above(10000)) 
# Note: for purposes of this exercise, we are just going to look at urban areas that have more than
# 10,000 tech workers to help make the analysis easier + the viz clearer

def find_r(variable):
    """Takes in a variable name as a string and return the correlation coefficient
    for that variable and the number of tech workers"""
    ...
    
vars_by_r = ... # our final result table

In [None]:
## Q4: What variable seems the most strongly correlated to number of tech workers?
# Check it manually or use code to figure it out below. 

...

In [None]:
## Q5: Before we actually begin creating a linear model, we need to check something VERY important:
# what needs to be true before we begin a regression analysis? 

cleaned_urban.scatter("Tech workers", ..., fit_line = True)
cleaned_urban.scatter("Tech workers", ..., fit_line = True)

In [None]:
## Q for thought: why would we see this pattern? what is an issue moving forward with this analysis? 
# should we move forward with a regression analysis?

## Linear Regression in Python

So, based on all of the work we did above, it looks like that the amount of tech workers is most closely correlated with the proportion of people who use motor vehicles OR walk as their main mode of transportation, as well as the median rent (in urban areas in particular). 

I think that the relationship between tech and rent is the most interesting, given that we live in the Bay Area, so let's go investigate that! 

Let's use what we've learned to build a linear regression model!

In [None]:
## Again, here is our dataset:
cleaned_urban.show(5)

## and our r-value
r = stats.pearsonr(cleaned_urban.column("Tech workers"), cleaned_urban.column("Median rent"))[0]
r

The proofs to calculate slope and intercept are out of the scope of this course, but we can assume the following:

For the prediction line in standard units: y = mx + b

m = slope = r

b = 0

For the prediction line in original units:

m = slope = r * (SDy / SDx)

b = Yavg - m * Xavg

In [None]:
# Q1: Calculate slope in original units 
def slope(tbl, x_label, y_label):
    ...

formula_slope = slope(cleaned_urban, "Tech workers", "Median rent")
formula_slope

In [None]:
# Q2: Calculate intercept in original units
# Don't forget about the DRY rule!

def intercept(tbl, x_label, y_label):
    ...

formula_int = intercept(cleaned_urban, "Tech workers", "Median rent")
formula_int

In [None]:
# Our formula line: 
print("Our line:" + str([formula_slope, formula_int]))

# If we wanted to predict the rent in an area with 100,644 tech workers (like SF), our line would report
# the following number as the median rent

formula_slope * 100644 + formula_int

In [None]:
## The algorithmic optimization approach:
# We can measure the "accuracy" of our line using a statistic: the mean-squared error
# Error = observed - predicted

def mse(slope, intercept):
    """Given a slope and intercept, report the mean squared error using the cleaned_urban table"""
    ...

## This function lets us compare the "accuracy" of various lines:
# MSE of some random line
mse(formula_slope, formula_int)

In [None]:
# We can now use the minimize(func) function, which is a "higher order" function (function that inputs/outputs a function)
# which will use an algorithm to find the arguments for func that produce the lowest output
# in other words, we are brute-forcing the line that creates the MSE!
minimize(mse)


## Bringing it all together...

Remember that R is, by mathematics, related to the slope of the line, and the slope of the line explains the relationship between 2 variables X and Y (i.e. it is the rate of change between the two variables).

Therefore, we can perform a bootstrap on the slope (and, by extension, R), treating the slope like any other statistic we've used before, to figure out if there truly is a relationship between # of tech workers and median rent. If the slope is 0, there is no relationship, but if the slope is not 0, we can assume there is some relationship.

Let's use the full urban table now for this analysis.

In [None]:
# Q1: What is the null and alternative hypotheses?



In [None]:
# Q2: Do 1000 bootstraps of the urban table, calculating the slope for each bootstrap.
# Save the slopes in the slopes array.
slopes = make_array()
repetitions = 1000

for i in np.arange(repetitions):
    ...

In [None]:
# Run this cell!
Table().with_column("Bootstrapped slopes", bootstrapped_slopes).hist(0, bins = np.arange(-0.001, 0.008, 0.0005))

In [None]:
## Calculate a 95% confidence interval using the slopes array.
# Set reject_null to True if we REJECT the null or False if we fail to reject the null.

reject_null = None
lower_bound = ...
upper_bound = ...

[lower_bound, upper_bound]

In [None]:
## What does this mean in context?