# Regression, Focus on Kent

Last session, we introduced regression. We will finish with the work from last time, and do a number of exercises, both to expand our understanding of regression and as a review. This time, let’s look at a subset of values, just focusing on Kent (County == 1, and it has 24 parishes).

Let’s first look at UNEMP and RELIEF in just 24 parishes. An important thing to note is the difference with correlation – the regression coefficient changes depending on what is the explanatory (X-axis) variable, or the independent variable, and is the dependent (Y-axis) variable. We can see this by deriving the slope and the intercept.

This is an important point, as you can see from these two images:

Here is a scatter of Unemployment and Relief in Kent, a trand line is also included:

<img src="Kent_UnempRelief.PNG">

Here is a scatter of Relief and Unemplomeny in Kent, a trand line is also included:

<img src="Kent_ReliefUnemp.PNG">

/1/ Unemployment and Relief: What do coefficients mean? 

/2/ Relief and Unemployment: What do coefficients mean?



In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

In [None]:
def standard_units(any_numbers):
    "Convert any array of numbers to standard units."
    return (any_numbers - np.mean(any_numbers))/np.std(any_numbers)  

def correlation(t, x, y):
    return np.mean(standard_units(t.column(x))*standard_units(t.column(y)))

def slope(table, x, y):
    r = correlation(table, x, y)
    return r * np.std(table.column(y))/np.std(table.column(x))

def intercept(table, x, y):
    a = slope(table, x, y)
    return np.mean(table.column(y)) - a * np.mean(table.column(x))

def fit(table, x, y):
    """Return the height of the regression line at each x value."""
    a = slope(table, x, y)
    b = intercept(table, x, y)
    return a * table.column(x) + b

def mean_squared_error(table, x, y):
    def for_line(slope, intercept):
        fitted = (slope * table.column(x) + intercept)
        return np.average((table.column(y) - fitted) ** 2)
    return for_line

def residual_plot(table, x, y):
    fitted = fit(table, x, y)
    residuals = table.column(y) - fitted
    res_table = Table().with_columns([
            'fitted', fitted, 
            'residuals', residuals])
    res_table.scatter(0, 1, fit_line=True)

In [None]:
wholeData = Table.read_table("https://github.com/data-8/history-connector/raw/gh-pages/Data4.csv")
# Since we only focus on Kent this time, filter out all columns where COUNTY == 1.
kentData = wholeData.where(wholeData.column("COUNTY") .... 1)
kentData.show()

In [None]:
#As a review, what would a correlation matrix reveal?
#kentData.to_df().corr()

/1/ First, write out the equation with UNEMP as the eXplanatory var and RELIEF as the dependent var.

*(double-click to edit)*

In [None]:
#Now, let's create a scatter with a trend line;
kentData.scatter ("UNEMP", "RELIEF", fit_line=True, s=20)

#Or alternatively(uncomment the next line)
#plots.scatter(kentData.column("UNEMP"), kentData.column("RELIEF"))

Question 1: Does it matter if we instead do a scatter of RELIEF and UNEMP?

*(double-click to edit)*

In [None]:
#Let's calculate the slope and intercept

kentslope = slope(kentData, ...., ....)
kentintercept = intercept(kentData, ...., .....)
kentslope, kentintercept

Question 2: If you had to summarize, how would you say that a unit increase in the eXplanatory variable affects the average value of the dependent variable? Consider writing this out as a statement.

*(double-click to edit)*

In [None]:
#Optional: If you can, calculate the R^2; at least consider a residual plot, to test the linearity assumption.

residual_plot(kentData, '....', '....')

/2/ Now, let's write out the equation with RELIEF as the eXplanatory var and UNEMP as the dependent var.

In [None]:
#Again, let's do a scatter with a trend line
kentData.scatter (".....", ".....", ...)

#Or alternatively(uncomment the next line)
#plots.scatter(kentData.column("...."), kentData.column("....."))

In [None]:
#calculate the slope and intercept
kentslope2 = slope(kentData, "....", "....")
kentintercept2 = intercept(kentData, "....", "....")
kentslope2, kentintercept2

Question 3: As we did above, If you had to summarize, how would you say that a unit increase in the eXplanatory variable affects the average value of the dependent variable? Consider writing this out as a statement.

In [None]:
#Optional: If you can, calculate the R^2


Note that no matter how we set the explanatory var and the dependent var. The R^2 is the same, the slope and intercept take on different values, however.

Question 4: How about WEALTH and POP? Does it look like a linear regression would be a good fit here? How would you go about showing this?
*(double-click to edit)* 