# Soc 5 Spring 2019

## Discussion 2: Analyzing Quantitative Data II

Estimated time: 40 minutes

**Before you begin, run the following two cells to load the packages needed for the rest of the notebook**

In [None]:
# RUN THIS CELL or the notebook will not work properly
!pip install numpy
!pip install scipy
!pip install matplotlib
!pip install datascience
!pip install pandas

In [None]:
# RUN THIS CELL or the notebook will not work properly
%run Data/functions.py
%matplotlib inline
from scipy import stats
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

## Introduction

In this discussion, you will learn how to interpret both quantitative and qualitative data through the Chi-Square statistic, T-tests, linear regression, and the $r$ statistic.

### The Data <a id='data'></a>

In this notebook, we'll be revisiting the GSS 2014 data that you saw in Discussion 1. Let's first load in the data.

In [None]:
gss_survey_data = Table.read_table("Data/GSS_2014_cleaned.csv")
gss_survey_data

### The Codebook

Take a look at the `GSS 2014 codebook` PDF file to review what these variables mean. The file is contained within the "Data" folder, which is in the same folder as this notebook. 

If your browser has problems opening the PDF, try downloading it to your computer. To do this:

1. Navigate inside the Data folder
2. Check the box next to the `GSS 2014 codebook.pdf` file
3. Click the "Download" button on the top toolbar

---

## Section 1: Chi-Square Statistic  <a id='section 1'></a>

Here, we are going to learn how to interpret the Chi-Square Statistic.

In this particular example, we will find the Chi-Square Statistic between the responses to NATMASS vs NATENVIR. From intuition, we expect that people who voted for more money spend on mass transportation would also support improving and protecting the environment. We can see if that is the case by looking at the Chi-Square Statistic.

Firstly, we have to construct a contingency table of the 2 attributes (NATMASS and NATENVIR):

In [None]:
contigTable = generate_3x3_contingency_table(gss_survey_data, "NATMASS", "NATENVIR")
contigTable

The columns of the table correspond to the score a respondant gave for NATMASS, and the rows correspond to the score a respondant gave for NATENVIR. Each cell corresponds to the total number of people who voted with the specific answers to NATMASS and NATENVIR.

Before we calculate the chi-squared statistic, let's first see the expected distribution of this table:

In [None]:
find_expected_dist(contigTable, "NATMASS", "NATENVIR")

In [None]:
expected = find_expected_dist(contigTable, "NATMASS", "NATENVIR")
expected

The expected distribution is dictated under the Null Hypothesis, which assumes that there is no significant relationship between the 2 attributes. If we want to disprove the Null Hypothesis, we will have use the Chi-Square Statistic! <br><br>
(You do not need to know how we solve for the statistic, but just know how to interpret it!)

In [None]:
chi_squared, degree_freedom = find_chi_square(contigTable)

print("chi-squared statistic = " + str(chi_squared))
print("degrees of freedom = " + str(degree_freedom))

The p-value for the Chi-Square Statistic above (with 4 degrees of freedom) less than .0001. 
Knowing this, is there a significant relationship between NATENVIR and NATMASS? Explain your answer. 

**Answer:** ...

Explain why the degrees of freedom is 4.

**Answer:** ...

---

## Section 2: T-tests <a id='section 2'></a>

Here, we are going to learn how to use t-tests for differences in means. 

In this example, we will be looking at the differences between the female and the male responses (or more specifically, their responses to NATFARE). 

Our Null Hypothesis would be that there should be no significant difference between the 2 groups' responses to NATFARE, whereas the Alternative Hypothesis is that there exists a significant difference between the 2 groups' responses to NATFARE.

In [None]:
columns_of_interest = ["NATEDUC", "NATFARE", "NATROAD", "NATMASS", "NATHEAL", "NATENVIR"]

females = gss_survey_data.where("SEX", are.equal_to(2))

means_female = generate_means_table(females, columns_of_interest)
means_female.relabel("category", "category (female)")

print("female sample size = " + str(females.num_rows))
means_female

In [None]:
males = gss_survey_data.where("SEX", are.equal_to(1))

means_male = generate_means_table(males, columns_of_interest)
means_male.relabel("category", "category (male)")

print("male sample size = " + str(males.num_rows))
means_male

As you can see above, the 2 tables seperate out the male responses from the female responses, and averages all their responses in both groups. However, we will only be looking at the NATFARE's mean and standard deviation in calculating the t value.

(again, you do not need to know how we solve for the statistic, but just know how to interpret it!)

In [None]:
t = generate_t_value(means_female, means_male, females.num_rows, males.num_rows, "NATFARE")

print("t value = " + str(t))

The p-value for this t-Stat above (with 1533 degrees of freedom (= total female sample + total male sample - 2)) is .51. 

Knowing this, what can we say about our null hypothesis?

**Answer:** ...

---

## Section 3: Linear Regression <a id='section 3'></a>

Here, we are going to learn how to interpret the linear regression line and its R-statistic (also known as the $r$ value or the Pearson correlation coefficient).

![pearson-r.png](attachment:pearson-r.png)

Here we have a nice diagram outlining different types of $r$ values. From the picture, we can see that a set of data with a positive slope (positive correlation) will have a positive $r$ value, and vice versa. Also, the closer the data is to the line, the further the r value is from 0; if the data has basically no correlation, the $r$ value equals 0.

In this example, we will see if there is a correlation between AGE and EDUC. The code below will make a scatter plot of the 2 attributes:

In [None]:
plt.scatter(gss_survey_data.column("AGE"), gss_survey_data.column("EDUC"), alpha=.2)
plt.xlabel('AGE')
plt.ylabel('EDUC')
plt.show()

Well, things aren't looking too good; It's very difficult to spot the correlation, if any, between EDUC and AGE. We can still try to find the best fit line and find its corresponding value:

In [None]:
m, c, r_value, p_value, std_err = stats.linregress(gss_survey_data["AGE"], gss_survey_data["EDUC"])

plt.plot(gss_survey_data["AGE"], gss_survey_data["EDUC"], 'o', label='Original data')
plt.plot(gss_survey_data["AGE"], m*gss_survey_data["AGE"] + c, 'r', label='Fitted line')

print("regression line : y=" + str(m) +"x + " + str(c))
print("r value = "+ str(r_value))
plt.xlabel('AGE')
plt.ylabel('EDUC')
plt.show()

From what you know about the Pearson $r$ statistic, what can you say about the data, given the $r$ value? Also, is the plot contrary to what you expect? Explain your answers.

**Answer:** ...

Since the correlation between AGE and EDUC is not great, let's look at something else. 

Let's now try grouping the table by EDUC, then plotting the mean of their NATEDUC for each group. As a reminder, NATEDUC stands for their view on how much money should be provided to improving the nation's education system.

In [None]:
grouped_educ = gss_survey_data.group("EDUC", np.mean)
grouped_educ

In [None]:
plt.scatter(grouped_educ["EDUC"], grouped_educ["NATEDUC mean"])
plt.xlabel('EDUC')
plt.ylabel('mean NATEDUC')
plt.show()

Immediately, we can see that there is a much better correlation between EDUC and the mean NATEDUC for each level of education. Now it's your turn to experiment and try to find the line of best fit! Remember that the line of best-fit is the line that minimizes the errors (the distances between the line and the points).

Below is an interactive plot that allows you to control the slope and y-intercept of a line, and it's your job to try to find the best line that minimizes the error that printed under the plot!

The plot may take some time to load, so be patient with it :)

Play around with the plot and try your best to minimize the error; you don't have to be exact! 

In [None]:
@interact(m=(-5/100, 0, 1/1000), c=(1.6, 2, 1/100))
def g(m, c):
    est = m*grouped_educ["EDUC"] + c
    plt.plot(grouped_educ["EDUC"], grouped_educ["NATEDUC mean"], 'o', label='Original data')
    plt.plot(grouped_educ["EDUC"], est, 'r', label='Fitted line')
    plt.xlabel('EDUC')
    plt.ylabel('mean NATEDUC')
    plt.show()
    
    error = ((grouped_educ["NATEDUC mean"] - (m*grouped_educ["EDUC"] + c))**2).mean()
    
    print("y = "+str(m)+"x+"+str(c))
    print("error = "+str(error))
    return

Now let's calculate the true line of best fit, and the corresponding error and $r$ value.

In [None]:
m_i, c_i, r_value, p_value, std_err = stats.linregress(grouped_educ["EDUC"], grouped_educ["NATEDUC mean"])
plt.plot(grouped_educ["EDUC"], grouped_educ["NATEDUC mean"], 'o', label='Original data')
plt.plot(grouped_educ["EDUC"], m_i*grouped_educ["EDUC"] + c_i, 'r', label='Fitted line')

error = ((grouped_educ["NATEDUC mean"] - (m_i*grouped_educ["EDUC"] + c_i))**2).mean()
print("regression line : y=" + str(m_i) +"x + " + str(c_i))
print("error = " + str(error))
print()
print("r value = "+ str(r_value))
plt.show()

How close was your line in the interactive plot to the true best fit line? Given the $r$ value, what can you say about the data?

**Answer:** ...

## Bibliography

- Pearson Product-Moment Correlation (picture). https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php

Notebook developed by: William Sheu

Data Science Modules: http://data.berkeley.edu/education/modules