# [SOC-5] Analysis of Quantitative Data Discussion

---

*Estimated Time: 50 minutes*

---

### Topics Covered
- Chi-Square Statistic
- t Tests
- Linear Least-squares Regression
    - R Statistic

### Table of Contents

[The Data](GSS 2014 data)<br>

[Context]<br>

1 - [Section 1](Chi-Square Statistic)<br>

2 - [Section 2](t Tests)<br>

3 - [Section 3](Linear Least-squares Regression)<br>

In [None]:
from datascience import *
%matplotlib inline
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

---

## The Data <a id='data'></a>
Explanation of the data the students will be working with. 

<b>Example</b>: In this notebook, you'll be working with ..............


In [None]:
gss_survey_data = Table.read_table("Data/GSS_2014_data.csv")

gss_survey_data = gss_survey_data.where("AGE", are.between('0','89'))
gss_survey_data = gss_survey_data.where("SEX", are.between_or_equal_to('1','2'))
gss_survey_data = gss_survey_data.where("EDUC", are.between_or_equal_to('0','96'))
gss_survey_data = gss_survey_data.where("NATEDUC", are.between_or_equal_to('1','4'))
gss_survey_data = gss_survey_data.where("NATFARE", are.between_or_equal_to('1','4'))
gss_survey_data = gss_survey_data.where("NATROAD", are.between_or_equal_to('1','4'))
gss_survey_data = gss_survey_data.where("NATMASS", are.between_or_equal_to('1','4'))
gss_survey_data = gss_survey_data.where("NATHEAL", are.between_or_equal_to('1','4'))
gss_survey_data = gss_survey_data.where("NATENVIR", are.between_or_equal_to('1','4'))

for label in gss_survey_data.labels:
    gss_survey_data = gss_survey_data.with_column(label, gss_survey_data.column(label).astype(int))
    
gss_survey_data

## Section 1: Chi-Square Statistic  <a id='section 1'></a>

Intro to section 1 here.

In [None]:
contigTable = gss_survey_data.groups(["NATMASS", "NATENVIR"]).pivot("NATMASS", "NATENVIR", values="count", collect=np.sum)
contigTable = contigTable.with_column("total", [np.sum(list(contigTable.row(n))[1:]) for n in np.arange(0, len(contigTable)-1)])
contigTable = contigTable.with_row(["total", 610, 795, 130, 1535])
contigTable

In [None]:
expected = [[contigTable.column("total")[row]*(contigTable.column(col)[3]/1535) for col in ["1", "2", "3"]] for row in [0, 1, 2]]
observed = [[contigTable.column(col)[row] for col in ["1", "2", "3"]] for row in [0, 1, 2]]
expected = np.concatenate(expected).ravel()
observed = np.concatenate(observed).ravel()

chi_squared = np.sum((expected-observed)**2/expected)
degree_freedom = 4

print("chi-squared statistic = " + str(chi_squared))
print("degrees of freedom = " + str(degree_freedom))

The critical chi-squared statistic for a 0.05 probability level (p value) and a degree of freedom of 4 is 9.488. Knowing this, is there a significant correlation between NATENVIR and NATMASS? Explain your answer.

---

## Section 2: t tests <a id='section 2'></a>

Intro to section 2 here.

In [None]:
lst = ["NATEDUC", "NATFARE", "NATROAD", "NATMASS", "NATHEAL", "NATENVIR"]
females = gss_survey_data.where("SEX", are.equal_to(2))
data = [females[col] for col in lst]
mean_values = [np.mean(col) for col in data]
std_values = [np.std(col) for col in data]
means_female = Table().with_column("category (female)", lst).with_columns("mean", mean_values, "standard deviation", std_values)

print("female sample size = " + str(females.num_rows))
means_female

In [None]:
males = gss_survey_data.where("SEX", are.equal_to(1))
data = [males[col] for col in lst]
mean_values = [np.mean(col) for col in data]
std_values = [np.std(col) for col in data]
means_male = Table().with_column("category (male)", lst).with_column("mean", mean_values, "standard deviation", std_values)

print("male sample size = " + str(males.num_rows))
means_male

In [None]:
s_p = ((females.num_rows - 1)*(0.56339)**2 + (males.num_rows - 1)*(0.602806)**2)/(males.num_rows + females.num_rows - 2)**.5
t = (1.28996 - 1.36401)/(s_p*(1/females.num_rows + 1/males.num_rows)**.5)
print("t value = " + str(t))

The critical t value for siginificance level of 0.05 and degrees of freedom of 1533 (= total female sample + total male sample - 2) is 1.645848. Knowing this, what can we say about our null hypothesis?

## Section 3: Linear Regression <a id='section 3'></a>

Intro to section 3 here.

### R statistic <a id='subsection 1'></a>

Intro to subsection 1 here.

In [None]:
plt.scatter(gss_survey_data["AGE"], gss_survey_data["EDUC"])
plt.show()

In [None]:
m, c, r_value, p_value, std_err = stats.linregress(gss_survey_data["AGE"], gss_survey_data["EDUC"])
plt.plot(gss_survey_data["AGE"], gss_survey_data["EDUC"], 'o', label='Original data')
plt.plot(gss_survey_data["AGE"], m*gss_survey_data["AGE"] + c, 'r', label='Fitted line')

print("regression line : y=" + str(m) +"x + " + str(c))
print("r value = "+ str(r_value))
plt.show()

In [None]:
grouped_educ = gss_survey_data.group("EDUC", np.mean)
plt.scatter(grouped_educ["EDUC"], grouped_educ["NATEDUC mean"])
plt.show()

In [None]:
def squared_error(ys_orig,ys_line):
    return sum((ys_line - ys_orig) * (ys_line - ys_orig))

def coefficient_of_determination(ys_orig,ys_line):
    y_mean_line = [np.mean(ys_orig) for y in ys_orig]
    squared_error_regr = squared_error(ys_orig, ys_line)
    squared_error_y_mean = squared_error(ys_orig, y_mean_line)
    return (1 - (squared_error_regr/squared_error_y_mean))

@interact(m=(-50, 0), c=(100, 200))
def g(m, c):
    est = m/1000*grouped_educ["EDUC"] + c/100
    plt.plot(grouped_educ["EDUC"], grouped_educ["NATEDUC mean"], 'o', label='Original data')
    plt.plot(grouped_educ["EDUC"], est, 'r', label='Fitted line')
    plt.show()
    
    return "y="+str(m/1000)+"x+"+str(c/100)+"     r^2="+str(coefficient_of_determination(grouped_educ["NATEDUC mean"], est))

In [None]:
m, c, r_value, p_value, std_err = stats.linregress(grouped_educ["EDUC"], grouped_educ["NATEDUC mean"])
plt.plot(grouped_educ["EDUC"], grouped_educ["NATEDUC mean"], 'o', label='Original data')
plt.plot(grouped_educ["EDUC"], m*grouped_educ["EDUC"] + c, 'r', label='Fitted line')

print("regression line : y=" + str(m) +"x + " + str(c))
print("r^2 value = "+ str(r_value**2))
plt.show()

___
## Survey

Have any feedback about this notebook? Please fill out our survey:

https://docs.google.com/forms/d/e/1FAIpQLSe54U3E64kYFWwQHSUpAvWYMuJOdKzbHDZjPa3nMUlHSSs0PQ/viewform

---

## Bibliography

Cite sources in this format, separated with bullet points:

<b>Format</b>: `Author` - `How source was used`. `URL`

Example:

- John Denero - Adapted NLP techniques. https://denero.org

---
Notebook developed by: X, X, X

Data Science Modules: http://data.berkeley.edu/education/modules
