# [Global102] Lab III: Analyzing the Data we Collected!


---


### Professor: Tiffany Page

In Lab 3, you will import the data set our class created through our survey project, analyze it, create visualizations and interpret that data. You are encouraged to work on this in small groups. In Lab III, we have kept some of the explanatory cells from Labs I and II to aid you in your analysis.



Estimated Time: 2 hours

---

# Part 1: The Jupyter Notebook <a id='section 0'></a>

Before we start our lab, we want to give you a brief introduction to Jupyter Notebooks (like this one) where you will work on conducting your survey analysis. 

**Jupyter notebooks** are documents that can contain a seamless compilation of text, code, visualizations, and more. A notebook is composed of two types of rectangular **cells**:  markdown and code. A **markdown cell**, such as this one, contains text. A **code cell** contains code. All of the code in this notebook is in a programming language called **Python**. You can select any cell by clicking it once. After a cell is selected, you can navigate the notebook using the up and down arrow keys or by simply scrolling.

### 1.1 Run a cell <a id='subsection 0a'></a>
To run a code cell once it's been selected, 
- press `Shift` + `Enter`, or
- click the Run button in the toolbar at the top of the screen. 

If a code cell is running, you will see an asterisk (\*) appear in the square brackets to the left of the cell. Once the cell has finished running, a number corresponding to the order in which the cell was run will replace the asterisk and any output from the code will appear under the cell.

### 1.2 Writing Comments in a cell <a id='subsection 0b'></a>
You'll notice that many code cells contain lines of blue text that start with a `#` (like the one above). These are *comments*. Comments often contain helpful information about what the code does or what you are supposed to do in the cell. The leading `#` tells the computer to ignore the rest of the line.

### 1.3 Editing a cell <a id='subsection 0c'></a>

**Question 1.3.1** You can edit a Markdown cell by clicking it twice. Text in Markdown cells is written in [**Markdown**](https://daringfireball.net/projects/markdown/), a formatting syntax for plain text, so you may see some funky symbols when you edit a text cell. Once you've made your changes, you can exit text editing mode by running the cell. 

### 1.4 Saving and loading <a id='subsection 0d'></a>

#### Saving and Loading

Your notebook can record all of your text and code edits, as well as any graphs you generate or calculations you make. You can save the notebook in its current state by clicking `Control-S`/`Command-S`, clicking the **floppy disc icon** in the toolbar at the top of the page, or by navigating to **File > Save and Checkpoint** in the menu bar.

The next time you open the notebook, it will look the same as when you last saved it.

**Note:** After loading a notebook you will see all the outputs (graphs, computations, etc) from your last session, but you won't be able to use any variables you assigned or functions you defined. You can get the functions and variables back by re-running the cells where they were defined – the easiest way is to **highlight the cell where you left off work, then go to Cell > Run all above** in the menu bar. You can also use this menu to run all cells in the notebook by clicking **Run all**.

# Part 2: Introduction To Python <a id='section 1'></a>

Now that you are comfortable with using Jupyter Notebooks, we also need to learn a programming language to communicate with the computer. 

**Programming** is giving the computer a set of step-by-step instructions to follow in order to execute a task. It's a lot like writing your own recipe book! In order to communicate with computers, we must talk to them in a way that they can understand us, via a **programming language**. 

There are many different kinds of programming languages, but we will be using **Python** in this lab because it is concise, simple to read, and applicable in a variety of projects – from web development to mobile apps to data analysis.

In [None]:
# Just simply run this cell (Don't worry if you do not understand it)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## 2.1 Data types <a id='subsection 1a'></a>
Almost all data you will work with broadly falls into two types: numbers and text. 

**Numerical data** shows up green in code cells and can be positive, negative, or include a decimal.

**Text data** (also called *strings*) shows up red in code cells. Strings are enclosed in double or single quotes. Note that numbers can appear in strings.

**Note:** In Python, the single quotation mark and double quotation mark can both denote the start and end of a string. A string must start and end with the same type of quotation mark. If your string contains either type of quotation mark, you can either "escape" this character using a `\` or simply surround the string with the other type of quotation mark. Keep in mind that apostrophes are equivalent to single quotation marks. In this lab, we are going to use double quotation "string" for consistency. 

## 2.2  Variables <a id='subsection 1b'></a>
Variable is a named place in the computer's memory where a programmer can store data and later retrieve the data using the variable name. We can give variables values using an **assignment statement**. We can retrieve the data anytime using the variable name. This serves a purpose of saving our intermediate result to that variable. 

The assignment statement has three parts. On the left is the *variable name* `income`. On the right is the *value* 10. The *equals sign* in the middle tells the computer to assign the value to the name. You'll notice that when you run the cell with the assignment, it doesn't print anything. But, if we try to access `income` and `tax` again in the future, they will have the value we assigned them.

We can also assign strings to variables.

## 2.3 Python  List <a id='subsection 1c'></a>

In the previous section, we introduced *variables* which can be used to store data. What if we want to store more than one value in a variable? In Python, a `list` is a data structure which contains multiple data items. Lists in Python can be created by just placing a sequence of data inside square brackets `[]`. The data saved in the list must be seperated by commas. 

## 2.4 Functions <a id='subsection 1d'></a>
A function is a procedure which works a lot like a machine: it takes an input, does something to it, and produces an output. The input is put between brackets and can also be called the _argument_ or _parameter_. Functions can have multiple arguments. Defining functions can be helpful when you find yourself needing to do the same procedure multiple times with slightly different inputs.

Using a defined function is known as _calling_ a function. To call a function, simply write the name of the function with your input variable in brackets (argument).



## 2.5 Importing Python Packages <a id='subsection 1e'></a>


Most programming involves work that is very similar to work that has been done before. Since writing code is time-consuming, it's good to rely on others' published code when you can.  Rather than copy-pasting, many programming languages allows us to **import packages**. A module is a file with Python code that has defined variables and functions. A Python package is a collection of modules. By importing packages, we are able to use previously written functions in our own code.

Run the cell below to import all the packages that we'll use later to conduct our analysis.

In [None]:
# Run this cell to import the following modules 

from datascience import *
from utils import *

**Note:** Most of our data cleaning process will be done by using the funcitons of `Tables` under the module `datascience`. However, the datascience module does not have methods that are used to deal with missing values. 

For this lab, we wrote a module called **Utils** which contains helper functions that are handy to use to deal with missing values, extract substrings, and convert descriptive answers to numerical values, etc. For this lab purpose, your task is to learn how to apply those functions to clean the survey data. You do not have to know exactly how those funtions are written. 

# Part 3: Table and Table operations <a id='section 2'></a>

### `Table`
Now that we have learned the basic knowledge of Python, we can use Python to analyze our dataset. But before getting into that, we need to learn a useful structure that is used to store our data in a clean and organized way.

First, let's understand what `Table` are. `Table` are a way of representing tabular data that are part of the `datascience` package that we imported earlier. A `Table` can be viewed in two ways:
* a sequence of named columns that each describe a single attribute of all entries in a data set, or
* a sequence of rows that each contain all information about a single individual in a data set.


### 3.1 Create Tables <a id='subsection 2a'></a>
There are two general ways to create a table:
1. We can import data from another file and display it as a Table using the `Table().read_table("file_name")` method. We will talk more on this when we try to import our survey dataset below. 

2. We can also create a table from scratch. For example, let's say we have three lists, one with a list of different flavors of cake, one with a list of their prices, and another one with the rating for each flavor of the cake. Then, we can create a new `Table` with each of these lists as columns with the `.with_columns()` method: 


If you want to access specific rows, columns, or cells from the data, that's where `Table` methods come in handy! 

### 3.2 Table methods <a id='subsection 2b'></a>
`Table`s in the datascience module have functions associated with them, we call those functions **methods**. (The terms "method" and "function" are technically not the same thing.) The statement `from datascience import *` imports all the methods included in the datascience package. A `Table` method is just like a function, but it must operate on a `Table`. An example call may look like:

`tbl.method(arguments)` 


**Hint:** `tbl.take([row_indices])` takes in a list of row indices and return the rows associated with those indices.

**Note:** The 1st row of a Table in python has index 0, the second row has index 1, and so on. If we want to access the first row of a Table we need to call `tbl.take(0)`. 

Here, we only briefly talked about `Table` methods. In the following sections, we are going to learn how to use `Table` methods to clean and analyze our data. 

**Note:**
We understand that it is a little bit hard to understand all of these in one lab for most of the programming beginners. We will be walking you through the data cleaning and wrangling process step by step to make it as easy as possible.

Recommended Reading:
 * [Introduction to tables](https://www.inferentialthinking.com/chapters/03/4/Introduction_to_Tables)
 * [Data 8 online Textbook: Chapter 3](http://www.inferentialthinking.com/chapters/03/programming-in-python.html)
 * [datascience.tables documentation](http://data8.org/datascience/tables.html#tables-overview)
 
After this lab, if you still have questions, try reading the `datascience` package documentation for more explanations. For now, let's clean some data!

# Part 4: Acquiring Our Survey Data <a id='section 3'></a>


### Data Context 

In this section, you'll be working with the data we collected as a class. 


## 4.1 Import and display the data <a id='subsection 3a'></a>

The survey data is saved in a [csv](https://en.wikipedia.org/wiki/Comma-separated_values) file. The `datascience` package has `read_table` method which allows us to read in the data and display it as a table. In general, to import data from a `.csv` file, we write `Table.read_table("file_name")`.

**TASK:** Run the cell below to import the csv file `GS102Fall2022_cleaned.csv` which stores our data and display it as a table, and name the table `raw_data`.

In [None]:
raw_data = Table.read_table("GS102Fall2022_cleaned.csv")

Let's examine the table `raw_data` to see what data it contains. 

Calling the `tbl.show(n)` method displays the first a couple of rows of the table. For example,  `raw_data.show(5)` displays the first 5 rows of the table `raw_data`. Additionally, make sure not to call `.show(n)` without an argument, as this will crash your notebook!

**TASK:** Display the top five rows of the `raw_data` table. 

In [None]:
# show the first five rows of the raw_data table 
raw_data.show(5)

Instead of only displaying the top rows of the table, we may want to look at some arbitray range of rows in the table. We can use the `tbl.take()` method. 

**TASK:** Display the 5th row to the 8th row of the table by running the cell below. 

In [None]:
# display the 5th row to the 8th row
raw_data.take([4,5,6,7])

Well done! You succesfully displayed the data as a table and now we can manage the data using Table methods! 


## 4.2 Data Overview <a id='subsection 3b'></a>

Now we want to have more information about the data. First, We want to know the size of the table. For this data, the number of rows of the table the number of people who answered the survey (assuming no duplicates). The number of columns corresponds to the number of questions in the survey. In `Table`s  have the property `num_rows`, which tells you how many rows are in a `Table`. (A "property" can be thought of as a method that doesn't need to be called by adding parentheses.)

Example call: `tbl.num_rows` 

**TASK:** Run the cell below to print out the number of rows in the table. 

In [None]:
# print the number of rows in the table 
num_surveydata_rows = raw_data.num_rows
print("The table has {} rows in it!".format(num_surveydata_rows))

Similarly, the property `num_columns` returns the number of columns in a table. Example call: `tbl.num_columns` 

**TASK:** Run the cell below to print out the number of columns in the table.

In [None]:
# Run this cell
num_farmers_markets_columns = raw_data.num_columns
print("The table has", num_farmers_markets_columns, "columns in it!")

#### How many survey responses do we have?

#### Answer:

#### 4.3.1.2 Select Columns 

We can use the `Table` method `tbl.select(column_names)` to choose only the columns that we want from `tbl`. It takes any number of arguments, as long as all are column names in our table. This method returns a new table with only those columns in it. *The columns are in the order in which they were listed as arguments*. Check the documentation of [Table.select](http://data8.org/datascience/_autosummary/datascience.tables.Table.select.html) for more detailed information. 

**TASK:** Run the cell below to get a table containing only two columns: 'Q12' and 'Q2'. 

In [None]:
# This is just an example to show how to obtain a new table with two columns 'Q12', 'Q2'
raw_data.select("Q12", "Q2")

Below, we give a Summary of `Table` methods that we used through this notebook:

### Summary of `Table` methods ###

    
|Name|Example|Purpose|
|-|-|-|
|`Table`|`Table()`|Create an empty table, usually to extend with data|
|`Table.read_table`|`Table.read_table("my_data.csv")`|Create a table from a csv file|
|`with_columns`|`tbl = Table().with_columns("cake price", [2, 4.5], "cake name", ["chocolate", "straberry"]`|Create a copy of a table with more columns|
|`sort`|`tbl.sort("cake price")`|Create a copy of a table sorted by the values in a column|
|`where`|`tbl.where("cake price", are.above(2))`|Create a copy of a table with only the rows that match some *predicate*|
|`num_rows`|`tbl.num_rows`|Compute the number of rows in a table|
|`num_columns`|`tbl.num_columns`|Compute the number of columns in a table|
|`select`|`tbl.select("column name")`|Create a copy of a table with only some of the columns|
|`drop`|`raw_data.exclude([row_indices]))`|Create a copy of a table without some of the rows|
|`take`|`tbl.take([row_indices])`|Create a copy of the table with only the rows whose indices are in the given array|

### Summary of `utils` functions ###

    
|Name|Example|Purpose|
|-|-|-|
|`encode_nans`|`encode_nans(table, column_name)`|Converts "nan" strings in the column_name to `None`|
|`encode_nans_table`|`encode_nans_table(table)`|Converts all "nan" strings in a `Table` to `None`|
|`get_first_selection`|`get_first_selection(table, column_name)`|Keep the first selection for a question|
|`get_mixed_category`|`get_mixed_category(table, column_name, string)`|Create a 'Mixed string' category|
|`missing_proportion`|`missing_proportion(table, column_name)`|Calculate the proportion of missing values in a column|
|`drop_nonserious_rows`|`drop_nonserious_rows(table, column_name)`|Delete the non-serious responses in the survey|

##### Dependencies: (Run the cell below before continuing)

In [None]:
from datascience import *
from utils import *
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
sns.set()

### 5.1 The Data <a id = 'section0'></a>

In [None]:
data = Table().read_table('GS102Fall2022_cleaned.csv')
data.show(3)

### 6.1 Relationship between demographic factors and involvement in political organizations on campus <a id = 'section1'></a>


#### Displaying Rows

Let's get an idea of the data we're working with.

TASK: Use the .select method to display only the columns pertaining to gender and involvement in political organizations on campus.

Remember that you can use Python's lists ([x, y]) to select more than one column at once. Save this new table into a name called politicalorgs_gender.

In [None]:
politicalorgs_gender = data.select(['Q13', 'Q4'])
politicalorgs_gender

In [None]:
print("The proportion of missing values in the first column is: {}".format(missing_proportion(talk_politics_gender, 'Q13')))
print()
print("The proportion of missing values in the second column is: {}".format(missing_proportion(talk_politics_gender, 'Q3')))

#### Pivot Tables

Pivot tables (also known as Contingency Tables) are data structures that allow us to summarize key points in our dataset. In our case, we are trying to look at the relationship between gender and involvement in political organizations. Our independent variable, or the variable that we believe might influence the other is gender. This variable should be presented along the columns of our pivot table. The dependent variable should be placed along the rows of the pivot table. The data within the table will be counts of gender-involvement in political organizations at Berkeley pairs.

TASK: Use the Table method .pivot to create a pivot table between gender and involvement in political organizations at Berkeley. This method takes in two arguments: the column name to be displayed along the columns, and the column name to be displayed along the rows. Save the resulting pivot table into a name called pivoted_politicalorgs_gender.

In [None]:
pivoted_politicalorgs_gender = politicalorgs_gender.pivot('Q13', 'Q4')
pivoted_politicalorgs_gender

Let's look at the distribution of respondents across gender, year in school, major, and race/ethnicity. First we will look at race.

In [None]:
plt.figure(figsize = (18, 6))
plt.rcParams.update({'font.size': 15})
ax=sns.barplot(x = data.group('Q14').column('Q14'), y = data.group('Q14').column('count')); # Run this cell
plt.xlabel('Race/Ethnicity')
plt.ylabel('Counts')
plt.title('Race/Ethnicity of Students in Survey'); 

# To display count labels: 
# 1. set sns.barplot to ax (above)
# 2. use code below as template - you can adjust the numbers in ax.text to move/center labels
counts = data.group('Q14').column('count');
for i, p in enumerate(ax.patches):
    height = p.get_height();
    ax.text(p.get_x() + p.get_width() / 2, height + 1, counts[i], ha="center");
            
plt.xticks(rotation=90);
   

**Task:** Generate a bar chart for gender, major and year in school. If you go to the menu bar at the top of the page, select the Insert drop-down menu and select Insert Cell Below you will be able to add an empty cell. Then copy and paste the code above, but change the variable names, x-axis label and the title of the bar chart to reflect the variable you are charting.

What do you notice about the distribution of respondents across gender, race/ethnicity, major, year in school?

**Answer for Gender:**

**Answer for Race/ethnicity:**

**Answer for Major:**

**Answer for year in school:**

**How representative is our sample?**

In these bar charts we can see the total number of respondents by race/ethnicity, gender and major. With this information you can determine how representative of the larger population our sample is across these categories. Calculate the percentage of respondents for each race/ethnicity category and for gender. Compare that to the larger population by finding that information on this website: https://opa.berkeley.edu/campus-data/uc-berkeley-quick-facts. Remember we are just interested in the UCB undergraduate population, not all UCB students. Here's the data on majors (number of students in the total population for each major):

College of Chemistry (1,079)
College of Engineering (4,041)
College of Environmental Design (672)
Haas School of Business (1,044)
L&S Administered Programs (3,899)
L&S Arts & Humanities Division (1,376)
L&S Biological Sciences Division (1,202)
L&S Math & Physical Sciences Division (1,134)
L&S Social Sciences Division (4,990)
L&S Undeclared (11,496)
L&S Undergrad Studies Division (616) 
Rausser College of Natural Resources (2,517)


Percent by race/ethnicity category in our survey sample.

**Answer:**


How do these percentages compare to the larger population? Is our sample representative along the lines of race/ethnicity?

**Answer:**

What should we be careful making claims about and why (based on our sample)?

**Answer:**


**Task** Add some additional cells and repeat this for gender and major. Be sure to answer all three questions above for gender and major as well.

#### 6.2: Processing Pipeline

In data science, when you repeat a set of tasks to analyze a dataset, you are creating a *data processing pipeline*. 

We will write a function that takes in a table, two column names of categorical variables in the table, a title, and a category, and outputs a bar graph displaying the relationship between those variables. The first column name is the independent variable.

In [None]:
def categorical_variable_relationship(table, first, second, title, category):
    table = table.select([first, second])
    table = drop_missing_rows(table, first)
    table = drop_missing_rows(table, second)
    pivot = table.pivot(first, second)
    proportion_pivot = counts_to_proportions(pivot)
    plot_bar_chart(proportion_pivot, proportion_pivot.labels[1:], title, category)

#### **TASK:** Using your newly created `categorical_variable_relationship` function, plot a bar chart to find the relationship between gender **'Q13'** and involvement in political organizations on campus 'Q4'**. 

In [None]:
categorical_variable_relationship(data,
'Q13', 'Q4', 'Relationship between Gender and Involvement in Political Organizations',
                                  'Gender')
plt.xlabel("Involved in Political Organization on Campus");

**Interpret the bar chart** What do you see that is interesting? Careful: Note the order of the categories on the x-axis.

**Answer:**

**Task** Add some blank cells and repeat this with race/ethnicity and involvement in political organizations. Create the bar chart and interpret it. Then select at least 4 other variable combinations of your choice from our survey, create the bar charts and interpret them. Go back to our survey questions and identify variables combinations that you think might yield interesting patterns. Coordinate with your group so that each group member is selecting different combinations.

# 7.1 Significance Tests for Categorical Variables <a id = 'section2'></a>

At this point, you should have identified some differences along the lines of gender, race/ethnicity, major, year in school, etc. However, how do we know that these differences are not due to *random chance* alone? To answer this question, we turn to **hypothesis testing** for categorical variables. 

### Hypothesis Testing: The Basics <a id = 'subsection2a'></a>

Hypothesis tests are used when you observe some phenomena and want to know whether it happened by random chance alone or due to a specific cause. A hypothesis is a guess about the world, based on available evidence. We want to test between two different hypotheses:

The Null Hypothesis: My observation has arisen due to random chance alone.
The Alternative Hypothesis: My observation has arisen due to a cause other than random chance alone.

### Chi-square Testing: Introduction and Case Study <a id = 'subsection2b'></a>

The Chi-square Test is a type of hypothesis test that works well with categorical data. It measures how well the observed distribution of data fits with the distribution that is expected if the variables are independent. (Light, 2008) We need to convert our table data into a pivot table before we can conduct the chi-squared test.
We need to use counts instead of proportions.

### Step 1: Pivoting the Data

**TASK:** Use the `Table` method `.pivot` with the appropriate ordering of column names. Save this into a name called `politicalorgs_gender`. 

Don't worry about the ordering of values; just make sure to have the correct values on the vertical and horizontal axis. 

In [None]:
politicalorgs_gender = data.select(['Q13', 'Q4'])
politicalorgs_gender
pivoted_politicalorgs_gender = politicalorgs_gender.pivot('Q13', 'Q4')
pivoted_politicalorgs_gender

### Step 2: Adding Row and Column Totals

In order to calculate the expected counts under our null hypothesis of independence, we need to calculate row and column totals.

**Row totals** are horizontal sums added as the right-most column of the table. In this case, they would represent the total number of respondents by gender identity in the dataset.

**Column Totals** are vertical sums added as the bottom row of the table. In our example, they represent the total number of respondents in each category of mask usage.

**TASK:** Create a table called `totals` which adds a Row Total and Column Total to the `pivoted_politicalorgs_gender` table. Use the function `add_row_totals`, which takes in a pivot table and returns an updated version with the row totals. Also use the function `add_column_totals`, which takes in a pivot table and returns an updated version with the column totals. 

In [None]:
totals = add_row_totals(pivoted_politicalorgs_gender)
totals = add_column_totals(totals)
totals

### Step 3: Calculating the Chi-square statistic and interpreting results

Once we have both our prepared data table and the expected values corresponding to each entry, we can calculate the chi-square statistic.

𝜒2=∑𝑖=1𝑛(𝑂𝑖−𝐸𝑖)2𝐸𝑖
 
Under our null hypothesis of independence, this chi-squared statistic follows a chi-squared distribution. In general, most hypothesis tests involve the calculation of a test statistic, then seeing the probability of observing a more extreme test statistic value given the distribution implied by the null hypothesis. In this case, our null hypothesis implies a chi-squared distribution, which has an additional parameter called degrees of freedom. This parameter is determined by the shape of our pivot table.

degrees of freedom=(Number of Columns−1)⋅(Number of Rows−1)
 
For our example, the degrees of freedom is just 1.

Let's break this down.
- The Greek letter chi is written $\chi$. Thus $\chi^2$ is the symbolic representation of the chi-squared statistic.
- The $O_i$ represents an observed count (e.g. the number of female Democrats in the dataset)
- The $E_i$ represents an expected count (e.g. expected number of female Democrats given random choice)
- The $\sum_{i=1}^{n}$ represents the summation over all observations 

In simple terms, we take the following steps to calculate the Chi-square statistic: 

1) Take the difference between the Observed and Expected counts for each unique group in the sample 

2) Square that difference and divide it by the expected value for that group

3) Add up all those differences 

4) Find the probability of observing a chi-squared statistic value under the chi-squared distribution.

This might seem like a lot of computation. If doing this computation by hand, you would either have to look up a chi-squared distribution table to find the probability of seeing your chi-squared statistic or use a function that can calculate that for you. Instead, we have provided you a function that can do all these steps for you.

TASK: Use the function chisquaretest to calculate the Chi-square statistic. This function takes in a pivot table with added Row and Column Totals.

In [None]:
chisquaretest(totals)

You may have noticed that there is another output called the p-value. This is a number which indicates the likelihood that your observations are consistent with the null hypothesis. In this case, our null hypothesis is that mask usage and gender are independent of each other.

By convention, we say that:

If the p-value is less than or equal to 0.05 then, we can reject the null hypothesis. Essentially, what we are saying here is that a 1 in 20 chance of observing our test statistic is too unlikely for the null hypothesis to be true.

If the p-value is greater than 0.05, then we do not reject the null hypothesis.
Rejecting the null hypothesis means that we have evidence that supports the alternative hypothesis. In the case of Chi-square tests, it means that the two variables are inter-related. In either case, notice that we never accept that a hypothesis is true; rather, we simply reject or fail to reject it.

**TASK:** Based on our p-value, what can we determine about the relationship between gender and involvement in political organizations? What can you say about the null hypothesis?

**Answer:** [Click on this cell and write your answer here]

**Task:** Add additional empty cells and repeat this for four other variable pairs. What relationships are statistically significant?

### Discussion: Correlation vs. Causation <a id = 'subsection2c'></a>

In any class involving statistics, you may have heard the adage, *"Correlation doesn't imply causation."* 

Let's clarify what that means and why it's so important. *Correlation* is the inter-relation in trends of two variables. Whereas, *Causation* is an explicit statement that a change in one variable directly incites a change in the other variable (ex: smoking and respiratory illness). 


Let's look at some concrete examples of why correlation isn't the same as causation. For instance, there is a 95.8% correlation between the per capita consumption of mozzarella cheese and the number of Civil Engineering doctorates awarded in the US. Clearly, these are two completely unrelated events that aren't linked to one another. As such, we wouldn't use this correlation as evidence of causality between these variables. 

For more "Spurious Correlations", check out this link: https://www.tylervigen.com/spurious-correlations


In our discussion of Chi-square tests in Lab 2, we were able to find that the relationship between certain variables (ie. gender and presidential voting choice) deviated significantly from the null hypothesis. Does this mean that gender *causes* the choice in presidential candidates? No. To establish that relationship, we must gather more evidence.

Typically, to establish a causal relationship between variables, you must perform a randomized controlled experiment. If you're interested in this topic, you can check out this link for more information: https://www.statisticssolutions.com/establishing-cause-and-effect/

### 8.1 Bias in Surveys <a id = 'section6'></a>

The goal of a survey is to provide information about a large population from a limited sample. In this notebook, we've gone quite in depth in how to analyze different variables present in a survey. However, we've operated under the assumption that our survey data was representative of the UC Berkeley student population. However, in the real world, data collection is messy and difficult. Thus, we must be aware of sources of bias that may be present in our data. Here are a few common sources of bias in survey data:

Undercoverage bias: Certain groups of the population are left out of the sample, leading to an undercoverage of responses in the sample

Nonresponse bias: If the survey is optional, then certain respondents may not complete it. This may lead to skewed data.

Self-selection bias: If sample members volunteer themselves to take the survey, it may be the case that they are passionate about the issues asked about. This usually leads to an over-representation of individuals with strong opinions in the survey.

**TASK:** Given these sources of bias, do you see any problems with the methodology of our survey? If there are problems, what would you have done differently and why?

**Answer:** [Click on this cell and write your answer here]

--------------------------------------------------------------------------------------------------------------------------------

Good job! You have finished Lab 3: Analyzing the Data we Collected!

## Bibliography <a id = 'section7'></a>

• Caitlin Light - Adapted Chi-square case study. https://www.ling.upenn.edu/~clight/chisquared.htm

• Tyler Vigen - Incorporated example of "spurious" correlation. https://www.tylervigen.com/spurious-correlations

• Statistics Solutions - Referenced section on experimental design. https://www.statisticssolutions.com/establishing-cause-and-effect/

• Stat Trek - Adapted material on sources of survey bias. https://stattrek.com/survey-research/survey-bias.aspx

Some examples adapted from the UC Berkeley Data 8 textbook:

* [Tables](https://www.inferentialthinking.com/chapters/06/Tables.html")
* [Data 8 online Textbook: Chapter 3](http://www.inferentialthinking.com/chapters/03/programming-in-python.html)

- [Introduction to tables](https://www.inferentialthinking.com/chapters/03/4/Introduction_to_Tables)


Some term explanations adapted from the datascience documentation: 

- [datascience.tables documentation](http://data8.org/datascience/tables.html#tables-overview)


Some ideas in the sections of "Jupyter Notebook", "Introduction to Python" and "Tables and Table operations" adapted from materials in [UC Berkeley Data Science Modules core resources](http://github.com/ds-modules/core-resources):


- Shriya Vohra, Scott Lee, Pancham Yadav - intro-module-final
- Keeley Takimoto - Intro-to-Python-and-Jupyter

___
### Getting extra help

Interested in more help with learning Python or computational survey analysis? Check out  [Data Peer Consulting](https://data.berkeley.edu/education/data-peer-consulting) in Moffitt library for drop-in, one-on-one questions. For additional workshops designed for people new to computational analysis, take a look at the workshops at [The Dlab](https://dlab.berkeley.edu) (free for Berkeley students!). 

Good luck!

#### Note to Students: 
If you would like to use the utility provided by the Data Science Education Program team, simply copy the utils.py script to the folder where you are creating your analysis notebook. Good luck!