# Soc. 5 Spring 2019

## Functions

**Note:** You are not required to submit this notebook.

### Introduction

This notebook will walk you through the usage of many functions that you'll need for Project 5, so it's highly recommended that you go through this notebook first, before attempting the Project 5 notebook.

We will pick up where we left off in `Discussion 1`, so if you still feel shaky in using Jupyter notebooks, or are unsure what Python variables are, then be sure to review that notebook first, especially the "Intro to Python" section.

Now that you've hopefully covered the basics in the discussions, let's go over some functions you'll encounter on the project!

In [None]:
# Importing the functions we need
from functions import *
%matplotlib inline

### Barcharts:
To create a barchart, use the `barchart()` function. This function takes a list of categories first, the frequency of each category second, the x-axis label third, y-axis label fourth, the title of the graph fifth, and, finally, the filename for saving the graph, in that order.

In [None]:
# Example:
categories = array("gruyere", "brie", "cheddar", "provolone")
frequencies = array(10 , 30, 100, 60)

barchart(categories, frequencies, "Cheese Type", "Popularity", "Cheese Popularities", "cheese_chart")

### Histograms:
To create a histogram, use the `histogram()` function. This function takes in an array of numerical values first, the x-axis label, the y-axis label, the title of the graph, and the filename for saving the graph, in that order.

Let's use a histogram to plot the distribution of the sum of 5 dice rolls. We will first simulate these dice rolls with the function below. Don't worry about the code in the function. All you need to know is that a "sample" means rolling the dice 5 times and adding up the results, and here, we're taking 50 samples.

In [None]:
# Run and ignore
def simulate_dice_rolls(n=5, num_samples=50, func=np.sum):
    results = []
    for _ in range(0, num_samples):
        results.append(func(np.random.randint(1, 6 + 1, size=n)))
    return results

In [None]:
# Example
dice_distribution = simulate_dice_rolls(5, 50)
histogram(dice_distribution, 
          "Result",
          "Frequencies",
          "The Distribution of the Sum of 5 Dice Rolls (50 Samples)", 
          "dice_roll_histogram")

**Note:** The figures that you produce using `histogram()` and `barchart()` are saved in the "Output" folder. Make sure you give each graph a unique filename so that they don't overwrite each other!

### Grades Data

For the next couple examples, let's use the following small grades dataset to illustrate some functions that might be helpful as you are doing the project.

In [None]:
grade_and_score = Table().with_columns([
        'letter', array('a', 'b', 'c','d','f', 'i'),
        'count',  array( 9,   10,   7,  5,  4,  1),
        'points', array(10,   8,   6,  4,  2,  0),
        ])

grade_and_score

### Filtering Values:
To filter values from a table, use the `filter_values()` function. This function takes in the name of the table first, the column that you're filtering, and an array of values to be removed.

Let's say you want to filter values 1 and 9 from the `count` column. `filter_values` will drop all rows where `count` has a value of 1 or 9.

In [None]:
filter_values(grade_and_score, 'count', array(1, 9))

### Creating Categories:
To create a categorical variable from another numerical column, use the `create_categories()` function. This function takes in the name of the table, the column that contains the values you want to "categorize", and, finally, the endpoints of each category.

**Warning:** Make sure you cover the whole range of values that are in the column.

Let's say we want to create a category for the range a value in the column column falls in.

In [None]:
# Min of the count column
np.min(grade_and_score.column("count"))

In [None]:
# Max of the count column
np.max(grade_and_score.column("count"))

Thus, our first endpoint should be 1, and the last endpoint should be 10, and we can divide the range from 1 to 10 however we choose.

Here, we're dividing the range of `count` into 2 groups, 0-5, and 6-10+.

In [None]:
# We can use the table from above to create a categorical variable for count
create_categories(grade_and_score, 'count', array(0, 6, 10))

### GSS 2014
For the next couple examples, we'll use our familiar GSS 2014 data that we've been working with for the previous discussions.

In [None]:
data = Table.read_table("Data/GSS_2014_cleaned.csv")
data

### Cross-Tabulation:
To create a cross-tabulation of two columns of a table, use the `cross_tab()` function. This function takes in: the name of a table, the column to use for the column values, and column to use for row values, in that order. 

In [None]:
# This table shows the relationship between letter grades and point values.
x_tb = cross_tab(data, 'SEX', 'NATFARE')
x_tb

Using this two way table, we can also see the corresponding table of expected counts under the Null Hypothesis that the two columns aren't related. We do this using the `expected_counts()` function, which takes in a cross_tab table.

In [None]:
exp_tb = expected_counts(x_tb)
exp_tb

### Chi-Square Statistic
Using the cross tabulated and expected counts tables, we can use `compute_chi_square()` to obtain the Chi-Square Statistic, which takes the cross_tab table first, and then, the expected counts table, in that order.

In [None]:
compute_chi_square(x_tb, exp_tb)

### Grouping Tables:
To find the count of each column's values, we can "group" the table by that value, using the `.group()` function on the table, with the column of interest as the parameter. For example, let's see the count of each response to `NATFARE`.

In [None]:
data.group("NATFARE")

However, if we specify a function as the second paramter/argument to `.group()`, we can calculate other statistics over the groups.

In [None]:
data.group("NATFARE", np.mean)

However, we can see that taking the mean of some of these columns, such as the nominal `CASEID` column doesn't make sense. Then, it's preferable to select a subset of the columns, and select the ones that we do care about.

In [None]:
subset = data.select(["EDUC", "AGE", "NATFARE"])
subset

Now if we group by `NATFARE`, we only get the columns we need. What the table below is showing is the mean of the `EDUC` and `AGE` values for everyone who responded with the corresponding  `NATFARE` response. 

In [None]:
subset.group("NATFARE", np.mean)

Note that we can use other functions with `.group()`. Let's say we wanted the `median` or `range` instead.

In [None]:
subset.group("NATFARE", np.median)

In [None]:
subset.group("NATFARE", np.range)

As a reminder, here are all the array functions we discussed in the first discussion:
- _np.mean()_: calculates the mean of an array 
- _np.median()_: calculates the median of an array
- _np.mode()_: calculates the mode of an array 
- _np.var()_: calculates the variance of an array 
- _np.std()_: calculates the standard deviation of an array
- _np.range()_: calculates the range of an array
- _np.sum()_: calculates the sum of all values in an array

All of these can be used with the `.group()` function