# Descriptive Statistics
### Agenda:
1) What are the types of data  
2) Analyzing data?  
3) Distributions **FIX Maybe?**  
4) Inference **FIX**

## Types of Data

Once we have collected data, we need to understand the data better. Common questions might include, "what is a "normal" value for this data?" or "is some of the data very different from the rest?"

Examples:   
1) Scores on your midterm  
2) Presidential election outcomes by party
3) Gay marriage approval

In [150]:
import numpy as np
from scipy import stats
from datascience import Table
class_size = 300
grade_table = Table().with_columns(
    'student id', np.random.randint(300000000, 350000000, class_size),
    'midterm grade', np.random.randint(70, 90, class_size)
)
grade_table

student id,midterm grade
342109301,82
334421680,86
305650146,78
323466177,80
328982125,86
305709526,80
345744825,83
329704956,71
336314572,77
316896098,75


In [151]:
import pandas as pd
election_table = pd.read_excel('data/desc_stats.xlsx', sheet_name="Sheet2")[['year','political party']].dropna().head(10)
election_table

Unnamed: 0,year,political party
0,*1789*,no formally organized parties
13,*1792*,Federalist
18,*1796*,Federalist
31,*1800*,Democratic-Republican
36,*1804*,Democratic-Republican
38,*1808*,Democratic-Republican
42,*1812*,Democratic-Republican
45,*1816*,Democratic-Republican
48,*1820*,Democratic-Republican
51,*1824*,no distinct party designations


In [152]:
%matplotlib inline
ord_vals = ['Like','Like Somewhat','Neutral','Dislike Somewhat','Dislike']
np.random.choice(ord_vals, 10)
approval_table = Table().with_columns(
    'student id', np.random.randint(300000000, 350000000, 10),
    'policy approval', np.random.choice(ord_vals, 10)
)
approval_table

student id,policy approval
320155890,Dislike Somewhat
336041432,Dislike Somewhat
330715288,Dislike Somewhat
341898187,Like Somewhat
328506613,Dislike
342713283,Like Somewhat
341441587,Neutral
314689299,Neutral
305115159,Dislike Somewhat
341054602,Neutral


### Measurement Metric

Different sets of data, or variables, often have different kinds of values that the data/variable can hold. In the two examples above, the "midterm grade" data can have numeric values between 70 and 90, and "political party" can be any one of the political parties in US history.

**How can we differentiate between these kinds of data?**

The *measurement metric* of a variable is the type of value the variable can take on. Variables are one of three types: *categorical*, *ordinal*, and *continuous*. Let's take a look at each of these.

### Categorical Variables
Categorical variables are variables that take on one of a finite, unordered, list of values. Above, the winning political party is a categorical variable because its values are from a finite set (Republican, Democrat, Whig, etc). Additionally, there is no ordering between these parties.

Additional examples: religious identification, dog breed, colors.

**Check Your Understanding:**
Above, would midterm grades count as a categorical variable? Why or why not?

### Ordinal Variables
Ordinal variables are variables that take on one of a finite, *ordered*, list of values. Above, the approval rating is a categorical variable because its values are from a finite set (like, dislike, neutral, etc). Additionally, there is an ordering between the values: "like" has a higher ranking than "neutral" or "dislike".

Additional examples: socioeconomic status (e.g middle class), health status ("healthy", "somewhat sick", etc).

**Check Your Understanding:**
Above, would midterm grades count as an ordinal variable? Why or why not? What about political party?

### Continuous Variables
Continuous variables are variables that take on one of a *potentially* infinite, ordered, list of values. Above, the midterm grade variable is continuous because its values are any numeric value between 70 and 90..

Additional examples: income, age, temperature.

What's the difference between continuous and ordinal variables? While it can appear that continuous and ordinal variables are the same, there is an important distinction: *equal unit difference*. For ordinal variables, the difference between each value does not have to be the same. For continuous variables, the difference must be the same. For example, if measuring socioeconomic status (an ordinal variable), the difference between "lower class" and "middle class" is not necessarily the same difference between "middle class" and "upper class". In contrast, units of age are equidistant: the difference between age 11 and 12 is the same as between 29 and 30.

**Check Your Understanding:**
Above, would approval rating count as a continuous variable? Why or why not?

## Analyzing Data

There are two ways to describe a variable: measures of *central tendency* and *variation* (or *dispersion*). 



### Central Tendency
What are typical values of the variable?

There are 3 ways of measuring central tendency: mode, mean, median.  
1) *Mode*: the value that occurs most frequently in the data  
2) *Mean*: the average value, calculated by adding all values of the variable together and dividing by the number of occurences of the variable.  
3) *Median*: the middle value, calculated by ordering all values of the variable by their rank, and choosing the value in the middle.

#### Use Cases 

**Mode:**  
Can we use mean or median for categorical variables? No! Because there is no rank system, no order, between values, we can not apply normal arithmetic operators such as addition or multiplication. Therefore, we must rely on mode for categorical variables. 

*Check Your Understanding*: What's the mode value of election_table below?

In [153]:
election_table

Unnamed: 0,year,political party
0,*1789*,no formally organized parties
13,*1792*,Federalist
18,*1796*,Federalist
31,*1800*,Democratic-Republican
36,*1804*,Democratic-Republican
38,*1808*,Democratic-Republican
42,*1812*,Democratic-Republican
45,*1816*,Democratic-Republican
48,*1820*,Democratic-Republican
51,*1824*,no distinct party designations


**Median:**  
We can use the median measurement for ordinal and continuous variables. 

How can we calculate the median value for an ordinal variable? Because ordinal variables have a ranking, we can find the middle value.

*Check Your Understanding*: What's the median value of "policy approval" below?

In [154]:
approval_table

student id,policy approval
320155890,Dislike Somewhat
336041432,Dislike Somewhat
330715288,Dislike Somewhat
341898187,Like Somewhat
328506613,Dislike
342713283,Like Somewhat
341441587,Neutral
314689299,Neutral
305115159,Dislike Somewhat
341054602,Neutral


**Mean:**  
We can use the mean measurement for continuous variables. For a variable $X$, its mean, $\overline{X}$, can be found thus:

$$\overline{X} = \frac{\sum_{i=1}^{n}X_i}{n}$$

*Check Your Understanding*: What's the mean value of "midterm grade" below?

In [155]:
grade_table.take[:10]

student id,midterm grade
342109301,82
334421680,86
305650146,78
323466177,80
328982125,86
305709526,80
345744825,83
329704956,71
336314572,77
316896098,75


### Using Python
Python makes calculating measures of central tendency easy!  
1) *Mode*: call stats.mode() on the table column of interest  
2) *Median*: call np.median() on the table column of interest  
3) *Mean*: call np.mean() on the table column of interest

In [156]:
#Mode
print("Mode: ", stats.mode(approval_table.column('policy approval')))
#Median
print("Median: ", np.median(grade_table.column('midterm grade')))
# Mean
print("Mean: ", np.mean(grade_table.column('midterm grade')))

Mode:  ModeResult(mode=array(['Dislike Somewhat'], dtype='<U16'), count=array([4]))
Median:  79.5
Mean:  79.44333333333333


### Variation/Dispersion
How much do the values of the variable vary? The most common measurement is the *standard deviation*, which is calculated by looking at the average difference between the average value and all actual values of the variable.
$$Standard Deviation(X) = \sqrt{(variance(X))}= \sqrt{\frac{\sum_{i=1}^{n} (X_i - \overline{X})^2}{n-1}}$$

Let's take a look at an example with the midterm scores:

In [159]:
# The values
grade_table.select('midterm grade').take[:10]

midterm grade
82
86
78
80
86
80
83
71
77
75


In [160]:
#The mean:
np.mean(grade_table.column('midterm grade'))

79.44333333333333

Given the two above, calculate the standard deviation!

### Using Python
Python makes calculating measures of variance easy!  
1) *Standard Deviation*: call np.std() on the table column of interest  
The above is a shorthand for the above calculation, which we can prove below:

In [161]:
grades_list = grade_table.column('midterm grade')
print(np.std(grades_list))
print(np.sqrt(np.var(grades_list)))
print(np.sqrt(sum((grades_list-np.mean(grades_list))**2)/(len(grades_list))))

5.850936297342123
5.850936297342123
5.850936297342124


Let's review each of the functions used above:  
* np.std(): calculate the standard deviation of a list  
* np.sqrt(): calculate the square root of a value  
* np.var(): calculate the variance of a list  
* len(): calculate the number of items in a list (our *n*)

## Distributions
We have covered how to calculate the key statistics for a variable. Most often, these variables represent samples of larger distributions, or sets of values. Collecting data from an entire population is expensive, and so most data sets will be limited to samples of the population. However, a problem arises from this: we normally want the statistics for the overall population, not just the sample.

For example, when determining presidential approval, surveys will create samples of 1-3,000 responses. However, we are rarely concerned with just the opinions of the survey respondents, but rather the broader American people. How can we determine the "true" statistical values for the population when we only have the sample values?

### Central Limit Theorem

In order to understand the central limit theorem, we should start with a simple example: rolling dice

In [203]:
#Central limit Theorem
from ipywidgets import *
def show_dist(dist_func):
    def normal_dist(center, sample_size):
        x = dist_func(center, size=sample_size)
        sns.distplot(x, hist=False, color='0')
        avg = np.mean(x)
        std_dev = np.std(x)
        for position in [avg - 1*std_dev, avg + 1*std_dev, avg - 2*std_dev, avg + 2*std_dev]:
            plt.axvline(position, color='r')
        plt.show();
    interact(normal_dist,center=(0,1000), sample_size=(1000,10000))
show_dist(np.random.normal)

interactive(children=(IntSlider(value=500, description='center', max=1000), IntSlider(value=5500, description=…

In [191]:
#Dice rolling
faces = [1,2,3,4,5,6]
interact(lambda rolls : sns.distplot(np.random.choice(faces, rolls), kde=False, bins=6), rolls=(10,1000))

interactive(children=(IntSlider(value=505, description='rolls', max=1000, min=10), Output()), _dom_classes=('w…

<function __main__.<lambda>(rolls)>

In [193]:
#Dice Rolling Average
def mean_distribution(num_distributions):
    average_values = []
    for _ in range(num_distributions):
        faces = [1,2,3,4,5,6]
        rolls = np.random.choice(faces, 100)
        average_value = np.mean(rolls)
        average_values.append(average_value)
    sns.distplot(average_values, hist=False)
interact(mean_distribution, num_distributions=(100,10000))

interactive(children=(IntSlider(value=5050, description='num_distributions', max=10000, min=100), Output()), _…

<function __main__.mean_distribution(num_distributions)>