# [MCB-163L] Introduction to Data Science in Python

*Estimated Time: 60 minutes*

### Table of Contents:


**Part I: Python**
 1. [Data](#section_data)
 2. [Expressions](#section_expr)
 3. [Names](#section_names)
 4. [Functions](#section_func)
 
**Part II: Tables**
1. [Sorting Dataframes](#section_sort)
2. [Column / Row Selection](#section_filter)
3. [Attributes](#section_attributes)

**Part III: Some Statistics**
1. [The Mean](#section_mean)
2. [The Distribution](#section_distr)

**Part IV: Statistical Analysis**
1. [The Bootstrap](#section_bootstrap)
2. [Confidence Interval](#section_ci)
3. [The P-Value](#section_pval)


# The Jupyter  Notebook <a id='section_jupyter'></a>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
Welcome to the Jupyter Notebook! `Notebooks` are documents that can contain text, code, visualizations, and more. 

A notebook is composed of rectangular sections called `cells`. There are 2 kinds of cells: markdown and code. A `markdown cell`, such as this one, contains text. A `code cell` contains code in Python, a programming language that we will be using for the remainder of this module. You can select any cell by clicking it once. After a cell is selected, you can navigate the notebook using the up and down arrow keys.

To run a code cell once it's been selected:

<ul>

  <li>press Shift-Enter, or</li>
  <li>click the Run button in the toolbar at the top of the screen.</li>

</ul>  

If a code cell is running, you will see an asterisk (\*) appear in the square brackets to the left of the cell. Once the cell has finished running, a number will replace the asterisk and any output from the code will appear under the cell.
</div>

In [None]:
# run this cell
print("Hello World!")

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
Code cells can be edited any time after they are highlighted. Try editing the next code cell to print your name.

</div>

In [None]:
# edit the code to print your name
print("Hello: my name is Data Curriculum Staff")

### Saving and Loading


<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
Your notebook can record all of your text and code edits, as well as any graphs you generate or calculations you make. You can save the notebook in its current state by clicking Control-S, clicking the floppy disc icon in the toolbar at the top of the page, or by going to the File menu and selecting "Save and Checkpoint".

The next time you open the notebook, it will look the same as when you last saved it.

<br><br>
<i><b>Note:</b></i> after loading a notebook you will see all the outputs (graphs, computations, etc) from your last session, but you won't be able to use any variables you assigned or functions you defined. You can get the functions and variables back by re-running the cells where they were defined- the easiest way is to highlight the cell where you left off work, then go to the Cell menu at the top of the screen and click "Run all above". You can also use this menu to run all cells in the notebook by clicking "Run all".</br></br>
</div>
    


<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">

Before we begin, we'll need a few extra tools to conduct our analysis. Run the next cell to load some code packages that we'll use later.


<br><br>
<i><b>Note: </b></i>this cell MUST be run in order for most of the rest of the notebook to work.

</div>

In [None]:
# dependencies: THIS CELL MUST BE RUN
#!pip install -r requirements

import numpy as np
import math
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Part 1. Python <a id='section_python'></a>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">

<code>Python</code> is  programming language- a way for us to communicate with the computer and give it instructions. 

Just like any language, Python has a <i>vocabulary</i> made up of words it can understand, and a <i>syntax</i> giving the rules for how to structure communication.
</div>


### Errors <a id="subsection error"></a>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">

<code>Python</code> is  programming language- a way for us to communicate with the computer and give it instructions. 

Just like any language, Python has a <i>vocabulary</i> made up of words it can understand, and a <i>syntax</i> giving the rules for how to structure communication.

<p>Python is a language, and like natural human languages, it has rules. Whenever you write code, you will often accientally break some of these rules. When you run a code cell that doesn't follow every rule exactly, Python will produce an <code>error message</code>.</p>

<p>Errors are <i>normal</i>; experienced programmers make many errors every day. Errors are also <i>not dangerous</i>; you will not break your computer by making an error (in fact, errors are a big part of how you learn a coding language). An error is nothing more than a message from the computer saying it doesn't understand you and asking you to rewrite your command.
</p>

<p>We have made an error in the next cell.  Run it and see what happens.
</p>

</div>

In [None]:
print("This line is missing something."

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
The last line of the error output attempts to tell you what went wrong.  The <i>syntax</i> of a language is its structure, and this <code>SyntaxError</code> tells you that you have created an illegal structure.  "<code>EOF</code>" means "end of file," so the message is saying Python expected you to write something more (in this case, a right parenthesis) before finishing the cell.

</div>

## Part 1.1: Data <a id='section_data'></a>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
<b>Data:</b> is information- the "stuff" we manipulate to make and test hypotheses. 

Almost all data you will work with broadly falls into two types: numbers and text. <i>Numerical data</i> shows up green in code cells and can be positive, negative, or include a decimal.
</div>

In [None]:
# Numerical data

4

87623000983

-667

3.14159

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
Text data (also called <i>strings</i>) shows up red in code cells. Strings are enclosed in double or single quotes. Note that numbers can appear in strings.
</div>

In [None]:
# Strings
"a"

"Hi there!"

"We hold these truths to be self-evident that all men are created equal."

# this is a string, NOT numerical data
"3.14159"

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; "> 
We can store different types of data into a container called an <code>array</code>. These arrays are contained within <code>[...]</code>. Set <code>my_array</code> to different types of data and run the cell.

</div>

In [None]:
my_array = [...]
my_array

## Part 1.2: Expressions <a id='section_expr'></a>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">

A bit of communication in Python is called an <code>expression</code>. It tells the computer what to do with the data we give it.

<p>Here's an example of an expression.</p>

</div>

In [None]:
# an expression
14 + 20

<div style="border-left: 3px solid #003262; p</code>adding: 1px; padding-left: 10px; background: #ffffff; ">
    
When you run the cell, the computer <code>evaluates</code> the expression and prints the result. Note that only the last line in a code cell will be printed, unless you explicitly tell the computer you want to print the result.
</div>

In [None]:
# more expressions. what gets printed and what doesn't?
100 / 10

print(4.3 + 10.98)

33 - 9 * (40000 + 1)

884

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
Many basic arithmetic operations are built in to Python, like multiplication <code> * </code>, addition <code> + </code>, subtraction <code> - </code>, and division <code> / </code>. There are many others, which you can find information about <a href="http://www.inferentialthinking.com/chapters/03/1/expressions.html">here</a>.

<p>The computer evaluates arithmetic according to the PEMDAS order of operations (just like you probably learned in middle school): anything in parentheses is done first, followed by exponents, then multiplication and division, and finally addition and subtraction.</p>
</div>

In [None]:
# before you run this cell, can you say what it should print?
4 - 2 * (1 + 6 / 3)

## Part 1.3: Names <a id='section_names'></a>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
Sometimes, the values you work with can get cumbersome- maybe the expression that gives the value is very complicated, or maybe the value itself is long. In these cases it's useful to give the value a <i>name</i>.

We can name values using what's called an <i>assignment</i> statement.
</div>

In [None]:
# assigns 442 to x
x = 442

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
The assignment statement has three parts. On the left is the <i>name</i> (<code>x</code>). On the right is the <i>value</i> (442). The <i>equals sign</i> in the middle tells the computer to assign the value to the name.

<p>You'll notice that when you run the cell with the assignment, it doesn't print anything. But, if we try to access <code>x</code> again in the future, it will have the value we assigned it.</p>
</div>

In [None]:
# print the value of x
x

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
You can also assign names to expressions. The computer will compute the expression and assign the name to the result of the computation.
</div>

In [None]:
y = 50 * 2 + 1
y

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
We can then use these name as if they were numbers.</div>

In [None]:
x - 42

In [None]:
x + y

## Part 1.4: Functions <a id='section_func'></a>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
We've seen that values can have names (often called <code>variables</code>), but operations may also have names. A named operation is called a <code>function</code>. Python has some functions built into it.
</div>

In [None]:
# a built-in function 
round

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
Functions get used in <i>call expressions</i>, where a function is named and given values to operate on inside a set of parentheses. The <code>round</code> function returns the number it was given, rounded to the nearest whole number.
</div>

In [None]:
# a call expression using round
round(1988.74699)

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
        
A function may also be called on more than one value (called <i>arguments</i>). For instance, the <b>min</b> function takes however many arguments you'd like and returns the smallest. Multiple arguments are separated by commas. (<i>max</i> works the same way, can you guess what it does?)
</div>

In [None]:
min(9, -34, 0, 99)

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
Another example of this is the <code>sum</code> function. The diference of this function is that the items that you are summing must be in an <code>array</code> (mentioned in Part 1.1) and must all be numbers. In the cell below, set <code>my_other_array</code> to a list of numbers you'd like and then use <code>sum</code> to add them up. 
</div>

In [None]:
my_other_array = [...]
my_other_array

In [None]:
sum(my_other_array)

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">

<b>Practice:</b>
<ul>
  <li>The <code>abs</code> function takes one argument (just like <code>round</code>)</li>
</ul> 
Try calling <code>abs</code> in the cell below. What does it do?

Also try calling each function <i>incorrectly</i>, such as with the wrong number of arguments. What kinds of error messages do you see?
</div>

In [None]:
# replace the ... with calls to abs 
...


### Dot Notation

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
        
Python has a lot of <a href="https://docs.python.org/3/library/functions.html">built-in functions</a> (that is, functions that are already named and defined in Python), but even more functions are stored in collections called <i>modules</i>. Earlier, we imported the <code>math</code> module so we could use it later. Once a module is imported, you can use its functions by typing the name of the module, then the name of the function you want from it, separated with a <code>.</code>.
</div>

In [None]:
# a call expression with the factorial function from the math module
math.factorial(5)

# Part 2: Tables <a id='section_tables'></a>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
The last section covered four basic concepts of python: data, expressions, names, and functions. In this next section, we'll see just how much we can do to examine and manipulate our data with only these minimal Python skills.
</div>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
<code>Tables</code> are fundamental ways of organizing and displaying data. Run the next cell to load the data.
</div>

In [None]:
# Run this cell
primary_auditory = pd.read_csv('./data/primary_auditory_area.csv')
primary_auditory.head()

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
        
This DataFrame (or table) is organized into <code>columns</code>: one for each <i>category</i> of information collected:

<p>You can also think about the table in terms of its <code>rows</code>. Each row represents all the information collected about a particular instance, which can be a person, location, action, or other unit. </p>

<p>In the <code>primary_auditory</code> rows, the instance is a projection from one brain structure to another. The columns then encompass different characteristics of this projection.</p>

<p>Using the function <code>.head()</code> give us only the first five rows by default. Can you see how many rows there are in total?</p>
</div>


## Part 2.1: Sorting DataFrames <a id='section_sort'></a>


### Sorting values in a column using  `.sort_values`

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
        
The <code>.sort_values</code> function is used to sort the values in a column of a DataFrame. This function takes in two arguments, The <b>column label</b> <i>(in string form)</i> and <code>ascending</code> <i>(must equal True or False)</i>. In order to get values sorted from least to greatest, <code>ascending = True</code>. In order to get values sorted from greatest to least, <code>ascending = False</code>.

<p>Let's sort the values in the column <code>projection_density</code> from <i>greatest to least</i> from the <code>primary_auditory</code> DataFrame.</p>
</div>

In [None]:
primary_auditory.sort_values('projection_density', ascending = False).head()

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">

<b>Practice:</b>

Try using <code>.sort_values</code> on <code>primary_auditory</code> in the cell below to arrange the volume from least to greatest. What is the <code>structure_id</code> of the experiment that had the least volume injected to it? If there is more than one with the same volume, just write down the top-most of the table. Assign this value to <code>least_volume</code>.
</div>

In [None]:
...

In [None]:
least_volume = ...
least_volume


## Part 2.2: Column/Row Selection <a id='section_filter'></a>


### Selecting columns with `[ ... ]`

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
        
The <b>[...]</b> is used to get a <code>Series</code> containing one column and an index. It takes in the name of a column in the form of a string.

Let's select the <code>projection_density</code> from the <code>primary_auditory</code> DataFrame and assign it to <code>proj_density</code>.
</div>

In [None]:
# make a new table with only selected columns
proj_density = ...
proj_density


## Part 2.3: Attributes <a id='section_attributes'></a>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
       
Columns have <code>attributes</code> that give information about the them, like values contained within them. These attributes are accessed using the dot method. But, since an attribute doesn't perform an operation on the table, there are no parentheses (like there would be in a call expression).

An attributes you'll use frequently is <code>values</code>, which will give us a list of the values in the column in the form of an <code>array</code> (mentioned in Part 1.1).
</div>

In [None]:
# Run this cell
proj_density.values

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
We can do several things with the values of the columns. Using the length function <code>len()</code>, we can find the number of values in our selected column. Run the cell below to see how many values are in <code>proj_density</code>.
</div>

In [None]:
# Run this cell
len(proj_density.values)

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
Something else we can do with this data is to obtain specific values. To obtain the first value, we use <code>proj_density[0]</code>. Since our data is organized from greatest to lowest values, this would give us the greatest projection density.
</div>

In [None]:
# Run this cell
proj_density.values[0]

# Part 3: Statistics + DataFrames <a id='section 2'></a>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
We can also apply our mathematical and statistical background along with our coding experience to find out basic stats regarding our data (i.e. mean, median, standard deviation, etc.). In this section, we will use DataFrames and functions to calculate these stats.
</div>


## Part 3.1: The Mean<a id='section_mean'></a>

Let's first find the **mean** projection densities at the primary auditory area. The formula for the mean is:

$$M = \frac{\sum_{i=1}^N x_i}{N}$$


In other words, we add up all the values in the projection density array, then divide the sum by the total number of values in the array. For example, in the context of this problem, we can assign these variables as follows:




* $M$ = Average projection density in the Primary Auditory Area
* $x_i$ = Data: i.e. each projection density value in the array
* $N$ = Number of values Total Number of projection densities in the array

To do this, we can use **proj_density** which we defined earlier and use `.values` to get the list of values. Then, we can sum up the values using the `sum()` function.

In [None]:
# Run this cell
sum_values = sum(proj_density.values)
sum_values

 
Next, we can use the `len()` function to find the number of values in the list of projection densities.

In [None]:
# Run this cell
num_values = len(proj_density.values)
num_values

    
Finally, we can divide the summed up values from `sum_values` with the number of values from `num_values` to get the mean.

In [None]:
# Run this cell
mean = sum_values / num_values
mean


**Practice:** Now, try and find the `mean projection volumes` of the primary auditory area. First, assign `proj_volumes` to a list of projection volumes using `.values`. Then, assign `sum_volumes` to the sum of these values using the `sum()` function. Next, assign `num_values` to number of values using the `len()` function. Finally, assign mean_volume to `sum_volumes` divided by `num_volumes`.

In [None]:
proj_volumes = primary_auditory['volume'].values
sum_volumes = sum(proj_volumes)
num_volumes = len(proj_volumes)
mean_volume = sum_volumes / num_volumes
mean_volume

## 3.2. The Distribution<a id='section_distr'></a>

Lastly, we can visualize the distribution of the projection density values. A distribution shows us how the data or values are distributed by drawing a nice smooth curve. There are three types of distributions: 
* A **right skewed** distribution has the majority of values concentrated in the left with a tail off to the right. 
* A **left skewed** distribution has the majority of values concentrated in the right with a tail off to the left. 
* A **normal (symmetric)** distribution has values concentrated at the center. Also known as the "bell curve."

Notice in the graph below, taken from Lumen Learning that shows the how values can be distributed.

<img src='./images/dist.png' width="700px"/>

Run the cell below to output a visual distribution of projection densities in the primary auditory area.

In [None]:
# Run this cell
plt.xlim(0, 0.21)
sns.distplot(proj_density, hist=False);

We can clearly see that this is a distribution skewed to the right because the projection density values are highly concentrated in the left and has a tail going off to the right. One example where you can see this is in U.S. Household Incomes. You can learn more about distributions [here](https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/skewed-distribution/).

Recall the stats we computed for the mean, median, and standard deviation:

* **Mean:** 0.0055
* **Median:** 3.6405e-05
* **Standard Deviation:** 2.7076e-16

Compare these values to the figure above and the visual distribution. Notice that the data we have contains many zeros as well as high values less than 0.025, which in turn affect our outcome when caculating the stats.

# Part 4: Statistical Analysis 


## Part 4.1: The Bootstrap<a id='section_bootstrap'></a>

    
### Background

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
Before we dive into what bootstrapping is, we're going to define some terms that we will be using throughout the rest of the intro lab and the lab you will perform.
</div>

        
- **Sample**: A sample is a set of data collected from a population by a defined procedure. In the next cells, you will be able to see a sample of a population compared with the original population that we obtain by using the `.sample` function.

   - *The type of sample that data scientists use often is a simple   random sample, where each member of the sample has an equal chance of getting chosen.*


- **Statistic**: A single measure of some attribute of a sample. In the next cells, you will be able to see how we obtain the mean of the population and the sample using the `.mean()` function.
 
   - *An example of a statistic that is used many times is the mean.*

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
        
Run the next couple of cells. In these cells, we will show you how a sample is done and the statistic that is drawn from these samples. You do not need to understand the code in this section, you are only responsible for understanding the concept and analysis behind Bootstrapping.

We will be using the grade distribution for the class Math 110 in Fall 2018. That semester, 402 people received a grade, as shown in the x-axis. The grades are shown by the amount of GPA points awarded per grade (an A and A+ being a 4.0 to an F being 0.0).

We will also find a statistic of this data, the mean.
</div>

In [None]:
# Run this cell
math_110 = pd.read_csv('./data/math_110_grade_distribution.csv')
math_110.hist();

In [None]:
# Run this cell
math_110_mean = math_110.mean()[0]
math_110_mean

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
       
In the next cell, we will take a sample of the students who took the class. Our sample will be of 100 people. We will then obtain the mean and compare it with that of the population.
</div>

In [None]:
# Run this cell
sample_1 = math_110.sample(100)
sample_1.hist();

In [None]:
sample_1.mean()[0]

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
        
How do the distributions from the population and the sample differ? How does the statistic differ? Write your answer below.
</div>

*If you run the two cells above again, you most likely will get a different distribution and a different statistic. This is because the statistic is only based off the sample and not the entire population.*

    
### Bootstrapping

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
       
A data scientist is using the data in a random sample to estimate an unknown parameter. She uses the sample to calculate the value of a statistic that she will use as her estimate.

Once she has calculated the observed value of her statistic, she could just present it as her estimate and go on her merry way. But she’s a data scientist. She knows that her random sample is just one of numerous possible random samples, and thus her estimate is just one of numerous plausible estimates.

By how much could those estimates vary? To answer this, it appears as though she needs to draw another sample from the population, and compute a new estimate based on the new sample. But she doesn’t have the resources to go back to the population and draw another sample.

It looks as though the data scientist is stuck.

Fortunately, a brilliant idea called the bootstrap can help her out. Since it is not feasible to generate new samples from the population, the bootstrap generates new random samples by a method called resampling: the new samples are drawn at random from the original sample.
</div>

<img src="./images/bootstrap.png" width="5000px"/>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
       
In our example given above, the grades for all 402 students enrolled in Math 110 in Fall 2018 is our population. Bootstrapping is used when we don't have the overall population and instead we only have the sample drawn. 

Pretending we don't have the population, in the next cells, we will use bootstrap on a sample of 100 students provided by us to estimate the population statistic (which we have).
</div>

In [None]:
# Run this cell
math_110_sample = pd.read_csv('./data/math_110_sample.csv')
math_110_sample.hist();

In [None]:
# Run this cell
sample_mean = math_110_sample.mean()[0]
sample_mean


Our overall goal is to use the resamples to calculate an interval in which the statistic (the mean) of the population is likely to be. To do this, we need to find the statistic for all the resamples, which we will do in the following cells.

Before we start, there are three keys to resampling that you should know about when performing a bootstrap:
- Draw at random from the original sample 
  - This just means that we are getting a random sample of the sample and that every individual has an equal possibility of being chosen. 
- Draw with replacement (replace = True)
  - Pretend you are drawing a card from the deck of cards. You draw the first card. If you were to draw without replacement, then you would have 51 cards to draw from for the next turn. However, if you draw with replacement, you draw from the 52 cards, meaning that you can draw the same card that you drew the first time again.
- Draw as many values as the original sample contained 
  - For every resample, you will draw as many cards as the first sample (in this case 100).

We will apply these rules when doing bootstrapping in the cells below.

In [None]:
# In this cell, we will use something called a **for loop** to easily 
# get the resamples and find the mean of each one. We want a large number
# of resamples, so we set it to 5000.
means = []
for i in np.arange(5000):
    resampled = math_110_sample.sample(n = 100, replace = True)
# notice that we used n = 100 because the sample had 100 students and that
# we set replace = True
    mean = resampled.mean()
    means = np.append(means, mean)
# We obtain the means of the resamples in a list, shown below
means

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
        
Run the cell below to make a DataFrame with these means. 
</div>

In [None]:
# Run this cell
sample_table = pd.DataFrame(data={'Math 110 Averages': means})
sample_table.hist();


## Part 4.2: Confidence Interval<a id='section_ci'></a>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
<p>In the case of our lab, after bootstrapping is done, we want to find out the confidence interval. Bootstrapping produces an interval of estimates, to account for chance variability in the random sample. However, it happens that out of 100 resamples, around 5 of them do not include the statistic of the population. Because of this, we are only 95% confident that the bootstrap contains our population parameter.</p> 

<p>To account for this, we create a <code>95% Confidence Interval</code> of the bootstrap. We do this by using the <code>.quantile</code> function, which takes in the left and right parameters. These, we choose to be <i>0.025</i> and <i>0.975</i> so that we may be the medium 95% of the bootstrap values.</p>

</div>

In [None]:
# Run this cell
left_95 = 0.025
right_95 = 0.975
percentile = list(sample_table.quantile([left_95, right_95])['Math 110 Averages'])
percentile

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
       
Run the next cell to see the 95% Confidence Interval overlayed onto the bootstrap. 
</div>

In [None]:
# Run this cell
sample_table.hist()
plt.hlines(y=0, xmin=percentile[0], xmax=percentile[1], linewidth=10, color = 'y');


## Part 4.3: The P-value<a id='section_pval'></a>


### Background

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">

Before diving into the p-value, we need to go over some basic necessary background, namely the<code>null hypothesis</code> and the<code>alternative hypothesis</code>.
</div>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
All statistical tests attempt to choose between two views of the world. Specifically, the choice is between two views about how the data were generated. These two views are called hypotheses.

<code>The null hypothesis</code>: This is a clearly defined model about chances. It says that the data were generated at random under clearly specified assumptions about the randomness. The word “null” reinforces the idea that if the data look different from what the null hypothesis predicts, the difference is due to nothing but chance.

<code>The alternative hypothesis</code>: This says that some reason other than chance made the data differ from the predictions of the model in the null hypothesis.
</div>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
Let's say that in Fall 2018, the head of the math department wanted to sample the scores of 100 students from one of the math classes on campus. He did this, but forgot to write down what math class it was. The mean that he obtained from these 100 students was 3.408, which we will assign to <code>math_mean</code>. Did he take the sample from Math 110, or was it from another class?
</div>

In [None]:
# Run this cell
math_mean = 3.408

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
The first step is to come up with a <code>null hypothesis</code> and a <code>alternative hypothesis</code>.
</div>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">

<p><b>null hypothesis</b>: math_mean comes from the same sample distribution as that of Fall 2018 Math 110. Any variation is purely due to chance.</p>

<p><b>alternative hypothesis</b>: math_mean does not come from the same sample distribution as that of Fall 2018 Math 110.</p> 
</div>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
We can find out what hypothesis is correct using something called a p-value. The P-value of a test is the chance, based on the model in the null hypothesis, that the test statistic will be equal to the observed value in the sample or even further in the direction that supports the alternative.

If a P-value is small, that means the tail beyond the observed statistic is small and so the observed statistic is far away from what the null predicts. This implies that the data support the alternative hypothesis better than they support the null. By convention, we say that anything less than 5% is "statistically significant".
</div>

In [None]:
# Run this cell
sample_table.hist(bins = 20)
plt.axvline(math_mean, color = 'k');

In [None]:
# Run this cell
p_value = np.average(sample_table > math_mean)
p_value

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
Our P-value ends up being 0.0. Because out P-value is less than 0.05, we reject the <code>null hypothesis</code>. This means that math_mean does not come from the same sample distribution as that of Fall 2018 Math 110 and <code>math_mean</code> was too large to reflect chance variation alone. We, however do not accept the <code>alternative hypothesis</code> because it could be true, but we can never totally be sure. We can just show that the <code>null hypothesis</code> is highly improvable given our data, which is why we reject it. 
</div>

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
After doing some organizing in her office, the head of the math department found out what course had been sampled, it was actually Math 125A! We were right, the mean of the sampled class had not come from the same distribution as that of our own, Math 110!
</div>

# Conclusion

<div style="border-left: 3px solid #003262; padding: 1px; padding-left: 10px; background: #ffffff; ">
    
It is important that you review these concepts before your in-class lab. Know the basics of operating python and how to utilize some of the functions like <code>.sort_values</code>, <code>len()</code>, <code>sum()</code>, <code>.values</code> and <code>[...]</code>. Don't worry too much about the code for Part 3, but do know conceptually what a bootstrap does, what a 95% Confidence Interval is, and what the significance of a P-value is. 
</div>

<i>Notebook Developed by: Elias Saravia & Daniel Lopez</i>