# Berkeley Data Science Modules: Introduction to Data Science in Python

# The Jupyter  Notebook <a id='section 0'></a>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
Welcome to the Jupyter Notebook! **Notebooks** are documents that can contain text, code, visualizations, and more. 

A notebook is composed of rectangular sections called **cells**. There are 2 kinds of cells: markdown and code. A **markdown cell**, such as this one, contains text. A **code cell** contains code in Python, a programming language that we will be using for the remainder of this module. You can select any cell by clicking it once. After a cell is selected, you can navigate the notebook using the up and down arrow keys.

To run a code cell once it's been selected, 
- press Shift-Enter, or
- click the Run button in the toolbar at the top of the screen. 

If a code cell is running, you will see an asterisk (\*) appear in the square brackets to the left of the cell. Once the cell has finished running, a number will replace the asterisk and any output from the code will appear under the cell.

In [2]:
# run this cell
print("Hello World!")

Hello World!


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
Code cells can be edited any time after they are highlighted. Try editing the next code cell to print your name.

In [3]:
# edit the code to print your name
print("Hello: my name is NAME")

Hello: my name is NAME


## Saving and Loading


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
Your notebook can record all of your text and code edits, as well as any graphs you generate or calculations you make. You can save the notebook in its current state by clicking Control-S, clicking the floppy disc icon in the toolbar at the top of the page, or by going to the File menu and selecting "Save and Checkpoint".

The next time you open the notebook, it will look the same as when you last saved it.

**Note:** after loading a notebook you will see all the outputs (graphs, computations, etc) from your last session, but you won't be able to use any variables you assigned or functions you defined. You can get the functions and variables back by re-running the cells where they were defined- the easiest way is to highlight the cell where you left off work, then go to the Cell menu at the top of the screen and click "Run all above". You can also use this menu to run all cells in the notebook by clicking "Run all".

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
Before we begin, we'll need a few extra tools to conduct our analysis. Run the next cell to load some code packages that we'll use later. 

Note: this cell MUST be run in order for most of the rest of the notebook to work.

In [4]:
# dependencies: THIS CELL MUST BE RUN
from datascience import *
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import ipywidgets as widgets
%matplotlib inline

# 1. Python <a id='section 1'></a>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">

**Python** is  programming language- a way for us to communicate with the computer and give it instructions. 

Just like any language, Python has a *vocabulary* made up of words it can understand, and a *syntax* giving the rules for how to structure communication.


### Errors <a id="subsection error"></a>


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">

**Python** is  programming language- a way for us to communicate with the computer and give it instructions. 

Just like any language, Python has a *vocabulary* made up of words it can understand, and a *syntax* giving the rules for how to structure communication.

Python is a language, and like natural human languages, it has rules. Whenever you write code, you will often accientally break some of these rules. When you run a code cell that doesn't follow every rule exactly, Python will produce an **error message**.

Errors are *normal*; experienced programmers make many errors every day. Errors are also *not dangerous*; you will not break your computer by making an error (in fact, errors are a big part of how you learn a coding language). An error is nothing more than a message from the computer saying it doesn't understand you and asking you to rewrite your command.

We have made an error in the next cell.  Run it and see what happens.

In [5]:
print("This line is missing something."

SyntaxError: unexpected EOF while parsing (<ipython-input-5-c7b7223ecd08>, line 1)

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
The last line of the error output attempts to tell you what went wrong.  The *syntax* of a language is its structure, and this **SyntaxError** tells you that you have created an illegal structure.  "**EOF**" means "end of file," so the message is saying Python expected you to write something more (in this case, a right parenthesis) before finishing the cell.

## 1a. Data <a id='subsection 1a'></a>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
**Data** is information- the "stuff" we manipulate to make and test hypotheses. 

Almost all data you will work with broadly falls into two types: numbers and text. *Numerical data* shows up green in code cells and can be positive, negative, or include a decimal.

In [6]:
# Numerical data

4

87623000983

-667

3.14159

3.14159

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
Text data (also called *strings*) shows up red in code cells. Strings are enclosed in double or single quotes. Note that numbers can appear in strings.

In [7]:
# Strings
"a"

"Hi there!"

"We hold these truths to be self-evident that all men are created equal."

# this is a string, NOT numerical data
"3.14159"

'3.14159'

## 1a. Expressions <a id='subsection 1a'></a>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">

A bit of communication in Python is called an **expression**. It tells the computer what to do with the data we give it.

Here's an example of an expression.

In [8]:
# an expression
14 + 20

34

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
When you run the cell, the computer **evaluates** the expression and prints the result. Note that only the last line in a code cell will be printed, unless you explicitly tell the computer you want to print the result.

In [9]:
# more expressions. what gets printed and what doesn't?
100 / 10

print(4.3 + 10.98)

33 - 9 * (40000 + 1)

884

15.280000000000001


884

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
Many basic arithmetic operations are built in to Python, like  **( * )**,  **( + )**,  **( - )**, and **( / )**. There are many others, which you can find information about [here](http://www.inferentialthinking.com/chapters/03/1/expressions.html). 

The computer evaluates arithmetic according to the PEMDAS order of operations (just like you probably learned in middle school): anything in parentheses is done first, followed by exponents, then multiplication and division, and finally addition and subtraction.

In [10]:
# before you run this cell, can you say what it should print?
4 - 2 * (1 + 6 / 3)

-2.0

## 1b. Names <a id='subsection 1b'></a>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
Sometimes, the values you work with can get cumbersome- maybe the expression that gives the value is very complicated, or maybe the value itself is long. In these cases it's useful to give the value a **name**.

We can name values using what's called an *assignment* statement.

In [11]:
# assigns 442 to x
x = 442

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
The assignment statement has three parts. On the left is the *name* (**x**). On the right is the *value* (442). The *equals sign* in the middle tells the computer to assign the value to the name.

You'll notice that when you run the cell with the assignment, it doesn't print anything. But, if we try to access `x` again in the future, it will have the value we assigned it.

In [12]:
# print the value of x
x

442

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
You can also assign names to expressions. The computer will compute the expression and assign the name to the result of the computation.

In [13]:
y = 50 * 2 + 1
y

101

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
We can then use these name as if they were numbers.

In [14]:
x - 42

400

In [15]:
x + y

543

## 1c. Functions <a id='subsection 1c'></a>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
We've seen that values can have names (often called **variables**), but operations may also have names. A named operation is called a **function**. Python has some functions built into it.

In [16]:
# a built-in function 
round

<function round(number, ndigits=None)>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
Functions get used in *call expressions*, where a function is named and given values to operate on inside a set of parentheses. The **round** function returns the number it was given, rounded to the nearest whole number.

In [17]:
# a call expression using round
round(1988.74699)

1989

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
A function may also be called on more than one value (called *arguments*). For instance, the **min** function takes however many arguments you'd like and returns the smallest. Multiple arguments are separated by commas.

In [18]:
min(9, -34, 0, 99)

-34

### Dot Notation

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
Python has a lot of [built-in functions](https://docs.python.org/3/library/functions.html) (that is, functions that are already named and defined in Python), but even more functions are stored in collections called *modules*. Earlier, we imported the **math** module so we could use it later. Once a module is imported, you can use its functions by typing the name of the module, then the name of the function you want from it, separated with a `.`.

In [19]:
# a call expression with the factorial function from the math module
math.factorial(5)

120

# 2. Tables <a id='section 2'></a>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">

The last section covered four basic concepts of python: data, expressions, names, and functions. In this next section, we'll see just how much we can do to examine and manipulate our data with only these minimal Python skills.

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
**Tables** are fundamental ways of organizing and displaying data. Run the next cell to load the data.

In [20]:
# Run this cell
primary_auditory = pd.read_csv('primary_auditory_area.csv')
primary_auditory.head()

Unnamed: 0,id,projection_density,projection_intensity,experiment_id,structure_id,volume
0,633910317,0.369275,2317.957628,146858006,1002,2.305407
1,633909357,4e-06,41.824506,146858006,63,0.144033
2,633910180,0.012837,508.76271,146858006,895,1.887742
3,633909650,0.000197,167.747616,146858006,338,0.022839
4,633909678,2.7e-05,209.261534,146858006,362,1.525187


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
This DataFrame (or table) is organized into **columns**: one for each *category* of information collected:

You can also think about the table in terms of its **rows**. Each row represents all the information collected about a particular instance, which can be a person, location, action, or other unit. 

Using the function **.head()** give us the first five rows by default. Can you see how many rows there are in total?

## 2a. Table Attributes <a id='subsection 2a'></a>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
Every table has **attributes** that give information about the table, like the number of rows and the number of columns. Table attributes are accessed using the dot method. But, since an attribute doesn't perform an operation on the table, there are no parentheses (like there would be in a call expression).

Attributes you'll use frequently include **index** and **columns**, which identify the rows and columns in the table, respectively. Using the length function **len(...)**, we can find the number of rows and columns in the table. 

In [21]:
# get the number of columns
len(primary_auditory.columns)

6

In [22]:
# get the number of rows
len(primary_auditory.index)

1264

## 2b. Sorting DataFrames <a id='subsection 2b'></a>

### Sorting values in a column using  `.sort_values`

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
The **.sort_values** function is used to sort the values in a column of a DataFrame. This function takes in two arguments, The **column label** *(in string form)* and **ascending** *(must equal True or False)*. In order to get values sorted from least to greatest, **ascending = True**. In order to get values sorted from greatest to least, **ascending = False**.

Let's sort the values in the column **projection_density** from *greatest to least* from the **primary_auditory** DataFrame.

In [23]:
primary_auditory.sort_values('projection_density', ascending = False).head()

Unnamed: 0,id,projection_density,projection_intensity,experiment_id,structure_id,volume
237,633910344,0.50376,4533.668548,146858006,1027,0.625498
0,633910317,0.369275,2317.957628,146858006,1002,2.305407
739,633117226,0.274526,1992.324495,120491896,1002,2.275487
950,631208331,0.158768,2455.250805,100149109,1002,2.306901
419,633424814,0.142119,1435.964484,116903230,1002,2.291676


## 2c. Column/Row Selection <a id='subsection 2b'></a>

### Selecting columns with `[ ... ]`

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
The **[...]** is used to get a **Series** containing one column and an index. It takes in the name of a column in the form of a string.

Let's select the **projection_density** from the **primary_auditory** DataFrame.

In [24]:
# make a new table with only selected columns
primary_auditory['projection_density']

0       3.692751e-01
1       4.123313e-06
2       1.283747e-02
3       1.972891e-04
4       2.664822e-05
5       3.834191e-04
6       4.992553e-04
7       7.612440e-06
8       8.460778e-04
9       7.893355e-05
10      1.320373e-04
11      4.315364e-06
12      6.639497e-02
13      1.077242e-04
14      1.864043e-03
15      0.000000e+00
16      7.855044e-07
17      2.479754e-04
18      0.000000e+00
19      3.221951e-05
20      8.637651e-05
21      0.000000e+00
22      7.261229e-03
23      6.018361e-02
24      6.312055e-03
25      4.234803e-04
26      6.771754e-04
27      6.093192e-02
28      0.000000e+00
29      9.126371e-04
            ...     
1234    6.116316e-03
1235    1.020957e-03
1236    3.521699e-04
1237    6.276456e-09
1238    3.261592e-02
1239    7.158385e-08
1240    5.796269e-07
1241    3.296581e-02
1242    0.000000e+00
1243    1.779570e-05
1244    8.520625e-03
1245    5.562547e-03
1246    0.000000e+00
1247    1.629792e-06
1248    1.006289e-06
1249    1.901035e-03
1250    0.000

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
Using **.values** will give us a list of the values in the column in the form of an **array**. An **array** is simply a list of values in the form of integers, strings, etc.

In [25]:
# Run this cell
primary_auditory['projection_density'].values

array([3.69275150e-01, 4.12331250e-06, 1.28374696e-02, ...,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00])

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">

To access the 1st value of an array, we use **[...]** again and the number 0. 

Suppose n is the number of values in our array. If we wanted to get the last value of an array, we would input n - 1. 

In [26]:
# Run this cell
primary_auditory['projection_density'].values[0]

0.369275149579306

# 3. Statistical Analysis 

## 3.a Bootstrapping

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
In this lab, we will be using a technique called boostrapping. This technique is something that statisticians use when we want to work with the population data but all we have is a sample.

A sample is a set of data collected from a population by a defined procedure. In this lab, our sample will be the different experiments conducted in a certain area of the brain and their projection densities. 

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
The technique simulates repeated random sampling from the population. We assume that since our original sample is large and random, it resembles the population, so we sample from it! In this case, we are assuming that the experiments are a good representation of the overall population of projection densities in the specific area that we are injecting.

Below are a couple of illustrations showing the concept.

<img src="bootstrap_1.png", width="500px"/>

<img src="bootstrap_2.png", width="500px"/>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #f0f0ff; ">
    
There are three keys to resampling correctly while performing a bootstrap:
- Draw at random from the original sample 
- Draw with replacement (replace = True)
- Draw as many values as the original sample contained 

## 3.b Confidence Interval

In the case of our lab, after bootstrapping is done, we want to find out the confidence interval, 