## The Jupyter Notebook

The Jupyter Notebook (formerly the IPython Notebook) is an interactive environment specifically designed with scientific computing in mind. If you've used Mathematica before, it might look familiar. It not only lets you run code but write it and organize it in an intuitive, appealing manner and document it as you go along -- in fact, this document was created entirely as a Jupyter notebook!

The Notebook is divided into Cells, which can be run independently. Each cell's output will be displayed below it. Just enter some Python code into a cell, and either choose *Cell -> Run* from the menu at the top of the page, press the 'Play'-looking button, or press Shift+Enter on your keyboard.

For example:

In [1]:
print("Hello world")

Hello world


In [2]:
1 + 1

2

In [3]:
# Cells can also contain multiple lines of code.
# Note that anything following a # sign (like this line) is a comment.
1 + 1 # You can also comment specific lines of code, like this.
# Comments are a good way to document as you 
a = 2 # Here we assign the number 2 to variable a, for example.
print(a) # Print the value of variable a
a = a + 3 # Now add 3 to a.
print(a) # Print out a's new value.

2
5


Working with cells lets you edit and test blocks of codes as you go, instead of needing to either run your code one line at a time or all at once. As you'll find, one of the keys to data analysis is constant tinkering -- both to figure out what you want to do with your data, and to get your doing what you want it to do. 

Cells can also contain plain text, or [Markdown](http://daringfireball.net/projects/markdown/)-formatted text. This cell is a Markdown cell, for example. This lets you use the Notebook as your "lab notebook", adding explanations or commentary for yourself or for others as you go. To turn a cell from a code cell to a text or Markdown once, choose the appropriate format from the *Cell -> Cell Type* menu at the top of the page. 

To edit a text cell once you've created it, just double-click on the text.

## Python Basics

Let's quickly review some of the basics of the Python programming language. If you already have had some experience with Python, this will just be a quick refresher. If you're brand-new to Python or to coding in general, this will only cover a subset of what you can do with it.

### Variables

A variable is just a chunk of data you've given a name, so that you can refer to it later. A variable might refer to a single number or some text, or to a massive table with many rows and columns. 

For the most part, you don't need to declare a variable in advance the way you do in many other languages; you just start using it. For example

In [None]:
a = 5 # Create a variable called 'a' and assign the number 5 to it.

In [None]:
print(a) # Output the value of a

Notice that variables carry over from cell to cell in the Notebook. Even though you're running the cells one at a time, all the data created in the notebook you're using is being held in memory. Variables don't carry over between notebooks, or between sessions.

In [None]:
print(a*2) # Multiply a by 2, and output the result 

In [None]:
b = a + 5 # Create a new variable 'b', and set it to a+5.

In [None]:
a = 7 # Change a's value to 7.

*Do you think this will affect the value of **b**?*

In [None]:
print(a, b) # Output the values of variables a and b

*Nope. **b**'s value has been set, and doesn't depend on **a** anymore.*

Notice the comma in the print command. You can combine multiple things to print on the same line, separated by a comma.

Variables in Python are **dynamically typed**, meaning that the type of data a variable stores isn't fixed. So you can assign a number to a variable one line, and a string (the computer-sciency word for text) to it the next. For example:

In [None]:
print(a)
a = "Hello, Python!" 
print(a)

Note that the old value of **a** is gone now. Even though you *can* use the same variable name to refer to different data doesn't mean that you should. It's generally a good idea to have each variable serve one particular purpose in your code.

You can't use a variable before you've assigned something to it. Since the computer doesn't know what kind of data your variable is going to hold, it can't prepare a default value for it. If you get this wrong, the computer will usually tell you. For example:

In [None]:
print(c) # We haven't assigned anything to c yet.

This is an error. Errors are annoying, but they're your friends: they let you know that you probably made a mistake, and hopefully will help you fix it. You'll probably see these a lot. Don't worry, you haven't broken anything. 

The error above, for example, is a **NameError**, and lets you know that you tried to use a variable that wasn't defined.

Naturally, you can also use variables together. For example:

In [None]:
a = 7
b = 2
print("a = ", a)
print("b = ", b)
print("a + b = ", a+b)
print("a - b = ", a-b)
print("a * b = ", a*b)
print("a / b = ", a/b)

Python (like most programming languages) doesn't treat all numbers the same. There are basically two types of numbers, as far as Python is concerned: round numbers (integers, or ``int``s) and numbers with values after the decimal point (floating points, or just ``float``s). You can see the ``type`` of a variable using the ``type`` command. 

When you divide an integer by an integer, Python automatically converts the result to a float, even if the result is a whole number.

In [None]:
# These are ints
a = 10
b = 5

In [None]:
type(a)

In [None]:
type(a/b)

Another type of variable is a **string**, or ``str``, which is one or more text characters. *"Hello world"* is a string, for example. So is *"7"*. By default, Python doesn't diffentiate between those two exmaples: anything in quotation marks is just some text. For example:

In [None]:
a = "7"
a / 2

However, you can convert a string to an int or float using, appropriately enough, the ``int`` or ``float`` commands. These will try to convert the string to a number, or throw an error if it doesn't know how. Converting values from type to type is called **casting**.

In [None]:
int(a) / 2

In [None]:
int("Hi")

Python figured out how to convert the character "7" to a number, but not the characters *"Hi"*.

You can also convert values the other way. For example:

In [None]:
str(7.5000)

**A quick note on decimal and thousands seperators:** Python follows the American convention of using a period (.) for the beginning of the fractional portion of the number. That means that:

In [None]:
float("1.25")

works, but 

In [None]:
float("1,25")

Does not.

## From datum to data: Lists and Dictionaries

This is a course on data analysis, and most of the time the data you'll want to analyze will be more than just one or two numbers. If we're dealing with 100 (or 100,000) values, we don't want to create separate variables for each of them. There are two standard data types which can store multiple values together: **lists** (similar to other languages' arrays), and **dictionaries** (hash maps or hash tables, in other languages).

### Lists

Lists simply store a set of values in order, like this:

In [None]:
some_numbers = [1, 5, 4, 3, 0] # Some meaningless numbers.

You can access specific values within the list using square brackets. Note that list positions start counting from **0**, so if your list is 5 items long (like the one above), the positions run from 0 to 4. So:

In [None]:
print("First item in the list:", some_numbers[0])
print("Third item in the list:", some_numbers[2])
print("Last item in the list:", some_numbers[4])

If you try to access a position greater than the length of the list, you'll get an error:

In [None]:
print(some_numbers[10])

You can also create a new list by subsetting another list, using the colon **:** symbol. Think of it as taking values from position : up to position.

In [None]:
first_list = [1, 4, 5, 7, 12, 15]
print(first_list)

In [None]:
second_list = first_list[2:6]
print(second_list)

If you leave out the number before or after the colon, it indicates 'from the beginning' or 'to the end', respectively. So:

In [None]:
print(first_list[:4])

In [None]:
print(first_list[4:])

You can also use negative numbers as list positions, which indicate counting from the **end** of the list. -1 is the index for the *last* element in a list, for example:

In [None]:
print(first_list[-1])

Similarly, if you wanted to get the last two elements of a list, you could write:

In [None]:
print(first_list[-2:])

Lists can store data of all different types together. For example, the list below has an integer, a float and a string:

In [None]:
mixed_list = [1, 1.0, "One"]
print(mixed_list)

Lists can even store other lists. This will be important later, since it's how we can put together multi-dimensional objects like tables and matrices.

In [None]:
first_row = [1, 0]
second_row = [0, 1]
table = [first_row, second_row]

print("Position (1,0)", table[1][0])
print("Position (0,1)", table[0][1])

Note that the names 'row' and 'table' are arbitrary -- all **table** really is, is a list of lists. When we write table[1] we are getting the second entry in the list, which in this case just happens to be a list as well. So table[1][1]  just means 'The second entry in the second entry in the list **table**'.

We can also create these lists directly, and nest them as deep as we want. Below, I create a 3 X 2 X 2 table:

In [None]:
# Notice that we can put line-breaks when they're enclosed in brackets
threedee = [ [ [1,2], [3, 4] ],  
           [[5, 6], [7, 8] ],
           [ [9, 10], [11, 12] ]
           ]

print(threedee[1][1][1])

We can add items to the end of a list using the **.append(...)** operation, like this:

In [None]:
my_list = [1, 2, 3]
my_list.append(4)
print(my_list)

Notice the syntax *append* uses: variable dot operation. The operation following the dot is called a **method**. We'll see this pretty often in Python (and elsewhere). It indicates an operation (method) associated with a particular variable of a particular type. In this case, all list variables have an append method associated with them. 

(Advanced note: technically, lists (like almost everything else in Python) are objects, with append as one of their methods. If you don't know what that means, don't worry about it.)

We can count the number of values in a list using **len**, like this:

In [None]:
print(len(my_list))

Pop quiz: What do you think the result of  *len(threedee)* will be?

In [None]:
print(len(threedee))

As far as Python is concerned, **threedee** has only 3 values; each of those values happens to be another list. So:

In [None]:
print(len(threedee[0]))

In [None]:
print(len(threedee[0][0]))

In [None]:
print(len(threedee[0][0][0]))

### Dictionaries

Dictionaries are a type of data structure that associates unique keys with values. Both keys and values can be any type of data, with the exception that the key cannot be mutable, or subject to change. A number could be a key, for example, but not a variable.

Dictionaries are declared with {curly braces}, but are accessed with [square brackets], like this:

In [None]:
number_names = {"One": 1, "Two": 2} # The string 'One' is the key, and integer 1 the value.
print(number_names["One"])

The traditional example used to explain dictionaries is a telephone book, mapping each unique name to a number:

In [None]:
phone_book = {} # Let the computer know that phone_book is an empty dictionary

phone_book["Alice"] = "555-123-4567"
phone_book["Bob"] = "555-987-6543"

Unlike with lists, we don't need a special command to add data. If we assign a value to a new key, it's automatically added to the dictionary. If we assign a new value to an existing key, it overwrites it. so:

In [None]:
print(phone_book)
phone_book["Carol"] = "555-314-1519"
print(phone_book)
phone_book["Carol"] = "555-271-8281"
print(phone_book)

Notice that the dictionary items (key-value pairs) aren't displayed in any particular order -- in fact, they aren't necessarily even displayed in the order in which we added them. Unlike lists, dictionaries are inherently unordered.

(**Note:** from Python 3.6 onward, dictionaries *are* ordered, and preserve the order in which keys were added. However, most Python code doesn't rely on that yet, and it's best not to rely on dictionary ordering if you ever expect to need to work with an earlier version of Python). 

We can get all the keys and all the values in a dictionary using **.keys()** and **.values()** , like this:

In [None]:
print(phone_book.keys())
print(phone_book.values())

Notice that these methods return special types, ``dict_keys`` and ``dict_values``, which need to be converted to lists before they can be worked with as we did with lists

In [None]:
print(phone_book.keys()[1])

In [None]:
dict_keys = list(phone_book.keys())
print(dict_keys[1])

Okay, now we can store values individually, or inside lists and dictionaries. So now what?

## Moving through your code: loops and conditions 

### For loops
Of course, we don't usually want to operate on values one at a time; we're analyzing data, so we want to operate on multiple values together. We also want to write as little code as possible, both to make our lives easier and reduce the chance for typos and bugs. The basic way to do this is a **for loop**.

The idea of a for loop is to take a block of code, and repeat (iterate) it for each value in a list, in order. The simplest example would look like this:

In [None]:
some_numbers = [2, 5, 7, 1, 101, 9, 9]

for x in some_numbers:
    print(x)

This is telling Python to assign each value in the list to the variable **x**, in order, and then print the value of **x**. The variable we assign in the loop is just an ordinary variable, and can have any name we want. 

Suppose we wanted to double each value and print it, we could do something like:

In [None]:
my_var = "This will be overwritten"

for my_var in some_numbers:
    # These lines will be executed for each value in the some_numbers list
    a = my_var * 2
    print(a)

print("Current value of my_var", my_var) # This line will be executed only once

### Sidebar: Indentation

Indentation is very important, and is part of what makes Python different from many other languages. Python treats indents / whitespace the way many other languages treat curly braces. Notice how the lines below the **for** command are indented? That tells Python that those lines are part of the for-loop's block. It will repeat those lines, and only then go to the next line at the previous level of indentation.

Your indentations need to be consistent. You can use tabs or a certain number of spaces, but you need to use the same throughout your program. The convention is to offset blocks with four spaces. The Jupyter Notebook, and most good programming text editors, will let you choose to automatically insert four spaces when you hit the TAB key. Jupyter Notebook does this by default.

You will probably get this wrong at some point. Python will generally tell you, like this:

In [None]:
for x in some_numbers:
print(x)

But sometimes the mistake will be something more subtle, like this:

In [None]:
for x in some_numbers:
    a = x * 2
print(a)

The code is valid, it just probably won't do what you had wanted it to.

### End Sidebar

Okay, now let's see how to use a for loop to compute the mean (average) of a list of numbers. To do that, we need to sum all the values in the list, then divide by the list's length.

In [None]:
some_numbers = [1, 2, 10, 99, 37, 45, 62, 78, 19]

total = 0 # Prepare a new variable to store the sum
# Sum up the numbers
for x in some_numbers:
    total += x # The += sign means 'add value and assign', like total = total + x
mean = total / len(some_numbers)
print("The mean of the list is ", mean)

We can do something similar with dictionaries, too. Suppose we have a dictionary associating several stock market shares with their current value, and we wanted to find the average. It would look something like this:

In [None]:
prices = {"AAPL": 454.45, "MSFT": 32.70, "AMZN": 297.26, "ORCL": 32.92}

total = 0.0
for stock in prices: # When we use a for loop on a dictionary, it iterates over the *keys*
    total += prices[stock]
mean = total / len(prices)
print("The average stock price is", mean)

Remember how we can get the dictionary keys and values? If we don't want to look up the dictionary value every time, we can write:

In [None]:
prices = {"AAPL": 454.45, "MSFT": 32.70, "AMZN": 297.26, "ORCL": 32.92}

total = 0.0
for price in prices.values(): # Get the list of values, and iterate over those
    total += price
mean = total / len(prices)
print("The average stock price is", mean)

### Conditions: If-else statements

Another important thing we want to do is to have some code execute only if a certain condition is true. We do this using the **if** command, followed by a condition for the system to check. Everything indented under the **if** command will execute only if the condition is true.

In [None]:
if 5 > 0:
    print("True!")
    

We can follow an **if**  with an **else**, giving code to execute if the condition is false.

In [None]:
a = 5
b = 4
if a < b:
    print("a smaller than b") # this code won't get executed here.
else:
    print("b smaller than or equal to a") # This code will.

There's one last command that can go along with an **if**: **elif**, for else if. This means 'if the previous condition is false AND this condition is true'. For example:

In [None]:
a = 5
b = 5

if a < b:
    print("a smaller than b")
elif a > b:
    print("a greater than b")
else:
    print("a and b are equal!")

If we have a few conditions we want to check together, we can combine them with the keywords *and*, meaning only if all the conditions are true; and **or**, meaning if any of the conditons are true. 

In [None]:
if 5 > 0 and 2 > 5:
    print("Never gonna happen")
else:
    print("Here")

In [None]:
if 5 > 0 or 2 > 5:
    print("Now this will print")
else:
    print("Nope.")

We can also check to see whether a certain item appears in a list or dictionary or not, using the **in** keyword. For example:

In [None]:
some_numbers = [1, 5, -2, 18, 6.312375]

if 2 in some_numbers:
    print ("Yes")
else:
    print("No.")
    
if 5 in some_numbers:
    print("And yes.")

In [None]:
some_values = {"One": 1, "Two": 2, "Three": 3}

if "One" in some_values:
    print ("Yes.")

if 1 in some_values: 
    print("And yes.")
else:
    print("But no")

Notice that with dictionaries, the **in** keyword checks only the keys, not the values. 

#### Putting it all together
Suppose we wanted to take the stock prices above, and categorize them into above-average and below-average. Here's one way we could do it:

In [None]:
prices = {"AAPL": 454.45, "MSFT": 32.70, "AMZN": 297.26, "ORCL": 32.92}

# Finding the average:
total = 0.0
for stock in prices: # When we use a for loop on a dictionary, it iterates over the *keys*
    total += prices[stock]
mean = total / len(prices)

# Categorizing them:
# Create a new dictionary with the categories as keys, and empty lists as values.
stock_types = {"Above average": [], "Below average": []}
for stock in prices:
    if prices[stock] > mean:
        stock_types["Above average"].append(stock)
    else:
        stock_types["Below average"].append(stock)

print(stock_types)

## Functions

Functions are blocks of code that you define that do something specific with certain parameters you give them. They often (but not always) take inputs (called arguments) and return some output.

Functions are defined using the **def** keyword, with an indented block of code underneath. 

Here's a basic function that takes no inputs and doesn't return a value, and just prints "Hello, world":

In [None]:
def hello_world():
    print("Hello, world!")

Once we've defined a function, we can call it like this:

In [None]:
hello_world()

Of course, a function like that isn't very useful. Let's give it some arguments, and make it more useful. Arguments go inside the parentheses after the function name; they are variables which have values assigned to them  when the function is called.

In [None]:
def addition(a, b):
    '''
    Add numbers a, b and print the results.
    
    You can add documentation to a function using three single- or double-quotes
    at the top of the function, like this. This is good practice to document
    what your function does.
    '''
    print(a + b)

In [None]:
addition(1, 2)

In [None]:
addition(5, 6)

Note that the variables **a** and **b** above are only defined inside the function. Variables inside and outside the function are different, even if they have the same name.

In [None]:
a = 5
b = 10
addition(1, 1)
print(a, b)

Note that our addition function is just printing the result, but not saving it anywhere. To store the result, we need to have the function **return** a value.

In [None]:
c = addition(2, 3)
print(c)

The function isn't returning anything, so there isn't any value to assign to **c**. We could rewrite the function like this:

In [None]:
def addition(a, b):
    return(a + b)

In [None]:
c = addition(2, 3)
print(c)

Obviously, you don't need to write your own function just to handle adding two numbers. You'll want to use them when you have more complicated code that you think you'll need to run again and again in different places in your program. For example, suppose you expect to need to find the average for many lists -- a pretty standard thing. You could write a function to do it, like this:

In [None]:
def find_mean(num_list):
    '''
    Find the average of num_list
    '''
    total = 0
    for x in num_list:
        total += x
    return total / len(num_list)

(Note: the ``range`` command lets you loop over a list of whole numbers between the specified start and end)

In [None]:
print(find_mean(range(10))) # Find the average of numbers from 0 to 9

In [None]:
some_numbers = [100, 202, 303, 404, 505, 606, 707, 909]
print(find_mean(some_numbers))

Now let's rewrite our stock categorizer using the find_mean functions we've defined:

In [None]:
prices = {"AAPL": 454.45, "MSFT": 32.70, "AMZN": 297.26, "ORCL": 32.92}

# Categorizing the stocks:
# Create a new dictionary with the categories as keys, and empty lists as values.
stock_types = {"Above average": [], "Below average": []}
for stock in prices:
    if prices[stock] > find_mean(prices.values()):
        stock_types["Above average"].append(stock)
    else:
        stock_types["Below average"].append(stock)

print(stock_types)

Notice that right now, the find_mean function is being called and executed for every iteration of the for loop. Really, we only need to compute it once, and store the results in a variable. This isn't a big deal here with only 4 iterations, but when you get into the thousands or millions of iterations, it starts to add up:

In [None]:
prices = {"AAPL": 454.45, "MSFT": 32.70, "AMZN": 297.26, "ORCL": 32.92}

mean_price = find_mean(prices.values()) # Calculate the mean in advance
# Categorizing the stocks:
# Create a new dictionary with the categories as keys, and empty lists as values.
stock_types = {"Above average": [], "Below average": []}
for stock in prices:
    if prices[stock] > mean_price:
        stock_types["Above average"].append(stock)
    else:
        stock_types["Below average"].append(stock)

print(stock_types)