![calnerds](http://calnerds.berkeley.edu/css/images/logo.jpg)

# Advanced Python Bootcamp


## Overview

In the first Python workshop, we learned how to use Python as a calculator, how to assign variables, the fundamental datatypes including numbers, strings, lists, and booleans, numpy arrays, pyplot, functions, loops, conditionals, and recursion. The notebook is available on my github here if you want to check it out in depth: https://github.com/dawnia/intropython09232017

Today we'll build on what we learned as well introduce completely new material, with a focus on methods in data science.

An overview of today's topics is below:

1. Dictionaries
2. List comprehensions
3. Object-Oriented Programming
4. Monte Carlo methods
5. Reading, analyzing, plotting, and writing data from and to files
6. Bonus: Regular expressions

## Dictionaries

A dictionary is a collection of unordered key-value pairs. Unordered means that pairs in dictionaries don't have indices; to get a value we index by key.

A dictionary looks something like:
<pre>
dict = {key_1: value_1, key_2: value_2, key_3: value_3}
</pre>
enclosed in curly brackets, with key-value pairs connected with colons.

Here we add a new pair to the dictionary, get the value associated with the first key, delete the first pair, and get the number of pairs in the entire dictionary:
<pre>
dict[key_4] = value_4     # we modify existing values in the same way
first_val = dict[key_1]   # first_val = value_1
del dict[key_1]
len(dict)
</pre>

To loop through a dictionary's keys, we use a for loop:
<pre>
for key in dict:
</pre>
and to loop through both keys and values,
<pre>
for key, value in dict.items():
</pre>
Remember, you can call key and value whatever variable names you want.

Here's an example where we use a Python dictionary to create a very small literal English to German dictionary.

In [None]:
engToGer = {'thank you':'danke', 'please':'bitte', 'red':'rot', 'correct':'richtig', 'tongue twister':'zungenbrecher'}

What's the German word for 'red'? We just index into the dictionary with the key 'red'.

Oops, I spelled the word for "correct" wrong. Let's change that value:

### Exercise: Dictionaries

Make a new code box below with the __+__ symbol in the upper left of your notebook, and code the following:

1. Create a dictionary with names as keys and favorite boba place in Berkeley as values. Gather data from yourself and two of your partners' names in class. 
2. Loop through the dictionary and print out all the names.
3. Modify your favorite restaurant to be Crossroads.

## List Comprehensions

List comprehensions are a quick way of using one list (or any iterable) to create another. They allow us to apply a for loop on every element of a given list to create a new list. The basic list comprehension looks like this:

In [None]:
# Makes a list of each initial in presidents
initials = [p[0] for p in presidents]

print(initials)

If we want, we can add conditionals (if, elif, else) statements and nested for loops if we're working with a matrix structure. 

Here's another example of a list comprehension. Below we create evens_squared out of old_things by squaring only the even numbers only using a for loop and if statement. 

In [None]:
old_things = [1, 2, 3, 4, 5, 6]

evens_squared = []
for item in old_things:
    if item % 2 == 0:   # if item is even
        evens_squared.append(item**2)

print(evens_squared)        

We can rewrite the for loop below to the following list comprehension, creating better_evens_squared in only one line.

In [None]:
# better_new_things is populated with each even item in old_things
better_evens_squared = [item**2 for item in old_things if item % 2 == 0]   

print(better_evens_squared)

### Mini-Exercise

Write a list comprehension that takes in a list <code>toFilter</code> and creates a new list containing only objects that are not strings from the original list. To check the type of an object, use <code>type(someThing)</code>. The type of a string is <code>str</code>.

In [None]:
toFilter = ['time', 1, 4, 12, 16, 'end']

newList = # Your code here

print(newList)

### Exercise: Household Income Deviation in Alameda County

You are given the dictionary incomes, containing cities in Alameda County as keys and their median family incomes (as of 2011) as values. We want to find the standard deviation to measure how spread out the cities' family incomes are.

You may need:
* new variables (can initially set to 0) to store the mean and variance
* for loop over dictionary using .items()
* len() to find number of key-value pairs in dictionary
* for loop over dictionary inside a list comprehension
* math.sqrt()

In [None]:
import math   # for math.sqrt()
incomes = {'Alameda':93349, 'Albany':87500, 'Berkeley':102976, 'Dublin':121380, 'Emeryville':99954, 
           'Fremont':109853, 'Hayward':69044, 'Livermore':108406, 'Newark':84244, 'Oakland':58237, 
           'Piedmont':221875, 'Pleasanton':136464, 'San Leandro':72080, 'Union City':91176}

Broken down into smaller problems:

1) Find the mean (sum of values divided by number of values) median family income of all the cities. 

2) In one line of code, make a new list containing each city's difference (or distance) from the mean.

3) Calculate the variance: the mean of each difference squared.

4) Print the standard deviation, which is the square root of the variance.

## Object-Oriented Programming

So far you know about Python's core data types: strings, numbers, lists, and dictionaries. In this section you will learn about the last major data structure, classes. Classes allow us to treat data like objects, just as in real life! 

Say we want to simulate planets in the Milky Way. How could we represent this kind of environment? A string, number, or list alone don't seem to be the right data types to use. We need to create a new class, which is a template for creating objects. 

Below we create the class Planet. This will allow us to create objects, or instances, from the class, like Earth and Pluto. Classes have variables and methods, which we access via dot notation. You have been using dot notation! When we add an element to some list my_list with my_list.insert(), insert is a method specific to lists, and we are calling insert on the instance my_list. We can also use dot notation to access and modify variables.

There are two kinds of variables, instance and class. Instance variables are specific to each instance. So Earth and Pluto would each have their own values for the boolean life. Class variables are shared by all instances of a class. We're defining all objects of Planet to be in the Milky Way, so galaxy and star will be the same for each planet.

A method is an action (function) that all instances of a class may perform. We define methods for objects in a class by writing def statements. Each method has to accept one argument by default, the value self. This is a reference to the instance that is calling the method and allows us to refer to that instance's variables and methods. In planetInfo() we print the variables of the Planet instance planetInfo() was called on, so we use self.name to represent that planet's name, etc.

The method that initializes objects has a special name in Python, \__init\__ (two underscores on each side of the word "init"), and is called the constructor for the class. Like all methods, it always takes in the parameter self, which is bound to the new instance, and can take in other parameters to initialize instance variables. We set instance variables using self.someInstanceVariable.

For more information, refer to: http://composingprograms.com/pages/25-object-oriented-programming.html 

In [None]:
class Planet:
    # Class variables: Shared by all instances #
    galaxy = "Milky Way"
    star = "Sun"
    
    # Instance variables: Specific to each instance, set by __init__ #
    name = ""
    hoursInDay = 0    # rotation period: length of day in hours
    moons = []        # list of moons
    hasLife = False      # boolean 
    
    # Instantiate instance variables here #
    def __init__(self, n, day, moonList, life):  # can call these local variables anything
        self.name = n
        self.hoursInDay = day
        self.moons = moonList
        self.hasLife = life
    
    # Write methods here #
    def planetInfo(self):
        moonString = ", ".join(self.moons)   # format moons list
        print("Planet {} in the galaxy {} has moons {}".format(self.name, self.galaxy, moonString))

earth = Planet("Earth", 24, ["Moon"], True)   # create instance earth
print("Life on", earth.name, earth.hasLife)
earth.planetInfo()

### Exercise: Planets

Your tasks:

1. Add information from the instance variable hoursInDay and the class variable star to the print statement in planetInfo.
2. Create a new instance representing Pluto (153.3 hours in a day and moons Charon, Nix, Hydra, Kerberos, and Styx).
3. Use the planetInfo method to print Pluto's information. 
4. Define a new method that updates hoursInDay, adding 1 to its current value.

# Break 

![bear](https://raw.githubusercontent.com/dawnia/python-workshops/master/bear.png)

## Monte Carlo Methods

Monte Carlo methods are a class of computational techniques that use random sampling and many, many trials to solve problems. They are useful for modeling systems in particle physics, risk analysis in business and engineering, computational biology, mathematical optimization, and artificial intelligence.

### Exercise: Use Monte Carlo to Estimate Pi

In this exercise, you will use a Monte Carlo method to estimate the value of pi, ~3.14. 

The idea is to approximate an area by counting dots. As an example, consider the image below. We want to determine the area of the smaller square compared to the bigger one. The ratio between the two areas is 1/4, so if we randomly distribute 1000 dots then we expect about 250 dots to lie within the smaller square.

![pi1](https://raw.githubusercontent.com/dawnia/python-workshops/master/squares.png)

We know that the area of a circle is pi * r^2 and that of a square is (2r)^2. The ratio of both areas is:
![eq1](https://s0.wp.com/latex.php?latex=%5Cfrac%7BA_%7Bcircle%7D%7D%7BA_%7Bsquare%7D%7D+%3D+%5Cfrac%7B%5Cpi+r%5E2%7D+%7B%282r%29%5E2%7D+%3D+%5Cfrac%7B%5Cpi+r%5E2%7D%7B4r%5E2%7D+%3D+%5Cfrac%7B%5Cpi%7D%7B4%7D+&bg=fafcff&fg=2a2a2a&s=2)
Solving for pi:
![eq2](https://s0.wp.com/latex.php?latex=%5Cpi+%3D+4+%2A+%5Cfrac%7BA_%7Bcircle%7D%7D%7BA_%7Bsquare%7D%7D+&bg=fafcff&fg=2a2a2a&s=2)
This means if we can calculate the ratio A_circle / A_square on the right hand side of the equation, we also get the value of pi. What we will do is approximate this ratio by choosing points (x,y) randomly in the square. 
![pi2](https://raw.githubusercontent.com/dawnia/python-workshops/master/inscribed.png)

(x,y) will either lie within or outside the circle. If we randomly pick points within this space, we can assume that the probability of randomly picking a point within the circle is proportional to the area of the circle. If we determine the ratio of the points inside the circle to the total number of points (points in the square), we can approximate a value for pi.

![pi3](https://raw.githubusercontent.com/dawnia/python-workshops/master/dots.png)

We will run our simulation over only the first quadrant, which does not change the ratio. We define our square and circle to have radius 1, so we will only generate points between 0 and 1. 

We want to do four things:
1. Randomly generate a point, represented by an x and y variable, n times. You can use random.random() to get a random number between 0 and 1.
2. Track the number of points that have fallen within the circle (use isInCircle()) as well as within the total (square's) area.
3. If the point lands within the circle, count it by increasing the variable circleCount by 1.
4. At the end, compare the number of points within the circle to the total number of points, using the equation above to find your estimated value of pi. Try different values of n (100, 1000, 100000, etc.) and see how pi changes.

In [None]:
import random

# Returns a boolean determining if the point lies within a circle of radius 1. #
def isInCircle(x,y):
    if x**2 + y**2 < 1:  # a point is in a circle if x^2 + y^2 < radius^2.
        return True
    else:
        return False    

# Your code here    

## Files

Whether you're a social scientist, mathematician, or physicist, it's almost certain that you'll have to work with data sets stored in text files. Here we learn the fundamentals of reading data from different kinds of files so that we can later analyze or plot their information.

### Reading Data from Files

To read a file, we use open(), which returns something called a file object. File objects have their own pre-defined methods, such as read(), write(), and seek(). Here we're naming that file object infile. 

Below are three of many ways of reading the file data.txt into a string format. data.txt contains one line of comment and four rows of three columns of data. Each method is useful for different situations, depending on what you need from your data. Spend a minute reading through each method, thinking about what might be causing its print output. We'll see more examples of reading files soon, so don't worry if the code is difficult to follow.

In [None]:
# read the whole file at once #
with open('data.txt') as infile:
    contents = infile.read()
    print(contents)
print(contents.splitlines())   # .splitlines() breaks contents up by line into a list
    
# read each line at a time with a for loop #    
with open('data.txt') as infile:
    for line in infile:
        print(line)
        
# put each line into a list #
with open('data.txt') as infile:
    lines = infile.readlines()
    print(lines)        

After we open a file and read its contents, we should eventually close it if we're not using it anymore. The with statement takes care of opening and closing the file, although there are old-school ways of doing this manually (with open() and close()). If we tried to read a file after closing it, either explicitly with close() or outside the with block, we would get an error.

Note: The \n's in the last printout represent line breaks, which we could individually remove with .strip().

Let's focus on the second method, which reads each line individually. Say we want to get numerical data from the first column, pressure. We can use the method split() on a string, which breaks up a string by its whitespaces and returns a list with each segment. See split()'s behavior below:

In [None]:
some_string = "If life gives you lemons"
spl = some_string.split()
print(spl)
first_word = spl[0]
print(first_word)

In [None]:
with open('data.txt') as infile:
    pressures = []
    for line in infile:
        spl = line.split()        # split the line at whitespaces. str.split returns a list
        pressures.append(spl[0])  # append the first item to the output list

print(pressures)        

### Flash Exercise!

Replace the four lines in the with statement with a single-line list comprehension!

In [None]:
with open('data.txt') as infile:
    #pressures = ??? # YOUR CODE HERE

print(pressures)

We're not done yet. We only want lines with numbers, and moreover we have to convert them from strings to numbers. There are many ways to avoid the lines without numbers. We can simply skip the first line using readline(), as shown below. 

Notice that whenever we read a line, whether with read(), readline(), readlines(), or a for loop, once Python reads a line, it automatically moves to the next line. So after calling readline(), the next line is the second line in the file, and the for loop begins there.

In [None]:
with open('data.txt') as infile:
    pressures = []
    infile.readline()
    for line in infile:
        val = line.split()[0]
        pressures.append(float(val))   # cast string to float 

print(pressures)        

Above we cast a string to a float. "Casting" transforms one data type into another, if possible. A float is a real number (i.e. a decimal) in Python, as opposed to just an integer.

However, what if we're not sure which lines don't have numbers? Maybe our data set is big, and we can't manually check each line ourselves. There are many ways to achieve this, and at the end of this notebook is one commonly used way: regular expressions, or regex. Check out this section if you have extra time, want to read ahead, or plan to do data science in the future!

### Word Counting

Let's look at a real-world example of what we can do with files that we read. pp.text contains the book Pride and Prejudice by Jane Austen, downloaded from gutenberg.org:
https://www.gutenberg.org/files/1342/1342-0.txt

We import Counter from collections. Counter is an unordered collection that stores elements as dictionary keys and stores counts as values which can be zero or negative. See the documentation for more information: https://docs.python.org/3.6/library/collections.html#collections.Counter

We create ctr, a new Counter object, and word_count, which will count words (and any other characters or symbol bounded by whitespace).

In [None]:
from collections import Counter

ctr = Counter()
total = 0

Let's take a peek at the text. If you didn't catch this above, \n represents a line break.

In [None]:
with open('pp.txt') as infile:
    book = infile.read()

book[:150]   # see the first 150 characters

In [None]:
book[1000:1500]   # see some section inside the book

Let's read pp.txt and loop through each line. As a reminder, strip() removes whitespace at the beginning and end of a string, and split() splits up words by whitespace, returning a list. We use total to count the total number of words and track how many times each word occurs with Counter ctr. So if we come across the word "love" 42 times, ctr["love"] = 42.

### Interactive Exercises

Code the two lines with comments to update total and ctr.

In [None]:
with open('pp.txt') as infile:
    for line in infile:
        line_words = line.strip().split()
        for word in line_words:
            # add 1 to total
            # add 1 to that specific word's count

There are these many total words in the book:

In [None]:
total

Which should equal:

In [None]:
sum(ctr.values())

Think about why this is true.

The length of ctr gives us the total number of unique words in the book.

In [None]:
len(ctr)

Let's use most_common, a method of Counter, to see the top 10 most-used words in the book:

In [None]:
ctr.most_common(10)

Pride and Prejudice follows the romance between protagonist Elizabeth and the wealthy, aloof bachelor Darcy. Let's see how many times their names are mentioned.

In [None]:
ctr['Elizabeth'] 

In [None]:
ctr['Darcy']

In [None]:
ctr['she'] + ctr['She']

In [None]:
ctr['he'] + ctr['He']

This is a potentially interesting statistic; Darcy's mention count is a little over half that of Elizabeths's. Those who know the book will remember that Elizabeth is in many more scenes than Darcy. The fact that Darcy's mention count is over half that of Elizabeth's suggests that he is mentioned in a lot of scenes in which he isn't present. 

For a more rigorous analysis of gender in text, check out the Beauty and Beast Speech Analysis Jupyter Notebook. It compares male and female characters' speech frequency in the movies Beauty and the Beast and Toy Story. This notebook is more advanced and uses some additional imports, but we can still identify fundamental elements from this workshop, including: reading through lines, split(), and regular expressions. Also note how few lines it takes to plot data using matplotlib!

### Exercises, continued

Estimate the average number of characters per word. Hint: use book, which is a single string containing the whole book, and total, the number of words in the book.

Bonus exercise: Get rid of the Gutenberg ebook information at the beginning and end of the book when reading in the text file.

### Plotting Data from Files

Here we use the packages numpy and matplotlib to plot data from a text file in just a few lines of code. 

Here we want to plot information from sampledata.txt, a file containing two columns of numbers. We load the data in the file into a 2D array using numpy's <code>loadtext()</code>, a fast reader for simply formatted files. We're doing this just to see what the data looks like.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

data = np.loadtxt('sampledata.txt')    # load and show data with numpy
print(data)

Let's plot with matplotlib.pyplot's method <code>plotfile</code>. Here's the docs: http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plotfile

One important argument is delimiter, which defines the string used to separate values. Often text files separate values by commas or other symbols, so we would have to pass something else into this argument.

In [None]:
plt.plotfile('sampledata.txt', delimiter=' ', cols=(0, 1), names=('x', 'y'))
plt.show()

### Writing numerical data to file

Numpy's savetxt() method is very useful for saving arrays to a text file. 

Suppose that you’ve read two columns of data into the arrays t for time and v for voltage from a pressure sensor. Also, suppose that the manual for the sensor gives the following equation to find the pressure in atmospheres from the voltage reading: 

`p = 0.15 + v/10.0`

Recall that this single Python command will calculate an array p with the same length as the array v. 

In [None]:
import numpy as np

t = np.arange(10)  # integers from 0 to 9
v = np.array([7.4, 3.4, 0.6, 2, 9.2, 11, 6.3, 6.6, 5.3, 2.8])  # voltage values
p = 0.15 + v/10.0    # pressure values

print(t)
print(p)

Once you’ve calculated the pressures, you want to write the times and pressures to a text file for later analysis. We can do this with savetxt(). This method takes in a filename and an array-like object, which you can think of as an array or something that can be made into an array with np.array, including regular numbers.__\*__ The filename can refer to an existing file, which will be overwritten (be careful!), or it can be a file that doesn't exist, and will be created with the arrays we pass in. The file will be saved in the same directory as the program.

See the documentation for savetxt() here for optional arguments and some examples: https://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html

__\*__ Here's a discussion on array-like objects if you're interested in what else counts: https://stackoverflow.com/questions/40378427/numpy-formal-definition-of-array-like-objects

In [None]:
np.savetxt('new.txt', (t,p))

Let's check whether it saved correctly.

In [None]:
checkOutput = np.loadtxt('new.txt')    
print(checkOutput)

Try saving your own arrays to file in the code box below!

Each array appears in a different row. We can use the column_stack() method to write each array into a different
column, which may be more desirable if we're working with large data sets.

In [None]:
dataOut = np.column_stack((t,p))
np.savetxt('new2.txt', dataOut)
checkOutput2 = np.loadtxt('new2.txt')    
print(checkOutput2)

By default, the numbers will be written in scientific notation. The fmt argument can be used to specify the formatting. If one format is supplied, it will be used for all of the numbers. The general form of the fmt argument is

`fmt = '%(width).(precision)(specifier)'`

where width specifies the maximum number of digits, precision specifies the number of digits after the decimal point, and the possibilities for specifier are shown below. For integer formatting, the precision argument is ignored if given. For scientific notation and floating point formatting, the width argument is optional.

![fmt](https://raw.githubusercontent.com/dawnia/python-workshops/master/fmt.png)

A format can also be provided for each column (two in this case):

In [None]:
np.savetxt('new3.txt', dataOut, fmt=('%i3', '%4.3f'))

You may need to do some Googling or documentation-reading depending on the data you want to write. For example, see here for a discussion on transposing an output of three arrays:
https://stackoverflow.com/questions/15192847/saving-arrays-as-columns-with-np-savetxt

Stack Overflow is extremely useful for programming help and has a very active Python community.

## Bonus: Regular Expressions (Regex)

Regex, imported as re, is a tiny, powerful programming language within Python, used to specify rules for the set of possible strings you want to match. You can then ask questions such as "Does this string match the pattern?", or "Is there a match for the pattern anywhere in this string?" You can also use regex to modify a string or to split it apart in various ways. The use of regex is extremely widespread—some might say pervasive—and is found in many programming languages and especially important for data science.

Here we read data.txt, only adding values to pressures if they are a number. Given some string val, how do we know if it's a number? It must be in one of these formats, where x equals one or more digits. Also, a number can have + or - in front of it. 

* x (e.g. 5, 749)
* x. (e.g. 5., 0.)
* x.x (e.g. 5.0, 273.09)

Our pattern will turn out to be `[+-]?\d+\.?\d*`

But what in the world does r'[+-]?\d+\.?\d\*' mean? Let's break it down, and feel free to reference the list of operators below. 

* [+-]? zero or one of either + or - 

* \d+ one or more digits

* \.? optionally a decimal

* \d\* zero or more digits

There is more than one way to write this regex pattern--try to define your own! 

After importing re, we use its compile() method to create the regex pattern, called exp. We write r __\*__ before the regex pattern, which itself is in quotes. Then while reading the file, we use match() to only add a value val if it matches exp.

__\*__ The r stands for raw string notation, meaning that backslashes shouldn't be interpreted as escape characters for special characters, which Python normally does. As you see below, backslashes have a different meaning for regex, denoting special symbols like \d.

### Exercise

Go into data.txt and add a couple new lines throughout of both numerical data and non-numbers. Test your modified file on the code below!

Note: Always remember to add a new line at the end of your file. For a discussion on this, see: https://stackoverflow.com/questions/729692/why-should-text-files-end-with-a-newline

In [None]:
import re
exp = re.compile(r'[+-]?\d+\.?\d*')   # create regex pattern

with open('data.txt') as infile:
    pressures = []
    infile.readline()
    for line in infile:
        val = line.split()[0]
        if re.match(exp, val):        # check if our string matches exp
            pressures.append(float(val))   

print(pressures)        

Here are some fundamental operators in regex. Don't feel the need to read the syntax thoroughly; use it as a reference instead.

* [] - enclose a set of characters you wish to match. For example, [abc] matches a, b, or c. [a-z] matches letters from a to z.
* A backslash can be followed by various characters to signal a special sequence. It’s also used preceding metacharacters (such as [) so you can still match their literal value in patterns.
* \d - any decimal
* \D - any non-decimal
* \w - any alphanumeric character
* \W - any non-alphanumeric character
* \s - any whitespace character
* \S - any non-whitespace character
* . - anything except a newline character
* A * following a character matches that character zero or more times.
* A + following a character matches that character one or more times.
* A ? following a character matches that character zero or one times.
* {x,y} following a character will match that character at least x and at most y (inclusive) times.
* ^ matches some pattern at the start of a string, and $ matches some pattern at the end.
* | in between strings gives the option of matching one of the specified strings.

Here are some resources to further explore regex, with better-formatted syntax and many examples:
* https://developers.google.com/edu/python/regular-expressions
* https://docs.python.org/3.6/howto/regex.html

## Congratulations, you have finished the Advanced Python Bootcamp!

Image credits:

* https://learntofish.wordpress.com/2010/10/13/calculating-pi-with-the-monte-carlo-method/
* https://www.physast.uga.edu/classes/phys3330/zhao/downloads/download/664