<center>*TA contact info*</center>

<center>
I am available for all programming and math-related questions throughout the course. The best way to contact me is by email:</center>

<center>**deverett@princeton.edu**</center>

---

# Lab #2

Last class, we learned about the basics of Python programming, including variables, lists, loops, and if statements. 

Today, we will build on that knowledge, jumping forward to look at functions, Python packages, and then at the Twitter API.

*Note: if you didn't manage to attend the Intro to Plotting session, <a href="https://drive.google.com/open?id=0B2N8L8DcfG0NMHFoaXBjZERTY28">here is the notebook</a>.*

In [None]:
# a couple of useful tricks:

print( 5 in [1, 2, 3] )
print( 5 in [1, 2, 5] )

sentence = 'hello,this,is,a,sentence'

split_sentence = sentence.split(',')
print ( split_sentence )

joined_sentence = 'X'.join(split_sentence)
print ( joined_sentence )

print( 'There are', sentence.count(','), 'commas in the sentence' )
print( 'There are', split_sentence.count(','), 'commas in the list of words' )

---

## Functions

A function is simply a block of general-purpose code that can be run by calling its name.

In most cases, this block of code can take some inputs, do its job on them, and then return some outputs.

Let's look at a simple function:

In [None]:
def add(x, y):
    result = x + y
    return result

In [None]:
a = 3
b = 4

z = add(a, b)

print(z)

In the previous example, `x` and `y` are **parameters** (also known as **arguments**) to the `add` function.

Two other important concepts about functions:
* default values for parameters
* returning multiple objects

In [None]:
def split_string(string, n=2):
    """Split a string into n chunks
    
    Parameters
    ----------
    string : str
        the string to be split
    n : int
        the number of chunks into which to split the string
        
    Returns
    -------
    chunks : a list of individual chunks made by splitting string
    """
    if len(string) % n != 0:
        print('Length of string is not divisible by n; expect some letters to be cut off.')
    
    chunk_size = len(string)//n
    chunks = []
    for i in range(n):
        chunk = string[i*chunk_size : i*chunk_size + chunk_size]
        chunks.append(chunk)
    return chunks

In [None]:
print( split_string('hello world!', 4) )

  <font color='green'>   
**Exercise**: Write a function that takes as parameters 3 strings and returns the answer to the question: "are the lowercase versions of these 3 strings all the same?"
 </font>

---

## Python packages

Technically, it is possible for you to do almost any analysis using just what we have learned already in Python. But as you can imagine, some complex tasks would take years to achieve, including millions of lines of code.

Fortunately, many generous developers have wrriten general-purpose code in the form of "packages." Many such packages are free, widely available, and strictly maintained, so that we can use them and rely on them.

Today we will look briefly at 2 such packages that are very popular in scientific computing: `numpy` and `pandas`.

At the end, we will learn about a relatively new and useful package called `tweepy`, which will allow us to interact with Twitter.

### NumPy

`numpy` is the numerical computing package for numpy. One of the most confusing topics for beginner Python programmers comes with the introduction of `numpy`. In this course, we will only make limited use of `numpy`, but it is very important to know.

To import numpy (or any package), we use this syntax:

In [None]:
import numpy as np

This command makes all of the `numpy` functionality available to us under the shortcut name `np`.

The primary concept to understand with `numpy` is the `array`. Arrays act very much like Python lists, which we learned last week. But, arrays are specialized for math.

A `numpy` array can be created from any list, like this:

In [None]:
my_list = [5, 10, 11, 32]
my_array = np.array(my_list)

# or, equivalently:

my_array = np.array([5, 10, 11, 32])

To understand the difference between lists and arrays, consider the following cells:

In [None]:
print( my_list * 2 )

In [None]:
print( my_array * 2 )

In [None]:
print( my_list + 2 )

In [None]:
print( my_array + 2 )

You now understand the claim that `numpy` arrays are specialized for math.

You can always check the size and dimensions of an array using the `shape` attribute.

In [None]:
print( my_array.shape )

`numpy` offers a function similar to `range`, but that handles decimal numbers:

In [None]:
my_range = np.arange(0, 1000, 0.1)

print(my_range)

`numpy` also has functions for producing arrays of random numbers:

In [None]:
my_rand = np.random.random(10000)

my_normal = np.random.normal(0, 1., size=10000)

In [None]:
import matplotlib.pyplot as pl
pl.style.use('default')
%matplotlib notebook

fig,axs = pl.subplots(2,1)
axs[0].hist(my_rand, bins=100)
axs[1].hist(my_normal, bins=100)

Because `numpy` is a math package, lots of important mathematical functions are implemented:

In [None]:
mean = np.mean(my_rand)
print('Mean is', mean)

std = np.std(my_rand)
print('Standard deviation is', std)

var = np.var(my_rand)
print('Variance is', var)

Arrays are not limited to one dimension; in fact, they can be any number of dimensions. It is common to represent matrices of data using a 2D `numpy` array:

In [None]:
my_matrix = np.ones([8,12])

print('The shape of this matrix is: ', my_matrix.shape)
print(my_matrix)

As we learned, when you index lists, you can retrieve or set the n'th element using:

`my_list[n]`

For 2-dimensional arrays, that convention is extended as follows:

`my_array[row_index, column_index]`

In other words, the rows of a 2D matrix are the first **axis**, and the columns are the second **axis**.

In [None]:
top_left_element = my_matrix[0,0]

first_row = my_matrix[0,:]

first_column = my_matrix[:,0]

<font color='green'>   
**Exercise**: create a 2D array in which the values of all elements are 42. Then, set just the diagonal elements to be 0.
</font>

When an array has more than 1 dimension, the default behavior for functions like `np.mean` is to compute it over all the data. But, one of the optional parameters to those functions is `axis`. This allows you to compute a statistic, like mean, for all rows, or columns of your data:

In [None]:
mean_of_rows = np.mean(my_matrix, axis=1)

print(mean_of_rows.shape)

In [None]:
mean_of_columns = np.mean(my_matrix, axis=0)

print(mean_of_columns.shape)

`numpy` arrays can be saved to files and loaded back into Python.

In [None]:
np.save('my_array_file.npy', my_matrix)

In [None]:
my_loaded_data = np.load('my_array_file.npy')

Furthermore, `numpy` arrays can be saved to and loaded from CSV files too:

In [None]:
np.savetxt("my_data.csv", my_matrix, delimiter=",")

In [None]:
my_data = np.genfromtxt('my_data.csv', delimiter=',')

<font color='green'>   
**Exercise**: 

The file "temperature_anomalies.npy" contains a `numpy` array. The rows represent years, from 1880 through 2016. The columns represent months, January through December. The values represent the temperature anomaly in the given month and year, gathered from <a href="https://www.ncdc.noaa.gov/cag/data-info/global">this source</a>.  
 
<ul>

<li>Load the temperature data into Python
<li>Compute the mean and standard deviation of the temperature anomaly for each year
<li>Compute the mean and standard deviation of the temperature anomaly by month, averaged over years 
<li>Bonus: plot these trends and comment on the meaning of the finding

</ul>
</font>

### Pandas

Whereas `numpy` is the go-to package for math in Python, `pandas` is another popular package used for analyzing timeseries and tabular data.

To be clear, `pandas` is built *on top* of `numpy`, so if you have a `pandas` object, it usually has all the available functions that a `numpy` array would. That is one of the reasons that knowing `numpy` is so important.

For our purposes, the `pandas` DataFrame is the most important object with which to be familiar. It is basically a 2D numpy array, but it has some special properties, like names for the columns.

In [None]:
import pandas as pd

people = pd.DataFrame(columns=['name', 'age', 'is_female', 'height'])

people.loc[:,'name'] = ['Adam','Bart','Cynthia', 'Dolores', 'Edwin', 'Frances']
people.loc[:,'age'] = [22, 21, 22, 22, 20, 21]
people.loc[:,'is_female'] = [False, False, True, True, False, True]
people.loc[:,'height'] = [5.8, 5.9, 5.8, 5.7, 6., 5.5]

people

In [None]:
people.loc[2,'name'] = 'Zorba'

people

Because DataFrames are basically just fancy arrays, we can always retrieve their raw values:

In [None]:
print(people.values)

`pandas` offers a suite of very useful features, such as the ability to group your table by a property, and compute statistics over the other properties for each group.

In [None]:
people.groupby('age').mean()

DataFrames can also be saved and read to and from csv files:

In [None]:
people.to_csv('people.csv')

In fact, pandas even offers functions to save and read Excel files:

In [None]:
people.to_excel('people.xls')

<font color='green'>   
**Exercise**: the file `2016_donations.csv` contains approximately one million records of campaign donations from individuals in the 2015 and 2016 election cycles (they are real data from <a href="http://www.fec.gov/finance/disclosure/ftpdet.shtml#a2015_2016">this source</a>).

* Load the data into a pandas DataFrame
* Print the number of individual unique donors in the list
* Print the average donation amount
* Create a new table in which there is one row per state, and the columns are the mean, standard deviation, minimum, and maximum donations for that state
* Save your new table to an Excel spreadsheet or CSV file


### tweepy

Many companies and services offer API's: Application Program Interfaces. These are basically a set of standard commands you can use to query the company's database and retrieve information that you want.

Twitter has an excellent API, and developers have built Python-accessible bindings for it, such that we can use Python to retrieve publicly available tweets.

All that's required in order to do this is a set of keys provided by Twitter to anyone who requests them.

<br/>
<font color='green'> 
    **Exercise**: 

    <ul>
    <li> in a terminal window, run `pip install tweepy`
    <li> go to https://apps.twitter.com </li>
    <li> Create New App, with content of your choosing </li>
    <li> Go to Keys and Access Tokens, Create My Access Token</li>
    <li> In the next code cell, define 4 variables for your:
        <ul>
        <li> API Key </li>
        <li> API Secret </li>
        <li> Access Token </li>
        <li> Access Token Secret </li>
        </ul>
    </li>
    </ul>

</font>


In [None]:
n_posts = 15

for tweet in tw.Cursor(api.home_timeline).items(n_posts):
    author = tweet.author.name
    sname = tweet.author.screen_name
    text = tweet.text
    
    print('{} (@{})\n{}\n'.format(author, sname, text))

The API gives us the ability to search Twitter, just like the search bar on the website itself.

In [None]:
for tweet in tw.Cursor(api.search, q='obama', lang='en').items(10):
    print(tweet.text, '\n')

In [None]:
# note that the query can be a list of terms

for tweet in tw.Cursor(api.search, q=['ukraine', 'madonna', 'joe trudeau'], lang='en').items(10):
    print(tweet.text, '\n')

In [None]:
tweet.

The available options for the Twitter API's search functionality is documented <a href="https://dev.twitter.com/rest/public/search">here</a>.

An important limitation to know is that the Twitter search API does not let you search arbitrarily far back in time on all of Twitter (because it is not feasible on their end). It does, however, let you scroll through a given user's timeline without limit, and it does allow you to search all of Twitter from ~1 week ago until the present moment.

Also note that Twitter <a href="https://dev.twitter.com/rest/public/rate-limiting">imposes a limit</a> on the rate at which you can query the API, set at 15 queries within a given 15-minute window. When you do the following exercise, you'll want to first ensure that you have a reliable way to store the tweets you retrieve, so that you don't waste your API calls.  
  
<font color='green'>   
**Exercise**: 

Consider this list of 10 keywords:
<ul>
<li> trump
<li> spicer
<li> conway
<li> russia
<li> alzheimer
<li> neuron
<li> memory
<li> superbowl
<li> beatles
</ul>

<ol>
<li>For each keyword, retrieve 1000 unique tweets that contain the keyword. 
<li> Save those tweets into a `pandas` DataFrame, and save that DataFrame to a file.
<li> For each pairwise comparison of the 10 keywords (100 comparisons), compute the proportion of tweets containing keyword A that also contain keyword B. 
<li>Store the results in a 2D `numpy` matrix or a `pandas` DataFrame, and save the data to a file.

<li> Bonus: display and save a pseudocolor plot to visualize this co-occurence data.

</ol>

</font>
