# Programming in Python

This notebook offers an introduction to progamming in the Python language. It's impossible to cover it all in a single notebook (or a single class!); however, this notebook highlights core aspects of Python that are important for this class. I highly recommend the (free and online!) book <a href=https://python.swaroopch.com/><i>A Byte of Python</i></a> if you would like to further study the ideas outlined in this notebook.

## Hello world!

As is customary when learning a new programming language, we can start a hello world program:

In [None]:
print("hello world!")

We can also use single quotes to specify a string:

In [None]:
print('hello world!')

## Comments

It is absolutely essential to comment your code when writing a program in any language and this is no different for Python. You can easily add inline and multi-line comments in Python. Consider the following inline comments:

In [None]:
# You can put a comment on a newline
print('I love you Python.') # You can also put a comment here

The Python interpreter ignores everything after the hash symbol. Multi-line comments are specified using 3 consecutive quotation marks (either double or single quotes):

In [None]:
print('''Here is one of my all time favorite Trump
tweets on climate change.''')

#print('The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.')

For more on best practices regarding commenting code, see the <a href="https://google.github.io/styleguide/pyguide.html#Comments">Google's Python style guide</a>. I will illustrate (you need to hold me to this!) these best practices throughout the course. In short, as outlined in the book <i>Byte of Python Use</i>, use as many useful comments as you can in your program to:
* explain assumptions
* explain important decisions
* explain important details
* explain problems you're trying to solve
* explain problems you're trying to overcome in your program, etc.

<a href="https://blog.codinghorror.com/code-tells-you-how-comments-tell-you-why/">Code tells you how, comments should tell you why</a>.

## String methods and concatenation

Strings -- sequences of characters -- are obviously important when dealing with text data. We have already seen how to specify a string in Python, using both single and double quotation marks. (<b>Note</b>: I tend to use single quotes, as this is easier on my keyboard. You can use whichever you like best; however, be consistent.) Strings also have a number of "methods" that will prove useful throughout the term. For instance, say that we want to convert a string to all lowercase:

In [None]:
print('University of Exeter'.lower())

Or all uppercase letters,

In [None]:
'University of Exeter'.upper()

Another method that we will rely on heavily throughout the course is the <b><span style="color:green">split()</span></b> method,

In [None]:
'University of Exeter'.split(' ')

In [None]:
len('University of Exeter'.split(' '))

Here, we "split" the string based on space (i.e., the ' '). There are a bunch of other string methods (see this <a href="https://www.shortcutfoo.com/app/dojos/python-strings/cheatsheet">cheatsheet</a> for more information) and we will use several of these methods throughout the course. We will often want to combine (or concatenate) strings together.

In [None]:
# This code illustrates one way to concatenate a string
'data' + 'science'

In [None]:
'data' + ' ' + 'science'

In [None]:
'data ' + 'science'

You can also use the "format" strings in Python 3 to insert/concatenate a string. 

In [None]:
term1 = 'analyst'
term2 = 'sexy'
print(f'Data {term1} is the new {term2} job.')

## Numbers

Here is the description of numbers in <i>A Byte of Python</i>:

"Numbers are mainly of two types -- integers and floats.
An example of an integer is 2 which is just a whole number.
Examples of floating point numbers (or floats for short) are 3.23 and 52.3E-4. The E notation indicates powers of 10. In this case, 52.3E-4 means 52.3 * 10^-4^."

That pretty much sums it up!

## Variables and operators

Often, we want to store numbers in strings in variables and perform "operations" on those <b>variables</b>. 

In [None]:
# Assigning variables is easy in Python
a = 'data'
b = 'science'

# And we can 'do things' with these variables
a + b

In [None]:
# Works the same way for numbers
c = 2
d = 4

# And we can add these variables
c + d

In [None]:
# We can also assign a new variable based on an operation
x = c + d
print(x)

Be careful, however, when trying to mix types:

In [None]:
# Try to concatenate a string and an integer
a + c

In [None]:
# Instead, we need to preform the operation using consistent types
a + str(c)

### Operators

Python includes all of the arithmetic (for integers and floats), relational, and logical operators that you will need (<a href="https://www.tutorialspoint.com/python/python_basic_operators.htm">click here for a complete list of operators</a>). Let's look at the main <b>arithmetic</b> operators.

In [None]:
3 + 5 # addition

In [None]:
3 - 5 # subtraction

In [None]:
3 * 5 # multiplication

In [None]:
3 / 5 # division

There are a number of other arithmetic operators that we could run into throughout the term, such as:

* Power: ``` 5 ** 3 ``` outputs ``` 125 ```.
* Modulo: ``` 100 % 10 ``` outputs ```0```.
* And so on and so forth (again, see (<a href="https://www.tutorialspoint.com/python/python_basic_operators.htm">here</a> for more info)

We will also often make use of <b>relational</b> operators. For instance, the relational "equals" operator is important for testing the equality between two objects:

In [None]:
a = 2
b = 3

# Are a and b equal?
a == b

In [None]:
# And if we re-assign variable a to 3?
a = 3
a == b

Here are the other relational operators that we will use:

* `!=` (not equal to)
* `<`  (less than)
* `>`  (greater than)
* `<=` (less than or equal to)
* `>=` (greater than or equal to)

Finally, Python also provides a set of <b>logical</b> and <b>membership</b> operators:

* `and` (boolean AND)
* `or`  (boolean OR)
* `not` (boolean NOT)
* `in` (membership)

So, for instance,


In [None]:
a and b == 3

In [None]:
tokens = 'University of Exeter'.split(' ')

In [None]:
'Exeter' in tokens

In [None]:
'exeter' in 'University of Exeter'.lower()

In [None]:
'Travis' in 'University of Exeter'.split(' ')

We will also occasionally use the following <b>assignment</b> operator to increment counter (more on this when we get to "loops"),

In [None]:
# Assignment for i
i = 0
print(i)

# Increment i by 1
i -= 2

print(i)

## Control flow

For simple programs -- such as those outlined in the code above -- executing code from top to bottom works just fine. However, for everything else, we will need a bit more control. This is where <b>control flow</b> statements come in handy. In this section, we will introduce Python's three control flow statements: `if`, `for`, and `while`.

### The `if` statement

The value of the various logical and relational operators outlined above really come into focus when combined with the `if` statment in Python. Let's take a look at several examples.

<b>Example 1</b>: Simple if/else statement. Check whether our Trump tweet includes the phrase "global warming."

In [None]:
# Initialize our program
keyword = 'global'
tweet = 'The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.'

if keyword in tweet.lower():
    print(f'Found {keyword} in tweet.')
else:
    print(f'Could not find {keyword} in tweet')

<b>Example 2</b>: Nested if/else statements. First, check if the string has 140 or fewer characters (i.e., consistent with Twitter limits). If this is true, check whether our Trump tweet includes the phrase "global warming."

In [None]:
# Initialize our program
keyword = 'global warming'
tweet = 'The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.'

if len(tweet) <= 140:
    if keyword in tweet:
        print(f'Found {keyword} in tweet')
    else:
        print(f'Could not find {keyword} in tweet')
else:
    print('Not a tweet!')

### The `for` loop

We often want to make repeated calculations and this is where the idea of a "loop" comes in. Let's start by taking a look at a `for` loop, which allows you to <i>iterate over a sequence of objects</i>.

In [None]:
# Split (or tokenize) the Trump tweet into words
tweet = 'The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.'
words = tweet.split(' ')

# Iterate over the sequence of words and print
for word in words:
    print(word.lower())

In [None]:
words[0:3]

The variable "word" holds each object in the sequence, one at a time. Note that you can name this anything you want (e.g., 'travis' or 'token' or whatever).

In [None]:
for trav in words:
    print(trav.lower())

As another example, say that we wanted to iterate over the numbers 0 to 9. How can we do this in Python?

In [None]:
list(range(len(words)))

In [None]:
# The xrange() function creates the "sequence of objects" to iterate over. 
# By default, xrange() iterates from 0 in increments of 1.
tweet = 'The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.'
words = tweet.split(' ')

for i in range(len(words)):
    print(i, words[i])

In [None]:
range(10)

Iterating over lists of objects (such as our words above) or numbers is a common task. Sometimes you want to iterate over a list of objects AND keep a counter to track here you are in the list. This is where the `enumerate` function comes in handy.

<b>Example 3</b>: The `enumerate` function. Print the first 5 words in our Trump tweet.

In [None]:
tweet = 'The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.'
words = tweet.split(' ')

for i, word in enumerate(words):
    print(i, word, 'trump')

We can also get the same answer without the `enumerate` function by instead `xrange` and iterating over the "words" list:

In [None]:
for i, word in enumerate(words):
    if i < 5:
        print(word)
    else:
        break

Need to explain break

### The `while` loop

Whenever possible, it is good to use a `for` loop to iterate over sequences. However, there are times when you do not know the length of the sequence you are iterating over ahead of time. This is when a `while` loop useful. Let's revist <b>Example 3</b>, but this time using a `while` loop.

In [None]:
# We need to initialize a counter to hold our iterations
i = 0

while i < len(words):
    print(words[i])
    # Need to update the counter. Otherwise, we get trapped in an
    # "infinite loop"!
    i += 1

## Exceptions

Sometimes we need to catch errors before they happen. We do so using ``try`` and ``except`` in Python (see <a href="https://python.swaroopch.com/exceptions.html"><i>Byte of Python</i> on Excpections</a> for more information). For instance, consider the following ``while`` loop:

In [None]:
# Initialize counter
i = 0

# This is called an infinite loop -- be careful!
while True:
    print(words[i])
    i += 1

Once we run out of words, the code breaks -- it errors out with a ``IndexError``. If we ran into this error in one of our programs, the program would stop executing. Instead, we can "catch" the error, using a ``try`` and ``except`` sequence:

In [None]:
# Initialize counter
i = 0

# This is called an infinite loop -- be careful!
while True:
    # Try to print a word
    try:
        print(words[i])
        i += 1
    # Raise an exepction if the code errors out
    except:
        print('We ran out of words!')
        break

This code, instead, catches the error -- our program could continue doing other things if we wanted. There are times when catching errors can be super helpful.

## Functions

Often when writing programs and doing analysis, we want to reuse pieces (or blocks) of code. We do so by declaring a function using the `def` statement. We have already used several of Python's built-in functions earlier in this tutoral. For instance, we "called" the `len` function to get the number of characters in a string. Python, however, makes it super easy to define your own functions.

<b>Example 4</b>: Looking up words in a tweet, any tweet. We can extend our code in <b>Example 1</b> to make it reusable for any tweet by defining a function.

In [None]:
tweet = 'The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.'

def lookup(tweet, keyword='global warming'):
    '''This function takes a tweet and keyword, and returns True if the 
       keyword is present and False otherwise '''
    
    #print('Is %s found in the tweet?' % keyword)
    
    if keyword in tweet:
        return True
    else:
        return False

In [None]:
print(lookup(tweet, 'travis'))

In [None]:
lookup(tweet, keyword)

Our function above illustrates features that most functions have:

* <b> Parameters </b>: `keyword` and `tweet` are parameters in our function (i.e., information sent to the function that it needs to run).
* <b> Return value(s) </b>: Most, but not all, functions return a value (or set of values).
* <b> Doc string </b>: String explaining what the function does (i.e., "documenting" the function), which appears just under the function definition.

Doc strings are helpful, as they allow a user (including yourself!) to get help on what the function does:

In [None]:
help(lookup)

We can also specify <b>default values for parameters</b>. For instance, we can add the following:

In [None]:
def lookup(tweet, keyword = 'covfefe'):
    '''This function takes a tweet and keyword, and returns True if the 
       keyword is present and False otherwise '''
    
    print('Is %s found in the tweet?' % keyword)
    
    if keyword in tweet:
        return True
    else:
        return False

In [None]:
lookup(tweet)

In [None]:
lookup(tweet, keyword='global warming')

Your functions can get quite complex and you can even inlude an <a href=https://www.geeksforgeeks.org/args-kwargs-python/>arbitrary number of arguments</a>. However, we are not going to worry about the complexities at this point. And don't worry, we will be using functions throughout this course, so you will get many (many!) opportunities to practice their use (for more on functions, click <a href = "https://python.swaroopch.com/functions.html">here</a>).

## Data structures

Python offers a number of alternatives (or "structures") for storing data. There are four built-in data structures: `list`, `dict`, `tuple`, and `set`. We will look at each of these in turn.

### Lists

A list is just that -- a list of objects. These "objects" can be numbers, strings, and even other data structures. For instance, when we "split" the Trump tweet above into seperate words, Python returnd a list:

In [None]:
words = tweet.split(' ')
print(words)
print(len(words))

This list has 19 elements and we can lookup a particular element in the list using the appropriate index. Once again, note that Python indexes lists starting at 0, and moves right to left. So if we wanted to lookup the 5th element in this list, we would type:

In [None]:
print(words[-1])

We can iterate over a list in the opposite direction by using negative indices. So to get the last and second to last word in the list, we could type:

In [None]:
# Last word
print(words[-1])

# Second to last word
print(words[-2])

We can also `append` objects to the end of a list or `insert` objects into a list using an index:

In [None]:
# Add an additional word to the end of our list
words.append('crazy')
print(words)

or `insert` an object based on an index:

In [None]:
# Add a word to the begining of the list
words.insert(0, 'trump')
print(words)

Lists are super flexable and store just about anything. For example, we will often need to work with "lists of lists". 

In [None]:
# Define a list to hold two Trump tweets
tweets = ["Let's continue to destroy the competitiveness of our factories & manufacturing so we can fight mythical global warming. China is so happy!",
          "The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive."]

# Loop over the "tweets" and tokenize
tokenized_tweets = []
for tweet in tweets:
    tokenized_tweets.append(tweet.split(' '))

print(tokenized_tweets)

We can then access an individual element within our nested lists as follows:

In [None]:
# Get word 12 in tweet 2
#print(tokenized_tweets[1][11])
tokenized_tweets[0][-1]

### List comprehension

While we are on the subject of lists, it is good to introduce the idea of "list comprehesnion" in Python. I think of list comprehension as a special type of loop. This procedure takes a list, loops over it (typically modifying it in some way), and then returns a new list. The advantage of using list comprehension rather than, say, a `for loop` is that it often lead to efficient, easy to read code. Let's take a look.

In [None]:
nums = [i for i in range(10)]
print(nums)

We could also do the same thing using a loop, but it's a bit long-winded:

In [None]:
nums = []
for i in range(10):
    nums.append(i)

print(nums)

Or take our tweet example above:

In [None]:
tweets = ['The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.',
          'This is also a tweet.']
tokenized_tweets = [tweet.split(' ') for tweet in tweets]
print(tokenized_tweets)

We will see many more examples of using list comprehension in Python.

### Dictionaries

In addition to lists, dictionaries are one of the most often used data structures in Python programs. As aptly described in <i>Byte of Python</i>, 

> "A dictionary is like an address-book where you can find the address or contact details of a person by knowing only his/her name i.e. we associate <i>keys</i> (name) with <i>values</i> (details). Note that the key must be unique just like you cannot find out the correct information if you have two persons with the exact same name."

How do dictionaries work in practice? Let's go back to the two tweets from Donald Trump above. Say that we wanted to store the `tweets` list along with a list of tweet IDs. We could do so using a dictionary as follows:

In [None]:
# Define a list of tweet ids
ids = [1, 2]

# Combine the ids and tweets into a dictionary
tweets_dict = {'tweets': tweets, 'ids': ids}

print(tweets_dict)

And we can now call up each list using the `tweets_dict` and the relevant key.

In [None]:
# Grab the ids to view
print(tweets_dict['tweets'][0])

As with lists, dictionaries are super flexable. I often store data as a list of dictionaries as follows:

In [None]:
tweets = [{'id': 1, 
           'tweet': "Let's continue to destroy the competitiveness of our factories & manufacturing so we can fight mythical global warming. China is so happy!"},
          {'id': 2,
           'tweet': "The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive."}
         ]


This allows you to call up a particular tweet using the "tweet" key, instead of having to remember which index in a list holds the tweet element.

In [None]:
for row in tweets:
    row['tokens'] = row['tweet'].split(' ')

In [None]:
tweets[0]

### Tuples

I tend to use tuples less often then lists and dictionaries, but they are still quite useful in certain circumstances. You can think of a tuple as a stripped down version of a list, with the added feature that they are <a href="https://medium.com/@meghamohan/mutable-and-immutable-side-of-python-c2145cf72747">immutable</a> (don't worry about this concept too much at this point). Basically, we can use tuples when we really, really want to objects to remain together and we don't want them to be changed.

You define a tuple in a very similar way to a list:

In [None]:
tweet = (1, "Let's continue to destroy the competitiveness of our factories & manufacturing so we can fight mythical global warming. China is so happy!")
print(tweet)

We can still iterate over this tuple and call individual elments based on their index; however, their is no `append` or `insert` method for tuples. They are "hard to change" by design!

### Sets

A set is useful when you want a <i>unique</i>, unordered collection of Python objects. For example,

In [None]:
names = ['travis', 'travis', 'travis', 'riley', 'riley', 'dreolin']
names_set = list(set(names))
print(names_set)
#names_set.append('ranu')
#(names_set)

In [None]:
set(names)

Where the use of sets really helps us is when checking for membership in a collection of objects. For instance, if I wanted to know whether 'travis' was includded in this list of names, I could use the membership operator above on the list of names directly:

In [None]:
print('travis' in names)

However, when the list of names is large or you need to check for membership many times, it becomes much more efficient to do the following:

In [None]:
print('travis' in set(names))

## Input and output 

Most of your scripts and programs will need to read and write data. Let's jump right in with an example of reading and writing a CSV file in Python using the  ``pandas`` library. After learning how to read and write a CSV formatted file, we will look at some other useful file formats.

<b>Example 5</b>: Reading, processing, and then writing data. Open the trump_tweets_2017.csv file, flag tweets about "fake news", and write these tweets to disk.

We need to start by downloading the trump_tweets_2017.csv data and store it a location that you can find. I downloaded it to the following folder on my machine: /Users/tcoan/git_repos/notebooks/data. If you want to avoid typing the entire (absolute) path each time you read and write data, you can set the working directory using the `os` module (similar to `setwd()` in **R**).


In [None]:
import os
os.chdir('/Users/tcoan/git_repos/notebooks')

Next, we use the `pandas` library to load the data. For example, this code will load our tweets CSV into a new data type unique to `pandas`: the "data frame". 

In [None]:
import pandas as pd
trump_df = pd.read_csv('data/trump_tweets_2017.csv')

Like in **R**, we can look at the first couple of row by using the `head` method:

In [None]:
trump_df.head(10)

This loads our tweets CSV into a new data type unique to pandas: the "data frame". Dataframes represent tabular data organized by variable (or what `pandas` refers to as "series"). Like in **R**, we can look at the first couple of row by using the head method:

In [None]:
trump_df.head() # prints the first 5 rows by default
# You can print a different number of rows by passing
# values to the head function:
# trump_df.head(10)

We can also look at the `tail` of the dataset:

In [None]:
trump_df.tail()

There's a ton that you can do with `pandas` and there are many great [tutorials](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/) online to become more familar with the library's power. For now, however, let's look at some of the most useful functions for "getting to know" your data:

In [None]:
# Get the shape of a dataframe (rows, columns)
trump_df.shape

In [None]:
# Get the column (or variable) names
trump_df.columns

In [None]:
# You can subset columns (or variables) in a dataframe by passing a list of variable names
trump_df[['source', 'retweet_count']].head()

In [None]:
# You can get a frequency table using the value_counts() function:
trump_df['source'].value_counts()

In [None]:
# And crosstabs using the crosstab() function:
pd.crosstab(trump_df['source'], trump_df['is_retweet'])

In [None]:
# You can use pandas to get descriptive statistics for variables (or groups of variables)
trump_df['retweet_count'].mean()

In [None]:
# And can pull back a handful of different descriptive statistics, using the ".agg()" function:
trump_df['retweet_count'].agg(['count', 'mean', 'median', 'std'])

### Converting between `pandas` DataFrames and Python data structures

When processing text, I like to use native Python data structures -- lists and dictionaries -- as they offer additional flexbility. Almost all of the code used to process and analyse text data that you see in this course assumes that our data are "lists of dictionaries". To convert from a dataframe to a list of dicionaries:

In [None]:
trump = trump_df.to_dict(orient="records")

The `trump` object is a list holding the tweets:

In [None]:
print(len(trump))

And each element of the list is a dictionary with information on the tweet:

In [None]:
print(trump[0])

So if we wanted the view the `text` for the 10th tweet in our dataset, we would use:

In [None]:
print(trump[9]['text'])

We can also convert back to a `pandas` dataframe by using the `DataFrame()` function:

In [None]:
df = pd.DataFrame(trump)
df.head()

### Writing data using `pandas`

Just as it's easy to **read** data using the `.read_` suite of functions, we can also use the `to_` set of functions to write data to disk. For instance, to write a CSV to disk taht only includes the `source` and `text` from the Trump tweets data:

In [None]:
trump_df[['source', 'text']].to_csv('data/source_text_trump.csv')

### JSON formatted files

Many APIs (e.g., the Twitter API) return JSON formatted files. <a href="https://en.wikipedia.org/wiki/JSON">Wikipedia</a> describes JSON files as follows:

> "In computing, JavaScript Object Notation or JSON (/ˈdʒeɪsən/ JAY-sən) is an open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value). It is a very common data format used for asynchronous browser–server communication, including as a replacement for XML in some AJAX-style systems."

The JSON file format looks a lot like a Python dictionary. For example,

[
   {
      "source":"Twitter for iPhone",
      "text":"Jobs are kicking in and companies are coming back to the U.S. Unnecessary regulations and high taxes are being dramatically Cut, and it will only get better. MUCH MORE TO COME!,
      "created_at":"Sat Dec 30 22:42:09 +0000 2017",
      "retweet_count":24332,
      "favorite_count":117013,
      "is_retweet":false,
      "id_str":"947236393184628741"
   }
]

We load JSON files in Python using the ``json`` module. As an example, we can load the JSON version of the 2017 Trump Twitter data (again, stored on my system in /Users/tcoan/git_repos/notebooks/data):

In [None]:
import json

# Read JSON formatted data
with open('data/trump_tweets_2017.json', 'r', encoding='utf-8') as jfile:
    jdata = json.load(jfile)

jdata[0]

In [None]:
jdata[0].keys()

We write (or dump) JSON files in the usual way. When writing JSON, I like to use a handful of additional options to the `json.dump`:

In [None]:
with open('data/pretty.json', 'w') as jfile:
    json.dump(jdata[0:10], jfile, indent=4, separators=(',', ': '), sort_keys=True)
    # Add trailing newline for POSIX compatibility
    jfile.write('\n')

Again, you can also open JSON formatted files with `pandas`:

In [None]:
# Read a json file
trump_df_json = pd.read_json('data/trump_tweets_2017.json')
trump_df_json.head()

### Pickle files

The last file format that we will use are so-called "pickle" files. Here is how the <a href="https://docs.python.org/3/library/pickle.html">Python docs describes pickling files</a>:

> "Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” or “flattening”; however, to avoid confusion, the terms used here are “pickling” and “unpickling."

We read and write pickle files (surprise, surprise) using the `pickle` module. As an example, let's "serialize" our `jdata` file and write it to disk.

In [None]:
import pickle

with open('data/trump_tweets_2017.pkl', 'wb') as pfile:
    pickle.dump(jdata, pfile)

This creates a "pickled" file in the /home/tcoan directory. Pickle files are not human readable, but they are super useful because the preserve, exactly, the Python object that you are writing to disk. We can then, at a later point, load the same object back into Python for further analysis. For example, 

In [None]:
tweets = pickle.load(open('data/trump_tweets_2017.pkl', 'rb'))

In [None]:
tweets == jdata

Or with `pandas`:

In [None]:
tweets_df_pickle = pd.read_pickle('data/trump_tweets_2017.pkl')
tweets_df_pickle.head()

## Our first program: pulling it all together

The final step in our whirlwind tour of Python is to pull our code together into a single "program" (i.e., a collection of functions that, when executed, perform a task). We will stick with our Twitter example.

<b>Example 6</b>: Write a Python program that reads a CSV file of Tweets, searches for a particular keyword, and returns the relevant Tweets.

In [None]:
# Import dependencies
import csv

# Let's write a function to read a CSV file
def read_csv(path):
    '''This function takes an (absolute) path to a CSV file and
       returns a utf-8 encoded list of tweets and the header of
       field "labels" associated with the CSV. Note that we assume
       that the first row of the input file is the header.
    
       Arguments:
       ----------
       path: absolute path to the CSV file.
       
      
       Returns:
       --------
       A dictionary with the header labels and the Tweets.
    '''
    
    with open(path, 'r') as csvfile:
        # Connect to file
        csvreader = csv.reader(csvfile)

        # Read the tweets
        tweets = [row for row in csvreader]
    
    return {'header': tweets[0], 'tweets': tweets[1:]}


# See if a tweet includes the relevant keyword
def lookup(tweet, keyword):
    '''This function takes a tweet and keyword, and returns True if the 
       keyword is present and False otherwise.
       
       Arguments:
       ----------
       tweet: The text of a Tweet
       keyword: The keyword of interest to lookup
       
       Returns:
       --------
       True if the keyword is present and False otherwise
    '''
    
    # Standardize keyword and Tweet to use lowercase
    if keyword.lower() in tweet.lower():
        return True
    else:
        return False


# Main function to search a CSV of tweets
def search_tweets(keyword, path, text_idx = 2):
    '''This function takes a keyword and absolute path to
       to a CSV file of Tweets and returns a new list of
       Tweets that contain the keyword.
       
       Arguments:
       ----------
       keyword:  The keyword of interest to lookup in the Tweet
       path: The absolute path to the CSV file holding the Tweets
       text_idx: Is the index for the element holding the Tweet text
                 (defaults to index = 2)
    '''
    
    # Read CSV content
    content = read_csv(path)
    
    # Search Tweets for keyword
    key_tweets = [tweet for tweet in content['tweets'] 
                  if lookup(tweet[text_idx], keyword) == True]
    
    print('Found %s tweets about %s' % (len(key_tweets), keyword))
    
    return key_tweets


We can now execute the `search_tweets` function to search a CSV of Tweets for a particular keyword:

In [None]:
res = search_tweets('CNN', 'data/trump_tweets_2017.csv')

And we can inspect the individual Tweets as per usual:

In [None]:
print(res[0])

## You try!

Let's reinforce our Python programming skills with an in-class activity. We will use the `facebook_ads_climate.csv` data, which is a random sample of 500 Facebook advertisements related to climate change in the U.S. The data was collected via the (Meta Ads Library) [https://www.facebook.com/ads/library/api/], which allows you to collect "all active and inactive ads about social issues, elections or politics". Using these data, please do the following tasks:

1. Load the `facebook_ads_climate.csv` using `pandas`. Use the `head` function to view the first 10 rows of the data. **Note**: when reading the CSV file, please be sure to add the argument `keep_default_na=False`.

2. Convert the loaded `DataFrame` to a "list of dictionaries". Print the length of this list and ensure that it is equal to 500.

3. Let's see how many of the sampled ads are related to "energy". To do so, loop over the `ad_creative_body` text and complete the following:
    + Convert the text to lowercase.
    + Check if the text includes word "energy". If yes, add a new field (or variable) to each row named "energy" that is equal to 1; otherwise, set the value of the "energy" equal to 0.

4. Convert your "list of dictionaries" back to a pandas data frame using the `pd.DataFrame()` function. Use the `value_counts` function to count how many of the ads are about energy.

5. Use `pandas` to export your new dataset as a CSV to disk.

In [None]:
#1. Load the facebook_ads_climate.csv using pandas. Use the head function to view the first 10 rows of the data.



In [None]:
#2. Convert the loaded DataFrame to a "list of dictionaries". Print the length of this list and ensure that it is equal to 500.



In [None]:
#3. Let's see how many of the sampled ads are related to "energy".



In [None]:
#4. Convert your "list of dictionaries" back to a pandas data frame using the pd.DataFrame() function. Use the value_counts function to count how many of the ads are about energy.



In [None]:
#5. Use pandas to export your new dataset as a CSV to disk.



## Object oriented programming (optional material if we have time!)

In the geekier corners of the internet (or the University campus), there's an on-going debate on the benefits and drawbacks of functional progamming (FP) versus object oriented programming (OOP). You can ignore these debates! However, when using Python, you will often run into the use of "<b>classes</b>" and thus it is important to have some knowledge of what a "class" is. Providing the knowledge is the goal of this section. (Note: for an excellent introduction to classes in Python, see <i>Byte of Python</i> chapter on <a href="https://python.swaroopch.com/oop.html">Object Oriented Programming</a>.)

We have actually already run into classes. For instance, the `csv.reader`code that we used to import a CSV file above, is a "class." 

In [None]:
print(csvreader)

In [None]:
"Good bye Donald!".lower()

This tells us that the `csvreader` that we assigned above is an object of the UnicodeReader `class`. Great, but what does all this actually mean?

### The `class` function

OOP is a paradigm of programming built on the idea of classes of <b>objects</b>---i.e., a structure that holds data (often referred to as "attributes" and functions or procedures (often referred as "methods"). As an example, say we were interested in defining "classes" of people walking around this university. There are different types of people and these people do different things. We can define a `professor` class as follows:

In [None]:
# Define the "professor" class. 
class professor:
    pass

In [None]:
prof = professor()

In [None]:
print(prof)

We now have a professor class, but they don't actually do anything. We can add a <b>method</b>, as follows:

In [None]:
# Define the "professor" class. 
class professor:
    def pontificate(self):
        print("I'm a professor. Blah, blah, blah.")
        

Now our professor does what professors do best: pontificate! We can now instantiate our class and call the `pontificate` method:

In [None]:
prof = professor()

In [None]:
prof.pontificate()

In [None]:
# Initialize class
travis = professor()

# Call method
travis.pontificate()

Great, but we still have a bunch of unanswered questions? What's this `self` thingy? How do I store and pass <b>attributes</b> to my professor class? Let's start with the second question. Say we wanted to add two attributes to our professor `class`: a `name` and `subject` attribute. 

In [None]:
# Define the "professor" class. 
class professor:
    def __init__(self, name, subject):
        self.name = name
        self.subject = subject
    
    def pontificate(self):
        print("My name is %s. I teach %s. Blah, blah, blah." % (self.name, self.subject))

As shown above, we can add attributes to our class by defining a `__init__` method and then attaching the `self` object. Now if we instantiate and call `pontificate`:

In [None]:
# Intialize class
travis = professor('jason', 'public opinion')
print(travis)
# Call method
#travis.pontificate()

Again, it is not super important for you to understand how classes work for this class (no pun intended!). You just need to know that they exist, you initialize them with a set of attributes, and then "use" them by calling their methods.