# Week 2 Homework: Loading data

Create a new IPython Notebook and use it to answer the following assignments. Note that the **c** sections for each are meant to be a little more challenging.

## Question 1

Load the MovieData.csv dataset as described in this week's lesson, and use it to find the following values:

a. What is the average US Gross of movies in the dataset?

b. How many movies in the dataset have budgets greater than $20 million?

c. How many movies were released by each film distributor? (Hint: this could be a good place to use dictionaries)

### Solution

To solve this problem, first load the data like we did in the lesson itself. Then, figure out what operation you need to take on each row: summing up for (a); checking a condition and then counting for (b), and adding to a dictionary and then counting for (c). Once you've determined that, iterate over all the rows and apply your operation.

### loading the data:

In [1]:
# First we import the tools we'll need
import datetime as dt

In [2]:
# Next, we define the utility functions we're going to use:

def convert_to_int(text):
    '''
    Convert a string to an integer
    '''
    try:
        return int(text)
    except:
        return None

def make_date(date_str):
    '''
    Turn a MM/DD/YY string into a datetime object
    '''
    m, d, y = date_str.split("/")
    m = int(m)
    d = int(d)
    y = int(y)
    if y > 13:
        y += 1900
    else:
        y += 2000
    return dt.datetime(y, m, d)



In [3]:
# Now we open the data:
f = open("MovieData.csv")
data = []
print(f.readline()) # Skip the first row
for row in f: 
    row =  row.split("\t")
    row[0] = make_date(row[0])
    row[3] = convert_to_int(row[3]) # Budget
    row[4] = convert_to_int(row[4]) # US Gross
    row[5] = convert_to_int(row[5]) # Worldwide Gross
    data.append(row)

Release_Date	Movie	Distributor	Budget	US Gross	Worldwide Gross



### a) 

To calculate the average movie gross, we first need to get all the movie grosses. Remember (or check) that the US Gross is in column 4.

Note: if you try to add None to a number you'll get an error; you need to check for that, or use try/except, to handle None entries. We also should count only the rows where the gross is present, and not those where it is missing.

In [4]:
total = 0.0 # Make it a float, to make division work
count = 0 # Increase by 1 for every row with a gross.
for row in data:
    try:
        total += row[4]
        count += 1
    except:
        pass # Do nothing

print(total / count)

44185793.44224795


### b)

This is even simpler; for each row, check the budget and keep a count of how many budgets are greater than 20 million.

In [5]:
big_budget = 0 # Counter for how many rows meet our criteria
for row in data:
    if row[3] > 20000000: # Check the budget size
        big_budget += 1 
print(big_budget)

1657


### c)

For this assignment, we need to do two things: keep track of the distributors, and associate a running count with each one. This is exactly the sort of task dictionaries were developed for. Remember that a dict associates some fixed key (in this case, a distributor) with a changeable value (like a count). 

We can use a single dictionary to track all the distributors. Each row, check whether that distributor is already in the dictionary. If so, add 1 to its value; if not, start the count at 1.

In [6]:
distributors = {} # Empty dictionary
for row in data:
    distributor = row[2] # Get the film distributor
    # If we've already seen this distributor before:
    if distributor in distributors:
        distributors[distributor] += 1 
    # Otherwise, start a new counter:
    else:
        distributors[distributor] = 1 

# Now print the results:
for distributor in distributors:
    print(distributor, distributors[distributor])

 659
Paramount Pictures 258
Freestyle 2
MGM/UA 137
MGM 3
Screen Gems 1
Dimension Films 2
Testimony Pictures 1
Walt Disney Pictures 1
RKO 2
Galaxy 1
Orion 17
Paladin 1
Fine Line 16
Shooting Gallery 1
Island/Alive 1
Cloud Ten Pictures 1
Anchor Bay 4
Jerry Gross Organization 1
Zion 1
ART 1
Avco Embassy 5
Black Diamond Pictures 1
Samuel Goldwyn 9
Paramount Classics 12
RS Entertainment 1
Yash Raj 1
Sony Classics 78
Big Pictures 1
Live Entertainment 3
Sony Pictures 2
Disney 1
Off Hollywood Pictures 1
Freestyle/Darko 1
IFC Films 20
Rogue Pictures 1
American International Pictures 1
Five and Two Pictures 1
Magnolia 15
New Line 138
Videos 1
Legacy 1
New World Pictures 1
Excel Entertainment 5
Wellspring 1
PION 1
Picturehouse 6
Bigger Picture 1
Outrider Pictures 1
Overture Films 1
J.F. Prods 1
Walt Disney Co. 9
Gramercy 12
CBS Films 3
New Yorker 2
Off-Hollywood Distribution 1
Weinstein 1
Weinstein/Dimension 1
Film Sales Co. 1
Small Planet 1
Monterey Media 1
Universal/Arenas Entertainment 1
Warner

## Question 2

Load the earthquake data in QuakeData.csv, and use it to answer the following questions:

a. How many earthquakes are in the dataset?

b. What is the average magnitude of earthquakes in the dataset?

c. The DateTime format in this dataset is a bit trickier than it was for movies. Try to parse it into a datetime object. Do most earthquakes happen between midnight and noon, or noon to midnight?

### Solution

This question calls on you to apply the techniques you learned in the lesson to a new dataset, with different columns. The steps will be roughly the same, with some tweaking to fit the new data.

### loading the data:

Like before, the first thing to do is to look at the first few rows of data and see how it's structured.

In [7]:
f = open("QuakeData.csv")
# Print the top few rows to see the format
for i in range(5):
    print(f.readline()) # Get the columns

DateTime,Latitude,Longitude,Depth,Magnitude,MagType,NbStations,Gap,Distance,RMS,Source,EventID,Version

2012-01-01T00:30:08.770+00:00,12.008,143.487,35.0,5.1,mb,178,45,,1.20,pde,pde20120101003008770_35,1363392487731

2012-01-01T00:43:42.770+00:00,12.014,143.536,35.0,4.4,mb,29,121,,0.98,pde,pde20120101004342770_35,1363392488431

2012-01-01T00:50:08.040+00:00,-11.366,166.218,67.5,5.3,mb,143,43,,0.82,pde,pde20120101005008040_67,1363392488479

2012-01-01T01:22:07.660+00:00,-6.747,130.008,145.0,4.2,mb,14,112,,1.16,pde,pde20120101012207660_145,1363392488594



Again, we see that the first row contains the column names. Unlike the Movie data, however, the values are separated with commas, not tabs. 

Since it isn't clear what data type to expect from each column, I'm going to do something a bit different here, and have a function that tries to convert each record to a floating-point number, and returns a string if it fails. I'll also use list comprehension to make the row conversion faster.

In [8]:
def convert(record):
    try:
        return float(record)
    except:
        return record

In [9]:
data = []
f = open("QuakeData.csv")
print(f.readline().split(",")) # Print the column headers
for row in f:
    row = row.split(",") # Split on commas
    row = [convert(x) for x in row] # Convert each entry
    data.append(row)

['DateTime', 'Latitude', 'Longitude', 'Depth', 'Magnitude', 'MagType', 'NbStations', 'Gap', 'Distance', 'RMS', 'Source', 'EventID', 'Version\n']


### a)

Each row corresponds to a single earthquake, so a count of quakes is just a count of rows -- which is just the length of the *data* list:

In [10]:
print(len(data))

12684


### b)

To find the average magnitude, you need to make a list of all the magnitudes and then just average it. Look at the columns, above, and see which one is Magnitude (it's the 5th one, or row index 4). Note that you can use the built-in **sum(...)** function to sum a list of numbers. Divide the sum by the length of the list, making sure not to round it to an integer, and you have the answer.

In [11]:
quake_sizes = [row[4] for row in data] # Make a list of all magnitudes
total = sum(quake_sizes) * 1.0 # Multiply by 1.0 to make sure it's a float
print(total / len(quake_sizes))

4.558483128350625


### c)

The first step to parsing a date is figuring out its format. The **convert** function that I created above left dates as strings; let's look at some arbitrary date and see how it's formatted.

In [12]:
test_date = data[100][0] # Some arbitrary record
print(test_date)

2012-01-04T12:48:42.100+00:00


it looks like the format is Year-Month-Day, followed by the letter 'T', followed by the time. This might be a good place to use string subsetting, and figure out where in the string each piece of information is. The year is the first four characters, the month is characters 6 and 7 (indices 5,6), etc. Test this out and make sure you've got the right indices:

In [13]:
year = test_date[:4]
month = test_date[5:7]
day = test_date[8:10]
hour = test_date[11:13]
minute = test_date[14:16]
second = test_date[17:19]

print(year, month, day, hour, minute, second)

2012 01 04 12 48 42


Now you can create a function to split the date-time string and turn it into a datetime object:

In [14]:
def make_datetime(record):
    year = int(record[:4])
    month = int(record[5:7])
    day = int(record[8:10])
    hour = int(record[11:13])
    minute = int(record[14:16])
    second = int(record[17:19])
    return dt.datetime(year, month, day, hour, minute, second)

In [15]:
# Testing the function on our test date:
print(make_datetime(test_date))

2012-01-04 12:48:42


Now we can use make_datetime to create a list of datetime objects, one for each earthquake. We can then check the hour for each, and count how many happen in each time block.

In [16]:
# Now for the real deal:
quake_times = [make_datetime(row[0]) for row in data]
morning_count = 0
evening_count = 0
for time in quake_times:
    if time.hour < 12: 
        morning_count += 1
    else:
        evening_count += 1
print(morning_count)
print(evening_count)

6241
6443


## Question 3

In this question, you're asked to use what you've learned to analyze unstructured data -- a plain text file, *data.txt* containing the Facebook's description of its business in its [mandatory annual report to shareholders](http://investor.fb.com/secfiling.cfm?filingID=1326801-15-6) (called a Form 10K).

**Bonus:** *sorted(...)* is a built-in Python function for sorting lists according to some order. For extra credit, figure out how to sort dictionary keys based on the values in descending order, and sort all your answers below. To read more about sorting, start at the Python documentation here:
https://docs.python.org/3.5/howto/sorting.html#sortinghowto

a) Read in the text in the file, and count how many times each individual word appears. For this exercise, words are separated by whitespace -- don't worry about colons, dashes, etc.

b) [Stop words](https://en.wikipedia.org/wiki/Stop_words) are words that appear so often in a language that they aren't useful for analysis (you may have noticed them in your results for **a**). Below is a list of stop words taken from [NLTK](http://www.nltk.org/), the Natural Language Toolkit for Python. Read in the document and count words again -- but this time, convert all the words to lower-case, and only include the words that *aren't* on the stop word list. Also remove any punctuation (the characters .,?!-) from the beginning and end of words Finally, only output the words which appear more than once.

c) Write the results of the previous section (including words that appear only once) to a csv file. There should be two columns, one for the word and the other for the count. For example:

    Word,Count
    million,2
    objectives,1
    ...
    

In [17]:
STOP_WORDS = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 
'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

### Solution

### a)

For this section, we need to load *data.txt*, loop over each row, and separate each row into words. Remember, we can split on whitespace using the *.split()* method. Next, we use a dictionary to count how many times each word appears.

In [18]:
f = open("data.txt")
word_counts = {}
for row in f:
    for word in row.split():
        if word in word_counts:
            word_counts[word] += 1
        else:
            word_counts[word] = 1
f.close()

Once we have the dictionary, we can just loop over the words and print out the word and count for each:

In [19]:
for word in word_counts:
    print(word, word_counts[word])

DAUs 1
website 1
device 1
driving 1
objectives, 1
Android, 2
types 1
in 8
whether 1
WhatsApp 1
34% 1
plugins, 1
them 6
receive 2
similarly 1
designed 1
within 2
including 3
what 1
Create 3
efforts 1
, 2
Atlas, 1
How 3
medium-sized 1
programming 1
right-hand 1
everywhere 1
computers, 1
discovery 1
payments 1
trusted 1
and 42
products 1
help 6
available 3
platform's 1
campaigns. 1
focuses 2
connect 1
others 1
Facebook. 1
Our 5
active 1
open 1
set 1
who 6
improve 1
showing 1
objectives 1
We 10
Phone 1
engaging 1
places 1
photos 2
then 1
Messenger 3
such 3
compared 2
top 1
Finally, 1
marketers, 2
exposure, 1
Instagram. 1
goods 1
Developers 1
Audience 3
million 2
creating 1
based 2
drive 1
developers. 2
generate 3
websites 1
social 1
value 2
at 1
LiveRail. 1
goals. 1
about 1
modifications 1
grow 1
performance 1
(SMS) 1
invites, 1
throughout 1
grow, 1
mission 1
connected. 1
marketing 1
chat 1
Network, 2
Payments 2
By 1
across 1
but 1
accessing 1
those 2
infrastructure 2
connected 1
distribut

For the extra credit, we can use *sorted* to sort the words. To sort a dictionary, we need to tell it to sort the keys based on their associated values; to sort from highest to lowest (instead of the default low-high order, we set the *reverse* parameter to True:

In [20]:
for word in sorted(word_counts, key=lambda w: word_counts[w], reverse=True):
    print(word, word_counts[word])

and 42
to 33
on 19
their 14
Facebook 13
mobile 13
of 13
the 12
people 12
from 11
We 10
a 9
ads 9
developers 9
in 8
with 8
is 8
our 7
application 7
them 6
help 6
who 6
ad 6
personal 6
for 6
applications 6
that 6
Our 5
computers. 5
enable 5
an 5
also 5
or 5
marketers 4
December 4
can 4
as 4
online 4
share 4
revenue 4
web 4
by 4
we 4
use 4
including 3
Create 3
How 3
available 3
Messenger 3
such 3
Audience 3
generate 3
providing 3
Value 3
business 3
Marketers 3
increase 3
monetize 3
devices. 3
other 3
tools 3
messaging 3
devices 3
Android, 2
receive 2
within 2
, 2
focuses 2
photos 2
compared 2
marketers, 2
million 2
based 2
developers. 2
value 2
Network, 2
Payments 2
those 2
infrastructure 2
sales, 2
in-store 2
purchase 2
insights 2
2014 2
friends 2
world 2
applications. 2
create 2
reach 2
advertisers 2
Facebook, 2
show 2
they 2
each 2
are 2
videos, 2
had 2
enables 2
Windows 2
average 2
Instagram 2
make 2
DAUs 1
website 1
device 1
driving 1
objectives, 1
types 1
whether 1
WhatsApp 1
34% 1


### b)

For this problem, I reload the data and add two steps before adding each word to the count dictionary: first I lower-case the word, then I use strip to remove any non-letter characters. Afterwards, I check whether it's a stop-word. If it isn't a stop-word, I proceed to add it to the dictionary:

In [21]:
f = open("data.txt")
word_counts = {}
for row in f:
    for word in row.split():
        word = word.lower()
        word = word.strip(".,!?- ")
        if word not in STOP_WORDS:
            if word in word_counts:
                word_counts[word] += 1
            else:
                word_counts[word] = 1
f.close()

A more streamlined way is possible using the keyword *continue*: this skips everything below, and goes to the next iteration of the for loop:

In [22]:
f = open("data.txt")
word_counts = {}
for row in f:
    for word in row.split():
        word = word.lower()
        word = word.strip(".,!?- ")
        if word in STOP_WORDS:
            continue # Skip to the next word in the row
        if word in word_counts:
            word_counts[word] += 1
        else:
            word_counts[word] = 1
f.close()

Next, we can sort the words, this time storing them in a list.

In [23]:
sorted_words = sorted(word_counts, key=lambda w: word_counts[w], reverse=True)

Now we can loop over the sorted list, and print out those with the desired count:

In [24]:
for word in sorted_words:
    count = word_counts[word]
    if count > 1:
        print(word, count)

facebook 16
people 14
mobile 13
developers 12
ads 10
marketers 10
application 8
applications 8
devices 6
help 6
computers 6
personal 6
ad 6
share 5
enable 5
create 5
value 5
use 5
also 5
revenue 4
web 4
tools 4
messenger 4
online 4
december 4
messaging 4
 3
friends 3
increase 3
world 3
including 3
payments 3
available 3
monetize 3
reach 3
network 3
instagram 3
providing 3
generate 3
audience 3
business 3
engagement 2
phone 2
results 2
windows 2
receive 2
insights 2
within 2
campaigns 2
2014 2
videos 2
focuses 2
connect 2
whatsapp 2
objectives 2
compared 2
photos 2
million 2
based 2
advertisers 2
ios 2
brand 2
infrastructure 2
connected 2
show 2
discover 2
2013 2
android 2
products 2
sales 2
enables 2
grow 2
build 2
in-store 2
purchase 2
average 2
make 2
feed 2


### c) 

Now we need to output the results. We create a file called, in this example, "hw_3c.txt", add headers, and repeat the loop above. To eliminate white space between the entries and commas, we concatenate them into a single string before writing to file. Don't forget that we also need to add a '\n' to each line to make sure there's a line break in the text file.

In [25]:
f = open("hw_3c.txt", "w")
header = "Word,Count\n"
f.write(header)

for word in sorted_words:
    count = str(word_counts[word]) # Convert the number to a string
    row = word + "," + count + "\n"
    f.write(row)

f.close() # Don't forget to close the file!