# Homework 1 - Data Structures and Sorting


## Due: Friday, January 19, 2018,  11:59:00pm

### Submission instructions
After completing this homework, you will turn in two files via Canvas ->  Assignments -> Homework 1:
Your Notebook, named si330-hw1-YOUR_UNIQUE_NAME.ipynb and
the HTML file, named si330-hw1-YOUR_UNIQUE_NAME.html

### Name:  Dingan Chen
### Uniqname: dinganc
### People you worked with:  "I worked by myself"

## Objectives
After completing this homework assignment, you should know how to
* use compound data structures
* perform simple and complex sorting
* use lambda functions

In addition, this assignment will provide an opportunity to work with a large (100,000 row) data set.

## Background

Massive Open Online Courses (MOOCs) are a popular way for people to learn new skills.  The University of Michigan
offers many different MOOCs, which are produced by faculty members and supported by the Office of Academic 
Innovation.

MOOCs tend to be used by hundreds to hundreds of thousands of users.  These users leave "digital exhaust" when
they work through the MOOC in the form of web server log entries.  We have obtained a small sample of these data
files from Prof. Chris Brooks, who is a colleague here at UMSI.  The data files are de-identified: anything
that could identify a person, such as their UMID or their IP address are "hashed" (encrypted).  Each line in the
data file represents a "page view" by a user.  The schema for each line is:

```umich_user_id, hashed_session_cookie_id, server_timestamp, hashed_ip, user_agent, url, initial_referrer_url, rowser_language, course_id, country_cd, region_cd, timezone, os, browser, key, value```

Of note is the ```UMICH_USER_ID```, which identifies each user, and ```HASHED_SESSION_COOKIE_ID``` which identifies a session.
Sessions are important:  they represent a collection of pageviews between the time that a user logs in and,
usually, when they log out.

We've already used the "mooc_small.csv" file in our lab.  We recommend that you continue to use that file for your
development work and then switch to the "mooc_big.csv" file (which contains 100,000 lines of data) for your
final analysis.  **Note that you must use the mooc_big.csv file for the work you submit.** 

In the lab, we went through the motions of some manipulation of the MOOC log data.  For this assignment, we'll be
asking three real-world questions:

1. How many different countries (based on ```COUNTRY_CD```) are represented in the data file?
2. Which countries do most of the page views come from?
3. For people accessing the MOOC from the US, what is the average number of page views per session?

In addition to the MOOC data file, you're also going to use a file called ```countrycodes.tsv``` to map
2-digit country codes to the full name of the country.  Why?  Because not everyone knows that PF is 
"French Polynesia".

The rest of the notebook contains specific steps that you need to follow and complete.  Places where you need to 
do something are indicated in <font color="magenta">magenta</font>.

First, let's load up the ```csv``` library; we're going to need it to read the comma- and tab-separated 
values files.

In [1]:
import csv
from collections import defaultdict

#### Step 1. Import the data

You'll load the data from the two files ```mooc_big.csv``` and ```countrycodes.tsv``` into two separate 
data structures. 

Let's start with ```countrycodes.tsv```.  Remember, we're going to use that file to map from the 
2-digit country code to the country name (e.g. from "CA" to "Canada").  




 <font color="magenta">Modify the next block of code so that it loads ```countrycodes.tsv``` into a data structure
    that would allow you to look up the country name that corresponds to the 2-digit country code.</font>

In [2]:
country_names = {} # Change "None" to the appropriate data structure

with open("countrycodes.tsv", "r") as csvfile:
    reader = csv.DictReader(csvfile, delimiter = "\t", quotechar = '"')
    for row in reader:
        #use "ISO ALPHA-2 Code" to extract country code from each row dictionary
        #use "Country or Area Name" to extract country name from each row dictionary
        country_names[row['ISO ALPHA-2 Code']]=row['Country or Area Name'] # Change this line to populate the data structure you created above with the data from the file

Now load the MOOC log data into an appropriate data structure (start with the mooc_small.csv file, then remember to change to mooc_big.csv). For this file, you should store all the rows in a data structure.


 <font color="magenta">Modify the next block of code so that it loads the MOOC log data into a data structure 
   that will allow you to answer the three real-world questions.</font>

In [3]:
mooc_data_file_name = "mooc.csv" # Remember to change this later

mooc_data = [] # Change "None" to the appropriate data structure

with open(mooc_data_file_name, "r") as csvfile:
    reader = csv.DictReader(csvfile, delimiter = ",", quotechar = '"')
    for row in reader:
        #store each dictionry to the list
        mooc_data.append(row); # Change this line to populate the data structure you created above with data from the file

#### Step 2. Manipulating and interpreting the data to answer our questions

Now that we have our data loaded, we can start to answer the real-world questions.

Recall that the first question
is <b>"How many different countries (based on COUNTRY_CD) are represented in the data file?"</b>

To do this, you're going to have to figure out how many unique country codes there are in the MOOC log file. There are few different ways to do this, but you probably want to use either a ```set``` or a ```dict```.

<font color="magenta">Modify the following code block so that the print statement at the end prints
    the number of countries represented in the MOOC log file.</font>

In [4]:
countries = set() # Change "None" to an appropriate data structure

for row in mooc_data:
    # use .add for set data structure
    countries.add(row['country_cd']) # Change this line to include code that will populate your data structure

# Do not change the following line
print("There are {0} unique countries in the MOOC log data file.".format(len(countries)))

There are 19 unique countries in the MOOC log data file.


Next you want to find out the <b>top 5 countries with the most page views</b>. For this, you should implement a composite data structure which stores, for each country, details of each log - ```UMICH User ID``` and ```hashed_session_cookie_id```. There are different ways that you can do this. One way would be by using a ```dictionary of lists```. Think about how you would populate this list.

After that you will sort the data structure using ```sorted```. You will need to write down the code to provide the ```sorted``` function with a key parameter using the ```lambda``` function. This will specify the operation to be performed on the data structure for sorting (what the data structure will be sorted by).

<font color="magenta">Modify the following code block so that the print statement at the end prints
    the top 5 of countries represented in the MOOC log file, and the corresponding number of users.</font>

In [5]:
country_user_data = defaultdict(list) # Change "None" to an appropriate data structure

for i in mooc_data:
    country_user_data[i['country_cd']].append(i['umich_user_id'])#Write down the code for the lambda function to sort 'country_user_data' by the number of users from that country.

sorted_country_user_data = sorted(country_user_data.items(),key=lambda x:len(x[1]),reverse=True) 
# Do not change the following lines of code. 
# This should output the top 5 countries, along with the number of users from each of those countries.
for i in range(5):
    print(country_names[sorted_country_user_data[i][0]], len(sorted_country_user_data[i][1]))

United States of America 44
Canada 10
Germany 8
French Polynesia 5
Belarus 5


From this step on, you will be working on ```country_user_data``` data structure.

Here, you will need to <b>filter the data so you only have entries from the US (i.e. where COUNTRY_CD is US)</b>. You need to retrieve the number of logs for a user, for each session i.e. which have the same ```hashed_session_cookie_id```.

From ```country_user_data``` data structure retrieve the entries from US. Using ```defaultdict``` you should count the number of logs (number of rows) in a session ```hashed_session_cookie_id``` into a new data structure. The number of logs/rows will give you the number of pages the user has viewed in one session.

<font color="magenta">Modify the following code block so the data structure us_data contains only the entries for people from the US.</font>

In [6]:
us_data = defaultdict(int) #Change none to the appropriate data structure
for row in country_user_data['US']:
    us_data[row]+=1 #Write your code here to store the number of log entries per session in us_data

Now, you need to calculate the <b>average number of pageviews per session</b> for users in the US. ```numpy``` which will be covered later has an in-built method. For now, you will iterate over the values, sum them up, and divide by the number of values. Recall ```sum``` and ```len``` methods in python.

<font color="magenta">In the following block of code put in the formula for calculating the average.</font>

In [7]:
#extract US data from country_user_data and appened to a list
#sum (list) gives the sum
#len(list) gives the number of elements in it
avg_page_views_per_session = sum([us_data[row] for row in country_user_data['US']])/len([us_data[row] for row in country_user_data['US']]) # Put in the code to get the average number of page views per session for MOOC users from the US
print(avg_page_views_per_session)

2.590909090909091


Finally, you want to <b>sort the sessions to retrieve the ones have maximum number of logs</b>. Implement a ```sorted``` function, pass the appropriate ```lambda``` function to the ```key``` parameter and store it into the data structure ```sorted_us_data```.

<font color="magenta">In the following block, write down the code for the sorted function. The print statement should output the top 5 hashed_session_cooke_id and the corresponding number of logs for that session.</font>

In [8]:
#sort by reverse because we want largest first

sorted_us_data = sorted(us_data,key=lambda x:us_data[x],reverse=True)   # Change this line to include a sorted function.

for i in range(5):
    print(sorted_us_data[i]) #This will print out the top 5 sessions with their hashed_session_cookie_id and no. of log entries

c7e0b7e873392815abee61a53c231a1d5866a659
1066a697903937dcb2bba46698b65c9067602b13
44185055eece5d1bc7986d743d240a7633d968ff
95cffc5948af183853735930299a0bc48c1cdc6c
70d530b2e677aa82a680b36ba534dbabc884e010
