(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

# Stack Overflow

## Introduction 

In this assignment, we will look at some posts on Stack Overflow during the year of 2015 and measure the similarity of users by looking at the types of questions they answer. We will also analyze the creation dates of questions.

## Step 0. Preparation

Before we start working on the notebook, let's make sure that everything is setup properly. You should have downloaded and installed
* [Anaconda](https://store.continuum.io/cshop/anaconda/)
* [Git](http://git-scm.com/downloads)

If you are working from the undergraduate lab (on a linux machine) these are both installed, but you need to follow the instructions [from here](https://github.com/datascience16/lectures/blob/master/Lecture2/Getting-Started.ipynb).



## Step 1. Getting the data

Let's make a sample request to retrieve the questions posted on Stack Exchange on the first day of 2015. Documentation of the Stack Exchange API can be found [here](https://api.stackexchange.com/docs).

In [2]:
import requests

start_time = 1420070400 # 01-01-2015 at 00:00:00
end_time   = 1420156800 # 01-02-2015 at 00:00:00

response = requests.get("https://api.stackexchange.com/2.2/questions?pagesize=100" +
                        "&fromdate=" + str(start_time) + "&todate=" + str(end_time) +
                        "&order=asc&sort=creation&site=stackoverflow")
print response

<Response [200]>


All dates in the Stack Exchange API are in [unix epoch time](https://en.wikipedia.org/wiki/Unix_time). The format for the request string is specified [here](https://api.stackexchange.com/docs/questions).

We can try to print the response that Stack Exchange returns.

In [None]:
print response.text

It is not possible to read the raw response. Instead, we need to decode the raw response as JSON and use the `json` library to print it.

In [None]:
import json

print json.dumps(response.json(), indent=2)

Now we can easily see that the response consists of a list of question items. For each of these items, we get information about its attributes such as its `creation_date`, `answer_count`, `owner`, `title`, etc.

Notice that has_more is true. To get more items, we can [request the next page](https://api.stackexchange.com/docs/paging).

-----------------

## Step 2. Parsing the responses

In this section, we practice some of the basic Python tools that we learned in class and the powerful string handling methods that Python offers. Our goal is to be able to pick the interesting parts of the response and transform them in a format that will be useful to us.

First let's isolate the creation_date in the response. Fill in the rest of the ```print_creation_dates_json()``` function that reads the response and prints the creation dates. Notice that a JSON object is basically a dictionary. **(5 pts)**

I used a simple for loop to process each item in the json object, indexing for the creation_date. This is basically traversing a dictionary and getting the value of a specific key in each item.


In [3]:
def print_creation_dates_json(response):
    """
    Prints the creation_date of all the questions in the response.
    
    Parameters:
        response: Response object
    """
    rjson = response.json()
                       
    for i in range(100):
        print(rjson["items"][i]["creation_date"])

print_creation_dates_json(response)


1420070458
1420070503
1420070552
1420070577
1420070611
1420070641
1420070703
1420070727
1420070734
1420070777
1420070801
1420070848
1420070859
1420070866
1420070968
1420071005
1420071029
1420071103
1420071122
1420071175
1420071184
1420071212
1420071230
1420071340
1420071431
1420071530
1420071736
1420071794
1420071830
1420071868
1420071907
1420071929
1420071939
1420072002
1420072021
1420072074
1420072129
1420072243
1420072342
1420072354
1420072397
1420072430
1420072455
1420072481
1420072610
1420072638
1420072667
1420072685
1420072777
1420072779
1420072902
1420072924
1420072976
1420072979
1420072997
1420073055
1420073169
1420073273
1420073276
1420073352
1420073383
1420073425
1420073455
1420073456
1420073492
1420073510
1420073524
1420073787
1420073851
1420073932
1420074037
1420074057
1420074085
1420074170
1420074204
1420074224
1420074226
1420074269
1420074320
1420074334
1420074356
1420074436
1420074492
1420074515
1420074596
1420074602
1420074640
1420074817
1420074822
1420074825
1420074859

'I used a simple for loop to process each item in the json object, indexing for the creation_date.'

Write the code that calls the ```print_creation_dates_json()``` function to print out all the creation dates of questions posted on the first day in 2015. Please be aware of Stack Exchange's [rate limit](https://api.stackexchange.com/docs/throttle). **(5 pts)**

This is a slight modification from the last code, using a while loop to retrieve all the pages till there aren't any more, printing the dates at each iteration.

In [4]:
def print_creation_dates_json(response):
    """
    Prints the creation_date of all the questions in the response.
    
    Parameters:
        response: Response object
    """
    
    rjson = response.json()
    pageNo = 2
    
    while rjson["has_more"]:
        for i in range(100):
            print(rjson["items"][i]["creation_date"])
        response = requests.get("https://api.stackexchange.com/2.2/questions?page=" + str(pageNo) + "&pagesize=100" +
                        "&fromdate=" + str(start_time) + "&todate=" + str(end_time) +
                        "&order=asc&sort=creation&site=stackoverflow")
        pageNo = pageNo + 1
        rjson = response.json()
    
    for i in range(len(rjson["items"])):
        print(rjson["items"][i]["creation_date"])
    
print_creation_dates_json(response)
    

1420070458
1420070503
1420070552
1420070577
1420070611
1420070641
1420070703
1420070727
1420070734
1420070777
1420070801
1420070848
1420070859
1420070866
1420070968
1420071005
1420071029
1420071103
1420071122
1420071175
1420071184
1420071212
1420071230
1420071340
1420071431
1420071530
1420071736
1420071794
1420071830
1420071868
1420071907
1420071929
1420071939
1420072002
1420072021
1420072074
1420072129
1420072243
1420072342
1420072354
1420072397
1420072430
1420072455
1420072481
1420072610
1420072638
1420072667
1420072685
1420072777
1420072779
1420072902
1420072924
1420072976
1420072979
1420072997
1420073055
1420073169
1420073273
1420073276
1420073352
1420073383
1420073425
1420073455
1420073456
1420073492
1420073510
1420073524
1420073787
1420073851
1420073932
1420074037
1420074057
1420074085
1420074170
1420074204
1420074224
1420074226
1420074269
1420074320
1420074334
1420074356
1420074436
1420074492
1420074515
1420074596
1420074602
1420074640
1420074817
1420074822
1420074825
1420074859

Due to time constraints, we have downloaded the [data dump](http://cs-people.bu.edu/kzhao/teaching/stackoverflow-posts-2015.tar.gz) for Stack Overflow's posts in 2015. Note that this file is 10GB. If you don't have space on your computer, you can download it into `/scratch` on one of the machines in the undergrad lab or you can download it onto a USB. You may also want to work with a subset of this data at first, but your solution should be efficient enough to work with the whole dataset. For example, if you call `read()` on this file, you will get a `MemoryError`.

Write a function to parse out the questions posted in 2015. These are posts with `PostTypeId=1`. Make a `pandas DataFrame` with 4 columns: `Id`, `CreationDate`, `OwnerUserId`, and the first tag in `Tags`. Save the `DataFrame` to a file named `question_dataframe.csv`. **(10 pts)**

The code parses through the xml file row by row to collect all the necessary attributes. The data is first stored in the data dictionary and then converted into a dataframe and then to a csv file. While I attempted to surpass the dataframe and go straight into the csv, the module took too many steps and I wasn't able to fully understand the process, so I stuck with mine. It took around 10-15 minutes to finish running on my computer, however this isn't surpising given the age of my laptop and the relative inefficiency of passing data through 3 data structures.

In [84]:
import pandas as pd
from lxml import etree

xmlDirectory = 'stackoverflow-posts-2015.xml'

data = {'Id':[],'CreationDate':[],'OwnerUserId':[],'Tags':[]}

for event, element in etree.iterparse(xmlDirectory):
    try:
        if element.attrib["PostTypeId"] == "1":
            
            data['Id'].append(element.attrib["Id"])
            data['CreationDate'].append(element.attrib["CreationDate"])
            
            try:
                data['OwnerUserId'].append(element.attrib["OwnerUserId"])
            except KeyError:
                data['OwnerUserId'].append(element.attrib["OwnerDisplayName"])
                
            tag = "".join(c for c in element.attrib["Tags"] if c not in ('<'))
            tag = tag.split('>')[0]
            data['Tags'].append(tag)
            
            element.clear()
            
    except KeyError:
        pass

dataframe = pd.DataFrame.from_dict(data)
dataframe.to_csv('question_dataframe.csv')

In [85]:
question_df = pd.read_csv('question_dataframe.csv')
print question_df

         Unnamed: 0             CreationDate        Id OwnerUserId  \
0                 0  2015-01-01T00:00:58.253  27727385     3210431   
1                 1  2015-01-01T00:01:43.673  27727388      868779   
2                 2  2015-01-01T00:02:32.123  27727391     4372672   
3                 3  2015-01-01T00:02:57.983  27727393     2482149   
4                 4  2015-01-01T00:03:31.337  27727394     4263870   
5                 5  2015-01-01T00:04:01.407  27727396     4409381   
6                 6  2015-01-01T00:05:03.773  27727406      875317   
7                 7  2015-01-01T00:05:27.167  27727407      821742   
8                 8  2015-01-01T00:05:34.733  27727408     2595033   
9                 9  2015-01-01T00:06:17.720  27727409     1815395   
10               10  2015-01-01T00:06:41.067  27727410      541091   
11               11  2015-01-01T00:07:28.747  27727414     1210038   
12               12  2015-01-01T00:07:39.243  27727418     3674356   
13               13 

-----------------

## Step 3. Putting it all together

We are now ready to tackle our original problem. Write a function to measure the similarity of the top 1000 users with the most answer posts. Compare the users based on the types of questions they answer. We will categorize the questions by looking at the first tag in each question. You may choose to implement any one of the similarity/distance measures we discussed in class. Document your findings. **(30pts)**

Note that answers are posts with `PostTypeId=2`. The ID of the question in answer posts is the `ParentId`.

You may find the [sklearn.feature_extraction module](http://scikit-learn.org/stable/modules/feature_extraction.html) helpful.

This step was similar to step two, except that it has to perform an additional few checks to the questions csv file we generated previously. First, a iterparse runs through the original xml file, counting the number of times each user ID appears. Once the list is compiled, the top 1000 is sorted and returned to be used in a second iterparse run. The second run will keep track of any tags that appear with the user ID by cross referencing the parent id of the answer with the ids in the question dataframe csv.

In [30]:
import pandas as pd
from lxml import etree
from sklearn.feature_extraction.text import CountVectorizer

xmlDirectory = 'stackoverflow-posts-2015.xml'
question_df = pd.read_csv('question_dataframe.csv')

user_count = {}

for event, element in etree.iterparse(xmlDirectory):
    try:
        if element.attrib["PostTypeId"] == "2":
            user = element.attrib["OwnerUserId"]
            if user in user_count:
                user_count[user] += 1
            else:
                user_count[user]= 1
            element.clear()
    except KeyError:
        element.clear()
        pass
    
user_count = pd.DataFrame(user_count,index=[0]).transpose()
user_count.columns=['Count']
user_count = user_count.sort_values('Count', ascending = False)
user_count = user_count.head(1000)

users_top = frozenset(user_count.index.values)

user_df = {}

for event, element in etree.iterparse(xmlDirectory):
    try:
        if element.attrib["OwnerUserId"] in users_top and element.attrib["PostTypeId"] == "2":
            qid = int(element.attrib["ParentId"])
            user = element.attrib["OwnerUserId"]
            try:
                tag = question_df[question_df["Id"] == qid]['Tags'].values[0]
                if user in user_df:
                    if tag in user_df[user]:
                        user_df[user][tag] += 1
                    else:
                        user_df[user][tag] = 1

                else:
                    user_df[user] = {tag:1}
            except IndexError:
                element.clear()
                pass
            element.clear()
            
    except KeyError:
        element.clear()
        pass

userData = pd.DataFrame.from_dict(user_df)
userData.to_csv('user_dataframe.csv')

print userData

                  Unnamed: 0  1000933  100297  1003142  1009479  1009603  \
0                  .htaccess        0       0        0        0        0   
1                       .net        0       0        0        0        0   
2                   .net-3.5        0       0        0        0        0   
3                   .net-4.0        0       0        0        0        0   
4     .net-framework-version        0       0        0        0        0   
5                         2d        0       0        0        0        0   
6                     32-bit        0       0        0        0        0   
7                         3d        0       0        0        0        0   
8                      64bit        0       0        0        0        0   
9       abstract-syntax-tree        0       0        0        0        0   
10             accelerometer        0       0        0        0        0   
11              access-token        0       0        0        0        0   
12          

The second part involves creating a 1000x1000 matrix to generate a Jaccard similarity index for every combination of user data. The data was drawn from indexing userData to the number of appearances of each tag, and then each jaccard similarity value is calculated and placed down in the matrix.

In [61]:
import pandas as pd
from math import*
userData = pd.read_csv('user_dataframe.csv')

def jaccard_sim(x,y):
    ic = len(set.intersection(*[set(x), set(y)]))
    uc = len(set.union(*[set(x), set(y)]))
    return ic/float(uc)

userData = userData.fillna(0) 
userList = userData.columns.values[1:]
jaccard = pd.DataFrame(index= userList, columns= userList)

for i in range(1000):
    for j in range(1000):
        jaccard[str(userList[i])][str(userList[j])] = jaccard_sim(userData[str(userList[i])].values,userData[str(userList[j])].values)

print jaccard

jaccard.to_csv('jaccard_df.csv')

           1000933    100297   1003142    1009479   1009603   1011527  \
1000933          1    0.3125  0.214286       0.25  0.266667  0.285714   
100297      0.3125         1  0.272727   0.222222  0.333333  0.363636   
1003142   0.214286  0.272727         1   0.285714  0.333333     0.375   
1009479       0.25  0.222222  0.285714          1      0.25  0.266667   
1009603   0.266667  0.333333  0.333333       0.25         1  0.444444   
1011527   0.285714  0.363636     0.375   0.266667  0.444444         1   
1012053   0.263158       0.4  0.214286   0.315789  0.461538  0.285714   
101361        0.25  0.307692       0.3     0.3125  0.363636       0.4   
1016435   0.294118  0.266667      0.25   0.277778  0.307692  0.333333   
1025118        0.2  0.363636     0.375     0.1875       0.3  0.333333   
1030675   0.291667  0.272727  0.142857       0.28  0.181818  0.190476   
103167    0.333333       0.4  0.214286   0.190476  0.357143  0.285714   
1038015        0.3  0.352941    0.1875   0.227273  

Let's plot a subset of the distance matrix. Order the pairwise distance in your distance matrix (excluding the entries along the diagonal) in increasing order and pick user pairs until you have 100 unique users. See [Lecture 3](https://github.com/datascience16/lectures/blob/master/Lecture3/Distance-Functions.ipynb) for examples. **(10 pts)**

Again, we reuse the code from above but to include some random sampling of the userList in order to pick 100 pairs of users. To make sure there isn't duplicates a set is used. Then a smaller 100 by 100 matrix is generated with the same row columns as the sample user list, and again the jaccard similarity index is retrieved for each pair of unique users. Then a pcolor heat map is used from the matplotlib library to generate a plot for this subset.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

userListSample = set()

while len(userListSample) < 100:
    j = np.random.randint(0,high=1000)
    userListSample.add(userList[j])

jaccard_sample = pd.DataFrame(index= userListSample, columns= userListSample)

for i in userListSample:
    for j in userListSample:
        jaccard_sample[str(i)][str(j)] = jaccard_sim(userData[str(i)].values,userData[str(j)].values)

jaccard_sample.to_csv('jaccard_sample.csv')
print jaccard_sample

plt.pcolor(jaccard_sample)
plt.yticks(np.arange(0.1, len(jaccard_sample.index), 1), jaccard_sample.index)
plt.xticks(np.arange(0.1, len(jaccard_sample.columns), 1), jaccard_sample.columns)
plt.show()


Next, let's create some time series from the data. Look at the top 100 users with the most question posts. For each user, your time series will be the `CreationDate` of the questions posted by that user. You may want to make multiple time series for each user based on the first tag of the questions. Compare the time series using one of the methods discussed in class. Document your findings. **(30 pts)**

You may find the [pandas.DataFrame.resample module](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html) helpful.

In [None]:
question_df = pd.read_csv('question_dataframe.csv')



Plot the 2 most similar and the 2 most different time series. **(10 pts)**

In [None]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()