<h1>Find Your Expert</h1>
<br>
The purpose of our application is to identify the best comments for a user defined article. In order to provide the best and most relevent comments for articles that have somehow escapped the reddit verse, find similar articles that have been posted on and we present the best comments. We identify the best comments through a prediction model.
<br>
<h2>Prediction Strategy</h2>
<img src="PredictionStrat.png">
The above proccess identifies our prediction strategy but we requrie a slightly different approach. In order to train our model we need to collect articles that *Do* exist in the reddit verse and the top comments related to that article. We can then use this data to extract the features from the comments and the relation to the article, and construct a model to predict the score of the comments. 
<br>
<h2>Training Strategy</h2>
<img src="TrainingStrat.png">
<br>
<br>
The collection strategy was designed to maximise effective data collection while adhearing to the <a href="https://www.reddit.com/wiki/licensing">**Reddit API Guidlines**</a>.
In order to provide the most effective training data the collection strategy was designed with the following coniderations
<br>
* Only collect comments directly in response to an orginal article post. We restrict our collection to first level responses in an effort to avoid collecting comments not directly related to the orginal article.
* Collect from specific sub-reddits where the orginal content is usually an external source ie. r/news, r/TIL
* Avoid sub-reddits where there is no orginal content like AMA's, or ELI5
* Avoid multi-media subreddits like r/pics, r/videos

<h1>Things!</h1>

The Reddit API consits of several *things* (the base class). The API provided documentation on several different <a href='https://github.com/reddit/reddit/wiki/JSON'> *Things!*</a>; however we are most interested in the following:

<h2>Links</h2>
Links are the orginal post, all of the comments we are interested in our in response to a specific link. A link object has several tags that can all be found in the <a hfref= 'https://github.com/reddit/reddit/wiki/JSON'> *things documentation*</a>. The most relevent tags to our collection strategy are:

|**Type**|**Tag**|**Description**|
|--|--|-------------------------------|
|String|id|this links identifier, e.g. "8xwlg"|
|String|name|Fullname of link|
|String|author|the account name of the poster.|
|String|domain|the domain of this link. Self posts will be self.subreddit while other examples include en.wikipedia.org and s3.amazon.com
|boolean|over_18|true if the post is tagged as NSFW. False if otherwise|
|int|num_comments|the number of comments that belong to this link. includes removed comments.|
|String|permalink|relative URL of the permanent link for this link|
|int|score|the net-score of the link. note: A submission's score is simply the number of upvotes minus the number of downvotes. If five users like the submission and three users don't it will have a score of 2. Please note that the vote numbers are not "real" numbers, they have been "fuzzed" to prevent spam bots etc. So taking the above example, if five users upvoted the submission, and three users downvote it, the upvote/downvote numbers may say 23 upvotes and 21 downvotes, or 12 upvotes, and 10 downvotes. The points score is correct, but the vote totals are "fuzzed".|
|String|subreddit|subreddit of thing excluding the /r/ prefix. "pics"|
|String|title|the title of the link. may contain newlines for some reason|

<br>
<h2>Comments</h2>
Comments are the actual comments made in regard to a specific link. The most relevent tags to our collection strategy are:

|**Type**|**Tag**|**Description**|
|--|--|-------------------------------|
|String|id|this comment identifier, e.g. "8xwlg"|
|String|name|Fullname of comment|
|String|author|the account name of the poster.|
|String|body|the raw text. this is the unformatted text which includes the raw markup characters are escaped.|
|int|gilded|the number of times this comment received reddit gold|
|String|link_id|ID of the link this comment is in|
|String|parent_id|ID of the thing this comment is a reply to, either the link or a comment in it|
|int|score|the net-score of the comment|


<h2>Listings</h2>
The last *Thing* we need is the listing class. In order to facilitate bulk data collection we can make use of the listing class to request list of the things we are interested in. The listing class literaly consists of a list of other things. A listing object is restricted to 100 Things so we also make use of the paging concept to get longer lists of things we want. The paging concept of the listing class is controlled by the following tags:

|**Type**|**Tag**|**Description**|
|--|--|-----------------| 
|String|before|The fullname of the listing that follows before this page. null if there is no previous page|
|String|after|The fullname of the listing that follows after this page. null if there is no next page.|


<h2>Header Information</h2>
In addtion to the JSON objects themselves we also make use of the response headers to stay under the rate limiting. Reddit nicely asks to make no more than 30 requests per minute. We use the following response headers to ensure we play by the rules. 

|**Header Tag**|**Description**|
|----|----------| 
|X-Ratelimit-Used|Approximate number of requests used in this period|
|X-Ratelimit-Remaining| Approximate number of requests left to use|
|X-Ratelimit-Reset|Approximate number of seconds to end of period|


<h2>Create a Reddit API Request Token</h2>
<br>
Before we can start looking at *Things!* we need to we need to create a scripting application and an authorized token. The application is our vehicle for making authorized requests to the reddit API. The application **ExpertCollector/0.1 by cs109-2015** was created using https://github.com/reddit/reddit/wiki/OAuth2-Quick-Start-Example and named using reddit user agent naming conventions.
<br>
Once the application was created we can use the user and application credentials to request a token. User and application credentials are not to be shared and will only be distributed as required. The credentials text file is of the following format:
<br>
<br>
client_id: application id<br>
username: username<br>
pw: password<br>
secret_id<br>

In [1]:
import requests
import requests.auth
import time
import os
import json
import pandas as pd
import sys

from IPython.display import clear_output

In [2]:
def getToken(creds):
    client_auth = requests.auth.HTTPBasicAuth(creds['client_id'], creds['secret_id'])
    post_data = {"grant_type": "password", "username": creds['username'], "password": creds['pw']}
    headers = {"User-Agent": creds['user_agent']}
    response = requests.post("https://www.reddit.com/api/v1/access_token", auth=client_auth, data=post_data, headers=headers)
    if response.status_code == 200:
        print 'Credentials Verified: Token Recived'
    else:
        print 'Invalid Creds'

    auth = response.json()['token_type']+' '+response.json()['access_token']
    return {"Authorization": auth, "User-Agent": creds['user_agent']}

We can use a created token and our application name to construct the authorized header for application requests. Our request token is valid for 1 hour so we may have to do this more than once.

In [3]:
with open("creds.txt") as f:
    creds = dict([line.strip().split(':') for line in f])
token = getToken(creds)

Credentials Verified: Token Recived


We can now use the created token to make authorized requests, through our application, directlty to the reddit API. As an example let's see how active our cs109-2015 reddit account has been.

In [4]:
response = requests.get("https://oauth.reddit.com/api/v1/me", headers=token)
response.json()

{u'comment_karma': 0,
 u'created': 1448151333.0,
 u'created_utc': 1448122533.0,
 u'gold_creddits': 0,
 u'gold_expiration': None,
 u'has_mail': True,
 u'has_mod_mail': False,
 u'has_verified_email': False,
 u'hide_from_robots': False,
 u'id': u's9gd4',
 u'inbox_count': 1,
 u'is_gold': False,
 u'is_mod': False,
 u'is_suspended': False,
 u'link_karma': 1,
 u'name': u'cs109-2015',
 u'over_18': False,
 u'suspension_expiration_utc': None}

<h2>Collecting Training Links</h2>
<br>
Now that we can make requests, our first step is to find links (articles) we can use for our training data. In order to get a broad set of relevent training data we aim to collect the Top *n* Links, in a specified time period,for a set of specific sub-reddits we are interested in. The following function lets us do this.

In [26]:
'''
GetTrainingLinks(subreddit_list,n_links,time_window)

Description:
############
Collectes the top n links, in the specified time window, from each subreddit provided.

Runtime(seconds) = n_links/50 * length(subreddit_list)

Parameters:
###########
subreddit_list: a list of subreddit names

n_links: total number of links per sub-reddit to collect

time_window: one of ('hour', 'day', 'week', 'month', 'year', 'all')

outfile: path to output location

Outputs:
#########
Writes a single JSON per line to the outfile. 

Returns:
##########


'''
url = "https://oauth.reddit.com/r"

def getTrainingLinks(subreddit_list,n_links,time_window,outfile):
    token = getToken(creds)
    for subreddit in subreddit_list:
        n=0
        while n < n_links:
            if n == 0:
                query= 'top?limit=100&t={0}'.format(time_window)
            else:
                query= 'top?limit=100&t={0}&after={1}'.format(time_window,after)
            request_url= "/".join([url,subreddit,query])
            response = requests.get(request_url, headers=token)
            after = response.json()['data']['after']
            n+=100
            if not os.path.exists(os.path.dirname(outfile)):
                os.makedirs(os.path.dirname(outfile))
            with open(outfile,'a') as link_file:
                   for link in response.json()['data']['children']:
                        json.dump(link['data'],link_file)
                        link_file.write('\n')
            clear_output()
            sys.stdout.write('{0} r/{1} Links Collected'.format(n,subreddit))
            sys.stdout.flush()
            time.sleep(2) # So we respect the rate limits! 
    
    clear_output()
    print 'Done.'
    return outfile

<h2> Sample Collection</h2>
Here we specify our collection parameters, this is just a sample. The actual training collection prameters will be more extensive.

In [27]:
subreddit_list = ['science','news','worldnews','dataisbeautiful','todayilearned']
n_links = 1000
time_window ='month'
outfile ='./links.txt'

training_links = getTrainingLinks(subreddit_list,n_links,time_window,outfile)

Done.


In [28]:
links = []
#Tags to keep
tags= [ u'author',u'created_utc', u'domain', u'downs', u'gilded',u'is_self', u'likes', u'media', 'id',
 u'num_comments', u'num_reports', u'over_18', u'permalink',u'score', u'selftext', u'subreddit', u'thumbnail', u'title', u'ups', u'url']

with open(training_links) as data_file:
    for link in data_file:
        links.append(pd.read_json(link,orient='records',typ='series')[tags])
linkdf = pd.concat(links,axis=1).transpose()
print linkdf.shape
linkdf.head()

(3988, 20)


Unnamed: 0,author,created_utc,domain,downs,gilded,is_self,likes,media,id,num_comments,num_reports,over_18,permalink,score,selftext,subreddit,thumbnail,title,ups,url
0,Bloomsey,1446232000.0,thelatestnews.com,0,0,False,,,3qvj7a,824,,False,/r/science/comments/3qvj7a/researchers_have_de...,8201,,science,http://b.thumbs.redditmedia.com/RyxLNlGq2NUKfz...,Researchers have developed a blood test that c...,8201,http://www.thelatestnews.com/single-drop-of-bl...
1,trpftw,1446582000.0,cnn.com,0,0,False,,,3recdd,3965,,False,/r/science/comments/3recdd/mass_killings_and_s...,7174,,science,http://b.thumbs.redditmedia.com/JA2kgMdHb-dCuQ...,"Mass killings and school shootings spread ""con...",7174,http://www.cnn.com/2015/07/02/health/contagiou...
2,the_phet,1447239000.0,ibtimes.co.uk,0,0,False,,,3se6lu,1078,,False,/r/science/comments/3se6lu/algae_has_been_gene...,6588,,science,http://b.thumbs.redditmedia.com/y1CGKgl69hKw-s...,Algae has been genetically engineered to kill ...,6588,http://www.ibtimes.co.uk/algae-genetically-eng...
3,drewiepoodle,1445600000.0,phys.org,0,0,False,,,3pw7xy,1913,,False,/r/science/comments/3pw7xy/one_of_the_oddest_p...,6259,,science,http://b.thumbs.redditmedia.com/pnxM2olTL3M4el...,One of the oddest predictions of quantum theor...,6259,http://phys.org/news/2015-10-zeno-effect-verif...
4,godsenfrik,1447707000.0,news.stanford.edu,0,0,False,,,3t2exx,708,,False,/r/science/comments/3t2exx/when_scientists_fal...,6151,,science,http://b.thumbs.redditmedia.com/FQPskQP8EejVh9...,"When scientists falsify data, they try to cove...",6151,http://news.stanford.edu/news/2015/november/fr...


<h2>Finding Usefull Links</h2>
<br>
Now we have a dataframe of potential links but before we can collect the comments we do some prunning. We remove links based on the following criteria:

* Remove any potential NSFW links using the over_18 tag
* Remove any links with no article using the is_self tag
* Remove links with fewer than 10 comments
* Remove any duplicate URLs. 

We write the list of unique article urls so we can go get the original article. 

In [29]:
training_links = linkdf.loc[(~linkdf.is_self.fillna(False)) & (~linkdf.over_18.fillna(False)) & (linkdf.num_comments > 10),:]
training_links.url.unique().tofile('train_urls.txt',"\n")

<h2> Collecting Training Comments</h2>

Now that we know what links (articles) we are interested in. We need to go get the top comments. To maximise the relevencey to the original article we restrict our comment collector to the direct children of the article. We also want to control how many comments for each article we recive. The following function let's us do this.

In [30]:
'''
GetTrainingComments(link_list,depth, outfile)

Description:
############
Collectes the first children for each provided link. 

Runtime(seconds) = len(link_list)/2

Parameters:
###########
linkdf: a dataframe where each record is a link. 
        Required columns: subreddit, name

outfile: path to output location

depth: the max depth of comments to return. 


Outputs:
#########
Writes a single comment JSON per line to the outfile. 

Returns:
##########


'''
url = "https://oauth.reddit.com/r"

def getTrainingLinks(df,depth, outfile):
    token = getToken(creds)
    for subreddit, name in [tuple(x) for x in df[['subreddit','id']].values]:
            query= 'comments/{0}/?depth={1}'.format(name,depth)
            request_url= "/".join([url,subreddit,query])
            response = requests.get(request_url, headers=token)
            if not os.path.exists(os.path.dirname(outfile)):
                os.makedirs(os.path.dirname(outfile))
            with open(outfile,'a') as comment_file:
                   for comment in response.json()[1]['data']['children'][:-1]:
                        json.dump(comment['data'],comment_file)
                        comment_file.write('\n')
    time.sleep(2) # So we respect the rate limits! 
    print 'Done.'
    return outfile

In [31]:
training_comments = getTrainingLinks(linkdf,1,'./comments.txt')

Credentials Verified: Token Recived
Done.


In [32]:
comments = []
#Tags to keep
tags= [ u'author', u'body', u'body_html', u'controversiality', u'created', u'created_utc', u'distinguished', u'downs',
 u'edited', u'gilded', u'id', u'likes', u'link_id', u'name', u'num_reports', u'parent_id', u'replies', u'score',
 u'subreddit', u'ups']

with open(training_comments) as data_file:
    for comment in data_file:
        comments.append(pd.read_json(comment,orient='records',typ='series')[tags])
commentdf = pd.concat(comments,axis=1).transpose()
commentdf.head()

Unnamed: 0,author,body,body_html,controversiality,created,created_utc,distinguished,downs,edited,gilded,id,likes,link_id,name,num_reports,parent_id,replies,score,subreddit,ups
0,Voerendaalse,&gt; Subsequent validation using a separate va...,"&lt;div class=""md""&gt;&lt;blockquote&gt;\n&lt;...",0,1446262000.0,1446233000.0,,0,1.446234e+09,0,cwipf39,,t3_3qvj7a,t1_cwipf39,,t3_3qvj7a,"{u'kind': u'Listing', u'data': {u'modhash': No...",1685,science,1685
1,Bloomsey,Peer-reviewed article: http://www.cell.com/can...,"&lt;div class=""md""&gt;&lt;p&gt;Peer-reviewed a...",0,1446261000.0,1446232000.0,,0,False,0,cwiohv5,False,t3_3qvj7a,t1_cwiohv5,,t3_3qvj7a,"{u'kind': u'Listing', u'data': {u'modhash': No...",660,science,660
2,schnupfndrache7,is this as revolutionary as it sounds?,"&lt;div class=""md""&gt;&lt;p&gt;is this as revo...",0,1446264000.0,1446235000.0,,0,False,0,cwiqorg,,t3_3qvj7a,t1_cwiqorg,,t3_3qvj7a,"{u'kind': u'Listing', u'data': {u'modhash': No...",275,science,275
3,geoffp82,And then what? Full body MRI?,"&lt;div class=""md""&gt;&lt;p&gt;And then what? ...",0,1446262000.0,1446233000.0,,0,False,0,cwipio1,,t3_3qvj7a,t1_cwipio1,,t3_3qvj7a,"{u'kind': u'Listing', u'data': {u'modhash': No...",64,science,64
4,momoneymoproblemss,Is it weird I would like to take this test on ...,"&lt;div class=""md""&gt;&lt;p&gt;Is it weird I w...",0,1446263000.0,1446235000.0,,0,False,0,cwiqags,,t3_3qvj7a,t1_cwiqags,,t3_3qvj7a,"{u'kind': u'Listing', u'data': {u'modhash': No...",131,science,131


In [39]:
print commentdf.shape
commentdf.to_csv('./training_comments.csv',sep=',',encoding='utf-8',index=False)

(109964, 20)


In [40]:
comments = pd.read_csv('./training_comments.csv',encoding='utf-8')
comments.head()

  data = self._reader.read(nrows)


Unnamed: 0,author,body,body_html,controversiality,created,created_utc,distinguished,downs,edited,gilded,id,likes,link_id,name,num_reports,parent_id,replies,score,subreddit,ups
0,Voerendaalse,&gt; Subsequent validation using a separate va...,"&lt;div class=""md""&gt;&lt;blockquote&gt;\n&lt;...",0,1446262140,1446233340,,0,1446234176.0,0,cwipf39,,t3_3qvj7a,t1_cwipf39,,t3_3qvj7a,"{u'kind': u'Listing', u'data': {u'modhash': No...",1685,science,1685
1,Bloomsey,Peer-reviewed article: http://www.cell.com/can...,"&lt;div class=""md""&gt;&lt;p&gt;Peer-reviewed a...",0,1446260748,1446231948,,0,False,0,cwiohv5,False,t3_3qvj7a,t1_cwiohv5,,t3_3qvj7a,"{u'kind': u'Listing', u'data': {u'modhash': No...",660,science,660
2,schnupfndrache7,is this as revolutionary as it sounds?,"&lt;div class=""md""&gt;&lt;p&gt;is this as revo...",0,1446264074,1446235274,,0,False,0,cwiqorg,,t3_3qvj7a,t1_cwiqorg,,t3_3qvj7a,"{u'kind': u'Listing', u'data': {u'modhash': No...",275,science,275
3,geoffp82,And then what? Full body MRI?,"&lt;div class=""md""&gt;&lt;p&gt;And then what? ...",0,1446262287,1446233487,,0,False,0,cwipio1,,t3_3qvj7a,t1_cwipio1,,t3_3qvj7a,"{u'kind': u'Listing', u'data': {u'modhash': No...",64,science,64
4,momoneymoproblemss,Is it weird I would like to take this test on ...,"&lt;div class=""md""&gt;&lt;p&gt;Is it weird I w...",0,1446263469,1446234669,,0,False,0,cwiqags,,t3_3qvj7a,t1_cwiqags,,t3_3qvj7a,"{u'kind': u'Listing', u'data': {u'modhash': No...",131,science,131


<h1>Collecting Prediction Links!</h1>
Use the *Search* URL to collect the links most similar to the requested article!