<h1>Find Your Expert</h1>
<br>
The purpose of our application is to identify the best comments for a user defined article. In order to provide the best and most relevent comments for articles that have somehow escapped the reddit verse, find similar articles that have been posted on and we present the best comments. We identify the best comments through a prediction model.
<br>
<h2>Prediction Strategy</h2>
<img src="PredictionStrat.png">
The above proccess identifies our prediction strategy but we requrie a slightly different approach. In order to train our model we need to collect articles that *Do* exist in the reddit verse and the top comments related to that article. We can then use this data to extract the features from the comments and the relation to the article, and construct a model to predict the score of the comments. 
<br>
<h2>Training Strategy</h2>
<img src="TrainingStrat.png">
<br>
<br>
The collection strategy was designed to maximise effective data collection while adhearing to the <a href="https://www.reddit.com/wiki/licensing">**Reddit API Guidlines**</a>.
In order to provide the most effective training data the collection strategy was designed with the following coniderations
<br>
* Only collect comments directly in response to an orginal article post. We restrict our collection to first level responses in an effort to avoid collecting comments not directly related to the orginal article.
* Collect from specific sub-reddits where the orginal content is usually an external source ie. r/news, r/TIL
* Avoid sub-reddits where there is no orginal content like AMA's, or ELI5
* Avoid multi-media subreddits like r/pics, r/videos

<h1>Things!</h1>

The Reddit API consits of several *things* (the base class). The API provided documentation on several different <a href='https://github.com/reddit/reddit/wiki/JSON'> *Things!*</a>; however we are most interested in the following:

<h2>Links</h2>
Links are the orginal post, all of the comments we are interested in our in response to a specific link. A link object has several tags that can all be found in the <a hfref= 'https://github.com/reddit/reddit/wiki/JSON'> *things documentation*</a>. The most relevent tags to our collection strategy are:

|**Type**|**Tag**|**Description**|
|--|--|-------------------------------|
|String|id|this links identifier, e.g. "8xwlg"|
|String|name|Fullname of link|
|String|author|the account name of the poster.|
|String|domain|the domain of this link. Self posts will be self.subreddit while other examples include en.wikipedia.org and s3.amazon.com
|boolean|over_18|true if the post is tagged as NSFW. False if otherwise|
|int|num_comments|the number of comments that belong to this link. includes removed comments.|
|String|permalink|relative URL of the permanent link for this link|
|int|score|the net-score of the link. note: A submission's score is simply the number of upvotes minus the number of downvotes. If five users like the submission and three users don't it will have a score of 2. Please note that the vote numbers are not "real" numbers, they have been "fuzzed" to prevent spam bots etc. So taking the above example, if five users upvoted the submission, and three users downvote it, the upvote/downvote numbers may say 23 upvotes and 21 downvotes, or 12 upvotes, and 10 downvotes. The points score is correct, but the vote totals are "fuzzed".|
|String|subreddit|subreddit of thing excluding the /r/ prefix. "pics"|
|String|title|the title of the link. may contain newlines for some reason|

<br>
<h2>Comments</h2>
Comments are the actual comments made in regard to a specific link. The most relevent tags to our collection strategy are:

|**Type**|**Tag**|**Description**|
|--|--|-------------------------------|
|String|id|this comment identifier, e.g. "8xwlg"|
|String|name|Fullname of comment|
|String|author|the account name of the poster.|
|String|body|the raw text. this is the unformatted text which includes the raw markup characters are escaped.|
|int|gilded|the number of times this comment received reddit gold|
|String|link_id|ID of the link this comment is in|
|String|parent_id|ID of the thing this comment is a reply to, either the link or a comment in it|
|int|score|the net-score of the comment|


<h2>Listings</h2>
The last *Thing* we need is the listing class. In order to facilitate bulk data collection we can make use of the listing class to request list of the things we are interested in. The listing class literaly consists of a list of other things. A listing object is restricted to 100 Things so we also make use of the paging concept to get longer lists of things we want. The paging concept of the listing class is controlled by the following tags:

|**Type**|**Tag**|**Description**|
|--|--|-----------------| 
|String|before|The fullname of the listing that follows before this page. null if there is no previous page|
|String|after|The fullname of the listing that follows after this page. null if there is no next page.|


<h2>Header Information</h2>
In addtion to the JSON objects themselves we also make use of the response headers to stay under the rate limiting. Reddit nicely asks to make no more than 30 requests per minute. We use the following response headers to ensure we play by the rules. 

|**Header Tag**|**Description**|
|----|----------| 
|X-Ratelimit-Used|Approximate number of requests used in this period|
|X-Ratelimit-Remaining| Approximate number of requests left to use|
|X-Ratelimit-Reset|Approximate number of seconds to end of period|


<h2>Create a Reddit API Request Token</h2>
<br>
Before we can start looking at *Things!* we need to we need to create a scripting application and an authorized token. The application is our vehicle for making authorized requests to the reddit API. The application **ExpertCollector/0.1 by cs109-2015** was created using https://github.com/reddit/reddit/wiki/OAuth2-Quick-Start-Example and named using reddit user agent naming conventions.
<br>
Once the application was created we can use the user and application credentials to request a token. User and application credentials are not to be shared and will only be distributed as required. The credentials text file is of the following format:
<br>
<br>
client_id: application id<br>
username: username<br>
pw: password<br>
secret_id<br>

In [1]:
import requests
import requests.auth
import time
import os
import json
import pandas as pd
import sys
import newspaper
import nltk

from IPython.display import clear_output

In [2]:
def getToken(creds):
    client_auth = requests.auth.HTTPBasicAuth(creds['client_id'], creds['secret_id'])
    post_data = {"grant_type": "password", "username": creds['username'], "password": creds['pw']}
    headers = {"User-Agent": creds['user_agent']}
    response = requests.post("https://www.reddit.com/api/v1/access_token", auth=client_auth, data=post_data, headers=headers)
    if response.status_code == 200:
        print 'Credentials Verified: Token Recived'
    else:
        print 'Invalid Creds'

    auth = response.json()['token_type']+' '+response.json()['access_token']
    return {"Authorization": auth, "User-Agent": creds['user_agent']}

We can use a created token and our application name to construct the authorized header for application requests. Our request token is valid for 1 hour so we may have to do this more than once.

In [3]:
with open("creds.txt") as f:
    creds = dict([line.strip().split(':') for line in f])
token = getToken(creds)

Credentials Verified: Token Recived


We can now use the created token to make authorized requests, through our application, directlty to the reddit API. As an example let's see how active our cs109-2015 reddit account has been.

In [4]:
response = requests.get("https://oauth.reddit.com/api/v1/me", headers=token)
response.json()

{u'comment_karma': 0,
 u'created': 1448151333.0,
 u'created_utc': 1448122533.0,
 u'gold_creddits': 0,
 u'gold_expiration': None,
 u'has_mail': True,
 u'has_mod_mail': False,
 u'has_verified_email': False,
 u'hide_from_robots': False,
 u'id': u's9gd4',
 u'inbox_count': 1,
 u'is_gold': False,
 u'is_mod': False,
 u'is_suspended': False,
 u'link_karma': 1,
 u'name': u'cs109-2015',
 u'over_18': False,
 u'suspension_expiration_utc': None}

<h2>Collecting Training Links</h2>
<br>
Now that we can make requests, our first step is to find links (articles) we can use for our training data. In order to get a broad set of relevent training data we aim to collect the Top *n* Links, in a specified time period,for a set of specific sub-reddits we are interested in. The following function lets us do this.

In [5]:
'''
GetTrainingLinks(subreddit_list,n_links,time_window)

Description:
############
Collectes the top n links, in the specified time window, from each subreddit provided.

Runtime(seconds) = n_links/50 * length(subreddit_list)

Parameters:
###########
subreddit_list: a list of subreddit names

n_links: total number of links per sub-reddit to collect

time_window: one of ('hour', 'day', 'week', 'month', 'year', 'all')

outfile: path to output location

Outputs:
#########
Writes a single JSON per line to the outfile. 

Returns:
##########


'''
url = "https://oauth.reddit.com/r"

def getTrainingLinks(subreddit_list,n_links,time_window,outfile):
    token = getToken(creds)
    for subreddit in subreddit_list:
        n=0
        while n < n_links:
            if n == 0:
                query= 'top?limit=100&t={0}'.format(time_window)
            else:
                query= 'top?limit=100&t={0}&after={1}'.format(time_window,after)
            request_url= "/".join([url,subreddit,query])
            response = requests.get(request_url, headers=token)
            after = response.json()['data']['after']
            n+=100
            if not os.path.exists(os.path.dirname(outfile)):
                os.makedirs(os.path.dirname(outfile))
            with open(outfile,'a') as link_file:
                   for link in response.json()['data']['children']:
                        json.dump(link['data'],link_file)
                        link_file.write('\n')
            clear_output()
            sys.stdout.write('{0} r/{1} Links Collected'.format(n,subreddit))
            sys.stdout.flush()
            time.sleep(2) # So we respect the rate limits! 
    
    clear_output()
    print 'Done.'
    return outfile

<h2> Sample Collection</h2>
Here we specify our collection parameters, this is just a sample. The actual training collection prameters will be more extensive.

In [6]:
subreddit_list = ['science','news','worldnews','dataisbeautiful','todayilearned','politics',
                  'technology','space','internetisbeautiful','nottheonion','gadgets',
                  'documentaries','upliftingnews','programming','europe','datascience',
                  'uspolitics','ukpolitics','CanadaPolitics','explainlikeimfive','liberal','conservative',
                  'nba','soccer','nfl','food','SubredditSimulator','askscience','askhistorians']

n_links = 1000
time_window ='year'
outfile ='./links.txt'

training_links = getTrainingLinks(subreddit_list,n_links,time_window,outfile)

Done.


In [2]:
links = []
#Tags to keep
tags= [ u'author',u'created_utc', u'domain', u'downs', u'gilded',u'is_self', u'likes', u'media', 'id',
 u'num_comments', u'num_reports', u'over_18', u'permalink',u'score', u'selftext', u'subreddit', u'thumbnail', u'title', u'ups', u'url']

with open('./links.txt') as data_file:
    for link in data_file:
        links.append(pd.read_json(link,orient='records',typ='series')[tags])
linkdf = pd.concat(links,axis=1).transpose()
print linkdf.shape
linkdf.head()

(34798, 20)


Unnamed: 0,author,created_utc,domain,downs,gilded,is_self,likes,media,id,num_comments,num_reports,over_18,permalink,score,selftext,subreddit,thumbnail,title,ups,url
0,the_phet,1447239000.0,ibtimes.co.uk,0,0,False,,,3se6lu,1073,,False,/r/science/comments/3se6lu/algae_has_been_gene...,6705,,science,http://b.thumbs.redditmedia.com/y1CGKgl69hKw-s...,Algae has been genetically engineered to kill ...,6705,http://www.ibtimes.co.uk/algae-genetically-eng...
1,skoalbrother,1448903000.0,phys.org,0,0,False,,,3uvg0o,2216,,False,/r/science/comments/3uvg0o/researchers_find_ne...,6777,,science,http://b.thumbs.redditmedia.com/hZrhEdBoJp22oE...,"Researchers find new phase of carbon, make dia...",6777,http://phys.org/news/2015-11-phase-carbon-diam...
2,godsenfrik,1447707000.0,news.stanford.edu,0,0,False,,,3t2exx,703,,False,/r/science/comments/3t2exx/when_scientists_fal...,6259,,science,http://b.thumbs.redditmedia.com/FQPskQP8EejVh9...,"When scientists falsify data, they try to cove...",6259,http://news.stanford.edu/news/2015/november/fr...
3,avogadros_number,1447108000.0,phys.org,0,0,False,,,3s6xe6,657,,False,/r/science/comments/3s6xe6/dispersants_did_not...,6146,,science,http://b.thumbs.redditmedia.com/mFQzb4d2QNiyaE...,Dispersants did not help oil degrade in BP spi...,6146,http://phys.org/news/2015-11-dispersants-oil-d...
4,Letmeirkyou,1447870000.0,popularmechanics.com,0,0,False,,,3tbkv6,580,,False,/r/science/comments/3tbkv6/scientists_have_dis...,6020,,science,http://b.thumbs.redditmedia.com/spD7SnbKlAEY1y...,Scientists have discovered an exoplanet still ...,6020,http://www.popularmechanics.com/space/deep-spa...


In [3]:
print linkdf.subreddit.value_counts()

politics               1200
Conservative           1200
todayilearned          1200
science                1200
uspolitics             1200
askscience             1200
Documentaries          1200
soccer                 1200
food                   1200
Liberal                1200
CanadaPolitics         1200
datascience            1200
europe                 1200
news                   1200
ukpolitics             1200
InternetIsBeautiful    1200
nottheonion            1200
UpliftingNews          1200
worldnews              1200
AskHistorians          1200
dataisbeautiful        1200
nfl                    1200
space                  1200
programming            1200
SubredditSimulator     1200
technology             1200
nba                    1200
explainlikeimfive      1199
gadgets                1199
dtype: int64


<h2>Finding Usefull Links</h2>
<br>
Now we have a dataframe of potential links but before we can collect the comments we do some prunning. We remove links based on the following criteria:

* Remove any potential NSFW links using the over_18 tag
* Remove any links with no article using the is_self tag
* Remove links with fewer than 10 comments
* Remove any duplicate URLs. 

We write the list of unique article urls so we can go get the original article. 

In [4]:
training_links = linkdf.loc[(~linkdf.is_self.fillna(False)) & (~linkdf.over_18.fillna(False)) & (linkdf.num_comments > 10),:]
urls = training_links.url.unique()

In [10]:
#TODO Documentation

def getArticle(url):
    article = newspaper.Article(url, fetch_images = False)
    article.download()
    article.parse()
    article.nlp()
    return {"url":article.url, "text":article.text,"keywords":newspaper.nlp.keywords(article.text), 
            "authors":article.authors,"summary":article.summary,
            "publish_date":str(article.publish_date)}

In [11]:
#TODO DOC
url = urls[0]
getArticle(url)

{'authors': [u'Derek Keats Flickr', u'Hannah Osborne', u'Marc Cirera'],
 'keywords': {u'algae': 1.0436046511627908,
  u'cancer': 1.0261627906976745,
  u'cells': 1.0261627906976745,
  u'chemotherapeutic': 1.0130813953488371,
  u'drugs': 1.0305232558139534,
  u'engineered': 1.0130813953488371,
  u'genetically': 1.0174418604651163,
  u'nanoparticles': 1.0130813953488371,
  u'toxic': 1.0130813953488371,
  u'tumours': 1.0130813953488371},
 'publish_date': '2015-11-10 16:00:00+00:00',
 'summary': u'Algae has been genetically engineered to kill cancer cells without harming healthy cells.\nThe algae nanoparticles, created by scientists in Australia, were found to kill 90% of cancer cells in cultured human cells.\nThe antibody binds only to molecules found on cancer cells, thus delivering the toxic drug specifically to the target cells.\nIn turn, the antibody binds only to molecules found on cancer cells, meaning it could deliver drugs to the target cells.\nResearchers genetically engineered th

In [19]:
#Collect ALL Articles and write each JSON to a new line
#TODO doc
def writeArticleJSON(urls,outfile):
    if not os.path.exists(os.path.dirname(outfile)):
                    os.makedirs(os.path.dirname(outfile))
    with open(outfile,'a') as article_file:
        for url in urls :
            try:
                json.dump(getArticle(url),article_file)
                article_file.write('\n')
            except:
                print 'error: ' + url
    print 'done.'
    return outfile


In [5]:
#Since We already Have some articles, lets not get duplicates.
articles = []
with open('./articles.txt') as data_file:
    for article in data_file:
        try:
            articles.append(pd.read_json(article,orient='records',typ='series'))
        except:
            print 'inlvalid JSON'
articledf = pd.concat(articles,axis=1).transpose()
print articledf.shape
articledf.head()

inlvalid JSON
(11472, 6)


Unnamed: 0,authors,keywords,publish_date,summary,text,url
0,"[Derek Keats Flickr, Hannah Osborne, Marc Cirera]","{u'toxic': 1.01308139535, u'cancer': 1.0261627...",2015-11-10 16:00:00+00:00,Algae has been genetically engineered to kill ...,Algae has been genetically engineered to kill ...,http://www.ibtimes.co.uk/algae-genetically-eng...
1,[],"{u'diamond': 1.04316546763, u'laser': 1.008633...",,"If Q-carbon is harder than diamond, why would ...","This is a collection of 0.02, 0.03 and 0.04 ca...",http://phys.org/news/2015-11-phase-carbon-diam...
2,[Bjorn Carey],"{u'obfuscation': 1.01181102362, u'fraud': 1.01...",2015-11-16 00:00:00,Stanford researchers uncover patterns in how s...,Stanford researchers uncover patterns in how s...,http://news.stanford.edu/news/2015/november/fr...
3,[],"{u'oil': 1.05142857143, u'microbes': 1.0114285...",,And that leads to more questions about where m...,"Samantha Joye, a professor of marine sciences ...",http://phys.org/news/2015-11-dispersants-oil-d...
4,[More From],"{u'star': 1.02124645892, u'forming': 1.0106232...",2015-11-18 06:00:00,"Together, the two observations allowed the sci...",Shares Share\n\nTweet\n\nE-mail\n\n​Of the tho...,http://www.popularmechanics.com/space/deep-spa...


In [6]:
new_urls = articledf.url[~articledf.url.isin(urls)]
len(new_urls)

104

In [20]:
article_data = writeArticleJSON(new_urls,'./articles.txt')

Traceback (most recent call last):
  File "/home/paul/anaconda/lib/python2.7/site-packages/newspaper/parsers.py", line 54, in fromstring
    cls.doc = lxml.html.fromstring(html)
  File "/home/paul/anaconda/lib/python2.7/site-packages/lxml/html/__init__.py", line 706, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)


You must download and parse an article before parsing it!
error: http://www.rappad.co/freestyle
done.


  File "/home/paul/anaconda/lib/python2.7/site-packages/lxml/html/__init__.py", line 600, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src/lxml/lxml.etree.c:68121)
  File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102470)
  File "parser.pxi", line 1667, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:101229)
  File "parser.pxi", line 1035, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:96139)
  File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290)
  File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476)
  File "parser.pxi", line 631, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91904)
XMLSyntaxError: line 1241: htmlParseEntityRef: expecting ';'


Since some of the articles couldn't be downloaded (reasons could vary from the article no longer being available, rate limiting, bandwidth etc.) All of these issues could be explored but one of the joys of working with big data is being able to except a certain loss of data. Here we can see how many training threads we lost due to issues downloading the article. 

In [22]:
articles = []
with open('./articles.txt') as data_file:
    for article in data_file:
        try:
            articles.append(pd.read_json(article,orient='records',typ='series'))
        except:
            print 'inlvalid JSON'
articledf = pd.concat(articles,axis=1).transpose()
print articledf.shape
articledf.head()

inlvalid JSON
(11472, 6)


Unnamed: 0,authors,keywords,publish_date,summary,text,url
0,"[Derek Keats Flickr, Hannah Osborne, Marc Cirera]","{u'toxic': 1.01308139535, u'cancer': 1.0261627...",2015-11-10 16:00:00+00:00,Algae has been genetically engineered to kill ...,Algae has been genetically engineered to kill ...,http://www.ibtimes.co.uk/algae-genetically-eng...
1,[],"{u'diamond': 1.04316546763, u'laser': 1.008633...",,"If Q-carbon is harder than diamond, why would ...","This is a collection of 0.02, 0.03 and 0.04 ca...",http://phys.org/news/2015-11-phase-carbon-diam...
2,[Bjorn Carey],"{u'obfuscation': 1.01181102362, u'fraud': 1.01...",2015-11-16 00:00:00,Stanford researchers uncover patterns in how s...,Stanford researchers uncover patterns in how s...,http://news.stanford.edu/news/2015/november/fr...
3,[],"{u'oil': 1.05142857143, u'microbes': 1.0114285...",,And that leads to more questions about where m...,"Samantha Joye, a professor of marine sciences ...",http://phys.org/news/2015-11-dispersants-oil-d...
4,[More From],"{u'star': 1.02124645892, u'forming': 1.0106232...",2015-11-18 06:00:00,"Together, the two observations allowed the sci...",Shares Share\n\nTweet\n\nE-mail\n\n​Of the tho...,http://www.popularmechanics.com/space/deep-spa...


In [8]:
print str(len(training_links.url.unique())) + ' Article URLs'
linkdf = linkdf.loc[linkdf.url.isin(articledf.url)]
print str(len(articledf.url.unique())) + ' Sucsessfull Downloads'

21268 Article URLs
11420 Sucsessfull Downloads


<h2> Collecting Training Comments</h2>

Now that we know what links (articles) we are interested in. We need to go get the top comments. To maximise the relevencey to the original article we restrict our comment collector to the direct children of the article. We also want to control how many comments for each article we recive. The following function let's us do this.

In [24]:
'''
GetTrainingComments(link_list,depth, outfile)

Description:
############
Collectes the first children for each provided link. 

Runtime(seconds) = len(link_list)/2

Parameters:
###########
linkdf: a dataframe where each record is a link. 
        Required columns: subreddit, name

outfile: path to output location

depth: the max depth of comments to return. 


Outputs:
#########
Writes a single comment JSON per line to the outfile. 

Returns:
##########


'''
url = "https://oauth.reddit.com/r"

def getTrainingLinks(df,depth, outfile):
    token = getToken(creds)
    for subreddit, name in [tuple(x) for x in df[['subreddit','id']].values]:
            query= 'comments/{0}/?depth={1}'.format(name,depth)
            request_url= "/".join([url,subreddit,query])
            response = requests.get(request_url, headers=token)
            if not os.path.exists(os.path.dirname(outfile)):
                os.makedirs(os.path.dirname(outfile))
            with open(outfile,'a') as comment_file:
                   for comment in response.json()[1]['data']['children'][:-1]:
                        try:
                            json.dump(comment['data'],comment_file)
                            comment_file.write('\n')
                        except:
                            print skipped
    time.sleep(1) # So we respect the rate limits! 
    print 'done.'
    return outfile

In [25]:
training_comments = getTrainingLinks(linkdf,1,'./comment_data.txt')

done.


In [9]:
comments = []
#Tags to keep
tags= [ u'author', u'body', u'body_html', u'controversiality', u'created', u'created_utc', u'distinguished', u'downs',
 u'edited', u'gilded', u'id', u'likes', u'link_id', u'name', u'num_reports', u'parent_id', u'replies', u'score',
 u'subreddit', u'ups']

with open('./comment_data.txt') as data_file:
    for comment in data_file:
        comments.append(pd.read_json(comment,orient='records',typ='series')[tags])
commentdf = pd.concat(comments,axis=1).transpose()

#Remove ['removed'] comments
commentdf = commentdf.loc[commentdf['body']!='[removed]']
commentdf.head()

Unnamed: 0,author,body,body_html,controversiality,created,created_utc,distinguished,downs,edited,gilded,id,likes,link_id,name,num_reports,parent_id,replies,score,subreddit,ups
0,SirT6,The title sort of misses the point of the stud...,"&lt;div class=""md""&gt;&lt;p&gt;The title sort ...",0,1447280000.0,1447251000.0,,0,False,1,cwwhtv7,,t3_3se6lu,t1_cwwhtv7,,t3_3se6lu,"{u'kind': u'Listing', u'data': {u'modhash': No...",1359,science,1359
1,DrBiochemistry,Just want to point out that until I see a deli...,"&lt;div class=""md""&gt;&lt;p&gt;Just want to po...",0,1447277000.0,1447249000.0,,0,False,0,cwwgxle,,t3_3se6lu,t1_cwwgxle,,t3_3se6lu,"{u'kind': u'Listing', u'data': {u'modhash': No...",3209,science,3209
2,Frogblood,It's an interesting idea but the in vitro and ...,"&lt;div class=""md""&gt;&lt;p&gt;It&amp;#39;s an...",0,1447276000.0,1447247000.0,,0,False,0,cwwggxu,,t3_3se6lu,t1_cwwggxu,,t3_3se6lu,"{u'kind': u'Listing', u'data': {u'modhash': No...",133,science,133
3,mijn_ikke,Just waiting until somebody smarter than me co...,"&lt;div class=""md""&gt;&lt;p&gt;Just waiting un...",0,1447276000.0,1447247000.0,,0,1.447249e+09,1,cwwga6g,,t3_3se6lu,t1_cwwga6g,,t3_3se6lu,"{u'kind': u'Listing', u'data': {u'modhash': No...",773,science,773
4,awhitt8,Yes the title is sensationalized.\n\n&gt;The m...,"&lt;div class=""md""&gt;&lt;p&gt;Yes the title i...",0,1447285000.0,1447256000.0,,0,1.447259e+09,0,cwwkopn,,t3_3se6lu,t1_cwwkopn,,t3_3se6lu,"{u'kind': u'Listing', u'data': {u'modhash': No...",16,science,16


Now we have links, articles and comments. 

Lets join links and articles are really just different features about the thread orgin. So lets join them.

Now we have all of the data we need, we have articles and the corresponding comments. We created a few descriptive features for both the articles and the comments but we want to build a few more. But first we want to remove any incomplete records where we couldn't collect the article or the comments. 

In [10]:
#Subset For Complete Records
#Remove links with no articles
linkdf = linkdf.loc[linkdf.url.isin(articledf.url)]
#Remove links with no comments
commentdf['pid'] = commentdf.parent_id.apply(lambda x: str.split(str(x),'_')[1])
train_links = linkdf.loc[linkdf['id'].isin(commentdf['pid'])]

#Join Link Features and Article Features
#Join on PID opposed to URL!!!
train_articles = articledf.merge(train_links, on='url',how='left')

train_comments = commentdf[commentdf.pid.isin(train_links['id'])]
print str(train_articles.shape[0])+' Sample Articles'
print str(train_comments.shape[0])+ ' Sample Comments'

15611 Sample Articles
293701 Sample Comments


lets wrtie out training data to csv

In [11]:
train_articles.to_csv('./train_articles.csv',sep=',',index=False,encoding = 'utf-8')
train_comments.to_csv('./train_comments.csv',sep=',',index=False,encoding='utf-8')

Collection of the training data is complete, now move on the feature creation