<h1>Prediction Data Factory</h1>

This notebook demonstrates how we take the keywords of a URL and create a feature data frame of comments that we can use to predict the highest scored in relation to the new article. This notebook is intended to demonstrate the back-end of the prediction process and is not intended to actually complete the prediction. 
<br>
<br>
These functions demonstrate how we complete the entire process preformed in the Collection and Feature Factory for Prediction Data.
<br>
<br>
This is done so that we have comments from articles similar to the input URL. We then create features identical to the training data so that we can use our model to predict the most relevant out of these comments. The only difference here is that the similarities will be computed between the comment and the interest article opposed to the article they were originally about. 

In [216]:
import requests
import requests.auth
import time
import os
import json
import pandas as pd
import sys
import newspaper
import nltk
import operator
import re, string
from nltk.corpus import stopwords
from collections import Counter
from nltk.stem.porter import *
from nltk.tokenize import RegexpTokenizer


from IPython.display import clear_output

<h2>Get API Request Token</h2>

In [128]:
def getToken(creds):
    client_auth = requests.auth.HTTPBasicAuth(creds['client_id'], creds['secret_id'])
    post_data = {"grant_type": "password", "username": creds['username'], "password": creds['pw']}
    headers = {"User-Agent": creds['user_agent']}
    response = requests.post("https://www.reddit.com/api/v1/access_token", auth=client_auth, data=post_data, headers=headers)
    if response.status_code == 200:
        print 'Credentials Verified: Token Recived'
    else:
        print 'Invalid Creds'

    auth = response.json()['token_type']+' '+response.json()['access_token']
    return {"Authorization": auth, "User-Agent": creds['user_agent']}

with open("creds.txt") as f:
    creds = dict([line.strip().split(':') for line in f])
token = getToken(creds)

Credentials Verified: Token Recived


<h2>Define Required Functions</h2>

All of these functions are utilized in the Collection Factory or the Feature Factory to create Training data. We use them to ensure the prediction data has an identical structure to the training data

In [221]:
#Gets the Article from the URL of the reddit thread
def getArticle(url):
    article = newspaper.Article(url, fetch_images = False)
    article.download()
    article.parse()
    article.nlp()
    return {"url":article.url, "text":article.text,"keywords":newspaper.nlp.keywords(article.text), 
            "authors":article.authors,"summary":article.summary,
            "publish_date":str(article.publish_date)}

#Creates Clean, Parsed Tokens (a bag of words) from a block of text
def get_tokens(text):
    lowers = text.lower()
    clean = regex.sub('',lowers)
    tokens=nltk.word_tokenize(clean)
    return [w for w in tokens if not w in stopset]

#Stems tokens to the root word
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

#Computes the Cosine Similarity between two bags of words
import math
def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

#Computes the Cosine between the Comment Keywords and the Article Keywords
#Keywords are produced from the newspaper.nlp module
def keyword_sim(comment_row):
    try:
        c_kw = comment_row['keywords']
        a_kw = articles[articles['id']==comment_row['pid']]['keywords'].iloc[0]
        sim = get_cosine(c_kw,a_kw)
    except:
        sim = 0.0
    return sim

#Computes Cosine between Comment Tokens and Article Tokens
#Tokens are produced by the stem and get tokens fucntions defined above
def token_sim(comment_row):
    try:
        c = comment_row['tokens']
        a = articles[articles['id']==comment_row['pid']]['tokens'].iloc[0]
        sim = get_cosine(c,a)
    except:
        sim = 0.0
    return sim

#Computes the cosine between all of a users tokens and the tokens of single article
def user_vocab_tokens_sim(comment_row):
    try:
        c = user_vocab[comment_row.author]
        a = articles[articles['id']==comment_row['pid']]['tokens'].iloc[0]
        sim = get_cosine(c,a)
    except:
        sim = 0.0
    return sim

#Computes the cosine between all of a single users keywords and the keywords of single article 
def user_vocab_kw_sim(comment_row):
    try:
        c = user_vocab[comment_row.author]
        a = articles[articles['id']==comment_row['pid']]['keywords'][0]
        sim = get_cosine(c,a)
    except:
        sim = 0.0
    return sim


In [224]:
'''
Description:
############
Collects comments from articles similar to 'keywords' and creates features for modelling.

Parameters:
###########
subreddit_list: a list of subreddit names

n_articles: Number of Articles to collect

n_comments: number of comments per article

time_window: one of ('hour', 'day', 'week', 'month', 'year', 'all')


Returns:
##########
dict of feature dataframes:
comments dataframe 
article data frame
similarit data frame

'''
url = "https://oauth.reddit.com/r"
regex = re.compile('[%s]' % re.escape(string.punctuation))
stopset = set(stopwords.words('english'))
stemmer = PorterStemmer()


def getPredictionData(keywords,n_links,n_comments,time_window):
    #Get Token
    token = getToken(creds)
    #Create Search words
    top3keywords = [x[0] for x in sorted(keywords.items(), key=operator.itemgetter(1),reverse=True)[:3]]
    searchterms= " ".join(top3keywords)
    #Create Query String
    query = "all/search?limit={0}&type=link&t={1}&sort=relevance&q='{2}'".format(n_links,time_window,searchterms)
    request_url= "/".join([url,query])
    #Make API Request to get Articles similar to the keywords
    response = requests.get(request_url, headers=token)
    tags= [ u'author',u'created_utc', u'domain', u'downs', u'gilded',u'is_self', u'likes', u'media', 'id',
         u'num_comments', u'num_reports', u'over_18', u'permalink',u'score', u'selftext', u'subreddit', u'thumbnail', u'title', u'ups', u'url']
    #Get The links for all of the articles
    links = []
    for link in response.json()['data']['children']:
        links.append(pd.DataFrame.from_dict(link['data'],orient='index'))
        linkdf = pd.concat(links,axis=1).transpose()[tags]
    #Go get the actual articles
    articles = []
    for link_url in linkdf.url.unique():
        articles.append(pd.DataFrame.from_dict(getArticle(link_url),orient='index'))
        articledf = pd.concat(articles,axis=1).transpose()
    #Go Get the comments for these articles
    comments = []
    tags= [ u'author', u'body', u'body_html', u'controversiality', u'created', u'created_utc', u'distinguished', u'downs',
            u'edited', u'gilded', u'id', u'likes', u'link_id', u'name', u'num_reports', u'parent_id', u'replies', u'score',
            u'subreddit', u'ups']
    for subreddit, name in [tuple(x) for x in linkdf[['subreddit','id']].values]:
            query= 'comments/{0}/?depth=1'.format(name)
            request_url= "/".join([url,subreddit,query])
            response = requests.get(request_url, headers=token)
            for comment in response.json()[1]['data']['children'][:-1]:
                try:
                    comments.append(pd.DataFrame.from_dict(comment['data'],orient='index'))
                    commentdf = pd.concat(comments,axis=1).transpose()[tags]
                except:
                    print 'skipped'
    linkdf = linkdf.loc[linkdf.url.isin(articledf.url)]
    #Remove links with no comments
    commentdf['pid'] = commentdf.parent_id.apply(lambda x: str.split(str(x),'_')[1])
    train_links = linkdf.loc[linkdf['id'].isin(commentdf['pid'])]
    #Join Link Features and Article Features
    pred_articles = articledf.merge(train_links, on='url',how='left')
    pred_comments = commentdf[commentdf.pid.isin(train_links['id'])]
    #Count Stemmed Tokens
    pred_comments['tokens'] = pred_comments.body.apply(lambda text: Counter(stem_tokens(get_tokens(text), stemmer)))
    #Comment Length
    pred_comments['comment_length'] = pred_comments.body.apply(len)
    #Number of Words
    pred_comments['n_tokens'] = pred_comments.tokens.apply(len)
    #Comment Keywords
    pred_comments['keywords'] = pred_comments.body.apply(newspaper.nlp.keywords)
    #Create Article Features
    #########################
    #Tokens
    pred_articles['tokens'] = pred_articles['text'].apply(lambda text: Counter(stem_tokens(get_tokens(text), stemmer)))
    #Article Length
    pred_articles['article_len'] = pred_articles['text'].apply(len)
    #Number of Words
    pred_articles['n_tokens'] = pred_articles['text'].apply(len)

    #Create Vocabularies for each user
    user_vocab = {}
    for author,vocab in pred_comments[['author','tokens']].groupby('author'):
        user_vocab[author] = sum((Counter(dict(x)) for x in vocab.tokens),Counter())
    
    #Create Similarity Features
    #########################
    pred_comments['keyword_sim']=pred_comments.apply(keyword_sim,axis=1)
    pred_comments['token_sim']=pred_comments.apply(token_sim,axis=1) 
    pred_comments['user_vocab_tokens']=pred_comments.apply(user_vocab_tokens_sim,axis=1)
    pred_comments['user_vocab_kw']=pred_comments.apply(user_vocab_kw_sim,axis=1)
    
    #Return List of DataFrames
    outdfs = {'pred_articles':pred_articles,'pred_comments':pred_comments}
    print 'Done.'
    return outdfs

<h2>Demonstrate with Sample Keywords</h2>

In [98]:
keywords = newspaper.nlp.keywords(u'Algae has been genetically engineered to kill cancer cells without harming healthy cells.\nThe algae nanoparticles, created by scientists in Australia, were found to kill 90% of cancer cells in cultured human cells.\nThe antibody binds only to molecules found on cancer cells, thus delivering the toxic drug specifically to the target cells.\nIn turn, the antibody binds only to molecules found on cancer cells, meaning it could deliver drugs to the target cells.\nResearchers genetically engineered the algae to produce an antibody-binding protein on the surface of their shells.')
keywords

{u'algae': 1.052325581395349,
 u'binds': 1.0348837209302326,
 u'cancer': 1.069767441860465,
 u'cells': 1.069767441860465,
 u'cellsthe': 1.0348837209302326,
 u'engineered': 1.0348837209302326,
 u'genetically': 1.0348837209302326,
 u'kill': 1.0348837209302326,
 u'molecules': 1.0348837209302326,
 u'target': 1.0348837209302326}

<h2>Run getPredictionData</h2>

This sample only collects 5 similar articles. As this number increases the collection will take longer but there will be more comments to predict from.

In [225]:
sample_out = getPredictionData(keywords,5,1,'year')

Credentials Verified: Token Recived
Done.


In [220]:
print sample_out['pred_comments'].columns
print sample_out['pred_articles'].columns

Index([          u'author',             u'body',        u'body_html',
       u'controversiality',          u'created',      u'created_utc',
          u'distinguished',            u'downs',           u'edited',
                 u'gilded',               u'id',            u'likes',
                u'link_id',             u'name',      u'num_reports',
              u'parent_id',          u'replies',            u'score',
              u'subreddit',              u'ups',              u'pid',
                 u'tokens',   u'comment_length',         u'n_tokens',
               u'keywords'],
      dtype='object')