<h2>Feature Factory</h2>
This is the mechanism for extracting usefull attributes from the comments, the articles and the relationship between them. Some of the fetures were created at collection time to reduce the amount of data that would be stored; but we expand on those features here. The premise of our application is that the top comments, will be sytanctically similar to the articles they are describing. For example, comments and users that use the same language of the article may increase the overall score of that article. The feature factory produces a large variety of these comparsions and we use EDA to determine the ones with the most predictive power.

In [2]:
import time
import os
import pandas as pd
import sys
import nltk
from nltk.corpus import stopwords
from collections import Counter
from nltk.stem.porter import *
from nltk.tokenize import RegexpTokenizer
import newspaper
import ast

from IPython.display import clear_output

<h2>Read in the Collected Comments</h2>

This reads in the csv produced by the collection factory, reformats some of the field types and removes any duplicated records. 

In [3]:
comments = pd.read_csv('./train_comments.csv',low_memory=False)
comments[['body']] = comments[['body']].astype(str)
comments['keywords'] = comments.keywords.apply(ast.literal_eval)
comments.drop_duplicates(inplace=True,subset='id')
comments.reset_index(drop=True)
print comments.shape
comments.head()

(275686, 25)


Unnamed: 0,author,body,body_html,controversiality,created,created_utc,distinguished,downs,edited,gilded,...,parent_id,replies,score,subreddit,ups,pid,tokens,comment_length,n_tokens,keywords
0,SirT6,The title sort of misses the point of the stud...,"&lt;div class=""md""&gt;&lt;p&gt;The title sort ...",0,1447279564,1447250764,,0,False,1,...,t3_3se6lu,"{u'kind': u'Listing', u'data': {u'modhash': No...",1359,science,1359,3se6lu,"Counter({'alga': 5, 'cancer': 4, 'cell': 4, 'd...",869,52,"{u'toxinalgae': 1.00993377483, u'cancer': 1.03..."
1,DrBiochemistry,Just want to point out that until I see a deli...,"&lt;div class=""md""&gt;&lt;p&gt;Just want to po...",0,1447277409,1447248609,,0,False,0,...,t3_3se6lu,"{u'kind': u'Listing', u'data': {u'modhash': No...",3209,science,3209,3se6lu,"Counter({'kill': 2, 'deliveri': 2, 'cancer': 1...",307,30,"{u'survives': 1.02941176471, u'thing': 1.02941..."
2,Frogblood,It's an interesting idea but the in vitro and ...,"&lt;div class=""md""&gt;&lt;p&gt;It&amp;#39;s an...",0,1447276156,1447247356,,0,False,0,...,t3_3se6lu,"{u'kind': u'Listing', u'data': {u'modhash': No...",133,science,133,3se6lu,"Counter({'idea': 2, 'target': 2, 'overexcit': ...",432,39,"{u'tumour': 1.02173913043, u'targeting': 1.043..."
3,mijn_ikke,Just waiting until somebody smarter than me co...,"&lt;div class=""md""&gt;&lt;p&gt;Just waiting un...",0,1447275611,1447246811,,0,1447248944.0,1,...,t3_3se6lu,"{u'kind': u'Listing', u'data': {u'modhash': No...",773,science,773,3se6lu,"Counter({'thank': 1, 'gold': 1, 'point': 1, 'e...",163,12,"{u'somebody': 1.05172413793, u'gold': 1.051724..."
4,awhitt8,Yes the title is sensationalized.\n\n&gt;The m...,"&lt;div class=""md""&gt;&lt;p&gt;Yes the title i...",0,1447284967,1447256167,,0,1447259263.0,0,...,t3_3se6lu,"{u'kind': u'Listing', u'data': {u'modhash': No...",16,science,16,3se6lu,"Counter({'drug': 5, 'deliveri': 4, 'materi': 3...",1447,104,"{u'silicon': 1.01530612245, u'tissue': 1.01530..."


<h2>Read in the Articles</h2>
Here we read in the article csv produced by the collection factory. 

In [None]:
articles = pd.read_csv('./train_articles.csv',low_memory=False)
articles[['text','summary']] = articles[['text','summary']].astype(str)
articles.drop_duplicates(inplace=True,subset = 'url')
articles['keywords'] = articles.keywords.apply(ast.literal_eval)
articles.reset_index(drop=True,inplace=True)
print articles.shape
articles.head()

<h2> Create Comment Features</h2>
Here we create freatures strictly from the comments, a large variety of attribute features were collected in the colelction factory however we are most interested in examining the content of the comments. In order to examine the content we want to create parse, stemmed token for each block of text. We also use newspaper.nlp.keywords() on the comment text so that we can compare the keywords of the comment to the keywords of the articles. 

In [5]:
#Create Comment Features
########################
#Tokenized Comments

import re, string
regex = re.compile('[%s]' % re.escape(string.punctuation))
stopset = set(stopwords.words('english'))
stemmer = PorterStemmer()

def get_tokens(text):
    lowers = text.lower()
    clean = regex.sub('',lowers)
    tokens=nltk.word_tokenize(clean)
    return [w for w in tokens if not w in stopset]

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

#Count Stemmed Tokens
comments['tokens'] = comments.body.apply(lambda text: Counter(stem_tokens(get_tokens(text), stemmer)))

#Comment Length
comments['comment_length'] = comments.body.apply(len)

#Number of Words
comments['n_tokens'] = comments.tokens.apply(len)

#Comment Keywords
comments['keywords'] = comments.body.apply(newspaper.nlp.keywords)

print comments.columns

Index([u'author', u'body', u'body_html', u'controversiality', u'created',
       u'created_utc', u'distinguished', u'downs', u'edited', u'gilded', u'id',
       u'likes', u'link_id', u'name', u'num_reports', u'parent_id', u'replies',
       u'score', u'subreddit', u'ups', u'pid', u'tokens', u'comment_length',
       u'n_tokens', u'keywords'],
      dtype='object')


<h2>Create Article Features</h2>
Here we create freatures strictly from the articles, a large variety of attribute features were collected in the colelction factory however we are most interested in examining the content of the comments. In order to examine the content we want to create parse, stemmed token for each block of text. 

In [6]:
#Create Article Features
#########################
#Tokens
articles['tokens'] = articles['text'].apply(lambda text: Counter(stem_tokens(get_tokens(text), stemmer)))

#Article Length
articles['article_len'] = articles['text'].apply(len)

#Number of Words
articles['n_tokens'] = articles['text'].apply(len)

articles.columns

Index([u'authors', u'keywords', u'publish_date', u'summary', u'text', u'url',
       u'author', u'created_utc', u'domain', u'downs', u'gilded', u'is_self',
       u'likes', u'media', u'id', u'num_comments', u'num_reports', u'over_18',
       u'permalink', u'score', u'selftext', u'subreddit', u'thumbnail',
       u'title', u'ups', u'tokens', u'article_len', u'n_tokens'],
      dtype='object')

Lets write the final article and comment data to file.

In [76]:
articles.to_csv('./train_articles.csv',sep=',',index=False)
comments.to_csv('./train_comments.csv',sep=',',index=False)

Now we have raw content, stemmed tokens, keywords for each comment and article. Now we can start to look at the relationships between comment content and article content. The features we created are largly bag of word representations of either the full text or keywords of the articles and the comments. One way we can compare these bags of words is to use the cosine similarity. 
<br>
<br>
The cosine similarity, measures the similarity bewteen two bags of words or vectors where A and B represent two vectors.  

$$\frac{A*B}{|A||B|}$$

The cosine distance was chosen becasue it accounts for the length of the vectors. This is important when we are comparing articles, which tend to be quite long, to comments which can be quite short. 

The downside of using cosine similarity is that is does not account for the global popularity of a term. This means that two vectors with a few rare terms in common is the same as two vectors with a few popular words in common. This is an obvious issue we would like to address by using the term frequency- inverse document frequency weight of each word. 


In [7]:
import math
def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

The following functions define the construction of several similarity features. Each function applies the get_cosine() function to different bags of words for the article and the comment.
<br>

In [8]:
#Compute the similarity between the keyword of the comment and it's corresponding article. 
def keyword_sim(comment_row):
    try:
        c_kw = comment_row['keywords']
        a_kw = articles[articles['id']==comment_row['pid']]['keywords'].iloc[0]
        sim = get_cosine(c_kw,a_kw)
    except:
        sim = 0.0
    return sim

comments['keyword_sim']=comments.apply(keyword_sim,axis=1)

In [9]:
#Computes the similarity between the tokens of the comment and it's corresponding article.
def token_sim(comment_row):
    try:
        c = comment_row['tokens']
        a = articles[articles['id']==comment_row['pid']]['tokens'].iloc[0]
        sim = get_cosine(c,a)
    except:
        sim = 0.0
    return sim

comments['token_sim']=comments.apply(token_sim,axis=1)    

<h2>User Similarity</h2>
In order to determine a users overall knowledge of a topic we compute a user vocabulary. The user vocabulary is a bag of words consisting of *All* of the comments a user has made (in the articles we collected).

In [10]:
#Here We Build a Vocab for each user
user_vocab = {}
for author,vocab in comments[['author','tokens']].groupby('author'):
    user_vocab[author] = sum((Counter(dict(x)) for x in vocab.tokens),Counter())

In [11]:
#Compute the similarity between the posting users vocabulary and the token of the article corresponding to this comment.
def user_vocab_tokens_sim(comment_row):
    try:
        c = user_vocab[comment_row.author]
        a = articles[articles['id']==comment_row['pid']]['tokens'].iloc[0]
        sim = get_cosine(c,a)
    except:
        sim = 0.0
    return sim

comments['user_vocab_tokens']=comments.apply(user_vocab_tokens_sim,axis=1)

In [22]:
#Compute the similarity between the posting users vocabulary and the keywords of the article corresponding to this comment.
def user_vocab_kw_sim(comment_row):
    try:
        c = user_vocab[comment_row.author]
        a = articles[articles['id']==comment_row['pid']]['keywords'].iloc[0]
        sim = get_cosine(c,a)
    except:
        sim = 0.0
    return sim

comments['user_vocab_kw']=comments.apply(user_vocab_kw_sim,axis=1)

<h2>Reduce the output</h2>
Here we filter the output to attributes and the newly computed similarity features.

In [23]:
sim_features = comments[['id','likes','link_id','name','parent_id','score','subreddit','ups','pid','comment_length','n_tokens','keyword_sim','token_sim','user_vocab_tokens','user_vocab_kw']]
sim_features.columns

Index([u'id', u'likes', u'link_id', u'name', u'parent_id', u'score',
       u'subreddit', u'ups', u'pid', u'comment_length', u'n_tokens',
       u'keyword_sim', u'token_sim', u'user_vocab_tokens', u'user_vocab_kw'],
      dtype='object')

Write to disk

In [25]:
sim_features.to_csv('./new_sim_features.csv',sep=',',index=False)

In [26]:
sim_features = pd.read_csv('./new_sim_features.csv')