<b>Scratch Pad</b>  
User interaction data (ie # of comments, per subreddit etc)  - Can be used as word move weight/count  
Architecture - WMD fed into net with features as top X closest users (ie user 1,8,56,123) which are ordered by their inherent embedding distances from a reference user (ie a default sub user), could also include furthest users, add NLP/content features for fine tuning suggestions from close users.  
Need to deal with aging problem of recommender systems  
Need to update network gefx file  
identify users very far from you to introduce outside perspectives  
Center of gravity of user comments over time of user account age  
comment -> subreddit sequenced RNN, predict next subreddit to comment in?  
Use subbeddit tags and related subreddits for network connections in embedding network  
things learned - reading code from python packages  
temporal EDA (distribution of post month/year)  
Chrome extension to show visualization of selected users network compared to yours  
explore dataset size over accuracy



In [1]:
import json
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from ggplot import *
import networkx as nx
from networkx.readwrite import json_graph
from operator import itemgetter
from collections import Counter
from bs4 import BeautifulSoup
from pyemd import emd
import random
from sklearn.metrics import euclidean_distances
import tensorflow as tf
from tflearn.data_utils import to_categorical, pad_sequences
import tflearn
import SubRecommender
from sklearn.preprocessing import normalize
%matplotlib inline

ImportError: No module named 'rnn'

<h1>Introduction</h1>
In this notebook, we explore a dataset compiled using Reddit's PRAW API in collecting historical user subbreddit comments. The goal of this analysis is to inform the development of a Recurrent Nueral Network model that can be used as a recommender system in recommending users new subreddits based on their historical subreddit commenting patterns.  

<h2>Dataset</h2>

The dataset was compiled using a python scrapper developed using Reddit's PRAW API. The raw data is a list of 3-tuples of [username,subreddit,utc timestamp]. Each row represents a single comment made by the user.

In [None]:
with open('data/train_reddit_data.json','r') as data_file:    
    reddit_data = json.load(data_file)

In [None]:
df = pd.DataFrame(reddit_data,columns=['user','subreddit','utc_stamp'])
df['utc_stamp'] = pd.to_datetime(df['utc_stamp'],unit='s')
df.head()

In [None]:
print("Unique Users = " + str(len(df.groupby('user')['user'].nunique())))
print("Unique Subreddits = " + str(len(df.groupby('subreddit')['subreddit'].nunique())))
print("Total User Comments = " + str(df.shape[0]))

<h2>Subreddit Data</h2>

In [None]:
user_subs = df.groupby(['user'])['subreddit'].nunique()
plt.hist(user_subs.values, bins=100)
plt.title("User vs Sub Counts Histogram")
plt.show()

In [None]:
sub_users = df.groupby(['subreddit'])['user'].nunique()
data_tuple = pd.DataFrame([(sub,count) for sub,count in sub_users.items()],columns=["sub","user_count"])
sorted_df = data_tuple.sort_values(by='user_count',ascending=False)
sorted_df.head(50)

In [None]:
user_summary = df.groupby(by=['user'])['user']
plt.hist(user_summary.value_counts(), bins=100)
plt.show()

In [None]:
plt.hist(user_summary.value_counts(), bins=100)
axes = plt.gca()
axes.set_ylim([0,500])
plt.show()

In [None]:
users_vs_subs = []
current_user = reddit_data[0][0]
subs = []
interaction_count = 0
sub_discovery_time = []
usr_sub_discovery_time = [0]
user_subs_list = []
for i,comment in enumerate(reddit_data):
    if comment[0] != current_user:
        user_subs_list = [comment[1]]
        sub_discovery_time.append(usr_sub_discovery_time)
        usr_sub_discovery_time = []
        interaction_count = 0
        users_vs_subs.append(len(subs))
    elif comment[1] not in user_subs_list:
        usr_sub_discovery_time.append(interaction_count)
        interaction_count = 0
        user_subs_list.append(comment[1])
    if comment[1] not in subs:
        subs.append(comment[1])
    current_user = comment[0]
    if comment[1] != reddit_data[i-1][1]:
        interaction_count = interaction_count + 1

In [None]:
indexes = np.arange(len(users_vs_subs))
plt.plot(indexes,users_vs_subs)
plt.title("Unique Subreddits vs User Count")
plt.show()

<h1>Model Architecture</h1>

The hypothesis of the recommender model is that, given an ordered sequence of user subreddit interactions, patterns will emerge that favour the discovery of paticular new subreddits given that historical user interaction sequence. The intuition is, that as users interact with the Reddit ecosystem, they discover new subreddits of interest, but these new discoveries are influenced by the communities they have previously been interacting with. We can then train a model to recognize these emergent subreddit discoveries based on users historical subreddit discovery patterns. When the model is presented with a new sequence of user interaction, it "remembers" other users that historically had similiar interaction habits and recommend their subreddits that the current user has yet to discover.  

To build the training dataset, the subreddit interaction sequence for each user can be ordered and then split into chunks representing different periods of Reddit interaction and discovery. From each chunk, we can randomly remove a single subreddit from the interaction as the "discovered" subreddit and use it as our training label for the interaction sequences. This formulation brings with it a hyperparameter that will require tuning, namely the sequence size of each chunk of user interaction periods. 

There are also a couple of design decisions needed that will create inherent assumptions in the model. This includes whether the labelled "discovered" subreddit should be randomly chosen from each interaction sequence, or should there be a more structured selection. The proposed model utilizes the distribution of subreddits existing in the dataset to weight the random selection of a subreddit as the sequence label, which gives a higher probability of selection to rarer subreddits. This will smoothen the distribution of training labels across the models vocabulary of subreddits in the dataset. Also, each users interaction sequence has been compressed to only represent the sequence of non-repeating subreddits, to eliminate the repeatative structure of users constantly commenting in a single subreddit, while providing information of the users habits in the reddit ecosystem more generally, allowing the model to distinguish broader patterns from the compressed sequences.

These subreddit sequence/subreddit label pairs are then passed to various RNN architectures (shallow and deep LSTM/GRU networks) with an exploration of hyperparamter optimization to select the optimal model for recommending new subreddits of interest to reddit users.

In [None]:
disc_times = [usr_dts[-1] for usr_dts in sub_discovery_time if usr_dts and usr_dts[-1] > 1 and  usr_dts[-1] <100]
plt.hist(disc_times , bins=90)
plt.title("New Sub Discovery Time Steps")
plt.show()

In [None]:
np.percentile(disc_times,95)

In [None]:
non_rep_interaction = [sum(usr_dts) for usr_dts in sub_discovery_time]
plt.hist(non_rep_interaction , bins=90)
plt.title("Total Non-Repeating User Interactions")
plt.show()

In [None]:
np.mean(non_rep_interaction)

In [None]:
flt_disc_times = [dt for usr_dts in sub_discovery_time for dt in usr_dts[50:] if dt < 50]
plt.hist(flt_disc_times , bins=50)
plt.title("Truncated New Sub Discovery Time Steps")
plt.show()

In [None]:
np.mean(flt_disc_times)

<h3>Model Testing</h3>

In [None]:
import SubRecommender
import json
from collections import Counter
tst = SubRecommender.SubRecommender('data/train_reddit_data.json',
                                    sequence_chunk_size=15,min_seq_length=5,min_count_thresh=100)
train_df = tst.load_train_df(load_file="data/training_sequences/329_15_5_sequence_data.json")
tst.train(num_epochs=10,npartitions=12)

In [None]:
label_counter = Counter(tst.training_labels)
label_counter.most_common(1)

In [None]:
len(label_counter.keys())

<h3>Baseline KNN Model</h3>

In [None]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from scipy.sparse import csr_matrix


train,test = tst.split_train_test(train_df,0.8)
X_train = np.array(train['sub_seqs'])
y_train = np.array(train['sub_label']).astype(np.uint16)
X_test = np.array(test['sub_seqs'])
y_test =np.array(test['sub_label']).astype(np.uint16)

enc = OneHotEncoder(n_values=tst.vocab_size)
neigh = KNeighborsClassifier(n_neighbors=1,n_jobs=6)

X_train = pad_sequences(X_train, maxlen=tst.sequence_chunk_size, value=0.,padding='post')
X_test = pad_sequences(X_test, maxlen=tst.sequence_chunk_size, value=0.,padding='post')

enc.fit(X_train)

X_train = csr_matrix(enc.transform(X_train).toarray())
X_test = csr_matrix(enc.transform(X_test).toarray())

clf = neigh.fit(X_train, y_train)

In [None]:
accuracy_score(clf.predict(X_test),y_test)

In [None]:
def recommendation_accuracy(user_data,model,clf,enc=None):
    accuracies = []
    for usr,user_subs in user_data.items():
        user_seqs = tst.build_training_sequences(user_subs)
        training_sequences = [data[0] for data in user_seqs]
        training_labels = [data[1] for data in user_seqs]
        X_test = pad_sequences(training_sequences, maxlen=tst.sequence_chunk_size, value=0.,padding='post')
        if training_sequences:
            if model == 'knn':
                X_test = csr_matrix(enc.transform(X_test).toarray())
                recs = set(clf.predict(X_test))
            elif model == 'rnn':
                sub_probs = clf.predict(X_test)
                recs = set([probs.index(max(probs)) for probs in sub_probs])
            elif model == 'most_common':
                recs = [clf]
            accuracy = sum([1 if rec in training_labels else 0 for rec in recs])/len(recs)
            accuracies.append(accuracy)        
    return np.mean(accuracies)

In [None]:
with open('data/test_user_comment_sequence_cache.json','r') as data_file:    
    user_seqs = json.load(data_file)

In [None]:
recommendation_accuracy(user_seqs,'rnn',tst.model)

In [None]:
recommendation_accuracy(user_seqs,'most_common',4)

<h3>Word Movers Distance</h3>

In [None]:
def pairwise_emd(user_A,user_B,graph_cords):
    set_subs = [sub for sub in set(list(user_A.keys())+list(user_A.keys())) if sub in graph_cords.keys()]
    sub_cords = np.array([graph_cords[sub] for sub in set_subs]) 
    A_interacts = np.array([user_A[sub] if sub in list(user_A.keys()) else 0 for sub in set_subs])
    B_interacts = np.array([user_B[sub] if sub in list(user_B.keys()) else 0 for sub in set_subs])
    euc_dists = euclidean_distances(sub_cords,sub_cords)
    emd_dist = emd(A_interacts.astype(np.double), B_interacts.astype(np.double), euc_dists.astype(np.double))
    return emd_dist

In [None]:
pairwise_emd(grouped['subreddit']['count']['-DEAD-'].sort_values(ascending=False),
             grouped['subreddit']['count']['-Doomcrow-'].sort_values(ascending=False),graph_cords)