## Data exploration 

In the following sections we will explore the dataset to fetch basic statistics about the dataset like the number of subreddits, users, how many comments were made etc. This will help us in determing the direction of further exploration of the comments and suitable subsampling of the comments, with which we will build the final recommender. 

Also a baseline we are considering each comment by a user on a subreddit as an implicit feedback that the user interacts with this subreddit. 

In [50]:
import ijson
import json
import numpy as np
from tqdm import tqdm_notebook
from collections import defaultdict

In [26]:
# sample json from the reddit data to get the keys
sample_json = """{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}"""
json_keys = [key for key, value in json.loads(sample).items()]

In [27]:
json_keys

['score_hidden',
 'name',
 'link_id',
 'body',
 'downs',
 'created_utc',
 'score',
 'author',
 'distinguished',
 'id',
 'archived',
 'parent_id',
 'subreddit',
 'author_flair_css_class',
 'author_flair_text',
 'gilded',
 'retrieved_on',
 'ups',
 'controversiality',
 'subreddit_id',
 'edited']

In [45]:
subreddits = defaultdict(int) # number of comments per subreddit
authors = defaultdict(int) # number of comments per user

with open('data/RC_2015-01') as infile:
    for line in tqdm_notebook(infile):
        comment = json.loads(line)
        subreddits[comment['subreddit']] += 1
        authors[comment['author']] += 1
        del comment

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [52]:
print('Number of users: %s' % len(authors))
print('Number of subreddits: %s' % len(subreddits))

Number of users %s: 2512123
Number of subreddits %s: 47172


In [56]:
avg_comments_per_subreddit = np.mean([v for k, v in subreddits.items()])
avg_comments_per_user = np.mean([v for k, v in authors.items()])

In [59]:
print('Average number of comments per subreddit: %.3f' % avg_comments_per_subreddit)
print('Average number of comments per user: %.3f' % avg_comments_per_user)

Average number of comments per subreddit: 1141.600
Average number of comments per user: 21.437


In [None]:
subreddit_user_interaction = defaultdict(lambda: defaultdict(lambda: 0))

with open('data/RC_2015-01') as infile:
    for line in tqdm_notebook(infile):
        comment = json.loads(line)
        subreddit = comment['subreddit']
        author = comment['author']
        subreddits[] += 1
        authors[comment['author']] += 1
        del comment

In [69]:
y = defaultdict(lambda: defaultdict(lambda: 0))

In [71]:
y

defaultdict(<function __main__.<lambda>()>, {})

In [72]:
y['a']

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {})

In [74]:
y['a']['b'] += 1

In [75]:
y

defaultdict(<function __main__.<lambda>()>,
            {'a': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
                         {'b': 1})})

In [76]:
y['a']['c'] += 1

In [77]:
y.values

<function defaultdict.values>

In [79]:
dict(y)

{'a': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
             {'b': 1, 'c': 1})}

In [80]:
y

defaultdict(<function __main__.<lambda>()>,
            {'a': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
                         {'b': 1, 'c': 1})})

In [81]:
y['z']['c'] += 1

In [82]:
y

defaultdict(<function __main__.<lambda>()>,
            {'a': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
                         {'b': 1, 'c': 1}),
             'z': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
                         {'c': 1})})