## Data exploration 

In the following sections we will explore the dataset to fetch basic statistics about the dataset like the number of subreddits, users, how many comments were made etc. This will help us in determing the direction of further exploration of the comments and suitable subsampling of the comments, with which we will build the final recommender. 

Also as a baseline we are considering each comment by a user on a subreddit as an implicit feedback that the user interacts with this subreddit. 

In [50]:
import json
import numpy as np
from tqdm import tqdm_notebook
from collections import defaultdict

In [26]:
# sample json from the reddit data to get the keys
sample_json = """{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}"""
json_keys = [key for key, value in json.loads(sample_json).items()]

In [27]:
json_keys

['score_hidden',
 'name',
 'link_id',
 'body',
 'downs',
 'created_utc',
 'score',
 'author',
 'distinguished',
 'id',
 'archived',
 'parent_id',
 'subreddit',
 'author_flair_css_class',
 'author_flair_text',
 'gilded',
 'retrieved_on',
 'ups',
 'controversiality',
 'subreddit_id',
 'edited']

In [45]:
subreddits = defaultdict(int) # number of comments per subreddit
authors = defaultdict(int) # number of comments per user

with open('data/RC_2015-01') as infile:
    for line in tqdm_notebook(infile):
        comment = json.loads(line)
        subreddits[comment['subreddit']] += 1
        authors[comment['author']] += 1
        del comment

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [52]:
print('Number of users: %s' % len(authors))
print('Number of subreddits: %s' % len(subreddits))

Number of users %s: 2512123
Number of subreddits %s: 47172


In [56]:
avg_comments_per_subreddit = np.mean([v for k, v in subreddits.items()])
avg_comments_per_user = np.mean([v for k, v in authors.items()])

In [59]:
print('Average number of comments per subreddit: %.3f' % avg_comments_per_subreddit)
print('Average number of comments per user: %.3f' % avg_comments_per_user)

Average number of comments per subreddit: 1141.600
Average number of comments per user: 21.437


In [103]:
subreddit_user_interaction = defaultdict(lambda: defaultdict(lambda: 0))

with open('data/RC_2015-01') as infile:
    for line in tqdm_notebook(infile):
        comment = json.loads(line)
        subreddit = comment['subreddit']
        author = comment['author']
        subreddit_user_interaction[subreddit][author] += 1
        del comment

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [104]:
with open('interactions', 'a') as f:
    for subreddit in subreddit_user_interaction.keys():
        for author in subreddit_user_interaction[subreddit].keys():
            line = ' '.join(list(map(str, [subreddit, author, subreddit_user_interaction[subreddit][author], '\n'])))
            f.write(line)

In [119]:
quality_subreddit_user_interaction = defaultdict(lambda: defaultdict(lambda: 0))

with open('data/RC_2015-01') as infile:
    for line in tqdm_notebook(infile):
        comment = json.loads(line)
        subreddit = comment['subreddit']
        author = comment['author']
        if len(comment['body']) > 30:
            quality_subreddit_user_interaction[subreddit][author] += 1
        del comment

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [121]:
with open('interactions_30_ch', 'a') as f:
    for subreddit in quality_subreddit_user_interaction.keys():
        for author in quality_subreddit_user_interaction[subreddit].keys():
            line = ' '.join(list(map(str, [subreddit, author, quality_subreddit_user_interaction[subreddit][author], '\n'])))
            f.write(line)

In [135]:
user_comments_30 = defaultdict(int)

for subreddit in quality_subreddit_user_interaction.keys():
    for user in quality_subreddit_user_interaction[subreddit].keys():
        user_comments_30[user] += quality_subreddit_user_interaction[subreddit][user]

In [136]:
user_comments_30_sorted = sorted([(k, v) for k, v in user_comments_30.items()], key=lambda x: -x[1])

In [145]:
user_comments_30_sorted

[('[deleted]', 524990),
 ('AutoModerator', 233144),
 ('PoliticBot', 61889),
 ('autowikibot', 22599),
 ('TweetPoster', 16325),
 ('havoc_bot', 14186),
 ('MTGCardFetcher', 12305),
 ('imgurtranscriber', 10302),
 ('RPBot', 10014),
 ('pineapple_lumps', 6721),
 ('totes_meta_bot', 6700),
 ('dogetipbot', 6663),
 ('hit_bot', 6616),
 ('aGoldenWhale', 6552),
 ('TweetsInCommentsBot', 6153),
 ('Metaboss84', 6075),
 ('PriceZombie', 5853),
 ('TotalWarfare', 5171),
 ('Late_Night_Grumbler', 4963),
 ('_RegiBot', 4897),
 ('Marvelvsdc00', 4663),
 ('RealtechPostBot', 4548),
 ('MultiFunctionBot', 4301),
 ('Tucan_Sam_', 4221),
 ('catgirl64', 4214),
 ('MayTentacleBeWithYee', 4148),
 ('BluePotterExpress', 4129),
 ('ClearlyInvsible', 4006),
 ('timewaitsforsome', 3782),
 ('20141220', 3779),
 ('ravenluna', 3771),
 ('ttumblrbots', 3720),
 ('anbeav', 3622),
 ('VeryAwesome69', 3619),
 ('youtubefactsbot', 3615),
 ('changetip', 3565),
 ('subredditreports', 3444),
 ('angelofthedoctor', 3397),
 ('Thrice_Berg', 3394),
 ('

### Dealing with Bots

Reddit contains a lot of bots, which periodically check for new comments or messages and based on the content of those messages or comments responds differently.

We don't want to recommend subreddits to bots. There is particular signature to know that a particular user is a bot. But a general rule of thumb if the name ends with a bot it is a bot. For making our lives simpler we will work with the above hypotheses and remove all the user names whose name ends with 'bot'. This can remove some legit users and also miss other users which are actually bots, but this is the assumption we will be working with going on ahead. Along with that we will remove '[deleted]' and AutoModerator which are again not real users. From the top commenters we identified few more bots they have been added to the bots set.

In [171]:
bots = set(user for user, _ in user_comments_30_sorted if user.lower().endswith('bot'))
bots.add('[deleted]')
bots.add('AutoModerator')
bots.add('TweetPoster')
bots.add('imgurtranscriber')
bots.add('MTGCardFetcher')

In [172]:
interaction_without_bots = defaultdict(lambda: defaultdict(lambda: 0))

with open('data/RC_2015-01') as infile:
    for line in tqdm_notebook(infile):
        comment = json.loads(line)
        subreddit = comment['subreddit']
        author = comment['author']
        if len(comment['body']) > 30 and author not in bots:
            interaction_without_bots[subreddit][author] += 1
        del comment

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [173]:
with open('interactions_30_ch_no_bots', 'a') as f:
    for subreddit in interaction_without_bots.keys():
        for author in interaction_without_bots[subreddit].keys():
            line = ' '.join(list(map(str, [subreddit, author, interaction_without_bots[subreddit][author], '\n'])))
            f.write(line)

In [174]:
user_comments_30_no_bots = defaultdict(int)

for subreddit in interaction_without_bots.keys():
    for user in interaction_without_bots[subreddit].keys():
        user_comments_30_no_bots[user] += interaction_without_bots[subreddit][user]

In [175]:
user_comments_30_no_bots_sorted = sorted([(k, v) for k, v in user_comments_30_no_bots.items()], key=lambda x: -x[1])

In [176]:
user_comments_30_no_bots_sorted

[('pineapple_lumps', 6721),
 ('aGoldenWhale', 6552),
 ('Metaboss84', 6075),
 ('PriceZombie', 5853),
 ('TotalWarfare', 5171),
 ('Late_Night_Grumbler', 4963),
 ('Marvelvsdc00', 4663),
 ('Tucan_Sam_', 4221),
 ('catgirl64', 4214),
 ('MayTentacleBeWithYee', 4148),
 ('BluePotterExpress', 4129),
 ('ClearlyInvsible', 4006),
 ('timewaitsforsome', 3782),
 ('20141220', 3779),
 ('ravenluna', 3771),
 ('ttumblrbots', 3720),
 ('anbeav', 3622),
 ('VeryAwesome69', 3619),
 ('changetip', 3565),
 ('subredditreports', 3444),
 ('angelofthedoctor', 3397),
 ('Thrice_Berg', 3394),
 ('MovieGuide', 3283),
 ('ImaginaryMan', 3280),
 ('DastardlyGifts', 3249),
 ('Hanzo_Ishimura', 3226),
 ('braveonion', 3210),
 ('Aidan_Fikri', 3170),
 ('acini', 3156),
 ('DolphinDoom', 3126),
 ('Thief39', 3124),
 ('lolhaibai', 3087),
 ('Luna_Lockheart', 3029),
 ('king_kalamari', 2963),
 ('mikailgirl', 2919),
 ('roosterblue72', 2882),
 ('ApiContraption', 2810),
 ('redditbots', 2798),
 ('SadPandaFace00', 2792),
 ('Jakomako', 2777),
 ('M