# Reddit Discussion Recommender Bot
## Dataset Creation

In general usage, this Reddit recommender system is centered around the concept of finding *ongoing* discussions that are similar to selected discussion.
This involves an ongoing and real-time collection and creation of the dataset.

Furthermore, for the recommendation to work well, it is helpful to have a reasonably large dataset.  In practice, this means collecting comments for an hour or more.

For the purposes of testing and demonstration, it may prove rather helpful to have somewhat a contrived dataset.

We have created one or more mocked up Reddit discussions.

Here we'll created a corresponding dataset.

This should help to demonstrate some of the components of the recommender system.

In [2]:
import whoosh
import psaw

from whoosh.fields import Schema, TEXT, ID, STORED
from whoosh.analysis import StemmingAnalyzer, RegexTokenizer, LowercaseFilter, StopFilter, StemFilter
import os, os.path
from whoosh import index

## Create the Index

We need to create the index and prepare it.

### Create the Analyzer

The analyzer in Whoosh is analagous to the filter chain in MeTA.

This analyzer is relatively simple.  It does the following:

In [3]:
analyzer = RegexTokenizer() | LowercaseFilter() | StopFilter() | StemFilter()

### Create the Schema

The schema in Whoosh is describes how things are stored in the (inverted) index.

Several fields can be stored in the index.  Fields can be stored in the index, analagous to a postings file.

In this case, we don't need the actual documents (comment text body) stored.  We need that indexed.

But we do need the several other pieces of data about the comment stored so we can use them later to create the recommendation after analysis.

In [4]:
schema = Schema(
    comment_id=ID(stored=True),
    parent_id=ID(stored=True),
    submission_id=ID(stored=True),
    subreddit_id=ID(stored=True),
    content=TEXT(analyzer=analyzer)
)

### Initiate the Index


In [50]:
index_dir = "../data/processed"
if not os.path.exists(index_dir):
    os.mkdir(index_dir)
else:
    # Do I need to do anything to clean up?
    # No... it will clean things up once we write the first comment
    pass
ix = index.create_in(index_dir, schema)

## Populate the Index

In normal usage, the recommender system uses the psaw interface to Pushshift.io to ingest comments on an ongoing fashion.

Here, we'll change the way we use PSAW.

Our primary goal here will be to slurp up identified submissions

In [49]:
submission_list = [
    # https://www.reddit.com/r/askscience/comments/bpf6mx/earth_has_seasons_because_our_planets_axis_of/
    'bpf6mx',
    # https://www.reddit.com/r/explainlikeimfive/comments/4ajypn/eli5why_does_earth_axial_tilt_dictate_seasons_but/
    '4ajypn',
    # https://www.reddit.com/r/askscience/comments/3wooz2/how_common_are_planets_with_tilts_like_ours_are/
    '3wooz2',
    # https://www.reddit.com/r/OutOfTheLoop/comments/4xzwwv/what_is_3d_chess/
    '4xzwwv'
    # https://www.reddit.com/r/DaystromInstitute/comments/8n48s6/how_do_you_play_threedimensional_chess/
    '8n48s6'
]

In [12]:
submission_list

['bpf6mx']

### Get the api to Pushshift.io

In [13]:
api = psaw.PushshiftAPI()

### First... an example
Before we stuff things into the inverted index, let's take a look at what we get from psaw.

Here we query on the submission id.  We can sort the results.  But more importantly we filter the query
to ensure we only get the fields we require.

I know this first submission (at last at the moment I'm typing this) only has a couple comments...

In [24]:
for submission_id in submission_list[:1]:
    results = api.search_comments(
        link_id = submission_id,
        sort='asc',
        sort_type='created_utc',
        filter=['id','parent_id','link_id','subreddit_id', 'body','permalink']
    )
    results = list(results)
    for comment in results:
        print(comment)

comment(body="There's a range of axial tilts [within the solar system](https://en.wikipedia.org/wiki/Axial_tilt#Solar_System_bodies). There has been [some work](https://phys.org/news/2012-01-loss-planetary-tilt-doom-alien.html) suggesting that no axial tilt would result in an inhospitable planet with the tropics too hot and the poles too cold. Conversely, too tipped may result in most of the planet having 6 months of light and 6 months of dark, which again falls into extremes that may be inhospitable.", created_utc=1558077470, id='envjb1h', link_id='t3_bpf6mx', parent_id='t3_bpf6mx', permalink='/r/askscience/comments/bpf6mx/earth_has_seasons_because_our_planets_axis_of/envjb1h/', subreddit_id='t5_2qm4e', created=1558099070.0, d_={'body': "There's a range of axial tilts [within the solar system](https://en.wikipedia.org/wiki/Axial_tilt#Solar_System_bodies). There has been [some work](https://phys.org/news/2012-01-loss-planetary-tilt-doom-alien.html) suggesting that no axial tilt would r

Now... let's see what happens to our comment text via our analyzer...

In [25]:
[token.text for token in analyzer("Hello there, this is a TEST")]

['hello', 'there', 'test']

In [27]:
[token.text for token in analyzer(results[0].body)]

['there',
 'rang',
 'axial',
 'tilt',
 'within',
 'solar',
 'system',
 'http',
 'en.wikipedia.org',
 'wiki',
 'axial_tilt',
 'solar_system_bodi',
 'there',
 'ha',
 'been',
 'some',
 'work',
 'http',
 'phys.org',
 'new',
 '2012',
 '01',
 'loss',
 'planetari',
 'tilt',
 'doom',
 'alien.html',
 'suggest',
 'no',
 'axial',
 'tilt',
 'would',
 'result',
 'inhospit',
 'planet',
 'tropic',
 'too',
 'hot',
 'pole',
 'too',
 'cold',
 'convers',
 'too',
 'tipp',
 'result',
 'most',
 'planet',
 'have',
 'month',
 'light',
 'month',
 'dark',
 'which',
 'again',
 'fall',
 'into',
 'extrem',
 'inhospit']

### Populate the index

In [51]:
with ix.writer() as writer:
    for submission_id in submission_list:
        results = api.search_comments(
            link_id = submission_id,
            sort='asc',
            sort_type='created_utc',
            filter=['id','parent_id','link_id','subreddit_id', 'body','permalink']
        )
        for comment in results:
            writer.add_document(
                comment_id = comment.id,
                parent_id = comment.parent_id,
                submission_id = comment.link_id,
                subreddit_id = comment.subreddit_id,
                content = comment.body
            )

Let's check the size of our inverted index...

In [52]:
ix.doc_count()

147