# The Deep-Learning Reddit Chatbot


## What is reddit?

[Inspired by pythonprogramming.net](https://pythonprogramming.net/bidirectional-attention-mechanism-chatbot-deep-learning-python-tensorflow/?completed=/training-model-chatbot-deep-learning-python-tensorflow/) and the [neuro-machine translator](https://github.com/tensorflow/nmt)


[Reddit](https://www.reddit.com/) is a massive forum on the internet, and it is famous for having many diverse user made "subreddits".


[A famous reddit post](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/?st=j9udbxta&sh=69e4fee7) has publically made avaliable 1.7 billion reddit comments compressed as 250GBs of data, which is nice considering Reddit API under PRAW and scraping data individually is not worth the hassle.


Another user was kind enough to [sort](https://www.reddit.com/r/bigquery/comments/3cej2b/17_billion_reddit_comments_loaded_on_bigquery/?st=jae26i99&sh=3d53e174) them on Google Big Query. This is cost prohibitive at the moment, so I might have to use Shamu. 

## Advantages 

  *The way that Reddit stores their comments is withing a parent-child format, where the parent is the original comment within a comment.

  *Another unique feature is the voting feature, where comments are voted.
  
  *Certain subreddits may hold unique culture and conversational styles that may be useful for a chat-bot provided by real people. We may be able to use the voting system as some sort of a filter comments to provide discrimination between conversations.

## Roadmap

1. We're going to download a sample data of 1 month (no need to download everything, but it would be interesting to toy around with it in Shamu. Ask Dr. Richardson for help)

2. Make a database to BUFFER the data. It's SO big that we can't just read it into our puny 32GB of RAM for our training set. Even just a month is big data (Reddit is massive). For SQLite3, we prepare a lot of pre-defined functions that insert themselves as SQL commands on a big database. (One provided by python-programing below)** Help! Dr. Richardson!

3. We're going to train our data using a Deep belief net using the theory on [neuro-machine translations](https://github.com/tensorflow/nmt) using something called "attention mechanisms" something related to [Long-Short Term Memory networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/).  ** Help! Dr. Richardson!

LSTMs are can remember decently sequences of tokens up to 10-20 in length fairly well. After, this point, their performance drops. For some reason a "Bidirectional" recurrent neural network does pretty well.

We're going to build a database that's going to store our parent comments with replies. The reason why is because these files are too big for us to just like read into RAM and then create the training files from a month basis (Reddit has many users). Thus,

But chances are you're gonna want to eventually if you wanted to create a really

Nice chat bot you're gonna be wanting to work on many months of data

so maybe possibly billions of comments you do have that your disposal so when that's the case we

Probably want to have some sort of database now for the purposes here. Just to keep things simple

Below is a code

In [None]:
import sqlite3
import json
from datetime import datetime
import time

timeframe = '2017-03'
sql_transaction = []
start_row = 0
cleanup = 1000000

connection = sqlite3.connect('{}.db'.format(timeframe))
c = connection.cursor()

def create_table():
    c.execute("CREATE TABLE IF NOT EXISTS parent_reply(parent_id TEXT PRIMARY KEY, comment_id TEXT UNIQUE, parent TEXT, comment TEXT, subreddit TEXT, unix INT, score INT)")

def format_data(data):
    data = data.replace('\n',' newlinechar ').replace('\r',' newlinechar ').replace('"',"'")
    return data

def transaction_bldr(sql):
    global sql_transaction
    sql_transaction.append(sql)
    if len(sql_transaction) > 1000:
        c.execute('BEGIN TRANSACTION')
        for s in sql_transaction:
            try:
                c.execute(s)
            except:
                pass
        connection.commit()
        sql_transaction = []

def sql_insert_replace_comment(commentid,parentid,parent,comment,subreddit,time,score):
    try:
        sql = """UPDATE parent_reply SET parent_id = ?, comment_id = ?, parent = ?, comment = ?, subreddit = ?, unix = ?, score = ? WHERE parent_id =?;""".format(parentid, commentid, parent, comment, subreddit, int(time), score, parentid)
        transaction_bldr(sql)
    except Exception as e:
        print('s0 insertion',str(e))

def sql_insert_has_parent(commentid,parentid,parent,comment,subreddit,time,score):
    try:
        sql = """INSERT INTO parent_reply (parent_id, comment_id, parent, comment, subreddit, unix, score) VALUES ("{}","{}","{}","{}","{}",{},{});""".format(parentid, commentid, parent, comment, subreddit, int(time), score)
        transaction_bldr(sql)
    except Exception as e:
        print('s0 insertion',str(e))

def sql_insert_no_parent(commentid,parentid,comment,subreddit,time,score):
    try:
        sql = """INSERT INTO parent_reply (parent_id, comment_id, comment, subreddit, unix, score) VALUES ("{}","{}","{}","{}",{},{});""".format(parentid, commentid, comment, subreddit, int(time), score)
        transaction_bldr(sql)
    except Exception as e:
        print('s0 insertion',str(e))

def acceptable(data):
    if len(data.split(' ')) > 1000 or len(data) < 1:
        return False
    elif len(data) > 32000:
        return False
    elif data == '[deleted]':
        return False
    elif data == '[removed]':
        return False
    else:
        return True

def find_parent(pid):
    try:
        sql = "SELECT comment FROM parent_reply WHERE comment_id = '{}' LIMIT 1".format(pid)
        c.execute(sql)
        result = c.fetchone()
        if result != None:
            return result[0]
        else: return False
    except Exception as e:
        #print(str(e))
        return False

def find_existing_score(pid):
    try:
        sql = "SELECT score FROM parent_reply WHERE parent_id = '{}' LIMIT 1".format(pid)
        c.execute(sql)
        result = c.fetchone()
        if result != None:
            return result[0]
        else: return False
    except Exception as e:
        #print(str(e))
        return False
    
if __name__ == '__main__':
    create_table()
    row_counter = 0
    paired_rows = 0

    #with open('J:/chatdata/reddit_data/{}/RC_{}'.format(timeframe.split('-')[0],timeframe), buffering=1000) as f:
    with open('/home/paperspace/reddit_comment_dumps/RC_{}'.format(timeframe), buffering=1000) as f:
        for row in f:
            #print(row)
            #time.sleep(555)
            row_counter += 1

            if row_counter > start_row:
                try:
                    row = json.loads(row)
                    parent_id = row['parent_id'].split('_')[1]
                    body = format_data(row['body'])
                    created_utc = row['created_utc']
                    score = row['score']
                    
                    comment_id = row['id']
                    
                    subreddit = row['subreddit']
                    parent_data = find_parent(parent_id)
                    
                    existing_comment_score = find_existing_score(parent_id)
                    if existing_comment_score:
                        if score > existing_comment_score:
                            if acceptable(body):
                                sql_insert_replace_comment(comment_id,parent_id,parent_data,body,subreddit,created_utc,score)
                                
                    else:
                        if acceptable(body):
                            if parent_data:
                                if score >= 2:
                                    sql_insert_has_parent(comment_id,parent_id,parent_data,body,subreddit,created_utc,score)
                                    paired_rows += 1
                            else:
                                sql_insert_no_parent(comment_id,parent_id,body,subreddit,created_utc,score)
                except Exception as e:
                    print(str(e))
                            
            if row_counter % 100000 == 0:
                print('Total Rows Read: {}, Paired Rows: {}, Time: {}'.format(row_counter, paired_rows, str(datetime.now())))

            if row_counter > start_row:
                if row_counter % cleanup == 0:
                    print("Cleanin up!")
                    sql = "DELETE FROM parent_reply WHERE parent IS NULL"
                    c.execute(sql)
                    connection.commit()
                    c.execute("VACUUM")
                    connection.commit()