# 1. Data Aquisition

We will reference the publically available [Reddit dump](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/). The dataset is publically available on Google BigQuery and is divided across months from December 2015 - October 2018. BigQuery allows us to perform low latency queries on massive datasets. One example is [this](https://bigquery.cloud.google.com/table/fh-bigquery:reddit_posts.2018_08). Unfortunately the posts have not been tagged with their comments. To extract this information, in addition to BigQuery, we will use [PRAW](https://praw.readthedocs.io/en/latest/) for this task. 

The idea is to randomly query a subset of posts from December 2015 - October 2018. Then for each of the post, use praw to get comments for each one. 

We are considering 11 flairs:
```
1. AskIndia
2. Politics
3. Sports
4. Food
5. [R]eddiquette
6. Non-Political
7. Scheduled
8. Business/Finance
9. Science/Technology
10. Photography
11. Policy/Economy 
```

### Note: This notebook requires a GCP Acount, a Reddit account,  CLOUD SDK installed.
Go to [cloud sdk for more info](https://cloud.google.com/sdk/)

Follow the following steps before running this notebook:
```
1. Intall big query locally using- pip install --upgrade google-cloud-bigquery
2. In the GCP Console, go to the Create service account key page.
3. From the Service account drop-down list, select New service account.
4. In the Service account name field, enter a name .
5. From the Role drop-down list, select Project > Owner.
6. Click Create. A JSON file that contains your key downloads to your computer.
7. In a new session, execute the following command- 
    export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json"

```



### To setup reddit credentials
```
1. Go to https://www.reddit.com/prefs/apps
2. Click create app at the bottom
3. Enter an app name, choose 'script' and enter http://localhost:8080 in redirect uri
4. Save the client id, client secret
```

### Run mongo client in a different session
```
1. In a seperate session(terminal), run ./mongod before proceeding.
```

##### Importing all libraries

Here we will be using 
1. PyMongo - a python wrapper for MongoDB to build our train and test datasets
2. PRAW - a python wrapper for reddit API
3. Numpy
4. Pandas

In [3]:
import pymongo
from pymongo import MongoClient
import numpy as np
import praw
import pandas as pd
from google.cloud import bigquery

Initializing the praw.reddit(), biguery.Client() and MongoClient() objects

In [3]:
# Enter your credentials here
client_id = ''
client_secret = ''
user_agent = ''
username = ''
password = ''

reddit = praw.Reddit(client_id=client_id, \
                     client_secret=client_secret, \
                     user_agent=user_agent, \
                     username=username, \
                     password=password)

client = bigquery.Client()
mongo_client = MongoClient()

#### Here we are querying the dataset from 2015-2018 limiting results to 100000 records and save to a dataframe

In [43]:
QUERY_POSTS = (
'SELECT * except (domain, subreddit, author_flair_css_class, link_flair_css_class, author_flair_text,'
                 'from_kind, saved, hide_score, archived, from_id, name, quarantine, distinguished, stickied,'
                 'thumbnail, is_self, retrieved_on, gilded, subreddit_id) '
'FROM `fh-bigquery.reddit_posts.201*`'
'WHERE subreddit = "india" and link_flair_text in ("Sports", "Politics", "AskIndia", "Business/Finance", "Food",' 
    '"Science/Technology", "Non-Political", "Photography", "Policy/Economy", "Scheduled", "[R]eddiquette") ' 
'LIMIT 100000'
)

query_job = client.query(QUERY_POSTS)
query = query_job.result().to_dataframe()

### Buidling our train and test sets

To build a balanced dataset, we will limit the number of samples for each flair at 2000 and randomly sample from the extracted dataset.

In [33]:
keep = []
data = query
flairs = [flair for flair in flairs if not str(flair) == 'nan']
for flair in flairs:
    l = len(data[data['link_flair_text'] == flair])
    if l > 2000:
        l = 2000
    idx = list(data[data['link_flair_text'] == flair]['id'])
    c = np.random.choice(idx, l, replace=False)
    for i in c:
        keep.append(i)

print (len(keep))

['Policy/Economy', '[R]eddiquette', 'Sports', 'Business/Finance', 'Photography', 'Politics', 'Scheduled', 'Food', 'Science/Technology', 'AskIndia', 'Non-Political']
20608


We keep only these samples and discard others.

In [34]:
data = data[data['id'].isin(keep)]

['81cfa2', '61zg8u', '7md8i9', '701ysh', '7o9rjd', '6zt3kv', '6ko0s4', '8s6mm2', '98sra2', '68sy9z']


Unnamed: 0,created_utc,author,url,num_comments,score,title,selftext,id,over_18,permalink,link_flair_text
7,1524926782,[deleted],https://www.reddit.com/r/india/comments/8fkbkh...,29,50,[R] Losing the will to live every single day,"So, I didn't do well in JEE Mains, even after ...",8fkbkh,False,/r/india/comments/8fkbkh/r_losing_the_will_to_...,[R]eddiquette


### Saving the dataset to a mongoDB collection

Here we define a mongodb database - "dataset" and dump the dataframe to collection "reddit_data". Before doing this, we use praw to get comments for each dataset and add this feature as comments to our dataset. For each post, we limit to top 10 comments.

In [38]:
mongo_client = MongoClient('mongodb://localhost:27017/')
db = mongo_client.dataset
collection = db['reddit_data']

In [40]:
import time
start = time.time()
np.random.seed(42)

for i, row in data.iterrows():
    comments = []
    num_comm = 10
    
    submission = reddit.submission(id=row['id'])
    l = len(submission.comments)
    
    if l > 0:
        if l < 10:
            num_comm = l
        r = np.random.choice(l, num_comm, replace=False) 
        for i in r:
            comments.append(submission.comments[i].body)
    
    t = {'created_utc': row['created_utc'],
        'title': row['title'],
        'selftext': row['selftext'],
        'author': row['author'],
        'num_comments': row['num_comments'],
        'id': row['id'],
        'link_flair_text': row['link_flair_text'],
        'comments': comments,
        'url': row['url'],
        'score': row['score'],
        'over_18': row['over_18']}
    collection.insert(t)

print ((time.time()-start)/60)



153.4097396651904


We can now export this dataset as json file 

In [45]:
!mongoexport --db dataset -c reddit_dataset --out ./reddit_data.json

2019-02-05T17:55:48.216+0530	connected to: localhost
2019-02-05T17:55:48.216+0530	exported 0 records


### Note: We manually split the dataset from this collection to training and test sets by a 80/20 split into train.json and test.json available in the github repo