# Team (◕‿◕✿)'s Project Process

## 1. Scraping and cleaning our data

We will scrape and clean our data with the following steps:

1. Filter to AskReddit threads made in the last year.
2. Get the thread IDs of the search results from step (1).
3. Use praw to scrape parent comments from thread X.

### 1.1 Filter using reddit search

We want to get a decent sample size over a year -- we thus split a year's worth of data (11/01/2014 - 10/31/2015) into four parts of three months each. We then convert each of these time ranges into epochs, and format the range to use in URLs.

In [None]:
import time

time_ranges = [('11-01-2014', '01-31-2015'), ('02-01-2015', '04-30-2015'), ('05-01-2015', '07-31-2015'), ('08-01-2015', '10-31-2015')] 
f_epoch = lambda x: int(time.mktime(time.strptime(x, '%m-%d-%Y')))
time_ranges = [str(f_epoch(start)) + '..' + str(f_epoch(end)) for (start, end) in time_ranges]
print time_ranges

For each element in ```time_ranges```, we create a url to filter to AskReddit threads in that time range. The URL for the filtering is:

https://www.reddit.com/r/AskReddit/search?sort=comments&q=timestamp%3A[time_range]&restrict_sr=on&syntax=cloudsearch

Breaking down the above URL:
* **/r/AskReddit**: filter to subreddit AskReddit
* **sort=comments**: the search results will be sorted based on number of comments
* **timestamp%3A1410739200..1411171200&**: restrict to 10/01/2014 00:00:00 to 10/01/2015 00:00:00, which are translated into epochs in the URL.
* **syntax=cloudsearch**: reddit's search is infamously bad. One reason is that its regular search syntax, Lucene, doesn't allow for timestamp search. We thus choose CloudSearch as our syntax instead, which allows us to use features (like timestamp searches) that Lucene does not allow.

reddit will return at most 1000 results for our search. 

###1.2 Get thread IDs

Now, we want to get all the search result URLs from our search. This sounds like a job for Beautiful Soup! First, using Developer Tools, we find our search contents:

![Image of HTML searching](images/part1_1.png)
![Image of HTML searching (2)](images/part1_2.png)

We next parse the HTML using Beautiful Soup to get all the thread URLs for the search results. After getting the thread URLs, we append '.json' to the end of the URL, and using regex to extract out the thread ID from the resulting json. For each time range query (4 in total), we return 25 threads for a total of 100 threads, and thus the resulting ```thread_ids``` has 100 strings.

(Note: the for-loop below can be written more nicely if reddit servers could handle more load! However, there are still HTTP error 429s after sleeping for 10 seconds. We circumvent this issue by checking how many URLs in ```url_list``` we manage to parse before hitting error 429, and continue appending from there.)

In [66]:
from bs4 import BeautifulSoup
import urllib2
import re

url_str1 = "https://www.reddit.com/r/AskReddit/search?sort=comments&q=timestamp%3A"
url_str2 = "&restrict_sr=on&syntax=cloudsearch"
url_list = []
'''
for tr in time_ranges[2:]:
    time.sleep(30) # avoid HTTP request error
    url = url_str1 + tr + url_str2
    data = BeautifulSoup(urllib2.urlopen(url).read(), 'html.parser')
    rows = data.findAll("a", attrs = {'class': 'search-comments may-blank'}, limit=None)
    url_list += [str(r.attrs['href'])[:-17] + '.json' for r in rows]
print "Done getting thread links!"

thread_ids = []
for elt in url_list[18:]:
    time.sleep(10)
    json_text = urllib2.urlopen(elt).read()
    thread_ids.append(re.search(r'\"id\": \"([a-zA-Z0-9_]+)\"', json_text).group(1))

'''
thread_ids = ['2rb0pa', '2qc6x6', '2pgxhq', '2nuals', '2ndc2r', '2qoe5v', '2nbuxe', '2noe2r', '2tv0hj', '2m64tg', 
              '2q8osf', '2rvvqw', '2sbi17', '2tedew', '2tx67q', '2typ2r', '2l9g71', '2podo2', '2q5zy0', '2ma5w6', 
              '2llwhe', "2s5g55", "2lbckf", "2r7crn", "2tbexp", "348vlx", "2yhxa9", "34aqsn", "2usqzx", "2z8krp", 
              "31ifqb", "31q682", "335on6", "2xdqg2", "32i92t", "31nbca", "32g128", "349i6w", "2zqim0", "308dzi", 
              "3249ff", "2v39v2", "32xyr1", "2uibgu", "30wygs", "2x3qbr", "2veuez", "316ose", "32kf6l", "2xrzhg", 
              "3enigz", "37c2p3", "36959m", "38o5au", "39a7r8", "39fq7n", "3csgjk", "3862j5", "3bcd9y", "3f6k5e", 
              "351azq", "3b8brt", "3cii40", "3cbo1v", "3cfcxh", "36ih74", "395il3", "3amxh2", "37kq41", "3d4kpu", 
              "39twrl", "350629", "3dhkkp", "3a75dg", "3b29ew", "3g4blw", "3p03f5", "3qhw47", "3o0k0p", "3gljgr", 
              "3lf4rg", "3q5aw1", "3johsm", "3lnuyo", "3knx4k", "3prc2q", "3ld69q", "3l8ag4", "3hpxbx", "3fuyw1", 
              "3midk5", "3gaz4f", "3jm138", "3l5tvv", "3jqmuf", "3lx3rp", "3hfrb5", "3mowrl", "3iv9xy", "3hu28g"]

100 100


### 1.3 Scrape comments using praw

praw (Python Reddit API Wrapper), given the thread ID, will scrape all the parent comments from the thread. We will store the scraped comments in a dictionary: the keys are the thread ID, and the values are the list of parent comments in that thread. 

Unfortunately, the call ```replace_more_comments``` below takes up a lot of time, due to reddit's API request limits. It is possible to pass ```limit=None``` in ```replace_more_comments``` to get all the comments in the thread, but that often leads to HTTP error 429. We thus cap the number of requests at 100. 

In [None]:
import praw
r = praw.Reddit('Getting comments for CS109 Final Project')

#thread_d = {}
for t_id in thread_ids[1:]:
    time.sleep(10) # lots of HTTP request errors!
    submission = r.get_submission(submission_id=t_id)
    submission.replace_more_comments(limit=100, threshold=0)
    all_comments = submission.comments
    thread_d[t_id] = all_comments
    print "Got " + str(len(all_comments)) + " comments for " + t_id

Phew! We don't want to go through that again, so let's pickle our dictionary into the file ```all_comments_dict.p```, so we have it ready-to-go for the next run.

In [226]:
import cPickle
cPickle.dump(thread_d, open('all_comments_dict.p', 'wb')) 
thread_d_load = cPickle.load(open('all_comments_dict.p', 'rb'))
print thread_d_load

###1.4 Convert praw to JSON.

(Note to self: on second thought, it might not be useful to convert to JSON)

There are a lot of interesting attributes we can get from a praw object. For now, let's get: 
* the gold count
* number of upvotes
* time posted
* comment body
* author

In [227]:
#from pprint import pprint
#print all_comments
#pprint (vars(all_comments[0]))
#print all_comments[0].author

{'3sndsr': [<praw.objects.Comment object at 0x108509dd0>, <praw.objects.Comment object at 0x107db0a10>], '37kr5n': [<praw.objects.Comment object at 0x1082fedd0>]}
