# Team (◕‿◕✿)'s Project Process

## 1. Scraping and cleaning our data

We will scrape and clean our data with the following steps:

1. Filter to AskReddit threads made between 10/01/2014 - 10/01/2015.
2. Get the thread IDs of the search results from step (1).
3. Use praw to scrape parent comments from thread X.
4. Convert each praw object from the scraping to JSON (NOTE: might not be useful, might remove).

### 1.1 Filter using reddit search

We first narrow down our comments to AskReddit comments made in the time range 10/01/2014 - 10/01/2015. The URL for the filtering is:

https://www.reddit.com/r/AskReddit/search?sort=comments&q=timestamp%3A1410739200..1411171200&restrict_sr=on&syntax=cloudsearch

Breaking down the above URL:
* **/r/AskReddit**: filter to subreddit AskReddit
* **sort=comments**: the search results will be sorted based on number of comments
* **timestamp%3A1410739200..1411171200&**: restrict to 10/01/2014 00:00:00 to 10/01/2015 00:00:00, which are translated into epochs in the URL.
* **syntax=cloudsearch**: reddit's search is infamously bad. One reason is that its regular search syntax, Lucene, doesn't allow for timestamp search. We thus choose CloudSearch as our syntax instead, which allows us to use features (like timestamp searches) that Lucene does not allow.

reddit will return at most 1000 results for our search. 

###1.2 Get thread IDs

Now, we want to get all the search result URLs from our search. This sounds like a job for Beautiful Soup! First, using Developer Tools, we find our search contents:

![Image of HTML searching](images/part1_1.png)
![Image of HTML searching (2)](images/part1_2.png)

We next parse the HTML using Beautiful Soup to get all the thread URLs for the search results. After getting the thread URLs, we append '.json' to the end of the URL, and using regex to extract out the thread ID from the resulting json.

(Note: the for-loop below can be written more nicely if reddit servers could handle more load! However, there are still HTTP error 429s after sleeping for 10 seconds. We circumvent this issue by checking how many URLs in ```url_list``` we manage to parse before hitting error 429, and continue appending from there.)

In [206]:
from bs4 import BeautifulSoup
import urllib2
import re
import time

'''
url = "https://www.reddit.com/r/AskReddit/search?sort=comments&q=timestamp%3A1410739200..1411171200&restrict_sr=on&syntax=cloudsearch"
data = BeautifulSoup(urllib2.urlopen(url).read(), 'html.parser')
rows = data.findAll("a", attrs = {'class': 'search-comments may-blank'}, limit=None)
url_list = [str(r.attrs['href'])[:-17] + '.json' for r in rows]

print "Done getting thread links!"

thread_ids = []
for i in range(len(url_list)):
    time.sleep(10) # reddit servers can't handle much load, so we sleep for a while
    json_text = urllib2.urlopen(url_list[i]).read()
    thread_ids.append(re.search(r'\"id\": \"([a-zA-Z0-9_]+)\"', json_text).group(1))
'''

['2gutes', '2gsgtl', '2ghddz', '2grdq2', '2gjgy8', '2gnn3m', '2ggypn', '2gg2r4', '2ge8ed', '2griq5', '2gp9iv', '2gf7lv', '2gnjmx', '2guikf', '2gfbm2', '2goanl', '2gkcuj', '2gni9r', '2gslf0', '2guify', '2gj6ir', '2gjur0', '2gltd7', '2guuuq', '2gk1pg']


### 1.3 Scrape comments using praw

praw (Python Reddit API Wrapper), given the thread ID, will scrape all the parent comments from the thread. We will store the scraped comments in a dictionary: the keys are the thread ID, and the values are the list of parent comments in that thread. 

Unfortunately, the call ```replace_more_comments``` below takes up a lot of time, due to reddit's API request limits. 

In [None]:
import praw
r = praw.Reddit('Getting comments for CS109 Final Project')

thread_d = {}
for t_id in thread_ids:
    print "Getting comments for " + t_id + " ..."
    submission = r.get_submission(submission_id=t_id)
    submission.replace_more_comments(limit=None, threshold=0)
    all_comments = submission.comments
    thread_d[t_id] = all_comments

Phew! We don't want to go through that again, so let's pickle our dictionary into the file ```all_comments_dict.p```, so we have it ready-to-go for the next run.

In [226]:
import cPickle
cPickle.dump(thread_d, open('all_comments_dict.p', 'wb')) 
thread_d_load = cPickle.load(open('all_comments_dict.p', 'rb'))
print thread_d_load

###1.4 Convert praw to JSON.

(Note to self: on second thought, it might not be useful to convert to JSON)

There are a lot of interesting attributes we can get from a praw object. For now, let's get: 
* the gold count
* number of upvotes
* time posted
* comment body
* author

In [227]:
#from pprint import pprint
#print all_comments
#pprint (vars(all_comments[0]))
#print all_comments[0].author

{'3sndsr': [<praw.objects.Comment object at 0x108509dd0>, <praw.objects.Comment object at 0x107db0a10>], '37kr5n': [<praw.objects.Comment object at 0x1082fedd0>]}
