## Problem Statement

The main goal for this project collecting data by scraping Reddit website and then building a binary classifier to identify where a given post came from. <br> 
For the project I chose highly correlated subreddits:
 - Dog lovers subreddit - Dogs https://www.reddit.com/r/dogs/
 - Dog haters subreddit - Dogfree https://www.reddit.com/r/Dogfree

> *This model could be used for identifying dog lovers and dog haters based on their posts on different social networks. And this information could be used further for ads targeting.*

## Data Collection

The goal of this step to collect enough posts from chosen 2 subreddits. 

### Import Libriaries

In [10]:
import requests
import time
import pandas as pd
import os.path

### Functions for parsing subreddits 

Data was gathered from Reddit's API, using the Python requests library. Reddit's API returns a JSON file with the page’s content.

I've defined 2 functions for collecting data. <br> 

**Load_posts** - accepts list where to collect data, direction: 'after'/'before', limit (up to 100) and url as arguments. The function creates request to reddit's API and parses posts from Reddit's JSON. <br>

**Load_subreddit** - accepts name of subreddit as an argument, checks if there is a file with posts of selected subreddit, if there is no file the function parses all available posts from the subreddit, if there is a file the function parses only last posts from subreddit and delete duplicates. Then creates DataFrame and saves it as CSV. 

In [4]:
def load_posts(posts, direction, limit, url):
    headers = {'User-agent': 'Bleep bot 0.1'}
    pagingId = None
    #create while loop, it'll be work until 'after'/'before' gets None
    #it allows me to avoid collecting duplicates 
    while True:
        #setting direction 'after'/'before' equal to none
        if pagingId == None:
            params = {'limit': limit}
        else:
            params = {direction: pagingId, 'limit': limit}
        #create request
        res = requests.get(url, params = params, headers=headers)
        #if we don't have errors we collect posts until 'after'/'before' gets None again.  
        if res.status_code == 200:
            the_json = res.json()
            posts.extend(the_json['data']['children'])
            if the_json['data'][direction] == None:
                break;
            pagingId = the_json['data'][direction]
        #if we get an error break the loop and print code of an error
        else:
            print(res.status_code)
            break
        #add 3 seconds to delay request in order to follow API access rules (up to 60 requests per minute)    
        time.sleep(3) 

def load_subreddit(name):
    posts = [] #create empty list for collecting data
    url = 'https://www.reddit.com/r/' + name + '/new/.json' #create url using an argument name
    #check if there is a file with posts of the subreddit
    #if 'no file' parse all available posts and create new dataframe  
    if os.path.exists('../data/'+ name + '.csv') == False:  
        load_posts(posts, 'after', 100, url)
        df = pd.DataFrame([p['data'] for p in posts]).drop_duplicates(subset='name')
    #if there is a file
    #load file, parse new posts, add new posts to existed posts and delete duplicates 
    else:
        old_posts_df = pd.read_csv('../data/'+ name + '.csv')
        old_posts_df.drop(['Unnamed: 0'], axis=1,inplace=True)
        load_posts(posts, 'before', 25, url)
        new_posts_df = pd.DataFrame([p['data'] for p in posts]).drop_duplicates(subset='name')
        df = pd.concat([old_posts_df,new_posts_df],sort=False).drop_duplicates(subset='name')
    #save data to csv
    df.to_csv('../data/'+ name + '.csv')
    #check how many posts we have
    print(name, df.shape)
    return (name, df.shape[0]) #name of subreddit and count of posts there for stats file

### Data Collection

Create list of topics I'd like to parse. Before I chose final 2 subreddits, I collected data from different subreddits.

In [5]:
sport_topics = ['nba', 'baseball', 'soccer','mls', 'hockey', 'mma', 'boxing', 'FIFA']  
other_topics = ['news', 'Futurology','AskEngineers','AskReddit','AskScience','History','gameofthrones','gottheories','Dogfree','aww','dogs']
tech_topics =  ['apple','applehate','android','MacSucks','mac','iphone']

In [11]:
for x in sport_topics + other_topics + tech_topics:
    load_subreddit(x)

nba (4838, 100)
baseball (2032, 104)
soccer (3099, 100)
mls (1377, 104)
hockey (1535, 100)
mma (1650, 106)
boxing (1209, 104)
FIFA (2582, 98)
news (339, 97)
Futurology (1336, 103)
AskEngineers (1179, 99)
AskReddit (10526, 97)
AskScience (1227, 97)
History (1089, 99)
gameofthrones (1114, 101)
gottheories (1014, 103)
Dogfree (1119, 104)
aww (6809, 102)
dogs (1285, 100)
apple (1027, 103)
applehate (61, 98)
android (878, 101)
MacSucks (166, 95)
mac (1271, 104)
iphone (1087, 103)


### Automation

In order to collect more data I automated collection. I put scrip 'reddit_collect.py' to AWS E2 instance and run every hour using cron task<br> 

(0 * * * * /home/ubuntu/anaconda3/bin/python /home/ubuntu/project/reddit_collect.py).<br>


To control AWS data collection I defined a function for saving statistics and added it to my script 'reddit_collect.py.

In [7]:
def save_stat(stats):
    f = open('../data/stat.txt','a+') #create new file or open and add new rows to the file 
    f.write('***********' + str(time.ctime()) + os.linesep) #add time when data collects
    for stat in stats:
        name = stat[0]
        count = str(stat[1])
        f.write(name +', ' + count + os.linesep) #add subreddit's name and total number of posts in my files
    f.close()

In [None]:
#code I changed in 'reddit_collect.py' to save statiscitcs 
stats = []
for x in sport_topics + other_topics + tech_topics:
    stats.append(load_subreddit(x))
save_stat(stats)

### Conclusions and next steps

 - I collected around 1200 post post for each of chosen Subreddits (Dogs and Dogfree)
 - The biggest number of posts I collected from AskReddit - around 11,000 posts 
 - Automated collection using ASW and Cron
 
Further improvements for the data collection:
 - Use PRAW: The Python Reddit API Wrapper for parsing data from Reddit