## Data Collection

The goal of this step to collect posts from any 2 subreddits. <br>
For my project I chose sport topics.

### Import Libriaries

In [2]:
import requests
import time
import pandas as pd
import os.path

### Functions for parsing subreddits 

For collecting data I've defined 2 functions. <br> 

Load_posts - accepts list where to collect data, direction: 'after'/'before', limit (up to 100) and url as arguments. The function creates request to reddit's API and parses posts from Reddit's JSON. <br>

Load_subreddit - accepts name of subreddit as an argument, checks if there is a file with posts of selected subreddit, if there is no file the function parses all available posts from the subreddit, if there is a file the function parses only last posts from subreddit and delete duplicates. Then creates DataFrame and saves it as CSV. 

In [3]:
def load_posts(posts, direction, limit, url):
    headers = {'User-agent': 'Bleep bot 0.1'}
    pagingId = None
    #create while loop, it'll be work until 'after'/'before' gets None
    #it allows me to avoid collecting duplicates 
    while True:
        #setting direction 'after'/'before' equal to none
        if pagingId == None:
            params = {'limit': limit}
        else:
            params = {direction: pagingId, 'limit': limit}
        #create request
        res = requests.get(url, params = params, headers=headers)
        #if we don't have errors we collect posts until 'after'/'before' gets None again.  
        if res.status_code == 200:
            the_json = res.json()
            posts.extend(the_json['data']['children'])
            if the_json['data'][direction] == None:
                break;
            pagingId = the_json['data'][direction]
        #if we get an error break the loop and print code of an error
        else:
            print(res.status_code)
            break
        time.sleep(3)

def load_subreddit(name):
    posts = [] #create empty list for collecting data
    url = 'https://www.reddit.com/r/' + name + '/.json' #create url using an argument name
    #check if there is a file with posts of the subreddit
    #if 'no file' parse all available posts and create new dataframe  
    if os.path.exists('../data/'+ name + '.csv') == False:  
        load_posts(posts, 'after', 100, url)
        df = pd.DataFrame([p['data'] for p in posts]).drop_duplicates(subset='name')
    #if there is a file
    #load file, parse new posts, add new posts to existed posts and delete duplicates 
    else:
        old_posts_df = pd.read_csv('../data/'+ name + '.csv')
        old_posts_df.drop(['Unnamed: 0'], axis=1,inplace=True)
        load_posts(posts, 'before', 25, url)
        new_posts_df = pd.DataFrame([p['data'] for p in posts]).drop_duplicates(subset='name')
        df = pd.concat([old_posts_df,new_posts_df],sort=False).drop_duplicates(subset='name')
    #save data to csv
    df.to_csv('../data/'+ name + '.csv')
    #check how many posts we have
    print(name, df.shape)

### Data Collection

Create list of topics I'd like to parse 

In [4]:
sport_topics = ['nba', 'baseball', 'soccer','mls', 'hockey', 'mma', 'boxing', 'FIFA']  

In [8]:
for x in sport_topics:
    load_subreddit(x)

nba (577, 100)
baseball (920, 103)
soccer (826, 100)
mls (951, 104)
hockey (948, 100)
mma (924, 105)
boxing (995, 104)
FIFA (601, 98)


In [55]:
for x in sport_topics:
    load_subreddit(x)

nba (677, 100)
baseball (1032, 104)
soccer (924, 100)
mls (1050, 104)
hockey (1033, 100)
mma (1019, 105)
boxing (1044, 104)
FIFA (703, 98)


In [23]:
for x in sport_topics:
    load_subreddit(x)

nba (667, 100)
baseball (1024, 103)
soccer (918, 100)
mls (1042, 104)
hockey (1024, 100)
mma (1015, 105)
boxing (1042, 104)
FIFA (696, 98)


In [54]:
load_subreddit('climateskeptics')

climateskeptics (983, 99)


In [53]:
load_subreddit('Futurology')

Futurology (913, 103)


In [52]:
load_subreddit('ShowerThoughts')

ShowerThoughts (764, 100)


In [51]:
load_subreddit('AskReddit')

AskReddit (994, 97)


In [50]:
load_subreddit('AskScience')

AskScience (1000, 97)


In [49]:
load_subreddit('History')

History (998, 99)


In [None]:
In order to collect more data I automated collection. I put scrip to AWS 

### Conclusions and next steps

https://praw.readthedocs.io/en/latest/

In order to collect more data I automated collection. 