# Project 3: Data Collection

...Description of what I am doing in this notebook....

....

....

About the API
Reddit's API is fairly straightforward. For example, if I want the posts from /r/boardgames, all I have to do is add .json to the end of the url: https://www.reddit.com/r/boardgames.json

TIPS:
- Reddit will give you 25 posts per request. To get enough data, you'll need to hit Reddit's API repeatedly (most likely in a for loop). Be sure to use the time.sleep() function at the end of your loop to allow for a break in between requests. THIS IS CRUCIAL
- The API will cap you at 1,000 posts for each subreddit (assuming the subreddit has that many posts).
- At the end of each loop, be sure to save the results from your scrape as a csv: JSON from Reddit > Pandas DataFrame > CSV. That way, if something goes wrong in your loop, you won't lose all your data.

To help you get started, we have a primer video on how to use Reddit's API: https://www.youtube.com/watch?v=5Y3ZE26Ciuk

**Imports**

In [6]:
# Imports
import pandas as pd
import requests       
import time

### Function to Get Comments

In [6]:
# Gets comments and information for all 1-3 level comments on a post (limit 25 per post)
def get_comments(permalink):

    # Get URL to comments page using permalink found in post information
    url = 'https://www.reddit.com' + permalink + '.json'
    headers = {'User-agent': 'swhaterBot 0.1'}
    res = requests.get(url, headers=headers)
    
    # Verify status code is 200
    if res.status_code == 200:
        comments = res.json()
    else:
        return print('ERROR: ', res.status_code)
    
    # Creating dictionary to house all comments per post
    all_comments = {}
    # Creating lists to house comments by comment level (1, 2)
    level1_coms = []
    level2_coms = []
    num_comments = 0
    
    #### LEVEL 1 ####
    # Begin by getting top level comments
    coms1_length = len(comments[1]['data']['children'])
    
    #print('# of Comments: ', coms1_length)
    for i in range(coms1_length):
        coms_sect1 = comments[1]['data']['children'][i]['data']
        #print('Body in Keys: ', 'body' in coms_sect1.keys())
        if 'body' in coms_sect1.keys():
            top_level_com = coms_sect1['body']
            level1_coms.append(top_level_com)
            
            if coms_sect1['replies'] != '':
                
                #### LEVEL 2 ####
                coms2_length = len(coms_sect1['replies']['data']['children'])
                
                for j in range(coms2_length):
                    coms_sect2 = coms_sect1['replies']['data']['children'][j]['data']
                    if 'body' in coms_sect2.keys():
                        sec_level_com = coms_sect2['body']
                        level2_coms.append(sec_level_com)
                
        num_comments = len(level1_coms) + len(level2_coms) 
        #+ len(level3_coms)
        
        if num_comments > 30:
            break
        
    all_comments['comts_first'] = level1_coms
    all_comments['comts_second'] = level2_coms
    all_comments['comts_num_collected'] = num_comments

    return all_comments

In [19]:
# Gets comments and information for all 1-2 level comments on a post (limit 30 per post)
def get_comments_df(permalink):

    # Get URL to comments page using permalink found in post information
    url = 'https://www.reddit.com' + permalink + '.json'
    headers = {'User-agent': 'swhaterBot 0.1'}
    res = requests.get(url, headers=headers)
    
    # Verify status code is 200
    if res.status_code == 200:
        comments = res.json()
    else:
        return print('ERROR: ', res.status_code)
    
    # Creating list to house a dictionary for each row of comments
    comments_list = []
    counter = 0 # keep track of comments...max at 30 per post
    
    #### LEVEL 1 ####
    coms1_length = len(comments[1]['data']['children'])
    
    #print('# of Comments: ', coms1_length)
    for i in range(coms1_length):
        comments_dict = {}
        coms_sect1 = comments[1]['data']['children'][i]['data']
   
        if 'body' in coms_sect1.keys():
            comments_dict['comment'] = coms_sect1['body'] # actual comment
            comments_dict['level'] = 1
            comments_dict['subreddit'] = permalink # permalink contains the subreddit name and can clean later
            
            comments_list.append(comments_dict)
            counter += 1
            
            if coms_sect1['replies'] != '':
                
                #### LEVEL 2 ####
                coms2_length = len(coms_sect1['replies']['data']['children'])
                
                for j in range(coms2_length):
                    comments_dict = {}
                    coms_sect2 = coms_sect1['replies']['data']['children'][j]['data']
                    if 'body' in coms_sect2.keys():
                        comments_dict['comment'] = coms_sect2['body'] # actual comment
                        comments_dict['level'] = 2
                        comments_dict['subreddit'] = permalink # permalink contains the subreddit name and can clean later

                        comments_list.append(comments_dict)
                        counter += 1
                        
        if counter> 30:
            break

    return  pd.DataFrame(comments_list)

In [None]:
test = '/r/SeriousConversation/comments/b8sgzc/person_from_my_graduating_class_died_the_other/'
test1 = '/r/SeriousConversation/comments/b8t1zs/how_do_i_accept_the_fact_that_im_growing_up/'

In [None]:
comments_df = pd.DataFrame()
for permalink in [test, test1]:
    df = get_comments_df(permalink)
    comments_df = pd.concat([comments_df, df], axis=0)

In [None]:
# Reset the indexx
comments_df.reset_index(drop=True, inplace=True)

Unnamed: 0,comment,level,subreddit
0,I'd like to think he'd be happy you remembered...,1,/r/SeriousConversation/comments/b8sgzc/person_...
1,I’d like to think so too.,2,/r/SeriousConversation/comments/b8sgzc/person_...
2,rest in peace stranger \n\nthank you for this ...,1,/r/SeriousConversation/comments/b8sgzc/person_...
3,You’re welcome:),2,/r/SeriousConversation/comments/b8sgzc/person_...
4,True. I think that's why people are often enco...,1,/r/SeriousConversation/comments/b8sgzc/person_...
5,"From my strange stranger's pov, that's the way...",1,/r/SeriousConversation/comments/b8sgzc/person_...
6,Something similar led me to think about life a...,1,/r/SeriousConversation/comments/b8sgzc/person_...
7,I can relate to this. I just turned 24 and Ive...,1,/r/SeriousConversation/comments/b8sgzc/person_...
8,I found out recently that an old co-worker of ...,1,/r/SeriousConversation/comments/b8sgzc/person_...
9,I work at a high school and this year has been...,1,/r/SeriousConversation/comments/b8sgzc/person_...


### Function to get Posts using the comments function

In [7]:
def get_posts(subreddit):
   
    # Gathering Each Post from asubreddit
    posts = []
    after = True
    for i in range(1,81):

        if after == None: # if not enough posts will break the loop
            break
        else:
            params = {'after': after}
        
        url = 'https://www.reddit.com/r/' + subreddit + '.json'
        headers = {'User-agent': 'swhaterBot 0.1'}
        res = requests.get(url, params=params, headers=headers)

        if res.status_code == 200:
            the_json = res.json()
            posts.extend(the_json['data']['children'])
            after = the_json['data']['after']
        else:
            return print(res.status_code)

        # keep status of how many posts collected
        if i % 4 == 0:
            print(f'{i*25} posts collected')

        # sleep for 1 second
        time.sleep(1)  
        
    # Veriry no duplicates
    if len(posts) != len(set([p['data']['name'] for p in posts])):
        print('Duplicate Present')
    else:
        print(f'FINAL: {len(posts)} posts collected')
            
        
    posts_list = []
    for  counter, post in enumerate(posts):
    
        single_post = {}
        
        # collect desired information from each post
        single_post['title'] = post['data']['title']
        single_post['selftext'] = post['data']['selftext']
        single_post['permalink'] = post['data']['permalink']
        single_post['subreddit'] = post['data']['subreddit']
        single_post['author'] = post['data']['author']
        single_post['url'] = post['data']['url']
        single_post['name'] = post['data']['name']    
        single_post['created'] = post['data']['created']
        single_post['num_comments'] = post['data']['num_comments']
        single_post['ups'] = post['data']['ups']
        single_post['domain'] = post['data']['domain']
        single_post['is_original_content'] = post['data']['is_original_content']
        single_post['edited'] = post['data']['edited']
        single_post['media_only'] = post['data']['media_only']
        single_post['is_video'] = post['data']['is_video']

        # Collect up to 25 comments if they exist (only first and second level)
        if post['data']['num_comments'] != 0:
            permalink = post['data']['permalink']
            comments_dict = get_comments(permalink)
            # Add comments dictionary for the post to the single_post dictionary
            single_post.update(comments_dict)
        else:
            single_post['comts_first'] = []
            single_post['comts_second'] = []
            single_post['comts_num_collected'] = 0

        posts_list.append(single_post) 
        
        if counter % 100 == 0:
            total_posts = len(posts)
            pct_posts_collected = round((counter / total_posts) * 100, 2)
            print(f'{pct_posts_collected}% posts with comments collected')

    return pd.DataFrame(posts_list)

### Colllect & save data for r/CasualConversation

In [9]:
casual = get_posts('casualconversation')

100 posts collected
200 posts collected
300 posts collected
400 posts collected
500 posts collected
600 posts collected
700 posts collected
800 posts collected
FINAL: 802 posts collected
0.0% posts with comments collected
12.47% posts with comments collected
24.94% posts with comments collected
37.41% posts with comments collected
49.88% posts with comments collected
62.34% posts with comments collected
74.81% posts with comments collected
87.28% posts with comments collected
99.75% posts with comments collected


In [10]:
casual.to_csv('../data/casual_coms.csv')

### Colllect & save data for r/SeriousConversation

In [11]:
serious = get_posts('../data/SeriousConversation')

100 posts collected
200 posts collected
300 posts collected
400 posts collected
500 posts collected
600 posts collected
700 posts collected
800 posts collected
900 posts collected
FINAL: 927 posts collected
0.0% posts with comments collected
10.79% posts with comments collected
21.57% posts with comments collected
32.36% posts with comments collected
43.15% posts with comments collected
53.94% posts with comments collected
64.72% posts with comments collected
75.51% posts with comments collected
86.3% posts with comments collected
97.09% posts with comments collected


In [12]:
serious.to_csv('../data/series_coms.csv')

### Combine & save data

In [13]:
casual.shape

(802, 18)

In [14]:
serious.shape

(927, 18)

In [15]:
pd.concat([casual, serious], axis = 0).shape

(1729, 18)

In [16]:
conversations = pd.concat([casual, serious], axis = 0)

In [17]:
# Save to data folder
conversations.to_csv('../data/conversations.csv', index = False)

### Create Dataframe for Only Comments

In [30]:
# Gets comments and information for all 1-2 level comments on a post (limit 30 per post)
def get_comments_df(subreddit):
    
    # Gathering Each Post from asubreddit
    posts = []
    all_comments_df = pd.DataFrame()
    after = True
    for i in range(1,81):

        if after == None: # if not enough posts will break the loop
            break
        else:
            params = {'after': after}
        
        url = 'https://www.reddit.com/r/' + subreddit + '.json'
        headers = {'User-agent': 'swhaterBot 0.1'}
        res = requests.get(url, params=params, headers=headers)

        if res.status_code == 200:
            the_json = res.json()
            posts.extend(the_json['data']['children'])
            after = the_json['data']['after']
        else:
            return print(res.status_code)

        # sleep for 1 second
        time.sleep(1)  
        
    #posts_list = []
    for  counter, post in enumerate(posts):

        permalink = post['data']['permalink']
        num_comments = post['data']['num_comments']
        
        if num_comments != 0:

            # Get URL to comments page using permalink found in post information
            url = 'https://www.reddit.com' + permalink + '.json'
            headers = {'User-agent': 'swhaterBot 0.1'}
            res = requests.get(url, headers=headers)

            # Verify status code is 200
            if res.status_code == 200:
                comments = res.json()
            else:
                return print('ERROR: ', res.status_code)

            # Creating list to house a dictionary for each row of comments
            comments_list = []
            num_comments = 0 # keep track of comments...max at 30 per post

            #### LEVEL 1 ####
            coms1_length = len(comments[1]['data']['children'])

            #print('# of Comments: ', coms1_length)
            for i in range(coms1_length):
                comments_dict = {}
                coms_sect1 = comments[1]['data']['children'][i]['data']

                if 'body' in coms_sect1.keys():
                    comments_dict['comment'] = coms_sect1['body'] # actual comment
                    comments_dict['level'] = 1
                    comments_dict['subreddit'] = permalink # permalink contains the subreddit name and can clean later

                    comments_list.append(comments_dict)
                    num_comments += 1

                    if coms_sect1['replies'] != '':

                        #### LEVEL 2 ####
                        coms2_length = len(coms_sect1['replies']['data']['children'])

                        for j in range(coms2_length):
                            comments_dict = {}
                            coms_sect2 = coms_sect1['replies']['data']['children'][j]['data']
                            if 'body' in coms_sect2.keys():
                                comments_dict['comment'] = coms_sect2['body'] # actual comment
                                comments_dict['level'] = 2
                                comments_dict['subreddit'] = permalink # permalink contains the subreddit name and can clean later

                                comments_list.append(comments_dict)
                                num_comments += 1

                if num_comments > 30:
                    break
        
        a_post_comments_df = pd.DataFrame(comments_list)
        all_comments_df = pd.concat([all_comments_df, a_post_comments_df], axis=0)
    
    return all_comments_df

In [33]:
serious_coms = get_comments_df('SeriousConversation')

In [34]:
casual_coms = get_comments_df('CasualConversation')

In [42]:
print(serious_coms.shape)
serious_coms.head()

(9596, 3)


Unnamed: 0,comment,level,subreddit
0,I wish it didn't overcrowd my feed,1,/r/SeriousConversation/comments/ayvt9u/looking...
0,My 33yr old sister maced me(29yr old) last nig...,1,/r/SeriousConversation/comments/b7ysjq/megathr...
1,"I'm concocting an utterly evil prank, now all ...",1,/r/SeriousConversation/comments/b7ysjq/megathr...
2,I am coming to the sad realization that I will...,1,/r/SeriousConversation/comments/b7ysjq/megathr...
3,Let's say you're having a discussion with an i...,1,/r/SeriousConversation/comments/b7ysjq/megathr...


In [43]:
print(casual_coms.shape)
casual_coms.head()

(7885, 3)


Unnamed: 0,comment,level,subreddit
0,It helps if you make it a clickable link...\n\...,1,/r/CasualConversation/comments/ayvu98/looking_...
1,It is clickable... it's a link post.,2,/r/CasualConversation/comments/ayvu98/looking_...
2,That sub should have a Reddit chat.,1,/r/CasualConversation/comments/ayvu98/looking_...
3,Agree!,2,/r/CasualConversation/comments/ayvu98/looking_...
4,"hi, id be happy to talk to you :)",1,/r/CasualConversation/comments/ayvu98/looking_...


In [44]:
comments = pd.concat([serious_coms, casual_coms], axis=0)
comments.shape

(17481, 3)

In [47]:
# Save to new csv data folder
comments.to_csv('../data/comments.csv', index = False)