# 1a. Data Collection and Storage - from Dad Jokes subreddit

This notebook is the first of four from the Reddit API scrape and classification project. 
The list of actions are performed in three sections:
 1. **Data scrape:** `requests.get` used to download >1500 unique posts from the Dad Jokes subreddit
 2. **.json data converted to dataframe:** Only relevant information from the json file was converted to a dataframe
 3. **Data archived:** Dataframe saved as a csv. file and used in notebook #3


A total of 1738 posts were obtained from the following two links (.json added to html for scrape):
   - https://www.reddit.com/r/dadjokes/top/?t=month
   - https://www.reddit.com/r/dadjokes/new/

Import libraries

In [1]:
import requests
import time
import pandas as pd

### Section 1. Data scrape
>**Run the next cell the first time only. COMMENT OUT AFTER RUNNING, DONT RUN AGAIN**
<br>This is because we will run the for loop 2X. 
   - Each subreddit link (top vs. hot. vs. new vs. controversial, etc) can only download a maximum of 1000 posts because that is all that is stored in the API. 
   - We will use the for loop with one dad jokes link (top) to download up to 1000 posts.
   - We will then use the for loop again to download an additional 1000 posts from another dad jokes link (new).
   - This will be continued until a sufficient amount of posts have been obtained. 

In [2]:
# posts is an empty list our json file will be stored in, adding 25 posts a time during the for loop
# if after = None, the download will start from the begining of the APIs json file

# posts = []
# after = None

__Instructions for code blocks below:__ 
   1. Set `url = url_1`
   2. Add a `user_agent` name to the header. The name is arbitrary. 
   3. Run the for loop to submit 1000 requests. Each request will contain a max of 25 posts. The total number of unique posts that can be downloaded per link is 1000.
   4. Check the number of unique posts.
   5. Repeat step #1 with a new link (set `url = url_2`); repeat steps 2-4 until >1500 unique posts are obtained.

Step 1:

In [7]:
url_1 = "https://www.reddit.com/r/dadjokes/top.json?t=month"
url_2 = "https://www.reddit.com/r/dadjokes/new/.json"
#url_3 = 'https://www.reddit.com/r/dadjokes/hot/.json'

url = url_2

Step 2:

In [4]:
# change header to collect more posts. change username if necessary
header = {"user_agent": "amytaylor"}

Step 3:

In [None]:
for i in range(1000):
    print(i)
    if after == None:
        params = {}
    else: 
        params = {'after': after}
    params = {'after': after}
    res = requests.get(url, params = params, headers = header)
    if res.status_code == 200:
        the_json = res.json()
        posts.extend(the_json['data']['children'])
        after = the_json['data']['after']

    time.sleep(0.4)

Step 4

In [9]:
# check the length of total posts downloaded and unique posts
print(len(posts))
len(set([p['data']['name'] for p in posts]))

3447


1738

So out of 3447 posts, 1738 are unique. This is enough to perform my classification analysis.


---
### Section 2: Convert json to a dataframe

In [None]:
df = [x['data'] for x in posts]
df = pd.DataFrame(df)
df.head()

In [11]:
df.shape

(3447, 99)

**Examine the column names, decide which columns to convert to df**

In [12]:
df.columns

Index(['approved_at_utc', 'approved_by', 'archived', 'author',
       'author_cakeday', 'author_flair_background_color',
       'author_flair_css_class', 'author_flair_richtext',
       'author_flair_template_id', 'author_flair_text',
       'author_flair_text_color', 'author_flair_type', 'author_fullname',
       'author_patreon_flair', 'banned_at_utc', 'banned_by', 'can_gild',
       'can_mod_post', 'category', 'clicked', 'content_categories',
       'contest_mode', 'created', 'created_utc', 'crosspost_parent',
       'crosspost_parent_list', 'distinguished', 'domain', 'downs', 'edited',
       'gilded', 'gildings', 'hidden', 'hide_score', 'id', 'is_crosspostable',
       'is_meta', 'is_original_content', 'is_reddit_media_domain',
       'is_robot_indexable', 'is_self', 'is_video', 'likes',
       'link_flair_background_color', 'link_flair_css_class',
       'link_flair_richtext', 'link_flair_template_id', 'link_flair_text',
       'link_flair_text_color', 'link_flair_type', 'locked'

**Drop duplicate posts**

In [14]:
df = df.drop_duplicates(['name'])

In [15]:
df.shape

(1738, 99)

**List of columns to keep:**
    'name', 'title', 'selftext', 
    'subreddit', 'created', 'author',
    'num_comments', 'ups', 'downs', 'score'
    

In [None]:
# create a data frame from the name column
df = [p['data']['name'] for p in posts]
df = pd.DataFrame(df, columns= ['name'])

# add REQUIRED columns
df['title'] = [p['data']['title'] for p in posts]
df['selftext'] = [p['data']['selftext'] for p in posts]

# add ADDITIONAL columns (just for fun)
df['subreddit'] = [p['data']['subreddit'] for p in posts]
df['created'] = [p['data']['created'] for p in posts]
df['author'] = [p['data']['author'] for p in posts]
df['num_comments'] = [p['data']['num_comments'] for p in posts]
df['ups'] = [p['data']['ups'] for p in posts]
df['downs'] = [p['data']['downs'] for p in posts]
df['score'] = [p['data']['score'] for p in posts]


df.head()

### Section 3: **Save dataframe**

In [None]:
# NOTE: REMEMBER TO COMMENT OUT AFTER EXECUTING. index=False makes df save without adding another index
# df.to_csv("./datasets/dad.csv", index = False)