# Project 3 - Creation vs Evolution

## Problem Statement

In this project, I use Reddit's API to scrape data from two subreddits, r/Evolution and r/Creation.  My goal is to build a model which may accurately and effectively predict and categorize post titles by subreddit.  Using NLP techniques, how accurately can we predict the subreddit based on the post titles?

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from IPython.core.display import HTML
import pylast
import requests
import time

### Importing json data from _reddit.com_

In [4]:
creation_url = 'https://www.reddit.com/r/Creation.json'
evolution_url = 'https://www.reddit.com/r/evolution.json'

In [5]:
headers = {'User-agent': 'harmon get 0.1'}

In [6]:
res_c = requests.get(creation_url, headers=headers)
res_c.status_code

200

In [7]:
res_e = requests.get(evolution_url, headers=headers)
res_e.status_code

200

In [8]:
creation_json = res_c.json()

In [9]:
evolution_json = res_e.json()

### Extracting from the JSON

In [10]:
sorted(creation_json.keys())

['data', 'kind']

In [11]:
sorted(creation_json['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [12]:
# returning list of 'names' of posts
[post['data']['name'] for post in creation_json['data']['children']]

['t3_b8qqma',
 't3_b8r1fb',
 't3_b8jrue',
 't3_b8l1to',
 't3_b87jy5',
 't3_b7spxi',
 't3_b7pf5f',
 't3_b7ojdq',
 't3_b7hsel',
 't3_b72t55',
 't3_b6o9a9',
 't3_b69294',
 't3_b5e196',
 't3_b50v81',
 't3_b4rkix',
 't3_b4ih2k',
 't3_b3sbxb',
 't3_b3p2zl',
 't3_b3e0ys',
 't3_b33s7p',
 't3_b2y783',
 't3_b1je35',
 't3_b1fbzt',
 't3_b16s5l',
 't3_b100ov']

In [13]:
pd.DataFrame(creation_json['data']['children'])

Unnamed: 0,data,kind
0,"{'approved_at_utc': None, 'subreddit': 'Creati...",t3
1,"{'approved_at_utc': None, 'subreddit': 'Creati...",t3
2,"{'approved_at_utc': None, 'subreddit': 'Creati...",t3
3,"{'approved_at_utc': None, 'subreddit': 'Creati...",t3
4,"{'approved_at_utc': None, 'subreddit': 'Creati...",t3
5,"{'approved_at_utc': None, 'subreddit': 'Creati...",t3
6,"{'approved_at_utc': None, 'subreddit': 'Creati...",t3
7,"{'approved_at_utc': None, 'subreddit': 'Creati...",t3
8,"{'approved_at_utc': None, 'subreddit': 'Creati...",t3
9,"{'approved_at_utc': None, 'subreddit': 'Creati...",t3


In [14]:
creation_json['data']['children'][0]['data'].keys()

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'thumbnail_height', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'author_flair_background_color', 'subreddit_type', 'ups', 'domain', 'media_embed', 'thumbnail_width', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'gildings', 'post_hint', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'banned_by', 'author_flair_type', 'contest_mode', 'selftext_html', 'likes', 'suggested_sort', 'banned_at_utc', 'view_count', 'archived', 'no_follow', 'is_crosspostable', 'pinned', 'over_18'

In [15]:
creation_json['data']['after']

't3_b100ov'

### Getting multiple pages of posts from r/creation
    - Scraping performed on 3/31/2019

In [16]:
creation_posts = []
after = None
for i in range(50):     # 50 pages of posts
    if i % 5 == 0:
        print(f'Parsing page number {i}')
    if after == None:
        params = {}
    else:
        params = {'after': after}
    creation_url = 'https://www.reddit.com/r/Creation.json'
    res_c = requests.get(creation_url, params=params, headers=headers)
    if res_c.status_code == 200:
        creation_json = res_c.json()
        creation_posts.extend(creation_json['data']['children'])
        after = creation_json['data']['after']
    else:
        print(res_c.status_code)
        print('Error fetching reddit json')
        break
    time.sleep(3)       # don't want to be labeled as a DDOS attack

Parsing page numbers 0
Parsing page numbers 1
Parsing page numbers 2
Parsing page numbers 3
Parsing page numbers 4
Parsing page numbers 5
Parsing page numbers 6
Parsing page numbers 7
Parsing page numbers 8
Parsing page numbers 9
Parsing page numbers 10
Parsing page numbers 11
Parsing page numbers 12
Parsing page numbers 13
Parsing page numbers 14
Parsing page numbers 15
Parsing page numbers 16
Parsing page numbers 17
Parsing page numbers 18
Parsing page numbers 19
Parsing page numbers 20
Parsing page numbers 21
Parsing page numbers 22
Parsing page numbers 23
Parsing page numbers 24
Parsing page numbers 25
Parsing page numbers 26
Parsing page numbers 27
Parsing page numbers 28
Parsing page numbers 29
Parsing page numbers 30
Parsing page numbers 31
Parsing page numbers 32
Parsing page numbers 33
Parsing page numbers 34
Parsing page numbers 35
Parsing page numbers 36
Parsing page numbers 37
Parsing page numbers 38
Parsing page numbers 39
Parsing page numbers 40
Parsing page numbers 41
Pa

In [17]:
len(creation_posts)

1246

In [18]:
# are there any duplicate posts?
len(set([post['data']['title'] for post in creation_posts]))

995

In [19]:
# dataframe from set of unique post titles
crea_df = pd.DataFrame(set([post['data']['title'] for post in creation_posts]))

In [20]:
crea_df.head(3)

Unnamed: 0,0
0,Between Knowing and Believing: Can we be certa...
1,What things around you do you observe that mak...
2,Intelligent Design as a Theory of Information
3,A great explanation of entropy (with sheep!)
4,Liars for Darwin


In [21]:
crea_df.rename(columns={0:'title'}, inplace=True)

In [22]:
crea_df['is_Evolution'] = 0

In [41]:
crea_df.head(3)

Unnamed: 0,title,is_Evolution
0,Between Knowing and Believing: Can we be certa...,0
1,What things around you do you observe that mak...,0
2,Intelligent Design as a Theory of Information,0


### Getting multiple pages of posts from r/evolution
    - Scraping performed on 3/31/2019

In [24]:
evolution_posts = []
after = None
for i in range(50):
    if i % 5 == 0:
        print(f'Parsing page number {i}')
    if after == None:
        params = {}
    else:
        params = {'after': after}
    evolution_url = 'https://www.reddit.com/r/evolution.json'
    res_e = requests.get(evolution_url, params=params, headers=headers)
    if res_e.status_code == 200:
        evolution_json = res_e.json()
        evolution_posts.extend(evolution_json['data']['children'])
        after = evolution_json['data']['after']
    else:
        print(res_e.status_code)
        print('Error fetching reddit json')
        break
    time.sleep(3)

Parsing page number 0
Parsing page number 1
Parsing page number 2
Parsing page number 3
Parsing page number 4
Parsing page number 5
Parsing page number 6
Parsing page number 7
Parsing page number 8
Parsing page number 9
Parsing page number 10
Parsing page number 11
Parsing page number 12
Parsing page number 13
Parsing page number 14
Parsing page number 15
Parsing page number 16
Parsing page number 17
Parsing page number 18
Parsing page number 19
Parsing page number 20
Parsing page number 21
Parsing page number 22
Parsing page number 23
Parsing page number 24
Parsing page number 25
Parsing page number 26
Parsing page number 27
Parsing page number 28
Parsing page number 29
Parsing page number 30
Parsing page number 31
Parsing page number 32
Parsing page number 33
Parsing page number 34
Parsing page number 35
Parsing page number 36
Parsing page number 37
Parsing page number 38
Parsing page number 39
Parsing page number 40
Parsing page number 41
Parsing page number 42
Parsing page number 4

In [25]:
len(evolution_posts)

1232

In [26]:
# are there any duplicate posts?
len(set([post['data']['title'] for post in evolution_posts]))

974

In [27]:
# dataframe from set of unique post titles
evo_df = pd.DataFrame(set([post['data']['title'] for post in evolution_posts]))

In [40]:
evo_df.head(3)

Unnamed: 0,title,is_Evolution
0,Anti-evolution courses on Udemy,1
1,When Birds Stopped Flying PBS Eons,1
2,I am currently researching creationism and evo...,1


In [29]:
evo_df.rename(columns={0:'title'}, inplace=True)

In [30]:
evo_df['is_Evolution'] = 1

In [31]:
evo_df.head(3)

Unnamed: 0,title,is_Evolution
0,Anti-evolution courses on Udemy,1
1,When Birds Stopped Flying PBS Eons,1
2,I am currently researching creationism and evo...,1
3,The flaws that I see with the Savanna hypothes...,1
4,These Female Insects Evolved Penises,1


### Combining post title dataframes

In [42]:
df = evo_df.append(crea_df)

In [43]:
df.head(3)

Unnamed: 0,title,is_Evolution
0,Anti-evolution courses on Udemy,1
1,When Birds Stopped Flying PBS Eons,1
2,I am currently researching creationism and evo...,1


In [44]:
df.tail(3)

Unnamed: 0,title,is_Evolution
992,"I just started a blog about Ecology, Environme...",0
993,Replacing Darwin - An Interview with Nathaniel...,0
994,Is there any evidence that rapid speciation to...,0


The dataframe should have 1969 rows, but the tail shows an index only to 994.  This issue resulted from combining the two dataframes with `.append()`, and currently most of our indexes apply to two separate rows.

To resolve this, we use `.reset_index(drop=True, inplace=True)` to reset the index for our new, combined dataframe!

In [45]:
# reseting the indexes to have fully unique indexes
df.reset_index(drop=True, inplace=True)

In [46]:
df.head(3)

Unnamed: 0,title,is_Evolution
0,Anti-evolution courses on Udemy,1
1,When Birds Stopped Flying PBS Eons,1
2,I am currently researching creationism and evo...,1


In [47]:
df.tail(3)

Unnamed: 0,title,is_Evolution
1966,"I just started a blog about Ecology, Environme...",0
1967,Replacing Darwin - An Interview with Nathaniel...,0
1968,Is there any evidence that rapid speciation to...,0


### Saving the "raw" dataframe

In [57]:
df.to_csv('./data/raw_titles.csv')