# Using Reddit's API to collect posts

In this project, I am attempting to answer the question: Do republicans and democrats discuss different topics? To answer this question, I will scrape posts from both republican and democratic subreddits, and then create a classification model to see if I can predict which subreddit the post came from. If I am able to do so with a high degree of accuracy, then this implies that democrats and republicans discuss different topics, or discuss the same topics but with differing sentiments. I will also look at word histograms of popular words form both subreddits, and perform t-tests between top words that appear in both subreddits to see if there is a significant difference between the mean frequencies of the top words.

### Scraping Thread Info from Reddit.com

In [2]:
import requests
import json
import pandas as pd
import time

In [2]:
headers = {'User-agent': 'Carl Lehman Bot 0.1'}

In [3]:
posts = []
after = None
for i in range(40):
    print(i)
    if after == None:
        params = []
    else:
        params = {'after': after}
    URL1 = 'https://www.reddit.com/r/Republican.json'
    res = requests.get(URL1, params = params, headers = headers)
    if res.status_code == 200:
        data = res.json()
        posts.extend(data['data']['children'])
        after = data['data']['after']
    else:
        print(res.status_code)
        break
    time.sleep(3)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


In [4]:
len(posts)

976

### Saving my results as a json

In [14]:
# Export to csv
with open('../Data/data_republican_9_5.json', 'w+') as f:
    json.dump(posts, f)

### Opening up my two jsons (scraped at different times) and joining them 

In [3]:
with open('../Data/data_republican_9_5.json', 'r') as f:
    r95 = json.load(f)

In [4]:
with open('../Data/data_republican_8_29.json', 'r') as f:
    r829 = json.load(f)

In [8]:
rep1 = pd.DataFrame([post['data']['selftext'] for post in r95], index = [post['data']['name'] for post in r95])

In [10]:
rep1['title'] = ([post['data']['title'] for post in r95])

In [13]:
rep2 = pd.DataFrame([post['data']['selftext'] for post in r829], index = [post['data']['name'] for post in r829])
rep2['title'] = ([post['data']['title'] for post in r829])

### Making one dataframe out of my two republican dataframes and dropping the duplicates

In [14]:
Republican = pd.concat([rep2,rep1], axis=0)

In [18]:
Republican['subreddit'] = 1

In [19]:
Republican.drop_duplicates(inplace=True)
len(Republican)

710

### Saving my result as a CSV

In [20]:
Republican.to_csv('../Data/republican.csv')