## Initial Data Collection

Due to the time-intensive nature of collecting data in large swaths, posts from both subreddits in question were collected in two batches to account for occasional connection issues. Likewise, in an effort to maintain clean, readable Jupyter notebooks, this process was divided between this notebook and the next (`02_collection_continued.ipynb`).

Initial imports:

In [1]:
import requests
import pandas as pd
import time
import datetime as dt

Below, a standard function to pull subreddit data from PushShift. The function reads in a string (the name of the subreddit), constructs a longer URL string, and anchors the Unix timestamp in (what was at the time) a presrent value. Adding the `time.sleep` function to avoid any server dust-ups, it goes on to build a dataframe of 5000 rows, 100 rows at a time. A `print()` statement is added to provide a ticker and track the collection process. Finally, the information is written to a `.csv` file in the `datasets` folder.

In [2]:
def reddit_scrape(subreddit):
    s_type = 'submission'
    unix_time_stamp = 1626825653
    
    url = f'https://api.pushshift.io/reddit/search/{s_type}/?subreddit={subreddit}&before={unix_time_stamp}&size=100'
    
    time.sleep(5)
    res = requests.get(url)
    post_list = res.json()['data']
    full_df = pd.DataFrame(post_list)[['title', 'author', 'created_utc', 'selftext', 'subreddit' ]]
    print(1)
    
    for n in range(1, 50):
        uts = full_df['created_utc'].min()
        url = f'https://api.pushshift.io/reddit/search/{s_type}/?subreddit={subreddit}&before={uts}&size=100'

        time.sleep(5)
        res = requests.get(url)

        post_list = res.json()['data']
        temp_df = pd.DataFrame(post_list)[['title', 'author', 'created_utc', 'selftext', 'subreddit' ]]

        full_df = pd.concat([full_df, temp_df])
        
        print(n+1)
        
    full_df.to_csv('../datasets/' + subreddit + '.csv', index = False)

First, the Catholicism subreddit:

In [5]:
reddit_scrape('Catholicism')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50


Next, Orthodoxy:

In [8]:
reddit_scrape('OrthodoxChristianity')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
