## Additional Data Collection

As indicated in the previous notebook (`01-collection.ipynb`), the data collection process was done in two batches to ensure an adequate sample size for each subreddit. Because of the time-intensive nature of this and other phases of the survey, some similar/related processes were divided into more than one notebook so that imformation could be refreshed and cleaned as efficiently as possible.

Imports:

In [1]:
import requests
import pandas as pd
import time
import datetime as dt

Previous dataset for the Catholicism sibreddit is read in to extract the lowest timestamp (i.e. the ticker for the oldest post included). 

In [3]:
buffer = pd.read_csv('../datasets/Catholicism.csv')

In [6]:
uts = buffer['created_utc'].min()

In [7]:
uts

1623577170

Below, a function similar to that in `01-collection.ipynb` constructs a PushShift URL from each subreddit's particular information. In this case, a Unix timestamp parameter is added. Also in each case, only one extra batch of posts was needed, so the `2` at the end of the resulting `.csv` name is simply hardcoded.

In [14]:
def reddit_scrape(subreddit, unix_time_stamp):
    s_type = 'submission'
    
    url = f'https://api.pushshift.io/reddit/search/{s_type}/?subreddit={subreddit}&before={unix_time_stamp}&size=100'
    
    time.sleep(5)
    res = requests.get(url)
    post_list = res.json()['data']
    full_df = pd.DataFrame(post_list)[['title', 'author', 'created_utc', 'selftext', 'subreddit' ]]
    print(1)
    
    for n in range(1, 45):
        uts = full_df['created_utc'].min()
        url = f'https://api.pushshift.io/reddit/search/{s_type}/?subreddit={subreddit}&before={uts}&size=100'

        time.sleep(5)
        res = requests.get(url)

        post_list = res.json()['data']
        temp_df = pd.DataFrame(post_list)[['title', 'author', 'created_utc', 'selftext', 'subreddit' ]]

        full_df = pd.concat([full_df, temp_df])
        
        print(n+1)
        
    full_df.to_csv('../datasets/' + subreddit + '2.csv', index = False)

Additional data gathered from the Catholicism subreddit:

In [9]:
reddit_scrape('Catholicism', uts)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50


Similarly, more data is gathered from the Orthodoxy subreddit. Again, the unique Unix timestamp is read in and the function is called, resulting in an additional `.csv`.

In [10]:
buffer = pd.read_csv('../datasets/OrthodoxChristianity.csv')

In [11]:
uts = buffer['created_utc'].min()

In [12]:
uts

1607351421

In [15]:
reddit_scrape('OrthodoxChristianity', uts)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45


The disparate dataframes are concatenated in the next notebook (`03-cleaning.ipynb`)