# Subreddit Classification Using Language Processing
## Part 1 of 4: Data Collection

#### Notebooks
- [01_data_collection](./01_data_collection.ipynb)
- [02_eda_and_cleaning](./02_eda_and_cleaning.ipynb)
- [03_visualizing](./03_visualizing.ipynb)
- [04_modeling](./04_modeling.ipynb)

#### This Notebook's Contents
- [Resources](#Resources)
- [Pulling Subreddit Data](#Pulling-Subreddit-Data)

## Resources
1. [Artificial Intelligence Subreddit](https://www.reddit.com/r/artificial/)  
2. [Data Science Subreddit](https://www.reddit.com/r/datascience/)  
3. [Pushshift API Wrapper Documentation](https://pushshift.io/api-parameters/)
4. [Reddit API Documentation](https://www.reddit.com/dev/api/)
5. [Epoch Extractor](https://www.epochconverter.com/)

## Pulling Subreddit Data

In [1]:
# Import the necessary libraries.
import pandas as pd
import numpy as np
import requests
import time

In [2]:
# Build a function to get posts from a subreddit.
def get_posts(subreddit, n_iter, epoch_right_now, filepath): 
    """
    (str, int, int, str) -> csv file
    Returns a dataframe as an exported csv file in the specified filepath.
    Parameters include the subreddit name, number of times the function should 
    run, the current epoch time stamp, and a file path to save the output.
    """

    # Create a base url variable.
    base_url = 'https://api.pushshift.io/reddit/search/submission/?subreddit='
    
    # Instantiate an empty list for the reddit dataframes.    
    df_list = []

    # Save the current epoch, used to iterate in reverse through time.
    current_time = epoch_right_now
    
    # Iterate through the number specified.
    for post in range(n_iter):
        
        try:
            # Instantiate a get request.
            res = requests.get(
                # requests.get takes the base_url and
                base_url,            
                # parameters for the get request.
                params = {
                    # Specify the subreddit.
                    'subreddit': subreddit,
                    # Specify the number of posts to pull.
                    'size': 100,
                    # Specify true for language.
                    'lang': True,                
                    # Pull everything from current time backward.
                    'before': current_time})

            # Extract tje data from the most recent request and store it as a dataframe.
            df = pd.DataFrame(res.json()['data'])

            # Pull specific columns from the dataframe for NLP analysis.
            df = df.loc[:, ['title',
                            'created_utc',
                            'selftext',
                            'subreddit',
                            'author',
                            'media_only',
                            'permalink']]

            # Append the dataframe to the dataframe list.
            df_list.append(df)

            # Set the current time counter back to the last epoch in the most recently 
            # grabbed dataframe. This ensures it will not keep grabbing the same posts.
            # This will be the oldest post in the dataframe.
            current_time = df['created_utc'].min() 

            # Set a time delay between pull requests.
            time.sleep(10)
            
            print(post)
        
        # If an exception is raised, print the current_time.
        except:
            print(current_time)    
    
    # Concatenate the list of dataframes.
    final_df = pd.concat(df_list, axis=0)
    
    # Write the final list of dataframes to a subreddit output file.
    final_df.to_csv(filepath + subreddit + '.csv', index=False)
    
    # Return.
    return

In [3]:
# Create a list of subreddits.
subreddits = ['artificial', 'datascience']

# Iterate through the list of subreddits, and generate csv files.
for subreddit in subreddits:
    get_posts(subreddit, 600, 1602186431, '../data/')

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
1526950599
152695