# **Explain Like I'm Not a Scientist**
### *An exploration of (not so) scientific communication*
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|Emily K. Sanders| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |Project 3: NLP|
|DSB-318| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |May 3, 2024|
---
###### *A report for the 2024 Greater Lafayette Association for Data Science Conference on Activism for a Thriving Society*

## Notebook 1 Summary

In notebook 1, I introduced the purpose of this work and summarized relevant background information.  I then gave an overview of the method and apparatuses.

In this notebook, I will demonstrate how I scraped the data from reddit, including `python` code.

## Method: Scraping procedure for posts

Below is the syntax I used to scrape the main feeds of each subreddit.  The comments on these posts had to be scraped separately, and will be addressed in the next notebook.

Note: The code below is presented in a Jupyter notebook for the purposes of readability and easy distribution.  In my behind-the-scenes work, all of this was done in `.py` text files.  I encourage anyone looking to reproduce this work to do the same, so that the entire code can be easily executed at the push of a button.

Much of the syntax on this page has been borrowed from a [notebook by Alanna Besaw](https://git.generalassemb.ly/dsb-318/breakfast-hour/tree/master/6_week/wed).  Thanks, Alanna!

### Set Up

In [2]:
# Imports
import pandas as pd
import requests
import getpass
from datetime import date, time, datetime
import time
import os

### Getting Authorized

The first step towards scraping reddit was to set up [an application](https://www.reddit.com/prefs/apps) that can interface with the API.  That application then assigned me several authorization keys that, along with my login information, I could use to connect to the API and request my data.

Because my keys and login information are specific to me, it is not possible to create reproducible code for this portion of the script.  Anyone wishing to replicate my work must obtain and enter their own credentials.  I have included `getpass` cells below where this information can be entered without it being visible in the notebook.

In [32]:
# Enter authorization keys

In [33]:
client_id = getpass.getpass() # Listed as "personal use script" in your application

 ········


In [34]:
client_secret = getpass.getpass() # Listed as "secret" in your application

 ········


In [35]:
user_agent = getpass.getpass() # The name of your application

 ········


In [36]:
username = getpass.getpass() # The reddit username associated with your application

 ········


In [37]:
password = getpass.getpass() # The reddit password associated with your application

 ········


Once those credentials were defined, it was necessary to request authorization to access the API, provide some more information, and receive an access token from the API.  Successful connections report a `200` status.  In the code below, I have used a boolean `True`/`False` evaluation within a print statement to provide a more readable status update.

In [7]:
# Authorize
auth = requests.auth.HTTPBasicAuth(client_id, client_secret)

# Set up authorization dictionary
data = {
  'grant_type': 'password','username': username, 'password': password}

# Create a header for scrapes - please change this value if replicating!
headers = {'User-Agent': 'EKS-DSB-318/Project-3'}

# Connect to the reddit API
res = requests.post(
    'https://www.reddit.com/api/v1/access_token',
    auth=auth, data=data, headers=headers)

# Check the API connection
print(f'The initial hook-in was successful? {res.status_code == 200}')

The initial hook-in was successful? True


These initial connections provide a token to authorize further connection and retrieval of data.  When this token is retrieved correctly, it also provides a `200` code.  

In [8]:
# Retrieve the access token
token = res.json()['access_token']

# Add the token to the headers
headers['Authorization'] = f'bearer {token}'

# Check that the token works
print(
  f'''The token is retrieved? {requests.get(
  'https://oauth.reddit.com/api/v1/me', headers=headers).status_code == 200}''')

The token is retrieved? True


It was then necessary for me to define some variables to make the scraping process smoother.  In the next cell, I defined the paths to the subreddits, which were the same in every scrape.  In the following cells, I entered the `after` parameter from the previous day's scraping, which tells the API where to pick up so as to minimize the number of duplicate posts.  Because reddit limits the number of posts an application can scrape per day, it was important to save these `after` values at the end of each day of scraping, and update them before resuming the next day.

In [9]:
# Define things for the requests - static
base_url = 'https://oauth.reddit.com'
subreddit1 = '/r/explainlikeimfive' 
subreddit2 = '/r/askphysics'

In [29]:
# Define things for the requests - varied day-to-day
after_eli5 = input('Enter eli5 after code: \n')
# For an example, use 't3_1cdeqbx'

Enter eli5 after code: 
 t3_1cdeqbx


In [30]:
# Define things for the requests - varied day-to-day
after_askp = input('Enter askp after code: \n')
# For an example, use 't3_1cd2wg4'

Enter askp after code: 
 t3_1cd2wg4


## Posts: Data In

I conducted my scraping via the use of a `while` loop, as shown below.  I first constructed two empty lists to hold the results of each scrape, set a counter to `0`, then, with each repetition of the loop, and for each subreddit: 
1. (re)define the parameters of each request, either from the inputted value above or with a value extracted from the previous scrape,
2. use `requests.get` to scrape the data from the API,
3. use `.json()` to convert the raw `JSON` object to a dictionary (full of other, nested dictionaries),
4. append the dictionary to the list
5. extract the `after` value and assign it to a variable (this variable appears in step 1),
6. move the counter, and finally,
7. wait 5 seconds before the next request, so as to not overburden the API server.

The purpose of this approach was to repeatedly gather posts at the maximum allowable rate until my daily limit of posts was used up accomplishable via setting the `limit` parameter to 100 (reddit's maximum) for each subreddit per loop (i.e., 200 total), and repeating the loop until it reached per-day post limit, which is 1000 posts per day, per subreddit.

Unfortunately, in the spirit of transparent science, I must here admit to two errors.  First, as such a novice to reddit, I misinterpreted the 1000 post limit as an overall cap, not a per-subreddit cap, and limited myself to 500 posts per subreddit per day for a while. This is regrettable, as it means I lost out on half of my potential data from those days.  Compounding this, a typographical error in the first version of the code below assigned the `after` value from AskPhysics to both `after_askp` and `after_eli5`, instead of assigning each variable the `after` value corresponding to its own subreddit, as would have been correct.  Combined, by the time I discovered them, these errors had resulted me collecting only a little over 1000 posts for AskPhysics, rather than the 2000 I could have had, and a mere 561 for eli5.  I paused collection of AskPhysics posts so as to not further imbalance the classes, and spent several days trying to make up the gap by scraping more eli5 posts, but this resulted in mostly duplicate posts.  In the interest of not storaging more data than necessary, when writing my data to CSV files, I descended to the level within the nested dictionaries that contained the content of the posts, and discarded the rest - unfortunately, including the `after` values that I erroneously believed to have already been used.  This would not have been an error had the code been written correctly, but I certainly would not do it the same way again, nor advise anyone attempting to replicate my work to do so.  Without the correct `after` values, it was all but impossible to find my place in the API and extract new, unique posts.  At the point when I ran out of time to try, I had only 740 unique eli5 posts.  This is less dramatic of a gap (740 to 1028, vs. 561 to 1028), but still imbalanced.  Fortuitously though, as mentioned above, the comments are the more valuable source of data, and they greatly outnumbered their parent posts.  Furthermore, eli5 produced more than twice as many comments as AskPhysics, which more than closed the gap (more on that later).

Finally, below is the final version of the scraping code.  As mentioned previously, I did most of my "behind the scenes" work with unglamorous text files, where I set the parameters as detailed above.  For the purposes of demonstration, however, I have set them both to 2 below - scrape two posts, two times.  I made this change as to not needlessly burden the server while editing the notebook.

In [31]:
# Create lists to store the scrapes
eli5_scrapes = []
askp_scrapes = []

# Set a counter
i = 0 

# Set the loop to repeat 2 times (was 5 during data collection, should have been 10)
while i<2:
    # Show signs of life (this process can take a while)
    print(i)
    # (Re-)Define the parameter dictionaries
    # On the first iteration, 'after' will be the values defined above
    params_eli5 = {'limit': 2, 'after': after_eli5} #(was 'limit': 100 during data collection)
    params_askp = {'limit': 2, 'after': after_askp} #(was 'limit': 100 during data collection)
    # Make the requests
    eli5 = requests.get(base_url+subreddit1, headers=headers, params=params_eli5)
    askp = requests.get(base_url+subreddit2, headers=headers, params=params_askp)
    # Append to lists
    askp_scrapes.append(askp.json())
    eli5_scrapes.append(eli5.json())
    # Update the afters
    after_eli5 = eli5.json()['data']['after']
    after_askp = askp.json()['data']['after']
    # Move the counter
    i += 1
    # Always tip your servers
    time.sleep(5)
print('done scraping')

0
1
done scraping


At the end of running that code each day, I had two lists full of dictionaries, themselves full of more dictionaries, corresponding to about 100 posts.  Before ending the `python` session, I needed to export this data to external files for storage. 

## Posts: Data Out

Because the `JSON`-turned-dictionary objects have such a layered structure, it was necessary to coerce them a bit to get them into a clean dataframe format.  Furthermore, because I was doing 5 scrapes per day for 2 subreddits (or at least, so I thought), I needed to be able to do this in a loop rather than one at a time.  I used a `for` loop to accomplish this, but also incorporated some counters for organizational purposes.  These counters are not inherently doing anything in the code; it was simply a strategy for me to keep track of what was going while the loop was running and which data came from which scrape.  I later discovered a way to encode this information during the scraping itself, but to once again be truthful, that epiphany came too late for much of the data.  For most iterations, the date and scrape number were recorded in the file name, and then added as variables later when those files were read back in and concatenated into one dataframe.

At the level of the nested dictionaries with the actual data, there were two entries: one, `['data']`, that contained the information I wanted, and another, `['kind']`, that only contained a one-word descriptor of the type of post.  I wanted to preserve both sets of information, but upon conversion to a dataframe, each dictionary became a separate column, and `kind` was the same for every row.  To resolve this, I wrote a function to double check that `kind` was the same in every row (it always was), to add and populate a row for `kind`, and then drop the `kind` column.  This function is presented below.

In [13]:
# Define a useful function
def be_kind(df):
    """Double check that there's only one 'kind' per scrape,
    then streamline the resulting dataframe.
    
    Arg: {df}, a dataframe created from a scrape from the reddit API
    Return: {df}, the same dataframe, altered
    Raise: nothing so far! It would raise all sorts of errors if 
    applied to a different kind of dataframe, though."""

    # If it's just the one value
    if len(df.loc[:,'kind'].unique())==1:
        # Create a new row in the 'data' column and populate it
        # with whatever's in the 'kind' column
        df.loc['kind','data'] = df.loc[:,'kind'].unique()[0]
        # Get rid of the ['kind'] column
        df.drop(columns = 'kind', inplace = True)       
    # If there's more than one (this never happened)
    else:
        # Tell me, then stop
        print(f'multiple kinds in {i}')
    # Return
    return df

At the end of processing, when each set of scrapes was combined into a CSV, I wanted to save it with a name that recorded the source, date, and scrape number (i.e., which iteration the loop was on) of its creation.  I defined two functions to do this.  The first uses capabilities from the `datetime` module to produce the date and time of its execution in a custom format.  The second function 
- gathers the production information,
- orients the dataframe the correct way,
- extracts column names from the first row, which had been an index column before transposing,
- assigns column headers,
- and saves the CSV to the working directory under the appropriate name, including a timestamp.  

These functions are displayed below.

In [14]:
# Define a useful functions
def my_date():
  return datetime.now().strftime('%Y-%m-%d_h%H-m%M-s%S')
print(my_date()) # test it

2024-05-02_h22-m30-s50


In [33]:
# Define another useful function
def post_csvs(l, df, k2, i):
    '''Convert the dataframes of scrapes into meaningfully-named CSVs.
    Requires `my_date()` to also be defined.
  
    Arg:
        l: the storage list the df came from; a proxy for the subreddit
        df: the dataframe to be converted
        k2: my per-subreddit counter
        i: the individual post counter
    Return:
        df in the environment
        a CSV file saved to the working directory
    Raise:
        fingers crossed'''
    
    # Use l to create text names
    if l == askp_scrapes:
        sub = 'askp'
    if l == eli5_scrapes:
        sub = 'eli5'
    # Bypass that weird copy thing
    # Transpose the dataframe so each scrape is a row, not a column
    df2 = df.copy(deep = True).transpose()
    # Make the first row the column names
    df2.columns = list(df2.iloc[0,:])
    # Drop an unnecessary column that appears after multiple merges
    if i>0:
        df2.drop(index = 'key_0', inplace = True)
    # Assign meaningful names (source, datetime, scrape iteration) and write to CSV
    df2.to_csv(f'../data/output/posts/{sub}_{my_date()}_scrape{k2}.csv')
    # Return
    return df2

Those two functions are put to use in the following `for` loop.  Within this loop, I:
1. selected one of the two lists of scrapes,
2. selected one of the scrapes (i.e., iterations of `i` in the above `for` loop) within that list,
3. and selected one of the posts within that scrape.  Then, for that post, I:
4. dug through the dictionaries to the level with the data,
5. assigned that dictionary to a dataframe called `x` (`y`, in posts other than the first),
6. applied `be_kind()` to `x` (`y`), and
7. printed an update for debugging and "proof of life."

I then moved the counter, and looped back up to step 3 and started the next post within the scrape.  All posts that were not the first one in their scrape were assigned to a dataframe named `y` instead of `x`, and then the two dataframes were merged together.  For the second post in a scrape (that is, the first `y`), they had to be merged just on their indices.  This merge created a column called `key_0` out of what had been the original index of `x`, and all subsequent `y`s were merged on that column.

After repeating steps 3-7 for each post within the scrape selected at step 2 and merging them all into one dataframe `x`, I applied my function `post_csvs()` to do some final tidying up, generate the file name, and save the file to the working directory.  Then, I directed the loop back up to step 2 to move to the next scrape within the list, and process its posts the same way.  Once all of the scrapes in that list had been processed, I looped back up to step 1 to do the same for the other list of scrapes.  At the end of all of that, I used a print statement to announce that it was done, and report how many posts had been processed.  The code is below.

**Note:** Because this cell has already been run for demonstration purposes, and thus already produced files in the `data/output/posts` folder, **future cells that draw from that folder will produce different results if run again.**  To replicate the results, please fork the repository, **delete** the contents of `data/output/posts`, and then try.

In [34]:
# Set an overall counter
k=0
# For each set of scrapes
for l in [askp_scrapes, eli5_scrapes]: 
    # Set a per-subreddit counter
    k2=0
    # For each scrape
    for j in l:
        # Count
        k2+=1
        # For each individual post scraped, note how much "digging" occurs here
        for i in list(range(len(j['data']['children']))):
            # If it's the first one we're processing
            if i==0:
                # It gets its own dataframe made of its dictionary
                x = pd.DataFrame(j['data']['children'][i])
                # Handle the 'kind' column
                x = be_kind(x)               
                # Print information for debugging, signs of life
                print(f'at the end of the first one, x.shape is {x.shape}')
            # If it's not the first one
            else:
                # It becomes a dataframe with a different name
                y = pd.DataFrame(j['data']['children'][i])
                # Handle the 'kind' column
                y = be_kind(y)               
                # Print information for debugging, signs of life
                print(f'y.shape for {k} (i = {i}) is {y.shape}')
                # Merge the dataframes together
                if i==1:
                    x = x.merge(y, how = 'outer', on = x.index, suffixes = [None, f'_{i}'])
                elif i>1:
                    x = x.merge(
                        y, how = 'outer', left_on = 'key_0', right_on = y.index, suffixes = [
                            None, f'_{i}'])
            # Count
            k+=1
        # Tidy up the dataframe, write to CSV
        x = post_csvs(l, x, k2, i)
print(f'All done, {k} posts processed')

at the end of the first one, x.shape is (105, 1)
y.shape for 1 (i = 1) is (105, 1)
at the end of the first one, x.shape is (105, 1)
y.shape for 3 (i = 1) is (105, 1)
at the end of the first one, x.shape is (106, 1)
y.shape for 5 (i = 1) is (106, 1)
at the end of the first one, x.shape is (106, 1)
y.shape for 7 (i = 1) is (106, 1)
All done, 8 posts processed


Below is the resulting dataframe, `x`, at the end of that loop.  Note that it only contains two rows because `x` is overwritten each time the loop cycles through a set of scrapes.  All 4 of the CSVs created in that loop can be found in the `data/output` folder.

In [35]:
pd.set_option('display.max_columns', None)
x

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,banned_at_utc,banned_by,can_gild,can_mod_post,category,clicked,content_categories,contest_mode,created,created_utc,discussion_type,distinguished,domain,downs,edited,gilded,gildings,hidden,hide_score,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,likes,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media,media_embed,media_only,mod_note,mod_reason_by,mod_reason_title,mod_reports,name,no_follow,num_comments,num_crossposts,num_reports,over_18,parent_whitelist_status,permalink,pinned,pwls,quarantine,removal_reason,removed_by,removed_by_category,report_reasons,saved,score,secure_media,secure_media_embed,selftext,selftext_html,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,title,top_awarded_type,total_awards_received,treatment_tags,ups,upvote_ratio,url,user_reports,view_count,visited,whitelist_status,wls,kind
data,[],False,,,False,cyberchief,,,[],,,,text,t2_914nb,False,False,False,[],,,False,False,,False,,False,1713965901.0,1713965901.0,,,self.explainlikeimfive,0,False,0,{},False,False,1cbyfa1,False,True,False,False,False,True,True,False,,#014980,Economics,"[{'e': 'text', 't': 'Economics'}]",f8e2fe5c-19e0-11e6-981d-0e2b7c7bd64b,Economics,light,richtext,False,,{},False,,,,[],t3_1cbyfa1,False,697,0,,False,all_ads,/r/explainlikeimfive/comments/1cbyfa1/eli5_why...,False,6,False,,,,,False,2928,,{},,,True,False,False,explainlikeimfive,t5_2sokd,r/explainlikeimfive,22830639,public,confidence,,ELI5: Why are business expenses deductible fro...,,0,[],2928,0.89,https://www.reddit.com/r/explainlikeimfive/com...,[],,False,all_ads,6,t3
data_1,[],False,,,False,Cold-Economist2858,,,[],,,,text,t2_vjx5tzgoa,False,False,False,[],,,False,False,,False,,False,1714152786.0,1714152786.0,,,self.explainlikeimfive,0,False,0,{},False,False,1cdrdus,False,True,False,False,False,True,True,False,,#0079d3,Physics,"[{'e': 'text', 't': 'Physics'}]",ef508d1e-19e0-11e6-8b82-0eea2757e675,Physics,light,richtext,False,,{},False,,,,[],t3_1cdrdus,True,9,0,,False,all_ads,/r/explainlikeimfive/comments/1cdrdus/eli5_wha...,False,6,False,,,,,False,0,,{},"When a tank of gas (nitrogen, co2, argon, etc....","&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,False,explainlikeimfive,t5_2sokd,r/explainlikeimfive,22830639,public,confidence,,eli5: What's left in spent gas tanks?,,0,[],0,0.38,https://www.reddit.com/r/explainlikeimfive/com...,[],,False,all_ads,6,t3


## Posts: Data Back In

Because each scrape was saved as an individual CSV, it was necessary to concatenate them together for analysis.  I concatenated all the posts together before scraping comments, so that there would only one source of post IDs to iterate through during comment scraping.  

To do this, I first read in a list of the names of the CSVs and created a placeholder list to receive them all.  I then used a `for` loop to read in each CSV as a dataframe and create columns within it for the date it was scraped, its scrape number, and its source, which were recorded in the file name in a previous step.  I then appended its contents to the list.  Finally, I concatenated the list into a dataframe.  The code for this process is below.

In [3]:
# Get a list of the CSVs
previous_scrapes = os.listdir('../data/output/posts/')
print(f"{len(previous_scrapes)} CSVs to do")

4 CSVs to do


In [44]:
# Create a placeholder list
scrapes = []

In [45]:
# Loop through the CSVs
for file in previous_scrapes:
  path = "../data/output/" + file
  df = pd.read_csv(path)
  file_name = file.split('.')[0] #drop the .csv
  file_name = file_name.split('_') #break into chunks
  df['source'] = file_name[0]
  df['date_scraped'] = file_name[1]
  df['scrape_num'] = file_name[-1]
  scrapes.append(df)

In [46]:
# Concatenate into one dataframe
scrapes = pd.concat(scrapes, sort=False, axis=0, join='outer', ignore_index=True)
print(f'scrapes is now a dataframe of dimensions {scrapes.shape}')
scrapes.head(1)

scrapes is now a dataframe of dimensions (8, 110)


Unnamed: 0.1,Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,banned_at_utc,banned_by,can_gild,can_mod_post,category,clicked,content_categories,contest_mode,created,created_utc,discussion_type,distinguished,domain,downs,edited,gilded,gildings,hidden,hide_score,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,likes,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_text,link_flair_text_color,link_flair_type,locked,media,media_embed,media_only,mod_note,mod_reason_by,mod_reason_title,mod_reports,name,no_follow,num_comments,num_crossposts,num_reports,over_18,parent_whitelist_status,permalink,pinned,pwls,quarantine,removal_reason,removed_by,removed_by_category,report_reasons,saved,score,secure_media,secure_media_embed,selftext,selftext_html,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,title,top_awarded_type,total_awards_received,treatment_tags,ups,upvote_ratio,url,user_reports,view_count,visited,whitelist_status,wls,kind,source,date_scraped,scrape_num,link_flair_template_id
0,data,[],False,,,False,CDNZimmy,,,[],,,,text,t2_5qwkjeko,False,False,False,[],,,False,False,,False,,False,1714098000.0,1714098000.0,,,self.AskPhysics,0,False,0,{},False,False,1cd9r3s,False,True,False,False,False,True,True,False,,,,[],,dark,text,False,,{},False,,,,[],t3_1cd9r3s,False,5,0,,False,all_ads,/r/AskPhysics/comments/1cd9r3s/keplers_constant/,False,6,False,,,,,False,2,,{},I’m looking for an explanation of a weird prob...,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,False,AskPhysics,t5_2sumo,r/AskPhysics,607255,public,,,Keplers constant,,0,[],2,0.67,https://www.reddit.com/r/AskPhysics/comments/1...,[],,False,all_ads,6,t3,askp,2024-05-02,scrape1,


Once all of the posts were concatenated into one dataframe, I conducted some preliminary exploratory data analysis, which I will briefly summarize in that section.  The final step of processing this data, was to save it back out as a combined CSV.  That CSV can be found the `data/input` folder, but because of its size, it could not easily be pushed to GitHub for distribution.  Therefore, this CSV, as well as concatenated CSVs for each subreddit's comments, have been compressed into a zip file, called `concatted-scrapes-separate-csvs-by-source.zip`.  Within that file, this CSV is called `combined-as-of_2024-04-30_h17-m21-s57.csv`.

For the purposes of demonstration, I will save the sample scrapes generated above into a CSV in the `data/output` folder.

In [49]:
# Save the giant df as csv
scrapes.to_csv(f'../data/output/concatted-wholes/combined-as-of_{my_date()}.csv', index = False)

In this notebook, I have walked through my process for scraping, saving, and processing posts from each of my chosen subreddits.  In the next notebook, I will explain this process for comments, and how I then combined posts and comments into one dataframe.