# **Explain Like I'm Not a Scientist**
### *An exploration of (not so) scientific communication*
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|Emily K. Sanders| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |Project 3: NLP|
|DSB-318| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |May 3, 2024|
---
###### *A report for the 2024 Greater Lafayette Association for Data Science Conference on Activism for a Thriving Society*

## Prior Notebooks Summary

In the previous 2 notebooks, I introduced the purpose of this work and summarized relevant background information, then gave an overview of the Method section up to the end of scraping the posts.

In this notebook, I will demonstrate how I scraped the comments of these posts, including `python` code.

## Method: Scraping procedure for comments

Below is the syntax I used to scrape the comments for each post.  Where it is the same as the syntax for scraping the posts, it is presented without further elaboration.  Where it is different, I have explained why.

The note and attribution from the previous notebook apply here too.

### Set Up

In [122]:
# Imports
import pandas as pd
import requests
import getpass
from datetime import date, time, datetime
import time
import os

### Getting Authorized

In [123]:
# Enter authorization keys

In [124]:
client_id = getpass.getpass() # Listed as "personal use script" in your application

 ········


In [125]:
client_secret = getpass.getpass() # Listed as "secret" in your application

 ········


In [126]:
user_agent = getpass.getpass() # The name of your application

 ········


In [127]:
username = getpass.getpass() # The reddit username associated with your application

 ········


In [128]:
password = getpass.getpass() # The reddit password associated with your application

 ········


In [129]:
# Authorize
auth = requests.auth.HTTPBasicAuth(client_id, client_secret)

# Set up authorization dictionary
data = {
  'grant_type': 'password','username': username, 'password': password}

# Create a header for scrapes - please change this value if replicating!
headers = {'User-Agent': 'EKS-DSB-318/Project-3'}

# Connect to the reddit API
res = requests.post(
    'https://www.reddit.com/api/v1/access_token',
    auth=auth, data=data, headers=headers)

# Check the API connection
print(f'The initial hook-in was successful? {res.status_code == 200}')

The initial hook-in was successful? True


In [130]:
# Retrieve the access token
token = res.json()['access_token']

# Add the token to the headers
headers['Authorization'] = f'bearer {token}'

# Check that the token works
print(
  f'''The token is retrieved? {requests.get(
  'https://oauth.reddit.com/api/v1/me', headers=headers).status_code == 200}''')

The token is retrieved? True


In [131]:
# Define things for the requests - static
base_url = 'https://oauth.reddit.com'
subreddit1 = '/r/explainlikeimfive' 
subreddit2 = '/r/askphysics'

## Comments: Data In

This is where the paths diverge.  Whereas posts can be scraped from the subreddit in large batches, comments must be scraped from each individual post on which they were left.  This meant that rather than doing a `while` loop with a few iterations, I had to iterate through each individual post and place a separate `get` request for its comments.  Despite this difference, the syntax of the loop is very similar to the previous one, and is therefore presented with fewer annotations.  For the purpose of demonstration, I have imported the CSV created in the last notebook and processed it in this way.  In the "behind-the-scenes" work, this was the combined dataframe of all of the posts.

In [153]:
# Import the dataframe of posts
scrapes = pd.read_csv('../data/output/concatted-wholes/combined-as-of_2024-05-02_h23-m28-s15.csv')
scrapes.shape

(8, 110)

In [154]:
# Create placeholder lists
askp_comments_scrapes = []
eli5_comments_scrapes = []

In [155]:
# Iteratively request each post's comments
for i in scrapes.index:
  link = scrapes.loc[i, 'permalink']  # identify the correct link
  sub = scrapes.loc[i, 'source']  # identify which subreddit it came from
  comments = requests.get(base_url+link, headers=headers)
  if sub=='eli5':
    eli5_comments_scrapes.append(comments.json())
  elif sub=='askp':
    askp_comments_scrapes.append(comments.json())
  print(f"post {scrapes.loc[i, 'id']} complete")
  time.sleep(5)

post 1cd9r3s complete
post 1cddwhq complete
post 1cdlxli complete
post 1cd1hxh complete
post 1cdrrks complete
post 1cdnk4p complete
post 1cbyfa1 complete
post 1cdrdus complete


As with the posts, at the end of running that code, I had two lists full of dictionaries, themselves full of more dictionaries, corresponding to the comments of one post each.  Before ending the `python` session, I again needed to export this data to external files for storage. This was a very similar process as it was for the posts, but it required a few modifications to the `post_csvs()` function (adding a "comment" indicator), and allowed for a slight simplification of the `for` loop because each scrape only contained one dictionary of information, rather than 100.

## Comments: Data Out

Like the posts, the comments were saved as `JSON`-turned-dictionary objects and needed some coercing into dataframes that could be exported to CSVs for storage.

In [135]:
# Define a useful function
def be_kind(df):
    """Double check that there's only one 'kind' per scrape,
    then streamline the resulting dataframe.
    
    Arg: {df}, a dataframe created from a scrape from the reddit API
    Return: {df}, the same dataframe, altered
    Raise: nothing so far! It would raise all sorts of errors if 
    applied to a different kind of dataframe, though."""

    # If it's just the one value
    if len(df.loc[:,'kind'].unique())==1:
        # Create a new row in the 'data' column and populate it
        # with whatever's in the 'kind' column
        df.loc['kind','data'] = df.loc[:,'kind'].unique()[0]
        # Get rid of the ['kind'] column
        df.drop(columns = 'kind', inplace = True)       
    # If there's more than one (this never happened)
    else:
        # Tell me, then stop
        print(f'multiple kinds in {i}')
    # Return
    return df

In [136]:
# Define a useful functions
def my_date():
  return datetime.now().strftime('%Y-%m-%d_h%H-m%M-s%S')
print(my_date()) # test it

2024-05-03_h05-m39-s21


In [137]:
# Define a useful function
def post_csvs(l, df, k2, i):
    '''Convert the dataframes of scrapes into meaningfully-named CSVs.
    Requires `my_date()` to also be defined.
  
    Arg:
        l: the storage list the df came from; a proxy for the subreddit
        df: the dataframe to be converted
        k2: my per-subreddit counter
        i: the individual post counter
    Return:
        df in the environment
        a CSV file saved to the working directory
    Raise:
        fingers crossed'''
    
    # Use l to create text names
    if l == askp_comments_scrapes:
        sub = 'askp-comments'
    if l == eli5_comments_scrapes:
        sub = 'eli5-comments'
    # Bypass that weird copy thing
    # Transpose the dataframe so each scrape is a row, not a column
    df2 = df.copy(deep = True).transpose()
    # Make the first row the column names
    df2.columns = list(df2.iloc[0,:])
    # Drop an unnecessary column that appears after multiple merges
    if i>0:
        df2.drop(index = 'key_0', inplace = True)
    # Assign meaningful names (source, datetime, scrape iteration) and write to CSV
    df2.to_csv(f'../data/output/comments/{sub}_{my_date()}_scrape{k2}.csv')
    # Return
    return df2

In [157]:
for l in [askp_comments_scrapes, eli5_comments_scrapes]:
  print(len(l))

4
4


**Note:** Because this cell has already been run for demonstration purposes, and thus already produced files in the `data/output/comments` folder, **future cells that draw from that folder will produce different results if run again.**  To replicate the results, please fork the repository, **delete** the contents of `data/output/comments`, and then try.

In [158]:
# Set an overall counter
k=0
# For each set of scrapes
for l in [askp_comments_scrapes, eli5_comments_scrapes]: 
    # Set a per-subreddit counter
    k2=0
    # For each scrape
    for p in l:
        # Count
        k+=1
        j = p[1]['data']['children'] 
        # For each individual post scraped, note how much "digging" occurs here
        for i in list(range(len(j))):
            # If it's the first one we're processing
            if i==0:
                # It gets its own dataframe made of its dictionary
                x = pd.DataFrame(j[i])
                # Handle the 'kind' column
                x = be_kind(x)               
            # If it's not the first one
            else:
                # It becomes a dataframe with a different name
                y = pd.DataFrame(j[i])
                # Handle the 'kind' column
                y = be_kind(y)               
                # Merge the dataframes together
                if i==1:
                    x = x.merge(y, how = 'outer', left_on = x.index, right_on = y.index, suffixes = [None, f'_{i}'])
                elif i>1:
                    x = x.merge(y, how = 'outer', left_on = 'key_0', right_on = y.index, suffixes = [None, f'_{i}'])
            # Count
            k2+=1
        # Tidy up the dataframe, write to CSV
        x = post_csvs(l, x, k2, i)
print(f'All done, {k} posts processed')

All done, 8 posts processed


Below is the resulting dataframe, `x`, at the end of that loop.  As with the posts, `x` is still overwritten each time the loop cycles through a set of scrapes, but its lentgh is now variable for each scrape.  This `x` corresponds to only one post, but has 4 rows because that post received 4 top level comments.  As can be seen in the `['replies']` columns, two of those comments received further sub-comments, and two of them did not.  Early on in the project I extracted many of the sub-comments, and I would be happy to make them available to other scholars.  However, due to time and computing constraints, they were not included in the current work.

In [140]:
pd.set_option('display.max_columns', None)
x

Unnamed: 0,all_awardings,approved_at_utc,approved_by,archived,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,banned_at_utc,banned_by,body,body_html,can_gild,can_mod_post,collapsed,collapsed_because_crowd_control,collapsed_reason,collapsed_reason_code,comment_type,controversiality,created,created_utc,depth,distinguished,downs,edited,gilded,gildings,id,is_submitter,likes,link_id,locked,mod_note,mod_reason_by,mod_reason_title,mod_reports,name,no_follow,num_reports,parent_id,permalink,removal_reason,replies,report_reasons,saved,score,score_hidden,send_replies,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_type,top_awarded_type,total_awards_received,treatment_tags,unrepliable_reason,ups,user_reports,kind
data,[],,,False,,ezekielraiden,,,[],,,,text,t2_16nhu9,False,False,False,[],,,"It depends what you mean by ""completely voided...","&lt;div class=""md""&gt;&lt;p&gt;It depends what...",False,False,False,,,,,0,1714153061.0,1714153061.0,0,,0,False,0,{},l1drwig,False,,t3_1cdrdus,False,,,,[],t1_l1drwig,False,,t3_1cdrdus,/r/explainlikeimfive/comments/1cdrdus/eli5_wha...,,"{'kind': 'Listing', 'data': {'after': None, 'd...",,False,15,False,True,False,explainlikeimfive,t5_2sokd,r/explainlikeimfive,public,,0,[],,15,[],t1
data_1,[],,,False,,TheJeeronian,,,[],,,,text,t2_3spgt7bd,False,False,False,[],,,Under normal use a tank will not be completely...,"&lt;div class=""md""&gt;&lt;p&gt;Under normal us...",False,False,False,,,,,0,1714152968.0,1714152968.0,0,,0,False,0,{},l1drmne,False,,t3_1cdrdus,False,,,,[],t1_l1drmne,False,,t3_1cdrdus,/r/explainlikeimfive/comments/1cdrdus/eli5_wha...,,"{'kind': 'Listing', 'data': {'after': None, 'd...",,False,6,False,True,False,explainlikeimfive,t5_2sokd,r/explainlikeimfive,public,,0,[],,6,[],t1
data_2,[],,,False,,DBDude,,,[],,,,text,t2_9eu97,False,False,False,[],,,It's not empty. It's just gas at the same pres...,"&lt;div class=""md""&gt;&lt;p&gt;It&amp;#39;s no...",False,False,False,,,,,0,1714154977.0,1714154977.0,0,,0,False,0,{},l1dxlee,False,,t3_1cdrdus,False,,,,[],t1_l1dxlee,True,,t3_1cdrdus,/r/explainlikeimfive/comments/1cdrdus/eli5_wha...,,,,False,2,False,True,False,explainlikeimfive,t5_2sokd,r/explainlikeimfive,public,,0,[],,2,[],t1
data_3,[],,,False,,kanakamaoli,,,[],,,,text,t2_egn6g,False,False,False,[],,,Empty gas tanks may still have a small amount ...,"&lt;div class=""md""&gt;&lt;p&gt;Empty gas tanks...",False,False,False,,,,,0,1714181528.0,1714181528.0,0,,0,False,0,{},l1fwaud,False,,t3_1cdrdus,False,,,,[],t1_l1fwaud,True,,t3_1cdrdus,/r/explainlikeimfive/comments/1cdrdus/eli5_wha...,,,,False,1,False,True,False,explainlikeimfive,t5_2sokd,r/explainlikeimfive,public,,0,[],,1,[],t1


## Comments: Data Back In

The individual comments CSVs had to read in and concatenated just as the posts had.  Unfortunately, for reasons I was never able to pinpoint, some comments were exported without column headers, and a few became badly distorted, which caused problems upon re-importation.  Some of these also had two extra columns, but they were always unpopulated.  To work around this, I devised some tests to process smoothly the CSVs that could be, and to quarantine the others.

In [159]:
# Define the values that should always be in certain locations
t61 = ['AskPhysics', 'explainlikeimfive']
t62 = ['t5_2sumo', 't5_2sokd']
t63 = ['r/AskPhysics', 'r/explainlikeimfive']

# Define the tests
tests = ( 
  (df.iloc[1,0]=='data') & (df.iloc[1,61] in t61) & (
    df.iloc[1,62] in t62) & (df.iloc[1,63] in t63))

A few more tests, related to the number of columns in the CSVs were written directly into the loop.  Unfortunately, this required explicitly definining a list of the columns that were supposed to appear.

In [160]:
com_col_names = [
  'Unnamed: 0', 'all_awardings', 'approved_at_utc', 'approved_by', 'archived', 'associated_award', 
  'author', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_richtext', 
  'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'author_flair_type',
  'author_fullname', 'author_is_blocked', 'author_patreon_flair', 'author_premium', 'awarders', 
  'banned_at_utc', 'banned_by', 'body','body_html', 'can_gild', 'can_mod_post', 'collapsed',
  'collapsed_because_crowd_control', 'collapsed_reason', 'collapsed_reason_code', 'comment_type', 
  'controversiality', 'created', 'created_utc', 'depth', 'distinguished', 'downs', 'edited', 
  'gilded', 'gildings', 'id', 'is_submitter', 'likes', 'link_id', 'locked', 'mod_note', 
  'mod_reason_by', 'mod_reason_title', 'mod_reports', 'name', 'no_follow', 'num_reports', 
  'parent_id', 'permalink', 'removal_reason', 'replies', 'report_reasons', 'saved', 'score', 
  'score_hidden', 'send_replies', 'stickied', 'subreddit', 'subreddit_id', 
  'subreddit_name_prefixed', 'subreddit_type', 'top_awarded_type', 'total_awards_received', 
  'treatment_tags', 'unrepliable_reason', 'ups', 'user_reports', 'kind']

In [161]:
# Get a list of the CSVs
previous_scrapes = os.listdir('../data/output/comments/')
print(f"{len(previous_scrapes)} CSVs to do")

8 CSVs to do


Below is the modified loop for reading in comments.

In [162]:
# Create placeholder lists
scrapes = [] # good CSVs
trouble = [] # bad CSVs

In [163]:
# Loop through the CSVs
for file in previous_scrapes:
  path = "../data/output/comments/" + file
  df = pd.read_csv(path)
  # Run tests
  n = [col for col in df.columns if col not in com_col_names]
  m = [col for col in com_col_names if col not in df.columns]
  # For the problem CSVs
  if ((len(n)!=len(m)) | (len(n)>1) | (len(m)>1)) & (len(df)!=2) & (tests==False):
    # Create a list of numbers, convert to strings
    x = list(range(len(df.columns)))
    y = [str(b) for b in x]
    # Read the CSV in again, suppress column names
    df = pd.read_csv(path, header=0)
    # Apply the numbers as column names
    df.columns = y
    # Put it in jail
    trouble.append(df)
  else:  # Otherwise do the normal stuff
    file_name = file.split('.')[0] #drop the .csv
    file_name = file_name.split('_') #break into chunks
    df['source'] = file_name[0]
    df['date_scraped'] = file_name[1]
    df['scrape_num'] = file_name[-1]
    scrapes.append(df)
print(f'{len(scrapes)} good CSVs')
print(f'{len(trouble)} bad CSVs')

6 good CSVs
2 bad CSVs


Because the deformed CSVs always followed the same pattern, I was able to devise a standard process for dealing with them.  They had the correct columns in the correct order, and just needed the correct headers.  Some had extra columns, which needed a name for smoother processing, but were always blank and could be safely deleted.

Unfortunately, because these dataframes had to be put through a different path without any column headers, they were not able to be marked with the information from their filenames.  Because of the limited number of posts, I was able to scrape all of the comments in the same day, and could therefore assign them the correct date manually.  (I will assign the same date here, even though these demonstration comments were  actually scraped whenever the notebook was last run.)  The `['source']` column could also be inferred, this time from the `['subreddit']` column that came prepackaged with the scrapes.  The scrape number, however, could not be retraced, and thus I filled it with "unknown."

In [168]:
# Add extra column names
com_col_names.append('source')
com_col_names.append('date_scraped')

In [None]:
# Concatenate the bad dataframes
trouble = pd.concat(trouble, sort=False, axis=0, join='outer', ignore_index=False)

In [166]:
# Add the column names
trouble.columns = com_col_names

# Add the 'scrape_num' column
trouble['scrape_num'] = 'unknown'
print(trouble.shape)

(35, 75)


In [171]:
# Apply some rules to weed out 
trouble = trouble[trouble['all_awardings']=='[]']
trouble = trouble[trouble['body'].notna()]
print(trouble.shape)
trouble.head(1)

(34, 75)


Unnamed: 0.1,Unnamed: 0,all_awardings,approved_at_utc,approved_by,archived,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,banned_at_utc,banned_by,body,body_html,can_gild,can_mod_post,collapsed,collapsed_because_crowd_control,collapsed_reason,collapsed_reason_code,comment_type,controversiality,created,created_utc,depth,distinguished,downs,edited,gilded,gildings,id,is_submitter,likes,link_id,locked,mod_note,mod_reason_by,mod_reason_title,mod_reports,name,no_follow,num_reports,parent_id,permalink,removal_reason,replies,report_reasons,saved,score,score_hidden,send_replies,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_type,top_awarded_type,total_awards_received,treatment_tags,unrepliable_reason,ups,user_reports,kind,source,date_scraped,scrape_num
0,data,[],,,False,,ross_ns7f,,,[],,,,text,t2_s7p2fdqk,False,False,False,[],,,"If you don't understand astrophysics and QM, h...","&lt;div class=""md""&gt;&lt;p&gt;If you don&amp;...",False,False,False,,,,,0.0,1714153000.0,1714153000.0,0,,0.0,False,0.0,{},l1drmsx,False,,t3_1cdlxli,False,,,,[],t1_l1drmsx,False,,t3_1cdlxli,/r/AskPhysics/comments/1cdlxli/looking_for_par...,,"{'kind': 'Listing', 'data': {'after': None, 'd...",,False,4.0,False,True,False,AskPhysics,t5_2sumo,r/AskPhysics,public,,0.0,[],,4.0,[],t1,,,unknown


In [172]:
# Concatenate the good CSVs
scrapes = pd.concat(scrapes, sort=False, axis=0, join='outer', ignore_index=True)
print(scrapes.shape)
scrapes.head(1)

(17, 75)


Unnamed: 0.1,Unnamed: 0,all_awardings,approved_at_utc,approved_by,archived,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,banned_at_utc,banned_by,body,body_html,can_gild,can_mod_post,collapsed,collapsed_because_crowd_control,collapsed_reason,collapsed_reason_code,comment_type,controversiality,created,created_utc,depth,distinguished,downs,edited,gilded,gildings,id,is_submitter,likes,link_id,locked,mod_note,mod_reason_by,mod_reason_title,mod_reports,name,no_follow,num_reports,parent_id,permalink,removal_reason,replies,report_reasons,saved,score,score_hidden,send_replies,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_type,top_awarded_type,total_awards_received,treatment_tags,unrepliable_reason,ups,user_reports,kind,source,date_scraped,scrape_num
0,data,[],,,False,,mtauraso,,misc,[],9a9f6614-c6b2-11e4-986d-22000bc08516,Graduate,dark,text,t2_3f8ok,False,False,False,[],,,I'd double check your units. Just plugging va...,"&lt;div class=""md""&gt;&lt;p&gt;I&amp;#39;d dou...",False,False,False,,,,,0,1714100000.0,1714100000.0,0,,0,False,0,{},l1aocrt,False,,t3_1cd9r3s,False,,,,[],t1_l1aocrt,False,,t3_1cd9r3s,/r/AskPhysics/comments/1cd9r3s/keplers_constan...,,,,False,4,False,True,False,AskPhysics,t5_2sumo,r/AskPhysics,public,,0,[],,4,[],t1,askp-comments,2024-05-03,scrape2


In [173]:
# Concatenate into one dataframe
scrapes = pd.concat([scrapes, trouble], sort=False, axis=0, join='outer', ignore_index=True)
print(f'scrapes is now a dataframe of dimensions {scrapes.shape}')
scrapes.head(1)

scrapes is now a dataframe of dimensions (51, 75)


Unnamed: 0.1,Unnamed: 0,all_awardings,approved_at_utc,approved_by,archived,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,banned_at_utc,banned_by,body,body_html,can_gild,can_mod_post,collapsed,collapsed_because_crowd_control,collapsed_reason,collapsed_reason_code,comment_type,controversiality,created,created_utc,depth,distinguished,downs,edited,gilded,gildings,id,is_submitter,likes,link_id,locked,mod_note,mod_reason_by,mod_reason_title,mod_reports,name,no_follow,num_reports,parent_id,permalink,removal_reason,replies,report_reasons,saved,score,score_hidden,send_replies,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_type,top_awarded_type,total_awards_received,treatment_tags,unrepliable_reason,ups,user_reports,kind,source,date_scraped,scrape_num
0,data,[],,,False,,mtauraso,,misc,[],9a9f6614-c6b2-11e4-986d-22000bc08516,Graduate,dark,text,t2_3f8ok,False,False,False,[],,,I'd double check your units. Just plugging va...,"&lt;div class=""md""&gt;&lt;p&gt;I&amp;#39;d dou...",False,False,False,,,,,0.0,1714100000.0,1714100000.0,0,,0.0,False,0.0,{},l1aocrt,False,,t3_1cd9r3s,False,,,,[],t1_l1aocrt,False,,t3_1cd9r3s,/r/AskPhysics/comments/1cd9r3s/keplers_constan...,,,,False,4.0,False,True,False,AskPhysics,t5_2sumo,r/AskPhysics,public,,0.0,[],,4.0,[],t1,askp-comments,2024-05-03,scrape2


In the "behind the scenes" work, because I was discovering the aforementioned problems and figuring out how to fix them as I went, I processed the batch of comments from each subreddit separately.  The can be found in the `data/input` folder, within `concatted-scrapes-separate-csvs-by-source.zip`.  Their file names begin with `askp` and `eli5`, respectively.

For the purposes of demonstration, I will save the sample comments generated above into a CSV in the `data/output` folder.

In [174]:
# Save the giant df as csv
scrapes.to_csv(f'../data/output/concatted-wholes/comments-combined-as-of_{my_date()}.csv', index = False)

In this notebook, I have walked through my process for scraping, saving, and processing comments from each of my chosen subreddits.  In the next notebook, I will show how I combined them into one dataframe, and explain the feature selection and engineering process.