# <font color='blue'>Introduction</font>

## <font color='green'>Foreword</font>

Christopher Denq

5/5/2023

Notebook 1 of 3

Contains the scrapper functions that are used to generate the data for the NLP project.

## <font color='green'>Code Setup</font>

### Imports

In [2]:
import requests
import pandas as pd

### Loading Dataset & Global Variables

In [3]:
# URL Setup
url = "https://api.pushshift.io/reddit/search/submission"
# subreddits = ["lawschooladmissions", "gradadmissions"]
subreddits_2 = ["AskReddit", "askscience"] # switched to this subreddit due to lack of activity in previous subreddits

### Custom Functions

In [4]:
# Custom function for checking data
def check_size(subreddits: list) -> None:
    """ 
    Prints the number of rows in the indicated databases.

    Parameters:
        subreddits: list
            A list containing the subreddit names as string; used in the pushshift url

    Returns:
        None
            Prints the number of rows in the subreddit database
    """
    for subreddit in subreddits:
        df = pd.read_csv(f"../scrapped_data/{subreddit.lower()}_data.csv", encoding="utf-8-sig")
        print(f"/r/{subreddit.lower()} has {df.shape[0]} entries total")
    return

In [5]:
# Custom function for scrapping data
def scrape(subreddits: list, mode: str="new", intial: bool=False) -> None:
    """ 
    Scrapes provided subreddits and automatically updates pre-existing database of reddit posts, if any. 

    Parameters:
        subreddits: list
            A list containing the subreddit names as string; used in the pushshift url

        mode: str, default="new"
            "new" indicates scraping posts that have been generate after the earliest post in database
            "old" indicates scraping posts that have been generate before the oldest post in database

        intial: bool, default="False"
            "True" indicates that this is the first scrape (creating database)
            "False" indicates that this is a subsequent scrape (updated database)

    Returns:
        None
            Appropriate database is either created or updated
    """
    # Loop through subreddits to fetch data
    for subreddit in subreddits:
        print(f"Starting /r/{subreddit.lower()}...")

        # Setup URL params and make connection
        params = {
            "subreddit": subreddit,
            "limit": 1000,
        }
        if not intial: # Change params depending on whether this is our first extraction
            existing_df = pd.read_csv(f"../scrapped_data/{subreddit.lower()}_data.csv", encoding="utf-8-sig")
            newest_time, oldest_time = existing_df["created_utc"].values[-1], existing_df["created_utc"].values[0]
            if mode == "new":
                params["after"] = newest_time
            elif mode == "old":
                params["before"] = oldest_time
            else:
                print('Please input "old" or "new" for the function.')
                return
        response = requests.get(url, params)

        # Loading data
        if response.status_code == 200:
            print("Successful connection!")
            df = pd.DataFrame(response.json()['data'])[['subreddit', 'selftext', 'title', 'author_flair_text', 'created_utc', 'url']]
        else:
            print("Failed connection...")
            print(f"Leaving /r/{subreddit.lower()}!")
            print("="*10)
            continue

        # Combining datasets
        if mode == "new":
            combined_df = pd.concat([existing_df, df], axis=0)
        else:
            combined_df = pd.concat([df, existing_df], axis=0)
        combined_df = combined_df.sort_values(by="created_utc", ascending=True)

        # Save data to CSV
        combined_df.to_csv(f"../scrapped_data/{subreddit.lower()}_data.csv", encoding="utf-8-sig", index=False)
        print(f"Added {df.shape[0]} entries!")
        print("="*10)

    # Output
    return

## <font color='green'>Code</font>

In [6]:
# RUN ONLY IF FIRST TIME OR ELSE IT WILL WIPE YOUR DATASET
# scrape(subreddits_2, #initial=True#)

In [7]:
# scrape(subreddits_2, "old") # Scrape historical posts
# scrape(subreddits_2) # Scrape new posts
check_size(subreddits_2)

/r/askreddit has 13149 entries total
/r/askscience has 12314 entries total
