# Web APIs & NLP - Data Collection & Cleaning

## Problem Statement
Reddit has hired my team of data scientists to prototype a machine learning model that utilizes Natural Language Processing (NLP) to determine which subreddit a post originated from, only provided its content (i.e. no title, no comments, no metadata). This model will be a binary classification model for two specific subreddits using a bag-of-words approach.
Reddit has determined 2 requirements for our team to measure success:
1. Accuracy of classification > 95%
2. Both Precision and Recall must remain > 90%

## Background
Identifying the subreddit origin of a random post on Reddit is a challenging task that requires an in-depth understanding of the language, style, and content of different subreddits. In a world where misinformation and disinformation are prevalent, being able to accurately classify posts to their original source can significantly impact the credibility and reliability of the information being shared.

Reddit has hired my team of data scientists to prototype a machine learning model that utilizes NLP to determine which subreddit a post comes from only given its content (i.e. no title, no comments, no metadata). This is desirable because this concept can be expanded to other inputs and data generated by Reddit users that can be utilized in moderation, targeted marketing, and trend analysis.

Because this is a proof-of-concept model that my team is building, we have confirmed with Reddit that we will build a model that will be trained for binary classification on two specific subreddits that have content related to one another. The principle for limiting the scope to this degree is to prove that the concept is functional with a small set of data before allocating significant financial resources to build a generalized model for the entire website. It is critical that the subreddit topics be related to prove that this method can function even when subreddit posts may appear similar. Provided the proof-of-concept model is successful, Reddit will then focus more resources on improving the model and generalizing it to the entire website.

As mentioned in the problem statement, Reddit's metrics for success are achieving an accuracy that exceeds 95% whilst also maintaining precision and recall above 90%. The purpose of these numbers is that Reddit wants to ensure that the model is accurate and has a balanced performance.

To achieve these targets, our team will use a bag-of-words approach. We will use a common API to scrape the text of posts on the chosen subreddits and then test multiple methods of pre-processing. We will test multiple different models as well as look at ensembling the best ones.

This repository contains the code in the form of a scientific notebook report, the data, and the models that were used.

## Contents:
- [Problem Statement](#Problem-Statement)
- [Background](#Background)
- [Datasets](#Datasets)
- [Imports](#Imports)
- [Data Collection](#Data-Collection)
- [Save Collected Data](#Save-Collected-Data)
- [Data Collection Summary](#Data-Collection-Summary)

## Datasets
Data sets were collected from two subreddits on [Reddit.com](https://reddit.com) by scraping the data using the [Pushshift API](https://github.com/pushshift/api). The two subreddits that were selected were [*r/audiophile*](https://www.reddit.com/r/audiophile) and [*r/guitar*](https://www.reddit.com/r/guitar) because they have a sufficient amount of data as well as being related in topic.

### Data Dictionary:
The data dictionary contains all features, provided and engineered, that were used in the models.


|Feature|Type|Dataset|Description|
|---|---|---|---|
|**subreddit**|*str*|Reddit|The name of the subreddit a post was scraped from|
|**created_utc**|*str*|Reddit|The epoch timestamp of when the post was created|
|**post_length**|*int*|Reddit|The length of each post by character|
|**post_word_count**|*int*|Reddit|The length of each post by word count|
|**selftext**|*str*|Reddit|The text from a post on a subreddit|
|**cleaned_selftext**|*str*|Reddit|The cleaned text from a post on a subreddit (used for testing, but not in final model)|
|**no_stop_selftext**|*str*|Reddit|The cleaned text from a post on a subreddit with stopwords removed (used for testing, but not in final model)|
|**stem_selftext**|*str*|Reddit|The cleaned and stemmed text from a post on a subreddit with stopwords removed (used for testing, but not in final model)|
|**lemmatize_selftext**|*str*|Reddit|The cleaned and lemmatized text from a post on a subreddit with stopwords removed (used for testing, but not in final model)|
|**no_shared_stem_selftext**|*str*|Reddit|The cleaned and stemmed text from a post on a subreddit with stopwords and commonly shared words removed|

## Imports

#### Libraries

In [1]:
# Imports
import pandas as pd
import requests

## Data Collection

Data collection from Reddit is self-contained within a function in order to promote efficiency when scraping data.

https://www.epochconverter.com/ was used to calculate and convert hardcoded epoch timestamps.

### get_subreddit_posts
- This function uses the pushshift API to collect data from Reddit.
- It performs some preliminary data cleaning in order to ensure that the collected samples do not contain any blanks and are unique from one another.
- Posts collected must conatin more than 10 words so the model has enough data to train on.
- As it iterates, it prints out the number of samples it collected in order to monitor its progress.
- When the function has completed running, it prints the sizes of each dataframe that was collected.
- NOTE: If there are not sufficient entries from a subreddit, the function will crash. This is why the number of queried entries is also displayed. If that number dips far below the expected size (e.g. 500), then it is an indication to select a smaller 'n'-size.

**Parameters:**
- subreddits (list): List of subreddits (str) to query
- size (int): The number of posts to be queried at a time (max: 500) (default: 100)
- n (int): Desired number of entries that each subreddit df will return (default: 1000)
- before (int): Epoch timestamp of the most recent time that posts will be pulled from (default: 1/1/23 1:00:00 GMT)

**Returns:**
- dfs (list): A list of pandas dataframes of length (n, len(subreddits))

In [2]:
def get_subreddit_posts(subreddits, size=100, n=1000, before=1672534800):
    """This function queries subreddit posts using the Pushshift API
    
    Parameters:
        subreddits (list): List of subreddits (str) to query
        size (int): The number of posts to be queried at a time (max: 500) (default: 100)
        n (int): Desired number of entries that each subreddit df will return (default: 1000)
        before (int): Epoch timestamp of the most recent time that posts will be pulled from (defualt: 1/1/23 1:00:00 GMT)
    
    Returns:
        dfs (list): A list of pandas dataframes of length (n, len(subreddits))
     """

    # Define api url
    url = 'https://api.pushshift.io/reddit/search/submission'

    dfs = [] # initialize list of dfs
    
    for subreddit in subreddits:

        # Define parameters
        count = 0 # initialize df entry count
        before_itr = before # reset updated 'before' to original timestamp
        
        # Loop until df is filled with desired amount of unique rows
        while count <= n:
            # Parameters for pushshift api
            params = {
                'subreddit': subreddit, # name of subreddit
                'size': size, # target number of posts to be queried
                'before': before_itr # grabs most recent entries before this date (epoch int format)
            }

            # Request data
            res = requests.get(url, params)

            # Convert to json and pull 'data' from dict
            data = res.json()['data']

            # Convert to df
            df_temp = pd.DataFrame(data) #convert to df for concatenation
            df_temp = df_temp[['subreddit',
#                                'title',
                               'selftext',                       
#                                'author',
#                                'permalink',
                               'created_utc']] # only keep important columns

            # append newly returned data to df
            if count == 0:
                df = df_temp # initialize df
            else:
                df = pd.concat([df, df_temp], axis=0) # concatenate to df
                        
            # Drop any rows that have the post removed
            df = df.loc[df['selftext'] != '[removed]']

            # Check for any duplicated posts
            df = df.loc[df['selftext'].duplicated() != True]

            # Remove empty posts (most are removed in duplicates)
            df = df.loc[df['selftext'] != '']

            # If subreddit name does not match input, then drop
            df['subreddit'] = df['subreddit'].str.lower()
            df = df.loc[df['subreddit'] == subreddit]

            # Update date and count
            before_itr = int(pd.to_datetime(df_temp['created_utc'], unit='s').min().timestamp())
            count = df.shape[0]
            print(f"Number of posts collected: {count} ({len(data)} queried)")

            # Drop NaNs
            df = df.dropna()

            # Drop any rows that have less than 10 words
            df = df.loc[df['selftext'].str.split().str.len() >= 10]
            
            # Drop specific column with no text - This was found later and has not actual words.
            df = df.loc[df['selftext'] != "\^        \^         \^         \^        \^\n\n|        |         |         |        |\n\n|        |         |         |        |"]

        # Reset index
        df = df.reset_index(drop=True)

        # Drop any rows above target number of rows
        df = df.iloc[:n]
        
        # Drop 'created_utc' column
        # This was used for collecting the data in a sequential manner, but is not needed for the model analysis
        df = df.drop(columns='created_utc')
                
        # Create a list of dfs for different subreddits
        dfs.append(df)

        # Print collection update
        print(f"Data from the '{subreddit}' subreddit successfully collected.\n")
    
    # Print summary
    print(f"DataFrame shapes:")
    print(f"{subreddits[0]}: {dfs[0].shape}")
    print(f"{subreddits[1]}: {dfs[1].shape}")
    
    # Return dfs
    return dfs

In [3]:
# Define parameters for function
size = 500
num_rows = 10000
before = 1677459600 # 2/27/23 1:00:00 AM GMT Starting time
subreddits = ['audiophile', 'guitar']

In [4]:
# Run function to scrape subreddits
dfs = get_subreddit_posts(subreddits, size=500, n=4000, before = 1677459600) # before = 2/27/23 1:00:00 GMT

Number of posts collected: 273 (499 queried)
Number of posts collected: 558 (500 queried)
Number of posts collected: 862 (500 queried)
Number of posts collected: 1155 (500 queried)
Number of posts collected: 1418 (500 queried)
Number of posts collected: 1693 (499 queried)
Number of posts collected: 1957 (500 queried)
Number of posts collected: 2261 (500 queried)
Number of posts collected: 2559 (500 queried)
Number of posts collected: 2855 (500 queried)
Number of posts collected: 3146 (500 queried)
Number of posts collected: 3430 (500 queried)
Number of posts collected: 3732 (500 queried)
Number of posts collected: 3983 (499 queried)
Number of posts collected: 4209 (499 queried)
Data from the 'audiophile' subreddit successfully collected.

Number of posts collected: 232 (500 queried)
Number of posts collected: 461 (498 queried)
Number of posts collected: 700 (500 queried)
Number of posts collected: 926 (500 queried)
Number of posts collected: 1153 (500 queried)
Number of posts collected

In [5]:
# Review the first df
dfs[0].head()

Unnamed: 0,subreddit,selftext
0,audiophile,"Hey everyone,\n\nMy late father was a DJ in so..."
1,audiophile,I recently bought a Dual Kicker Comp R 12” sub...
2,audiophile,Hi I'm a beginner to this space and I just wan...
3,audiophile,"I'm currently looking for a new power amp, and..."
4,audiophile,i was wondering which is the best option for o...


In [6]:
# Review the second df
dfs[1].head()

Unnamed: 0,subreddit,selftext
0,guitar,"Hi guys, so I have my guitar connected to thre..."
1,guitar,So I’m playing with a lot of gain but that sho...
2,guitar,Recently bought a multiscale 8 string and play...
3,guitar,Will it sound uneven compared to the still-pac...
4,guitar,I recently put ernie ball cobalts on my guitar...


In [7]:
# Combine dfs into single final df
df_final = pd.concat([dfs[0], dfs[1]], axis=0) # concatenate to df
print(df_final.shape)

(8000, 2)


## Save Collected Data

In [8]:
# Save to csv
filename = 'reddit.csv'
df_final.to_csv(f"../data/{filename}", index=False)

## Data Collection Summary
- In this notebook, posts from two subreddits were collected to be processed in later notebooks.
- The same number of posts was collected from each subreddit.
- All posts contain text (not just the title) and are unique.