# Reddit NLP Classifier

## Data Collection (1/4) 

## Contents
- [Data Collection](#Data-Collection)
- [Data Dictionary](#Data-Dictionary)

## Data Collection

For this project, `r/malefashionadvice` & `r/femalefashionadvice` subreddits were selected. Their posts were collected using the `Pushshift's API` and the `requests` library.


### All libraries

In [1]:
# Import libraries
import pandas as pd
import requests
import time

### Create a reusable function

A reusable function was created retrieve posts from each subreddit. The function also raises an error when HTTP status response is not 200. Additionally, it removes comments by AutoModerator. Since Pushshift limits requests to 500 posts, a loop was used in the function to obtain more than 500 posts from each subreddit. This project required to use at least 1,000 posts from each subreddit. Also, `time.sleep()` function was included to provide the server with a short break between queries. Lastly, the posts date range was set to include posts from March 1 or older, to ensure that the same posts would be retrieved each time the code was run. 

In [2]:
# Reference: https://youtu.be/AcrjEWsMi_E
# Reference: https://github.com/pushshift/api.git
# Reference: https://www.epochconverter.com/
# Reference: GA 503 API Solution Code Lesson 
# Reference: https://stackoverflow.com/questions/40045545/pandas-query-string-where-column-name-contains-special-characters

def subreddit_comment(subreddit, num_post):
    # Target web page 
    url = 'https://api.pushshift.io/reddit/search/comment'   
    # Set the parameters 
    params = {
        'subreddit': subreddit,
        'size': 500,
        'before': 1677718897 # Set to March 1, 2023
    }
    
    # Establish the connection to the web page
    res = requests.get(url, params)
    # Raise an error if HTTP status response is not 200 
    if res.status_code != 200:
        return f"Error {res.status_code}: \
        Unable to retrieve data from {subreddit}. Please try again."
    else:
        # Store data in json form 
        data = res.json()
        # Store data in data column 
        posts = data['data']
        # Save it in dataframe 
        data_df = pd.DataFrame(posts)
        # Remove comments by AutoModerator
        df1 = data_df.query('author != "AutoModerator"')

    # Loop above process if data size is smaller than `num_post`
    while len(df1) < num_post:
        # Get older posts from previous extraction
        prev_post = df1[['created_utc']].iloc[-1]
        # Set the parameters
        params = {
            'subreddit': subreddit,
            'size': 500,
            'before': prev_post
        }
        
        # Establish the connection to the web page
        res = requests.get(url, params)
        # Raise an error if HTTP status response is not 200 
        if res.status_code != 200:
            return f"Error {res.status_code}: \
            Unable to retrieve data from {subreddit}. Please try again."
        else:
            # Store data in json form 
            data = res.json()
            # Store data in data column 
            posts = data['data']
            # Save it in dataframe 
            data_df = pd.DataFrame(posts)
            # Remove comments by AutoModerator and [removed] comments
            df2 = data_df.query('author != "AutoModerator"')
            
            # Concatenate datasets 
            df_concat = pd.concat([df1, df2])
            df1 = df_concat.drop_duplicates(subset='body')
        # Provide the server with a short break between queries
        time.sleep(5) 
    return df_concat.drop_duplicates(subset='body').reset_index()[['subreddit', 
                                                                   'body', 
                                                                   'created_utc']]


### Collected data using the above function

The comments from two subreddits: `r/malefashionadvice` and `r/femalefashionadvice` were collected, using the above function.

In [8]:
# Use a function to retreive comments from two subreddits 
malefashion=subreddit_comment('malefashionadvice', 2500)
# Check the datashape
malefashion.shape

(2778, 3)

In [9]:
# Use a function to retreive comments from two subreddits 
femalefashion=subreddit_comment('femalefashionadvice', 2500)
# Check the datashape
femalefashion.shape

(2793, 3)

### Check post creation date

In [10]:
# Check the post date range 
pd.concat([malefashion.iloc[[0]], malefashion.iloc[[-1]]])

# 1677718806 - Thursday, March 1, 2023 (CT)
# 1677588766 - Tuesday, February 28, 2023 (CT)

Unnamed: 0,subreddit,body,created_utc
0,malefashionadvice,Definitely agree there’s personality there. Se...,1677718806
2777,malefashionadvice,Walmart brand now. Saw them today,1677456121


In [11]:
# Check the post date range 
pd.concat([femalefashion.iloc[[0]], femalefashion.iloc[[-1]]])

# 1677718502 - Wednesday, March 1, 2023 (CT)
# 1677601359 - Tuesday, February 28, 2023 (CT)

Unnamed: 0,subreddit,body,created_utc
0,femalefashionadvice,"30th is auto-permission to go wild. Btw, 40 wa...",1677718502
2792,femalefashionadvice,Any ideas on linen pants that aren't see throu...,1677390987


After checking post creation date, deleted this column for the clean dataset as it is irrelevant for the project. 

In [12]:
# Delete created_utc from both subreddits 
malefashion.drop(columns = 'created_utc', inplace=True)
femalefashion.drop(columns = 'created_utc', inplace=True)

### Combine two subreddits data

The posts from both subreddits were then combined into a single dataset. 

In [13]:
# Combine two subreddits together
df = pd.concat([malefashion, femalefashion], ignore_index=True)

In [14]:
# Check data info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5571 entries, 0 to 5570
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  5571 non-null   object
 1   body       5571 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


### Save dataset

The combined subreddits dataset was saved in csv format. 

In [15]:
# Save the dataframe 
df.to_csv('../data/subreddits_combined.csv', index=False)

## Data Dictionary

|Feature|Type|Dataset|Discription|
|----|----|----|----|
|subreddit|object|Reddit's two subreddits (r/malefashionadvice & r/femalefashionadvice)|Subreddit (Reddit's community) name|
|body|object|Reddit's two subreddits (r/malefashionadvice & r/femalefashionadvice)|Actual text from the post|
|created_utc|int64|Reddit's two subreddits (r/malefashionadvice & r/femalefashionadvice)|Date of submission creation(epoch time)|