<div>
<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px" width=100>
</div>

# Project 3: Web APIs & NLP

## Background

[Reddit](https://www.reddit.com) is a network of communities where people can dive into their interests, hobbies and passions. Subreddits are user-created channels where discussion on the topic of interest, hobby or passion are organized. From [Metrics For Reddit](https://frontpagemetrics.com/history), there are over 3.2 million subreddits as of December 2021, with hundreds of subreddits being created every day. 

## Problem Statement

As there are many different subreddits on Reddit, and since interests, hobbies and passions can be similar, there are always various subreddits that are similar to each other. Without a doubt, anyone who is new to writing and posting to Reddit can be confused as to which subreddit to post to. 

In this project, the aim is to assist the new Reddit user in the decision of which subreddit to make the post in, through the use of classification models based on the analysis of the posts of the subreddits.

For the context of this project, the post is in the form of a scary experience, and the choices of the subreddits the new Reddit user has for making the post are [nosleep](https://www.reddit.com/r/nosleep/) and [paranormal](https://www.reddit.com/r/paranormal/), two subreddits that cater to scary personal experiences and paranormal experiences, thoughts and theories. 

To determine how successful the classification model is, the **accuracy** of the model as well as the **specificity** of the model will be the defining factors. Choosing accuracy is obvious as having the post going to the wrong subreddit is a bad idea, but choosing specificity as well is due to the fact thatt the `nosleep` subreddit has strict posting requirements, and thus it will be good to reduce the possibility of the post being wrongfully posted to the `nosleep` subreddit and having the post removed.

At the end of this project, by determining the top root word features of each subreddit, we will better assist in the decision-making process of the user based on the given post.

## Part 1: Data Collection

### 1. Imports (All imported libraries are added here)

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

### 2. Parameters for Data collection

In [2]:
# url for the pushshift api
url = "https://api.pushshift.io/reddit/search/submission"   

# subreddits chosen
subreddits = ['nosleep','paranormal']

# Posts before 31 December 2021, 23:59:59, for the initial point of collection (time in Unix epoch time)
last_time = 1640966399     

### 3. Function for Data collection

In [26]:
def get_posts(sub, num_posts):
    # Empty list for the data collection to output
    posts = [] 
    
     # initialize N for number of posts collected as count
    N = 0
    
    # initialize the last variable
    last = last_time
    
    # while loop to collect N posts
    while N < num_posts:          
        # parameters for the pushshift api: subbreddit, size of 100 posts each loop, before time
        params = {
            'subreddit': sub,   
            'size': 100,
            'before': last
        }
        
        # api request and save to json
        request = requests.get(url, params)
        res = request.json()
        
        # add posts to list
        posts.extend(res['data'])
        
        # Redefining parameters for while loop
        last = int(res['data'][-1]['created_utc'])
        
        
        # increase count by 100 for number of posts
        N += 100
    
    return posts     

### 4. Data Collection

In [27]:
%%time
# data collection for the first subreddit nosleep
nosleep_posts = get_posts(subreddits[0], 1000)

Wall time: 1min 43s


In [20]:
%%time
# data collection for the second subreddit paranormal
paranormal_posts = get_posts(subreddits[1], 1000)

Wall time: 42.1 s


### 5. Output to DataFrame and Export

In [21]:
# output to dataframe
df_nosleep = pd.DataFrame(nosleep_posts)
df_paranormal = pd.DataFrame(paranormal_posts)

In [22]:
# save to csv
df_nosleep.to_csv('../datasets/nosleep_posts.csv', index=False)
df_paranormal.to_csv('../datasets/paranormal_posts.csv', index=False)

### 6. Progress thus far
At this point in time, we have completed the webscraping of the two subreddits and compiled them into csvs in order to prep the data for analysis. 

We will continue next in Part 2: [EDA_and_Data_Cleaning](./02_EDA_and_Data_Cleaning.ipynb)
