<div>
<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px" width=100>
</div>

# Project 3: Web APIs & NLP

## Background

[Reddit](https:www.reddit.com) is a network of communities where people can dive into their interests, hobbies and passions. Subreddits are user-created channels where discussion on the topic of interest, hobby or passion are organized. From [Metrics For Reddit](https://frontpagemetrics.com/history), there are over 3.2 million subreddits as of December 2021, with hundreds of subreddits being created every day. 

1. Using [Pushshift's](https://github.com/pushshift/api) API, posts are collected from the two subreddits of [nosleep](https://www.reddit.com/r/nosleep/) and [paranormal](https://www.reddit.com/r/paranormal/).
2. `You'll then use NLP to train a classifier on which subreddit a given post came from.`

## Problem Statement

As there are many different subreddits on Reddit, and since interests, hobbies and passions can be similar, there are always various subreddits that are similar to each other. Without a doubt, anyone who is new to writing and posting to Reddit can be confused as to which subreddit to post to. In this project, the aim is to assist the new Reddit user in the decision of which subreddit to make the post in.

For the context of this project, the post is in the form of a scary experience, and the choices of the two of the subreddits are [nosleep](https://www.reddit.com/r/nosleep/) and [paranormal](https://www.reddit.com/r/paranormal/), two subreddits that cater to scary personal experiences and paranormal experiences, thoughts and theories.

## Part 1: Data Collection

### 1. Imports (All imported libraries are added here)

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

### 2. Parameters for Data collection

In [2]:
# url for the pushshift api
url = "https://api.pushshift.io/reddit/search/submission"   

# subreddits chosen
subreddits = ['nosleep','paranormal']

# Posts before 31 December 2021, 23:59:59, for the initial point of collection (time in Unix epoch time)
last_time = 1640966399     

### 3. Function for Data collection

In [26]:
def get_posts(sub, num_posts):
    # Empty list for the data collection to output
    posts = [] 
    
     # initialize N for number of posts collected as count
    N = 0
    
    # initialize the last variable
    last = last_time
    
    # while loop to collect N posts
    while N < num_posts:          
        # parameters for the pushshift api: subbreddit, size of 100 posts each loop, before time
        params = {
            'subreddit': sub,   
            'size': 100,
            'before': last
        }
        
        # api request and save to json
        request = requests.get(url, params)
        res = request.json()
        
        # add posts to list
        posts.extend(res['data'])
        
        # Redefining parameters for while loop
        last = int(res['data'][-1]['created_utc'])
        
        
        # increase count by 100 for number of posts
        N += 100
    
    return posts     

### 4. Data Collection

In [27]:
%%time
# data collection for the first subreddit nosleep
nosleep_posts = get_posts(subreddits[0], 1000)

Wall time: 1min 43s


In [20]:
%%time
# data collection for the second subreddit paranormal
paranormal_posts = get_posts(subreddits[1], 1000)

Wall time: 42.1 s


In [6]:
%%time
# data collection for the subreddit Ghoststories
# ghoststories_posts = get_posts('Ghoststories', 1000)

Wall time: 0 ns


In [7]:
%%time
# data collection for the subreddit WritingPrompts
# writingprompts_posts = get_posts('WritingPrompts', 1000)

Wall time: 0 ns


In [8]:
%%time
# data collection for the subreddit scarystories
# scarystories_posts = get_posts('scarystories', 1000)

Wall time: 0 ns


In [9]:
%%time
# data collection for the subreddit Glitch_in_the_Matrix
# glitchmatrix_posts = get_posts('Glitch_in_the_Matrix', 1000)

Wall time: 0 ns


### 5. Output to DataFrame and Export

In [21]:
# output to dataframe
df_nosleep = pd.DataFrame(nosleep_posts)
df_paranormal = pd.DataFrame(paranormal_posts)

In [25]:
df_nosleep.iloc[99:101]['created_utc']

99     1640809216
100    1640808343
Name: created_utc, dtype: int64

In [11]:
## Placeholder for the alternative subreddits

# df_ghoststories = pd.DataFrame(ghoststories_posts)
# df_ghoststories.to_csv('../datasets/ghoststories_posts.csv', index=False)

# df_writingprompts = pd.DataFrame(writingprompts_posts)
# df_writingprompts.to_csv('../datasets/writingprompts_posts.csv', index=False)

# df_scarystories = pd.DataFrame(scarystories_posts)
# df_scarystories.to_csv('../datasets/scarystories_posts.csv', index=False)

# df_glitchmatrix = pd.DataFrame(glitchmatrix_posts)
# df_glitchmatrix.to_csv('../datasets/glitchmatrix_posts.csv', index=False)

In [22]:
# save to csv
df_nosleep.to_csv('../datasets/nosleep_posts.csv', index=False)
df_paranormal.to_csv('../datasets/paranormal_posts.csv', index=False)

### 6. Progress thus far
At this point in time, we have completed the webscraping of the two subreddits and combined them into csvs for analysis. We will continue next in Part 2: [EDA_and_Data_Cleaning](./02_EDA_and_Data_Cleaning.ipynb)
