# Problem statement

Two subreddits were chosen, r/nosleep and r/HFY.
- r/nosleep is a subreddit for horror/supernatural related short stories.
- r/HFY is a subreddit for science fiction/fantasy related short stories.

Both subreddits are short stories subreddits, but of different genres.

However, the 2 genres had some overlapping themes (i.e. some short stories in r/HFY had horror elements to it, and vice versa).
Using natural language processing, models were constructed to see if the two subreddits could be classified or not.

# Executive Summary

- Around 1,000 posts from each of the chosen subreddits were scraped from reddit using the reddit API.

- These were stored in the respectively named .csv files, and then loaded into dataframes. The 2 dataframes were then combined and preprocessing/cleaning was done on the scraped posts (using techniques such as RegEx, tokenizing and lemmatization).

- As this is a binary classification problem, Multi-nomial Naive-Bayes and Logistic Regression were chosen as the 2 models to use. For each model, 2 different word embedding methods were used (count vectorization and TFIDF). The whole modelling process was done using pipelines and then hyperparameters were optimized using GridSearchCV.

- It was found that all models constructed performed well, with the Logistic regression model utilizing count vectorization as the word embedding method performing the best (but not significantly better).

- Confusion matrices were constructed for each model, however it was discussed that in the context of this project accuracy score is the most important metric.



# Data Dictionary

- Data dictionary for the features used in the models:

|Feature|Data type|Description|
|-------|---------|-----------|
|selftext|*str*|The content of the reddit post (the predictor variable)|
|nosleep|*int*|Whether the post is from r/nosleep (1) or not (0) --- (the target variable)|

# Contents

- [Data gathering](#Data-gathering)

# Data gathering

In [2]:
import requests
import pandas as pd
import time
import random

In [3]:
# Define the 2 subreddit APIs
url_1 = 'https://www.reddit.com/r/HFY.json'
url_2 = 'https://www.reddit.com/r/nosleep.json'

In [4]:
# Assign responses to variables
res_1 = requests.get(url_1, headers={'User-agent': 'Pony Inc 1.0'})
res_2 = requests.get(url_2, headers={'User-agent': 'Pony Inc 2.0'})

In [5]:
# Check status codes to make sure the APIs are responding
print(res_1.status_code)
print(res_2.status_code)

200
200


In [6]:
# Convert res_1 to json and assign to dict1
dict1 = res_1.json()

In [7]:
# Get an idea of what dict1 looks like
dict1

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 26,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'HFY',
     'selftext': 'What, you guys are back already? My experiments aren\'t done yet! I just sent out my NANITES to fetch more reagents! Uh....here, have an MWC Update for this month\'s MWC Theme: **[Hallows 6]**. We\'ve got a bunch of stories for the [Bump in the Night] category already, but only a couple for the other two! So...you guys get to writing, and I\'ll get back to my experiments, deal?! Good ok....go read and write and stuff!\n\n\n---\n \n**Writers**: Make sure to tag the theme in your post title and mention the category in the post body. You must use include **[Hallows6]** in the title of your post, and include name of the category that you want to submit your story to inside the main body of your story. See the **[FAQ](https://www.reddit.com/r/hfy/wiki/ref/faq#wiki_what_is_the_mwc.3F)** for more info on tagging your post. Fo

In [8]:
# dict1 has 2 keys: kind and data
dict1.keys()

dict_keys(['kind', 'data'])

In [9]:
# Most information is kept in the 'children' key in the data dict
dict1['data']['children']

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'HFY',
   'selftext': 'What, you guys are back already? My experiments aren\'t done yet! I just sent out my NANITES to fetch more reagents! Uh....here, have an MWC Update for this month\'s MWC Theme: **[Hallows 6]**. We\'ve got a bunch of stories for the [Bump in the Night] category already, but only a couple for the other two! So...you guys get to writing, and I\'ll get back to my experiments, deal?! Good ok....go read and write and stuff!\n\n\n---\n \n**Writers**: Make sure to tag the theme in your post title and mention the category in the post body. You must use include **[Hallows6]** in the title of your post, and include name of the category that you want to submit your story to inside the main body of your story. See the **[FAQ](https://www.reddit.com/r/hfy/wiki/ref/faq#wiki_what_is_the_mwc.3F)** for more info on tagging your post. For example, if you want to submit your story under the "[Oktoberfest]" category,

### Automation of reddit api calling

```python
# Use a for loop to repeatedly hit the 2 subreddits' APIs 40 times (each time there should be around 25 posts)

posts_1 = []
posts_2 = []
after_1 = None
after_2 = None

for a in range(40):
    # Initially, there will be no 'after' queries to follow, so the url will be the ones defined above
    # After the first hit, get the 'after' from the response received and add it to the current_url
    if (after_1 == None) | (after_2 == None):
        current_url_1 = url_1
        current_url_2 = url_2
    else:
        current_url_1 = url_1 + '?after=' + after_1
        current_url_2 = url_2 + '?after=' + after_2
        
    print('current url:', current_url_1)
    print('current url:', current_url_2)
    res_1 = requests.get(current_url_1, headers={'User-agent': 'Pony Inc 1.0'})
    res_2 = requests.get(current_url_2, headers={'User-agent': 'Pony Inc 2.0'})
    
    # If the status_code is not 200, break out of the for loop
    if (res_1.status_code != 200) and (res_2.status_code != 200):
        print('Status error', res.status_code)
        break
    
    # Get the response for both subreddits and assign to the respective current_dict and current_posts
    current_dict_1 = res_1.json()
    current_posts_1 = [p['data'] for p in current_dict_1['data']['children']]
    posts_1.extend(current_posts_1)
    after_1 = current_dict_1['data']['after']
    
    current_dict_2 = res_2.json()
    current_posts_2 = [p['data'] for p in current_dict_2['data']['children']]
    posts_2.extend(current_posts_2)
    after_2 = current_dict_2['data']['after']
    
    # Output the posts to csv files 
    pd.DataFrame(posts_1).to_csv('./datasets/HFY.csv', index = False)
    pd.DataFrame(posts_2).to_csv('./datasets/nosleep.csv', index = False)
    
    # Randomize the timing in which to hit the subreddits' APIs so as to not trigger a ddos
    sleep_duration = random.randint(2,10)
    print(sleep_duration)
    time.sleep(sleep_duration)
```

**This project is continued in Project 3 Code2 - Data Preprocessing/Cleaning & Modelling.**