# Project 3 - Predicting Subreddit Posts

# Problem Statement

There was an unfortunate power outage on some of the servers which stored some of the posts of the subreddits:r/travel and r/backpacking. This caused the reddit posts to be stored incorrectly within the servers.

As an employee of Reddit, my boss has tasked me with correctly classifying these posts by training a classifier model to correctly identify the subreddit in which they belong to.

We will be training 2 models (Naive Bayes and Random Forests) based on about 2000 reddit posts webscraped from the subreddits online (1000 posts from each subreddit).  


## Project Workflow

The project has been divided into 3 parts:

1. This notebook contains the data collection of the 2 subreddits: r/wallstreetbets and r/finance
2. The second notebook contains the data cleaning and EDA of data retrieved from the 2 subreddits. 
3. The last notebook will consist of preprocessing of the datasets as well as modeling using: Logistic Regression and Random Forest Classifier. We then choose the model by checking its accuracy score in predicting posts from either subreddit.

# Import Libraries

In [1]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
from random import randint

In [2]:
pd.set_option("display.max_columns", 15)

In [3]:
pd.set_option("display.max_rows", 4000)

In [4]:
pd.options.display.max_colwidth = 400

# Pushshift API

While playing around with the Pushshift API, I realized that by default, the API returns some data containing empty lists or empty dictionaries. In this case, when creating a function to scrap for Reddit data, we will decide to drop some columns which are either not meaningful or metadata not relevant for the project.

In [5]:
# create a function to webscrap data from subreddit
def get_redditpost(subreddit, size, loop):
    '''
    subreddit: str, name of subreddit to search for
    size: int, number of post/comment per request (capped at 100 as per API)
    loop: int, number of times to repeat request
    '''   
    # columns to return for submission
    col = ['author', 'author_fullname', 'created_utc', 'id', 'is_self', 'num_comments', 'permalink', 
               'score', 'selftext', 'subreddit', 'title', 'url'] 
    
    # instantiate list for post data
    post_lists = []
    url_initial = f'https://api.pushshift.io/reddit/search/submission/?subreddit={subreddit}&size={size}'
    after = 1
       
    for i in range(loop):
        url = f'{url_initial}&after={i + after}d'
        # checking that looping is running
        print(f'Batch {i} of data from {url}')
        # get data
        res = requests.get(url)
        # checking for status code
        print(res.status_code)
        data = res.json()
        post_lists.extend(data['data'])
        # be polite to not overload Reddit server with requests
        time.sleep(randint(2,5))

    df = pd.DataFrame(post_lists)
    df = df[col]
        
    df.drop_duplicates(subset='selftext', inplace=True)   
        
    return df


# Retrieve Reddit post and convert to csv

## Backpacking submission posts

In [8]:
backpack_subs = get_redditpost('backpacking', 100, 100)

Batch 0 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=backpacking&size=100&after=1d
200
Batch 1 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=backpacking&size=100&after=2d
200
Batch 2 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=backpacking&size=100&after=3d
200
Batch 3 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=backpacking&size=100&after=4d
200
Batch 4 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=backpacking&size=100&after=5d
200
Batch 5 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=backpacking&size=100&after=6d
200
Batch 6 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=backpacking&size=100&after=7d
200
Batch 7 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=backpacking&size=100&after=8d
200
Batch 8 of data from https://api.pushshift.io/reddit/search/submission/?

200
Batch 70 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=backpacking&size=100&after=71d
200
Batch 71 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=backpacking&size=100&after=72d
200
Batch 72 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=backpacking&size=100&after=73d
200
Batch 73 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=backpacking&size=100&after=74d
200
Batch 74 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=backpacking&size=100&after=75d
200
Batch 75 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=backpacking&size=100&after=76d
200
Batch 76 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=backpacking&size=100&after=77d
200
Batch 77 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=backpacking&size=100&after=78d
200
Batch 78 of data from https://api.pushshift.io/reddi

In [9]:
backpack_subs.head(200)

Unnamed: 0,author,author_fullname,created_utc,id,is_self,num_comments,permalink,score,selftext,subreddit,title,url
0,solarhikes,t2_66soeyfs,1632314331,pt6nrw,False,1,/r/backpacking/comments/pt6nrw/2021_colorado_trail_thru_hike/,1,,backpacking,2021 Colorado Trail Thru Hike,https://youtu.be/JEzAjlGIVtI
5,lepeskin,t2_nnkkh,1632322776,pt9bqx,True,1,/r/backpacking/comments/pt9bqx/siguniangshan_tibet_four_sisters_mountain/,1,[removed],backpacking,"Siguniangshan, Tibet, Four Sisters Mountain",https://www.reddit.com/r/backpacking/comments/pt9bqx/siguniangshan_tibet_four_sisters_mountain/
10,natureboy234,t2_200p9g3k,1632330538,ptc2t2,True,0,/r/backpacking/comments/ptc2t2/pack_on_a_plane/,1,I’m taking a trip to Europe within a few days and am planning on taking my 65L Rei backpack. I know it’s too large to be a carry on but I really don’t want to check it as I have multiple layovers and haven’t had the best experience with bags at airports in the past. \nAre there any other options I have to make sure my pack doesn’t get lost or stolen?,backpacking,Pack on a plane?,https://www.reddit.com/r/backpacking/comments/ptc2t2/pack_on_a_plane/
18,greeneyedcat711,t2_3os7b9po,1632349599,pthcsv,True,0,/r/backpacking/comments/pthcsv/psa_double_check_your_nesting_cookware_before/,6,"A friend and I went backcountry camping in NM. We literally hiked out after two nights in the backcountry and then we drove straight to the airport and dropped off the rental car. We had to repack our gear by the ticket counter and put certain things in our checked luggage (tent poles, stakes, knife, etc). Unfortunately, in our scramble, we failed to check our cookware that can house the burne...",backpacking,PSA: Double check your nesting cookware before flying,https://www.reddit.com/r/backpacking/comments/pthcsv/psa_double_check_your_nesting_cookware_before/
25,wiscogirl2185,t2_4iqoe16j,1632358955,ptk68w,True,1,/r/backpacking/comments/ptk68w/water_suggestions/,1,"Taking my first backcountry trip in the Badlands next week; it will be four days long. From what I’ve read, there is not potable water where we will be. I have not done a hike where there weren’t water sources so looking for suggestions on what type of vessel to take (bladder, Smart Water bottles, Nalgene, etc) and also how much to take. I will have electrolyte mix to get some bang for my buck.",backpacking,Water suggestions,https://www.reddit.com/r/backpacking/comments/ptk68w/water_suggestions/
27,Easy-Try-1351,t2_8j0dzkih,1632380118,ptpkwe,True,0,/r/backpacking/comments/ptpkwe/looking_for_souvenirs_from_poland_croatia_germany/,1,"Hi all,\n\nI am a fellow traveller from Poland. Nice to meet you. Long story short, due to various reasons I've been unable to purchase souvenirs from some places that I've been to. I am looking for those from the cities below (no postcards, magnets please!). I'll cover costs of the purchase and shipping &amp; pay you for the trouble as well. As a proof, I am attaching pictures of my current c...",backpacking,"Looking for souvenirs from Poland, Croatia, Germany &amp; Lithuania",https://www.reddit.com/r/backpacking/comments/ptpkwe/looking_for_souvenirs_from_poland_croatia_germany/
40,FitPandaBear,t2_36znvh34,1632243421,psmw7r,True,4,/r/backpacking/comments/psmw7r/finding_fellow_backpackers_and_solo_travelers_in/,1,"After all my years of traveling around, what has been most exciting for me are the people I met and the friends I made. I learned that for some solo travelers and nomads sometimes it's challenging for them to meet new people every time they switch to a new city. That is why I created Nomad Friend Groups, free telegram groups that you can hop in and meet other travelers with over 200+ cities ar...",backpacking,Finding Fellow Backpackers And Solo Travelers In Every New City!,https://www.reddit.com/r/backpacking/comments/psmw7r/finding_fellow_backpackers_and_solo_travelers_in/
43,cruisedummy,t2_10qxwws1,1632249260,psox6x,True,1,/r/backpacking/comments/psox6x/2_weeks_in_europe/,1,"I am a Canadian Looking to travel to Europe for 2 weeks in November/December. Any suggestions on countries to visit? \n\nI’m looking for affordability, drinking/social scene, and scenery, or some combination. Warm climate is a bonus, but not necessary",backpacking,2 weeks in Europe,https://www.reddit.com/r/backpacking/comments/psox6x/2_weeks_in_europe/
45,cruisedummy,t2_10qxwws1,1632251024,pspir7,True,5,/r/backpacking/comments/pspir7/2_weeks_in_europe_suggestions/,1,"I am 25m from Canada looking to go to Europe in Nov/Dec for 2 weeks and looking for suggestions on countries and cities to visit during that 2 weeks. \n\nLooking for affordability, scenery, and good social/drinking scene. With only 2 weeks, I don’t want to cram too much",backpacking,"2 weeks in Europe, suggestions?",https://www.reddit.com/r/backpacking/comments/pspir7/2_weeks_in_europe_suggestions/
46,s3Nq,t2_sdlba,1632256462,psrefs,True,1,/r/backpacking/comments/psrefs/anyone_have_expieriance_with_the_rook_15_bag/,1,Its on sale right now on backcountry but i cant find any reviews on it and id rather not jump in blind. And if you have any suggestions for a bag around 200$ ill take all the help i can get lol,backpacking,Anyone have expieriance with the rook 15 bag?,https://www.reddit.com/r/backpacking/comments/psrefs/anyone_have_expieriance_with_the_rook_15_bag/


In [10]:
backpack_subs['created_utc'].duplicated().sum()

0

In [11]:
backpack_subs[['selftext']]

Unnamed: 0,selftext
0,
5,[removed]
10,I’m taking a trip to Europe within a few days and am planning on taking my 65L Rei backpack. I know it’s too large to be a carry on but I really don’t want to check it as I have multiple layovers and haven’t had the best experience with bags at airports in the past. \nAre there any other options I have to make sure my pack doesn’t get lost or stolen?
18,"A friend and I went backcountry camping in NM. We literally hiked out after two nights in the backcountry and then we drove straight to the airport and dropped off the rental car. We had to repack our gear by the ticket counter and put certain things in our checked luggage (tent poles, stakes, knife, etc). Unfortunately, in our scramble, we failed to check our cookware that can house the burne..."
25,"Taking my first backcountry trip in the Badlands next week; it will be four days long. From what I’ve read, there is not potable water where we will be. I have not done a hike where there weren’t water sources so looking for suggestions on what type of vessel to take (bladder, Smart Water bottles, Nalgene, etc) and also how much to take. I will have electrolyte mix to get some bang for my buck."
27,"Hi all,\n\nI am a fellow traveller from Poland. Nice to meet you. Long story short, due to various reasons I've been unable to purchase souvenirs from some places that I've been to. I am looking for those from the cities below (no postcards, magnets please!). I'll cover costs of the purchase and shipping &amp; pay you for the trouble as well. As a proof, I am attaching pictures of my current c..."
40,"After all my years of traveling around, what has been most exciting for me are the people I met and the friends I made. I learned that for some solo travelers and nomads sometimes it's challenging for them to meet new people every time they switch to a new city. That is why I created Nomad Friend Groups, free telegram groups that you can hop in and meet other travelers with over 200+ cities ar..."
43,"I am a Canadian Looking to travel to Europe for 2 weeks in November/December. Any suggestions on countries to visit? \n\nI’m looking for affordability, drinking/social scene, and scenery, or some combination. Warm climate is a bonus, but not necessary"
45,"I am 25m from Canada looking to go to Europe in Nov/Dec for 2 weeks and looking for suggestions on countries and cities to visit during that 2 weeks. \n\nLooking for affordability, scenery, and good social/drinking scene. With only 2 weeks, I don’t want to cram too much"
46,Its on sale right now on backcountry but i cant find any reviews on it and id rather not jump in blind. And if you have any suggestions for a bag around 200$ ill take all the help i can get lol


In [12]:
backpack_subs.shape

(998, 12)

In [13]:
backpack_subs.isnull().sum()

author             0
author_fullname    2
created_utc        0
id                 0
is_self            0
num_comments       0
permalink          0
score              0
selftext           1
subreddit          0
title              0
url                0
dtype: int64

In [14]:
backpack_subs.to_csv('../datasets/backpack_subs.csv', index=False)

## Travel submission posts

In [16]:
travel_subs = get_redditpost('travel', 100, 40)

Batch 0 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=travel&size=100&after=1d
200
Batch 1 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=travel&size=100&after=2d
200
Batch 2 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=travel&size=100&after=3d
200
Batch 3 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=travel&size=100&after=4d
200
Batch 4 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=travel&size=100&after=5d
200
Batch 5 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=travel&size=100&after=6d
200
Batch 6 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=travel&size=100&after=7d
200
Batch 7 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=travel&size=100&after=8d
200
Batch 8 of data from https://api.pushshift.io/reddit/search/submission/?subreddit=travel&size=100&after=9d
200
B

In [17]:
travel_subs.shape

(1051, 12)

In [18]:
travel_subs.head()

Unnamed: 0,author,author_fullname,created_utc,id,is_self,num_comments,permalink,score,selftext,subreddit,title,url
0,jgoat25,t2_e7fz9qh5,1632315586,pt70sh,True,0,/r/travel/comments/pt70sh/help/,1,"I live in Canada and want to travel to Miami Florida, The problem is I got the Moderna vaccine and the Pfizer one as my second. So i am mixed vaxxed, im inquiring if i would be able to travel there by plane? Would it help if i get a 3rd vaccine lol, or not?",travel,Help!,https://www.reddit.com/r/travel/comments/pt70sh/help/
1,tobe4funas,t2_nn9jz,1632315920,pt74nd,True,1,/r/travel/comments/pt74nd/suggestions_for_planning_2_months_long_stay_in/,1,[removed],travel,Suggestions for planning ~2 months long stay in Japan?,https://www.reddit.com/r/travel/comments/pt74nd/suggestions_for_planning_2_months_long_stay_in/
3,AlarmingInstance,t2_6p3apelj,1632316235,pt780a,True,1,/r/travel/comments/pt780a/italy_october_itinerary/,1,"Hi all!\n\nI'm going to Italy in October and need help deciding where to go my first day and a half there. I've been doing some research on my own but I would love the opinions of locals or people that have been to Italy before. \n\nMy favorite thing about traveling is finding those hidden gems that aren't as well known, so I would like to do something that's less of a photo op, and more of a ...",travel,Italy October Itinerary,https://www.reddit.com/r/travel/comments/pt780a/italy_october_itinerary/
4,cru_jonze,t2_93h7h,1632316489,pt7arq,True,0,/r/travel/comments/pt7arq/best_nyc_movie_tour_or_any_movies_shooting_in_nyc/,1,"I will be in the city this weekend and wanted to visit some iconic locations, or better yet, watch something currently shooting in the city. I live on the east coast so I have visited a lot of the standards already (Ghostbusters HQ, Rockefeller, Bethesda Terrace) but looking for something a bit more off the beaten path.",travel,Best NYC movie tour or any movies shooting in NYC right now?,https://www.reddit.com/r/travel/comments/pt7arq/best_nyc_movie_tour_or_any_movies_shooting_in_nyc/
6,RobinHoodProtocol,t2_cs279771,1632316825,pt7eag,False,0,/r/travel/comments/pt7eag/chhatrapati_shivaji_maharaj_terminus_ex_victoria/,1,,travel,"Chhatrapati Shivaji Maharaj Terminus (ex Victoria Terminus), Mumbai (ex Bombay). Shantaram’s traces [OC]",https://i.redd.it/xcxg627g22p71.jpg


In [19]:
travel_subs[['selftext']]

Unnamed: 0,selftext
0,"I live in Canada and want to travel to Miami Florida, The problem is I got the Moderna vaccine and the Pfizer one as my second. So i am mixed vaxxed, im inquiring if i would be able to travel there by plane? Would it help if i get a 3rd vaccine lol, or not?"
1,[removed]
3,"Hi all!\n\nI'm going to Italy in October and need help deciding where to go my first day and a half there. I've been doing some research on my own but I would love the opinions of locals or people that have been to Italy before. \n\nMy favorite thing about traveling is finding those hidden gems that aren't as well known, so I would like to do something that's less of a photo op, and more of a ..."
4,"I will be in the city this weekend and wanted to visit some iconic locations, or better yet, watch something currently shooting in the city. I live on the east coast so I have visited a lot of the standards already (Ghostbusters HQ, Rockefeller, Bethesda Terrace) but looking for something a bit more off the beaten path."
6,
16,"Hello everybody,\n\n&amp;#x200B;\n\nRight now I'm planning a trip to either Mexico or Costa Rica for roughly 4 weeks for October/November. But I'm not quite sure, if these destinations will be a good idea during that time of the year, since it is during the rain season and looking at the weather online it seems to rain a lot? \n\nSo my first question is, will it be worth to travel to one of th..."
22,I'm going to the Caribbean for my boyfriend's 30th with a group of six people and staying at an all inclusive with a private pool in our room. I already bought an inflatable pong table to surprise him. What other fun things can I bring to make this trip the best it can be?
25,"I am planning a three week road trip up the east coast of Australia. Last time I forgot a heap of things like solar lights, a sun umbrella and other inconveniences. I am putting all my money toward this trip and planners cost $30 plus dollars. I found these online https://Etsy.com/shop/roorark I really like the look of them, they are digital downloads. Should I get these or is it worth investi..."
29,"Hi! This is happening right now, so please help ASAP if you can!\nI booked a return flight with United (all domestic), there is a stopover in both legs. \nThe stopover was relatively close to my destination, so on the way there, someone picked me up at the airport and I didn’t use my connecting flight\n\nNow I’m at the gate and the airline is telling me they cancelled my return flights because..."
34,"So I have the opportunity to travel to Georgia near the end of next month. If I go, I'd like to do the classic nature treks - Svaneti, Ushguli, Kazbegi, etc. Any of you been there in autumn, and is it worth hiking at that time? Or should I just postpone the whole thing to May/June next year?"


In [20]:
travel_subs.isnull().sum()

author             0
author_fullname    2
created_utc        0
id                 0
is_self            0
num_comments       0
permalink          0
score              0
selftext           1
subreddit          0
title              0
url                0
dtype: int64

In [21]:
travel_subs.to_csv('../datasets/travel_subs.csv', index=False)