## Reddit NLP Project 

__The objective of this project, is two-fold:__

__1.__ Using [Pushshift's](https://github.com/pushshift/api) API, I'll collect posts from two subreddits of my choosing.

__2.__ I'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

In order to build these models I've opted for text heavy subreddits in order to provide as much raw text data for my models to train and test on. The 'legaladvice' subreddit has posts in which people solicit  legal adviice pertaining to whatever issues they may be facing that may warrant legal intervention. The 'casualconversation' subrreddit ranges on number of topics, where posters can simply rant about their day or ask about feedback on a number of thoughts or issues that my cross their mind. With each of these having an element of advice, I'm interested in seeing the type of language used in everyday, mundane conversation versus presumably more pressing issues that require legal guidance of some sort. In order to do so I've organized the project in the following manner: 

* __Part 1a/b__ - Webscrapping of text data from Reddit
* __Part 2__ - Cleaning of unstructured text data
* __Part 3__ - Exploratory Data Analysis (EDA)
* __Part 4a__- Preprocessing of cleaned data; lemmatizing and stemming using NLTK ; 
* __Part 4b__- Modeling; Logistic Regression and Random Forest modeling

__Part 1: Webscrapping two subreddit posts__

  ### Imports

In [1]:
import pandas as pd
import numpy as np
import requests
import time

In [2]:
base_url = 'https://api.pushshift.io/reddit/search/'

__The legal advice subreddit__

In [3]:
params = {
    'subreddit': 'legaladvice',
    'size': 75
}
res = requests.get(base_url + 'submission/', params=params)

In [4]:
res.status_code

200

In [5]:
data = res.json()

In [6]:
data.keys()

dict_keys(['data'])

In [7]:
# Convert the posts into a DataFrame
posts= pd.DataFrame(data['data'])
posts

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_background_color,author_flair_css_class,author_flair_text,author_flair_text_color,author_is_blocked,awarders,can_mod_post,...,whitelist_status,wls,author_flair_richtext,author_flair_type,author_fullname,author_patreon_flair,author_premium,post_hint,preview,banned_by
0,[],True,[deleted],,,,dark,False,[],False,...,all_ads,6,,,,,,,,
1,[],False,banksnosons,,,,,False,[],False,...,all_ads,6,[],text,t2_3ha8i7b2,False,False,,,
2,[],False,Hangman_Matt,,,,,False,[],False,...,all_ads,6,[],text,t2_8ew841k,False,False,,,
3,[],False,Throwawaaaaaay526289,,,,,False,[],False,...,all_ads,6,[],text,t2_e0mruzq3,False,False,,,
4,[],False,unholychalice,,,,,False,[],False,...,all_ads,6,[],text,t2_80teju2,False,False,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70,[],False,throwawayRA_121,,,,,False,[],False,...,all_ads,6,[],text,t2_95weryw0,False,False,,,
71,[],False,NotoriousVIV_,,,,,False,[],False,...,all_ads,6,[],text,t2_5ucl4dnx,False,False,,,
72,[],False,[deleted],,,,dark,False,[],False,...,all_ads,6,,,,,,,,
73,[],False,CamoCanna,,,,,False,[],False,...,all_ads,6,[],text,t2_1uanvj1m,False,False,,,


In [8]:
posts['created_utc'].sort_values(ascending= False)

0     1629570682
1     1629570605
2     1629570478
3     1629570313
4     1629570134
         ...    
70    1629561553
71    1629560835
72    1629560809
73    1629560764
74    1629560761
Name: created_utc, Length: 75, dtype: int64

In [9]:
early = posts['created_utc'][-1:]
early

74    1629560761
Name: created_utc, dtype: int64

In [14]:
#For loop to gather 3000 total posts form this subreddit
# posts = pd.DataFrame()
# early = None 
# for i in range(40):
#     params = {
#     'subreddit': 'legaladvice',
#     'size': 75,
#     #'before': early
#     }
#     if early != None:
#         params['before']=early
#     res = requests.get(base_url + 'submission/', params=params)
#     data = res.json()
#     posts= pd.concat([posts,
#                      pd.DataFrame(data['data'])])
#     early = posts['created_utc'].values[-1] 
#     time.sleep(3)

Commented the above code out because I run it too one too many times and received an error the last time I run it. I've gathered the necessary data and have saved it in CSV files. 

In [15]:
#posts.shape

(600, 69)

__Saving the 3000 posts gathered into a CSV file__

In [None]:
#posts.to_csv('./Data/legaladvice_reddit.csv')