# Subreddit Classification - Data Collection

## Original Data Sources: 

<b> nosleep and scarystories subreddits</b><br>
[nosleep](https://www.reddit.com/r/nosleep/)<br>
[scarystories](https://www.reddit.com/r/scarystories/)

### Importing necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

from nltk.sentiment.vader import SentimentIntensityAnalyzer
import warnings
warnings.filterwarnings('ignore')
import requests
import time
np.random.seed(824)

### Grabbing data from nosleep subreddit

In [2]:
round(time.time())

1587682540

In [3]:
base_url = 'https://api.pushshift.io/reddit/submission/search'

before_timestamp = round(time.time())

all_dfs = []

for _ in range(10):
    
    
    
    
    params = {
    "subreddit" : 'nosleep',
    "size" : 1000,
    'after': None,
    'before': before_timestamp
        }
    
    res = requests.get(base_url,params)
    
    if res.status_code == 200: 
        
        data = res.json()['data']
        
       #this is daniels way 
        oldest_submission_timestamp = data[-1]['created_utc'] 
        before_timestamp = oldest_submission_timestamp 
        
        df = pd.DataFrame(data)[['title', 'selftext']]
        
        all_dfs.append(df)
        
    else:
        print(f'ERROR CODE FOR RES: {res.status_code}') 
        break

In [4]:
all_dfs

[                                                 title  \
 0                      7 rules for surviving The Crack   
 1    I was followed home one night and got a tape t...   
 2    Don't go looking for the door... or you'll fin...   
 3    i am a prisoner at a special security prison, ...   
 4    my friend told me a scary story 10 years ago, ...   
 ..                                                 ...   
 995                              The Ghost in My House   
 996                         Mister Payne's Marketplace   
 997                   Heading Towards The End Part Two   
 998               Society of the Divine Dragon, part 2   
 999                                  It wasn’t a dream   
 
                                               selftext  
 0    “We have to rewrite it, there is no other poss...  
 1    It was a warm night in a town just outside of ...  
 2    [PART ONE](https://www.reddit.com/r/nosleep/co...  
 3    I am a prisoner at a special security prison ....  


In [5]:
no_sleep = pd.concat(all_dfs)

In [6]:
no_sleep

Unnamed: 0,title,selftext
0,7 rules for surviving The Crack,"“We have to rewrite it, there is no other poss..."
1,I was followed home one night and got a tape t...,It was a warm night in a town just outside of ...
2,Don't go looking for the door... or you'll fin...,[PART ONE](https://www.reddit.com/r/nosleep/co...
3,"i am a prisoner at a special security prison, ...",I am a prisoner at a special security prison ....
4,"my friend told me a scary story 10 years ago, ...","About 10 years ago, on this day, this happened..."
...,...,...
995,My house.,As I’m writing this I still stay in the same h...
996,Im a Retired Priest. Over the years I’ve heard...,| [First Sin](https://www.reddit.com/r/nosleep...
997,Somethings up at work,I work for a company called *̷͕͒̍̔̈̀̿̉͘#̶̡̠̥̏ͅ...
998,My boyfriend won't let me see my friends anymore,I feel like I might be going crazy. I don't kn...


In [7]:
no_sleep.reset_index(drop = True, inplace = True)

In [8]:
no_sleep['subreddit'] = 'nosleep'

In [9]:
no_sleep

Unnamed: 0,title,selftext,subreddit
0,7 rules for surviving The Crack,"“We have to rewrite it, there is no other poss...",nosleep
1,I was followed home one night and got a tape t...,It was a warm night in a town just outside of ...,nosleep
2,Don't go looking for the door... or you'll fin...,[PART ONE](https://www.reddit.com/r/nosleep/co...,nosleep
3,"i am a prisoner at a special security prison, ...",I am a prisoner at a special security prison ....,nosleep
4,"my friend told me a scary story 10 years ago, ...","About 10 years ago, on this day, this happened...",nosleep
...,...,...,...
9995,My house.,As I’m writing this I still stay in the same h...,nosleep
9996,Im a Retired Priest. Over the years I’ve heard...,| [First Sin](https://www.reddit.com/r/nosleep...,nosleep
9997,Somethings up at work,I work for a company called *̷͕͒̍̔̈̀̿̉͘#̶̡̠̥̏ͅ...,nosleep
9998,My boyfriend won't let me see my friends anymore,I feel like I might be going crazy. I don't kn...,nosleep


### Grabbing data from scarystories subreddit


In [11]:
base_url = 'https://api.pushshift.io/reddit/submission/search'

before_timestamp = round(time.time())

all_dfs = []

for _ in range(10):
    
    
    # print(before_timestamp)
    
    params = {
    "subreddit" : 'scarystories',
    "size" : 1000,
    'after': None,
    'before': before_timestamp
        }
    
    res = requests.get(base_url,params)
    
    if res.status_code == 200: #if statuscode is 200 then continue
        
        data = res.json()['data']
        
       #this is daniels way 
        oldest_submission_timestamp = data[-1]['created_utc'] #data is a list of dictionary so do -1 to grab oldest dictionary within data
        before_timestamp = oldest_submission_timestamp 
        
        df = pd.DataFrame(data)[['title', 'selftext']]
        
        all_dfs.append(df)
        
    else:
        print(f'ERROR CODE FOR RES: {res.status_code}') #if statuscode is not 200 then break the loop and try again
        break

In [12]:
all_dfs

[                                           title  \
 0                          The Man In The Corner   
 1    A Dogman Sighting From Osage County, Kansas   
 2                                    bvgn;lxgkzf   
 3                                  Rocking Chair   
 4                              creepy farm house   
 ..                                           ...   
 995                               Yes (Reupload)   
 996                                      Falling   
 997         I actually think my house is haunted   
 998                                   Dream Girl   
 999       Guy talking about way he could kill me   
 
                                               selftext  
 0    A question we usually tend to ask ourselves is...  
 1                                                       
 2                                               ur mom  
 3    When I was a kid, I was always terrified of sl...  
 4    about a year ago i was 16 me and my friend whe...  
 ..           

In [13]:
scary_stories = pd.concat(all_dfs)

In [14]:
scary_stories.reset_index(drop = True, inplace = True)

In [15]:
scary_stories

Unnamed: 0,title,selftext
0,The Man In The Corner,A question we usually tend to ask ourselves is...
1,"A Dogman Sighting From Osage County, Kansas",
2,bvgn;lxgkzf,ur mom
3,Rocking Chair,"When I was a kid, I was always terrified of sl..."
4,creepy farm house,about a year ago i was 16 me and my friend whe...
...,...,...
9995,2nd part of ( What Did i see ? ).,So it has been 2 days that i have seen that do...
9996,Conspiracy Theory,So this isn't that scary. It's a little bit bl...
9997,SCARIEST DARK WEB STORIES,
9998,Prologue,"Steve woke up in the middle of the night, swea..."


In [16]:
scary_stories['subreddit'] = 'scarystories'

In [17]:
scary_stories

Unnamed: 0,title,selftext,subreddit
0,The Man In The Corner,A question we usually tend to ask ourselves is...,scarystories
1,"A Dogman Sighting From Osage County, Kansas",,scarystories
2,bvgn;lxgkzf,ur mom,scarystories
3,Rocking Chair,"When I was a kid, I was always terrified of sl...",scarystories
4,creepy farm house,about a year ago i was 16 me and my friend whe...,scarystories
...,...,...,...
9995,2nd part of ( What Did i see ? ).,So it has been 2 days that i have seen that do...,scarystories
9996,Conspiracy Theory,So this isn't that scary. It's a little bit bl...,scarystories
9997,SCARIEST DARK WEB STORIES,,scarystories
9998,Prologue,"Steve woke up in the middle of the night, swea...",scarystories


### Concatenating the two individual subreddit dataframes into one dataframe

In [19]:
main_df = pd.concat([no_sleep, scary_stories])

In [20]:
main_df.shape

(20000, 3)

### Declaring nosleep subreddit as the target variable

In [21]:
main_df['subreddit'] = main_df['subreddit'].map({'nosleep':1, 'scarystories':0})

In [22]:
df = main_df.copy()

In [23]:
df

Unnamed: 0,title,selftext,subreddit
0,7 rules for surviving The Crack,"“We have to rewrite it, there is no other poss...",1
1,I was followed home one night and got a tape t...,It was a warm night in a town just outside of ...,1
2,Don't go looking for the door... or you'll fin...,[PART ONE](https://www.reddit.com/r/nosleep/co...,1
3,"i am a prisoner at a special security prison, ...",I am a prisoner at a special security prison ....,1
4,"my friend told me a scary story 10 years ago, ...","About 10 years ago, on this day, this happened...",1
...,...,...,...
9995,2nd part of ( What Did i see ? ).,So it has been 2 days that i have seen that do...,0
9996,Conspiracy Theory,So this isn't that scary. It's a little bit bl...,0
9997,SCARIEST DARK WEB STORIES,,0
9998,Prologue,"Steve woke up in the middle of the night, swea...",0


### Saving dataframe of combined raw data to csv file

In [24]:
df.to_csv('../datasets/combined_raw_data.csv')