<div>
<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px" width=100>
</div>

# Project 3: Web APIs & NLP

## Progress so far

In [Part 1](./01_Data_Collection.ipynb) of the project, using the [Pushshift's](https://github.com/pushshift/api) API, 500 posts are collected each from the two subreddits of [nosleep](https://www.reddit.com/r/nosleep/) and [paranormal](https://www.reddit.com/r/paranormal/), and then compiled into the two sets of posts into two different CSVs. At this point in time, data cleaning will be done to extract the necessary data required, as well as dropping the posts that will not be relevant to the required data. 

## Part 2: EDA and Data Cleaning

### 1. Imports (All imported libraries are added here)

In [1]:
import pandas as pd
import numpy as np
import re

from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

### 2. Importing the Datasets

In [2]:
# Importing the 'nosleep' dataset
df_nosleep = pd.read_csv('../datasets/nosleep_posts.csv')

# Importing the 'nosleep' dataset
df_paranormal = pd.read_csv('../datasets/paranormal_posts.csv') 

### 3. EDA

In [3]:
# Display the top 5 rows and shape for the two dataframes
display(df_nosleep.shape)
display(df_nosleep.head())

display(df_paranormal.shape)
display(df_paranormal.head())

(1000, 72)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,link_flair_template_id,link_flair_text,post_hint,preview,author_flair_background_color,author_flair_text_color,banned_by,url_overridden_by_dest,author_cakeday,distinguished
0,[],False,TheThirteenShadows,,[],,text,t2_gi05b7v5,False,False,...,,,,,,,,,,
1,[],False,ColourlessWind,,[],,text,t2_9jhd8xd2,False,False,...,,,,,,,,,,
2,[],False,No_Speed_1244,,[],,text,t2_9z9puy5t,False,False,...,,,,,,,,,,
3,[],False,Zithero,,[],,text,t2_t2i4w,False,False,...,8beec82a-dcc1-11e8-a09f-0e09eae1a1c0,Series,self,"{'enabled': False, 'images': [{'id': 'j-zztRob...",,,,,,
4,[],False,mir07,,[],,text,t2_z5tf4,False,False,...,,,,,,,,,,


(1000, 79)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,media,media_embed,secure_media,secure_media_embed,is_gallery,author_flair_background_color,author_flair_template_id,author_flair_text_color,media_metadata,author_cakeday
0,[],False,f0rthef0g,,[],,text,t2_fs0ax1ru,False,False,...,,,,,,,,,,
1,[],False,andyf123123,,[],,text,t2_jkthz65,False,False,...,,,,,,,,,,
2,[],False,gopherpoet,,[],,text,t2_dc7hm,False,False,...,,,,,,,,,,
3,[],False,Jacquelinedesire,,[],,text,t2_i2y473zc,False,False,...,,,,,,,,,,
4,[],False,MortemDaKlondikebarr,,[],,text,t2_5zxxwa0s,False,False,...,,,,,,,,,,


In [4]:
# display the column names
display(df_nosleep.columns)
display(df_paranormal.columns)

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'content_categories', 'contest_mode', 'created_utc', 'domain',
       'full_link', 'gildings', 'id', 'is_created_from_ads_ui',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subreddit_type', 'thumbnail'

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id',
       'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
       'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_sub

There are many columns in both dataframes, but not all of them are useful. Considering that we will making predictions based on which subreddit a post should be posted to, then only the three columns `subreddit`, `title`, `selftext`

In [5]:
# Creating new dataframes based on the chosen columns subreddit, selftext, title
df_sub1 = df_nosleep[['subreddit', 'title', 'selftext']].copy()
df_sub2 = df_paranormal[['subreddit', 'title', 'selftext']].copy()

# displaying the first 5 rows again
display(df_sub1.head())
display(df_sub2.head())

Unnamed: 0,subreddit,title,selftext
0,nosleep,I created a new plant. It's gonna hurt a lot o...,I created a new plant. It’s gonna hurt a lot...
1,nosleep,The people I kill won't stay dead.,I'm not writing this as some sort of confessio...
2,nosleep,"Can’t sleep, and don’t know why - help me",[removed]
3,nosleep,Don't got to the Magic Show at the Gypsy Carni...,[Part 1](https://www.reddit.com/r/nosleep/comm...
4,nosleep,"""Intent: The Truth"" - Randonauting is not that...","""What the hell?!"" Ella exclaims as we walk tow..."


Unnamed: 0,subreddit,title,selftext
0,Paranormal,"How sensitive are you to “energy” in a place, ...","When I was a child, something I learned from w..."
1,Paranormal,"Roars from the dragon below, the ancient serpe...",[removed]
2,Paranormal,Weird Dream - Sort Of...,(Am cross-posting this from r/HighStrangeness ...
3,Paranormal,Michael Jackson close friend recorded This in ...,
4,Paranormal,My friend and I were recording vocals in my ro...,So I live in a 2 bedroom apartment with my fri...


 #### 3.1 Missing values

In [6]:
# Checking for missing values
display(df_sub1.isnull().sum())
display(df_sub2.isnull().sum())

subreddit    0
title        0
selftext     5
dtype: int64

subreddit      0
title          0
selftext     144
dtype: int64

In [7]:
# Displaying the missing values in df_sub1
df_sub1[df_sub1['selftext'].isnull()]

Unnamed: 0,subreddit,title,selftext
179,nosleep,Celebrating Christmas in Bethlehem: a double b...,
273,nosleep,I,
425,nosleep,November 2021 Contest Nominations,
550,nosleep,Bishops call for days of prayer for Philippine...,
930,nosleep,Nosleep is looking for new mods!,


In [8]:
# Filtering the posts with no selftext, and a title of less than 10 words
df_sub1[(df_sub1['selftext'].isnull()) & (df_sub1['title'].str.split().str.len() < 10)]

Unnamed: 0,subreddit,title,selftext
179,nosleep,Celebrating Christmas in Bethlehem: a double b...,
273,nosleep,I,
425,nosleep,November 2021 Contest Nominations,
930,nosleep,Nosleep is looking for new mods!,


In [9]:
# Displaying the missing values in df_sub2
df_sub2[df_sub2['selftext'].isnull()]

Unnamed: 0,subreddit,title,selftext
3,Paranormal,Michael Jackson close friend recorded This in ...,
6,Paranormal,The red mask isn’t there in real life.,
7,Paranormal,Weird image taken with my phone.. real or glitch?,
19,Paranormal,Do you see the face behind my 4 year old? What...,
21,Paranormal,What does this mean?,
...,...,...,...
961,Paranormal,This was sent to our team. Very detailed account.,
965,Paranormal,Is this just a conveniently located lense flar...,
967,Paranormal,Is this something paranormal or just my brain ...,
978,Paranormal,Was watching some Netflix and this randomly ap...,


In [10]:
# Filtering the posts with no selftext, and a title of less than 10 words
df_sub2[(df_sub2['selftext'].isnull()) & (df_sub2['title'].str.split().str.len() < 10)]

Unnamed: 0,subreddit,title,selftext
6,Paranormal,The red mask isn’t there in real life.,
7,Paranormal,Weird image taken with my phone.. real or glitch?,
21,Paranormal,What does this mean?,
25,Paranormal,Scary Stories - DEADLINE by Richard Matheson (...,
31,Paranormal,"Shadow in my dads home, see comments for story.",
...,...,...,...
897,Paranormal,WTF is this 😭😭 appeared on bathroom door,
922,Paranormal,Orbs in my moms living room,
949,Paranormal,The duality of Man,
961,Paranormal,This was sent to our team. Very detailed account.,


For the posts with missing `selftext`, their `titles` are unique. Though it would seem that some of them are of images or videos that do not show up as text, and the others may be of no use, but for now, we will drop those posts with short `titles` of less than 10 words and for the rest, we will impute the missing values as empty strings.

In [11]:
# Imputing missing values as empty strings
df_sub1['selftext'].fillna('', inplace=True)
df_sub2['selftext'].fillna('', inplace=True)

#### 3.2 Posts of Interest
Other than the posts with missing `selftext`, there are many other posts with `[removed]` in their `selftext`. 

In [12]:
# Displaying the posts with removed selftext in df_sub1
df_sub1[df_sub1['selftext'] == '[removed]']

Unnamed: 0,subreddit,title,selftext
2,nosleep,"Can’t sleep, and don’t know why - help me",[removed]
8,nosleep,"If I die, please avenge me",[removed]
9,nosleep,The Story Generator,[removed]
14,nosleep,plagued by doubts,[removed]
15,nosleep,plagued by doubts,[removed]
...,...,...,...
986,nosleep, The first ever  Gaming and  NFT $Elite Fo...,[removed]
989,nosleep, The first ever  Gaming and  NFT $Elite Fo...,[removed]
990,nosleep, The first ever  Gaming and  NFT $Elite Fo...,[removed]
995,nosleep, The first ever  Gaming and  NFT $Elite Fo...,[removed]


In [13]:
# Displaying the missing values in df_sub2
df_sub2[df_sub2['selftext'] == '[removed]']

Unnamed: 0,subreddit,title,selftext
1,Paranormal,"Roars from the dragon below, the ancient serpe...",[removed]
14,Paranormal,Continuing My Relationship with My Deceased Fa...,[removed]
16,Paranormal,"Cryptids, mimics, skinwalkers",[removed]
22,Paranormal,Hunting went Terrifying?,[removed]
23,Paranormal,wtf is erratas?,[removed]
...,...,...,...
980,Paranormal,Castle Zvikov on Google Maps,[removed]
981,Paranormal,Grandfather's Voice?,[removed]
982,Paranormal,"So I know people will read this, this is power...",[removed]
987,Paranormal,Has anyone been to the Villisca Axe Murder House,[removed]


In this case, these posts with `[removed]` in their `selftext` are posts deleted by the moderators of the subreddits. Considering that this meant that they are spam posts or posts that do not fulfil the requirement of the subreddit, it will be prudent to remove them from the data as they will not be useful to the analysis.

#### 3.3 Posts with duplicate titles in selftext
When making a post, there is a possibility that the `title` of the post is not decided in advance. In these case, the redditor may decide to take the first few lines of the `selftext` and then add it to the `title` instead. 

In [14]:
# Displaying the posts with url links in selftext and title for the two dataframes
df_sub1['dupe_title'] = ""
for i in range(len(df_sub1.index)):
    if (df_sub1.loc[i]['selftext']).strip().startswith(df_sub1.loc[i]['title'].strip()):
        df_sub1.loc[i]['dupe_title'] = 'dupe'
    else:
        df_sub1.loc[i]['dupe_title'] = 'not_dupe'

df_sub2['dupe_title'] = ""        
for i in range(len(df_sub2.index)):
    if (df_sub2.loc[i]['selftext']).strip().startswith(df_sub2.loc[i]['title'].strip()):
        df_sub2.loc[i]['dupe_title'] = 'dupe'
    else:
        df_sub2.loc[i]['dupe_title'] = 'not_dupe'

In [15]:
df_sub1[df_sub1['dupe_title'] == 'dupe']

Unnamed: 0,subreddit,title,selftext,dupe_title
186,nosleep,Has anyone ever heard of what happens on the f...,Has anyone ever heard of what happens on the f...,dupe
203,nosleep,The Snatcher,The Snatcher\n\nA soft rain covered the quiet ...,dupe
313,nosleep,Lyrium is where the broken goes,\n\nLyrium is where the broken goes.\n\nIt wa...,dupe
382,nosleep,My Best Friend,"My Best Friend\n\nMy name is Lucy, my best fri...",dupe
391,nosleep,Lyrium is where the broken goes,\n\nLyrium is where the broken goes. \n\nIt ...,dupe
510,nosleep,October 5th,"October 5th, 2003, a date that never seems to ...",dupe
611,nosleep,I will never ride a malfunctioning elevator again,I will never ride a malfunctioning elevator ...,dupe
749,nosleep,THE SNOWMEN HUNT AT MIDNIGHT,THE SNOWMEN HUNT AT MIDNIGHT\n\nWe found the m...,dupe
797,nosleep,The problem with Jeremy,The problem with Jeremy was never an obvious o...,dupe
798,nosleep,The problem with Jeremy,The problem with Jeremy was never an obvious ...,dupe


In [16]:
df_sub2[df_sub2['dupe_title'] == 'dupe']

Unnamed: 0,subreddit,title,selftext,dupe_title
27,Paranormal,I’m in your walls,"I’m in your walls, I’m in your walls, I’m in y...",dupe
527,Paranormal,I think I met the devil,I think I met the devil.\n\nI think I had a 2 ...,dupe


In these cases, since the titles are repeated again in the selftext, we will not be considering these `titles` when doing the analysis on the combined text of the `title` and `selftext`, as this will lead to bias in the repeated words.

#### 3.4 Duplicated Posts

Duplicated posts happen when a redditor accidentally makes the same post more than once, or the redditor decide to post it again to gather more attention.

In [17]:
# Listing the duplicated posts with respect to selftext, 
# note that [removed] and empty selftexts are mentioned since they are dealt with earlier
df_sub1[df_sub1['selftext'].duplicated() & (df_sub1['selftext'] != '[removed]') & (df_sub1['selftext'] != '')]

Unnamed: 0,subreddit,title,selftext,dupe_title
113,nosleep,I used to work at a gas station at night. This...,To answer some of your questions I have never ...,not_dupe
134,nosleep,Autostop,It was 1 am when I finished my shift. I worked...,not_dupe
251,nosleep,"Eleven Years Ago, I Did Something Horrible. An...","\n\nDear Anonymous Friend,\n\nCan you keep a...",not_dupe
257,nosleep,Depression has Manifested...and it whispers to...,"\nDay 1 \n""We are happiness..."" a voice whisp...",not_dupe
399,nosleep,HELP ME!! I THINK MY TV IS TRYING TO KILL ME!!...,***Part I of V***\n\nI’d been really sad latel...,not_dupe
467,nosleep,The Body (Inside the Mind of a Serial Killer),I was sitting on the recliner in my living roo...,not_dupe
527,nosleep,A Silent Night.,"There was a thin layer of hoar frost outside, ...",not_dupe
625,nosleep,What Hides in Dark Skies (Part 1 of 2),In all my years of interviewing everyday peopl...,not_dupe
884,nosleep,The Rest Stop,I had been driving for a few hours before my s...,not_dupe
885,nosleep,The Rest Stop,I had been driving for a few hours before my s...,not_dupe


In [18]:
# Listing the duplicated posts with respect to title
# note that [removed] are mentioned since they are dealt with earlier
df_sub1[df_sub1['title'].duplicated() & (df_sub1['selftext'] != '[removed]')]

Unnamed: 0,subreddit,title,selftext,dupe_title
86,nosleep,I used to work at a gas station in the middle ...,Not all of the customers that used to come in ...,not_dupe
91,nosleep,What really happened?,"So, this is the first time I’ve ever openly s...",not_dupe
112,nosleep,I used to work at a gas station in the middle ...,To answer some of your questions I have never ...,not_dupe
132,nosleep,Dad.,My Dad always kept a dark secret.\n\nOne that ...,not_dupe
190,nosleep,Haunted house,Five years ago on a cold winters evening I was...,not_dupe
257,nosleep,Depression has Manifested...and it whispers to...,"\nDay 1 \n""We are happiness..."" a voice whisp...",not_dupe
329,nosleep,Don't open your eyes,I struggle with having nightmares. But recentl...,not_dupe
383,nosleep,My Best Friend,"My name is Lucy, my best friend Carol just upd...",not_dupe
391,nosleep,Lyrium is where the broken goes,\n\nLyrium is where the broken goes. \n\nIt ...,dupe
394,nosleep,"My School is a Prison, there are indescribable...","\n\nMy name is Zach, and I'm 14, and my schoo...",not_dupe


In [19]:
# Listing the duplicated posts with respect to selftext, 
# note that [removed] and empty selftexts are mentioned since they are dealt with earlier
df_sub2[df_sub2['selftext'].duplicated() & (df_sub2['selftext'] != '[removed]') & (df_sub2['selftext'] != '')]

Unnamed: 0,subreddit,title,selftext,dupe_title
148,Paranormal,Piano playing at Grandparents while alone,So back in 2010 my grandma had a severe stroke...,not_dupe
422,Paranormal,I just had some weird Deja Vu during the Matrix 4,"Before I start, normally I would be eh so what...",not_dupe
795,Paranormal,How do I get this out of my house?,"This is crazy, but I just had an experience. I...",not_dupe
939,Paranormal,Boot full of popcorn kernels,So a couple of months ago I went through all m...,not_dupe


In [20]:
# Listing the duplicated posts with respect to title
# note that [removed] are mentioned since they are dealt with earlier
df_sub2[df_sub2['title'].duplicated() & (df_sub2['selftext'] != '[removed]')]

Unnamed: 0,subreddit,title,selftext,dupe_title
26,Paranormal,"Roars from the dragon below, the ancient serpe...",Have you ever taken the idea of sin seriously?...,not_dupe
233,Paranormal,Pareidolia Or Paranormal?,Working a cold missing hiker case in ID: We en...,not_dupe
273,Paranormal,"PSA: Here’s how I get rid of ghosts, demons, a...","Been lurking here for a bit, and I’ve noticed ...",not_dupe
360,Paranormal,A negative experience in a rented house,*This is a real story and I wouldn't consider ...,not_dupe
390,Paranormal,Stubborn mom from beyond,"To properly understand this story, you have to...",not_dupe
586,Paranormal,Weird encounter while half awake,"Hello, so I had a really weird encounter the o...",not_dupe
704,Paranormal,Sleep paralysis,This event happened at least two weeks ago. It...,not_dupe
712,Paranormal,I think my house is haunted,"I've always believes in ghosts, like my whole ...",not_dupe
793,Paranormal,Creepy experience,"Honestly, I don't know what to make of this ex...",not_dupe
795,Paranormal,How do I get this out of my house?,"This is crazy, but I just had an experience. I...",not_dupe


Duplicate posts will lead to data leakage if the same post was to appear in the training and testing datasets later on. These duplicate posts will have to be removed.

In the case of duplicated posts with the same `selftext`, the significance is relevant since the `selftext` where most of the words are is duplicated, leading to bias in the data.

As for the duplicated posts with the same `title`, the case for this is that the redditor may make slight alterations to the `selftext` when a repost is made, but the title is still the same. In this case, removing these posts would achieve the same purpose as above.

#### 3.5 Posts with urls
For some of the posts, there are those with url links in their `selftext` and `title`. 

In [21]:
# Displaying the posts with url links in selftext and title for the two dataframes
display(df_sub1[df_sub1['title'].str.contains('http|\.com')])
display(df_sub1[df_sub1['selftext'].str.contains('http|\.com')])

Unnamed: 0,subreddit,title,selftext,dupe_title
427,nosleep,mortis .com,[removed],not_dupe
536,nosleep,https://youtu.be/ZTHDEqNjpTg,[removed],not_dupe


Unnamed: 0,subreddit,title,selftext,dupe_title
3,nosleep,Don't got to the Magic Show at the Gypsy Carni...,[Part 1](https://www.reddit.com/r/nosleep/comm...,not_dupe
22,nosleep,I didn’t know we had a basement. Update: My no...,Let me tell y’all about the weird-ass Christm...,not_dupe
27,nosleep,"My wife went out to the shore, and didn’t retu...",[Part 3](https://www.reddit.com/r/nosleep/comm...,not_dupe
44,nosleep,I tried to keep the vultures from my daughter,My daughter was all grown up. It was a challe...,not_dupe
49,nosleep,I joined the adult's table for the first time ...,"""It's not blood that binds us but tradition. C...",not_dupe
...,...,...,...,...
956,nosleep,I'm a Deliveryman for Monsters. Some Customers...,"Hi, my name's Jay and I'm a delivery guy. \n\...",not_dupe
958,nosleep,"I was visited by a ghost, had a monster in my ...","You think the title is funny and clickbaity, n...",not_dupe
960,nosleep,Don't got to the Magic Show at the Gypsy Carni...,I’m typing this with one good hand and another...,not_dupe
977,nosleep,True Love Is When You Stalk People and You Can...,"Ah love. What even is that? I, personally, wi...",not_dupe


In [22]:
display(df_sub2[df_sub2['title'].str.contains('http|\.com')])
display(df_sub2[df_sub2['selftext'].str.contains('http|\.com')])

Unnamed: 0,subreddit,title,selftext,dupe_title


Unnamed: 0,subreddit,title,selftext,dupe_title
15,Paranormal,Trial Run for Official r/Paranormal Discord Se...,We're going to try a Discord Server for the r/...,not_dupe
57,Paranormal,The house spirit's favorite toy,https://youtu.be/JTqWtPyHOfw\n\nThis bookcase ...,not_dupe
67,Paranormal,I made a previous post years ago here that hel...,I believe that I stated that I would post anot...,not_dupe
84,Paranormal,I keep seeing the same thing at the same spot ...,"hey everyone,this is my first post here,i join...",not_dupe
89,Paranormal,Am I overreacting? Or is there something here?,https://imgur.com/a/JuXoyL2\n\nThis is the vid...,not_dupe
102,Paranormal,"HELP... Rotten smells, overt encounters, sensi...",Here's a post I made last week when the entity...,not_dupe
138,Paranormal,Sad update to my haunted childhood home,"So last year, I wrote about [good spirits in m...",not_dupe
149,Paranormal,Strange Wooden Broom?,This is my first post to Reddit so if layout i...,not_dupe
181,Paranormal,Movement of Christmas Tree topper caught on Ho...,This was the battery pack for our Star on the ...,not_dupe
183,Paranormal,my friend is having paranormal experiences,ok so first he was watching youtube when his d...,not_dupe


For this matter, since these urls will clog up the number of features when CountVectorize is performed, it will be prudent to first eliminate these urls first to limit the additions of features. (ie, get rid of the potential http, www, com)

#### 3.6 Combining the `title` and the `selftext`
For the purpose of this project, we will be analysing based on the `title` and `selftext` of the posts. Thus, it will be easier to combine these two columns together when doing the analysis.  

In [23]:
# creating the text column from the two columns
df_sub1['text'] = df_sub1['title'] + ' ' + df_sub1['selftext']
df_sub2['text'] = df_sub2['title'] + ' ' + df_sub2['selftext']

# displaying the top 5 rows again
display(df_sub1.head())
display(df_sub2.head())

Unnamed: 0,subreddit,title,selftext,dupe_title,text
0,nosleep,I created a new plant. It's gonna hurt a lot o...,I created a new plant. It’s gonna hurt a lot...,not_dupe,I created a new plant. It's gonna hurt a lot o...
1,nosleep,The people I kill won't stay dead.,I'm not writing this as some sort of confessio...,not_dupe,The people I kill won't stay dead. I'm not wri...
2,nosleep,"Can’t sleep, and don’t know why - help me",[removed],not_dupe,"Can’t sleep, and don’t know why - help me [rem..."
3,nosleep,Don't got to the Magic Show at the Gypsy Carni...,[Part 1](https://www.reddit.com/r/nosleep/comm...,not_dupe,Don't got to the Magic Show at the Gypsy Carni...
4,nosleep,"""Intent: The Truth"" - Randonauting is not that...","""What the hell?!"" Ella exclaims as we walk tow...",not_dupe,"""Intent: The Truth"" - Randonauting is not that..."


Unnamed: 0,subreddit,title,selftext,dupe_title,text
0,Paranormal,"How sensitive are you to “energy” in a place, ...","When I was a child, something I learned from w...",not_dupe,"How sensitive are you to “energy” in a place, ..."
1,Paranormal,"Roars from the dragon below, the ancient serpe...",[removed],not_dupe,"Roars from the dragon below, the ancient serpe..."
2,Paranormal,Weird Dream - Sort Of...,(Am cross-posting this from r/HighStrangeness ...,not_dupe,Weird Dream - Sort Of... (Am cross-posting thi...
3,Paranormal,Michael Jackson close friend recorded This in ...,,not_dupe,Michael Jackson close friend recorded This in ...
4,Paranormal,My friend and I were recording vocals in my ro...,So I live in a 2 bedroom apartment with my fri...,not_dupe,My friend and I were recording vocals in my ro...


#### 3.7 Setting the target class
For the purpose of this project, since this is a classification problem, we will be labelling the target class column of `subreddit`. As such, we will be setting `nosleep` to 1 and `Paranormal` to 0.

In [24]:
# target column of subreddit, with values of nosleep and Paranormal
# setting nosleep to 1, Paranormal to 0
df_sub1['subreddit'] = df_sub1['subreddit'].map({'nosleep': 1, 'Paranormal': 0})
df_sub2['subreddit'] = df_sub2['subreddit'].map({'nosleep': 1, 'Paranormal': 0})

# displaying the top 5 rows again
display(df_sub1.head())
display(df_sub2.head())

Unnamed: 0,subreddit,title,selftext,dupe_title,text
0,1,I created a new plant. It's gonna hurt a lot o...,I created a new plant. It’s gonna hurt a lot...,not_dupe,I created a new plant. It's gonna hurt a lot o...
1,1,The people I kill won't stay dead.,I'm not writing this as some sort of confessio...,not_dupe,The people I kill won't stay dead. I'm not wri...
2,1,"Can’t sleep, and don’t know why - help me",[removed],not_dupe,"Can’t sleep, and don’t know why - help me [rem..."
3,1,Don't got to the Magic Show at the Gypsy Carni...,[Part 1](https://www.reddit.com/r/nosleep/comm...,not_dupe,Don't got to the Magic Show at the Gypsy Carni...
4,1,"""Intent: The Truth"" - Randonauting is not that...","""What the hell?!"" Ella exclaims as we walk tow...",not_dupe,"""Intent: The Truth"" - Randonauting is not that..."


Unnamed: 0,subreddit,title,selftext,dupe_title,text
0,0,"How sensitive are you to “energy” in a place, ...","When I was a child, something I learned from w...",not_dupe,"How sensitive are you to “energy” in a place, ..."
1,0,"Roars from the dragon below, the ancient serpe...",[removed],not_dupe,"Roars from the dragon below, the ancient serpe..."
2,0,Weird Dream - Sort Of...,(Am cross-posting this from r/HighStrangeness ...,not_dupe,Weird Dream - Sort Of... (Am cross-posting thi...
3,0,Michael Jackson close friend recorded This in ...,,not_dupe,Michael Jackson close friend recorded This in ...
4,0,My friend and I were recording vocals in my ro...,So I live in a 2 bedroom apartment with my fri...,not_dupe,My friend and I were recording vocals in my ro...


### 4. Data Cleaning

#### 4.1 Functions for data cleaning

In [25]:
# Function for removing links
def remove_urls(text):
    text = re.sub(r"(https?:\/\/(www\.)?[a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:;%_\+.~#?&//=]*))", "", text)
    return text

In [26]:
# Function for cleaning the dataframe
def clean_df(dataframe):    
    ## Selecting the chosen columns for the dataframe
    dataframe = dataframe[['subreddit', 'selftext', 'title']].copy()
    
    # Dropping the rows with missing selftext and short titles
    dataframe.drop(dataframe[(dataframe['selftext'].isnull()) & (dataframe['title'].str.split().str.len() < 10)].index, inplace = True)
    
    # Imputing the missing values in the selftext column with empty strings
    dataframe['selftext'].fillna('', inplace=True)
    
    # Dropping the rows with selftext [removed]
    dataframe.drop(dataframe[dataframe['selftext'] == '[removed]'].index, inplace = True)
    
    # Dropping duplicate rows
    dataframe.drop(dataframe[dataframe['selftext'].duplicated() & (dataframe['selftext'] != '')].index, inplace = True)
    dataframe.drop_duplicates(subset = ['title'], inplace=True)
    
    # Creating the text column from the title and selftext column
    dataframe['text'] = dataframe['title'] + ' ' + dataframe['selftext']
    dataframe.reset_index(drop=True, inplace=True) # Needed since rows are dropped
    
    ## To remove the dupe title in selftext  
    # creating a new column for dupe titles in selftext
    dataframe['dupe_title'] = ""
    for i in range(len(dataframe.index)):
        if (dataframe.loc[i]['selftext']).strip().startswith(dataframe.loc[i]['title'].strip()):
            dataframe.loc[i]['dupe_title'] = 'dupe'
        else:
            dataframe.loc[i]['dupe_title'] = 'not_dupe'
    
    # reverting the text column to selftext for the dupe rows        
    for i in dataframe[dataframe['dupe_title'] == 'dupe'].index:
        dataframe.loc[i]['text'] = dataframe.copy().loc[i]['selftext']

    # Remove url links in text
    dataframe['text'] = dataframe['text'].map(remove_urls)   
    
    # reclassing target column of subreddit, with values of nosleep and Paranormal, setting nosleep to 1, Paranormal to 0
    dataframe['subreddit'] = dataframe['subreddit'].map({'nosleep': 1, 'Paranormal': 0})
    
    return dataframe[['subreddit','text']]

#### 4.2 Cleaning the DataFrames

In [27]:
# Running the data-cleaning functions
df_nosleep_clean = clean_df(df_nosleep)
df_paranormal_clean = clean_df(df_paranormal)

In [28]:
# Displaying the shapes of the cleaned dataframes
display(df_nosleep_clean.shape)
display(df_paranormal_clean.shape)

(625, 2)

(825, 2)

In [29]:
# Displaying the top 5 rows
display(df_nosleep_clean.head())
display(df_paranormal_clean.head())

Unnamed: 0,subreddit,text
0,1,I created a new plant. It's gonna hurt a lot o...
1,1,The people I kill won't stay dead. I'm not wri...
2,1,Don't got to the Magic Show at the Gypsy Carni...
3,1,"""Intent: The Truth"" - Randonauting is not that..."
4,1,Accused in the Woods I have had the privilege ...


Unnamed: 0,subreddit,text
0,0,"How sensitive are you to “energy” in a place, ..."
1,0,Weird Dream - Sort Of... (Am cross-posting thi...
2,0,Michael Jackson close friend recorded This in ...
3,0,My friend and I were recording vocals in my ro...
4,0,I remember when I was 5 clearly As a child I w...


In [30]:
# Concatenate the two dataframes to create one single dataframe
df_clean = pd.concat([df_nosleep_clean, df_paranormal_clean])
df_clean.shape

(1450, 2)

In [31]:
# save to csv
df_clean.to_csv('../datasets/data_clean.csv', index=False)

### 5. Pre-processing

When dealing with text data, the following are common pre-processing steps, though it should be noted that not all of these steps would be used.

- Remove special characters
- Tokenizing
- Lemmatizing/Stemming
- Stop word removal

For now, we will do these pre-processing steps on a single post to see how it goes.

In [32]:
# Instantiate RegExpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

# Instantiate lemmatizer
lemmatizer = WordNetLemmatizer()

# Instantiate PorterStemmer
p_stemmer = PorterStemmer()

In [33]:
# Import the dataset again
df = pd.read_csv('../datasets/data_clean.csv')

#### 5.1 Tokenizing and Removal of special characters

In [34]:
# Run the tokenizer on the first post
tokens_0 = tokenizer.tokenize(df['text'][0].lower())
print(tokens_0)

['i', 'created', 'a', 'new', 'plant', 'it', 's', 'gonna', 'hurt', 'a', 'lot', 'of', 'people', 'i', 'created', 'a', 'new', 'plant', 'it', 's', 'gonna', 'hurt', 'a', 'lot', 'of', 'people', 'i', 'am', 'a', 'scientist', 'a', 'genetic', 'engineer', 'working', 'in', 'a', 'secret', 'facility', 'hidden', 'deep', 'within', 'a', 'remote', 'area', 'for', 'privacy', 'reasons', 'not', 'to', 'mention', 'the', 'safety', 'of', 'the', 'general', 'public', 'and', 'my', 'colleagues', 'reputations', 'i', 'will', 'not', 'disclose', 'the', 'exact', 'location', 'of', 'the', 'site', 'the', 'facility', 'was', 'small', 'with', 'two', 'sleeping', 'chambers', 'fit', 'for', 'housing', 'at', 'most', 'three', 'people', 'each', 'alongside', 'a', 'testing', 'chamber', 'nearby', 'a', 'common', 'room', 'fitted', 'with', 'uncomfortable', 'dull', 'couches', 'and', 'a', 'wooden', 'ebony', 'table', 'the', 'cafeteria', 'was', 'connected', 'to', 'it', 'by', 'a', 'narrow', 'hallway', 'and', 'had', 'enough', 'food', 'to', 'supp

#### 5.2 Stop Words Removal

In [35]:
# remove the stopwords from tokens, leakage words that are names of subreddits
print(len(tokens_0))
leakage_words = ['nosleep', 'paranormal']
tokens_0_stop = [token for token in tokens_0 if token not in stopwords.words("english") + leakage_words]
print(len(tokens_0_stop))

1690
837


#### 5.3 Lemmatizing

In [36]:
# lemmatize the tokens
tokens_0_stoplem = [lemmatizer.lemmatize(i) for i in tokens_0_stop]
print(tokens_0_stoplem)

['created', 'new', 'plant', 'gonna', 'hurt', 'lot', 'people', 'created', 'new', 'plant', 'gonna', 'hurt', 'lot', 'people', 'scientist', 'genetic', 'engineer', 'working', 'secret', 'facility', 'hidden', 'deep', 'within', 'remote', 'area', 'privacy', 'reason', 'mention', 'safety', 'general', 'public', 'colleague', 'reputation', 'disclose', 'exact', 'location', 'site', 'facility', 'small', 'two', 'sleeping', 'chamber', 'fit', 'housing', 'three', 'people', 'alongside', 'testing', 'chamber', 'nearby', 'common', 'room', 'fitted', 'uncomfortable', 'dull', 'couch', 'wooden', 'ebony', 'table', 'cafeteria', 'connected', 'narrow', 'hallway', 'enough', 'food', 'supply', 'u', 'month', 'purpose', 'facility', 'design', 'new', 'specie', 'plant', 'life', 'though', 'meant', 'solution', 'global', 'warming', 'plant', 'photosynthesized', 'much', 'faster', 'rate', 'compared', 'others', 'placed', 'strategically', 'across', 'globe', 'accompanied', 'special', 'growth', 'enzyme', 'could', 'theory', 'remove', 'e

#### 5.4 Stemming

In [37]:
# stem the tokens
tokens_0_stopstem = [p_stemmer.stem(i) for i in tokens_0_stop]
print(tokens_0_stopstem)

['creat', 'new', 'plant', 'gonna', 'hurt', 'lot', 'peopl', 'creat', 'new', 'plant', 'gonna', 'hurt', 'lot', 'peopl', 'scientist', 'genet', 'engin', 'work', 'secret', 'facil', 'hidden', 'deep', 'within', 'remot', 'area', 'privaci', 'reason', 'mention', 'safeti', 'gener', 'public', 'colleagu', 'reput', 'disclos', 'exact', 'locat', 'site', 'facil', 'small', 'two', 'sleep', 'chamber', 'fit', 'hous', 'three', 'peopl', 'alongsid', 'test', 'chamber', 'nearbi', 'common', 'room', 'fit', 'uncomfort', 'dull', 'couch', 'wooden', 'eboni', 'tabl', 'cafeteria', 'connect', 'narrow', 'hallway', 'enough', 'food', 'suppli', 'us', 'month', 'purpos', 'facil', 'design', 'new', 'speci', 'plant', 'life', 'though', 'meant', 'solut', 'global', 'warm', 'plant', 'photosynthes', 'much', 'faster', 'rate', 'compar', 'other', 'place', 'strateg', 'across', 'globe', 'accompani', 'special', 'growth', 'enzym', 'could', 'theori', 'remov', 'enough', 'carbon', 'dioxid', 'atmospher', 'return', 'earth', 'natur', 'temperatur',

#### 5.5 Functions for pre-processing

In [38]:
# Function for Tokenizer and StopWords Removal
def text_stoptokens(text):
    # Instantiate RegExpTokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    
    # generate the tokens
    tokens = tokenizer.tokenize(text.lower())
    
    # generating list of stopwords, with the usual stop words plus names of subreddits
    # additional stopwords include syntax for spacing that are the result of tokenizer
    leakage_words = ['nosleep', 'paranormal', 'amp', 'x200b', 'nbsp']
    tokens_stop = [token for token in tokens if token not in stopwords.words("english") + leakage_words]
    
    return tokens_stop

In [39]:
# Function for Lemmentization
def text_lem(tokens):
    # Instantiate lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # lemmatize the tokens
    words = [lemmatizer.lemmatize(token) for token in tokens]
    
    return words

In [40]:
# Function for stemming
def text_stem(tokens):
    # Instantiate PorterStemmer
    p_stemmer = PorterStemmer()
    
    # stem the tokens
    words = [p_stemmer.stem(token) for token in tokens]
    
    return words

In [41]:
# Function for joining the words in list back to string
def list_to_string(tokens):
    return ' '.join(tokens)

#### 5.6 Preprocessing the data
For this pre-processing of the text data, we will be performing Lemmatizing & Stemming after Tokenization and Stop Words Removal are done. 

In [42]:
%%time
# Pre-processing the dataframe by creating new column after tokenization and stop words removal
df['text_stop'] = df['text'].map(text_stoptokens)

Wall time: 6min 18s


In [43]:
%%time
# Pre-processing the dataframe by creating new columns for lemmatizing and stemming and both
df['text_lem'] = df['text_stop'].map(text_lem)
df['text_stem'] = df['text_stop'].map(text_stem)

Wall time: 16.6 s


In [47]:
%%time
# Turning the lists of words into strings
df['text_stop'] = df['text_stop'].copy().map(list_to_string)
df['text_lem'] = df['text_lem'].copy().map(list_to_string)
df['text_stem'] = df['text_stem'].copy().map(list_to_string)

Wall time: 78.9 ms


In [48]:
# Checking the dataframe again
display(df.shape)
df.head()

(1450, 5)

Unnamed: 0,subreddit,text,text_stop,text_lem,text_stem
0,1,I created a new plant. It's gonna hurt a lot o...,created new plant gonna hurt lot people create...,created new plant gonna hurt lot people create...,creat new plant gonna hurt lot peopl creat new...
1,1,The people I kill won't stay dead. I'm not wri...,people kill stay dead writing sort confession ...,people kill stay dead writing sort confession ...,peopl kill stay dead write sort confess event ...
2,1,Don't got to the Magic Show at the Gypsy Carni...,got magic show gypsy carnival final part 1 l p...,got magic show gypsy carnival final part 1 l p...,got magic show gypsi carniv final part 1 l par...
3,1,"""Intent: The Truth"" - Randonauting is not that...",intent truth randonauting fun hell ella exclai...,intent truth randonauting fun hell ella exclai...,intent truth randonaut fun hell ella exclaim w...
4,1,Accused in the Woods I have had the privilege ...,accused woods privilege living america country...,accused wood privilege living america country ...,accus wood privileg live america countri one f...


In [46]:
# save to csv again for modelling
df.to_csv('../datasets/data_model.csv', index=False)

### 6. Progress thus far
At this point in time, we have completed the basic EDA and cleaning the two datasets to prep for analysis. The two datasets are then compiled into a single dataframe of only two columns: `subreddit` and `text`. 

After that, preprocessing of the dataframe is done through the use of tokenization, lemmatization, stemming and stop word removals. The final dataframe is then saved for the next part of the project. (Do note that the original text column, and the 3 types of processed text columns are kept in the final dataframe. As even though it is likely that only the final processed text column is used for the modelling and analysis, there exists the possibility that we would use the previous columns if needed.)

We will continue next in Part 3: [Modelling and Analysis](./03_Modelling_and_Analysis.ipynb)