# Project 3 - Web APIs and Natural Language Processing

## Cleaning data from Reddit 

In [1]:
# Importing the libraries that I need. 

import requests
import pandas as pd
import regex as re

from bs4 import BeautifulSoup
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

In [2]:
# Reading in the data to clean it 
two_subreddits = pd.read_csv('./two_subreddits.csv')

In [3]:
# Looking at what I have, again
two_subreddits.head()

Unnamed: 0,author,created_utc,full_link,is_video,media_only,num_comments,num_crossposts,over_18,score,selftext,subreddit,title,url,crosspost_parent,crosspost_parent_list
0,KAMI_aka,1580305052,https://www.reddit.com/r/careerguidance/commen...,False,False,0,0.0,False,1,Im in my final year of my undergraduate degree...,careerguidance,Can I pursue a master's in engineering managem...,https://www.reddit.com/r/careerguidance/commen...,,
1,LostAMO,1580304222,https://www.reddit.com/r/careerguidance/commen...,False,False,2,0.0,False,1,[removed],careerguidance,Need advice on career change and from friends ...,https://www.reddit.com/r/careerguidance/commen...,,
2,PMMeYourMortys,1580302245,https://www.reddit.com/r/careerguidance/commen...,False,False,1,0.0,False,1,I’m utterly burning out. Every day for the pas...,careerguidance,Burnout: What freelance jobs can I do if I qui...,https://www.reddit.com/r/careerguidance/commen...,,
3,NotJobObsessed,1580301838,https://www.reddit.com/r/careerguidance/commen...,False,False,2,0.0,False,1,"Sometime ago, we moved from the north east to ...",careerguidance,Do I lack work ethic or am I being gaslighted?,https://www.reddit.com/r/careerguidance/commen...,,
4,NoxiousToxic,1580300094,https://www.reddit.com/r/careerguidance/commen...,False,False,1,0.0,False,1,"If this isn’t the place to ask, I will thank t...",careerguidance,I was curious: Can you exchange pay for a plac...,https://www.reddit.com/r/careerguidance/commen...,,


In [4]:
# Checking again that I have 20,000 rows
two_subreddits.shape

(20000, 15)

In [5]:
# Looking at a broad description of the data. 

two_subreddits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 15 columns):
author                   20000 non-null object
created_utc              20000 non-null int64
full_link                20000 non-null object
is_video                 15390 non-null object
media_only               13159 non-null object
num_comments             20000 non-null int64
num_crossposts           14674 non-null float64
over_18                  20000 non-null bool
score                    20000 non-null int64
selftext                 11623 non-null object
subreddit                20000 non-null object
title                    20000 non-null object
url                      20000 non-null object
crosspost_parent         288 non-null object
crosspost_parent_list    288 non-null object
dtypes: bool(1), float64(1), int64(3), object(10)
memory usage: 2.2+ MB


### Cleaning data using nulls on filtering variables 

**Personal note:**    
In this first look at the data, there are two types of variables. Those that I will use to explore the data and do further cleaning (is video, media only, author), and those I will use to perform the natural language processing. I am starting with those selected to inform the cleaning process: 

- author: I dropped 371 posts that seemed to be material of a specific course (in accounting) 
- is_video: I dropped two rows identified as video only. I want to focus on text.  
- media_only: I think this is a way to identify posts that consist mostly of hyperlinks. I'm keeping all
- over_18: 


**Author:**

In [6]:
#Looking at the distribution of authors. Making sure that there are no serial posters of adds
two_subreddits['author'].value_counts()

goviewyou              2006
rellotscire             472
RedditGreenit           461
manishmathur6928        371
[deleted]               342
                       ... 
HalfBack122               1
GnomeErcy                 1
LeadingSoupBoss           1
yanishpatwari             1
TheLonerMillionaire       1
Name: author, Length: 8943, dtype: int64

In [7]:
# Goviewyou is definitely a serial poster. But it posts education news, so it is relevant 
two_subreddits[two_subreddits['author']=='goviewyou']

Unnamed: 0,author,created_utc,full_link,is_video,media_only,num_comments,num_crossposts,over_18,score,selftext,subreddit,title,url,crosspost_parent,crosspost_parent_list
13344,goviewyou,1522761086,https://www.reddit.com/r/highereducation/comme...,False,,0,0.0,False,1,,highereducation,"ViewYou Education and Career Tips, March 28 – ...",http://viewyou.com/post/viewyou-education-and-...,,
13345,goviewyou,1522760845,https://www.reddit.com/r/highereducation/comme...,False,,0,0.0,False,1,,highereducation,Your Career Q&amp;A: How to Turn Job Interview...,https://flipboard.com/@goviewyou/goviewyou-4%2...,,
13346,goviewyou,1522758070,https://www.reddit.com/r/highereducation/comme...,False,,0,0.0,False,1,,highereducation,7 undeniable reasons why quick adaptation to c...,https://paper.li/GoViewYou/1356102065?edition_...,,
13350,goviewyou,1522685415,https://www.reddit.com/r/highereducation/comme...,False,,0,0.0,False,1,,highereducation,The surprising hand gesture that could help la...,https://flipboard.com/@goviewyou/goviewyou-4%2...,,
13352,goviewyou,1522672653,https://www.reddit.com/r/highereducation/comme...,False,,0,0.0,False,1,,highereducation,Why You Should Always Read Your Employment Con...,https://paper.li/GoViewYou/1356102065?edition_...,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19970,goviewyou,1441555788,https://www.reddit.com/r/highereducation/comme...,,,0,,False,1,,highereducation,ViewYou.com #Education #News! ViewYouGlobal.co...,https://paper.li/GoViewYou/1356102065?edition_...,,
19973,goviewyou,1441470183,https://www.reddit.com/r/highereducation/comme...,,,0,,False,1,,highereducation,ViewYou.com #Education #News! ViewYouGlobal.co...,https://paper.li/GoViewYou/1356102065?edition_...,,
19976,goviewyou,1441382974,https://www.reddit.com/r/highereducation/comme...,,,0,,False,1,,highereducation,ViewYou.com #Education #News! ViewYouGlobal.co...,https://paper.li/GoViewYou/1356102065?edition_...,,
19983,goviewyou,1441296876,https://www.reddit.com/r/highereducation/comme...,,,0,,False,1,,highereducation,ViewYou.com #Education #News! ViewYouGlobal.co...,https://paper.li/GoViewYou/1356102065?edition_...,,


In [8]:
# Another serial poster. Also in the higher education reddit. This one shares news from other sources
two_subreddits[two_subreddits['author']=='rellotscire']

Unnamed: 0,author,created_utc,full_link,is_video,media_only,num_comments,num_crossposts,over_18,score,selftext,subreddit,title,url,crosspost_parent,crosspost_parent_list
10103,rellotscire,1579023906,https://www.reddit.com/r/highereducation/comme...,False,False,1,0.0,False,1,,highereducation,Essay on the tension of online learning and co...,https://www.insidehighered.com/blogs/confessio...,,
10451,rellotscire,1574456577,https://www.reddit.com/r/highereducation/comme...,False,False,43,0.0,False,1,,highereducation,Indiana University condemns professor's racist...,https://www.insidehighered.com/news/2019/11/22...,,
10674,rellotscire,1572290730,https://www.reddit.com/r/highereducation/comme...,False,False,0,0.0,False,6,,highereducation,Benedict Students Told to Stay in Dorms During...,https://www.insidehighered.com/quicktakes/2019...,,
10721,rellotscire,1571597434,https://www.reddit.com/r/highereducation/comme...,False,False,1,0.0,False,0,,highereducation,It’s time to end the obsession with college ex...,https://www.washingtonpost.com/opinions/2019/1...,,
10775,rellotscire,1571062742,https://www.reddit.com/r/highereducation/comme...,False,False,0,0.0,False,6,,highereducation,University of Alabama to Pay Former Dean,https://www.insidehighered.com/quicktakes/2019...,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19986,rellotscire,1441274814,https://www.reddit.com/r/highereducation/comme...,,,25,,False,15,,highereducation,Harvard University allows students to pick new...,http://www.bostonglobe.com/metro/2015/09/02/ha...,,
19987,rellotscire,1441274107,https://www.reddit.com/r/highereducation/comme...,,,0,,False,5,,highereducation,What’s the real value of higher education?,http://www.newyorker.com/magazine/2015/09/07/c...,,
19997,rellotscire,1441192885,https://www.reddit.com/r/highereducation/comme...,,,0,,False,3,,highereducation,Are we nearing the end of college tuition pric...,http://www.washingtonpost.com/news/grade-point...,,
19998,rellotscire,1441192140,https://www.reddit.com/r/highereducation/comme...,,,2,,False,9,,highereducation,Why Students With Smallest Debts Have the Larg...,http://www.nytimes.com/2015/09/01/upshot/why-s...,,


In [9]:
# Exploring another serial poster
two_subreddits[two_subreddits['author'] == 'RedditGreenit']

Unnamed: 0,author,created_utc,full_link,is_video,media_only,num_comments,num_crossposts,over_18,score,selftext,subreddit,title,url,crosspost_parent,crosspost_parent_list
10006,RedditGreenit,1580270042,https://www.reddit.com/r/highereducation/comme...,False,False,0,0.0,False,1,,highereducation,Appeals court says Catholic university not obl...,https://www.ncronline.org/news/quick-reads/app...,,
10051,RedditGreenit,1579737401,https://www.reddit.com/r/highereducation/comme...,False,False,6,0.0,False,1,,highereducation,Bernie Sanders Introduces the Respect Graduate...,https://diverseeducation.com/article/164585/,,
10070,RedditGreenit,1579527905,https://www.reddit.com/r/highereducation/comme...,False,False,0,0.0,False,1,,highereducation,UC Berkeley student workers awarded millions i...,https://www.nbcnews.com/news/us-news/uc-berkel...,t3_eqvnj6,"[{'all_awardings': [], 'allow_live_comments': ..."
10082,RedditGreenit,1579394142,https://www.reddit.com/r/highereducation/comme...,False,False,0,0.0,False,1,,highereducation,Strike wins big gains for faculty at Clark Col...,https://nwlaborpress.org/2020/01/strike-wins-b...,,
10106,RedditGreenit,1578967506,https://www.reddit.com/r/highereducation/comme...,False,False,2,0.0,False,1,,highereducation,University of Pittsburgh's bill for 'union avo...,https://pittnews.com/article/153827/news/pitts...,t3_eoeg7o,"[{'all_awardings': [], 'allow_live_comments': ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19947,RedditGreenit,1441760897,https://www.reddit.com/r/highereducation/comme...,,,0,,False,10,,highereducation,Penn State graduate students spend Labor Day t...,http://www.centredaily.com/2015/09/07/4909196/...,,
19965,RedditGreenit,1441577237,https://www.reddit.com/r/highereducation/comme...,,,0,,False,6,,highereducation,University of Missouri graduate student employ...,http://www.columbiatribune.com/news/education/...,,
19974,RedditGreenit,1441398578,https://www.reddit.com/r/highereducation/comme...,,,1,,False,22,,highereducation,The Social Injustice Done to Adjunct Faculty: ...,http://www.thepublicdiscourse.com/2015/09/14452/,,
19990,RedditGreenit,1441220198,https://www.reddit.com/r/highereducation/comme...,,,0,,False,2,,highereducation,Ohio University grad student prez: Union push ...,http://www.athensmessenger.com/news/ou-grad-st...,,


In [10]:
# Yet another serial poster. I will delete this one as it seems to be specific to a course 
two_subreddits1 = two_subreddits[two_subreddits['author'] != 'manishmathur6928']

In [11]:
# Checking that I don't have these values anymore
two_subreddits1.shape

(19629, 15)

In [12]:
# There seems to be a lot of values that have [deleted] as author. I'm exploring those
# Most of what is shown below seems to be actual content. I'll keep it. 
two_subreddits1[two_subreddits['author']=='[deleted]']

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,author,created_utc,full_link,is_video,media_only,num_comments,num_crossposts,over_18,score,selftext,subreddit,title,url,crosspost_parent,crosspost_parent_list
32,[deleted],1580268245,https://www.reddit.com/r/careerguidance/commen...,False,False,2,0.0,False,1,,careerguidance,Remote job is making me lonely and unhappy and...,https://www.reddit.com/r/careerguidance/commen...,,
36,[deleted],1580267943,https://www.reddit.com/r/careerguidance/commen...,False,False,2,0.0,False,1,,careerguidance,lost my path,https://www.reddit.com/r/careerguidance/commen...,,
39,[deleted],1580266235,https://www.reddit.com/r/careerguidance/commen...,False,False,2,0.0,False,1,,careerguidance,Official vs Unofficial Titles,https://www.reddit.com/r/careerguidance/commen...,,
44,[deleted],1580264523,https://www.reddit.com/r/careerguidance/commen...,False,False,2,0.0,False,1,[deleted],careerguidance,Job prospects. Advice needed please.,/r/homesecurity/comments/evg0fh/job_prospects_...,t3_evg0fh,"[{'all_awardings': [], 'allow_live_comments': ..."
46,[deleted],1580263156,https://www.reddit.com/r/careerguidance/commen...,False,False,2,0.0,False,1,,careerguidance,"Odd career path, not sure where to go from here",https://www.reddit.com/r/careerguidance/commen...,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19852,[deleted],1442792092,https://www.reddit.com/r/highereducation/comme...,,,0,,False,0,[deleted],highereducation,Tips for teaching hospitality revenue management,https://www.reddit.com/r/highereducation/comme...,,
19930,[deleted],1441930398,https://www.reddit.com/r/highereducation/comme...,,,0,,False,1,[deleted],highereducation,What Is the Point of College? by Kwame Anthony...,http://www.nytimes.com/2015/09/13/magazine/wha...,,
19952,[deleted],1441729384,https://www.reddit.com/r/highereducation/comme...,,,0,,False,1,[deleted],highereducation,The N-O Man 67 29 14 The University of Iowa’s ...,http://www.slate.com/articles/life/education/2...,,
19963,[deleted],1441643202,https://www.reddit.com/r/highereducation/comme...,,,0,,False,1,[deleted],highereducation,ViewYou.com #Education #News! ViewYouGlobal.co...,https://paper.li/GoViewYou/1356102065?edition_...,,


**Videos:**

In [13]:
# Looking at the distribution of values among those that have information 
two_subreddits1['is_video'].value_counts()

False    15017
True         2
Name: is_video, dtype: int64

In [14]:
# Looking at the two posts with videos in detail. They do not have any useful information for me and I 
# will therfore drop them 
two_subreddits1[two_subreddits1['is_video']==True]

Unnamed: 0,author,created_utc,full_link,is_video,media_only,num_comments,num_crossposts,over_18,score,selftext,subreddit,title,url,crosspost_parent,crosspost_parent_list
11949,waitforcom,1554080977,https://www.reddit.com/r/highereducation/comme...,True,False,1,0.0,False,0,,highereducation,Aluminum atoms,https://v.redd.it/jora29mmwjp21,,
12721,Zestebookstore,1537761438,https://www.reddit.com/r/highereducation/comme...,True,False,0,0.0,False,2,,highereducation,Home Schooling,https://v.redd.it/h08b77ba04o11,,


In [15]:
# Dropping videos 
two_subreddits1.drop(11949, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [16]:
two_subreddits1.drop(12721, inplace=True)

In [17]:
#Checking it worked
two_subreddits1[two_subreddits1['is_video']==True]

Unnamed: 0,author,created_utc,full_link,is_video,media_only,num_comments,num_crossposts,over_18,score,selftext,subreddit,title,url,crosspost_parent,crosspost_parent_list


In [18]:
# Looking at the null values for the is_video category. I cannot see any pattern in the data for this filter 
# I'll look for others and come back to it. 
two_subreddits1[two_subreddits1['is_video'].isnull()]

Unnamed: 0,author,created_utc,full_link,is_video,media_only,num_comments,num_crossposts,over_18,score,selftext,subreddit,title,url,crosspost_parent,crosspost_parent_list
15390,hallenjolie,1493791191,https://www.reddit.com/r/highereducation/comme...,,,0,,False,1,,highereducation,Harvard University Admission,http://www.worldcollegedegrees.com/2017/04/Har...,,
15391,hallenjolie,1493789499,https://www.reddit.com/r/highereducation/comme...,,,0,,False,0,,highereducation,Oxford University Admission,http://www.worldcollegedegrees.com/2017/04/Oxf...,,
15392,lpez33,1493788756,https://www.reddit.com/r/highereducation/comme...,,,6,,False,8,To those who have taken the time to lend me so...,highereducation,Applying for a PhD in Biostatistics and I have...,https://www.reddit.com/r/highereducation/comme...,,
15393,hallenjolie,1493787863,https://www.reddit.com/r/highereducation/comme...,,,0,,False,0,,highereducation,College Courses List,http://www.worldcollegedegrees.com/2017/04/Col...,,
15394,hallenjolie,1493786679,https://www.reddit.com/r/highereducation/comme...,,,0,,False,0,,highereducation,Brown University Admission,http://www.worldcollegedegrees.com/2017/04/Bro...,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,ESB605,1441202756,https://www.reddit.com/r/highereducation/comme...,,,0,,False,3,,highereducation,"Survey Examines Cooperation Between Faculty, L...",https://www.insidehighered.com/quicktakes/2015...,,
19996,percytrappe,1441198833,https://www.reddit.com/r/highereducation/comme...,,,1,,False,3,,highereducation,University Humor – Erskine Bowles,https://academicanchor.wordpress.com/2012/12/0...,,
19997,rellotscire,1441192885,https://www.reddit.com/r/highereducation/comme...,,,0,,False,3,,highereducation,Are we nearing the end of college tuition pric...,http://www.washingtonpost.com/news/grade-point...,,
19998,rellotscire,1441192140,https://www.reddit.com/r/highereducation/comme...,,,2,,False,9,,highereducation,Why Students With Smallest Debts Have the Larg...,http://www.nytimes.com/2015/09/01/upshot/why-s...,,


**Media only**:

In [19]:
# Looking at the distribution of values. It seems we only have NaN or False. 
two_subreddits1['media_only'].value_counts()

False    12786
Name: media_only, dtype: int64

In [20]:
# The tag 'media_only' seems to apply to posts that only inlcude links to other articles. I'm keeping all those.
two_subreddits1[two_subreddits1['media_only'].isnull()]

Unnamed: 0,author,created_utc,full_link,is_video,media_only,num_comments,num_crossposts,over_18,score,selftext,subreddit,title,url,crosspost_parent,crosspost_parent_list
13159,Academous,1527012444,https://www.reddit.com/r/highereducation/comme...,False,,0,0.0,False,1,,highereducation,Will Macron clarify his university networks vi...,http://www.universityworldnews.com/article.php...,,
13160,Ebenezerschool,1526978756,https://www.reddit.com/r/highereducation/comme...,False,,1,0.0,False,1,,highereducation,A unique Learning system for Every Child | Ebe...,http://ebenezerirs.org/about-us/,,
13161,--onceinalifetime--,1526953663,https://www.reddit.com/r/highereducation/comme...,False,,7,0.0,False,0,The higher ed environment is a snake pit of in...,highereducation,Informers in our midst,https://www.reddit.com/r/highereducation/comme...,,
13162,RedditGreenit,1526890394,https://www.reddit.com/r/highereducation/comme...,False,,1,0.0,False,21,,highereducation,Getting organized: Oregon State University fac...,http://www.gazettetimes.com/news/local/getting...,,
13163,ccb621,1526838772,https://www.reddit.com/r/highereducation/comme...,False,,5,0.0,False,5,,highereducation,"I've Paid $18,000 To A $24,000 Student Loan, &...",https://www.bustle.com/p/ive-paid-18000-to-a-2...,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,ESB605,1441202756,https://www.reddit.com/r/highereducation/comme...,,,0,,False,3,,highereducation,"Survey Examines Cooperation Between Faculty, L...",https://www.insidehighered.com/quicktakes/2015...,,
19996,percytrappe,1441198833,https://www.reddit.com/r/highereducation/comme...,,,1,,False,3,,highereducation,University Humor – Erskine Bowles,https://academicanchor.wordpress.com/2012/12/0...,,
19997,rellotscire,1441192885,https://www.reddit.com/r/highereducation/comme...,,,0,,False,3,,highereducation,Are we nearing the end of college tuition pric...,http://www.washingtonpost.com/news/grade-point...,,
19998,rellotscire,1441192140,https://www.reddit.com/r/highereducation/comme...,,,2,,False,9,,highereducation,Why Students With Smallest Debts Have the Larg...,http://www.nytimes.com/2015/09/01/upshot/why-s...,,


In [21]:
# To avoid runing into trouble when modeling, I will fill the NaN with true. 
two_subreddits1['media_only'].fillna('True', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


In [22]:
# Checking it worked 
two_subreddits1.isnull().sum()

author                       0
created_utc                  0
full_link                    0
is_video                  4610
media_only                   0
num_comments                 0
num_crossposts            5326
over_18                      0
score                        0
selftext                  8308
subreddit                    0
title                        0
url                          0
crosspost_parent         19339
crosspost_parent_list    19339
dtype: int64

**Eliminating duplicates:**

In [23]:
# Looking for duplicates in the data (Thanks Adi!)
two_subreddits1[two_subreddits1['title'].duplicated(keep=False) == True]

Unnamed: 0,author,created_utc,full_link,is_video,media_only,num_comments,num_crossposts,over_18,score,selftext,subreddit,title,url,crosspost_parent,crosspost_parent_list
95,digestive-biscuit19,1580244279,https://www.reddit.com/r/careerguidance/commen...,False,False,2,0.0,False,1,[removed],careerguidance,Career Change,https://www.reddit.com/r/careerguidance/commen...,,
101,eucalypitus,1580241787,https://www.reddit.com/r/careerguidance/commen...,False,False,2,0.0,False,1,[removed],careerguidance,Advice Please,https://www.reddit.com/r/careerguidance/commen...,,
225,sanpellegrinoa,1580182443,https://www.reddit.com/r/careerguidance/commen...,False,False,1,0.0,False,1,\n\nI got a call requesting a follow up interv...,careerguidance,What should I expect for a follow up interview?,https://www.reddit.com/r/careerguidance/commen...,,
288,sanpellegrinoa,1580159141,https://www.reddit.com/r/careerguidance/commen...,False,False,0,0.0,False,1,\n\nI got a call requesting a follow up interv...,careerguidance,What should I expect for a follow up interview?,https://www.reddit.com/r/careerguidance/commen...,,
305,sanpellegrinoa,1580153694,https://www.reddit.com/r/careerguidance/commen...,False,False,0,0.0,False,1,I got a call requesting a follow up interview....,careerguidance,What should I expect for a follow up interview?,https://www.reddit.com/r/careerguidance/commen...,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19749,Sybles,1444134794,https://www.reddit.com/r/highereducation/comme...,,True,0,,False,2,,highereducation,Campuses Debate Rising Demands for ‘Comfort An...,http://www.nytimes.com/2015/10/05/us/four-legg...,,
19750,rellotscire,1444132322,https://www.reddit.com/r/highereducation/comme...,,True,0,,False,3,,highereducation,Campuses Debate Rising Demands for ‘Comfort An...,http://www.nytimes.com/2015/10/05/us/four-legg...,,
19826,goviewyou,1443114218,https://www.reddit.com/r/highereducation/comme...,,True,0,,False,1,,highereducation,ViewYou.com #Education #News! ViewYouGlobal.co...,https://paper.li/GoViewYou/1356102065?edition_...,,
19866,Sybles,1442597383,https://www.reddit.com/r/highereducation/comme...,,True,0,,False,14,,highereducation,Social sciences and humanities faculties to cl...,https://www.timeshighereducation.com/news/soci...,,


In [24]:
# Dropping the duplicates 
two_subreddits1.drop_duplicates(subset='title', keep='first', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [25]:
#Looking at the rersulting size of the data frame
two_subreddits1.shape

(19152, 15)

### Cleaning data using pre-selected analysis variables 

In [26]:
two_subreddits1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19152 entries, 0 to 19999
Data columns (total 15 columns):
author                   19152 non-null object
created_utc              19152 non-null int64
full_link                19152 non-null object
is_video                 14641 non-null object
media_only               19152 non-null object
num_comments             19152 non-null int64
num_crossposts           13964 non-null float64
over_18                  19152 non-null bool
score                    19152 non-null int64
selftext                 11160 non-null object
subreddit                19152 non-null object
title                    19152 non-null object
url                      19152 non-null object
crosspost_parent         283 non-null object
crosspost_parent_list    283 non-null object
dtypes: bool(1), float64(1), int64(3), object(10)
memory usage: 2.2+ MB


**Null cross-posts:**

In [27]:
# Looking at the distribution of values. Not much interesting there. 
two_subreddits1['num_crossposts'].value_counts()

0.0    13939
1.0       18
2.0        6
3.0        1
Name: num_crossposts, dtype: int64

In [28]:
# The cross posts seem so limited that I am going to look at them to see if I drop cross posting columns
two_subreddits1[(two_subreddits1['num_crossposts']==1) | (two_subreddits1['num_crossposts']==2) | (two_subreddits1['num_crossposts']==3)]

Unnamed: 0,author,created_utc,full_link,is_video,media_only,num_comments,num_crossposts,over_18,score,selftext,subreddit,title,url,crosspost_parent,crosspost_parent_list
11626,LysanderSporker,1559679261,https://www.reddit.com/r/highereducation/comme...,False,False,1,1.0,False,24,,highereducation,"Historians are a great resource. Journalists, ...",https://www.cjr.org/criticism/historians-journ...,,
12704,texlorax,1538003190,https://www.reddit.com/r/highereducation/comme...,False,False,0,1.0,False,67,,highereducation,U.S. Students Spend More Time Working Paid Job...,https://www.bloomberg.com/news/articles/2018-0...,,
12763,10to1000,1536863783,https://www.reddit.com/r/highereducation/comme...,False,False,3,2.0,False,4,I teach Economics at CUNY and my department do...,highereducation,LiveStream vs Recorded Video Lectures - lookin...,https://www.reddit.com/r/highereducation/comme...,,
12875,misbehavingeconomist,1534712563,https://www.reddit.com/r/highereducation/comme...,False,False,2,1.0,False,29,,highereducation,'I'm being exploited': the underpaid workers i...,https://www.theage.com.au/national/victoria/i-...,,
12888,texlorax,1534188505,https://www.reddit.com/r/highereducation/comme...,False,False,6,2.0,False,48,,highereducation,Tenured professor who exposes colleagues' publ...,https://edmontonjournal.com/opinion/columnists...,,
12937,flyoverokie,1532909856,https://www.reddit.com/r/highereducation/comme...,False,False,0,1.0,False,16,,highereducation,How do you get into Harvard? For the lucky few...,https://www.bostonglobe.com/metro/2018/07/28/h...,,
12940,15mgSodium,1532708142,https://www.reddit.com/r/highereducation/comme...,False,False,1,1.0,False,58,,highereducation,xkcd: Peer Review,https://xkcd.com/2025/,,
12994,Chino_Blanco,1531196052,https://www.reddit.com/r/highereducation/comme...,False,False,0,1.0,False,16,,highereducation,Memo to Dallin Oaks and David Bednar: Universi...,http://faithpromotingrumor.com/2018/05/02/fait...,,
13062,Sophia_H,1529053880,https://www.reddit.com/r/highereducation/comme...,False,False,1,1.0,False,1,,highereducation,"No classes, no professors: the alternative to ...",https://www.ft.com/content/45ade73e-5aac-11e8-...,,
13146,RedditGreenit,1527374677,https://www.reddit.com/r/highereducation/comme...,False,False,0,2.0,False,13,,highereducation,"At the New School, Labor Struggles Unite Stude...",https://psmag.com/education/at-the-new-school-...,,


In [29]:
# I am dropping the columns related to cross posts because they do not provide specific infomration on which subredits the posts are actually crossposted. 
two_subreddits1.drop(columns=['num_crossposts', 'crosspost_parent',	'crosspost_parent_list' ], inplace=True)

In [30]:
#Checking it worked.
two_subreddits1.shape

(19152, 12)

**Self text nulls:**

In [31]:
# Exploring the null values in the self text column by subreddit - Career guidance 
# The values shown below do not seem to follow any pattern
two_subreddits1[(two_subreddits1['selftext'].isnull()) & (two_subreddits1['subreddit']=='careerguidance')]

Unnamed: 0,author,created_utc,full_link,is_video,media_only,num_comments,over_18,score,selftext,subreddit,title,url
32,[deleted],1580268245,https://www.reddit.com/r/careerguidance/commen...,False,False,2,False,1,,careerguidance,Remote job is making me lonely and unhappy and...,https://www.reddit.com/r/careerguidance/commen...
36,[deleted],1580267943,https://www.reddit.com/r/careerguidance/commen...,False,False,2,False,1,,careerguidance,lost my path,https://www.reddit.com/r/careerguidance/commen...
39,[deleted],1580266235,https://www.reddit.com/r/careerguidance/commen...,False,False,2,False,1,,careerguidance,Official vs Unofficial Titles,https://www.reddit.com/r/careerguidance/commen...
43,itgetsbetter888,1580265124,https://www.reddit.com/r/careerguidance/commen...,False,False,0,False,1,,careerguidance,Any alarm company job prospects?,/r/homesecurity/comments/evg0fh/job_prospects_...
46,[deleted],1580263156,https://www.reddit.com/r/careerguidance/commen...,False,False,2,False,1,,careerguidance,"Odd career path, not sure where to go from here",https://www.reddit.com/r/careerguidance/commen...
...,...,...,...,...,...,...,...,...,...,...,...,...
9891,mountainmonkey2,1573406403,https://www.reddit.com/r/careerguidance/commen...,False,False,1,False,1,,careerguidance,Looking to get a 2-year Associates in IT when ...,https://www.reddit.com/r/careerguidance/commen...
9950,arhat050,1573336176,https://www.reddit.com/r/careerguidance/commen...,False,False,1,False,1,,careerguidance,"Hey, could use some feedback on improving my R...",https://www.reddit.com/r/resumes/comments/daar...
9975,nwalandgod,1573319660,https://www.reddit.com/r/careerguidance/commen...,False,False,195,False,1,,careerguidance,What jobs offer sufficient free time or time a...,https://www.reddit.com/r/careerguidance/commen...
9994,Puffymar1234,1573279295,https://www.reddit.com/r/careerguidance/commen...,False,False,7,False,1,,careerguidance,Will I seem more competitive in the workplace ...,https://www.reddit.com/r/careerguidance/commen...


In [32]:
# Exploring the null values in the self text column by subreddit - Higher Education
# This uncovers a few advertisers: sixsigmaedu and Adamscots. I'm exploring them further 

two_subreddits1[(two_subreddits1['selftext'].isnull()) & (two_subreddits1['subreddit']=='highereducation')]

Unnamed: 0,author,created_utc,full_link,is_video,media_only,num_comments,over_18,score,selftext,subreddit,title,url
10000,sixsigmaedu,1580291550,https://www.reddit.com/r/highereducation/comme...,False,False,0,False,1,,highereducation,Top UK education consultants in Hyderabad | to...,https://www.sixsigmaedu.com/blog/top-uk-educat...
10002,Epistaxis,1580279418,https://www.reddit.com/r/highereducation/comme...,False,False,0,False,1,,highereducation,Harvard Chemistry Chairman Charged on Alleged ...,https://www.wsj.com/articles/harvards-chemistr...
10003,Adamscots,1580278542,https://www.reddit.com/r/highereducation/comme...,False,False,0,False,1,,highereducation,पढ़ने में एकाग्रता कैसे लाएं?,https://hi.letsdiskuss.com/how-to-bring-concen...
10005,sixsigmaedu,1580277042,https://www.reddit.com/r/highereducation/comme...,False,False,0,False,1,,highereducation,Study in UK without IELTS for Indian students ...,https://www.sixsigmaedu.com/blog/study-in-uk-w...
10006,RedditGreenit,1580270042,https://www.reddit.com/r/highereducation/comme...,False,False,0,False,1,,highereducation,Appeals court says Catholic university not obl...,https://www.ncronline.org/news/quick-reads/app...
...,...,...,...,...,...,...,...,...,...,...,...,...
19995,ESB605,1441202756,https://www.reddit.com/r/highereducation/comme...,,True,0,False,3,,highereducation,"Survey Examines Cooperation Between Faculty, L...",https://www.insidehighered.com/quicktakes/2015...
19996,percytrappe,1441198833,https://www.reddit.com/r/highereducation/comme...,,True,1,False,3,,highereducation,University Humor – Erskine Bowles,https://academicanchor.wordpress.com/2012/12/0...
19997,rellotscire,1441192885,https://www.reddit.com/r/highereducation/comme...,,True,0,False,3,,highereducation,Are we nearing the end of college tuition pric...,http://www.washingtonpost.com/news/grade-point...
19998,rellotscire,1441192140,https://www.reddit.com/r/highereducation/comme...,,True,2,False,9,,highereducation,Why Students With Smallest Debts Have the Larg...,http://www.nytimes.com/2015/09/01/upshot/why-s...


In [33]:
# Looking at sixsigmaedu. Indeed an advertiser. I am dropping this 
two_subreddits2 = two_subreddits1[two_subreddits1['author'] != 'sixsigmaedu']

In [34]:
# Looking at adamscots. This is an interesting user. Given that he uses two types of language 
# characters and posts mostly issues that are irrelevant to highereducation 
two_subreddits3 = two_subreddits2[two_subreddits2['author'] != 'Adamscots']

In [35]:
# Confirming this worked 
two_subreddits3.shape

(19028, 12)

In [36]:
two_subreddits3.head()

Unnamed: 0,author,created_utc,full_link,is_video,media_only,num_comments,over_18,score,selftext,subreddit,title,url
0,KAMI_aka,1580305052,https://www.reddit.com/r/careerguidance/commen...,False,False,0,False,1,Im in my final year of my undergraduate degree...,careerguidance,Can I pursue a master's in engineering managem...,https://www.reddit.com/r/careerguidance/commen...
1,LostAMO,1580304222,https://www.reddit.com/r/careerguidance/commen...,False,False,2,False,1,[removed],careerguidance,Need advice on career change and from friends ...,https://www.reddit.com/r/careerguidance/commen...
2,PMMeYourMortys,1580302245,https://www.reddit.com/r/careerguidance/commen...,False,False,1,False,1,I’m utterly burning out. Every day for the pas...,careerguidance,Burnout: What freelance jobs can I do if I qui...,https://www.reddit.com/r/careerguidance/commen...
3,NotJobObsessed,1580301838,https://www.reddit.com/r/careerguidance/commen...,False,False,2,False,1,"Sometime ago, we moved from the north east to ...",careerguidance,Do I lack work ethic or am I being gaslighted?,https://www.reddit.com/r/careerguidance/commen...
4,NoxiousToxic,1580300094,https://www.reddit.com/r/careerguidance/commen...,False,False,1,False,1,"If this isn’t the place to ask, I will thank t...",careerguidance,I was curious: Can you exchange pay for a plac...,https://www.reddit.com/r/careerguidance/commen...


In [37]:
# Given that the higher education subredit has many more advertisers than the career guidance subreddit,
# I am going to explore the authors that posted more than 100 subreddits to catch the most serial advertisers 
two_subreddits3[two_subreddits3['subreddit'] == 'highereducation']['author'].value_counts().head(10)

goviewyou        1903
rellotscire       469
RedditGreenit     461
[deleted]         247
JohnKimble111     197
15mgSodium        194
Sybles            176
trot-trot         150
texlorax          105
Betsy514          100
Name: author, dtype: int64

In [38]:
# Looking for more serial posters. After inspecting them all, it seems that I 
two_subreddits3[two_subreddits3['author'] == 'Betsy514']

Unnamed: 0,author,created_utc,full_link,is_video,media_only,num_comments,over_18,score,selftext,subreddit,title,url
12423,Betsy514,1544544166,https://www.reddit.com/r/highereducation/comme...,False,False,0,False,1,,highereducation,The Cautionary Tale of Correspondence Schools,https://www.newamerica.org/education-policy/re...
12568,Betsy514,1541597053,https://www.reddit.com/r/highereducation/comme...,False,False,3,False,1,,highereducation,Democratic House will trigger tougher oversigh...,https://www.insidehighered.com/news/2018/11/07...
12765,Betsy514,1536803868,https://www.reddit.com/r/highereducation/comme...,False,False,1,False,3,,highereducation,"List of guidance links for students, schools a...",https://ifap.ed.gov/ifap/disaster.jsp
12766,Betsy514,1536796647,https://www.reddit.com/r/highereducation/comme...,False,False,3,False,63,,highereducation,Betsy DeVos Loses Student Loan Lawsuit Brought...,https://www.bloomberg.com/news/articles/2018-0...
12838,Betsy514,1535386046,https://www.reddit.com/r/highereducation/comme...,False,False,2,False,8,,highereducation,The Student Debt Problem Is Worse Than We Imag...,https://www.nytimes.com/interactive/2018/08/25...
...,...,...,...,...,...,...,...,...,...,...,...,...
19795,Betsy514,1443555159,https://www.reddit.com/r/highereducation/comme...,,True,0,False,1,,highereducation,CFPB Issues Report On Student Loan Servicing,http://files.consumerfinance.gov/f/201509_cfpb...
19868,Betsy514,1442589197,https://www.reddit.com/r/highereducation/comme...,,True,0,False,10,,highereducation,More Americans Falling Behind on Student Loans,http://www.huffingtonpost.com/entry/student-lo...
19910,Betsy514,1442186500,https://www.reddit.com/r/highereducation/comme...,,True,1,False,11,,highereducation,President signs executive order to allow prior...,http://hosted.ap.org/dynamic/stories/U/US_OBAM...
19943,Betsy514,1441822026,https://www.reddit.com/r/highereducation/comme...,,True,1,False,3,,highereducation,Research 4 College Stats Before Making Student...,http://t.usnews.com/Zcyhog?src=usn_rd


In [39]:
# Finally, I need to fill nulls in the selftext data so that I can use those rows in the model without a problem. 
two_subreddits3.isnull().sum()

author             0
created_utc        0
full_link          0
is_video        4511
media_only         0
num_comments       0
over_18            0
score              0
selftext        7871
subreddit          0
title              0
url                0
dtype: int64

In [69]:
two_subreddits3['selftext'].fillna(0, inplace=True)

In [70]:
# Checking it worked 
two_subreddits3.isnull().sum()

author             0
created_utc        0
full_link          0
is_video        4511
media_only         0
num_comments       0
over_18            0
score              0
selftext           0
subreddit          0
title              0
url                0
dtype: int64

**Looking at preselected variables to ensure they have enough variation to help my model:**

**Num_comments:**

In [71]:
# Looking at the distribution of number of comments by subreddit. There are enough differences between
# both to make it worth keeping them in the model 
two_subreddits3[two_subreddits3['subreddit']=='careerguidance']['num_comments'].value_counts().head()

2    2673
1    2344
0    1536
3     738
4     535
Name: num_comments, dtype: int64

In [72]:
two_subreddits3[two_subreddits3['subreddit']=='highereducation']['num_comments'].value_counts().head()

0    6090
1    1023
2     513
3     327
4     226
Name: num_comments, dtype: int64

**Over_18**:

In [73]:
# I'd like to explore this variable by subreddit too, to see if there is any significant difference 
# in the share of adults vs minors. I suspect that there is not much in the dataframe as a whole

two_subreddits3['over_18'].value_counts()

False    19022
True         6
Name: over_18, dtype: int64

**Score:**

In [74]:
# Similarly, I am checking the score in the entire dataframe first to see if there is any substantial 
# difference among values. There is. 

two_subreddits3['score'].value_counts()

1      14635
2        645
0        604
3        444
4        252
       ...  
102        1
118        1
103        1
87         1
143        1
Name: score, Length: 88, dtype: int64

In [75]:
# Exploring if the distribution is different between subreddits 
two_subreddits3[two_subreddits3['subreddit']=='careerguidance']['score'].value_counts().head(5)

1    9721
2      76
3      30
0      14
5      10
Name: score, dtype: int64

In [76]:
two_subreddits3[two_subreddits3['subreddit']=='highereducation']['score'].value_counts().head(5)

1    4914
0     590
2     569
3     414
4     243
Name: score, dtype: int64

#### Preparing dataframe for modeling: 

In [77]:
two_subreddits3.columns 

Index(['author', 'created_utc', 'full_link', 'is_video', 'media_only',
       'num_comments', 'over_18', 'score', 'selftext', 'subreddit', 'title',
       'url'],
      dtype='object')

In [78]:
# Of the 16 preselected variables, 8 will be taken into consideration for the model. 
clean_subreddit = two_subreddits3[['author', 'created_utc', 'media_only','num_comments', 
                                    'score', 'selftext', 'subreddit', 'title']]

In [79]:
# Making sure I still have a dataframe 
type(clean_subreddit)

pandas.core.frame.DataFrame

In [80]:
# Looking at the shape of the final dataframe 
clean_subreddit.shape

(19028, 8)

In [81]:
# Checking the datatypes of all the selected variables 
clean_subreddit.dtypes

author          object
created_utc      int64
media_only      object
num_comments     int64
score            int64
selftext        object
subreddit       object
title           object
dtype: object

In [86]:
# Since I want to include in my model both the words used in the title and those used on the selftext, I am going to 
# merge them into a single colum called 'full_text'
clean_subreddit['full_text'] = clean_subreddit['title'] + 'AND' + str(clean_subreddit['selftext'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [87]:
#Checking it worked
clean_subreddit.columns

Index(['author', 'created_utc', 'media_only', 'num_comments', 'score',
       'selftext', 'subreddit', 'title', 'full_text'],
      dtype='object')

In [88]:
clean_subreddit.isnull().sum()

author          0
created_utc     0
media_only      0
num_comments    0
score           0
selftext        0
subreddit       0
title           0
full_text       0
dtype: int64

In [89]:
# Saving this dataframe to a csv that I can use in the modeling notebook. 
clean_subreddit.to_csv('./clean_subreddit.csv', index=False)