![](https://techcrunch.com/wp-content/uploads/2019/02/Reddit-Header.png?w=1390&crop=1)
# Project 3: Web APIs & Classification - Subreddit

In [1]:
#import libraries
import requests
import time
import random
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

## Problem Statement

Travelling has been long considered essential to living in modern society. When wanderlust kicks in, the urge to scour the internet for possible travel destination is insatiable. With resources like blogs, websites, travel guides and videos, there lies a plethora of travel information easily accessible with just a click away. However with so much information available, are we able to quickly segregate these information so that users are able to obtain the relevant information that they require?

One such source is Reddit. Reddit is an American social news aggregation, web content rating, and discussion website. Basically it works like a forum where users can source for information by searching a certain subreddit(also known as threads) topic of interest. Users can also post, comment, and like threads and posts in a forum like content

Therefore to provide travellers with quick access to information, we will try to create a model predict and classify posts of 2 subreddits. This will aid travel companies dessminate travel information quickly to its users for greater customer satisfaction. This could also free up capacity of employees to focus on other productive work at hand. 

The 2 subreddit posts that we have chosen is [r/JapanTravel](https://www.reddit.com/r/JapanTravel/) and [r/SoloTravel](https://www.reddit.com/r/solotravel). These 2 were chosen for its similarities and we will see if our model can successfully classify them sperately. 

## Executive Summary

We will first used the Reddit API to pull data from both the r/Japan Travel and r/SoloTravel subreddits and put it into a dataframe for easier analysis. Next will then perform feature selection where we only select the features of the pulled data that is relevant to us, followed by data cleaning such as checking for nulls and duplicates.

Before we embark on data exploration, we will perfom some functions to the posts which is also known as Natural Language Processesing(NLP) which allows computers to understand the human language like we do. Some of the funtions are:
 * Remove links using regex
 * Remove non-letters using regex
 * Convert text to lower case
 * Lemmatize words which gives us the base words
 * Remove Stopwords which are common words in the english sentence structure like 'the", "them".

Next we will perform some exploratory data analysis on the processed posts to identify interesting trends and relationships between the words of each subreddit.

We will then perform our modelling where we will perform a train test split to fit and compare 2 models: Logistic Regression and Navie Bayes model(Mulinomial). Both of which we will run 2 vectorizers: CountVectorizer and TFIDF Vectorizer to help improve the accuracy. 

Our goal is to achieve at least **85%** accuracy given that both topics are very similar in nature. We will choose the best model that is closest to  goal.

We will then evaluate the model based on metrics such as confusion matrix, ROC AUC and F1 to analyze our chosen model better.

Lastly, we will present our recommendations and findings from our model that we have generated.

## Contents:

* [Webscraping from Reddit API](#Webscraping-from-Reddit-API)
 * [r/JapanTravel](#r/JapanTravel)
 * [JapanTravel Dataframe](#JapanTravel-Dataframe)
 * [r/SoloTravel](#r/SoloTravel)
 * [SoloTravel Dataframe](#SoloTravel-Dataframe)
* [Data Cleaning](#Data-Cleaning)
 * [Feature Selection](#Feature-Selection)
 * [Check for null values](#Check-for-null-values)
 * [Check for any blank rows](#Check-for-any-blank-rows)
 * [Check for outliers](#Check-for-outliers)
 * [Check for bot posts](#Check-for-bot-posts)
* [Export Clean Datasets](#Export-Clean-Datasets)
 

## Webscraping from Reddit API

##### r/JapanTravel

In [2]:
#start an empty list to store posts
japan_posts = []

#param empty for the first iteration
after = None

for a in range(50):
    url = 'https://www.reddit.com/r/JapanTravel/.json'
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    
    #send request to url
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    #in case of error
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    #get post
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    japan_posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a sleep duration to look more 'natural'
    sleep_duration = random.randint(1,3)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/JapanTravel/.json
1
https://www.reddit.com/r/JapanTravel/.json?after=t3_kkverw
3
https://www.reddit.com/r/JapanTravel/.json?after=t3_k5xxyt
2
https://www.reddit.com/r/JapanTravel/.json?after=t3_jvgdbd
1
https://www.reddit.com/r/JapanTravel/.json?after=t3_jl3o7r
3
https://www.reddit.com/r/JapanTravel/.json?after=t3_j9b22d
2
https://www.reddit.com/r/JapanTravel/.json?after=t3_iv1ctn
1
https://www.reddit.com/r/JapanTravel/.json?after=t3_ig98oa
2
https://www.reddit.com/r/JapanTravel/.json?after=t3_i2is41
3
https://www.reddit.com/r/JapanTravel/.json?after=t3_hsuxck
1
https://www.reddit.com/r/JapanTravel/.json?after=t3_hkt0gh
2
https://www.reddit.com/r/JapanTravel/.json?after=t3_haaan0
3
https://www.reddit.com/r/JapanTravel/.json?after=t3_gq5wnr
3
https://www.reddit.com/r/JapanTravel/.json?after=t3_g8rt3o
1
https://www.reddit.com/r/JapanTravel/.json?after=t3_fw6ih9
2
https://www.reddit.com/r/JapanTravel/.json?after=t3_fjelye
3
https://www.reddit.com/r/JapanTravel/.js

In [3]:
len(japan_posts)

1232

##### JapanTravel Dataframe

In [4]:
#convert posts to dataframe
japan_df = pd.DataFrame(japan_posts)
japan_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,url_overridden_by_dest,author_cakeday
0,,JapanTravel,##**January 2021 - [**Japan has again closed t...,t2_e5ic9,False,,0,False,"Japan Travel, COVID-19, And You: Guidelines On...",[],...,all_ads,True,https://www.reddit.com/r/JapanTravel/comments/...,1558523,1609458000.0,0,,False,,
1,,JapanTravel,[**Original Article Here.**](http://www.asahi....,t2_e5ic9,False,,0,False,Discussion: Organizers Express Doubts About Ho...,[],...,all_ads,True,https://www.reddit.com/r/JapanTravel/comments/...,1558523,1610251000.0,0,,False,,
2,,JapanTravel,"Like many here, I am desperately wishing I cou...",t2_5jw67axr,False,,0,False,Reflecting on last year’s (less than ordinary)...,[],...,all_ads,False,https://www.reddit.com/r/JapanTravel/comments/...,1558523,1610305000.0,0,,False,,
3,,JapanTravel,"Hello!\n\nIn 2016, on a day trip from Osaka, I...",t2_7qcifpmj,False,,0,False,Name of soba restaurant in Arima,[],...,all_ads,False,https://www.reddit.com/r/JapanTravel/comments/...,1558523,1610365000.0,0,,False,,
4,,JapanTravel,I will be traveling with my husband and 2 smal...,t2_6fq4xbhy,False,,0,False,Hyogo in March (focus on nature),[],...,all_ads,False,https://www.reddit.com/r/JapanTravel/comments/...,1558523,1610341000.0,0,,False,,


In [5]:
japan_df.shape

(1232, 109)

##### r/solotravel

In [6]:
#start an empty list to store posts
solo_posts = []

#param empty for the first iteration
after = None

for a in range(50):
    url = 'https://www.reddit.com/r/solotravel/.json'
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    
    #send request to url
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    #in case of error
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    #get post
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    solo_posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(1,3)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/solotravel/.json
2
https://www.reddit.com/r/solotravel/.json?after=t3_kt107k
1
https://www.reddit.com/r/solotravel/.json?after=t3_krff7s
2
https://www.reddit.com/r/solotravel/.json?after=t3_kohus8
3
https://www.reddit.com/r/solotravel/.json?after=t3_kkp3k4
1
https://www.reddit.com/r/solotravel/.json?after=t3_kiwjjn
1
https://www.reddit.com/r/solotravel/.json?after=t3_kgqq04
2
https://www.reddit.com/r/solotravel/.json?after=t3_kf0vvs
3
https://www.reddit.com/r/solotravel/.json?after=t3_kcvji5
2
https://www.reddit.com/r/solotravel/.json?after=t3_kaa4er
3
https://www.reddit.com/r/solotravel/.json?after=t3_k83zcj
3
https://www.reddit.com/r/solotravel/.json?after=t3_k5y6lh
3
https://www.reddit.com/r/solotravel/.json?after=t3_k3wmjp
3
https://www.reddit.com/r/solotravel/.json?after=t3_k11qxn
3
https://www.reddit.com/r/solotravel/.json?after=t3_jym75o
3
https://www.reddit.com/r/solotravel/.json?after=t3_jv8iy1
1
https://www.reddit.com/r/solotravel/.json?after=t3_jtlps

In [None]:
len(solo_posts)

##### SoloTravel Dataframe

In [7]:
solo_df = pd.DataFrame(solo_posts)
solo_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,num_crossposts,media,is_video,link_flair_template_id,post_hint,preview,media_metadata,poll_data,author_cakeday,url_overridden_by_dest
0,,solotravel,**!!NEW!!**\n\n* **Are you planning your first...,t2_6l4z3,False,,0,False,New to solo travel? Post here for introduction...,[],...,0,,False,,,,,,,
1,,solotravel,I recently realized that part of the reason I ...,t2_15fd72,False,,0,False,Travel is the ultimate game,[],...,0,,False,58fbe66a-c2a5-11e6-a8fd-0e8c4ad9b1dc,,,,,,
2,,solotravel,"Before things got crazy this past year, I was ...",t2_9gwo6u6o,False,,0,False,Using solo travel to help combat depression,[],...,0,,False,58fbe66a-c2a5-11e6-a8fd-0e8c4ad9b1dc,,,,,,
3,,solotravel,I'm going to be road tripping across the US fo...,t2_3t4a8a6u,False,,0,False,Can small dogs handle big hikes? Tips apprecia...,[],...,0,,False,c54faa40-08e1-11e7-b9e4-0e4b8b955122,,,,,,
4,,solotravel,This website meets my needs so perfectly im wo...,t2_4po26mc6,False,,0,False,Opinions of rentberry.com?,[],...,0,,False,558fec4c-c2a5-11e6-9730-0eabbe333632,,,,,,


In [8]:
solo_df.shape

(1231, 111)

## Data Cleaning

### Feature Selection

Based on the quick overview of the dataframes above, not all the features are relavant for our analysis. 

In [9]:
pd.set_option("display.max_columns", 200)

In [10]:
japan_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,thumbnail_height,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,thumbnail_width,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,post_hint,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,preview,all_awardings,awarders,media_only,link_flair_template_id,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,url_overridden_by_dest,author_cakeday
0,,JapanTravel,##**January 2021 - [**Japan has again closed t...,t2_e5ic9,False,,0,False,"Japan Travel, COVID-19, And You: Guidelines On...",[],r/JapanTravel,False,6,nine,0,,,False,t3_ko0lv1,False,dark,0.98,,public,286,1,{},,,False,[],,False,False,,{},Travel Alert,False,286,,False,self,1.61038e+09,green,[],{},self,,True,,1609486000.0,text,6,,,text,self.JapanTravel,True,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,{'images': [{'source': {'url': 'https://extern...,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,841cc97a-cc76-11e5-bf3b-0e738b837a3d,False,False,True,moderator,[],False,,,,t5_2uylr,,,,ko0lv1,True,,amyranthlovely,,0,True,all_ads,False,[],False,dark,/r/JapanTravel/comments/ko0lv1/japan_travel_co...,all_ads,True,https://www.reddit.com/r/JapanTravel/comments/...,1558523,1609458000.0,0,,False,,
1,,JapanTravel,[**Original Article Here.**](http://www.asahi....,t2_e5ic9,False,,0,False,Discussion: Organizers Express Doubts About Ho...,[],r/JapanTravel,False,6,nine,0,,,False,t3_ku6tea,False,dark,0.97,,public,216,1,{},,,False,[],,False,False,,{},Travel Alert,False,216,,False,self,False,green,[],{},self,,True,,1610280000.0,text,6,,,text,self.JapanTravel,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,{'images': [{'source': {'url': 'https://extern...,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,841cc97a-cc76-11e5-bf3b-0e738b837a3d,False,False,False,moderator,[],False,,,,t5_2uylr,,,,ku6tea,True,,amyranthlovely,,146,True,all_ads,False,[],False,dark,/r/JapanTravel/comments/ku6tea/discussion_orga...,all_ads,True,https://www.reddit.com/r/JapanTravel/comments/...,1558523,1610251000.0,0,,False,,
2,,JapanTravel,"Like many here, I am desperately wishing I cou...",t2_5jw67axr,False,,0,False,Reflecting on last year’s (less than ordinary)...,[],r/JapanTravel,False,6,seven,0,,,False,t3_kukc1x,False,dark,0.93,,public,236,1,{},,,False,[],,False,False,,{},Trip Report,False,236,,False,self,1.61038e+09,,[],{},,,True,,1610334000.0,text,6,,,text,self.JapanTravel,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,1c83b898-76d0-11e4-acb6-12313b0abe67,False,False,False,,[],False,,,,t5_2uylr,,,,kukc1x,True,,Shell_fly,,59,True,all_ads,False,[],False,,/r/JapanTravel/comments/kukc1x/reflecting_on_l...,all_ads,False,https://www.reddit.com/r/JapanTravel/comments/...,1558523,1610305000.0,0,,False,,
3,,JapanTravel,"Hello!\n\nIn 2016, on a day trip from Osaka, I...",t2_7qcifpmj,False,,0,False,Name of soba restaurant in Arima,[],r/JapanTravel,False,6,two,0,,,False,t3_kv126t,False,dark,0.7,,public,5,0,{},,,False,[],,False,False,,{},Question,False,5,,False,self,False,,[],{},,,True,,1610394000.0,text,6,,,text,self.JapanTravel,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,,[],[],False,58745c5a-714d-11e4-9ee8-12313b0e92a7,False,False,False,,[],False,,,,t5_2uylr,,,,kv126t,True,,somyotdisodomcia,,7,True,all_ads,False,[],False,,/r/JapanTravel/comments/kv126t/name_of_soba_re...,all_ads,False,https://www.reddit.com/r/JapanTravel/comments/...,1558523,1610365000.0,0,,False,,
4,,JapanTravel,I will be traveling with my husband and 2 smal...,t2_6fq4xbhy,False,,0,False,Hyogo in March (focus on nature),[],r/JapanTravel,False,6,four,0,,,False,t3_kuvndd,False,dark,0.8,,public,12,0,{},,,False,[],,False,False,,{},Recommendations,False,12,,False,self,False,,[],{},,,True,,1610370000.0,text,6,,,text,self.JapanTravel,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,,[],[],False,53abfb24-714d-11e4-8f92-12313d14791d,False,False,False,,[],False,,,,t5_2uylr,,,,kuvndd,True,,namaehanandesuka,,22,True,all_ads,False,[],False,,/r/JapanTravel/comments/kuvndd/hyogo_in_march_...,all_ads,False,https://www.reddit.com/r/JapanTravel/comments/...,1558523,1610341000.0,0,,False,,


Thus we have selected the following features to which we will perform our analysis on:
* author
 * who created the subreddit
* title
 * title of the post
* selftext
 * text in the post
* subreddit
 * name of the subreddit

#### r/JapanTravel

In [12]:
#selecting only relevant features for our dataframe:
japan_df = japan_df[['author','title','selftext','subreddit']]

In [13]:
#first few rows of the dataframe
japan_df.head()

Unnamed: 0,author,title,selftext,subreddit
0,amyranthlovely,"Japan Travel, COVID-19, And You: Guidelines On...",##**January 2021 - [**Japan has again closed t...,JapanTravel
1,amyranthlovely,Discussion: Organizers Express Doubts About Ho...,[**Original Article Here.**](http://www.asahi....,JapanTravel
2,Shell_fly,Reflecting on last year’s (less than ordinary)...,"Like many here, I am desperately wishing I cou...",JapanTravel
3,somyotdisodomcia,Name of soba restaurant in Arima,"Hello!\n\nIn 2016, on a day trip from Osaka, I...",JapanTravel
4,namaehanandesuka,Hyogo in March (focus on nature),I will be traveling with my husband and 2 smal...,JapanTravel


In [14]:
#check the shape of the dataframe
japan_df.shape

(1232, 4)

#### r/SoloTravel

In [15]:
#selecting only relevant features for our dataframe:
solo_df = solo_df[['author','title','selftext','subreddit']]
solo_df.head()

Unnamed: 0,author,title,selftext,subreddit
0,AutoModerator,New to solo travel? Post here for introduction...,**!!NEW!!**\n\n* **Are you planning your first...,solotravel
1,lostkarma4anonymity,Travel is the ultimate game,I recently realized that part of the reason I ...,solotravel
2,Ihatemygoddamnguts,Using solo travel to help combat depression,"Before things got crazy this past year, I was ...",solotravel
3,TheEntertainer17,Can small dogs handle big hikes? Tips apprecia...,I'm going to be road tripping across the US fo...,solotravel
4,redwithblackspots527,Opinions of rentberry.com?,This website meets my needs so perfectly im wo...,solotravel


In [16]:
solo_df.shape

(1231, 4)

### Check for null values

#### r/JapanTravel

In [17]:
#check for all nulls in each column
japan_df.isnull().sum()

author       0
title        0
selftext     0
subreddit    0
dtype: int64

#### r/SoloTravel

In [18]:
#check for all nulls in each column
solo_df.isnull().sum()

author       0
title        0
selftext     0
subreddit    0
dtype: int64

### Check for any blank rows 

Here we will check for any blank rows in both subreddits and analyze the best strategy to deal with them.

In [19]:
#define function to calculate blanks in features
def blanks(df):
    for col in df:
        print(col)
        print("Number of blanks: " + str((df[col] == '').sum()))
        print("\n")

In [20]:
blanks(japan_df)

author
Number of blanks: 0


title
Number of blanks: 0


selftext
Number of blanks: 3


subreddit
Number of blanks: 0




In [21]:
blanks(solo_df)

author
Number of blanks: 0


title
Number of blanks: 0


selftext
Number of blanks: 8


subreddit
Number of blanks: 0




Since blanks of both datasets are found in the selftext column, we will simply combine both the title and selftext column which will become the total single post by a subreddit user. 

In [22]:
#combine 'title' & 'selftext' columns
japan_df['posts'] = japan_df[['title', 'selftext']].agg(' '.join, axis=1)
solo_df['posts'] = solo_df[['title', 'selftext']].agg(' '.join, axis=1)

In [26]:
pd.set_option("max_colwidth", 100)
japan_df.head()

Unnamed: 0,author,title,selftext,subreddit,posts
0,amyranthlovely,"Japan Travel, COVID-19, And You: Guidelines On Travel &amp; Pandemic News Update Thread - Januar...","##**January 2021 - [**Japan has again closed their borders to all new entries at this time, due ...",JapanTravel,"Japan Travel, COVID-19, And You: Guidelines On Travel &amp; Pandemic News Update Thread - Januar..."
1,amyranthlovely,Discussion: Organizers Express Doubts About Hosting Tokyo Olympics &amp; The Future Of Travel To...,[**Original Article Here.**](http://www.asahi.com/ajw/articles/14091366)\n\n##We are opening thi...,JapanTravel,Discussion: Organizers Express Doubts About Hosting Tokyo Olympics &amp; The Future Of Travel To...
2,Shell_fly,Reflecting on last year’s (less than ordinary) trip to Japan,"Like many here, I am desperately wishing I could be traveling to Japan this year. With the incre...",JapanTravel,"Reflecting on last year’s (less than ordinary) trip to Japan Like many here, I am desperately wi..."
3,somyotdisodomcia,Name of soba restaurant in Arima,"Hello!\n\nIn 2016, on a day trip from Osaka, I visited a soba restaurant in the small town of Ar...",JapanTravel,"Name of soba restaurant in Arima Hello!\n\nIn 2016, on a day trip from Osaka, I visited a soba r..."
4,namaehanandesuka,Hyogo in March (focus on nature),I will be traveling with my husband and 2 small children to Kobe in March from another prefectur...,JapanTravel,Hyogo in March (focus on nature) I will be traveling with my husband and 2 small children to Kob...


In [27]:
solo_df.head()

Unnamed: 0,author,title,selftext,subreddit,posts
0,AutoModerator,"New to solo travel? Post here for introductions, newbie questions, anxiety and excitement - Week...","**!!NEW!!**\n\n* **Are you planning your first big trip to Europe? Check out our [brand-new, det...",solotravel,"New to solo travel? Post here for introductions, newbie questions, anxiety and excitement - Week..."
1,lostkarma4anonymity,Travel is the ultimate game,I recently realized that part of the reason I like solo travel so much is because it gives me a ...,solotravel,Travel is the ultimate game I recently realized that part of the reason I like solo travel so mu...
2,Ihatemygoddamnguts,Using solo travel to help combat depression,"Before things got crazy this past year, I was an avid traveler and took off whenever the chance ...",solotravel,"Using solo travel to help combat depression Before things got crazy this past year, I was an avi..."
3,TheEntertainer17,Can small dogs handle big hikes? Tips appreciated.,I'm going to be road tripping across the US for the next six months and exploring lots of nation...,solotravel,Can small dogs handle big hikes? Tips appreciated. I'm going to be road tripping across the US f...
4,redwithblackspots527,Opinions of rentberry.com?,This website meets my needs so perfectly im worried it’s too good to be true so if anyone’s used...,solotravel,Opinions of rentberry.com? This website meets my needs so perfectly im worried it’s too good to ...


In [28]:
pd.reset_option("max_colwidth")

### Check for duplicates

Here we will check for any duplicate posts in both subreddits and drop them as they will distort our results. 

#### r/JapanTravel

In [29]:
#drop all duplicates if any
japan_df.drop_duplicates(inplace = True)

#check shape of df
japan_df.shape

(980, 5)

Approximately 252 duplicated posts were removed from JapanTravel subreddit.

#### r/SoloTravel

In [30]:
#drop all duplicates if any
solo_df.drop_duplicates(inplace = True)

#check shape of df
solo_df.shape

(930, 5)

There were approximately 301 duplicates which we have removed from SoloTravel subreddit

### Check for outliers

We will first review and analyze all rows of the data and its contents

In [31]:
pd.set_option("display.unicode.east_asian_width", True)
pd.set_option("display.max_rows", 1000)

In [33]:
japan_df

Unnamed: 0,author,title,selftext,subreddit,posts
0,amyranthlovely,"Japan Travel, COVID-19, And You: Guidelines On...",##**January 2021 - [**Japan has again closed t...,JapanTravel,"Japan Travel, COVID-19, And You: Guidelines On..."
1,amyranthlovely,Discussion: Organizers Express Doubts About Ho...,[**Original Article Here.**](http://www.asahi....,JapanTravel,Discussion: Organizers Express Doubts About Ho...
2,Shell_fly,Reflecting on last year’s (less than ordinary)...,"Like many here, I am desperately wishing I cou...",JapanTravel,Reflecting on last year’s (less than ordinary)...
3,somyotdisodomcia,Name of soba restaurant in Arima,"Hello!\n\nIn 2016, on a day trip from Osaka, I...",JapanTravel,Name of soba restaurant in Arima Hello!\n\nIn ...
4,namaehanandesuka,Hyogo in March (focus on nature),I will be traveling with my husband and 2 smal...,JapanTravel,Hyogo in March (focus on nature) I will be tra...
5,Waffleboy,October 2021 Itinerary Question,"Hey all, I was hoping to get some feedback on ...",JapanTravel,"October 2021 Itinerary Question Hey all, I was..."
6,VanillaWinter,1 Week Itinerary for April 2022. Can anyone sa...,Me and (hopefully) 3 other friends are going t...,JapanTravel,1 Week Itinerary for April 2022. Can anyone sa...
7,Theiiaa,Drive your own Car/Motorcycle in Japan - How?,"Hi guys, I would like to ask you for help as I...",JapanTravel,Drive your own Car/Motorcycle in Japan - How? ...
8,Plus294,Question: Car rental in Hokkaido for 6 people,"Hi there fellow travellers,\n\nWe are thinking...",JapanTravel,Question: Car rental in Hokkaido for 6 people ...
9,johnnynjohnjohn,Google Reviews Restaurants,Hello! \n\nI have noticed that the ‘google sco...,JapanTravel,Google Reviews Restaurants Hello! \n\nI have n...


In [36]:
solo_df

Unnamed: 0,author,title,selftext,subreddit,posts
0,AutoModerator,New to solo travel? Post here for introduction...,**!!NEW!!**\n\n* **Are you planning your first...,solotravel,New to solo travel? Post here for introduction...
1,lostkarma4anonymity,Travel is the ultimate game,I recently realized that part of the reason I ...,solotravel,Travel is the ultimate game I recently realize...
2,Ihatemygoddamnguts,Using solo travel to help combat depression,"Before things got crazy this past year, I was ...",solotravel,Using solo travel to help combat depression Be...
3,TheEntertainer17,Can small dogs handle big hikes? Tips apprecia...,I'm going to be road tripping across the US fo...,solotravel,Can small dogs handle big hikes? Tips apprecia...
4,redwithblackspots527,Opinions of rentberry.com?,This website meets my needs so perfectly im wo...,solotravel,Opinions of rentberry.com? This website meets ...
5,wonderfullywell,Does anyone here have experience traveling wit...,Hey guys.\n\nI'm going to be traveling to Sout...,solotravel,Does anyone here have experience traveling wit...
6,newbikerzz7184,Solo Traveling Across North American Remote Ar...,"Before I start, I apologize if some of my ques...",solotravel,Solo Traveling Across North American Remote Ar...
7,AliveandDrive,How concerned are you about plane accidents?,In light of the accident involving the Sriwija...,solotravel,How concerned are you about plane accidents? I...
8,bobbricks1,Central Asia and South/East Africa - worth doi...,Currently in the midst of planning out some so...,solotravel,Central Asia and South/East Africa - worth doi...
9,gzmdza,ISO: RV companies that do one way rentals.,"Hello, I’m in search of guidance/recommendatio...",solotravel,ISO: RV companies that do one way rentals. Hel...


There are 2 observations from reviewing the 2 subreddit datasets with regards to the authors:

**AutoModerator**
 * These are moderator posts that are not posted by actual subreddit users and hence we will remove them
 
**[deleted]**
 * These are subreddit users with their subreddit accounts inactive but their posts are still relevant and hence we will keep them

### Check for bot posts

In [34]:
japan_df[japan_df['author']=='AutoModerator'].count()

author       0
title        0
selftext     0
subreddit    0
posts        0
dtype: int64

In [35]:
solo_df[solo_df['author']=='AutoModerator'].count()

author       38
title        38
selftext     38
subreddit    38
posts        38
dtype: int64

In [37]:
solo_df = solo_df.drop(solo_df[solo_df['author']=='AutoModerator'].index)

In [38]:
solo_df.shape

(892, 5)

In [39]:
#final check if all 'AutoModerator' has been dropped
solo_df[solo_df['author']=='AutoModerator'].count()

author       0
title        0
selftext     0
subreddit    0
posts        0
dtype: int64

## Export Clean Datasets

In [40]:
japan_df.to_csv("../datasets/japan_travel.csv", index = False)

In [41]:
solo_df.to_csv("../datasets/solo_travel.csv", index = False)