# Project 3: Web APIs & NLP

---

## Part 1: Data Collection

### Contents:

- [Problem Statement](#Problem-Statement)
- [Data Collection](#Data-Collection)

---

## Problem Statement

The PlayStation 5 (PS5) is the latest next-generation video gaming console from Sony Interactive Entertainment (Sony) and was released in Nov 2020. However, being almost 2 years after its release, demand for the PS5 is still strong, as global chip shortage due to the pandemic has affected Sony's production for the console. As a result, gamers are going back to its predecessor, the PlayStation 4 (PS4), for its extensive range of game titles and [continued support](https://pledgetimes.com/playstation-4-sonys-support-will-last-another-3-years/) by Sony, despite being released since Nov 2013.

For gamers that are deciding to purchase the PS4 or PS5, they can check out [r/PS5](https://www.reddit.com/r/PS5/) and [r/PS4](https://www.reddit.com/r/PS5/) on Reddit for discussions on these gaming consoles or create their new posts. To help new users decide which subreddit is more suitable to post on, we will be building a classifier using [Pushshift's](https://github.com/pushshift/api) API. Posts will be collected from r/PS5 and r/PS4, and classification models will be trained by learning which subreddit a given post came from.

We will be using a mixture of Logistic Regression, Naive Bayes, Decision Tree and Random Forest models and evaluate the best performing model with the highest accuracy score.

## Data Collection

#### Import libraries

In [1]:
import requests
import pandas as pd
import datetime
import time
import random


In [2]:
pd.set_option('display.max_columns', 200)

#### Testing of Pushshift Reddit API

In [3]:
url = 'https://api.pushshift.io/reddit/search/submission'

In [4]:
params = {
    'subreddit': ['PS4'],
    'size': 100,
}

In [5]:
res = requests.get(url, params)
res.status_code

200

In [6]:
data = res.json()

In [7]:
posts = data['data']

In [8]:
df = pd.DataFrame(posts)

In [9]:
df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,removed_by_category,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,author_flair_background_color,author_flair_template_id,author_flair_text_color,media,media_embed,secure_media,secure_media_embed,thumbnail_height,thumbnail_width,url_overridden_by_dest
0,[],False,oitullopsutinos,,[],,text,t2_2d2l37fu,False,False,False,[],False,False,1658474858,self.PS4,https://www.reddit.com/r/PS4/comments/w54bc4/n...,{},w54bc4,False,False,False,False,False,False,True,False,#373c3f,"[{'a': ':patch:', 'e': 'emoji', 'u': 'https://...",9f028410-6e55-11eb-91b3-0e4c6846b9fd,:patch: Help &amp; Tech Support,light,richtext,False,False,True,1,0,False,all_ads,/r/PS4/comments/w54bc4/need_help_with_a_dualsh...,False,6,moderator,1658474869,1,[removed],True,False,False,PS4,t5_2rrlp,5504974,public,self,Need help with a DualShock issue,0,[],1.0,https://www.reddit.com/r/PS4/comments/w54bc4/n...,all_ads,6,,,,,,,,,,,,
1,[],False,VELOCETTES,,[],,text,t2_kfdj8pza,False,False,False,[],False,False,1658474651,self.PS4,https://www.reddit.com/r/PS4/comments/w549ea/d...,{},w549ea,False,False,False,False,False,False,True,False,#373c3f,"[{'a': ':patch:', 'e': 'emoji', 'u': 'https://...",9f028410-6e55-11eb-91b3-0e4c6846b9fd,:patch: Help &amp; Tech Support,light,richtext,False,False,True,1,0,False,all_ads,/r/PS4/comments/w549ea/dns_issues/,False,6,moderator,1658474662,1,[removed],True,False,False,PS4,t5_2rrlp,5504973,public,self,DNS issues,0,[],1.0,https://www.reddit.com/r/PS4/comments/w549ea/d...,all_ads,6,,,,,,,,,,,,
2,[],False,SIXARIUS,,[],,text,t2_2myy8ulq,False,False,False,[],False,False,1658474527,self.PS4,https://www.reddit.com/r/PS4/comments/w54878/h...,{},w54878,False,True,False,False,False,True,True,False,#4d82be,"[{'a': ':game:', 'e': 'emoji', 'u': 'https://e...",6644f6fe-89af-11e3-b610-12313d27e9a3,:game: Game Discussion,light,richtext,False,False,True,0,0,False,all_ads,/r/PS4/comments/w54878/how_does_gta_the_trilog...,False,6,,1658474537,1,I never played GTA 3 and never finished vice c...,True,False,False,PS4,t5_2rrlp,5504972,public,self,How does GTA the trilogy DE run on base PS4?,0,[],1.0,https://www.reddit.com/r/PS4/comments/w54878/h...,all_ads,6,,,,,,,,,,,,
3,[],False,SIXARIUS,,[],,text,t2_2myy8ulq,False,False,False,[],False,False,1658474361,self.PS4,https://www.reddit.com/r/PS4/comments/w546jf/g...,{},w546jf,False,False,False,False,False,False,True,False,#373c3f,"[{'a': ':info:', 'e': 'emoji', 'u': 'https://e...",92947472-6e55-11eb-9307-0ea1d6002db3,:info: General Question,light,richtext,False,False,True,1,0,False,all_ads,/r/PS4/comments/w546jf/gta_the_trilogy_the_def...,False,6,moderator,1658474371,1,[removed],True,False,False,PS4,t5_2rrlp,5504970,public,self,GTA the trilogy the definitive edition runs well?,0,[],1.0,https://www.reddit.com/r/PS4/comments/w546jf/g...,all_ads,6,,,,,,,,,,,,
4,[],False,MrGeek767,,[],,text,t2_5lxgahgo,False,False,False,[],False,False,1658474320,self.PS4,https://www.reddit.com/r/PS4/comments/w5464h/c...,{},w5464h,False,False,False,False,False,False,True,False,#373c3f,"[{'a': ':info:', 'e': 'emoji', 'u': 'https://e...",92947472-6e55-11eb-9307-0ea1d6002db3,:info: General Question,light,richtext,False,False,True,1,0,False,all_ads,/r/PS4/comments/w5464h/can_i_use_a_us_card_on_...,False,6,moderator,1658474330,1,[removed],True,False,False,PS4,t5_2rrlp,5504967,public,self,Can I use a US card on my KSA account?,0,[],1.0,https://www.reddit.com/r/PS4/comments/w5464h/c...,all_ads,6,,,,,,,,,,,,


In [10]:
df['subreddit'].value_counts()

PS4    100
Name: subreddit, dtype: int64

In [11]:
df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_richtext', 'link_flair_template_id', 'link_flair_text',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'removed_by_category', 'retrieved_on', 'score', 'selftext',
       'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id',
       'subredd

In [12]:
posts[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'oitullopsutinos',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_2d2l37fu',
 'author_is_blocked': False,
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1658474858,
 'domain': 'self.PS4',
 'full_link': 'https://www.reddit.com/r/PS4/comments/w54bc4/need_help_with_a_dualshock_issue/',
 'gildings': {},
 'id': 'w54bc4',
 'is_created_from_ads_ui': False,
 'is_crosspostable': False,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': False,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '#373c3f',
 'link_flair_richtext': [{'a': ':patch:',
   'e': 'emoji',
   'u': 'https://emoji.redditmedia.com/jyy1ago822n41_t5_2rrlp/patch'},
  {'e': 'text', 't': ' Help &amp; Tech Su

After examining a sample post from Reddit, we will search the subreddit with following features that would be useful for analysis: <br>
`subreddit`, `id`: identifier for subreddit and post<br>
`title`: title of post<br>
`selftext`: description within post<br>
`removed_by_category`: whether the post was removed by Reddit/moderator due to violation of subreddit rules<br>
`created_utc`: post creation time (in Unix epoch format)<br>

#### Create function to extract submissions from Reddit

In [13]:
# define function to get submissions from subreddit

def submission(subreddit, iteration):
    current_time = int(datetime.datetime(2022,7,1,0,0).timestamp())
    reddit_list = []  
    features_list = ['subreddit',
                   'id',               # filter list of headers for export.
                   'title',
                   'selftext', 
                   'removed_by_category',
                   'created_utc'
                  ]
    
    for n in range(iteration):
        params = {
        'subreddit': subreddit,
        'size': 100,
        'before': current_time                       # add in current_time so that next iteration will continue there
        }
        res = requests.get(url, params)
        time.sleep(random.randint(1,3))              # add time delay during iteration
        df = pd.DataFrame(res.json()['data'])
        df = df.loc[:, features_list]
        reddit_list.append(df)
        current_time = df.created_utc.min()          # update current_time with earliest created_utc 
        if (n + 1) % 10 == 0:                        # print status update during iteration
            print(f"{n+1} out of 100 iteration.")
            
    return pd.concat(reddit_list, axis=0)

#### Extract from PS4 subreddit

In [14]:
ps4_df = submission('PS4', 100)

10 out of 100 iteration.
20 out of 100 iteration.
30 out of 100 iteration.
40 out of 100 iteration.
50 out of 100 iteration.
60 out of 100 iteration.
70 out of 100 iteration.
80 out of 100 iteration.
90 out of 100 iteration.
100 out of 100 iteration.


In [15]:
ps4_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9992 entries, 0 to 99
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   subreddit            9992 non-null   object
 1   id                   9992 non-null   object
 2   title                9992 non-null   object
 3   selftext             9969 non-null   object
 4   removed_by_category  7109 non-null   object
 5   created_utc          9992 non-null   int64 
dtypes: int64(1), object(5)
memory usage: 546.4+ KB


In [17]:
ps4_df.to_csv("../datasets/ps4.csv", index=False)

#### Extract from PS5 subreddit

In [18]:
ps5_df = submission('PS5', 100)

10 out of 100 iteration.
20 out of 100 iteration.
30 out of 100 iteration.
40 out of 100 iteration.
50 out of 100 iteration.
60 out of 100 iteration.
70 out of 100 iteration.
80 out of 100 iteration.
90 out of 100 iteration.
100 out of 100 iteration.


In [20]:
ps5_df.to_csv("../datasets/ps5.csv", index=False)

In [21]:
ps5_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9992 entries, 0 to 99
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   subreddit            9992 non-null   object
 1   id                   9992 non-null   object
 2   title                9992 non-null   object
 3   selftext             9985 non-null   object
 4   removed_by_category  3092 non-null   object
 5   created_utc          9992 non-null   int64 
dtypes: int64(1), object(5)
memory usage: 546.4+ KB
