<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP

### Contents:
- [Problem Statement](#Problem-Statement)
- [Data Collection](#Data-Collection)
- [Data Cleaning & EDA](#Data-Cleaning-&-EDA)
- [Preprocessing & Modeling](#Preprocessing-&-Modeling)
- [Evaluation and Conceptual Understanding](#Evaluation-and-Conceptual-Understanding)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

This workbook only contain contents on 'Problem Statement' and 'Data Collection'.

## Problem Statement

Men in Black (MIB) is a secret association that keeps top secret information regarding extraterrestrials under wraps, especially in Area 51. But still, nothing is full proof and things get leaked out. 

Some information is leaked out into the internet via reddit, most of which to aliens and space subreddits.
Instead of completely shutting down reddit, or deleting the subreddit, having a subreddit like aliens would be useful forum for people to channel all discussions there so that MIB to monitor what people are talking these days.

As requested by MIB to reddit data science team, they want us to identify the key words that can best capture the attention of aliens fans so as to differentiate itself from space. This would facilitate MIB's marketing effort on social media, events and podcasts and to certain extent, allows MIB to dictate what topics are trending or stop from further leak.

Therefore, the goal of the project is to discover the key words that best differentiate or predict whether a post  belongs to Aliens or Space Subreddit.

Three classification models, Naive Bayes, Logistic Regression and Random Forest will be developed to assist with the problem statement.

The success of the model will be assessed based on its precision and F1 score on unseen test data.

There are 395K aliens fans and 19.5 million space fans respectively on reddit. These fans come from all over the world. The models that will be developed are capable of accurately classify the two subreddits. There are enough unique posts in each subreddit to identify key words to achieve our research goal. Therefore, the scope of the project is appropriate.

This project is important to MIB and the relevant associations to increase our fan base using the key words identified to develop a successful marketing campaign. The secondary stakeholder will be a major aliens clubs, association. Their involvement will benefit them in terms of their club membership, events, shows, podcasts.


## Data Collection

- Was enough data gathered to generate a significant result?
- Was data collected that was useful and relevant to the project?
- Was data collection and storage optimized through custom functions, pipelines, and/or automation?

- Was thought given to the server receiving the requests such as considering number of requests per second?

### __Analysis__
**We will be using 1500 posts from each of the subreddit, space and aliens in order to have a good significant data collection for analysis. The data is relevant to the project and useful to determine the relationship between the two subreddits r/space and r/aliens. Each retrieval is only 100 post and a function created using a for loop to obtain 1500 posts. Consideration was made in retrieval of the post through the function. Therefore, upon each retrieval, an built-in interval of 1 sec was included for the server upon receiving the request. This is also to ensure the retrieval is not treated as a hack.**

---

In [1]:
import pandas as pd 
import requests
import time

# Set max display of columns and rows
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

*Requesting for subreddit r/space*

In [2]:
# To check the response is ok (200)
url = 'https://api.pushshift.io/reddit/search/submission'
params = {'subreddit' : 'space', 'size' : 100}
response = requests.get(url, params)
response.status_code

200

In [3]:
data = response.json()
posts = data['data']
len(posts)

100

In [4]:
df = pd.DataFrame(posts)
df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,post_hint,preview,pwls,removed_by_category,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,thumbnail_height,thumbnail_width,title,total_awards_received,treatment_tags,upvote_ratio,url,url_overridden_by_dest,whitelist_status,wls,media,media_embed,secure_media,secure_media_embed,gallery_data,is_gallery,media_metadata
0,[],False,Nova-Nightjar,,[],,text,t2_fux8lubh,False,False,False,[],False,False,1640277774,i.redd.it,https://www.reddit.com/r/space/comments/rmzn01...,{},rmzn01,False,False,False,False,True,False,False,False,,imagegif,[],image/gif,dark,text,False,False,True,1,0,False,all_ads,/r/space/comments/rmzn01/does_anyone_know_what...,False,image,"{'enabled': True, 'images': [{'id': 'CYatFLWJp...",6,moderator,1640277784,1,,True,False,False,space,t5_2qh87,19430352,public,https://b.thumbs.redditmedia.com/ZRSUF7zPAJHPm...,105.0,140.0,Does anyone know what this ive seen?,0,[],1.0,https://i.redd.it/oh9lqc4gmb781.jpg,https://i.redd.it/oh9lqc4gmb781.jpg,all_ads,6,,,,,,,
1,[],False,freedemocracy2021,,[],,text,t2_afzozxgm,False,False,False,[],False,False,1640277640,minddebris.com,https://www.reddit.com/r/space/comments/rmzl73...,{},rmzl73,False,False,False,False,False,False,False,False,,,[],,dark,text,False,False,True,0,0,False,all_ads,/r/space/comments/rmzl73/nasa_plans_a_nuclear_...,False,link,"{'enabled': False, 'images': [{'id': '8boSuore...",6,reddit,1640277651,1,,True,False,False,space,t5_2qh87,19430342,public,https://b.thumbs.redditmedia.com/rQMOaJjRwXho7...,83.0,140.0,NASA Plans a Nuclear Plant on Moon,0,[],1.0,https://www.minddebris.com/nuclear/,https://www.minddebris.com/nuclear/,all_ads,6,,,,,,,
2,[],False,charmed_ridge96,,[],,text,t2_b466oaog,False,False,False,[],False,False,1640276136,self.space,https://www.reddit.com/r/space/comments/rmz1sr...,{},rmz1sr,False,True,False,False,False,True,True,False,,black,[],Discussion,dark,text,False,False,True,0,0,False,all_ads,/r/space/comments/rmz1sr/i_think_black_holes_m...,False,,,6,,1640276146,1,"Pockets specifically for holding things in ""fo...",True,False,False,space,t5_2qh87,19430234,public,self,,,I think black holes might be pockets,0,[],1.0,https://www.reddit.com/r/space/comments/rmz1sr...,,all_ads,6,,,,,,,
3,[],False,MistWeaver80,,[],,text,t2_2vdtqcmq,False,False,True,[],False,False,1640272323,icrar.org,https://www.reddit.com/r/space/comments/rmxpmy...,{},rmxpmy,False,True,False,False,False,True,False,False,,,[],,dark,text,False,False,True,0,0,False,all_ads,/r/space/comments/rmxpmy/astronomers_have_prod...,False,link,"{'enabled': False, 'images': [{'id': 'w1hbOZHi...",6,,1640272333,1,,True,False,False,space,t5_2qh87,19430002,public,https://a.thumbs.redditmedia.com/g3ski_paZSxaM...,140.0,140.0,Astronomers have produced the most comprehensi...,0,[],1.0,https://www.icrar.org/centaurus/,https://www.icrar.org/centaurus/,all_ads,6,,,,,,,
4,[],False,daryavaseum,,[],,text,t2_64p27r3n,False,False,True,[],False,False,1640272192,instagram.com,https://www.reddit.com/r/space/comments/rmxo1b...,{},rmxo1b,False,True,False,False,False,True,False,False,,,[],,dark,text,False,False,False,0,0,False,all_ads,/r/space/comments/rmxo1b/i_considered_this_one...,False,,,6,,1640272203,1,,True,False,False,space,t5_2qh87,19429993,public,default,,,I considered this one of the sharpest moon ima...,0,[],1.0,https://www.instagram.com/p/CXd5Q3oMuKw/?utm_m...,https://www.instagram.com/p/CXd5Q3oMuKw/?utm_m...,all_ads,6,,,,,,,


In [5]:
df.shape

(100, 74)

In [6]:
df[['subreddit', 'selftext', 'title']].head()

Unnamed: 0,subreddit,selftext,title
0,space,,Does anyone know what this ive seen?
1,space,,NASA Plans a Nuclear Plant on Moon
2,space,"Pockets specifically for holding things in ""fo...",I think black holes might be pockets
3,space,,Astronomers have produced the most comprehensi...
4,space,,I considered this one of the sharpest moon ima...


In [10]:
# For initial checking if it works
# df.to_csv('./data/dfspace.csv')

In [7]:
dfspace = df.copy()

for i in range(14):
    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {
        'subreddit' : 'space', 
        'size' : 100, 
        'before' : dfspace['created_utc'].iloc[-1]}
    
    # Request data
    response = requests.get(url, params)
    
    if response.status_code != 200:
        pass
    else:
        dataspace = response.json()
        dfspace = pd.DataFrame(dataspace['data'])
    
        # Put a delay of one second in the requests to reddit's API for traffic management    
        time.sleep(1)
        
        df = pd.concat([df, dfspace], ignore_index=True)
        
#         For testing only
#         dfspace.to_csv('./data/dfspace'+str(i)+'.csv')

In [8]:
df.shape

(1500, 81)

In [9]:
df.to_csv('./data/space.csv')

*Requesting for subreddit r/aliens*

In [10]:
# To check the response is ok (200)
url = 'https://api.pushshift.io/reddit/search/submission'
params = {'subreddit' : 'aliens', 'size' : 100}
response = requests.get(url, params)
response.status_code

200

In [11]:
data = response.json()
posts = data['data']
len(posts)

100

In [12]:
df = pd.DataFrame(posts)
df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,removed_by_category,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,url_overridden_by_dest,whitelist_status,wls,link_flair_css_class,link_flair_template_id,link_flair_text,author_flair_background_color,author_flair_text_color,thumbnail_height,thumbnail_width,post_hint,preview,media,media_embed,secure_media,secure_media_embed,media_metadata,is_gallery,author_flair_template_id
0,[],False,organicfrog328,,[],,text,t2_g0k75nu4,False,False,False,[],False,False,1640268023,youtu.be,https://www.reddit.com/r/aliens/comments/rmwa8...,{},rmwa8h,False,False,False,False,False,False,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/aliens/comments/rmwa8h/astronomers_detected...,False,6,reddit,1640268033,1,,True,False,False,aliens,t5_2qktn,391220,public,default,Astronomers detected strange signal from Proxi...,0,[],1.0,https://youtu.be/v4cLECMVQZU,https://youtu.be/v4cLECMVQZU,all_ads,6,,,,,,,,,,,,,,,,
1,[],False,Jubilant-Vision2,,[],,text,t2_5polnc6r,False,False,False,[],False,False,1640263836,self.aliens,https://www.reddit.com/r/aliens/comments/rmv0y...,{},rmv0y8,False,True,False,False,False,True,True,False,#20692e,"[{'e': 'text', 't': 'Discussion '}, {'a': ':ta...",light,richtext,False,False,True,0,0,False,all_ads,/r/aliens/comments/rmv0y8/i_love_alien_films_a...,False,6,,1640263847,1,The earth gets infected by a space cloud or ne...,True,False,False,aliens,t5_2qktn,391220,public,self,I love alien films and stories. I came up with...,0,[],1.0,https://www.reddit.com/r/aliens/comments/rmv0y...,,all_ads,6,discussion,bf719cba-094d-11e3-a1e5-12313d096169,Discussion :table:,,,,,,,,,,,,,
2,[],False,SpotDeusVult,,[],,text,t2_d5xhe3fo,False,False,False,[],False,False,1640260803,self.aliens,https://www.reddit.com/r/aliens/comments/rmu72...,{},rmu72o,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/aliens/comments/rmu72o/sorry_guys_but_i_don...,False,6,,1640260813,1,"Common forms of life yes, but not life like us...",True,False,False,aliens,t5_2qktn,391217,public,self,"Sorry guys, but i don't think inteligent life ...",0,[],1.0,https://www.reddit.com/r/aliens/comments/rmu72...,,all_ads,6,,,,,,,,,,,,,,,,
3,[],False,Soggy-Investigator53,,[],,text,t2_gjl4p3xr,False,False,False,[],False,False,1640259457,self.aliens,https://www.reddit.com/r/aliens/comments/rmtuo...,{},rmtuoz,False,True,False,False,False,True,True,False,#20692e,"[{'e': 'text', 't': 'Discussion '}, {'a': ':ta...",light,richtext,False,False,True,0,0,False,all_ads,/r/aliens/comments/rmtuoz/some_philosophical_d...,False,6,,1640259468,1,\n\n1.There is no real time.\n\n2.Time is an...,True,False,False,aliens,t5_2qktn,391213,public,self,Some philosophical deduction if we live in Fra...,0,[],1.0,https://www.reddit.com/r/aliens/comments/rmtuo...,,all_ads,6,discussion,bf719cba-094d-11e3-a1e5-12313d096169,Discussion :table:,,,,,,,,,,,,,
4,[],False,[deleted],,,,,,False,,,[],False,False,1640256943,24newstodays.com,https://www.reddit.com/r/aliens/comments/rmt74...,{},rmt74v,False,False,False,False,False,False,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/aliens/comments/rmt74v/𝟸𝟶𝟶ʏᴇᴀʀᴏʟᴅ_ᴘᴀɪɴᴛɪɴɢ_...,False,6,deleted,1640256954,1,[deleted],True,False,False,aliens,t5_2qktn,391215,public,default,𝟸𝟶𝟶-ʏᴇᴀʀ-ᴏʟᴅ ᴘᴀɪɴᴛɪɴɢ ᴅᴇsᴄʀɪʙᴇs sʜᴀᴘᴇsʜɪғᴛɪɴɢ ...,0,[],1.0,https://24newstodays.com/?p=371,https://24newstodays.com/?p=371,all_ads,6,,,,,dark,73.0,140.0,,,,,,,,,


In [13]:
df.shape

(100, 77)

In [14]:
df[['subreddit', 'selftext', 'title']].head()

Unnamed: 0,subreddit,selftext,title
0,aliens,,Astronomers detected strange signal from Proxi...
1,aliens,The earth gets infected by a space cloud or ne...,I love alien films and stories. I came up with...
2,aliens,"Common forms of life yes, but not life like us...","Sorry guys, but i don't think inteligent life ..."
3,aliens,\n\n1.There is no real time.\n\n2.Time is an...,Some philosophical deduction if we live in Fra...
4,aliens,[deleted],𝟸𝟶𝟶-ʏᴇᴀʀ-ᴏʟᴅ ᴘᴀɪɴᴛɪɴɢ ᴅᴇsᴄʀɪʙᴇs sʜᴀᴘᴇsʜɪғᴛɪɴɢ ...


In [22]:
# For initial checking if it works
# df.to_csv('./data/dfaliens.csv')

In [15]:
dfaliens = df.copy()

for i in range(14):
    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {
        'subreddit' : 'aliens', 
        'size' : 100, 
        'before' : dfaliens['created_utc'].iloc[-1]}
    
    # Request data
    response = requests.get(url, params)
    
    if response.status_code != 200:
        pass
    else:
        dataaliens = response.json()
        dfaliens = pd.DataFrame(dataaliens['data'])
    
        # Put a delay of one second in the requests to reddit's API for traffic management    
        time.sleep(1)
        
        df = pd.concat([df, dfaliens], ignore_index=True)
        
#         For testing only        
#         dfaliens.to_csv('./data/dfaliens'+str(i)+'.csv')

In [16]:
df.shape

(1500, 82)

In [17]:
df.to_csv('./data/aliens.csv')