# Project 3 : Web APIs & NLP
#### Author: Najiha Boosra (DSI-NYC)

## Problem Statement

This is a binary classification problem. Our goal is two-fold:
- Using API, we have to collect posts from two subreddits
- Then we have to use NLP to train a classifier on which subreddit a given post came from.  


We are going to gather and prepare our data using the requests library, To create and compare two models. One of these must be a Bayes classifier, however the other can be a classifier of logistic regression, Bagging Classifier, Random forest, Extra tree etc.We will pick one based on the solution. Through it is about classification model so success will be measured by an accuracy score.



## Executive Summary 

After pull requests using API, we are going to use r/baby and r/Pets from the Reddit web page. There are lots of text with punctuations, several signs, spaces, etc, We have to clean all the data as much as possible to perform EDA. Then we use the stopwords method. In the modeling part we have to create a baseline model and then relative classification models. From the models we will get train test and cross Val score.this are the key to find our desired model to evaluate. So we can find the desired matric from the confusion matrix. At last we will find coefficients from the interpretable model using logistic regression.

## Contents:

- **[Case Study: Using `requests` to scrape](#Case-Study)**.

- **[Import Libraries](#Data-Import-Libraries)**.  

- **[Converting to Pandas Dataframe](#Converting-to-Pandas-Dataframe)**.

- **[Automating Multiple Pull Requests](#Automating-Multiple-Pull-Requests)**.


### Case Study: Using `requests` to scrape

#### Import Libraries 

In [188]:
import requests

import pandas as pd
import numpy as np

import datetime as dt
import time

In [189]:
base_url = "https://api.pushshift.io/reddit/search/submission/"

In [190]:
res = requests.get(base_url)

Checking the status code

In [191]:
res.status_code

200

Requesting two subreddit baby and Pets

In [192]:
res_baby = requests.get(base_url,
                       params ={"subreddit" : "baby",
                               "size" : 500
                               })
res_Pets = requests.get(base_url,
                       params={"subreddit" : "Pets",
                              "size" : 500
                              })

Convert the string to JSON, or dictionary

In [193]:
data_baby = res_baby.json()["data"]
data_Pets = res_Pets.json()["data"]

In [251]:
#data_baby

[{'all_awardings': [],
  'allow_live_comments': False,
  'author': 'lisharathi',
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_text': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_5dabh8b0',
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'can_mod_post': False,
  'contest_mode': False,
  'created_utc': 1587464966,
  'domain': 'lishaspune.blogspot.com',
  'full_link': 'https://www.reddit.com/r/baby/comments/g5csbd/best_dohale_jevan_ideas/',
  'gildings': {},
  'id': 'g5csbd',
  'is_crosspostable': False,
  'is_meta': False,
  'is_original_content': False,
  'is_reddit_media_domain': False,
  'is_robot_indexable': False,
  'is_self': False,
  'is_video': False,
  'link_flair_background_color': '',
  'link_flair_richtext': [],
  'link_flair_text_color': 'dark',
  'link_flair_type': 'text',
  'locked': False,
  'media_only': False,
  'no_follow': True,
  'num_comments': 0,
  'num_crossposts': 0,
  'over_18':

Dig into Data for baby

In [195]:
len(data_baby)

500

In [196]:
#data_Pets

[{'all_awardings': [],
  'allow_live_comments': False,
  'author': 'bs_05',
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_text': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_50egdvye',
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'can_mod_post': False,
  'contest_mode': False,
  'created_utc': 1587472445,
  'domain': 'self.Pets',
  'full_link': 'https://www.reddit.com/r/Pets/comments/g5eec1/cat_peeing_the_bed/',
  'gildings': {},
  'id': 'g5eec1',
  'is_crosspostable': True,
  'is_meta': False,
  'is_original_content': False,
  'is_reddit_media_domain': False,
  'is_robot_indexable': True,
  'is_self': True,
  'is_video': False,
  'link_flair_background_color': '',
  'link_flair_richtext': [],
  'link_flair_text_color': 'dark',
  'link_flair_type': 'text',
  'locked': False,
  'media_only': False,
  'no_follow': True,
  'num_comments': 1,
  'num_crossposts': 0,
  'over_18': False,
  'parent_whitelist

Dig into Data for Pets

In [197]:
len(data_Pets)

500

### Converting to Pandas Dataframe

In [198]:
baby_df = pd.DataFrame(data_baby)
baby_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,edited,media,secure_media,media_embed,secure_media_embed,author_flair_background_color,author_flair_text_color,crosspost_parent,crosspost_parent_list,media_metadata
0,[],False,lisharathi,,[],,text,t2_5dabh8b0,False,False,...,,,,,,,,,,
1,[],False,babeeclothing,,[],,text,t2_5dd5i7si,False,False,...,,,,,,,,,,
2,[],False,Hazzay88,,[],,text,t2_5c73kp0,False,False,...,,,,,,,,,,
3,[],False,babeeclothing,,[],,text,t2_5dd5i7si,False,False,...,,,,,,,,,,
4,[],False,babeeclothing,,[],,text,t2_5dd5i7si,False,False,...,,,,,,,,,,


In [199]:
Pets_df = pd.DataFrame(data_Pets)
Pets_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,author_flair_background_color,author_flair_text_color,edited,crosspost_parent,crosspost_parent_list,author_cakeday,poll_data,media_metadata,banned_by,author_flair_template_id
0,[],False,bs_05,,[],,text,t2_50egdvye,False,False,...,,,,,,,,,,
1,[],False,Fredz161099,,[],,text,t2_dy4r02d,False,False,...,,,,,,,,,,
2,[],False,parad0x88,,[],,text,t2_k326c,False,False,...,,,,,,,,,,
3,[],False,milkowu,,[],,text,t2_5auj1ko7,False,False,...,,,,,,,,,,
4,[],False,Floofy_hbjb,,[],,text,t2_q4emoxw,False,False,...,,,,,,,,,,


After concating both DataFrame

In [200]:
result_df = pd.concat([baby_df, Pets_df])

In [201]:
result_df.shape

(1000, 75)

In [202]:
result_df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'permalink',
       'pinned', 'removed_by_category', 'retrieved_on', 'score', 'selftext',
       'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id',
       'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'title',
       'total_awards_received', 'treatment_tags', 'url', 'post_hint',
       'previe

Picking some columns which seem important for our analysis

In [224]:
subfields =['subreddit',"subreddit_type", 'author',
           'created_utc','link_flair_text_color', 
           'link_flair_type','retrieved_on', 'score',
            'subreddit_subscribers', 'subreddit_type',  
           'title', 'domain', 'full_link', 'url', 
        'is_reddit_media_domain', 'no_follow', 'send_replies',   
          'can_mod_post', 'contest_mode',
            'is_crosspostable', 'is_meta', 
        'is_original_content', 'is_robot_indexable', 
            'is_self', 'is_video','locked', 'media_only',
        'over_18','pinned','spoiler', 'stickied']
result_df = result_df[subfields]

In [225]:
result_df.head()

Unnamed: 0,subreddit,subreddit_type,subreddit_type.1,author,created_utc,link_flair_text_color,link_flair_type,retrieved_on,score,subreddit_subscribers,...,is_original_content,is_robot_indexable,is_self,is_video,locked,media_only,over_18,pinned,spoiler,stickied
0,Pets,public,public,bs_05,1587472445,dark,text,1587472447,1,124006,...,False,True,True,False,False,False,False,False,False,False
1,Pets,public,public,Fredz161099,1587471627,dark,text,1587471629,1,124006,...,False,False,True,False,False,False,False,False,False,False
2,baby,public,public,Hazzay88,1587449519,dark,text,1587449521,1,7055,...,False,True,True,False,False,False,False,False,False,False
2,Pets,public,public,parad0x88,1587471378,dark,text,1587471380,1,124005,...,False,True,True,False,False,False,False,False,False,False
4,Pets,public,public,Floofy_hbjb,1587470008,dark,text,1587470010,1,124004,...,False,True,True,False,False,False,False,False,False,False


In [226]:
result_df.tail()

Unnamed: 0,subreddit,subreddit_type,subreddit_type.1,author,created_utc,link_flair_text_color,link_flair_type,retrieved_on,score,subreddit_subscribers,...,is_original_content,is_robot_indexable,is_self,is_video,locked,media_only,over_18,pinned,spoiler,stickied
496,Pets,public,public,[deleted],1586889930,dark,text,1586956197,1,123337,...,False,False,True,False,False,False,False,False,False,False
497,Pets,public,public,theblowfish5,1586889928,dark,text,1586956197,1,123337,...,False,True,True,False,False,False,False,False,False,False
498,Pets,public,public,[deleted],1586888557,dark,text,1586954994,1,123335,...,False,False,True,False,False,False,False,False,False,False
499,baby,public,public,Mhairib,1583768787,dark,text,1583768795,1,6774,...,False,True,True,False,False,False,False,False,False,False
499,Pets,public,public,controlisanillusion,1586888362,dark,text,1586954821,1,123335,...,False,True,True,False,False,False,False,False,False,False


In [227]:
result_df.shape

(912, 33)

Remove duplicates

In [228]:
result_df = result_df.loc[result_df.astype(str).drop_duplicates().index]

filtering some columns which are "_" == True

In [229]:
# filter only `is_self` == True
result_df = result_df[result_df["is_self"]]

In [230]:
result_df.head()

Unnamed: 0,subreddit,subreddit_type,subreddit_type.1,author,created_utc,link_flair_text_color,link_flair_type,retrieved_on,score,subreddit_subscribers,...,is_original_content,is_robot_indexable,is_self,is_video,locked,media_only,over_18,pinned,spoiler,stickied
0,Pets,public,public,bs_05,1587472445,dark,text,1587472447,1,124006,...,False,True,True,False,False,False,False,False,False,False
0,Pets,public,public,bs_05,1587472445,dark,text,1587472447,1,124006,...,False,True,True,False,False,False,False,False,False,False
1,Pets,public,public,Fredz161099,1587471627,dark,text,1587471629,1,124006,...,False,False,True,False,False,False,False,False,False,False
1,Pets,public,public,Fredz161099,1587471627,dark,text,1587471629,1,124006,...,False,False,True,False,False,False,False,False,False,False
2,baby,public,public,Hazzay88,1587449519,dark,text,1587449521,1,7055,...,False,True,True,False,False,False,False,False,False,False


Converts epoch time to datetime

In [231]:
dt.date.fromtimestamp(1587473233)

datetime.date(2020, 4, 21)

Creating `timestamp` column using `created_utc` column

In [232]:
result_df["timestamp"] = result_df["created_utc"].map(dt.date.fromtimestamp)

result_df["timestamp"].head()

0    2020-04-21
0    2020-04-21
1    2020-04-21
1    2020-04-21
2    2020-04-21
Name: timestamp, dtype: object

In [233]:
result_df.head()

Unnamed: 0,subreddit,subreddit_type,subreddit_type.1,author,created_utc,link_flair_text_color,link_flair_type,retrieved_on,score,subreddit_subscribers,...,is_robot_indexable,is_self,is_video,locked,media_only,over_18,pinned,spoiler,stickied,timestamp
0,Pets,public,public,bs_05,1587472445,dark,text,1587472447,1,124006,...,True,True,False,False,False,False,False,False,False,2020-04-21
0,Pets,public,public,bs_05,1587472445,dark,text,1587472447,1,124006,...,True,True,False,False,False,False,False,False,False,2020-04-21
1,Pets,public,public,Fredz161099,1587471627,dark,text,1587471629,1,124006,...,False,True,False,False,False,False,False,False,False,2020-04-21
1,Pets,public,public,Fredz161099,1587471627,dark,text,1587471629,1,124006,...,False,True,False,False,False,False,False,False,False,2020-04-21
2,baby,public,public,Hazzay88,1587449519,dark,text,1587449521,1,7055,...,True,True,False,False,False,False,False,False,False,2020-04-21


time.sleep()

In [234]:
for i in range(5):
    print(i)
    time.sleep(1)

0
1
2
3
4


## Automating Multiple Pull Requests

Putting it all together: 

In [237]:
#Credit to Mahdi Shadkam-Farrokhi for function
#The below function obtains and "cleans" the data from a subreddit. 
#The below function utilizes the pushshift API
def query_pushshift(subreddit, kind = 'submission', day_window = 30, n = 5):
    SUBFIELDS = ['subreddit',"subreddit_type", 'author',
           'created_utc','link_flair_text_color', 
           'link_flair_type','retrieved_on', 'score',
            'subreddit_subscribers', 'subreddit_type',  
           'title', 'domain', 'full_link', 'url', 
        'is_reddit_media_domain', 'no_follow', 'send_replies',   
          'can_mod_post', 'contest_mode','is_crosspostable', 'is_meta', 
        'is_original_content', 'is_robot_indexable', 
            'is_self', 'is_video','locked', 'media_only',
        'over_18','pinned','spoiler', 'stickied']

    # establish base url and stem
    BASE_URL = f"https://api.pushshift.io/reddit/search/{kind}" # also known as the "API endpoint" 
    stem = f"{BASE_URL}?subreddit={subreddit}&size=500" # always pulling max of 500
    # instantiate empty list for temp storage
    posts = []
    # implement for loop with `time.sleep(2)`
    for i in range(1, n + 1):
        URL = "{}&after={}d".format(stem, day_window * i)
        print("Querying from: " + URL)
        response = requests.get(URL)
        assert response.status_code == 200
        mine = response.json()['data']
        df = pd.DataFrame.from_dict(mine)
        posts.append(df)
        time.sleep(2)
    # pd.concat storage list
    full = pd.concat(posts, sort=False)
    # if submission
    if kind == "submission":
        # select desired columns
        full = full[SUBFIELDS]
        # drop duplicates
        #full.drop_duplicates(inplace = True)
        # select `is_self` == True
        full = full.loc[full['is_self'] == True]
    # create `timestamp` column
    full['timestamp'] = full["created_utc"].map(dt.date.fromtimestamp)
    print("Query Complete!")    
    return full 
    # if submission
    if kind == "submission":
        # select desired columns
        full = full[SUBFIELDS]
        # drop duplicates
        full.drop_duplicates(inplace = True)
        # select `is_self` == True
        full = full.loc[full['is_self'] == True]

    # create `timestamp` column
    full['timestamp'] = full["created_utc"].map(dt.date.fromtimestamp)
    
    print("Query Complete!")    
    return full 

In [238]:
result1 = query_pushshift("baby", n= 10)

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=baby&size=500&after=30d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=baby&size=500&after=60d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=baby&size=500&after=90d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=baby&size=500&after=120d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=baby&size=500&after=150d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=baby&size=500&after=180d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=baby&size=500&after=210d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=baby&size=500&after=240d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=baby&size=500&after=270d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=baby&size=500&after=300d
Que

In [239]:
result1.shape

(1724, 32)

In [248]:
result1.to_csv("../data/result1.csv", index = False)

In [240]:
result2 = query_pushshift("Pets", n= 10)

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=Pets&size=500&after=30d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=Pets&size=500&after=60d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=Pets&size=500&after=90d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=Pets&size=500&after=120d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=Pets&size=500&after=150d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=Pets&size=500&after=180d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=Pets&size=500&after=210d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=Pets&size=500&after=240d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=Pets&size=500&after=270d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=Pets&size=500&after=300d
Que

In [241]:
result2.shape

(2959, 32)

In [249]:
result2.to_csv("../data/result2.csv", index = False)

Combining Data and checking shape

In [242]:
combined_db = pd.concat([result1, result2], sort = False)
combined_db.shape

(4683, 32)

Store after pulling data

In [244]:
combined_db.to_csv("../data/combined_db.csv", index = False)

We will continue next procedure in next notebook