# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 3: Web APIs & NLP

## Problem Statement
---

There is an increase interest in Tolkien works recently due to the highly anticipated Lord of the Rings (LOTR) series coming up on September.
With that, LOTR fans are revisiting the two trilogy done by Peter Jackson: LOTR and The Hobbits. LOTR and Tolkien fans are discussing the works on Reddits and many other online discussion forums.
<br>

However, any Tolkien fans will know how extensive and in-depth Tolkien creation can be. Furthermore, characters can be in existence and involved through ages in Tolkien world.
<br>

Hence, the production team wants to help LOTR fans with differentiating the movies (LOTR and The Hobbits) online discussions. In that, LOTR fans will get the correct base knowledge required before the series is released.
<br>

_Note: Fans who only watched Percy Jackson movies are referred as LOTR fans in the statement above. While, hardcore Tolkien nerds who read the books are referred as Tolkien fans_

## Gather data
---

In [1]:
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup
from random import randint, seed

Set the URL to pull the data

In [2]:
url = "https://api.pushshift.io/reddit/search/submission"

In [3]:
dataset = pd.DataFrame()

Create a function to pull the data

In [4]:
def data_pull (utc_time, topic, x_iter):
    
    ds_list = []
    for i in range(x_iter):
        params = {
            'subreddit' : topic,
            'size' : 100,
            'before' : utc_time,
                }

        res = requests.get("https://api.pushshift.io/reddit/search/submission", params)

        print(f'Pull number: {i+1}')
        print(f'Status code: {res.status_code}')
        
        posts = res.json()['data']
        temp_dataset = pd.DataFrame(posts)
        ds_list.append(temp_dataset)
        
        utc_time = posts[-1]["created_utc"]

        seed(8)
        number = randint(3,8)    
        time.sleep(number)
        
    dataset = pd.concat(ds_list, axis = 0)
    
    return dataset

<br>
Pull data for Lord of the Rings

In [5]:
dataset = data_pull(1658237975, "lordoftherings", 18)

Pull number: 1
Status code: 200
Pull number: 2
Status code: 200
Pull number: 3
Status code: 200
Pull number: 4
Status code: 200
Pull number: 5
Status code: 200
Pull number: 6
Status code: 200
Pull number: 7
Status code: 200
Pull number: 8
Status code: 200
Pull number: 9
Status code: 200
Pull number: 10
Status code: 200
Pull number: 11
Status code: 200
Pull number: 12
Status code: 200
Pull number: 13
Status code: 200
Pull number: 14
Status code: 200
Pull number: 15
Status code: 200
Pull number: 16
Status code: 200
Pull number: 17
Status code: 200
Pull number: 18
Status code: 200


In [6]:
dataset

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,media_metadata,removed_by_category,discussion_type,suggested_sort,author_flair_background_color,author_flair_text_color,poll_data,author_cakeday,author_flair_template_id,banned_by
0,[],False,eHarder,,[],,text,t2_1680jg,False,False,...,,,,,,,,,,
1,[],False,storywriter109,,[],,text,t2_3yaarop9,False,False,...,,,,,,,,,,
2,[],False,WerbStudio,,[],,text,t2_bjdj3201,False,False,...,,,,,,,,,,
3,[],False,Late_Stage_PhD,,[],,text,t2_j9ej76cc,False,False,...,,,,,,,,,,
4,[],False,MD_Dreamer53214,,[],,text,t2_7uj1xhii,False,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,[],False,Affectionate_Fox_288,,[],,text,t2_hhsosuxb,False,False,...,,,,,,,,,,
96,[],False,Wolf_of_Hoth,,[],,text,t2_4wbq38,False,False,...,,,,,,,,,,
97,[],False,hurrydual,,[],,text,t2_k3n6gqp2,False,False,...,,,,,,,,,,
98,[],False,luvirinqewqewewq,,[],,text,t2_89d12uc7,False,False,...,,,,,,,,,,


Retrieve only the required columns

In [7]:
dataset = dataset[["subreddit", "title", "selftext"]]

In [8]:
dataset

Unnamed: 0,subreddit,title,selftext
0,lordoftherings,How this sub acts in every comment section abo...,
1,lordoftherings,how forgiving is Eru,I am making a maia that fell from grace due to...
2,lordoftherings,Music changes everything,
3,lordoftherings,Everything the RoP showrunners have said in in...,Some common critiques fans have for the showru...
4,lordoftherings,Fights when well choreographed despite being s...,
...,...,...,...
95,lordoftherings,Gondolin by Donato Giancola,
96,lordoftherings,Question about that scene in Two Towers with A...,hello! it’s my first time watching the movies ...
97,lordoftherings,It's probably the coolest piece I own now.,
98,lordoftherings,My most favorite one,


Exporting to CSV

In [None]:
dataset.to_csv("lordoftherings.csv", index=False)

<br>
Pull data for Hobbits and repeat the same steps

In [10]:
dataset_2 = data_pull(1658237975, "thehobbit", 18)

Pull number: 1
Status code: 200
Pull number: 2
Status code: 200
Pull number: 3
Status code: 200
Pull number: 4
Status code: 200
Pull number: 5
Status code: 200
Pull number: 6
Status code: 200
Pull number: 7
Status code: 200
Pull number: 8
Status code: 200
Pull number: 9
Status code: 200
Pull number: 10
Status code: 200
Pull number: 11
Status code: 200
Pull number: 12
Status code: 200
Pull number: 13
Status code: 200
Pull number: 14
Status code: 200
Pull number: 15
Status code: 200
Pull number: 16
Status code: 200
Pull number: 17
Status code: 200
Pull number: 18
Status code: 200


In [11]:
dataset_2 = dataset_2[["subreddit", "title", "selftext"]]

In [12]:
dataset_2

Unnamed: 0,subreddit,title,selftext
0,TheHobbit,The Complete SECOND AGE of Middle-earth Timeli...,
1,TheHobbit,Rings of Power Playlist - full of Middle-Earth...,
2,TheHobbit,Tolkien taught us Physics,5am thoughts brought to you by Tolkien loving ...
3,TheHobbit,It took a while to get here but it was worth it 😍,
4,TheHobbit,Are there any more Audiobooks like The Hobbit ...,if there are any please send for all of the bo...
...,...,...,...
95,TheHobbit,"GOLLUM READS, Harry Potter and the Sorcerer's ...",
96,TheHobbit,DAGOR DAGORATH (J.R.R Tolkien Short Film) Reac...,
97,TheHobbit,OMG.... They look soo nice!!,
98,TheHobbit,Found myself a perfect t-shirt,


In [13]:
dataset_2.to_csv("thehobbits.csv", index = False)