# Project 3 - Web APIs & NLP

### *Executive Summary*

ABC Travel aims to help travellers easily gather information during the COVID-19 pandemic. Rules and regulations are constantly changing, and cost of travel has also gone up due to the various restrictions. 

To build our information portal, we have decided that the best "hacks" and tips would be from people who have had first hand experience. For phase 1 of our information portal, we will be web-scraping data from 2 subreddits: r/shoestring for budget travel, and r/solotravel for solo travellers.

Following which, we will leverage NLP models for classification of the posts into 2 tags on our portal - solo and budgetbarbies. We will then compare a couple of models to see which gives us the best results.

## Introduction

### Background

The [recovery forecast](https://www.bain.com/insights/air-travel-forecast-when-will-airlines-recover-from-covid-19-interactive/) for air travel looks promising. COVID-19 has drastically changed the travel landscape, and a lot more research has to be done now when planning a trip. 

### Problem Statement

There are too many information sources when it comes to trip planning, making it hard to discern which are actually true or up to date.

Our company ABC Travel has decided to build a one-stop information portal built from prior experiences to make trip planning hassle-free in the time of COVID-19.

### Contents

* [Webscraping](#link1)
* [Data Cleaning](01_cleaning_eda.ipynb#link2)
* [EDA](01_cleaning_eda.ipynb#link3)
* [Model Preparation](02_modelling.ipynb#link6)
* [Modelling](02_modelling.ipynb#link4)
* [Recommendations](02_modelling.ipynb#link5)

### Data Dictionary

#### Data Sets
* [`shoestring.csv`](data/shoestring.csv) -- original data from r/shoestring
* [`solotravel.csv`](data/solotravel.csv) -- original data from r/solotravel
* [`data_vec.csv`](data/data_vec.csv) -- cleaned data used for EDA and Modelling


#### Data Dictionary (for data_vec)
* 'titletext' -- column created for joining of both title, and text in post
* 'lemmatized' -- column for text lemmatized using Spacy
* 'stemmed' -- column for text stemmed using SnowballStemmer

## Data Collection

In [1]:
# Import libraries

import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

import warnings
warnings.filterwarnings('ignore')

<a id='link1'></a>

### Pushshift search function

In [2]:
def ps(subreddit, loops=20, size=100, skip=30):
    # subreddit = where i am scraping from
    # loops = how many times to request
    # size = default to maximum of 100
    # skip = number of days to skip; set to 30 as we don't need them chronologically

    fields = ["author", "id", "title", "selftext", "subreddit", "created_utc"]

    posts = []
    url_orig = "https://api.pushshift.io/reddit/search/submission/?subreddit={}&size={}".format(
        subreddit, size
    )
    after = 1

    for i in range(loops):
        url = "{}&after={}d".format(url_orig, skip * i)
        print(i, url)
        response = requests.get(url)
        display(response.status_code)
        posts.extend(response.json()["data"])

    # put into dataframe for selected fields only
    df = pd.DataFrame(posts)
    df = df[fields]

    # remove duplicate posts
    df.drop_duplicates(subset=['selftext'], inplace=True) #check to drop based on self text
    

    return df

In [3]:
# loops set to 80 so as to scrape more posts, as many of them have selftext removed or blank
budget = ps("shoestring", loops=24)

0 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=0d


200

1 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=30d


200

2 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=60d


200

3 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=90d


200

4 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=120d


200

5 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=150d


200

6 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=180d


200

7 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=210d


200

8 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=240d


200

9 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=270d


200

10 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=300d


200

11 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=330d


200

12 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=360d


200

13 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=390d


200

14 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=420d


200

15 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=450d


200

16 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=480d


200

17 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=510d


200

18 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=540d


200

19 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=570d


200

20 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=600d


200

21 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=630d


200

22 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=660d


200

23 https://api.pushshift.io/reddit/search/submission/?subreddit=shoestring&size=100&after=690d


200

In [4]:
shoestring = budget[
    (budget["selftext"] != "[removed]")
    & (budget["selftext"] != "[deleted]")
    & (budget["selftext"] != "")
    & (budget["selftext"] != " ")
    & (budget["selftext"] != ".")
    & (budget["selftext"].notnull())
    & (budget["author"] != 'AutoModerator')

]

display(shoestring.shape)

shoestring.to_csv("data/shoestring.csv", index=False)

(1309, 6)

In [7]:
solo = ps("solotravel", loops=31)

0 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=0d


200

1 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=30d


200

2 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=60d


200

3 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=90d


200

4 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=120d


200

5 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=150d


200

6 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=180d


200

7 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=210d


200

8 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=240d


200

9 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=270d


200

10 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=300d


200

11 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=330d


200

12 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=360d


200

13 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=390d


200

14 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=420d


200

15 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=450d


200

16 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=480d


200

17 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=510d


200

18 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=540d


200

19 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=570d


200

20 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=600d


200

21 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=630d


200

22 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=660d


200

23 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=690d


200

24 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=720d


200

25 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=750d


200

26 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=780d


200

27 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=810d


200

28 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=840d


200

29 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=870d


200

30 https://api.pushshift.io/reddit/search/submission/?subreddit=solotravel&size=100&after=900d


200

In [8]:
solotravel = solo[
    (solo["selftext"] != "[removed]")
    & (solo["selftext"] != "[deleted]")
    & (solo["selftext"] != "")
    & (solo["selftext"] != " ")
    & (solo["selftext"] != ".")
    & (solo["selftext"].notnull())
    & (solo["author"] != 'AutoModerator')
    & (solo["author"] != 'SoloTravelMods')
]

display(solotravel.shape)

solotravel.to_csv("data/solotravel.csv", index=False)

(1348, 6)