# Project 3: Web APIs and NLP 

## Problem Statement


**To determine if the beverage company I worked for should focus on the coffee or the tea market in their business expansion, we need to determine if people in the area posts more on social media about coffee or tea.**


### Contents:
- [Background](#Background)
- [Webscraping](#Webscraping)



## Background
For hot beverage consumption, there are two primary contenders, and both have fiercely loyal fans. There is coffee lovers and espresso enthusiasts on one side, and green tea and chai lovers on the other. So coffee and tea lovers are the main target customers.

The beverage company I worked for is seeking to expand. They want to find out if people in the area post more on social media about coffee or tea. They will, then use this information to decide if they are going into the coffee market or the tea market. 

While I wish this project is a systematic literature review, the aim here is to find the buzz words associated with each beverage. This will help to determine if the person who made the posts online is a coffee drinker or tea drinker. To achieve this, a classification model is needed to classify posts into either coffee or tea. 


*Libraries added here*

In [1]:
import requests
import numpy as np
import pandas as pd

## Webscraping 

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

In [7]:
def search_word(searchword):
    '''this function does a webscrapping from Reddit posts/submissions using given searchword'''
    posts = []
    before = None
    results = []
    nr = 1
    loop = 30

    for nr in range(0,loop):
        if before is None:
            params = {'subreddit' : searchword,
                     'size' : 100}
        else:
            params = {
                'subreddit' : searchword,
                'size' : 100,
                'before': before}

        res = requests.get(url, params)
        data = res.json()
        before = data['data'][-1]['created_utc']
        print(f'loop: {nr} ')
        posts.extend(data['data'])

    # create df
    results = posts
    df = pd.DataFrame(results)
    df = df[['subreddit','selftext','title']]

    # remove empty cells
    df['selftext'].replace('', np.nan, inplace=True)
    df.dropna(inplace=True)

    # remove ['deleted','removed']
    df.drop(df.loc[(df.selftext == '[deleted]') | (df.selftext == '[removed]')].index, inplace=True)

    # check for duplicates
    df.drop_duplicates(subset=['selftext','title'], inplace=True)

    # combine columns 'selftext' and 'title'
    df['text'] = df['selftext'] + df['title']

    return df

### Searchword 'coffee'

In [4]:
coffee = search_word('coffee')

loop: 0 
loop: 1 
loop: 2 
loop: 3 
loop: 4 
loop: 5 
loop: 6 
loop: 7 
loop: 8 
loop: 9 
loop: 10 
loop: 11 
loop: 12 
loop: 13 
loop: 14 
loop: 15 
loop: 16 
loop: 17 
loop: 18 
loop: 19 


In [5]:
coffee.head()

Unnamed: 0,subreddit,selftext,title,text
1,Coffee,caffeine wise? i haven’t been able to sleep bu...,this may be a silly question but is black coff...,caffeine wise? i haven’t been able to sleep bu...
4,Coffee,"Thanks to COVID, shops near me put a moratoriu...",Current state of coffee in Seattle?,"Thanks to COVID, shops near me put a moratoriu..."
6,Coffee,I started with Kicking Horse &amp; Ethical Bea...,French Press Sour,I started with Kicking Horse &amp; Ethical Bea...
7,Coffee,I just started making cold brew about a week a...,Faster way of filtering cold brew,I just started making cold brew about a week a...
8,Coffee,im literally felling some effects of molly but...,coffe after exercises,im literally felling some effects of molly but...


In [6]:
coffee.shape

(1264, 4)

### Searchword 'tea'

In [9]:
tea = search_word('tea')

loop: 0 
loop: 1 
loop: 2 
loop: 3 
loop: 4 
loop: 5 
loop: 6 
loop: 7 
loop: 8 
loop: 9 
loop: 10 
loop: 11 
loop: 12 
loop: 13 
loop: 14 
loop: 15 
loop: 16 
loop: 17 
loop: 18 
loop: 19 
loop: 20 
loop: 21 
loop: 22 
loop: 23 
loop: 24 
loop: 25 
loop: 26 
loop: 27 
loop: 28 
loop: 29 


In [10]:
tea.head()

Unnamed: 0,subreddit,selftext,title,text
3,tea,Me and my mother got into an argument about th...,Should Chamomile tea be only for old men and w...,Me and my mother got into an argument about th...
4,tea,"Could i make green tea the night before, put i...",Question about green tea,"Could i make green tea the night before, put i..."
5,tea,A few years ago I bought a cast iron tea pot w...,Need help ID'ing stamp on bottom of cast iron ...,A few years ago I bought a cast iron tea pot w...
14,tea,"I recently tried ""Pukka Relax"", which has no s...",Which of these ingredients is causing sweetness?,"I recently tried ""Pukka Relax"", which has no s..."
15,tea,I don’t really like ripe but I’ve enjoyed all ...,What pu’erh would you recommend,I don’t really like ripe but I’ve enjoyed all ...


In [11]:
tea.shape

(1094, 4)

## Save search to csv file

In [12]:
coffee.to_csv('../datasets/coffeesearch.csv')

In [13]:
tea.to_csv('../datasets/teasearch.csv')