# <img src="../images/vegan-logo-resized.png" style="float: right; margin: 10px;">
# Data Collection

Author: Gifford Tompkins

---

Project 03 | Notebook 0 of 6

# Objective
The objective of this notebook is to scrape a substantial body of posts from the two subreddits [r/Vegan](https://www.reddit.com/r/vegan/) and [r/Vegetarian](https://www.reddit.com/r/vegetarian/) and to save them into a CSV file for preprocessing and data cleaning, to be completed in the next notebook.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Collection" data-toc-modified-id="Data-Collection-1">Data Collection</a></span></li><li><span><a href="#Objective" data-toc-modified-id="Objective-2">Objective</a></span></li><li><span><a href="#Import-Libraries" data-toc-modified-id="Import-Libraries-3">Import Libraries</a></span></li><li><span><a href="#Collecting-Data-Using-pushshift.io" data-toc-modified-id="Collecting-Data-Using-pushshift.io-4">Collecting Data Using <em>pushshift.io</em></a></span><ul class="toc-item"><li><span><a href="#Creating-the-corpus" data-toc-modified-id="Creating-the-corpus-4.1">Creating the corpus</a></span></li><li><span><a href="#Confirm-the-corpus" data-toc-modified-id="Confirm-the-corpus-4.2">Confirm the corpus</a></span></li></ul></li><li><span><a href="#Summary-and-Conclusion" data-toc-modified-id="Summary-and-Conclusion-5">Summary and Conclusion</a></span></li></ul></div>

# Import Libraries 

In [1]:
import pandas as pd
import requests
import numpy as np
import time
from datetime import datetime, timedelta

In [2]:
# Custom functions saved in repository
from project_functions.data_collection import get_data_json, create_doc, append_to_corpus    

# Collecting Data Using _pushshift.io_

We will be collecting posts (or documents, as they will be referred to in this notebook) through the [pushshift.io API](https://pushshift.io/). For that we will submit a base `url` with a dictionary of parameters. One of those parameters being the name of the subbreddit, either `r/Vegan` or `r/Vegetarian`.

In [5]:
url = 'https://api.pushshift.io/reddit/submission/search'
subreddits = ['Vegetarian','Vegan']  
corpus_csv = '../data/corpus.csv'

## Creating the corpus
We will create an empty dataframe to be saved as our `api_data.csv` that we can append each new batch of documents onto. Since we will be doing multiple pulls and hopefully getting up to 20,000 rows of data, we would like to save our data as we go in the event that our connection breaks or our function or program times out.

For custom functions used in this section, see the [data collection code](./project_functions/data_collection.py).

In [6]:
# Create empty dataframe to store our corpus as a csv file
df = pd.DataFrame(columns=['title','selftext','vegan'])
df.to_csv(corpus_csv,index=False)

In [13]:
# Loop through the two subreddits and pull the document batches with the pushift.io api.
for i, sub in enumerate(subreddits):
    
    # Instantiate and define the url parameters dictionary.
    url_params = {
        'subreddit': sub,
        'size': 500
    }
    
    # Pull 20 batches of posts for each subreddit.
    for count in range(20):
        posts = get_data_from_json(count, url, url_params)
        documents = []
        
        for post in posts:
            # Create a document from the post using the custom function 
            # `create_doc` then classify the post as Vegan post: 1 or a Vegetarian post: 0
            document = create_doc(i, post)
            documents.append(document)
            
        append_to_corpus(documents, corpus_csv)
        
        # Update 'before' paramter in url_params before running next batch.
        url_params['before'] = posts[-1]['created_utc']

        # Put a delay on our loop to slow down api usage
        time.sleep(10)
        
    # Remove the before parameter so that it can be redefined in the next subreddit
    url_params.pop('before')

r/Vegetarian: Pull number 1
https://api.pushshift.io/reddit/submission/search?subreddit=Vegetarian&size=500
100 documents appended to corpus.
r/Vegetarian: Pull number 2
https://api.pushshift.io/reddit/submission/search?subreddit=Vegetarian&size=500&before=1596757646
100 documents appended to corpus.
r/Vegetarian: Pull number 3
https://api.pushshift.io/reddit/submission/search?subreddit=Vegetarian&size=500&before=1596569187
100 documents appended to corpus.
r/Vegetarian: Pull number 4
https://api.pushshift.io/reddit/submission/search?subreddit=Vegetarian&size=500&before=1596399920
100 documents appended to corpus.
r/Vegetarian: Pull number 5
https://api.pushshift.io/reddit/submission/search?subreddit=Vegetarian&size=500&before=1596205339
100 documents appended to corpus.
r/Vegetarian: Pull number 6
https://api.pushshift.io/reddit/submission/search?subreddit=Vegetarian&size=500&before=1596037705
100 documents appended to corpus.
r/Vegetarian: Pull number 7
https://api.pushshift.io/reddi

r/Vegan: Pull number 17
https://api.pushshift.io/reddit/submission/search?subreddit=Vegan&size=500&before=1596310151
100 documents appended to corpus.
r/Vegan: Pull number 18
https://api.pushshift.io/reddit/submission/search?subreddit=Vegan&size=500&before=1596264342
100 documents appended to corpus.
r/Vegan: Pull number 19
https://api.pushshift.io/reddit/submission/search?subreddit=Vegan&size=500&before=1596219493
100 documents appended to corpus.
r/Vegan: Pull number 20
https://api.pushshift.io/reddit/submission/search?subreddit=Vegan&size=500&before=1596184048
100 documents appended to corpus.


## Confirm the corpus
Confirm that the posts have been pulled and that the data looks appropriate. We will do more extensive data-cleaning and EDA in the next notebook.

In [14]:
data = pd.read_csv('../data/corpus.csv')

In [15]:
data.head()

Unnamed: 0,title,selftext,vegan
0,My ‘100 calories club’ which helps people visu...,,0
1,Chilli fritters stuffed with a mix of raw onio...,,0
2,Lentil and mushroom gravy with sweetpotato mas...,,0
3,How to get excited about vegetables?,I stopped eating meat after 24 years due to a ...,0
4,Homemade Rasgulla - Bengali Spongy Milk Sweets...,,0


In [16]:
data.tail()

Unnamed: 0,title,selftext,vegan
4095,"Buffalo tofu, seared green beans and broccoli ...",,1
4096,"I got lazy tonight, and just a whole block of ...",This was my entire dinner:\nhttps://cookieandk...,1
4097,Vegan meals that don’t require cooking?,Hi! I’m looking for some healthy vegan meal id...,1
4098,What are your views on cross contamination?,What are your views when it comes to eating fo...,1
4099,Vegan Pink Dragonfruit Smoothie,,1


In [17]:
data['vegan'].value_counts()

0    2100
1    2000
Name: vegan, dtype: int64

In [19]:
data.shape

(4100, 3)

In [20]:
data.isnull().mean()

title       0.000000
selftext    0.586585
vegan       0.000000
dtype: float64

**Observations:**
> Appears to be some duplicates and the selftext column has selveral null values, but all items are stored properly in the [`corpus.csv`](./datasets/api_data.csv). The first posts are vegetarian while last posts are vegan and we have pulled equal numbers of posts from each subreddit.  
>
> The corpus looks ready to go.

# Summary and Conclusion
We have gathered a solid body of data for our NLP analysis. We did this through submitting batch request through the pushift.io API. We then stored the data of interest into a corpus that we saved as a CSV.

In the next notebook we will clean our data and do some initial exploratory analysis. After that we will have a better sense of direction for building our Vegan classification model.