<h1 align="center"><span style='font-weight: bold;'>Whole-Food Plant-Based vs. Paleo:</span><br />Identifying the best-performing classification model</h1>

---
<h2 align="center"><span style='font-weight: bold;'>01: Web Scraping</span></h2>

---

## **Purpose**

Lifestyle Eating is out to help people live healthier and happier lives through their diets. The idea is to build a platform that fosters a supportive community of users for a variety of diets. Within each diet community, they’ll be able to ask questions, share recipes and provide personal updates.

Though the option to simply have the user fill this information out when they post exists, the product development team is interested in rolling it out with auto-classification technology that can detect what diet the user is on, as they think this is something that could come in handy in later platform features.

To see how viable this auto-classification technology will be when applied to user posts, they've requested for us, the data science team, to look into it.

As lead data scientist on phase 1 of Project Auto-Classify, I’ve chosen to pull submissions from the [PlantBasedDiet](https://www.reddit.com/r/PlantBasedDiet/) and [Paleo](https://www.reddit.com/r/Paleo/) subreddits, as the submissions are very similar to what users would post on the Lifestyle Eating platform. These two diets were chosen because while they have a major difference, they share many similarities that will make the project a moderately challenging one. 

The purpose of phase 1 will be to identify the classification model that will most accurately detect the diets of the submissions. The model will be evaluated through the accuracy scores of the training and testing datasets. The results and findings of this phase will serve as the building blocks for the remainder of the project’s phases.

---

## **Data**

### Whole-Food Plant-Based vs. Paleo

As previously mentioned, the plant-based and paleo diets were chosen because while they have major differences, they share many similarities. So, what exactly do they consist of?

A **whole-food, plant-based diet** is one that focuses on natural foods that come from plants, are not heavily processed (minimally refined) and are free of animal ingredients such as meat, dairy, eggs and honey. 

The main food groups for this diet are fruits, vegetables, whole grains and legumes. Additional acceptable foods include nuts, seeds, tofu, tempeh, and plant-based milks. 

A **paleolithic diet**, commonly referred to as the caveman diet, focuses on natural foods that were consumed before the Neolithic or Agricultural Revolution (10,000 B.C.) when farming became the primary method of obtaining food.

The main food groups for this diet are lean meats, fish, fruits, vegetables and nuts. They avoid all processed foods, dairy, grains, legumes, and carbs that don’t come from fruits or vegetables.

So the similarities lie in that they focus on natural foods like fruits, vegetables and nuts, while avoiding processed foods and dairy.

The differences lie in that a paleo diet allows meat and fish, and avoids grains and legumes. The plant-based diet allows grains and legumes, but does not allow meat and fish.

---

### Data Source

The subreddit submissions will be obtained through the Pushift API. Given that the API limits to pulling 100 submissions at a time, a function will be created in order to automate the pulling of the desired number of submissions and return a concatenated DataFrame for a given subreddit.

In the end, there will be two CSV files created, one for the PlantBasedDiet subreddit and one for the Paleo subreddit. Each will contain the 5,000 most recent submissions, and include the following features: 

* **subreddit** - name of the subreddit the submission belonged to
* **title** - title of the submission
* **selftext** - body text of the submission
* **created_utc** - time of submission in unix time (seconds since 01-01-1970 UTC)

---

## Imports

In [2]:
import pandas as pd
import requests
import time

## Webscrape Function

In [3]:
# Help from here: https://medium.com/a-chatbots-life/nlp-classification-part-1-f0034d0a64a3

def get_data(subreddit, n_iter):
    """Returns a concatenated DataFrame for a given subreddit. Given that the Pushshift API limits to 
    100 submissions at a time, it will loop as many times as provided by the n_iter parameter."""
    
    # Initializing an empty list that will contain all DataFrames to concatenate
    df_list = []
    
    # Establishing the unix time to start with
    current_time = 1652224086
    
    # Creating a for loop that will create n_iter DataFrames of 100 submissions each, and will use and 
    # update the current_time in order to pull the 100 submissions previous to the ones pulled in the
    # previous iteration.
    for _ in range(n_iter):
        response = requests.get(
            'http://api.pushshift.io/reddit/search/submission', 
            params={'subreddit': subreddit, 'size': 100, 'before': current_time}
        )
       
        # Setting time in seconds to wait before executing next iteration to prevent exceeding the API limit
        time.sleep(3)
        data = response.json()
        submissions = data['data']
        df = pd.DataFrame(submissions)
        
        # Specifying the features to be included in the returned DataFrame
        df = df.loc[:, ['subreddit', 'title', 'selftext', 'created_utc']]
        df_list.append(df)
        
        # Re-settinng the current time to that of the oldest time  in the current iteration
        current_time = df.created_utc.min()
        
    return pd.concat(df_list, axis=0)

## Creating Datasets (~ 5,000 submissions each)

In [3]:
plant_based = get_data('PlantBasedDiet', 50).reset_index().drop(columns='index')

In [5]:
plant_based

Unnamed: 0,subreddit,title,selftext,created_utc
0,PlantBasedDiet,I'm having so many health problems from this diet,"**Warning, long post ahead** Okay so I had a ...",1652222200
1,PlantBasedDiet,Mercy For Animals encourages White House suppo...,,1652215236
2,PlantBasedDiet,Struggling with social &amp; familial stigma/b...,[removed],1652215221
3,PlantBasedDiet,Ideas for lunch to take to uni,[removed],1652213780
4,PlantBasedDiet,$10-25k grants to promote climate-friendly pla...,Not sure if this is the best sub but I saw tha...,1652210182
...,...,...,...,...
4994,PlantBasedDiet,Light easy meal,,1617475971
4995,PlantBasedDiet,"Following PBD with (for) PCOS, any advice",Hi Everyone!\n\nI've started following a vegan...,1617471936
4996,PlantBasedDiet,Cookbooks,[removed],1617469132
4997,PlantBasedDiet,"How to soak, cook dried Kidney Beans?",[removed],1617466938


In [5]:
paleo = get_data('Paleo', 50).reset_index().drop(columns='index')

In [6]:
paleo

Unnamed: 0,subreddit,title,selftext,created_utc
0,Paleo,This is my meal for the day. Let me hope that ...,,1652208666
1,Paleo,Coconut/Cassava/Arrowroot flours in Shakes for...,I am working out like crazy and need to up my ...,1652205599
2,Paleo,I’ve been using monk fruit sweetener for every...,,1652204018
3,Paleo,Grain free UNSWEETENED granola I can buy? Reco...,,1652203919
4,Paleo,Bananas! They aren’t paleo bc of the sugar con...,,1652203879
...,...,...,...,...
4993,Paleo,How can I eat paleo when you’re a teen and you...,[removed],1540926854
4994,Paleo,Battle of the Proteins. Taking on the suppleme...,,1540908230
4995,Paleo,"[Food Pic] Roasted herb chicken, chicken gizza...",,1540874039
4996,Paleo,[Question] Gut bacteria / Paleo,"44 yrs old, male, 5’8, 147lbs\n\nLately I’ve b...",1540863894


## Saving to CSV

In [9]:
# Commented out to prevent overwriting the original files

# plant_based.to_csv('../data/plant_based.csv', index=False)
# paleo.to_csv('../data/paleo.csv', index=False)