<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# DSI 37 Project 3

<a id='part_i'></a>
[Part II](Part_2-Cleaning_and_EDA.ipynb#part_ii) <br>
[Part III](Part_3-Modelling.ipynb#part_iii)

# Part I: Reddit API Access And Scraping

<a id='part_i'></a>

## Contents

[1. Intro](#intro)<br>
[2. Glossary](#glossary)<br>
[3. Imports](#imports)<br>
[4. Code](#code)<br>


## 1. Intro

<a id='intro'></a>

## Problem Statement

We are members of a data science team working for a specialised diet food company.
As such, understanding the customers' diets of interest and the unique preferences of specific diet groups is key to driving effective marketing, targeted advertisements, product development, and profit generation. The purposes of this project are twofold. 

Firstly, by leveraging NLP techniques, we aim to generate valuable insights on the characteristics and preferences of the Keto and Paleo communities. By thoroughly analysing the collected data, we aim to identify key patterns, trends, and distinguishing features of each community. We have chosen these two communities because despite rising interest in these two diets, there is a lack of goods and services targeted at them and this is a gap in the market that we hope to fill.


<img src="../images/keto stats 3.png" style='float: left; margin: 20px; width: 410px'>
<img src="../images/paleo stats 3.png" style='float: left; margin: 20px; width: 410px'>


Fig 1: number of subscribers per year from 2012 to 2023 for r/keto (left) and r/paleo (right)

Secondly, we aim to create a robust binary classifier that can effectively differentiate between posts from these two communities. To do this, we will make use of the posts from the two subreddits, [r\keto](https://www.reddit.com/r/keto/) and [r\paleo](https://www.reddit.com/r/Paleo/). This will serve as the foundation for our classification model.


Ultimately, the developed model will empower our company's product and marketing teams to precisely identify the needs and preferences of our clientele, helping us to tailor our offerings to Keto or Paleo diet followers. This data-driven, personalised marketing strategy will enhance customer satisfaction and drive business growth.

Throughout the project, we will undertake crucial steps such as preprocessing the subreddit posts and conducting Exploratory Data Analysis (EDA). Finally, the performance of our models will be evaluated based on the highest f1 score, ensuring the accuracy and effectiveness of our classification approach.

## Goals

* to be able to classify text as either 'keto' or 'paleo' with an accuracy of at least 80%
* to learn what key terms these two communities are focused on and whether we can turn them into products 
* to see if there are any other unexpected insights we can glean 

## Description of this codebook

This is part 1 of our overall code for this project. This part concerns the methods used to access the Reddit API and scrape data from posts from two r/paleo and r/keto. This code also explains some of the methods we used to circumvent Reddit's attempts to prevent data scraping.

## 2. Glossary

<a id='glossary'></a>

### Reddit:
Reddit is a social news website and forum where content is socially curated and promoted by site members through voting. It was founded in 2005 and ranks as the 10th most visited website in the world and 6th most visited website in the US. Anyone can create an account with just an email address. The key feature of Reddit is the ability to create 'subreddits' or individual sub-forum for special interest groups, and for any user to contribute content to these subreddits. As such, people with niche interests tend to congregate on Reddit to seek or give advice, information, and affirmation.

Subreddits can be sorted based on default, hot, new, or top (by time period). 

### Paleo Diet:
The Paleo diet is short for the 'paleolithic diet'. It's an eating plan based on foods humans might have eaten during the Paleolithic Era (around 2.5mil - 10,000 years ago). It's also known as a 'hunter-gatherer diet' because it excludes foods that only became more common when small-scale farming was invented. The theory behind this diet is that human bodies have not evolved as quickly as agricultural technology has and have not adapted to eating these modern foods yet. Followers of this diet believe that eating these modern foods cause health issues like obeisity and diabetes.

As such, followers of the paleo diet can eat:

* fruits
* vegetables
* lean meat
* fish
* nuts
* eggs
* seeds

They cannot eat:

* grains
* legumes
* dairy
* refined or added sugar
* added salt
* highly processed foods
* certain vegetables that are high in starch like corns, peas, and potatoes

r/Paleo has around 167k members.

### Keto Diet:
The Keto diet is short for the 'ketogenic diet'. It's an eating plan that involves consuming a very low amount of carbohydrates and replacing them with fat to help the body enter a metabolic state known as 'ketosis' where fat is burned rapidly for energy. While there are many variations, the Standard Ketogenic Diet recommends a split of 70% fat, 20% protein, and only 10% carbs. Since it's based on macronutrients, the keto diet has fewer restrictions than the paleo diet.

Recommended foods:
* meat 
* fatty fish
* eggs
* butter and cream
* cheese
* nuts and seeds
* healthy oils (e.g. extra virgin olive oil, avocado oil, etc.)
* avocados
* low-carb vegetables

Foods to avoid
* sugary foods
* foods that use artificial sweeteners
* grains or starches
* fruit
* beans or legumes
* root vegetables
* low-fat products 
* unhealthy fats (e.g. mayonnaise, processed vegetable oils, etc.)
* sauces (e.g. bbq, honey mustard, etc.)
* alcohol

r/Keto has around 3.26mil members.


## 3. Imports (Libraries)

<a id='imports'></a>

In [19]:
# for pulling data from reddit

import requests
import pandas as pd
import time
import random


## 4. Code

<a id='code'></a>

### 1. importing posts from r/Paleo and r/Keto

We can directly access the page info as a json by adding '.json' to the end of the url. After experimenting with scraping methods, we discovered that we could circumvent Reddit's attempts to block scraping by changing the sort methods. 

Step 1: defining the urls for the 2 subreddits we want to scrape

In [20]:
p_url = 'https://www.reddit.com/r/Paleo'

In [21]:
k_url = 'https://www.reddit.com/r/Keto'

Step 2: create a list of the different ways of sorting reddit posts because we discovered from previous scraping attempts that r/Keto would start duplicating posts after ~600 posts

In [22]:
url_ext = ['', '/new','/hot', '/top']

Step 3: write a function that will iterate through the different ways of sorting reddit posts (in url_ext) 

In [23]:
def reddit_importer(url, n, sub):
    posts = []
    after = None

    for i in url_ext:
        if i != '/top':
            for a in range(n): 
                if after == None:
                    current_url = url + i + '.json'
                else:
                    current_url = url + i + '.json' + '?after=' + after
                print(current_url)
                res = requests.get(current_url, headers={'User-agent': 'Shokupan Inc 1.0'})

                if res.status_code != 200:
                    print('Status error', res.status_code)
                    break

                current_dict = res.json()
                current_posts = [p['data'] for p in current_dict['data']['children']]
                posts.extend(current_posts)
                after = current_dict['data']['after']

                # COMPLETE THE CODE!
                if a > 0:
                    prev_posts = pd.read_csv(f'{sub}.csv')
                    current_df = pd.DataFrame(current_posts)
                    combined = pd.concat([prev_posts, current_df])
                    pd.DataFrame(combined).to_csv(f'{sub}.csv', index = False)
                else:
                    pd.DataFrame(posts).to_csv(f'{sub}.csv', index = False)

                # generate a random sleep duration to look more 'natural'
                sleep_duration = random.randint(2,7)
                print(sleep_duration)
                time.sleep(sleep_duration)
            after = None
        else:
            # top comes without an 'after' field for some reason
            current_url = url + i + '.json'
            res = requests.get(current_url, headers={'User-agent': 'Shokupan Inc 1.0'})
        

Step 4: use the function written in step 3 to import from one subreddit

In [6]:
reddit_importer(p_url, 40, 'paleo')

https://www.reddit.com/r/Paleo.json
5
https://www.reddit.com/r/Paleo.json?after=t3_13zbgq2
2
https://www.reddit.com/r/Paleo.json?after=t3_13ihnku
7
https://www.reddit.com/r/Paleo.json?after=t3_131gpc1
4
https://www.reddit.com/r/Paleo.json?after=t3_12g7ka2
2
https://www.reddit.com/r/Paleo.json?after=t3_11yq5df
6
https://www.reddit.com/r/Paleo.json?after=t3_11o6n44
2
https://www.reddit.com/r/Paleo.json?after=t3_11ahtr0
2
https://www.reddit.com/r/Paleo.json?after=t3_10xyl16
3
https://www.reddit.com/r/Paleo.json?after=t3_10qh7ec
6
https://www.reddit.com/r/Paleo.json?after=t3_10gwggc
7
https://www.reddit.com/r/Paleo.json?after=t3_104g0mk
5
https://www.reddit.com/r/Paleo.json?after=t3_zqjncz
4
https://www.reddit.com/r/Paleo.json?after=t3_zbh879
7
https://www.reddit.com/r/Paleo.json?after=t3_z0cp3x
7
https://www.reddit.com/r/Paleo.json?after=t3_yhvmfd
2
https://www.reddit.com/r/Paleo.json?after=t3_y52d2j
2
https://www.reddit.com/r/Paleo.json?after=t3_xp8uw7
2
https://www.reddit.com/r/Paleo.js

KeyboardInterrupt: 

Step 5: import the .csv file just written in step 4 to check that it worked, and also to see how many unique entries we have (since reddit starts duplicating posts after a while of scraping)

In [16]:
p_df = pd.read_csv('../data/paleo.csv')
print(f'The original shape is {p_df.shape}.')

# we used the subset 'name' because we noticed that drop_duplicates alone did not get rid of all the duplicates
p_df.drop_duplicates(subset = 'name', ignore_index = True, inplace = True)

print(f'The shape after dropping duplicates is {p_df.shape}.')

The original shape is (2484, 118).
The shape after dropping duplicates is (979, 118).


Step 6: repeat steps 4-5 on the other subreddit

In [24]:
reddit_importer(k_url, 40, 'keto')

https://www.reddit.com/r/Keto.json
6
https://www.reddit.com/r/Keto.json?after=t3_14drlyt
7
https://www.reddit.com/r/Keto.json?after=t3_14cru1d
3
https://www.reddit.com/r/Keto.json?after=t3_14btcu1
7
https://www.reddit.com/r/Keto.json?after=t3_14az9hx
5
https://www.reddit.com/r/Keto.json?after=t3_14ainre
4
https://www.reddit.com/r/Keto.json?after=t3_14a8ki2
5
https://www.reddit.com/r/Keto.json?after=t3_148jsey
4
https://www.reddit.com/r/Keto.json?after=t3_148fmao
3
https://www.reddit.com/r/Keto.json?after=t3_1481k0u
3
https://www.reddit.com/r/Keto.json?after=t3_146w86j
6
https://www.reddit.com/r/Keto.json?after=t3_145t0tl
7
https://www.reddit.com/r/Keto.json?after=t3_144xmyy
6
https://www.reddit.com/r/Keto.json?after=t3_143xo8w
2
https://www.reddit.com/r/Keto.json?after=t3_143hhav
3
https://www.reddit.com/r/Keto.json?after=t3_142b9nx
2
https://www.reddit.com/r/Keto.json?after=t3_14273kl
5
https://www.reddit.com/r/Keto.json?after=t3_140mo6u
2
https://www.reddit.com/r/Keto.json?after=t3_1

In [25]:
k_df = pd.read_csv('../data/keto.csv')
print(f'The original shape is {k_df.shape}.')

k_df.drop_duplicates(subset = 'name', ignore_index = True, inplace = True)


print(f'The shape after dropping duplicates is {k_df.shape}.')

The original shape is (2943, 113).
The shape after dropping duplicates is (975, 113).


Since both .csv files have around 1000 posts, it seems like we have enough data to proceed with. The next codebook will cover cleaning and EDA.

In [None]:
pd.DataFrame(p_df).to_csv('../data/paleo2.csv', index = False)
pd.DataFrame(k_df).to_csv('../data/keto2.csv', index = False)

<b> End of Part I</b> <br>
[Part II](Part_2-Cleaning_and_EDA.ipynb#part_ii)