# Problem Statement

We are a group of data scientists representing a yoga studio. We want to differentiate our approach and maximise the effectiveness of our marketing campaigns to yoga enthusiasts.

Our approach is detailed below:
- Grouping posts from reddit r\yoga and r\Meditation (which are similar in nature)
- Exploring the similarities, but focusing on exploiting differences between both subgroups
- Build an effective *classification* model to better target yoga enthusiasts (aka. r/yoga users; Maximise marketing spend)


*Rubic (Tailor accordingly*
- Is it clear what the goal of the project is? (Y)
- What type of model will be developed?
- How will success be evaluated?
- Is the scope of the project appropriate?
- Is it clear who cares about this or why this is important to investigate?
- Does the student consider the audience and the primary and secondary stakeholders?

# Data Collection

With our problem statement clearly defined above, we will now kick-start the data collection process. We will utilise the *Pushshift* API to extract fairly similar subreddits (Yoga and meditation).

In [1]:
# Import libraries
import requests
import pandas as pd
from datetime import datetime
import time
import redditcleaner
import csv

from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import re
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore') # Ensuring the notebook remains tidy

Given the limitations of the *Pushshift* API, we will need to create a custom function (*see below*) to loop our 'calls'. We will also need to standardize our retrievel times to ensure our dataset remains fixed.

In [2]:
# Scrapping posts
## Creating loop to pull out n number of posts from different subreddits
def scrap_posts(subreddit, n_posts):
    posts = []
    url = 'https://api.pushshift.io/reddit/search/submission' # Utilising Pushift API
    
    bef_dict = {'before': 1640908800} # Standardising our UTC (With other group members)
    
    for i in range(n_posts):
        params = {
                'subreddit':subreddit,
                'size': 100,
                'before': bef_dict['before']
                }
            
        res = requests.get(url, params)
        
        if res.status_code != 200:
            print(f'Error Code {res.status_code}, {res.reason}') # Terminating if we see an error code
            break
        
        data = res.json()
        posts.extend(data['data'])
            
        bef_dict['before'] = data['data'][-1]['created_utc']
        time.sleep(0.5) # Adding in some delay so as not to overload the server
    
    print(f"r/{subreddit} - Code:{res.status_code}, Status:{res.reason}")
    
    # create dataframe for scrapped posts
    df = pd.DataFrame(posts)
    df['created'] = df['created_utc'].apply(lambda x: datetime.fromtimestamp(x))
    
    # Stamping post and datetime while scraping 
    latest_post_stamped = datetime.fromtimestamp(df['created_utc'].iloc[0:].values[0])
    last_post_stamped = datetime.fromtimestamp(df['created_utc'].iloc[-1:].values[0])
    
    print(f"Scrapped {df.shape[0]} posts from {latest_post_stamped} to {last_post_stamped}")
    print()
    
    return df

In [3]:
%%time

# Pulling out 1500 posts for r/yoga
yoga = scrap_posts('yoga', 15)

r/yoga - Code:200, Status:OK
Scrapped 1500 posts from 2021-12-31 07:55:04 to 2021-11-07 19:16:53

Wall time: 2min 19s


In [4]:
# Let's take a quick look at our data
print(yoga.shape)
yoga.info()

(1500, 81)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 81 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   all_awardings                  1500 non-null   object        
 1   allow_live_comments            1500 non-null   bool          
 2   author                         1500 non-null   object        
 3   author_flair_css_class         0 non-null      object        
 4   author_flair_richtext          1484 non-null   object        
 5   author_flair_text              26 non-null     object        
 6   author_flair_type              1484 non-null   object        
 7   author_fullname                1484 non-null   object        
 8   author_is_blocked              1500 non-null   bool          
 9   author_patreon_flair           1484 non-null   object        
 10  author_premium                 1484 non-null   object        
 11  awarde

In [5]:
# Write to csv
yoga.to_csv('./data/yoga.csv')

In [6]:
%%time

# Pulling out 1500 posts for r/mediation
meditation = scrap_posts('meditation', 15)

r/meditation - Code:200, Status:OK
Scrapped 1500 posts from 2021-12-31 07:38:50 to 2021-12-03 11:39:22

Wall time: 3min 37s


In [7]:
# Let's take a quick look at our data
print(meditation.shape)
meditation.info()

(1500, 86)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 86 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   all_awardings                  1500 non-null   object        
 1   allow_live_comments            1500 non-null   bool          
 2   author                         1500 non-null   object        
 3   author_flair_css_class         0 non-null      object        
 4   author_flair_richtext          1485 non-null   object        
 5   author_flair_text              8 non-null      object        
 6   author_flair_type              1485 non-null   object        
 7   author_fullname                1485 non-null   object        
 8   author_is_blocked              1500 non-null   bool          
 9   author_patreon_flair           1485 non-null   object        
 10  author_premium                 1485 non-null   object        
 11  awarde

In [8]:
# Write to csv
meditation.to_csv('./data/meditation.csv')