<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Classification of Subreddit Posts (Part 1 of 2) 

This project will consists of two notebooks, the first notebook will consist of:
- Problem Statement and Introduction, and
- Scraping process for the subreddit posts

Meanwhile, the second notebook will discuss further on the:
- Dataset Cleaning
- Explanatory Data Anaylsis (EDA)
- Modelling, and
- Conclusion & Recommendation

## Problem Statement :

As in-house data scientists with Coftea, the aim of this study is to develop a classification model that will give insights on the top keywords for tea and coffee. The keywords will be used to optimize the recommender system that will be deploy to the website for search bar usage. It will also be used to analyse the social media when we wanted to analyse for sentiment for our products and to classify the post or reviews for the company to improve. The classification model that we developed will rely on the accuracy metric as it is important to identify both tea and coffee products.

## Introduction :

Most people will start their day with a cup of coffee or tea. Whether is a freshly roasted black coffee or a cup English Breakfast, they are like energy booster to the majority. After water, tea and coffee is the most popular drink that people is drinking worldwide. One of the reasons tea and coffee is popular, is the health benefits they bring [[1]](https://www.pennmedicine.org/updates/blogs/health-and-wellness/2019/december/health-benefits-of-tea#:~:text=Numerous%20studies%20have%20shown%20that,lasting%20impact%20on%20your%20wellness.) [[2]](https://www.psypost.org/2022/05/brain-imaging-study-suggests-that-drinking-coffee-enhances-neurocognitive-function-63213).

In 2021, according to Tea Association of the USA inc, [report](https://www.teausa.com/14655/tea-fact-sheet), around 80% of Americans consumed tea and approximately consumed 85 billion servings or 3.9 billion gallons of tea. Coffee Market also predicted with annual increment of 4% [[3]](https://finance.yahoo.com/news/coffee-market-revenue-reach-157-121900670.html).

Furthermore, in a new [report](https://www.grandviewresearch.com/industry-analysis/ready-to-drink-tea-and-ready-to-drink-coffee-market?utm_source=prnewswire&utm_medium=referral&utm_campaign=FMCG_16-May-22&utm_term=ready_to_drink_tea_and_ready_to_drink_coffee_market&utm_content=rd1) by Grand View Research, Inc, the RTD (Ready-To-Drink) Tea and Coffee global market size will increase to USD 167.88 billion by 2030, a 6.2% annual growth prediction. There is also a possibility that Asia Pacific region will hold the largest share of the market.

Coftea, a beverage company launched in 2018 that has various tea and coffee lines that are unique and special for the customers who like to have different experience. It started off as retail shop in Singapore , and now it has various branches within Southeast Asia. Recently, it has started new website that includes online shop to capture the global audience.

## 1. Importing Libraries

In [1]:
import requests
import string
import re
import pandas as pd
import time
import matplotlib.pyplot as plt

# set the max columns to none
pd.set_option('display.max_columns', None)

## 2. Scraping Datasets

For the datasets, we are going to scrape from the subreddit r/tea and r/Coffee to gain insights of what people are talking about currently. The scraping process will be using PushShift API. As the maximum post allowed for one request is 100 posts, we are building function to iterate through multiple iterations with delay in between iteration.  

We also filter further the posts that are removed or deleted that can be found using `removed_by_category` column and we also drop the posts with no `selftext`. The removed posts and no selftext posts will be save in separate dataframes each.

### - Function to Scrap From Subreddit using PushShift API

In [2]:
def ScrapSubreddit(sub_title, count, url='https://api.pushshift.io/reddit/search/submission'):
    use = []
    rmv = []
    nontxt = []
    post = 0
    utc = 0
    num = 0
    
    #while loop until the post retrieved reached the count we wanted
    while post < count:
        # set up the parameters
        if utc == 0:
            params = {
                'subreddit': sub_title,
                'size': 100
            }
        else:
            params = {
                'subreddit': sub_title,
                'size': 100,
                'before' : utc
            }
        
        # request to the API
        res = requests.get(url, params)
        
        # check status code
        if res.status_code != 200:
            return print('Error, status code: ',res.status_code)
        
        # convert data to json file type
        data = res.json()
        posts = data['data']
        
        # create dataframe from the json file
        df = pd.DataFrame(posts)
        
        if post == 0:
            t = pd.to_datetime(df['created_utc'].iloc[0],unit='s')
            print('Latest post: ', t)
                  
        utc = df['created_utc'].iloc[-1]
        
        if 'removed_by_category' in df.columns:
            
            # dataframe with post that is not removed
            df_u = df[(df['removed_by_category'].isna()) & (df['selftext'] != '')]
        
            # dataframe with post that is removed
            df_r = df[df['removed_by_category'].notna()]
        
            # dataframe with post that has no selftext
            df_nontext = df[(df['removed_by_category'].isna()) & (df['selftext'] == '')]
            
            rmv.append(df_r)
        
        else:
            # dataframe with post that has selftext
            df_u = df[df['selftext'] != '']
        
            # dataframe with post that has no selftext
            df_nontext = df[df['selftext'] == '']
        
        post += len(df_u)
        
        
        # append the df to the list
        use.append(df_u)
        nontxt.append(df_nontext)
        
        num += 1
        print('Loop number',num, 'done')
        print('Posts retrieved :', post)
        
        # sleep time before starting over the while loop
        time.sleep(3)
    
    # concat the list of dfs into one dataframe
    df_use = pd.concat(use)
    df_rmv = pd.concat(rmv)
    df_ntext = pd.concat(nontxt)
    
    print('Scraping completed!')
    
    return df_use, df_rmv, df_ntext
    
        

### - Scrape 1200 posts from r/tea Subreddit

In [3]:
# set the variable of subreddit1 and count
subreddit1 = 'tea'
count = 1200

In [4]:
df_tea, df_tea_removed, df_tea_nontext = ScrapSubreddit(subreddit1, count)

Latest post:  2022-05-24 07:35:46
Loop number 1 done
Posts retrieved : 33
Loop number 2 done
Posts retrieved : 68
Loop number 3 done
Posts retrieved : 100
Loop number 4 done
Posts retrieved : 133
Loop number 5 done
Posts retrieved : 158
Loop number 6 done
Posts retrieved : 189
Loop number 7 done
Posts retrieved : 231
Loop number 8 done
Posts retrieved : 258
Loop number 9 done
Posts retrieved : 294
Loop number 10 done
Posts retrieved : 331
Loop number 11 done
Posts retrieved : 378
Loop number 12 done
Posts retrieved : 411
Loop number 13 done
Posts retrieved : 451
Loop number 14 done
Posts retrieved : 494
Loop number 15 done
Posts retrieved : 530
Loop number 16 done
Posts retrieved : 570
Loop number 17 done
Posts retrieved : 606
Loop number 18 done
Posts retrieved : 644
Loop number 19 done
Posts retrieved : 683
Loop number 20 done
Posts retrieved : 716
Loop number 21 done
Posts retrieved : 754
Loop number 22 done
Posts retrieved : 791
Loop number 23 done
Posts retrieved : 836
Loop number

In [5]:
df_tea.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,removed_by_category,thumbnail_height,thumbnail_width,url_overridden_by_dest,gallery_data,is_gallery,media_metadata,media,media_embed,secure_media,secure_media_embed,crosspost_parent,crosspost_parent_list,author_flair_template_id,author_flair_text_color,author_flair_background_color,banned_by,poll_data,author_cakeday
0,[],False,cobaltjay,,[],,text,t2_4i4nw2ff,False,False,False,[],False,False,1653377746,self.tea,https://www.reddit.com/r/tea/comments/uwlc69/w...,{},uwlc69,False,True,False,False,False,True,True,False,,reco,"[{'e': 'text', 't': 'Recommendation'}]",7863b26c-9f57-11e4-a2b0-22000bc1889b,Recommendation,dark,richtext,False,False,True,0,0,False,all_ads,/r/tea/comments/uwlc69/whats_your_favourite_br...,False,6,1653377757,1,I have recently started some new medication th...,True,False,False,tea,t5_2qq5e,659158,public,self,What's your favourite brand for decaf tea?,0,[],1.0,https://www.reddit.com/r/tea/comments/uwlc69/w...,all_ads,6,,,,,,,,,,,,,,,,,,,,,
3,[],False,kibbles16,,[],,text,t2_m8mu0vez,False,False,False,[],False,False,1653372437,self.tea,https://www.reddit.com/r/tea/comments/uwk4pc/h...,{},uwk4pc,False,True,False,False,False,True,True,False,,,[],,,dark,text,False,False,True,0,0,False,all_ads,/r/tea/comments/uwk4pc/how_do_you_make_tea_tas...,False,6,1653372448,1,"I love tea, especially green and jasmine tea. ...",True,False,False,tea,t5_2qq5e,659124,public,self,How do you make tea taste refreshing?,0,[],1.0,https://www.reddit.com/r/tea/comments/uwk4pc/h...,all_ads,6,,,,,,,,,,,,,,,,,,,,,
7,[],False,-___ari___-,,[],,text,t2_82epe397,False,False,False,[],False,False,1653364473,self.tea,https://www.reddit.com/r/tea/comments/uwi2xw/h...,{},uwi2xw,False,True,False,False,False,True,True,False,,help,"[{'e': 'text', 't': 'Question/Help'}]",64c60b7e-9f57-11e4-adfe-22000b680aa5,Question/Help,dark,richtext,False,False,True,0,0,False,all_ads,/r/tea/comments/uwi2xw/how_is_the_first_flush_...,False,6,1653364483,1,"hi, just recently getting into specialty teas,...",True,False,False,tea,t5_2qq5e,659088,public,self,how is the first flush experience? is it that ...,0,[],1.0,https://www.reddit.com/r/tea/comments/uwi2xw/h...,all_ads,6,,,,,,,,,,,,,,,,,,,,,
8,[],False,UsernameNumberThree,,[],,text,t2_fcgfe,False,False,False,[],False,False,1653350251,self.tea,https://www.reddit.com/r/tea/comments/uwdvoh/w...,{},uwdvoh,False,True,False,False,False,True,True,False,,,[],,,dark,text,False,False,True,0,0,False,all_ads,/r/tea/comments/uwdvoh/what_are_your_most_succ...,False,6,1653350262,1,I purchased some from a specialty shop in NYC....,True,False,False,tea,t5_2qq5e,659042,public,self,What are your most successful concoctions with...,0,[],1.0,https://www.reddit.com/r/tea/comments/uwdvoh/w...,all_ads,6,,,,,,,,,,,,,,,,,,,,,
11,[],False,mkmkatreddit,,[],,text,t2_17bti9pn,False,False,False,[],False,False,1653336633,self.tea,https://www.reddit.com/r/tea/comments/uw95qw/p...,{},uw95qw,False,True,False,False,False,True,True,False,,help,"[{'e': 'text', 't': 'Question/Help'}]",64c60b7e-9f57-11e4-adfe-22000b680aa5,Question/Help,dark,richtext,False,False,True,0,0,False,all_ads,/r/tea/comments/uw95qw/please_help_is_my_yixin...,False,6,1653336643,1,"Hi,\ncan someone please help me asses whether ...",True,False,False,tea,t5_2qq5e,658994,public,self,"Please help, is my Yixing teapot genuine?",0,[],1.0,https://www.reddit.com/r/tea/comments/uw95qw/p...,all_ads,6,self,"{'enabled': False, 'images': [{'id': 's96yESlA...",,,,,,,,,,,,,,,,,,,


In [6]:
df_tea.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id',
       'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
       'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_sub

In [7]:
df_tea[df_tea['selftext'] == '']

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,removed_by_category,thumbnail_height,thumbnail_width,url_overridden_by_dest,gallery_data,is_gallery,media_metadata,media,media_embed,secure_media,secure_media_embed,crosspost_parent,crosspost_parent_list,author_flair_template_id,author_flair_text_color,author_flair_background_color,banned_by,poll_data,author_cakeday


The dataframe only includes the posts with content, next we will filter out the html inside the selftext

In [8]:
df_tea_removed['removed_by_category'].value_counts()

automod_filtered    480
reddit              110
moderator            20
deleted               8
Name: removed_by_category, dtype: int64

The posts removed based on our 33 loops totals up to 618 posts, which accounts to **18.7%** of the posts we had scrapped

In [9]:
df_tea_nontext.shape

(1458, 83)

The scrape data also has a lot of post without any content, which accounts up to **44.2%** of the posts

### - Scrape 1200 posts from r/coffee Subreddit

In [10]:
# set the variable of subreddit2 and count
subreddit2 = 'coffee'

In [11]:
df_coffee, df_coffee_removed, df_coffee_notext = ScrapSubreddit(subreddit2, count)

Latest post:  2022-05-24 07:41:32
Loop number 1 done
Posts retrieved : 71
Loop number 2 done
Posts retrieved : 128
Loop number 3 done
Posts retrieved : 182
Loop number 4 done
Posts retrieved : 241
Loop number 5 done
Posts retrieved : 308
Loop number 6 done
Posts retrieved : 363
Loop number 7 done
Posts retrieved : 424
Loop number 8 done
Posts retrieved : 483
Loop number 9 done
Posts retrieved : 544
Loop number 10 done
Posts retrieved : 611
Loop number 11 done
Posts retrieved : 670
Loop number 12 done
Posts retrieved : 736
Loop number 13 done
Posts retrieved : 797
Loop number 14 done
Posts retrieved : 854
Loop number 15 done
Posts retrieved : 914
Loop number 16 done
Posts retrieved : 972
Loop number 17 done
Posts retrieved : 1031
Loop number 18 done
Posts retrieved : 1100
Loop number 19 done
Posts retrieved : 1166
Loop number 20 done
Posts retrieved : 1230
Scraping completed!


Output log (incase the output is missing)  
Latest post:  2022-05-22 13:08:28  
Loop number 1 done
Posts retrieved : 33  
Loop number 2 done
Posts retrieved : 64  
Loop number 3 done
Posts retrieved : 100  
Loop number 4 done
Posts retrieved : 134  
Loop number 5 done
Posts retrieved : 166  
Loop number 6 done
Posts retrieved : 202  
Loop number 7 done
Posts retrieved : 245  
Loop number 8 done
Posts retrieved : 284  
Loop number 9 done
Posts retrieved : 316  
Loop number 10 done
Posts retrieved : 354  
Loop number 11 done
Posts retrieved : 401  
Loop number 12 done
Posts retrieved : 439  
Loop number 13 done
Posts retrieved : 481  
Loop number 14 done
Posts retrieved : 527  
Loop number 15 done
Posts retrieved : 566  
Loop number 16 done
Posts retrieved : 604  
Loop number 17 done
Posts retrieved : 639  
Loop number 18 done
Posts retrieved : 678  
Loop number 19 done
Posts retrieved : 716  
Loop number 20 done
Posts retrieved : 758  
Loop number 21 done
Posts retrieved : 795  
Loop number 22 done
Posts retrieved : 817  
Loop number 23 done
Posts retrieved : 852  
Loop number 24 done
Posts retrieved : 885  
Loop number 25 done
Posts retrieved : 911  
Loop number 26 done
Posts retrieved : 950  
Loop number 27 done
Posts retrieved : 996  
Loop number 28 done
Posts retrieved : 1037  
Loop number 29 done
Posts retrieved : 1063  
Loop number 30 done
Posts retrieved : 1093  
Loop number 31 done
Posts retrieved : 1136  
Loop number 32 done
Posts retrieved : 1174  
Loop number 33 done
Posts retrieved : 1209  
Scraping completed!

In [12]:
df_coffee.shape

(1230, 81)

In [13]:
df_coffee.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_richtext', 'link_flair_text_color', 'link_flair_type',
       'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts',
       'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subreddit_type', 'thumbnail', 'title', 'total_awards_rece

In [14]:
df_coffee_removed['removed_by_category'].value_counts()

reddit              134
moderator            45
automod_filtered      7
deleted               4
Name: removed_by_category, dtype: int64

Total of removed posts based on 2000 posts is 190, which accounts to **9.5%** of the total scraped data

In [15]:
df_coffee_notext.shape

(580, 81)

We have also removed 580 posts that has no selftext, which accounts to **29%** of the total data scraped

### - Converting Dataframe to CSV

We convert the dataframe of the removed posts and no text posts to csv as well

In [17]:
df_tea_removed.to_csv('data/df_tea_removed.csv', index=False)
df_tea_nontext.to_csv('data/df_tea_notext.csv', index=False)
df_coffee_removed.to_csv('data/df_coffee_removed.csv', index=False)
df_coffee_notext.to_csv('data/df_coffee_notext.csv', index=False)

As for the df_tea and df_coffee, we will combined the two and save it as csv file

In [18]:
df_combined = pd.concat([df_tea,df_coffee])

In [19]:
df_combined.to_csv('data/df_tea&coffee_220524.csv', index= False)

For the purpose of this analysis, we will use df_tea&coffee_220524.csv to have the same dataset to be analysed

This is the end of Part 1 notebook. Please refer to "Project 3: Classification of Subreddit Posts (Part 2 of 2)" notebook for the data cleaning, EDA, modelling and conclusion.