# Project 3: Reddit API & Classification (Dating Apps)


<img src="../images/cmbbumblelogo.png" style="float: centre; margin: 20px;width, height: 50px">

# Problem Statement


The proliferation of online dating apps since 2013 has revolutionsed the <a href='https://www.businessofapps.com/data/dating-app-market/'>online dating industry</a> , where users are spoilt for choice with over almost 20 different apps to choose from.  As part of a consultancy firm offering data-driven analysis and insights to such dating companies, we identify the most frequently-
used words among app users, so that we can determine the most talked-about features for each app, and provide recommendations to our stakeholders to increase their market share. To kickstart our analysis, we turn to one of the most popular social media platforms, Reddit. 

With over <a href='https://sg.oberlo.com/blog/reddit-statistics'>430 million monthly active users worldwide</a>, Reddit houses one of the largest social networking communities and is home to a massive 2.2 million subreddits, of which about 130,000 are currently active. In this project, we narrow down our search to subreddits of 2 popular dating apps - Coffee Meets Bagel (<a href='https://www.reddit.com/r/coffeemeetsbagel'>/r/coffeemeetsbagel</a>) and Bumble(<a href='https://www.reddit.com/r/bumble'>/r/bumble</a>). Using Natural Language Processing, we select and train 5 classification models to determine which model has the best performance in classifying posts into their subreddits based on the text the posts contain. The classification metrics - accuracy, recall, precision and specificity - will be used to evaluate the best model's performance.



# Executive Summary 

Reddit has exploded in popularity since its inception in 2005 and has remained one of the most popular social media communities,  with 58% of its users between the  <a href='https://websitebuilder.org/blog/reddit-statistics/'>ages of 18-29</a> . This is well-explained by the fact that the advent of social media coincided with internet usage gaining widespread adoption among millenials (roughly born between 1982 - 1996). Furthermore, the mobile applications (apps) revolution accompanied the rapid increase in mobile phone ownership due to their accessibility, convenience and general quality-of-life improvements. 

The online dating scene eventually rose to prominence and naturally, subreddits for all the dating apps were created where redditors gathered to seek advice and share their experiences. 2 of the most widely-used and text-heavy apps (Tinder was skipped as it was full of images) - Coffee Meets Bagel (CMB) and Bumble - were picked to explore in of this classification analysis.

We began by scraping the 2 subreddits for at least 750 posts to account for the elimination of non-useable posts (images only, no text, etc.). Tinder was an initial pick but as seen from the scraping results below, CMB and Bumble were the next 2 subreddits in line with a significantly higher proportion of useable posts. 

Scraping Results (Useable posts as a %) 
- Tinder 129/750 = 17% (Not used) 
- Cmb 691/750 = 92% (Used)
- Bumble 501/987 = 51% (Used)

After extensive data cleaning and text preprocesing through the removal of stop words (common words that are not meaningful in our analyses) and combining the title and text of each post, we explore the word frequencies from each subreddit through the use of word clouds and bar charts. Several keywords stood from the respective subreddits: 

- CMB: "discover", "suggested", "bean" 
- Bumble: "woman", "dating", 

The modelling selection process involved building pipelines consisting of either the CountVectorizer or Term frequency-Inverse document frequency vectorizer, and a classification model. Hyperparameter tuning was then done to obtain the best training and cross-validated scores for each pipeline. From the tuning results, it was determined that the best model is the Multinomial Naive Bayes model via comparisons with the cross-validated score and the accuracy score on the test (validation) data set, which were 77.0 and 73.6% respectively. 

The results of the classification metrics are as such: 
- Accuracy - 73.6% 
- Misclassification rate - 26.4%
- Recall - 76.8% 
- Precision - 77.4% 
- Specificity - 69.3% 

The accuracy score of 73.6% here is not extremely high, suggesting that posts made in the /r/bumble and /r/coffeemeetsbagel subreddits do not differ significantly.

High recall scores show that the model performs better at classifying posts belonging to /r/coffeemeetsbagel compared to classifying posts belonging to /r/bumble. A misclassification analysis was conducted by looking through the misclassified posts to make sense of the 26.4% misclassification rate. A common trend was discovered from the contents of the posts which consisted of users seeking advice on relationships, sharing their dating experiences on the apps, or ranting about scammers. It is understandable that these topics appear in both subreddits as both groups of users are in similar situations and thus, would share similar experiences. 

Word importance was also explored by determining the top coefficients of words in the respective subreddit. Words which influenced the classification model the most from CMB include 'liked', 'discover' and 'suggested, while that of Bumble was rather inconclusive with 265 words with the joint-highest coefficient. Closer inspection of these words did reveal several keywords unique to the Bumble app, such as 'bee', 'beeline', 'bi', 'binary' and 'non binary'. However, they did not appear frequently enough for our model to pick up their importance. 

These keywords are important because they are features unique to the respective apps, and it is evident CMB has more differentiating factors compared to Bumble, and that Bumble does not have features which are prominent or unique enough. 




## Contents 

- [1. Extract Datasets from Reddit](#1.-Extract-Datasets-From-Reddit)

### Import Libraries 

In [1]:
# import statements 
import pandas as pd
import requests
import time
import random
import string
import nltk
import regex as re
import numpy as np
# from pprint import pprint

%matplotlib inline

In [2]:
pd.pandas.set_option('display.max_columns', None)

## 1. Extract Datasets From Reddit 

In [3]:
def secret_agent_generator(N):
    return ''.join(random.choice(string.ascii_letters + string.digits) for _ in range(N))

### CoffeeMeetsBagel

In [None]:
url = 'https://www.reddit.com/r/coffeemeetsbagel.json'

In [None]:
cmb_posts = []
after = None

for a in range(30):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': secret_agent_generator(10)})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    cmb_posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,60)
    print(sleep_duration)
    time.sleep(sleep_duration)

In [None]:
df_cmb = pd.DataFrame(cmb_posts)
df_cmb.head()

In [None]:
df_cmb.shape

In [None]:
df_cmb.to_csv("../datasets/cmb_raw", index=False)

In [11]:
columns_to_keep = ['title', 'selftext', 'subreddit',]

In [None]:
df_cmb = df_cmb[columns_to_keep]
df_cmb.head()

In [None]:
df_cmb[df_cmb['selftext'] == '']

In [None]:
df_cmb = df_cmb.drop_duplicates(subset='selftext')
df_cmb.head()

In [None]:
df_cmb.drop(df_cmb[df_cmb["selftext"] == ""].index, inplace=True)

In [None]:
df_cmb['title_selftext'] = df_cmb['title'] + ' ' + df_cmb['selftext']
df_cmb.head()

In [None]:
df_cmb = df_cmb.drop(columns=['title', 'selftext'])
df_cmb.head()

In [None]:
df_cmb.to_csv("../datasets/cmb_cleaned", index=False)

### Bumble 

In [4]:
url_2 = 'https://www.reddit.com/r/bumble.json'

In [5]:
bumble_posts = []
after = None

for a in range(50):
    if after == None:
        current_url = url_2
    else:
        current_url = url_2 + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': secret_agent_generator(10)})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    bumble_posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,60)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/bumble.json
56
https://www.reddit.com/r/bumble.json?after=t3_m7ndsk
15
https://www.reddit.com/r/bumble.json?after=t3_m717zg
19
https://www.reddit.com/r/bumble.json?after=t3_m6urnz
58
https://www.reddit.com/r/bumble.json?after=t3_m5uzya
4
https://www.reddit.com/r/bumble.json?after=t3_m5qpvu
29
https://www.reddit.com/r/bumble.json?after=t3_m4r0t3
25
https://www.reddit.com/r/bumble.json?after=t3_m3zlqb
60
https://www.reddit.com/r/bumble.json?after=t3_m2w3oc
37
https://www.reddit.com/r/bumble.json?after=t3_m2knfd
13
https://www.reddit.com/r/bumble.json?after=t3_m21jio
39
https://www.reddit.com/r/bumble.json?after=t3_m0vc7g
58
https://www.reddit.com/r/bumble.json?after=t3_lzcgpo
35
https://www.reddit.com/r/bumble.json?after=t3_lzgqjh
46
https://www.reddit.com/r/bumble.json?after=t3_lz9qu9
56
https://www.reddit.com/r/bumble.json?after=t3_lytcqw
33
https://www.reddit.com/r/bumble.json?after=t3_lxt605
56
https://www.reddit.com/r/bumble.json?after=t3_lxdcid
47
https://w

In [6]:
df_bumble = pd.DataFrame(bumble_posts)
df_bumble.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,url_overridden_by_dest,is_gallery,media_metadata,gallery_data,poll_data,author_cakeday,crosspost_parent_list,crosspost_parent
0,,Bumble,\n\nPlease post any profile critique requests...,t2_6l4z3,False,,0,False,Weekly Profile Critique,[],r/Bumble,False,6,,0,,False,t3_m652h9,False,dark,0.86,,public,5,0,{},,False,[],,False,False,,{},,False,5,,True,,False,,[],{},,True,,1615916000.0,text,6,,,text,self.Bumble,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,new,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_3531l,,,,m652h9,True,,AutoModerator,,203,True,all_ads,False,[],False,,/r/Bumble/comments/m652h9/weekly_profile_criti...,all_ads,True,https://www.reddit.com/r/Bumble/comments/m652h...,161181,1615887000.0,0,,False,,,,,,,,
1,,Bumble,,t2_7x7aihk3,False,,0,False,how to get unmatched in 2 seconds — a message ...,[],r/Bumble,False,6,,0,,False,t3_m7xqub,False,dark,0.95,,public,1657,3,{},,False,[],,True,False,,{},,False,1657,,False,,False,,[],{'gid_1': 1},,False,,1616122000.0,text,6,,,text,i.redd.it,True,,,,,,False,False,False,False,False,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,False,False,False,,[],False,,,,t5_3531l,,,,m7xqub,True,,seasonsch4nge,,202,True,all_ads,False,[],False,,/r/Bumble/comments/m7xqub/how_to_get_unmatched...,all_ads,False,https://i.redd.it/yw61ss6r0un61.jpg,161181,1616093000.0,0,,False,https://i.redd.it/yw61ss6r0un61.jpg,,,,,,,
2,,Bumble,"Matched with a really pretty girl, she had a Y...",t2_amrqi7fb,False,,0,False,Should I let a match know she is giving up too...,[],r/Bumble,False,6,,0,,False,t3_m7gz9d,False,dark,0.99,,public,997,3,{},,False,[],,False,False,,{},,False,997,,False,,1616042145.0,,[],{'gid_1': 1},,True,,1616063000.0,text,6,,,text,self.Bumble,True,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,False,False,False,,[],False,,,,t5_3531l,,,,m7gz9d,True,,sombrerocabbage,,120,True,all_ads,False,[],False,,/r/Bumble/comments/m7gz9d/should_i_let_a_match...,all_ads,False,https://www.reddit.com/r/Bumble/comments/m7gz9...,161181,1616034000.0,0,,False,,,,,,,,
3,,Bumble,,t2_11e06l,False,,0,False,When a guy gets more than one quality match in...,[],r/Bumble,False,6,,0,,False,t3_m7zunx,False,dark,0.88,,public,31,0,{},,False,[],,True,False,,{},,False,31,,False,,False,,[],{},,False,,1616128000.0,text,6,,,text,i.redd.it,False,,,,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_3531l,,,,m7zunx,True,,Shmallory0,,8,True,all_ads,False,[],False,,/r/Bumble/comments/m7zunx/when_a_guy_gets_more...,all_ads,False,https://i.redd.it/z5awibk4jun61.jpg,161181,1616099000.0,0,,False,https://i.redd.it/z5awibk4jun61.jpg,,,,,,,
4,,Bumble,,t2_11hje6h3,False,,0,False,Mhmm. I see.,[],r/Bumble,False,6,,0,,False,t3_m80yd8,False,dark,0.97,,public,23,0,{},,False,[],,True,False,,{},,False,23,,False,,False,,[],{},,False,,1616131000.0,text,6,,,text,i.redd.it,False,,,,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_3531l,,,,m80yd8,True,,KSP927,,11,True,all_ads,False,[],False,,/r/Bumble/comments/m80yd8/mhmm_i_see/,all_ads,False,https://i.redd.it/tynsl7p2sun61.jpg,161181,1616102000.0,0,,False,https://i.redd.it/tynsl7p2sun61.jpg,,,,,,,


In [7]:
df_bumble.shape

(1245, 110)

In [9]:
df_bumble.to_csv("../datasets/bumble_raw", index=False)

In [12]:
df_bumble = df_bumble[columns_to_keep]
df_bumble.head()

Unnamed: 0,title,selftext,subreddit
0,Weekly Profile Critique,\n\nPlease post any profile critique requests...,Bumble
1,how to get unmatched in 2 seconds — a message ...,,Bumble
2,Should I let a match know she is giving up too...,"Matched with a really pretty girl, she had a Y...",Bumble
3,When a guy gets more than one quality match in...,,Bumble
4,Mhmm. I see.,,Bumble


In [13]:
df_bumble[df_bumble['selftext'] == '']

Unnamed: 0,title,selftext,subreddit
1,how to get unmatched in 2 seconds — a message ...,,Bumble
3,When a guy gets more than one quality match in...,,Bumble
4,Mhmm. I see.,,Bumble
5,Anyone else feel burnt out when every opener i...,,Bumble
6,Toe thumbs,,Bumble
...,...,...,...
1235,Does Bumble update your location in your profi...,,Bumble
1237,"I love some good, light-hearted interactions :)",,Bumble
1238,Sir this is not it,,Bumble
1239,Cannot win on this app,,Bumble


In [14]:
df_bumble = df_bumble.drop_duplicates(subset='selftext')
df_bumble

Unnamed: 0,title,selftext,subreddit
0,Weekly Profile Critique,\n\nPlease post any profile critique requests...,Bumble
1,how to get unmatched in 2 seconds — a message ...,,Bumble
2,Should I let a match know she is giving up too...,"Matched with a really pretty girl, she had a Y...",Bumble
8,Is she losing interest?,"Long story short, a mere 3 weeks ago met a gir...",Bumble
15,"Photo shown when you match with someone, does ...",You know how when you swipe and match with som...,Bumble
...,...,...,...
860,"Don't want kids, but dating nearing 30? [m]",Not sure if others have had this experience be...,Bumble
861,How long does it usually take for men to respond?,"I'm new to this, so I'm curious. I send my mes...",Bumble
864,How to change the position of the Instagram feed?,I see a lot of ladies have their Instagram fee...,Bumble
867,"Car selfies are big no, what if my car is fancy?",Like a Porsche? I really struggling to get any...,Bumble


In [None]:
df_bumble.drop(df_bumble[df_bumble["selftext"] == ""].index, inplace=True)

In [None]:
df_bumble['title_selftext'] = df_bumble['title'] + ' ' + df_bumble['selftext']
df_bumble

In [None]:
df_bumble = df_bumble.drop(columns=['title', 'selftext'])
df_bumble.head()

In [None]:
df_bumble.to_csv("../datasets/bumble_cleaned", index=False)