# Project 3: Web APIs & NLP

## Part One: Data Collection & Data Cleaning

### Problem Statement:
Reddit is a massive collection of forums where people can share news and content or comment on other people’s posts. Reddit is broken up into more than a million communities known as “subreddits,” each of which covers a different topic. The name of a subreddit begins with “r/,” which is part of the URL that Reddit uses. For example, r/nba is a subreddit where people talk about the National Basketball Association. A "post" is where the community share content by stories, links, images, and videos. A "comment" provides discussions on posts. And both comments & posts can be scored by being upvoted or downvoted. 

Yet, there is a dilemma, what if we wanted to gather data and model mulitple "subreddits"? This is difficult to compare such information without a classifier. 

Thus, can we use supervised machine learning to classify similar content from two different web sources?

How do we investigate this problem? 

I scraped about 4000 posts from two chosen subreddits. Each subreddit I scraped was about 2000 posts by using Reddit's API. Then, I used natural language processing to train a classifier model to check which post came for the correct subreddit. The classification models I decided to use were Logistic Regression, Bernoulli Naive Bayes, Bagged Decision Tree, and Random Forest which we evaluated on accuracy scores and results from confusion matrices.


### Executive Summary:
I begin by pulling the data from the two subreddits by using Reddit's API. The subreddits that I pulled were the r/mbti and r/Horoscope subreddits. The data that was imported was in JSON format. Therefore, I decided to create dataframes in Pandas to have easier access to clean and multiplate through the data.

Once, I looked through the dataframes, I looked for particular subfields using the Reddit's API data dictionary. I focused on the title, created_utc, author, selftext, and subreddit features. I chose these as the subfields because I wanted the best features for our modeling.

Next, I did some data cleaning. I checked for duplicate posts and missing values in each of the dataframes. Lastly, I combined into two subreddit into one dataframe, named 'subreddit'.

Then, I did some exploratory data analysis. I first showed the date range for the subreddit I have scraped. Thinking back on my problem statement, I want to detemine similar content in both of the datasets, thus, I do this by looking at the frequently occurring words in each dataframe. I did this by using an NLP functions called stemming and countvectorizer.I chose to display bar graphs that had the top 10 frequently occurring single gram word & bigram words in each subreddit. 

Next, I preprocess my data. I dropped the author and selftext feature because I do not need it for modeling. I mapped our target variable: subreddit into a binary classification. I did some more NLP processing. I used lemmatization, stemming, and stopwords to analyize my dataset futher. Then, I created our X feature and target variable and did a train-test split. I decided to change our X feature as a lemmatised version for our modeling. Lastly, I determined the basline score to compare to our models' results.

Finally, I modeled four different classification models. I modeled Logistic Regression, Bernoulli Naive Bayes, Bagged Decision Tree, and Random Forest. I also created a confusion matrix for each model to have further insights on each of my models. I wanted to see how well our models were able to correctly classify where each post came from. In the end, I focused on accuracy score and the bias-variance tradeoff from each model to determined which model was the best to answer my problem statement.


### Contents:
* Information on the Two Subreddits
* Data Collection
    - Import Libraries
    - Create Function to Retrieve Data from Reddit API
    - Gathering r/mbti & r/astrology data
    - Export r/mbti & r/astrology data before cleaning
    
* Data Cleaning
    - Create subfield
    - Checking for Duplicate Posts
    - Checking for Missing Values
    - Saved Clean Datasets
    - Combine both datasets

### Information on the Two Subreddits

I decided to choose r/mbti and r/astrology as our two subreddits to answer my problem statement. Both subreddits states about anything relating to MBTI or Horoscope.

The r/mbti subreddit has about 434k subscribers. It was created December 30, 2010. MBTI stands for Myers Briggs Type Indicator. This is a tool which is frequently used to help individuals understand their own communication preference and how they interact with others. Having an awareness of what MBTI is can help you adapt your interpersonal approach to different situations and audiences. 

The r/astrology subreddit has about 287k subscribers. It was created May 27, 2008. Astrology has existed for hundreds if not thousands of years, and has/is practiced by many different cultures. The astrology that is popular within many online spaces is based on the Western interpretation of the practice, which is founded on the movements and positions of the sun, moon, and planets. It is through interpreting these celestial bodies that astrology can explore relationship patterns, personalities, and life cycles. 

I believe that these subreddits are similar in content because both have the ability to provide us insight about an individual’s personality. 

We should also consider in our data science process, how the pre-filtering controller in both of the subreddits sites will affect our overall resluts.

### Data Collection
#### Import Libraries

In [1]:
import requests
import pandas as pd
import numpy as np
import time


#### Create Function to Retrieve Data from Reddit API

In [2]:
def subreddit_data(subreddit, num_posts):
    posts = []
    df_combined = pd.DataFrame()
    num_segments = num_posts//100
    url = 'https://api.pushshift.io/reddit/search/submission'
    start_post = 1658560688
    for i in range(num_segments):
        res = requests.get(url, 
                           params={
                               'subreddit': subreddit,
                               'size': 100,
                               'before': start_post
                           })
        if res.status_code == 200:
            data = res.json()
            posts = data['data']
            df_combined = pd.concat([df_combined, pd.DataFrame(posts)])
            start_post = posts[-1]['created_utc']
            time.sleep(1)
        else:
            print('Error')
   
    return df_combined

#### Gathering r/mbti Data

In [3]:
mbti = subreddit_data('mbti', 2_000)

In [4]:
mbti

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,url_overridden_by_dest,gallery_data,is_gallery,media_metadata,media,media_embed,secure_media,secure_media_embed,author_cakeday,banned_by
0,[],False,bunnymarzz,,[],,text,t2_oty2xuhm,False,False,...,,,,,,,,,,
1,[],False,dreamingonastar1,,[],,text,t2_naundn36,False,False,...,,,,,,,,,,
2,[],False,Real_Marsupial8984,,[],,text,t2_p9mevfwv,False,False,...,,,,,,,,,,
3,[],False,depressedgod13,isfj,"[{'e': 'text', 't': 'ISFJ'}]",ISFJ,richtext,t2_qa8vpcfk,False,False,...,,,,,,,,,,
4,[],False,Hydra-Sagaria,,[],,text,t2_62x532ir,False,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,[],False,EternalSerpentofHate,,[],,text,t2_psoopjsd,False,False,...,,,,,,,,,,
96,[],False,TuefelRabbit,isfp,"[{'e': 'text', 't': 'ISFP'}]",ISFP,richtext,t2_44801nmd,False,False,...,https://i.redd.it/f1udjcq7mna91.jpg,,,,,,,,True,
97,[],False,Hot-Newspaper-6322,,[],,text,t2_g2e634nz,False,False,...,,,,,,,,,,
98,[],False,Mariek_26,,[],,text,t2_map1k5wc,False,False,...,,,,,,,,,,


In [5]:
mbti.reset_index(inplace=True)

**Export Data to CSV Before Cleaning**

In [6]:
mbti.to_csv('./datasets/mbti.csv', index=False)

#### Gathering r/astrology Data

In [7]:
astro = subreddit_data('astrology', 2_000)

In [8]:
astro

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,...,post_hint,preview,secure_media,secure_media_embed,thumbnail_height,thumbnail_width,url_overridden_by_dest,author_flair_background_color,banned_by,author_cakeday
0,[],False,barbiesbloodline,taurus2,[],a22328b6-c450-11e5-8d0e-0e209de10c6d,,dark,text,t2_5vdfcx00,...,,,,,,,,,,
1,[],False,Galoreinsider,,[],,,,text,t2_84duswyj,...,,,,,,,,,,
2,[],False,YazzySanches,,[],,,,text,t2_7aysd9f6,...,,,,,,,,,,
3,[],False,Jeg_spider_salt,,[],,,,text,t2_prrzktig,...,,,,,,,,,,
4,[],False,Jeg_spider_salt,,[],,,,text,t2_prrzktig,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,[],False,not-cheetos,,[],,,,text,t2_3mdd92xr,...,,,,,,,,,,
95,[],False,Inside-Grape-2447,,[],,,,text,t2_78ewlv65,...,,,,,,,,,,
96,[],False,sarabrinley,,[],,,,text,t2_msfifo8o,...,,,,,,,,,,
97,[],False,smalldaisies,,[],,,,text,t2_lkbutnq4,...,,,,,,,,,,


In [9]:
astro.reset_index(inplace=True)

#### Export Data to CSV Before Cleaning

In [10]:
astro.to_csv('./datasets/astrology.csv', index=False)

### Data Cleaning 
#### Create subfield

In this data dictionary, we can see which are the best features to put in a subfield. I decided on 5 features.

We will start our subfield with the 
- title feature to know what is the post called. 
- author feature to know who created the post. 
- created_utc to know the date posted.
- selftext feature to know the content in the post. 
- subreddit feature to determine which post was from where.

In [11]:
subfield = ['title', 'author', 'created_utc','selftext', 'subreddit'] 

#### Cleaning r/mbti data

In [12]:
mbti = pd.read_csv('./datasets/mbti.csv')

In [13]:
mbti.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999 entries, 0 to 1998
Data columns (total 81 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   index                          1999 non-null   int64  
 1   all_awardings                  1999 non-null   object 
 2   allow_live_comments            1999 non-null   bool   
 3   author                         1999 non-null   object 
 4   author_flair_css_class         784 non-null    object 
 5   author_flair_richtext          1994 non-null   object 
 6   author_flair_text              784 non-null    object 
 7   author_flair_type              1994 non-null   object 
 8   author_fullname                1994 non-null   object 
 9   author_is_blocked              1999 non-null   bool   
 10  author_patreon_flair           1994 non-null   object 
 11  author_premium                 1994 non-null   object 
 12  awarders                       1999 non-null   o

In [14]:
#create subfield
mbti = mbti[subfield]

In [15]:
mbti.head()

Unnamed: 0,title,author,created_utc,selftext,subreddit
0,I recently found out my MTBI. I went to the su...,bunnymarzz,1658560552,,mbti
1,Do you have any advice for the soon to be newl...,dreamingonastar1,1658560509,I learn more and more every day about the pers...,mbti
2,Tp type personality pattern (ex:soccer),Real_Marsupial8984,1658560467,Estp: Experience (Se) first and then create yo...,mbti
3,May Pang,depressedgod13,1658559322,John Lennon’s temporary beau.\n\n[View Poll](h...,mbti
4,Which type is the most “neutral” on their beli...,Hydra-Sagaria,1658558060,\n\n[View Poll](https://www.reddit.com/poll/w5...,mbti


In [16]:
mbti.shape

(1999, 5)

In [17]:
#dropping duplicates
mbti = mbti.drop_duplicates() 

In [18]:
#there is no duplicates in the data
mbti.shape

(1999, 5)

In [19]:
#remove null values
mbti = mbti.dropna()

In [20]:
mbti.shape

(1325, 5)

In [21]:
#export clean data to csv
mbti.to_csv('./datasets/mbti_clean.csv', index=False)

#### Cleaning r/astrology

In [22]:
astro = pd.read_csv('./datasets/astrology.csv')

In [23]:
astro.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1998 entries, 0 to 1997
Data columns (total 78 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   index                          1998 non-null   int64  
 1   all_awardings                  1998 non-null   object 
 2   allow_live_comments            1998 non-null   bool   
 3   author                         1998 non-null   object 
 4   author_flair_css_class         58 non-null     object 
 5   author_flair_richtext          1996 non-null   object 
 6   author_flair_template_id       57 non-null     object 
 7   author_flair_text              47 non-null     object 
 8   author_flair_text_color        61 non-null     object 
 9   author_flair_type              1996 non-null   object 
 10  author_fullname                1996 non-null   object 
 11  author_is_blocked              1998 non-null   bool   
 12  author_patreon_flair           1996 non-null   o

In [24]:
#create subfield
astro = astro[subfield]

In [25]:
astro.head()

Unnamed: 0,title,author,created_utc,selftext,subreddit
0,can 12h placements attract liars?,barbiesbloodline,1658558627,[removed],astrology
1,Is this Big 6 placements a red flag?,Galoreinsider,1658551502,[removed],astrology
2,"Electional astrology, applying aspects only or...",YazzySanches,1658551378,[removed],astrology
3,How far should you live from a planetary line ...,Jeg_spider_salt,1658548308,[removed],astrology
4,How far from a planetary line can you feel its...,Jeg_spider_salt,1658547912,[removed],astrology


In [26]:
astro.shape

(1998, 5)

In [27]:
#dropping duplicates
astro = astro.drop_duplicates() 

In [28]:
#there are one duplicate found
astro.shape 

(1996, 5)

In [29]:
#remove null values
astro = astro.dropna()

In [30]:
astro.shape

(1779, 5)

In [31]:
astro.to_csv('./datasets/astrology_clean.csv', index=False)

#### Combining two subreddit into one dataframe

In [32]:
subreddit_combined = pd.concat([mbti,astro])

In [33]:
subreddit_combined

Unnamed: 0,title,author,created_utc,selftext,subreddit
1,Do you have any advice for the soon to be newl...,dreamingonastar1,1658560509,I learn more and more every day about the pers...,mbti
2,Tp type personality pattern (ex:soccer),Real_Marsupial8984,1658560467,Estp: Experience (Se) first and then create yo...,mbti
3,May Pang,depressedgod13,1658559322,John Lennon’s temporary beau.\n\n[View Poll](h...,mbti
4,Which type is the most “neutral” on their beli...,Hydra-Sagaria,1658558060,\n\n[View Poll](https://www.reddit.com/poll/w5...,mbti
5,Does this seem Si dom?,akuasrA,1658557876,Do these traits seem to fit with the definitio...,mbti
...,...,...,...,...,...
1993,"My north node, sun, and midheaven are all in l...",not-cheetos,1653456049,[removed],astrology
1994,1st and 8th House Pluto ?,Inside-Grape-2447,1653455877,[removed],astrology
1995,What does it mean if your venus is in the 12th...,sarabrinley,1653454096,[removed],astrology
1996,how to know which house will be affected durin...,smalldaisies,1653453207,[removed],astrology


In [34]:
subreddit_combined.to_csv('./datasets/subreddit_combined.csv', index=False)