<div style="text-align: center;">
    <img src="../images/ga_logo_large.png">
</div>

---
## **Project 3: Web APIs and NLP**

-----
### Problem Statement

How smart can a classification algorithm be?<br>
The human mind can understand and process written language quite efficiently.  If given a random blurb with enough words and context, a person can easily guess what the topic is about with a high degree of accuracy.  Can a classification model mimic such prowess?  This project aims to answer that question.<br>
In order to do so, different classification models will be built and trained to read randomized text from two different subreddits and then predict what topic is being discussed.  Easy for a person, can the machine beat the person?<br>
Various metrics will be used to evaluate each model's performance, with a key focus on misclassification.  The winning model shall be the one with the lowest misclassification rate.

----
### Data

There are four datasets available in the data folder, two for each subreddit.  You may consider one dataset as the old version of the data that was collected and the other dataset as the newer version.  The initial data scrapes for both subreddits are missing time stamps, but the textual data contained in them is valuable for this analysis.  The newer versions do have timestamps.  During the data cleaning process they were combined to preserve the text data.  The following is a breakdown of the data files:<br>
* `reddit_realestate.csv`:  This data was scraped from the <a href = 'https://www.reddit.com/r/RealEstate/'>real estate subreddit</a>. This dataset **does not** have time stamps.
* `reddit--RealEstate.csv`:  This data was scraped from the <a href = 'https://www.reddit.com/r/RealEstate/'>real estate subreddit</a>. This dataset does have time stamps.
* `reddit_travel.csv`: This data was scraped from the <a href = 'https://www.reddit.com/r/travel/'>travel subreddit</a>. This dataset **does not** have time stamps.
* `reddit--travel.csv`: This data was scraped from the <a href = 'https://www.reddit.com/r/travel/'>travel subreddit</a>. This dataset does have time stamps.

----
### Consulted Sources

This analysis relies on the data collected from the aforementioned subreddits and on the material learned in class and practiced during labs.  Other than reading documentation and doc strings, no outside sources were needed.

----
### Data Import and Cleaning

In [11]:
# needed libraries for this notebook
import numpy as np
import pandas as pd

In [12]:
# read in files
file_path1 = '../data/reddit--RealEstate.csv'
file_path1a = '../data/reddit_realestate.csv'

file_path2 = '../data/reddit--travel.csv'
file_path2a = '../data/reddit_travel.csv'

realestate1 = pd.read_csv(file_path1)
realestate2 = pd.read_csv(file_path1a)

travel1 = pd.read_csv(file_path2)
travel2 = pd.read_csv(file_path2a)

# check dimensions
print(realestate1.shape)
print(realestate2.shape)
print('*'*10)
print(travel1.shape)
print(travel2.shape)

(5198, 5)
(1031, 3)
**********
(4579, 5)
(1091, 3)


In [13]:
# append the old dataset to the new dataset
realestate = pd.concat([realestate1, realestate2], ignore_index = True, sort = False)
travel = pd.concat([travel1, travel2], ignore_index = True, sort = False)

In [14]:
print(realestate.shape)
print(travel.shape)

(6229, 5)
(5670, 5)


In [15]:
# check last three rows for realestate df
realestate.tail(3)

Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on
6226,t3_1fvse4j,Any real estate companies that lets people bui...,I know there will be a lot of building codes a...,,
6227,t3_1fvdxx7,Buying property previously owned by a church,We are purchasing a home previously owned by a...,,
6228,t3_1fvzlcy,Air conditioning unit stolen,My realtor went to see a new build for me and ...,,


In [16]:
# check first three rows for travel df
travel.head(3)

Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on
0,t3_1dzc3zh,All Layover Questions - READ THIS NOTICE,**READ THE NEW LAYOVER FAQ:** [**https://www.r...,1720556000.0,2024-10-08 10:34:14
1,t3_1fya9jq,A few favs from Herzegovina,,1728315000.0,2024-10-08 10:34:14
2,t3_1fypldc,Missing a flight because you get too comfortab...,Is there a name for this phenomenon?\n\nAsking...,1728355000.0,2024-10-08 10:34:14


Having scraped the data myself, I noticed that the first line on the `travel` dataframe is a general notice for users.  This line will be removed from it, it's not a user's post per se.  In case it might've been duplicated in subsequent scrapes, look for every row where `post_id` is equal to `t3_1dzc3zh`.

In [18]:
# remove rows in travel df with post id: t3_1dzc3zh
# find the rows first
rows_to_remove = travel[travel['post_id'] == 't3_1dzc3zh'].index
rows_to_remove

Index([0, 764, 1528, 2294, 3060, 3821, 4579, 5269], dtype='int64')

In [19]:
# remove from df
travel.drop(index = rows_to_remove, inplace = True)
travel.reset_index(drop = True, inplace = True)

In [20]:
# confirm
print(travel.shape)
travel.head(2)

(5662, 5)


Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on
0,t3_1fya9jq,A few favs from Herzegovina,,1728315000.0,2024-10-08 10:34:14
1,t3_1fypldc,Missing a flight because you get too comfortab...,Is there a name for this phenomenon?\n\nAsking...,1728355000.0,2024-10-08 10:34:14


----
**Duplicates**<br>
Because the data was scraped at different time intervals, there's a possibility of having duplicate posts.  The first step in cleaning up the data will be to remove any duplicates from both dataframes. This will be accomplished using the `post_id` column.

In [22]:
# drop duplicates from realestate df
print(f'Number of rows before removing duplicates: {realestate.shape[0]}')
realestate.drop_duplicates(subset = ['post_id'], inplace = True)
print(f'Number of rows after removing duplicates: {realestate.shape[0]}')

Number of rows before removing duplicates: 6229
Number of rows after removing duplicates: 1098


In [23]:
# drop duplicates from travel df
print(f'Number of rows before removing duplicates: {travel.shape[0]}')
travel.drop_duplicates(subset = ['post_id'], inplace = True)
print(f'Number of rows after removing duplicates: {travel.shape[0]}')

Number of rows before removing duplicates: 5662
Number of rows after removing duplicates: 1162


---
**Missing Values and Data Types**: `realestate`

In [25]:
realestate.isnull().sum()

post_id           0
post_title        0
post_text        19
published_on    183
scraped_on      183
dtype: int64

The timestamps are not particularly concerning at this juncture as this is a binary classfication task with text data.<br>
Only 19 posts have no text besides the title.  These 19 rows can be dropped.

In [27]:
print(f'rows before: {realestate.shape[0]}')
realestate.dropna(subset = ['post_text'], inplace = True)
print(f'rows after: {realestate.shape[0]}')

rows before: 1098
rows after: 1079


In [28]:
realestate.isnull().sum()

post_id           0
post_title        0
post_text         0
published_on    178
scraped_on      178
dtype: int64

Looks good.  Check data types below.

In [30]:
realestate.dtypes

post_id          object
post_title       object
post_text        object
published_on    float64
scraped_on       object
dtype: object

All good!

---
**Missing Values and Data Types**: `travel`

In [33]:
travel.isnull().sum()

post_id           0
post_title        0
post_text        15
published_on    330
scraped_on      330
dtype: int64

Follow same approach as `realestate`

In [35]:
print(f'rows before: {travel.shape[0]}')
travel.dropna(subset = ['post_text'], inplace = True)
print(f'rows after: {travel.shape[0]}')

rows before: 1162
rows after: 1147


In [36]:
travel.isnull().sum()

post_id           0
post_title        0
post_text         0
published_on    327
scraped_on      327
dtype: int64

In [37]:
travel.dtypes

post_id          object
post_title       object
post_text        object
published_on    float64
scraped_on       object
dtype: object

All good!

---
**Word counts and post lengths**

Having numerical data about all this text data may come in handy later on.  New columns will be added to store the length and number of words for each post title and post text.

---
###### _Post Titles Lengths_

In [42]:
# calculate post title lengths and store in new col
realestate['post_title_length'] = realestate['post_title'].map(lambda x: len(x))
travel['post_title_length'] = travel['post_title'].map(lambda x: len(x))

In [43]:
# confirm realestate df
realestate.head(1)

Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on,post_title_length
0,t3_1fybuha,I jointly inherited a property with someone wh...,My mother recently passed away and she had sig...,1728318000.0,2024-10-08 10:34:06,67


In [44]:
# confirm travel df
travel.head(1)

Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on,post_title_length
1,t3_1fypldc,Missing a flight because you get too comfortab...,Is there a name for this phenomenon?\n\nAsking...,1728355000.0,2024-10-08 10:34:14,71


---
###### _Post Titles Word Counts_

In [46]:
# count words on each title and store in new col
realestate['post_title_wc'] = realestate['post_title'].map(lambda x: len(x.split(' ')))
travel['post_title_wc'] = travel['post_title'].map(lambda x: len(x.split(' ')))

In [47]:
# confirm realestate df
realestate.head(2)

Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on,post_title_length,post_title_wc
0,t3_1fybuha,I jointly inherited a property with someone wh...,My mother recently passed away and she had sig...,1728318000.0,2024-10-08 10:34:06,67,13
1,t3_1fymxpq,Talk to me like I’m dumb about buying a house,We took out a 30-year mortgage in 2016 for $19...,1728347000.0,2024-10-08 10:34:06,46,11


In [48]:
# confirm travel df
travel.head(2)

Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on,post_title_length,post_title_wc
1,t3_1fypldc,Missing a flight because you get too comfortab...,Is there a name for this phenomenon?\n\nAsking...,1728355000.0,2024-10-08 10:34:14,71,13
2,t3_1fy6r62,"A week's long trip to Iceland, September 2024","For a while now I wanted to go to Iceland, and...",1728305000.0,2024-10-08 10:34:14,45,8


---
###### _Post Text Lengths_

In [50]:
# calculate post text lengths and store in new col
realestate['post_text_length'] = realestate['post_text'].map(lambda x: len(x))
travel['post_text_length'] = travel['post_text'].map(lambda x: len(x))

In [51]:
# confirm realestate df
realestate.tail(2)

Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on,post_title_length,post_title_wc,post_text_length
6032,t3_1fxh31c,"Closing on a house on Monday, getting hit by a...",First time homebuyer here closing on my house ...,,,76,14,493
6089,t3_1fxuh8x,Am I allowed to say no to open houses and show...,I recently found out that my landlord will be ...,,,51,11,932


In [52]:
# confirm travel df
travel.tail(2)

Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on,post_title_length,post_title_wc,post_text_length
5418,t3_1fxez6r,I'm from a developing country and have no inte...,I am from the Philippines but already migrated...,,,115,20,1035
5487,t3_1fwbn1o,Iceland for new years eve advice,Hello all. Recently decided to go solo travell...,,,32,6,611


---
###### _Post Text Word Counts_

In [54]:
# count words on each title and store in new col
realestate['post_text_wc'] = realestate['post_text'].map(lambda x: len(x.split(' ')))
travel['post_text_wc'] = travel['post_text'].map(lambda x: len(x.split(' ')))

In [55]:
# confirm realestate df
realestate.tail(2)

Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on,post_title_length,post_title_wc,post_text_length,post_text_wc
6032,t3_1fxh31c,"Closing on a house on Monday, getting hit by a...",First time homebuyer here closing on my house ...,,,76,14,493,87
6089,t3_1fxuh8x,Am I allowed to say no to open houses and show...,I recently found out that my landlord will be ...,,,51,11,932,194


In [56]:
# confirm travel df
travel.tail(2)

Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on,post_title_length,post_title_wc,post_text_length,post_text_wc
5418,t3_1fxez6r,I'm from a developing country and have no inte...,I am from the Philippines but already migrated...,,,115,20,1035,178
5487,t3_1fwbn1o,Iceland for new years eve advice,Hello all. Recently decided to go solo travell...,,,32,6,611,118


New columns with post title/text lengths and word counts have been successfully added to the dataframes.

----
**Merge Dataframes and Create Target Column**

Since the task at hand is to build a binary classification model that predicts which subreddit (`travel` or `realestate`) a post comes from, both datasets will be merged into one.<br>
First, though, another column will be appended to each one (`topic`).  This column will have all `0s` for the `realestate` dataframe, and all `1s` for the `travel` dataframe.<br>
Once they are merged, this column will serve as the binary classification target for the combined dataframe, where rows in which `topic` is equal to `0` being Real Estate posts, and where rows in which `topic` is equal to `1` being Travel posts.

In [60]:
# add topic col to both DFs
realestate['topic'] = 0
travel['topic'] = 1

In [61]:
# confirm realestate DF
print(realestate.shape)
realestate.head(2)

(1079, 10)


Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on,post_title_length,post_title_wc,post_text_length,post_text_wc,topic
0,t3_1fybuha,I jointly inherited a property with someone wh...,My mother recently passed away and she had sig...,1728318000.0,2024-10-08 10:34:06,67,13,852,168,0
1,t3_1fymxpq,Talk to me like I’m dumb about buying a house,We took out a 30-year mortgage in 2016 for $19...,1728347000.0,2024-10-08 10:34:06,46,11,768,146,0


In [62]:
# confirm travel DF
print(travel.shape)
travel.tail(2)

(1147, 10)


Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on,post_title_length,post_title_wc,post_text_length,post_text_wc,topic
5418,t3_1fxez6r,I'm from a developing country and have no inte...,I am from the Philippines but already migrated...,,,115,20,1035,178,1
5487,t3_1fwbn1o,Iceland for new years eve advice,Hello all. Recently decided to go solo travell...,,,32,6,611,118,1


In [63]:
# merge DFs
reddit = pd.concat([realestate, travel], ignore_index = True)

# confirm shape
reddit.shape

(2226, 10)

Both dataframes have been merged, but the top half of the rows belong to `realestate`, whereas the bottom half rows belong to `travel`. Let's shuffle the rows in order to create a random order and mimic unseen data.

In [65]:
# shuffle rows in new DF
# source: https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows
reddit = reddit.sample(frac = 1).reset_index(drop = True)

In [66]:
# take a look at a sample slice
reddit.loc[501:505,:]

Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on,post_title_length,post_title_wc,post_text_length,post_text_wc,topic
501,t3_1fw3r80,Brussels - is it as bad as they say? Has anyon...,So looking to travel to Europe as an American ...,1728060000.0,2024-10-08 10:34:16,59,13,2983,523,1
502,t3_1fy4rhm,Advice on travel itinerary - SEA + LatAm,"Hi all,\n\nMy partner and I are planning a 5-6...",1728298000.0,2024-10-08 10:34:15,40,8,1516,285,1
503,t3_1fz1li1,Travel to Laos and Thailand,I am traveling to Laos and Thailand for the fi...,1728399000.0,2024-10-08 12:04:37,27,5,508,98,1
504,t3_1fq3f3o,Connecting through Singapore Changi. Diff airl...,"Hello friends, \n\nI looked through the layove...",,,117,18,1131,211,1
505,t3_1fqrctd,Visas with an emergency passport,I am currently in Madagascar and will soon be ...,,,33,6,649,129,1


Combined dataframe rows are now shuffled.

----
**Revisit Some Columns**<br>
Let's revisit the `published_on` and `scraped_on` columns again.

In [69]:
reddit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2226 entries, 0 to 2225
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   post_id            2226 non-null   object 
 1   post_title         2226 non-null   object 
 2   post_text          2226 non-null   object 
 3   published_on       1721 non-null   float64
 4   scraped_on         1721 non-null   object 
 5   post_title_length  2226 non-null   int64  
 6   post_title_wc      2226 non-null   int64  
 7   post_text_length   2226 non-null   int64  
 8   post_text_wc       2226 non-null   int64  
 9   topic              2226 non-null   int64  
dtypes: float64(1), int64(5), object(4)
memory usage: 174.0+ KB


Again, since this is a binary classification task, it looks like those two columns may not be needed.  Another column that may not be needed is `post_id`.  Drop those three columns from the dataframe.

In [71]:
reddit.drop(columns = ['post_id', 'published_on', 'scraped_on'], inplace = True)

In [72]:
# confirm
print(reddit.shape)
reddit.head(2)

(2226, 7)


Unnamed: 0,post_title,post_text,post_title_length,post_title_wc,post_text_length,post_text_wc,topic
0,Monemvasia in December?,Im thinking about travelling to Monemvasia wit...,23,3,209,38,1
1,Vietjetair.com. DOB format is a problem,I attempted to enter my information on [Vietje...,40,7,506,85,1


In [73]:
reddit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2226 entries, 0 to 2225
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   post_title         2226 non-null   object
 1   post_text          2226 non-null   object
 2   post_title_length  2226 non-null   int64 
 3   post_title_wc      2226 non-null   int64 
 4   post_text_length   2226 non-null   int64 
 5   post_text_wc       2226 non-null   int64 
 6   topic              2226 non-null   int64 
dtypes: int64(5), object(2)
memory usage: 121.9+ KB


Combined dataframe looks ready for analysis and modeling.  Save a clean copy.

---
**Save Clean Data**

In [76]:
output_file = '../data/clean_data/reddit.csv'
reddit.to_csv(output_file, index = False)

Cleaned file successfully saved.

----
### Conclusion

Every single model exceeded the baseline accuracy, which was calculated at 51.5%.<br><br>
The worst performing model was the `KNeighborsClassifier` when coupled with the transformer `CountVectorizer`, its accuracy rate was 77.8%.  The key focus, though, was classification rate, so that same model missclassified travel posts a staregging 22.2% of the time!<br><br>
Interestingly, `KNeighborsClassifier` performed better when paired with the `TfidfVectorizer` transformer. That being said, on this specific iteration, the tuning of the hyperparameters allowed for a wider range of _**k**_ values to try on.<br><br>
The best performing model was `LogisticRegression` when paired with `TfidfVectorizer`.  This model achieved the lowest misclassfication rate at 5.7%, which is still relatively high given the fact that travel and real estate are very distinct topics. The most interesting insight for this model, as far as the hyperparameters were concerned, was that it was given the choice to also tokenize the text based on bi-grams by passing the option to the `ngram_range` parameter.  Based on the returned best parameters, that's exactly what it did, it did better with the bi-grams.<br><br>
The `RandomForestClassifier` models didn't do so well when graded by misclassification rate (over 13% in both iterattions).  More fine tuning of the hyper parameters may be needed, but the grid searches for these classifiers consume a lot of time, so this is recommended for a future iteration of this project.

----
### Recommendations

* Continue fine tuning the hyper parameters to see if a misclassification rate of less than 1% can be achieved.
* Scrape data from other subbreddits, specially two that are similar in topic (i.e. Travel vs. Travel Hacks) to see if the performance suffers or stays the same.