<table align="center">
    <tr>
        <td>
            <img src="../images/ga_logo_large.png" width="400">
        </td>
        <td>
            <img src="../images/pipistrello.png" width="400">
        </td>
    </tr>
</table>

---
## **Project 3: Web APIs and NLP**

-----
### Problem Statement

Lorem ipsum, dolor sit...

----
### Data

Two datasets were used for this analysis...
* `reddit_realestate.csv`:  This data was scraped...
* `reddit_travel.csv`: This data was scraped...

----
### Consulted Sources

Lorem ipsum, vulgar latin, romance languages...

----
### Functions

----
### Data Import and Cleaning

In [11]:
# needed libraries for this notebook
import numpy as np
import pandas as pd

In [12]:
# read in files
file_path1 = '../data/reddit--RealEstate.csv'
file_path1a = '../data/reddit_realestate.csv'

file_path2 = '../data/reddit--travel.csv'
file_path2a = '../data/reddit_travel.csv'

realestate1 = pd.read_csv(file_path1)
realestate2 = pd.read_csv(file_path1a)

travel1 = pd.read_csv(file_path2)
travel2 = pd.read_csv(file_path2a)

# check dimensions
print(realestate1.shape)
print(realestate2.shape)
print('*'*10)
print(travel1.shape)
print(travel2.shape)

(3463, 5)
(1031, 3)
**********
(3060, 5)
(1091, 3)


In [13]:
# append the old dataset to the new dataset
realestate = pd.concat([realestate1, realestate2], ignore_index = True, sort = False)
travel = pd.concat([travel1, travel2], ignore_index = True, sort = False)

In [14]:
print(realestate.shape)
print(travel.shape)

(4494, 5)
(4151, 5)


In [15]:
# check last three rows for realestate df
realestate.tail(3)

Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on
4491,t3_1fvse4j,Any real estate companies that lets people bui...,I know there will be a lot of building codes a...,,
4492,t3_1fvdxx7,Buying property previously owned by a church,We are purchasing a home previously owned by a...,,
4493,t3_1fvzlcy,Air conditioning unit stolen,My realtor went to see a new build for me and ...,,


In [16]:
# check first three rows for travel df
travel.head(3)

Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on
0,t3_1dzc3zh,All Layover Questions - READ THIS NOTICE,**READ THE NEW LAYOVER FAQ:** [**https://www.r...,1720556000.0,2024-10-08 10:34:14
1,t3_1fya9jq,A few favs from Herzegovina,,1728315000.0,2024-10-08 10:34:14
2,t3_1fypldc,Missing a flight because you get too comfortab...,Is there a name for this phenomenon?\n\nAsking...,1728355000.0,2024-10-08 10:34:14


Having scraped the data myself, I noticed that the first line on the `travel` dataframe is a general notice for users.  This line will be removed from it, it's not a user's post per se.  In case it might've been duplicated in subsequent scrapes, look for every row where `post_id` is equal to `t3_1dzc3zh`.

In [18]:
# remove rows in travel df with post id: t3_1dzc3zh
# find the rows first
rows_to_remove = travel[travel['post_id'] == 't3_1dzc3zh'].index
rows_to_remove

Index([0, 764, 1528, 2294, 3060, 3750], dtype='int64')

In [19]:
# remove from df
travel.drop(index = rows_to_remove, inplace = True)
travel.reset_index(drop = True, inplace = True)

In [20]:
# confirm
print(travel.shape)
travel.head(2)

(4145, 5)


Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on
0,t3_1fya9jq,A few favs from Herzegovina,,1728315000.0,2024-10-08 10:34:14
1,t3_1fypldc,Missing a flight because you get too comfortab...,Is there a name for this phenomenon?\n\nAsking...,1728355000.0,2024-10-08 10:34:14


----
**Duplicates**<br>
Because the data was scraped, there's a possibility of having duplicate posts.  The first step in cleaning up the data will be to remove any duplicates from both dataframes. This will be accomplished using the `post_id` column.

In [22]:
# drop duplicates from realestate df
print(f'Number of rows before removing duplicates: {realestate.shape[0]}')
realestate.drop_duplicates(subset = ['post_id'], inplace = True)
print(f'Number of rows after removing duplicates: {realestate.shape[0]}')

Number of rows before removing duplicates: 4494
Number of rows after removing duplicates: 1055


In [23]:
# drop duplicates from travel df
print(f'Number of rows before removing duplicates: {travel.shape[0]}')
travel.drop_duplicates(subset = ['post_id'], inplace = True)
print(f'Number of rows after removing duplicates: {travel.shape[0]}')

Number of rows before removing duplicates: 4145
Number of rows after removing duplicates: 1108


---
**Missing Values and Data Types**: `realestate`

In [25]:
realestate.isnull().sum()

post_id           0
post_title        0
post_text        17
published_on    183
scraped_on      183
dtype: int64

The timestamps are not particularly concerning at this juncture as this is a binary classfication task with text data.<br>
Only 17 posts have no text besides the title.  These 17 rows can be dropped.

In [27]:
print(f'rows before: {realestate.shape[0]}')
realestate.dropna(subset = ['post_text'], inplace = True)
print(f'rows after: {realestate.shape[0]}')

rows before: 1055
rows after: 1038


In [75]:
realestate.isnull().sum()

post_id           0
post_title        0
post_text         0
published_on    178
scraped_on      178
dtype: int64

Looks good.  Check data types below.

In [60]:
realestate.dtypes

post_id          object
post_title       object
post_text        object
published_on    float64
scraped_on       object
dtype: object

All good!

---
**Missing Values and Data Types**: `travel`

In [67]:
travel.isnull().sum()

post_id           0
post_title        0
post_text        14
published_on    330
scraped_on      330
dtype: int64

Follow same approach as `realestate`

In [70]:
print(f'rows before: {travel.shape[0]}')
travel.dropna(subset = ['post_text'], inplace = True)
print(f'rows after: {travel.shape[0]}')

rows before: 1108
rows after: 1094


In [77]:
travel.isnull().sum()

post_id           0
post_title        0
post_text         0
published_on    327
scraped_on      327
dtype: int64

In [72]:
travel.dtypes

post_id          object
post_title       object
post_text        object
published_on    float64
scraped_on       object
dtype: object

All good!

---
**Word counts and post lengths**

Having numerical data about all this text data may come in handy later on.  New columns will be added to store the length and number of words for each post title and post text.

<u>Post Titles</u>

In [87]:
# calculate post title lengths and store in new col
realestate['post_title_length'] = realestate['post_title'].map(lambda x: len(x))
travel['post_title_length'] = travel['post_title'].map(lambda x: len(x))

In [93]:
# confirm realestate df
realestate.head(1)

Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on,post_title_length
0,t3_1fybuha,I jointly inherited a property with someone wh...,My mother recently passed away and she had sig...,1728318000.0,2024-10-08 10:34:06,67


In [97]:
# confirm travel df
travel.head(1)

Unnamed: 0,post_id,post_title,post_text,published_on,scraped_on,post_title_length
1,t3_1fypldc,Missing a flight because you get too comfortab...,Is there a name for this phenomenon?\n\nAsking...,1728355000.0,2024-10-08 10:34:14,71
