## Prepare the dataset 

Prepare the dataset so that it can be used to label the data.

- Posts only posted in 2024 through 2025 
- Combine the title and text 
- Remove spaces in combine_text
- Add in identifier column


In [1]:
%%capture
pip install -r ../../requirements.txt

In [2]:
# Install the required packages
import sys 
import json #needed to translate JSON data
import requests #needed to perform HTTP GET and POST requests
import pandas as pd
import pprint # allows us to print more readable JSON data
from datetime import datetime 
import time 
import io

# NLP
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Need this set to None otherwise text columns will truncate!
pd.set_option('display.max_colwidth', None) 

In [3]:
import sys

# set this on the path so that we can reference the commong data locations
sys.path.append("../../scripts/")
from process_text_data import text_embeddings, compute_similarity_scores

In [4]:
from data_collection import authenticate_google_drive, grab_google_drive_folder_data

drive = authenticate_google_drive('../0_data_collection/credentials/google_drive_client_secret.json')
df = grab_google_drive_folder_data(drive=drive,credential_file="../0_data_collection/credentials/google_drive_folder_id.json",filename="reddit_data.csv")

Successfully loaded 'reddit_data.csv' into a DataFrame!


In [5]:
# Convert the 'timestamp' column from object (string) to datetime
df['created_at'] = pd.to_datetime(df['created_at'], errors='coerce')

# Filter data between 2024 and 2025
filtered_df = df[(df['created_at'].dt.year >= 2024) & (df['created_at'].dt.year <= 2025)]


In [6]:
## Combine text and remove spacing 
filtered_df['combine_text'] = filtered_df['title']+". "+ filtered_df['text']
filtered_df['combine_text'] = filtered_df['combine_text'].str.strip().str.replace(r'\s+', ' ', regex=True)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['combine_text'] = filtered_df['title']+". "+ filtered_df['text']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['combine_text'] = filtered_df['combine_text'].str.strip().str.replace(r'\s+', ' ', regex=True)


In [7]:
# Add a column that can be used an unique identifier
filtered_df.reset_index(drop=True, inplace=True)
filtered_df.reset_index(inplace=True)
filtered_df.rename(columns = {'index': 'unique_identifier'}, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df.rename(columns = {'index': 'unique_identifier'}, inplace=True)


In [8]:
# Update shape 
filtered_df.shape

(2823, 11)

In [9]:
filtered_df.head(2) 


Unnamed: 0,unique_identifier,submission_id,subredit_topic,search_query,title,text,score,num_comments,username,created_at,combine_text
0,0,1hqgmvb,FirstTimeHomeBuyer,Rocket Mortgage,Avoid Rocket Mortgage,"I have a RM mortgage and they call me once a year and ask where I am employed, what my salary is, if the house needs any repairs, and they run a soft credit check while I am on the phone with them and run through all my credit card debt. They also sell this information. According to the mortgage agreement you are required to cooperate with them. No other mortgage company does this.\n\nDuring the mortgage application process, they tried to charge me 3 points (!!). I got a competitive quote from another mortgage lender at a much lower cost and RM matched the offer. I wish I went with the other lender. The only reason I went with RM at that point was because I was further along in the process and rates where trending higher.",607,191,maz4499,2024-12-31 15:12:06,"Avoid Rocket Mortgage. I have a RM mortgage and they call me once a year and ask where I am employed, what my salary is, if the house needs any repairs, and they run a soft credit check while I am on the phone with them and run through all my credit card debt. They also sell this information. According to the mortgage agreement you are required to cooperate with them. No other mortgage company does this. During the mortgage application process, they tried to charge me 3 points (!!). I got a competitive quote from another mortgage lender at a much lower cost and RM matched the offer. I wish I went with the other lender. The only reason I went with RM at that point was because I was further along in the process and rates where trending higher."
1,1,1gum4oc,FirstTimeHomeBuyer,Rocket Mortgage,Is rocket mortgage that bad?,"I’m in the market to buy a house now (21m). I have roughly 22k saved up cash with another 3k in stocks if need be. 12k in house savings, 6k in emergency savings, and 4k in car savings. I make 72k a year salary. I have inquired from Rocket mortgage, Chase bank, and Wells Fargo. Rocket mortgage has the most friendly loan officer but haven’t given me an official quote, loan office estimated mid 6 range. Wells Fargo quoted me 6.875 interest and 7.22 APR, and Chase bank seems to have ghosted me. I have - 760-790 credit score depending on bureau. I’m just lost in the process right now. Looking in the 200-250k range for a house. Everyone I have talked to has been confident I can afford 250k and under and have quickly gotten approval letters from rocket mortgage and Wells Fargo.\n\nA side note. With my job I get a company truck that I am allowed to use as personal vehicle so I hardly use my 2017 accord and 2004 tundra which are both paid off. So no outstanding debts and no foreseeable need to purchase or repair a vehicle.",0,21,Valuable-Pilot-2818,2024-11-19 01:59:25,"Is rocket mortgage that bad?. I’m in the market to buy a house now (21m). I have roughly 22k saved up cash with another 3k in stocks if need be. 12k in house savings, 6k in emergency savings, and 4k in car savings. I make 72k a year salary. I have inquired from Rocket mortgage, Chase bank, and Wells Fargo. Rocket mortgage has the most friendly loan officer but haven’t given me an official quote, loan office estimated mid 6 range. Wells Fargo quoted me 6.875 interest and 7.22 APR, and Chase bank seems to have ghosted me. I have - 760-790 credit score depending on bureau. I’m just lost in the process right now. Looking in the 200-250k range for a house. Everyone I have talked to has been confident I can afford 250k and under and have quickly gotten approval letters from rocket mortgage and Wells Fargo. A side note. With my job I get a company truck that I am allowed to use as personal vehicle so I hardly use my 2017 accord and 2004 tundra which are both paid off. So no outstanding debts and no foreseeable need to purchase or repair a vehicle."


In [10]:
from data_collection import authenticate_google_drive, save_google_drive_data


In [11]:
credentials_path="../0_data_collection/credentials/google_drive_client_secret.json"
folder_path="../0_data_collection/credentials/google_drive_folder_id.json"

In [12]:
# Grab the Google Drive object
drive = authenticate_google_drive(credentials_path=credentials_path)


In [13]:
# Save the data in the Google Drive location
save_google_drive_data(drive=drive, 
                       credential_file=folder_path,  
                       dataframe =filtered_df, 
                       filename="reddit_filtered_data.csv")

File 'reddit_filtered_data.csv' uploaded successfully to folder 1kJ6TrI9MVT5mfnnYvS-OpRMJFVbIQ6Tl!
