## Get Reddit Data

This notebook will grab all the reddit data used. In order to run this notebook, the following need to be set up: 
1. **[PRAW API](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html)** created and saved within this directory in the file called "credentials.json". 
    - This will allow access to the Reddit data
2. **[Google Drive API](https://developers.google.com/drive/api/quickstart/go)** created and saved within this directory in the file called "client_secrets.json"
    - This will allow access to Google Drive to save the data in
    - Also create the Google Drive Folder ID Credentials file ("google_drive_credentials.json") that will contain the folder id of the folder in google drive you'd like to save it to 

    



In [1]:
%%capture
pip install -r ../../requirements.txt

In [2]:
%load_ext autoreload
%autoreload 2
%reload_ext autoreload


In [3]:
import sys

# set this on the path so that we can reference the commong data locations
sys.path.append("../../scripts/")
from data_collection import load_credentials, combine_subreddits, posts_to_comments


In [4]:
# Install the required packages
import sys 
import json #needed to translate JSON data
import requests #needed to perform HTTP GET and POST requests
import pandas as pd
import pprint # allows us to print more readable JSON data
from datetime import datetime 
import time 
import io



In [5]:
pd.set_option('display.max_colwidth', 100) # Need this otherwise text columns will truncate!


In [6]:
# Load in the credentials
reddit = load_credentials('credentials.json')

Version 7.7.1 of praw is outdated. Version 7.8.1 was released Friday October 25, 2024.


Logged into Reddit successfully


In [17]:
n_posts = 5 # Start small!
sub_reddits = ['FirstTimeHomeBuyer', 'RealEstate','loanoriginators', 
                'homeowners', 'Mortgages', 'personalfinance', 'realtor']
search_terms = ["Rocket", "Fargo"]

start_collect_time = time.time()

# grab the posts
reddit_data = combine_subreddits(reddit, n_posts, sub_reddits, search_terms)
end_collect_time = time.time()
time_elapsed = round((end_collect_time - start_collect_time)/60.0, 2)
print("--- Grabbed the data in: %s minutes ---" % (time_elapsed))

# grab comments from posts
# comments_combined_df = posts_to_comments(reddit_data)
# time_elapsed = round(time.time() - end_collect_time, 3)
# print("--- Grabbed comments from the posts in: %s seconds ---" % (time_elapsed))



Done pulling FirstTimeHomeBuyer subreddit for search term Rocket!
Done pulling RealEstate subreddit for search term Rocket!
Done pulling loanoriginators subreddit for search term Rocket!
Done pulling homeowners subreddit for search term Rocket!
Done pulling Mortgages subreddit for search term Rocket!
Done pulling personalfinance subreddit for search term Rocket!
Done pulling realtor subreddit for search term Rocket!
Done pulling FirstTimeHomeBuyer subreddit for search term Fargo!
Done pulling RealEstate subreddit for search term Fargo!
Done pulling loanoriginators subreddit for search term Fargo!
Done pulling homeowners subreddit for search term Fargo!
Done pulling Mortgages subreddit for search term Fargo!
Done pulling personalfinance subreddit for search term Fargo!
Done pulling realtor subreddit for search term Fargo!
====DONE!====
(60, 9)
--- Grabbed the data in: 0.07 minutes ---


## Export the data 

In [8]:
from data_collection import authenticate_google_drive, save_google_drive_data

In [9]:
# Grab the Google Drive object
drive = authenticate_google_drive()


Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?client_id=209917166075-saupbq0ls0he9jdlpjrtscaio8kf1m7p.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&access_type=offline&response_type=code

Authentication successful.


In [18]:
# Save the data in the Google Drive location
save_google_drive_data(drive=drive, 
                       credential_file="google_drive_credentials.json",  
                       dataframe =reddit_data, 
                       filename="reddit_data.csv")

# save_google_drive_data(drive=drive, 
#                        credential_file="google_drive_credentials.json",  
#                        dataframe =comments_combined_df, 
#                        filename="reddit_comments_data.csv")


Existing file 'reddit_data.csv' deleted.
File 'reddit_data.csv' uploaded successfully to folder 1kJ6TrI9MVT5mfnnYvS-OpRMJFVbIQ6Tl!


In [19]:
# Can only save the data location locally temporarily and will need to delete 
# so that we can comply to Reddit policy 
# comments_combined_df.to_csv('../../data/comments.csv', index = False)
# reddit_data.to_csv('../../data/posts.csv', index = False)


In [20]:
# Test - Code to grab the data 
# from data_collection import grab_google_drive_folder_data

# df = grab_google_drive_folder_data(filename="reddit_data.csv")