## Get Reddit Data

This notebook will grab all the reddit data used. In order to run this notebook, the following need to be set up: 
1. **[PRAW API](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html)** created and saved within this directory in the file called "credentials.json". 
    - This will allow access to the Reddit data
2. **[Google Drive API](https://developers.google.com/drive/api/quickstart/go)** created and saved within this directory in the file called "client_secrets.json"
    - This will allow access to Google Drive to save the data in
    - Also create the Google Drive Folder ID Credentials file ("google_drive_credentials.json") that will contain the folder id of the folder in google drive you'd like to save it to 

    



In [1]:
%%capture
pip install -r ../../requirements.txt

In [10]:
%load_ext autoreload
%autoreload 2
%reload_ext autoreload


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [27]:
import sys

# set this on the path so that we can reference the commong data locations
sys.path.append("../../scripts/")
from data_collection import load_credentials, combine_subreddits, posts_to_comments


In [53]:
# Install the required packages
import sys 
import json #needed to translate JSON data
import requests #needed to perform HTTP GET and POST requests
import pandas as pd
import pprint # allows us to print more readable JSON data
from datetime import datetime 
import time 
import io



In [20]:
pd.set_option('display.max_colwidth', 100) # Need this otherwise text columns will truncate!


In [24]:
# Load in the credentials
reddit = load_credentials('credentials.json')

Logged into Reddit successfully


In [48]:
n_posts = 5 # Start small!
sub_reddits = ['FirstTimeHomeBuyer', 'RealEstate']#, 'loanoriginators', 
                # 'homeowners', 'Mortgages', 'personalfinance']
search_terms = ["Rocket", "Fargo"]

start_collect_time = time.time()

# grab the posts
reddit_data = combine_subreddits(reddit, n_posts, sub_reddits, search_terms)
end_collect_time = time.time()
time_elapsed = round((end_collect_time - start_time)/60.0, 2)
print("--- Grabbed the data in: %s minutes ---" % (time_elapsed))

# grab comments from posts
comments_combined_df = posts_to_comments(reddit_data)
time_elapsed = round(time.time() - end_collect_time, 3)
print("--- Grabbed comments from the posts in: %s seconds ---" % (time_elapsed))



Done pulling FirstTimeHomeBuyer subreddit for search term Rocket!
Done pulling RealEstate subreddit for search term Rocket!
Done pulling FirstTimeHomeBuyer subreddit for search term Fargo!
Done pulling RealEstate subreddit for search term Fargo!
====DONE!====
(20, 10)
--- Grabbed the data in: 13.46 minutes ---
--- Grabbed comments from the posts in: 0.105 seconds ---
CPU times: user 612 ms, sys: 34.6 ms, total: 646 ms
Wall time: 22.1 s


## Export the data 

In [50]:
# Can only save the data location locally temporarily and will need to delete 
# so that we can comply to Reddit policy 
comments_combined_df.to_csv('../../data/comments.csv', index = False)
reddit_data.to_csv('../../data/posts.csv', index = False)


In [49]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive

# Authenticate with Google
gauth = GoogleAuth()
gauth.LocalWebserverAuth()  # Opens a browser for authentication

# Create GoogleDrive instance
drive = GoogleDrive(gauth)

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?client_id=209917166075-saupbq0ls0he9jdlpjrtscaio8kf1m7p.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&access_type=offline&response_type=code

Authentication successful.


In [61]:
# Grab the Folder Id of the google drive where the data will be saved
with open("google_drive_credentials.json", 'r') as file:
    google_drive_credentials = json.load(file)
folder_id = google_drive_credentials["folder_id"]

In [63]:
def save_data_google_drive(dataframe, filename): 
    csv_buffer = io.StringIO()
    dataframe.to_csv(csv_buffer, index=False)
    csv_buffer.seek(0)

    file = drive.CreateFile({'title': filename, 'parents': [{'id': folder_id}]})
    file.SetContentString(csv_buffer.getvalue())  # Set content from memory buffer
    file.Upload()
    print(f"File '{filename}' uploaded successfully to folder {folder_id}!")


In [64]:
save_data_google_drive(reddit_data, "reddit_data.csv")


File 'reddit_data.csv' uploaded successfully to folder 1kJ6TrI9MVT5mfnnYvS-OpRMJFVbIQ6Tl!


In [65]:
save_data_google_drive(comments_combined_df, "reddit_comments_data.csv")


File 'reddit_comments_data.csv' uploaded successfully to folder 1kJ6TrI9MVT5mfnnYvS-OpRMJFVbIQ6Tl!


In [60]:

# # Search for the file in the folder
# file_list = drive.ListFile({'q': f"'{folder_id}' in parents and trashed=false"}).GetList()

# # Find the specific file by name
# filename = "posts.csv"
# file_id = None
# for file in file_list:
#     if file['title'] == filename:
#         file_id = file['id']
#         break

# if file_id:
#     # Download file content into memory
#     file = drive.CreateFile({'id': file_id})
#     file_content = io.StringIO(file.GetContentString())

#     # Load into a Pandas DataFrame
#     df = pd.read_csv(file_content)

#     print(f"Successfully loaded '{filename}' into a DataFrame!")
#     print(df.head())  # Print first few rows
# else:
#     print(f"File '{filename}' not found in folder {folder_id}.")