### MADS Capstone Project: Text Analysis of Reddit Data for Mortgage Companies

#### Authors: 
###### Andreea Serban and Chris McAllister
----
##### The goal of this notebook is to access the Reddit API to see how people feel about different mortgage companies. The output will two dataframes:
1) One for all the posts on Reddit
2) One for all the comments associated with the posts
3) Lastly, we save it to our data folder.
----
##### Once we  have the data, we'll conduct to understand the following questions:
1) Sentiment Trends Overtime
2) Comparison of Sentiment to other mortgage companies
3) Identify Similar Brands based off what Redditors are saying (Coca-Cola, Xfinity, etc).
4) Identify other intersts of top posters

Documentation for Reedit API: https://praw.readthedocs.io/en/stable/getting_started/quick_start.html

In [1]:
# Install the required packages
import sys 
from IPython.display import clear_output
import json #needed to translate JSON data
import requests #needed to perform HTTP GET and POST requests
import pandas as pd
pd.set_option('display.max_colwidth', None) # Need this otherwise text columns will truncate!
import pprint #allows us to print more readable JSON data
from datetime import datetime 
import time 

#We will use the PRAW library to access Reddit data, version 7.7.1
try:
    import praw
except:
    !pip3 install praw==7.8.1 #this will install the version of PRAW that we need if PRAW does not already exist
    import praw

clear_output()


In [2]:
print("praw version:",praw.__version__) #it can be helpful to confirm the version we're using for our project

praw version: 7.7.1


In [3]:
import json

# Grab these credentials from: https://www.reddit.com/prefs/apps
# Function to load credentials from a JSON file
def load_credentials(file_path):
    with open(file_path, 'r') as file:
        credentials = json.load(file)
    return credentials

# Path to the credentials file
file_path = 'credentials.json'

# Load the credentials
credentials = load_credentials(file_path)

# Assigning the credentials to variables
REDDIT_USERNAME = credentials['REDDIT_USERNAME']
REDDIT_PASSWORD = credentials['REDDIT_PASSWORD']
APP_ID = credentials['APP_ID']
APP_SECRET = credentials['APP_SECRET']
APP_NAME = credentials['APP_NAME']

In [4]:
#Generate your reddit instance
reddit = praw.Reddit(
    client_id=APP_ID,
    client_secret=APP_SECRET,
    user_agent=APP_NAME,
    username=REDDIT_USERNAME, 
    password=REDDIT_PASSWORD,
    check_for_async=False # This additional parameter supresses a warning about "Asynchronous PRAW"
)

Version 7.7.1 of praw is outdated. Version 7.8.1 was released Friday October 25, 2024.


In [5]:
def create_submission_df(subredit_topic="FirstTimeHomeBuyer", n_posts=5, search_query='Rocket', include_edit = False): 
    
    """
    This function accesses a subreddit page and searches it for a given phrase. It then accesses the top-n posts, and stores the post as a record in a dataframe.

    args:
    subredit_topic (str): The name of the subreddit we want to access (ie r/subredit_topic).
    n_posts (int): The number of posts we want get from that subreddit
    search_query (str): The phrase we want to use to extract posts.
    include_edit (boolean): FALSE will not filter the post. TRUE will remove all text after the phrase "EDIT: "
    
    returns Pandas DataFrame with the text of 
    """
    
    subreddit = reddit.subreddit(subredit_topic)
    submissions = subreddit.search(search_query, limit=n_posts)

    # submissions = subreddit.hot(limit=n_posts)
    data = {
        'submission_id': [],
        'subredit_topic': [],
        'search_query': [],
        'title': [],
        'text': [],
        'score': [],
        'num_comments': [],
        'username': [],
        'created_at': [],
        # Added 
        'comment_dict': []
    }
    for submission in submissions:
        data['submission_id'].append(submission.id)
        data['subredit_topic'].append(subredit_topic)
        data['search_query'].append(search_query)
        data['title'].append(submission.title)
        data['text'].append(submission.selftext)
        data['score'].append(submission.score)
        data['num_comments'].append(submission.num_comments)
        data['username'].append(submission.author.name if submission.author else 'Deleted')
        data['created_at'].append(submission.created_utc)

        # added to get a dict of comments saved as a column
        comment_dict = {}
        for i, comment in enumerate(submission.comments):
        
            try:
                comment_i = comment.body
                comment_dict[i] = comment_i
            except:
                pass

        data['comment_dict'].append(comment_dict)

    # Create the dataframe from our dictionary
    submission_df = pd.DataFrame(data)
    submission_df['created_at'] = pd.to_datetime(submission_df['created_at'], unit='s') 
    
    # Remove \n\n using str.replace()
    submission_df['text'] = submission_df['text'].str.replace(r'\n\n', '', regex=True)

    # Remove part of the text that gets edited if include_edit == True
    if include_edit == True:
        submission_df['text'] = submission_df['text'].str.replace(r'(?i)edit: .*', '', regex=True)

    # Specify search term at end of dataset creation
    submission_df['search_query'] = search_query
    
    return submission_df

In [6]:
n_posts = 5 #we recommend lowering these numbers while you build your code.  Start with 1-2!

submission_df = create_submission_df("FirstTimeHomeBuyer", 5, "Rocket", False)

In [7]:
# submission_df.head(1)

In [8]:
# Code to hit mulitple subreddit and companies
def combine_subreddits(n_posts, sub_reddits, search_terms):

    """
    This function combines data from multiple sub-reddits and search terms to create one large dataframe for analysis. 

    args:
    n_posts (int): The number of posts that we want to access for each subreddit / search combination
    sub_reddits (list): A list of sub reddits we want to access
    search_terms (list): A list of terms we want to search on each subreddit (usually the name of a company). 

    returns:
    a pandas DataFrame with all the data about the posts in one big dataframe.
    """
    

    dfs = []
    for company in search_terms:
    
        for sr in sub_reddits:
        
            sr_data = create_submission_df(subredit_topic = sr, n_posts = n_posts, search_query = company)
            sr_data = dfs.append(sr_data)
        
            print("Done pulling " + sr + " subreddit for search term " + company + "!")
    
    # Union all dataframes together
    df_final = pd.concat(dfs)
    print('====DONE!====')
    print(df_final.shape)

    return df_final

In [9]:
n_posts = 5 # Start small!
sub_reddits = ['FirstTimeHomeBuyer', 'RealEstate', 'loanoriginators', 'homeowners', 'Mortgages', 'personalfinance']
search_terms = ["Rocket", "Fargo"]


reddit_data = combine_subreddits(n_posts, sub_reddits, search_terms)

Done pulling FirstTimeHomeBuyer subreddit for search term Rocket!
Done pulling RealEstate subreddit for search term Rocket!
Done pulling loanoriginators subreddit for search term Rocket!
Done pulling homeowners subreddit for search term Rocket!
Done pulling Mortgages subreddit for search term Rocket!
Done pulling personalfinance subreddit for search term Rocket!
Done pulling FirstTimeHomeBuyer subreddit for search term Fargo!
Done pulling RealEstate subreddit for search term Fargo!
Done pulling loanoriginators subreddit for search term Fargo!
Done pulling homeowners subreddit for search term Fargo!
Done pulling Mortgages subreddit for search term Fargo!
Done pulling personalfinance subreddit for search term Fargo!
====DONE!====
(60, 10)


In [10]:
# Extract comments as a df
def posts_to_comments(data):

    """
    The goal of this data is to extract the comments from a post. In our last function combine_subreddits() we created a columns
    saved as a dictioanry of comments.

    This function takes that dataset as input, and converts it into a df comprised of commments. We will have 1 row for each commment in this new df.

    args:
    data (Pandas DataFrame): A dataset of reddit posts. The output of combine_subreddits()

    returns:
    a Pandas DataFrame of comments. 

    """

    dfs_list = []
    for index, row in data.iterrows():
        df_i = pd.DataFrame([row['comment_dict']]).T.reset_index()
        df_i = df_i.rename({'index':'comment_index', 0: 'comment_text'}, axis = 1)
        df_i['post_id'] = row['submission_id']
        df_i['sub_reddit'] = row['subredit_topic']
        df_i['post_time'] = row['created_at']
        df_i['post_comment_count'] = row['num_comments']
        df_i['poster_username'] = row['username']
    
        dfs_list.append(df_i)
    
    comment_df = pd.concat(dfs_list)
    
    # Remove \n\n using str.replace()
    comment_df['comment_text'] = comment_df['comment_text'].str.replace(r'\n\n', '', regex=True)
    comment_df['comment_text'] = comment_df['comment_text'].str.replace(r'\*. \n\*', '', regex=True)


    
    
    return comment_df

comments_df = posts_to_comments(reddit_data)

In [11]:
comments_df.shape

(3738, 7)

In [12]:
# reddit_data.sample(1)

In [13]:
# comments_df.sample()

In [14]:
# join comments to original posts datset to get some more info 
comments_combined = pd.merge(left = comments_df,
                             right = reddit_data,
                             how = 'left',
                             left_on = 'post_id',
                             right_on = 'submission_id'
                            )

comments_combined = comments_combined[['comment_index', 'comment_text', 'post_id', 'sub_reddit', 'post_time', 'post_comment_count',
                                       'poster_username', 'subredit_topic', 'search_query', 'title']]


comments_combined.sample()

Unnamed: 0,comment_index,comment_text,post_id,sub_reddit,post_time,post_comment_count,poster_username,subredit_topic,search_query,title
122,16,"FYI to everyone who is as entertained at this post as I am, OP posted in this sub a few months ago asking about Rocket, and people in this sub warned her that they would take on a bunch of fees right before closing. And now she is big mad that they did it, after being warned: https://www.reddit.com/r/FirstTimeHomeBuyer/comments/11mjqqx/what_mortgage_lender_would_you_recommend/jbjw1lg/?utm_source=share&utm_medium=ios_app&utm_name=ioscss&utm_content=1&utm_term=1&context=3",1541el1,FirstTimeHomeBuyer,2023-07-19 17:31:01,471,justmeAlonekitty,FirstTimeHomeBuyer,Rocket,WARNING- do NOT work with rocket mortgage!!!


In [15]:
# Save data to our data folder
comments_combined.to_csv('../../data/comments.csv', index = False)
reddit_data.to_csv('../../data/posts.csv', index = False)