# Data Extraction from Reddit

## Overview
This notebook focuses on extracting comments from Reddit using the Python Reddit API Wrapper (PRAW). The primary goal is to gather comments from the subreddit ``` r/ETFs``` that mention the keyword ```VOO```.

## Importing Libraries
- **Purpose**: Import necessary Python libraries.
- **Libraries Used**: `praw` for interacting with Reddit's API, 'datetime' for handling timestamps.

In [1]:
import pandas as pd

import praw
from datetime import datetime as dt

## Setting up Reddit Client
- **Function**: `setup_reddit_client`
- **Purpose**: Initialize the Reddit API client with necessary credentials.
- **Inputs**: Client ID, client secret, and user agent.
- **Output**: A Reddit instance for API interactions.

In [2]:
client_id = "IHRxOIozokYZBI-72YpKfw"
client_secret = "a19KLEDV6oxrBZXz3FZ_h-2VtU_nIw"
user_agent = "Scraper 1.0 by /u/sc1015miniproject"

## Defining the Comment Extraction Function
- **Function**: `extract_comments_voo`
- **Purpose**: To define a function that extracts comments containing the keyword "VOO" from subreddit "r/ETFs".
- **Inputs**: Reddit instance, subreddit name, keyword, and a limit for the number of comments.
- **The `limit` Parameter**:
  - **Role**: The `limit` parameter is crucial for controlling the volume of data fetched from Reddit. It sets an upper bound on the number of comments the function will extract.
  - **Value (1000)**: The limit is set to 1000, a decision guided by Reddit API's rules and limitations. Setting this limit helps in managing the data extraction process within the operational constraints imposed by the API, such as rate limits and fair usage policies.
  - **Impact**: This limit ensures that the script remains compliant with Reddit's usage policies, avoiding potential issues like overloading the server or getting temporarily banned for excessive requests. It also helps in managing local resource usage, as fetching a large number of comments can be computationally intensive and time-consuming.
- **Process**:
  - Searches submissions in the specified subreddit that contain "VOO".
  - Extracts comments that contain "VOO" from these submissions.
  - Includes various comment attributes like author, ID, creation time, permalink, body, score, and subreddit name.
  - Utilizes a limit to control the maximum number of comments extracted.

In [3]:
def setup_reddit_client(client_id, client_secret, user_agent):
    return praw.Reddit(client_id=client_id, client_secret=client_secret, user_agent=user_agent)

def extract_comments_voo(reddit, subreddit_name, keyword, limit=1000):
    """
    Extract comments from the specified subreddit for submissions containing the keyword 'VOO'.
    Limit the number of comments to 1000.
    """
    subreddit = reddit.subreddit(subreddit_name)
    comments_data = []
    comment_count = 0

    for submission in subreddit.search(keyword, sort='new'):
        # Check if the submission title contains the keyword
        if keyword.lower() in submission.title.lower():
            submission.comments.replace_more(limit=None)
            for comment in submission.comments.list():
                if comment_count >= limit:
                    break

                if keyword.lower() in comment.body.lower():
                    comment_data = {
                        'author': str(comment.author),
                        'id': comment.id,
                        'created_utc': comment.created_utc,
                        'permalink': comment.permalink,
                        'body': comment.body,
                        'score': comment.score,
                        'subreddit': str(comment.subreddit)
                    }
                    comments_data.append(comment_data)
                    comment_count += 1

        if comment_count >= limit:
            break

    return comments_data

In [4]:
# Initialize and extract comments
reddit = setup_reddit_client(client_id, client_secret, user_agent)
comments = extract_comments_voo(reddit, "ETFs", "VOO")
comments_voo_df = pd.DataFrame(comments)

In [6]:
display(comments_voo_df)

Unnamed: 0,author,id,created_utc,permalink,body,score,subreddit
0,DaemonTargaryen2024,ky4tvav,1.712298e+09,/r/ETFs/comments/1bwa9np/voo_question/ky4tvav/,VOO is already an index fund with a low expens...,8,ETFs
1,anbu-black-ops,ky4ugsl,1.712298e+09,/r/ETFs/comments/1bwa9np/voo_question/ky4ugsl/,Splg is another alternative to voo if you cant...,1,ETFs
2,LAW9960,ky2zvyq,1.712269e+09,/r/ETFs/comments/1bw0523/i_noticed_that_when_p...,Some platforms like M1 Finance or Fidelity Go ...,4,ETFs
3,coinslinger88,kxwqx9i,1.712176e+09,/r/ETFs/comments/1buybes/sso_vs_spyvoo/kxwqx9i/,$VOO is for people who hate money,1,ETFs
4,coinslinger88,kxwr28w,1.712176e+09,/r/ETFs/comments/1buvulh/just_starting_brokera...,$VOO is for homeless people,0,ETFs
...,...,...,...,...,...,...,...
208,Hancock02,kwwa20h,1.711592e+09,/r/ETFs/comments/1bpdnmd/schg_schd_vs_voo/kwwa...,"In the end, they will probably track about the...",1,ETFs
209,HiNdSiGhT1982,kxbhxko,1.711839e+09,/r/ETFs/comments/1bpdnmd/schg_schd_vs_voo/kxbh...,good points but the cost on booth schd and sch...,1,ETFs
210,NativeTxn7,kwwq3xp,1.711599e+09,/r/ETFs/comments/1bpdnmd/schg_schd_vs_voo/kwwq...,"No way to know honestly, Large growth has had ...",4,ETFs
211,rem14,kwvnzn8,1.711583e+09,/r/ETFs/comments/1bpdnmd/schg_schd_vs_voo/kwvn...,13% overlap with VOO and less than 1% overlap ...,2,ETFs


In [7]:
comments_voo_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213 entries, 0 to 212
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   author       213 non-null    object 
 1   id           213 non-null    object 
 2   created_utc  213 non-null    float64
 3   permalink    213 non-null    object 
 4   body         213 non-null    object 
 5   score        213 non-null    int64  
 6   subreddit    213 non-null    object 
dtypes: float64(1), int64(1), object(5)
memory usage: 11.8+ KB


### Save the DataFrame as a CSV File.

In [8]:
comments_voo_df.to_csv('../datasets/reddit_comment_voo.csv', index=False)