<a href="https://www.kaggle.com/code/gpreda/collect-wallstreetbets-data?scriptVersionId=230080152" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Collect and Update r/wallstreetbets Data on Reddit

There are 4 steps for this process:


- Run the collection
- Load the current data
- Merge old (existent) data with currently collected
- Save new version

We schedule the collection to be run daily.

In order to make this work, we also need to set the environment variables for Reddit application using the Kaggle feature that allows us to set secrets.

Note: we will also monitor the activity of this collection and dataset update Notebook using integration with neptune.ai


# Initializations

## Install praw

In [1]:
!pip install praw

Collecting praw
  Downloading praw-7.7.1-py3-none-any.whl (191 kB)
     |████████████████████████████████| 191 kB 5.1 MB/s            
Collecting prawcore<3,>=2.1
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Installing collected packages: prawcore, praw
Successfully installed praw-7.7.1 prawcore-2.3.0


## Instal neptune

In [2]:
!pip3 install neptune-client==1.2.0

Collecting neptune-client==1.2.0
  Downloading neptune_client-1.2.0-py3-none-any.whl (448 kB)
     |████████████████████████████████| 448 kB 5.1 MB/s            
Collecting bravado<12.0.0,>=11.0.0
  Downloading bravado-11.1.0-py2.py3-none-any.whl (37 kB)
Collecting swagger-spec-validator>=2.7.4
  Downloading swagger_spec_validator-3.0.3-py2.py3-none-any.whl (27 kB)
Collecting monotonic
  Downloading monotonic-1.6-py2.py3-none-any.whl (8.2 kB)
Collecting bravado-core>=5.16.1
  Downloading bravado-core-6.1.1.tar.gz (63 kB)
     |████████████████████████████████| 63 kB 1.8 MB/s             
[?25h  Preparing metadata (setup.py) ... [?25l- done
Collecting jsonref
  Downloading jsonref-1.1.0-py3-none-any.whl (9.4 kB)
Building wheels for collected packages: bravado-core
  Building wheel for bravado-core (setup.py) ... [?25l- \ done
[?25h  Created wheel for bravado-core: filename=bravado_core-6.1.1-py2.py3-none-any.whl size=67696 sha256=f36df6ba6572fbaac74e30133dec7e

## Packages used

In [3]:
import os
import praw
import neptune.new as neptune
import pandas as pd
import datetime as dt
from tqdm import tqdm
import time

  from neptune.version import version as neptune_client_version
  This is separate from the ipykernel package so we can avoid doing imports until


# Environments setup for Reddit and neptune.ai secrets

Here is a simple tutorial about using secrets with Kaggle: [Feature Launch: User Secrets](https://www.kaggle.com/product-feedback/114053)

In [4]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

# Neptune.ai initialization

In [5]:
neptune_api_token = user_secrets.get_secret("neptune_api")
run = None
try:
    run = neptune.init_model(
        project="preda/WallStreetBets",
        api_token=neptune_api_token,
    )
except Exception as ex:
    print(f"Exception: {ex}")

Exception: 
[95m
----NeptuneMissingRequiredInitParameter---------------------------------------
[0m
[96mneptune.init_model()[0m invocation was missing [96mkey[0m.
If you want to create a new object using [96minit_model[0m, [96mkey[0m is required:
https://docs.neptune.ai/api/neptune#init_model

[92mNeed help?[0m-> https://docs.neptune.ai/getting_help



# Utility functions

In [6]:
def get_date(created):
    return dt.datetime.fromtimestamp(created)


def reddit_connection(environment="Kaggle"):
    
    if environment == "Kaggle":
        personal_use_script = user_secrets.get_secret("REDDIT_PERSONAL_USE_SCRIPT_14_CHARS")
        client_secret = user_secrets.get_secret("REDDIT_SECRET_KEY_27_CHARS")
        user_agent = user_secrets.get_secret("REDDIT_APP_NAME")
        username = user_secrets.get_secret("REDDIT_USER_NAME")
        password = user_secrets.get_secret("REDDIT_LOGIN_PASSWORD")
         
    else: #local (Linux/Windows) environment
        personal_use_script = os.environ["REDDIT_PERSONAL_USE_SCRIPT_14_CHARS"]
        client_secret = os.environ["REDDIT_SECRET_KEY_27_CHARS"]
        user_agent = os.environ["REDDIT_APP_NAME"]
        username = os.environ["REDDIT_USER_NAME"]
        password = os.environ["REDDIT_LOGIN_PASSWORD"]

    reddit = praw.Reddit(client_id=personal_use_script, \
                         client_secret=client_secret, \
                         user_agent=user_agent, \
                         username=username, \
                         password='')
    return reddit


# Build the dataset (daily update)

In [7]:
def build_dataset(reddit, search_words='wallstreetbets', items_limit=4000):
    
    # Collect reddit posts
    subreddit = reddit.subreddit(search_words)
    new_subreddit = subreddit.new(limit=items_limit)
    topics_dict = { "title":[],
                "score":[],
                "id":[], "url":[],
                "comms_num": [],
                "created": [],
                "body":[]}
    
    print(f"retreive new reddit posts ...")
    for submission in tqdm(new_subreddit):
        topics_dict["title"].append(submission.title)
        topics_dict["score"].append(submission.score)
        topics_dict["id"].append(submission.id)
        topics_dict["url"].append(submission.url)
        topics_dict["comms_num"].append(submission.num_comments)
        topics_dict["created"].append(submission.created)
        topics_dict["body"].append(submission.selftext)

    for comment in tqdm(subreddit.comments(limit=items_limit)):
        topics_dict["title"].append("Comment")
        topics_dict["score"].append(comment.score)
        topics_dict["id"].append(comment.id)
        topics_dict["url"].append("")
        topics_dict["comms_num"].append(0)
        topics_dict["created"].append(comment.created)
        topics_dict["body"].append(comment.body)

    topics_df = pd.DataFrame(topics_dict)
    print(f"new reddit posts retrieved: {len(topics_df)}")
    topics_df['timestamp'] = topics_df['created'].apply(lambda x: get_date(x))

    return topics_df
   

# Update and save dataset

We perform the following actions:  
* Load old dataset  
* Merge the two datasets  
* Save the merged data

We also log here the information on the updated dataset.

In [8]:
def update_and_save_dataset(topics_df):   
    file_path = "../input/wallstreetbets-2022/wallstreetbets_2022.csv"
    out_file_path = "wallstreetbets_2022.csv"
    if run:
        run["rows_new"] = topics_df.shape[0]
        run["cols_new"] = topics_df.shape[1]
    if os.path.exists(file_path):
        topics_old_df = pd.read_csv(file_path)
        if run:
            run["rows_old"] = topics_old_df.shape[0]
            run["cols_old"] = topics_old_df.shape[1]
        print(f"past reddit posts: {topics_old_df.shape}")
        topics_all_df = pd.concat([topics_old_df, topics_df], axis=0)
        print(f"new reddit posts: {topics_df.shape[0]} past posts: {topics_old_df.shape[0]} all posts: {topics_all_df.shape[0]}")
        topics_new_df = topics_all_df.drop_duplicates(subset = ["id"], keep='last', inplace=False)
        print(f"all reddit posts: {topics_new_df.shape}")
        if run:
            run["rows_merged"] = topics_old_df.shape[0]
            run["cols_merged"] = topics_old_df.shape[1]
        topics_new_df.to_csv(out_file_path, index=False)
    else:
        print(f"reddit posts: {topics_df.shape}")
        topics_df.to_csv(out_file_path, index=False)

# Run it all

We perform the following actions:  
* Initialize connection  
* Build the dataset  
* Update and save the dataset


In [9]:
reddit = reddit_connection()
topics_data_df = build_dataset(reddit)
update_and_save_dataset(topics_data_df)

Version 7.7.1 of praw is outdated. Version 7.8.1 was released Friday October 25, 2024.


retreive new reddit posts ...


797it [00:10, 73.11it/s]
917it [00:05, 175.63it/s]


new reddit posts retrieved: 1714


  This is separate from the ipykernel package so we can avoid doing imports until


past reddit posts: (1098269, 8)
new reddit posts: 1714 past posts: 1098269 all posts: 1099983
all reddit posts: (1099205, 8)


# Stop neptune.ai session

Make sure to stop neptune.ai session before existing the run.

In [10]:
if run:
    run.stop()