 Clone Reference Repository
This command clones the StockEmotions GitHub repository, which contains financial sentiment data and models used as a reference for this project.

In [None]:
!git clone https://github.com/adlnlp/StockEmotions.git

This cell loads the tweet-level sentiment data from the StockEmotions repository (train, val, and test splits).
It filters relevant columns, renames them for consistency, merges all splits, parses the dates, and maps sentiment labels (0 → bearish, 1 → bullish).

In [1]:
import pandas as pd
from pathlib import Path

base = Path("StockEmotions/tweet")
df_list = []
for split in ["train_stockemo","val_stockemo","test_stockemo"]:
    path = base / f"{split}.csv"
    df = pd.read_csv(path)
    df = df[['date','ticker','original','senti_label']]
    df.rename(columns={'original':'text', 'senti_label':'sentiment'}, inplace=True)
    df_list.append(df)
df = pd.concat(df_list, ignore_index=True)
df['date'] = pd.to_datetime(df['date']).dt.date
df['sentiment'] = df['sentiment'].map({0:'bearish',1:'bullish'})
df.head()


Unnamed: 0,date,ticker,text,sentiment
0,2020-01-01,AMZN,$AMZN Dow futures up by 100 points already 🥳,
1,2020-01-01,TSLA,$TSLA Daddy's drinkin' eArly tonight! Here's t...,
2,2020-01-01,AAPL,$AAPL We’ll been riding since last December fr...,
3,2020-01-01,TSLA,"$TSLA happy new year, 2020, everyone🍷🎉🙏",
4,2020-01-01,TSLA,"$TSLA haha just a collection of greats...""Mars...",


Shows dataset size, unique tickers

In [2]:
print(df.shape)
print(df['ticker'].nunique())

(10000, 4)
37


Loads daily price data for each stock ticker and stores them in a dictionary.



In [3]:
price_dir = Path("StockEmotions/price")
price_dfs = {}
for file in price_dir.glob("*.csv"):
    tk = file.stem
    dfp = pd.read_csv(file, parse_dates=['Date'])
    dfp['date'] = dfp['Date'].dt.date
    dfp.set_index('date', inplace=True)
    price_dfs[tk] = dfp[['Open','Close','Adj Close']]
len(price_dfs), list(price_dfs.keys())[:5]


(41, ['AAPL', 'ABNB', 'AMT', 'AMZN', 'BA'])

In [4]:
price_dfs["AMZN"]

Unnamed: 0_level_0,Open,Close,Adj Close
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-12-31,92.099998,92.391998,92.391998
2020-01-02,93.750000,94.900497,94.900497
2020-01-03,93.224998,93.748497,93.748497
2020-01-06,93.000000,95.143997,95.143997
2020-01-07,95.224998,95.343002,95.343002
...,...,...,...
2020-12-24,159.695007,158.634506,158.634506
2020-12-28,159.699997,164.197998,164.197998
2020-12-29,165.496994,166.100006,166.100006
2020-12-30,167.050003,164.292496,164.292496


#### Filter by Available Price Data
Removes tweets for tickers with no matching price history.

In [5]:
tickers = set(df['ticker'])
available = set(price_dfs.keys())
print("In tweets but missing price data:", tickers - available)
df = df[df['ticker'].isin(available)].reset_index(drop=True)

In tweets but missing price data: {'BRK.B'}


### Load Pretrained Sentiment Model
Loads the twitter-roberta-base-sentiment model and tokenizer to classify tweet sentiment.

In [6]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "cardiffnlp/twitter-roberta-base-sentiment"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
labels = ['negative', 'neutral', 'positive']


  from .autonotebook import tqdm as notebook_tqdm
