# StockTwits-Crypto Dataset

## Idea
- As discussed earlier while exploraing classical ML approach, using external dataset from similar domain to fine tune to large language model on binary classification.
- This perticular dataset contains all cryptocurrency-related posts from the StockTwits website, from 1st of November 2021 to the 15th of June 2022. 
- There are total __1.3 MN__ tweets collected over above mentioned period. 
- Though, the labeling process has been unkown, I thought of giving it a shot to fine tune the model using it.

Reference - https://huggingface.co/datasets/ElKulako/stocktwits-crypto

## Data Description

- Stats, 
    - The dataset holds __1.3 MN tweets__ annotated with __3__ labels:
    - Sentiments: __Bearish, Bullish, Neutral__
- I've hypothesize that Bearish correlates to Negative Sentiment & Bullish correlates to Positive Sentiment. 
- So after dropping tweets with Neutral label from the dataset and mapping Bearish to Negative as well as Bullish to Positive class,
    - No. of __Negative__ Samples - __124,451__
    - No. of __Positive__ Samples - __676,701__

In [1]:
import pandas as pd
import numpy as np

## Loading Dataset

In [4]:
data1_df = pd.read_excel("../data/stocktwits-crypto/st-data-full.xlsx", sheet_name="stocktwits_1")
data2_df = pd.read_excel("../data/stocktwits-crypto/st-data-full.xlsx", sheet_name="stocktwits_2")

In [8]:
data_df = pd.concat([data1_df, data2_df], axis=0)
data_df

Unnamed: 0,text,label
0,"if you were curious, price chose the lowest ch...",1
1,"true, not even 10k followers here yet.",1
2,dogecoin co-founder billy markus hits back at ...,1
3,"i’m curious, do any bulls have a price where ...",1
4,friday everybody buy 10 more on friday,2
...,...,...
731692,i tried well now the haters are -45% or liquid...,0
731693,i'd be pretty happy if bitcoin ended the year...,2
731694,will jump to 88 000 in no time 😬✈️✈️✈️,2
731695,"set it and forget it, i’ll see you guys at 😉😉",2


In [9]:
data_df['label'].value_counts()

2    676701
1    530545
0    124451
Name: label, dtype: int64

## Filtering Postive and Negative Labels

In [10]:
data_df = data_df[data_df['label'].isin([0, 2])]
data_df['label'].value_counts()

2    676701
0    124451
Name: label, dtype: int64

In [12]:
data_df['text_len'] = data_df['text'].astype(str).apply(lambda x: len(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df['text_len'] = data_df['text'].astype(str).apply(lambda x: len(x))


## Inspecting Distribution of Sequence Length

In [14]:
data_df.describe()

Unnamed: 0,label,text_len
count,801152.0,801152.0
mean,1.68932,75.838547
std,0.724458,78.228417
min,0.0,1.0
25%,2.0,32.0
50%,2.0,51.0
75%,2.0,89.0
max,2.0,1007.0


## Sample Final Set of 100,000 samples,

- equally distributed across both the class (50k each)

In [15]:
sampled_df = data_df.groupby('label').apply(lambda x: x.sample(50000, random_state=42))

In [18]:
sampled_df.reset_index(drop=True, inplace=True)

In [19]:
sampled_df

Unnamed: 0,text,label,text_len
0,7 ways to short bitcoins bear,0,30
1,yahoo shows bitty 30k,0,22
2,can anyone instruct me how to short shib?,0,42
3,bulls need to learn what bearish flags look l...,0,93
4,all the solid alt coins at breaching ath's. st...,0,100
...,...,...,...
99995,another bare flag daaam pick me up at 65k boy...,2,55
99996,the bears have hopes,2,21
99997,who has cool nft shiba images😁,2,32
99998,nice slow recovery i’ll take it,2,32


In [20]:
sampled_df["label"].value_counts()

0    50000
2    50000
Name: label, dtype: int64

## Saving dataset on Disk

In [21]:
sampled_df.to_csv("../data/stocktwits-crypto/st-data-mini.csv", index=False)

In [23]:
sampled_df.describe()

Unnamed: 0,label,text_len
count,100000.0,100000.0
mean,1.0,74.58398
std,1.000005,74.295816
min,0.0,1.0
25%,0.0,32.0
50%,1.0,52.0
75%,2.0,88.0
max,2.0,993.0
