Skip to content

Applied random forests to classify sentiment of over 1M cryptocurrency-related messages on StockTwits posted between 28/11/2014 and 25/07/2020

License

Notifications You must be signed in to change notification settings

dang-trung/stocktwits-sentiment-classifier

Repository files navigation


Logo

StockTwits Sentiment Classifier

An Application of Random Forest!

MIT License GitHub LinkedIn

Project Description

Introduction

  • Objective: Project for my intern at Research Center VERA, Ca' Foscari University of Venice.

  • Abstract: 2,045,322 cryptocurrency-related Tweets (~287MB) are retrieved using StockTwits API. The messages are posted from 28/11/2014 to 25/07/2020. Nearly half of those messages are labelled with sentiment (i.e. Bullish/Bearish). Based on the labeled dataset, a Random Forest model is then trained to classify the sentiments of Tweets about cryptocurrencies, resulting in a 74.75% prediction accuracy on test set.

  • Status: Completed.

Methods Used

Dependencies

  • Python 3
  • numpy==1.18.5
  • pandas==1.0.5
  • scikit-learn==0.23.2
  • requests==2.24.0

Table of Contents

Getting Started

How to Run

  1. Clone this repo: git clone https://github.com/dang-trung/stocktwits-sentiment-classifier

  2. Create your environment (virtualenv):
    virtualenv -p python3 venv
    source venv/bin/activate (bash) or venv\Scripts\activate (windows)
    (venv) cd stocktwits-sentiment-classifier
    (venv) pip install -e

    Or (conda):
    conda env create -f environment.yml
    conda activate stocktwits-sentiment-classifier

  3. Run in terminal:
    python -m sentiment_classifier
    Note that due to API limits, it will take several days to fully download all 2m+ cryptocurrencies-related Tweets on StockTwits from 2014 to 2020.

Data Storage

  1. Downloaded messages will be stored in data/01_raw.
  2. Messages after being processed (so that only information relevant to sentiment) will be stored in data/02_processed.
  3. Vectorized text messages are stored in data/03_vectorized (since this file is small compared to the files generated by step 1 and 2, I already included this in the repo.)
  4. External files (symbols of cryptos & rules for text-processing) are stored in data/04_external

Results

  • Model parameters: ntree=500, max_depth=20, max_samples=0.75
  • Confusion matrix of training set
Actual Classes
Bearish Bullish
Predicted Class Bearish 82,208 8,426
Bullish 5,269 85,365
  • Confusion matrix of test set (~74.75% accuracy)
Actual Classes
Bearish Bullish
Predicted Class Bearish 59,888 30,747
Bullish 175,937 551,880

Read More

For better understanding of the project, kindly read the report.

About

Applied random forests to classify sentiment of over 1M cryptocurrency-related messages on StockTwits posted between 28/11/2014 and 25/07/2020

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages