# Predicting Cryptocurrency Price Change with Twitter Sentiment Analysis

## 0.1 Background

This work builds on numerous academic papers that have found that tweet sentiment and volume are able to succesfully predict changes in cryptocurrency price (Stenqvist & Lonno, 2017, Coliani, Rosales & Signorotti, 2015). This work is unique in that it focuses on making predictions during shorter intervals than are typically examined by the majority of academic papers on this topic. Additionally, instead of only performing prediction modeling for Bitcoin, tweets related to Ethereum and Ethereum market data were collected alongside data pertaining to Bitcoin.

## 0.2 Intent

In this notebook previously collected data consisting of tweets related to Bitcoin and Ethereum as well as Bitcoin and Ethereum market data (including price and volume) are first cleaned and prepared for further use. Machine learning models are then instantiated and trained to make predictions about the direction of price changes using this data. Model accuracy is assessed using repeated k-fold cross validation. A variety of time intervals (time periods over which data is aggregated in order to predict cryptocurrency prices at the end of the next interval) are tested and results compared.

## 0.3 Data Collection

Data collection scripts are available in the 'data-gatherer' subdirectory of this repository.

Data collection scripts were loaded to a dedicated server to continual data collecting. Using the [tweepy API](https://www.tweepy.org/) two real-time streams of tweets containing "bitcoin" (for the bitcoin stream) or "ethereum" (for the ethereum stream) were instantiated. Tweet data, consisting of a timestamp and tweet text were saved to respective CSV files. Tweet streams were closed and re-opened every 15 minutes to prevent potential errors from prematurely ending data collection.

[Coinmarketcap API](https://coinmarketcap.com/api/) was accessed every 15 minutes, and Bitcoin and Ethereum price and 24hr volume collected and saved to respective CSV files.

Using these methods data was collected for a period of 19 days between 2/18/21 and 3/09/21. 

## 0.4 Description of Data

During the collection period 1,048,575 tweets containing 'bitcoin' were collected and 115,395 tweets containing 'ethereum'.

**Tweet Data**:

- Timestamp
- Text (converted to sentiment score in this notebook)
- Volume (calculated in this notebook)

**Cryptocurrency Market Data**:

- Timestamp
- Current Price (averaged over numerous exchanges)
- 24HR volume (averaged over numerous exchanges)


## 1. Loading and Cleaning Data

The data is loaded from CSV files into Pandas dataframes. Whitespace and links are removed from the tweet text. Tweets containing words that were observed to occur frequently in "spam tweets" (advertisements that are tweeted on a frequent basis) are removed from the dataset. Finally, duplicate tweets are removed due to being more likely to be an advertisement or a twitter bot providing numerical cryptocurrency information on a regular basis. A column containing the text without alphanumeric characters is used for the purpose of duplicate removal in order to remove tweets from accounts sharing numerical data.

In [1]:
import pandas as pd
import numpy as np

# Token: 'BTC' for bitcoin, 'ETH' for ethereum
def loadTweetData(token): 

    data = pd.read_csv('{}-tweets.csv'.format(token),parse_dates=['datetime'])
    return data

In [2]:
# Retreives the market data. Required input: interval in 15 minutes increments i.e. 1 hour = 4

def loadMarketData(token,sample):
    
    marketdata = pd.read_csv('{}-quotes.csv'.format(token),parse_dates=['datetime'],index_col = "datetime")
    marketdata = marketdata.resample(sample).first()

    return marketdata

In [3]:
import re 

def cleanData(data):
    
    text = data.text.values
    junk = np.zeros(len(text))
    
    filterwords = ['giveaway','latokens','want to experience','retweet']

    for i in range(len(text)):
        # Removes whitespace
        text[i] = text[i].strip()
        text[i] = ' '.join(text[i].split())
        # Removes ... from end of text if present
        if(text[i] != '' and text[i][-1] == '…'):
            text[i] = text[i][:-1]
        # Removes links
        text[i] = re.sub(r'http\S+', '', text[i])
        # Removes whitespace
        text[i] = text[i].strip()
        text[i] = ' '.join(text[i].split())
        
        for word in filterwords:
            if word in text[i].lower():
                junk[i] = 1.0
        
    # Replaces data text with whitespace-removed text    
        
    data['text'] = pd.DataFrame(text)
        
    # Adds "junk" vector to data
    data['junk'] = pd.DataFrame(junk)
        
    # Removes tweets where junk = 1.0
    data = data[data.junk == 0]
        
    data = data.drop('junk',axis=1)
    
    # Removes duplicates, first removing non-alphabetical characters
    
    text = data['text'].values
    
    regex = re.compile('[^a-zA-Z]')
    
    for i in range(len(text)):
        text[i] = (regex.sub(' ' ,text[i]))
        text[i] = ' '.join(text[i].split())
        text[i] = text[i].lower()
        
    data['alphatext'] = pd.DataFrame(text)
    
    data = data.drop_duplicates(subset='alphatext')
    
    # Reset index
    
    data = data.reset_index(drop=True)
        
    return data.drop(['alphatext'],axis=1)

## 2. Add Sentiment Scores

Sentiment scores are calculated using the [Vader Sentiment Intensity Analyzer](https://github.com/cjhutto/vaderSentiment). Scores for each tweet are calculated and added as a column to the data.

In [4]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

def addSentimentScores(data):

    analyzer = SentimentIntensityAnalyzer()
    
    text = data['text'].values
    
    sentiment = np.zeros(len(text))
    
    for i in range(len(text)):
        compound = analyzer.polarity_scores(text[i])['compound']
        sentiment[i] = compound
        
    data['sentiment'] = pd.DataFrame(sentiment)
    
    return data

## 3. Get Mean Tweet Sentiment, Tweet Volume Change, and Price Change

Twitter data is then aggregated over the selected timesample period (15 minutes, 2 hours, etc.). Sentiment scores of the tweets within each time interval are averaged. Tweet volume for the interval is calculated. In addition, the price and volume change during each interval, calculated as the difference between the price and volume values for the next timestamp and the values for the current timestamp) are added as columns. An increase in the price/volume during the interval is coded as 1 and decrease or no change as 0.

In [5]:
def signFunction(x):
    if(x)>0:
        return 1
    else:
        return 0

In [6]:
def getIntervalData(token,sample):
    
    data = loadTweetData(token)
    data = cleanData(data)
    data = addSentimentScores(data)
    
    prices = loadMarketData(token,sample)
    
    data = data.set_index('datetime')
    data = data.resample(sample).agg({'sentiment':'mean','text':'count'})
    data = data.join(prices)
    data = data.dropna()
    
    price_volume = data[['price','volume']].diff(periods=-1)
    price_volume = -price_volume
    
    
    price_volume['price'] = price_volume['price'].apply(signFunction)
    price_volume['volume'] = price_volume['volume'].apply(signFunction)
    
    data['price'] = price_volume['price']
    data['volume'] = price_volume['volume']
    
    data = data.dropna()
    
    return data

## 4. Predict Direction of Price Movement

Using mean sentiment score, tweet volume, and trading volume change as features and change in price as the targer, Logistic Regression and 3 machine learning models are instantiated. Performance is evaluated using repeated k-fold cross validation. 

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score

def predictPriceDirection(X,Y):

    scores = {}
    
    model = LogisticRegression()

    cv = RepeatedKFold(n_splits=10, n_repeats=3,random_state=1)
    score = cross_val_score(model,X,Y,cv=cv,n_jobs=1)
    score = sum(score)/len(score) 
    
    scores['logistic_regression'] = score

    model = BernoulliNB()

    cv = RepeatedKFold(n_splits=10, n_repeats=3,random_state=1)
    score = cross_val_score(model,X,Y,cv=cv,n_jobs=1)
    score = sum(score)/len(score) 

    scores['bernoulli_nb'] = score
    
    model = RandomForestClassifier()

    cv = RepeatedKFold(n_splits=10, n_repeats=3,random_state=1)
    score = cross_val_score(model,X,Y,cv=cv,n_jobs=1)
    score = sum(score)/len(score) 

    scores['random_forest'] = score

    model = AdaBoostClassifier()

    cv = RepeatedKFold(n_splits=10, n_repeats=3,random_state=1)
    score = cross_val_score(model,X,Y,cv=cv,n_jobs=1)
    score = sum(score)/len(score) 

    scores['adaboost'] = score
    
    return scores

## 5. Test Different Time Intervals

Model performance for 4 time intervals is collected and displayed.

In [8]:
# Test each of the specified intervals for each of the specified tokens

# Tokens: list of tokens to test, Intervals: List of intervals to test

def testIntervals(tokens,intervals):
    
    for t in tokens:
        for i in intervals:
            data = getIntervalData(t,i)
            X = data[['text','sentiment','volume']]
            Y = data['price'].values
            scores = predictPriceDirection(X,Y)
            
            print('''{} --- {} --- Logistic Regression: {} --- Bernoulli NB: {} --- Random Forest: {} --- ADABoost: {}
                    '''.format(t,i,scores['logistic_regression'],scores['bernoulli_nb'],scores['random_forest'],
                        scores['adaboost']))    

In [9]:
testIntervals(['BTC','ETH'],['15MIN','30MIN','1H','2H'])

BTC --- 15MIN --- Logistic Regression: 0.6662847363887756 --- Bernoulli NB: 0.6662847363887756 --- Random Forest: 0.5916700666394669 --- ADABoost: 0.6544029647762818
                    
BTC --- 30MIN --- Logistic Regression: 0.6134792521659824 --- Bernoulli NB: 0.6158093935248518 --- Random Forest: 0.5048791609667121 --- ADABoost: 0.5836251709986321
                    
BTC --- 1H --- Logistic Regression: 0.5612403100775192 --- Bernoulli NB: 0.5790697674418602 --- Random Forest: 0.5015503875968993 --- ADABoost: 0.5263565891472869
                    
BTC --- 2H --- Logistic Regression: 0.5463924963924963 --- Bernoulli NB: 0.5826118326118327 --- Random Forest: 0.5826118326118326 --- ADABoost: 0.5036075036075036
                    
ETH --- 15MIN --- Logistic Regression: 0.6576144816364482 --- Bernoulli NB: 0.6583118317061831 --- Random Forest: 0.5569979079497908 --- ADABoost: 0.6532932357043236
                    
ETH --- 30MIN --- Logistic Regression: 0.6300390266299358 --- Bernoulli

## 6. Conclusion

For both Bitcoin and Ethereum, the models performed best when predicting at 15 minute intervals, with decreasing accuracy after adding time to the interval. Logistic regression performed better than the machine learning models and resulted in ~67% prediction accuracy for Bitcoin and ~66% prediction accuracy for Ethereum with 15 minute interval data.

This may mean that price is more sensitive to short-term changes in tweet sentiment, tweet volume, and trade volume change, or alternatively that smaller time samples provided more training data and subsequently better models. These questions could be answered by more frequent collection of market data as well as by collecting data for a longer period of time.

## 7. Work Cited

1. Stenqvist, Evita, and Jacob Lönnö. "Predicting Bitcoin price fluctuation with Twitter sentiment analysis." (2017).
2. Colianni, Stuart, Stephanie Rosales, and Michael Signorotti. "Algorithmic trading of cryptocurrency based on Twitter sentiment analysis." CS229 Project (2015): 1-5.
