# Elon Musk Tweets Sentiment Analysis

I will be attempting to write an algorithmic trading bot which either longs or shorts TSLA stock depending on the sentiment analysis of Elon Musk's tweets. In order to see if such an algorithm would perform well, I will be gathering a [dataset](https://www.kaggle.com/ayhmrba/elon-musk-tweets-2010-2021) of his tweets from the years 2011-2021, and using them to backtest the strategy on QuantConnect. 


In [1]:
# from google.colab import files
# from google.colab import drive
# drive.mount('/content/gdrive')
# files.upload()

# !cp kaggle.json ~/.kaggle/
# !chmod 600 /root/.kaggle/kaggle.json

# !kaggle datasets download -d ayhmrba/elon-musk-tweets-2010-2021
# !unzip /content/elon-musk-tweets-2010-2021.zip

In [2]:
# !pip install transformers

In [3]:
# import dependencies
import pandas as pd
import re

from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [4]:
df = pd.read_csv("D:/Code/QuantConnect/ElonMuskTweetSentimentAnalysis/data/2021.csv")
df

Unnamed: 0.1,Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,...,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,0,1373819373090050048,1373669212271566858,1.616379e+12,2021-03-22 02:10:37,0,,@bluemoondance74 @NASASpaceflight Going well. ...,en,[],...,,,,,,"[{'screen_name': 'bluemoondance74', 'name': 'R...",,,,
1,1,1373735946244431873,1373669212271566858,1.616359e+12,2021-03-21 20:39:07,0,,@NASASpaceflight Hopefully will happen this de...,en,[],...,,,,,,"[{'screen_name': 'NASASpaceflight', 'name': 'C...",,,,
2,2,1373555480870621188,1373328330041229312,1.616316e+12,2021-03-21 08:42:00,0,,@newscientist True,en,[],...,,,,,,"[{'screen_name': 'newscientist', 'name': 'New ...",,,,
3,3,1373507545315172357,1373263440391864323,1.616305e+12,2021-03-21 05:31:31,0,,@cleantechnica I am accumulating resources to ...,en,[],...,,,,,,"[{'screen_name': 'cleantechnica', 'name': 'Cle...",,,,
4,4,1373492611231535111,1373357995288051718,1.616301e+12,2021-03-21 04:32:11,0,,@CathieDWood When vast amounts of manufacturin...,en,[],...,,,,,,"[{'screen_name': 'CathieDWood', 'name': 'Cathi...",,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12333,12333,143171132814671872,143171132814671872,1.322970e+12,2011-12-04 03:33:52,0,,Am reading a great biography of Ben Franklin b...,en,[],...,,,,,,[],,,,
12334,12334,142881284019060736,142881284019060736,1.322901e+12,2011-12-03 08:22:07,0,,That was a total non sequitur btw,en,[],...,,,,,,[],,,,
12335,12335,142880871391838208,142880871391838208,1.322900e+12,2011-12-03 08:20:28,0,,"Great Voltaire quote, arguably better than Twa...",en,[],...,,,,,,[],,,,
12336,12336,142188458125963264,142188458125963264,1.322735e+12,2011-12-01 10:29:04,0,,I made the volume on the Model S http://t.co/...,en,[],...,,,,,,[],,,,


The only columns we will need are the 'date' and 'tweet' columns. Also, we need to reverse them for the algorithm's sake. They need to be going from olders to latest, not vice versa.

In [5]:
df = df[['date', 'tweet']]
df = df[::-1].reset_index(drop=True)
df

Unnamed: 0,date,tweet
0,2011-12-01 09:55:11,Went to Iceland on Sat to ride bumper cars on ...
1,2011-12-01 10:29:04,I made the volume on the Model S http://t.co/...
2,2011-12-03 08:20:28,"Great Voltaire quote, arguably better than Twa..."
3,2011-12-03 08:22:07,That was a total non sequitur btw
4,2011-12-04 03:33:52,Am reading a great biography of Ben Franklin b...
...,...,...
12333,2021-03-21 04:32:11,@CathieDWood When vast amounts of manufacturin...
12334,2021-03-21 05:31:31,@cleantechnica I am accumulating resources to ...
12335,2021-03-21 08:42:00,@newscientist True
12336,2021-03-21 20:39:07,@NASASpaceflight Hopefully will happen this de...


Much better. Lets look at a couple random tweets.

In [6]:
print(df['tweet'][115])
print(df['tweet'][742])
print(df['tweet'][1215])
print(df['tweet'][9211])
print(df['tweet'][11501])

Happy bday to my old and dear friend @adeoressi! U do parties better than a rockstar. For Berlin ...  http://t.co/vJZkJU13
Two teams from Tesla aiming to set a cross-country EV speed record this week. Departing Fri from LA, arriving Sun in NY.
Jeff maybe unaware SpaceX suborbital VTOL flight began 2013. Orbital water landing 2014. Orbital land landing next.  https://t.co/S6WMRnEFY5
@CNN @GavinNewsom  https://t.co/OP6l8DBf7r
@mirojurcevic @TashaARK This is a misperception. SpaceX developed &amp; continues to use lidar for Dragon docking with @Space_Station.   Just pointless imo for self-driving. If you’re going to do active photon generation, use an occlusion penetrating wavelength, like precision radar at ~4mm.


So, firstly we can see that a lot of Elon's tweets are either about random things or about his other companies, so we will need to filter them to make sure whether or not we should be using each tweet or not. 

Second, it is possible for Elon to share good news about Tesla, which has a probability of inflating TSLA stock price, thereby giving us a bit of alpha if we are quick enough. 

And thirdly, we can see that there are URLS in a lot of the tweets, which could be detremental to the sentiment analyzer, so we will have to remove them.

In [7]:
for i in range(len(df)):
    if "http" in df["tweet"][i]:
        urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|[?:%[0-9a-fA-F][0-9a-fA-F])+', df["tweet"][i])

        for url in urls:
            df["tweet"][i] = df["tweet"][i].replace(url, '{URL}')

In [8]:
print(df['tweet'][115])
print(df['tweet'][742])
print(df['tweet'][1215])
print(df['tweet'][9211])
print(df['tweet'][11501])

Happy bday to my old and dear friend @adeoressi! U do parties better than a rockstar. For Berlin ...  {URL}
Two teams from Tesla aiming to set a cross-country EV speed record this week. Departing Fri from LA, arriving Sun in NY.
Jeff maybe unaware SpaceX suborbital VTOL flight began 2013. Orbital water landing 2014. Orbital land landing next.  {URL}
@CNN @GavinNewsom  {URL}
@mirojurcevic @TashaARK This is a misperception. SpaceX developed &amp; continues to use lidar for Dragon docking with @Space_Station.   Just pointless imo for self-driving. If you’re going to do active photon generation, use an occlusion penetrating wavelength, like precision radar at ~4mm.


Much better. I will save this new DF as a csv of its own to have for later.

In [9]:
df.to_csv("D:/Code/QuantConnect/ElonMuskTweetSentimentAnalysis/data/ElonMuskTweetsPreProcessed.csv", index=False)

Now, since QuantConnect does not let us import the transformers library into its environement, we will have to perform the sentiment analysis on the data beforehand, and save it as a new csv which has scores instead of tweets. 

In [10]:
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

Downloading: 100%|██████████| 638M/638M [05:28<00:00, 2.04MB/s]


The way that the BERT sentiment analyzer works, is that first you pass the sentence/passage/tweet into the tokenizer which spits out a vector of tokenized words. Then, you pass the tokens into the model and it returns 5 scores, almost like 5 star reviews. The score with the highest number is the most likely one. Lets see an example

In [11]:
good_setence = "Wow that is amazing, I cannot believe it! Incredible! Fantastic!"
bad_sentence = "That was the most horrible thing I have ever experienced. Terrible! Never again!"

good_tokens = tokenizer.encode(good_setence, return_tensors="pt")
bad_tokens = tokenizer.encode(bad_sentence, return_tensors="pt")

print(good_tokens)
print(bad_tokens)

tensor([[  101, 94608, 10203, 10127, 39854,   117,   151, 25004, 22142, 10197,
           106, 81981,   106, 47088,   106,   102]])
tensor([[  101, 10203, 10140, 10103, 10889, 36129, 45795, 10301, 21973,   151,
         10574, 15765, 39183,   119, 50334,   106, 13362, 12590,   106,   102]])


In [12]:
good_results = model(good_tokens)
bad_results = model(bad_tokens)

print(good_results)
print(bad_results)

SequenceClassifierOutput(loss=None, logits=tensor([[-1.6161, -2.4433, -1.5049,  0.4739,  4.2215]],
       grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
SequenceClassifierOutput(loss=None, logits=tensor([[ 5.0263,  1.4021, -1.1469, -3.1746, -1.4358]],
       grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)


As we can see, the good sentence has a much higher number in the last logit which is equivalent to 5 stars, while the bad sentence has the highest number in the first logit spot, resembling a 1 star!

Now lets try it with a tweet from Elon. I'll use one of the tweets we printed earlier.

In [13]:
tweet = "The Model S beta endurance car just passed 150,000 miles on a single battery pack!"
tokens = tokenizer.encode(tweet, return_tensors="pt")
result = model(tokens)
print(result)

SequenceClassifierOutput(loss=None, logits=tensor([[-0.5236, -0.9150, -0.6577,  0.3347,  1.3500]],
       grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)


Although the scores aren't as obvious as the good sentence example I provided, we can see that the analyzer is still able to pick up that this is in fact good news, and places the highest sentiment possibilty on the 5-star!

Now lets make a new CSV which has the additional column of 'score', which is the model's output for each tweet.

In [None]:
input_file = "D:/Code/QuantConnect/ElonMuskTweetSentimentAnalysis/data/ElonMuskTweetsPreProcessed.csv"
output_file = "D:/Code/QuantConnect/ElonMuskTweetSentimentAnalysis/data/ElonMuskTweetsScored.csv"

with open(input_file, 'r', newline='') as f_in, open(output_file, 'w', newline='') as f_out:
    reader = csv.reader(f_in, delimiter=',')
    writer = csv.writer(f_out, delimiter=',')
    for line in reader:
        time = line[0]
        tweet = line[1]
        tokens = tokenizer.encode(tweet, return_tensors="pt")
        score = model(tokens)
        writer.writerow([time, score])