#CryptoBERT

CryptoBERT is a pre-trained NLP model to analyse the language and sentiments of cryptocurrency-related social media posts and messages. In this code we forecast the labels n the processed news headlines file

In [27]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [28]:
from transformers import TextClassificationPipeline, AutoModelForSequenceClassification, AutoTokenizer #Crypto pre-trained model
import pandas as pd #To read files
import csv #To write into csv file
import os #To write in a specific directory

The CryptoBERT model has:

Input = a (list of) strings, each to be classified as bearish, neutral or bulish.
Output = a (list of) dictionary, containing predicted label and a score (irrelevant for the analysis)

In [29]:
#https://huggingface.co/ElKulako/cryptobert

model_name = "ElKulako/cryptobert"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 3) # possible labels
cryptobert = TextClassificationPipeline(model=model, tokenizer=tokenizer, max_length=64, truncation=True, padding = 'max_length') #suggeested settings

A few examples ot strings to see how the moel behaves

In [30]:
post_1 = " see y'all tomorrow and can't wait to see ada in the morning, i wonder what price it is going to be at. 😎🐂🤠💯😴, bitcoin is looking good go for it and flash by that 45k. "
print(cryptobert(post_1))

post_2 = "SpiceJet to issue 6.4 crore warrants to promoters"
post_3 = "Genpact appoints Edward J Fitzpatrick as CFO"
post_4 = "SpiceJet to issue 6.4 crore warrants to promoters"
df_posts = [post_2, post_3, post_4]
preds = cryptobert(df_posts)

print(preds[0:3])
#print(preds[2]['score']) #To access individual elements

[{'label': 'Bullish', 'score': 0.8734585642814636}]
[{'label': 'Neutral', 'score': 0.9430785179138184}, {'label': 'Bullish', 'score': 0.5652675032615662}, {'label': 'Neutral', 'score': 0.9430785179138184}]


Testing the model on dataset of news headlines with sentiment annotations. The data has been preproces using CleanData.ipynb code.

In [31]:
news_path = '/content/drive/MyDrive/Turing Machine Learining/Sentiment_Analysis/News_headlines_tagged.csv'
df = pd.read_csv(news_path)
print(df.head(3))

headlines_news = list(df["Title"])
decisions = list(df["Decisions"])

                                               Title  Decisions
0  SpiceJet to issue 6.4 crore warrants to promoters          1
1                  MMTC Q2 net loss at Rs 10.4 crore          1
2  Mid-cap funds can deliver more, stay put: Experts          2


Forecast labels given headlines with Cryptobert model. Below code runs for around 30 minutes. Partial run -> specify desired number of predictions = num_pred

In [32]:
num_pred = len(headlines_news) #Change to desired number of predicitons
print("Testing sample is: ", num_pred)

preds = cryptobert(headlines_news[:num_pred])

Testing sample is:  5


Output the results

In [35]:
#A few example of testing data
print("Number of headlines: ", num_pred)
for i in range(5):
  print(headlines_news[i], end = ' -> ')
  print(decisions[i])


pred_labels = [d['label'] for d in preds] #Extract list of labels from list of dictioanries of cryptobet output

print()
print("Predicted labels (original tags): \n", pred_labels[:5])

#Predicted labels are originally Bearish/Neutral/Bullish -> convert to 0/1/2 respecitvely
pred_labels_tagged = [2 if label == 'Bullish' else 0 if label == 'Bearish' else 1 if label == 'Neutral' else label for label in pred_labels]
print("Predicted labels (adjusted tags): \n", pred_labels_tagged[:5])


Number of headlines:  5
SpiceJet to issue 6.4 crore warrants to promoters -> 1
MMTC Q2 net loss at Rs 10.4 crore -> 1
Mid-cap funds can deliver more, stay put: Experts -> 2
Mid caps now turn into market darlings -> 2
Market seeing patience, if not conviction: Prakash Diwan -> 1

Predicted labels (original tags): 
 ['Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral']
Predicted labels (adjusted tags): 
 [1, 1, 1, 1, 1]


Given list of headlines and list of predicted labels by Cryptobert model write it into a new csv file

In [36]:
def write_lists_to_csv(list1, list2, directory, filename):
    # Combine directory path and filename
    filepath = os.path.join(directory, filename)
    with open(filepath, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Headline', 'Predicted label'])  # Write header
        for item1, item2 in zip(list1, list2): #Given two list, only write those into the csv
            writer.writerow([item1, item2])

Write into the new file so data analysis not dependent on runnig time of CrytoBERT

In [26]:
directory = '/content/drive/MyDrive/Turing Machine Learining/Sentiment_Analysis/'
csv_filename = 'News_headlines_tagged_cryptobert_partial.csv' #If partial predictions change the name, not to overwirte full predictions file
write_lists_to_csv(headlines_news, pred_labels_tagged, directory, csv_filename)