# BERTweet word embedding extractor

by Björn, Harmen, Joris, Oscar

This notebook is meant to extract the CLS embeddings per tweet, and export these embeddings to train other models. This notebook is meant to be run on [Google Collab](https://colab.research.google.com/) with GPU hardware acceleration enabled for the best speed possible when run from a less performant device.

To enable GPU hardware acceleration, to to 'Runtime' in the taskbar, then 'Change runtime type', select 'GPU' under hardware acceleration.


## Useful information:
Here are some additional resources that we found useful while making this notebook:

[Illustrated guide on how to use BERT](http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)

[BERTweet Git repo with example code](https://github.com/VinAIResearch/BERTweet#preprocess)

[Explanation of the \[CLS\] token](https://datascience.stackexchange.com/questions/66207/what-is-purpose-of-the-cls-token-and-why-is-its-encoding-output-important#:~:text=21-,%5Bcls%5D,-stands%20for%20classification)

[Hugging face pipeline documentation](https://huggingface.co/docs/transformers/main_classes/pipelines)

In [None]:
!pip3 install datasets
!pip3 install nltk emoji==0.6.0
!pip3 install transformers

In [10]:
# get required files
import os
if not os.path.exists('./BERTweet/'):
  !git clone https://github.com/VinAIResearch/BERTweet/

if not os.path.exists('./Semeval2018-Task2-EmojiPrediction/'):
  !wget https://github.com/fvancesco/Semeval2018-Task2-Emoji-Detection/blob/master/dataset/Semeval2018-Task2-EmojiPrediction.zip?raw=true
  !unzip -q Semeval2018-Task2-EmojiPrediction.zip\?raw\=true
  !rm -r sample_data __MACOSX/ Semeval2018-Task2-EmojiPrediction.zip\?raw\=true

In [3]:
import datasets
import torch
import numpy as np
from transformers import pipeline
import pickle
import copy

from sklearn.pipeline import make_pipeline

import sys
sys.path.append("./BERTweet")
from TweetNormalizer import normalizeTweet

from pprint import pprint
from tqdm.notebook import tqdm

In [None]:
# Load dataset from Hugging Face

dataset = datasets.load_dataset('tweet_eval', 'emoji')

# pre-process the dataset (normalizeTweet is from BERTweet github)

def preprocess(tweet):
  """Uses the same method as BERTweet to pre-process the tweets
  Tweet is of format dict[str, str | int], and so is the output
  """
  tweet['text'] = normalizeTweet(tweet['text'])
  return tweet

tokenized_dataset = dataset.map(preprocess)


print('\nBefore pre-processing:')
pprint(dataset['train']['text'][:5])

print('\nAfter pre-processing:')
pprint(tokenized_dataset['train']['text'][:5])

print()
dataset

In [None]:
# Check if the test sets of Hugging face and SamEval contain some of the same tweets

hugging_face_testset = dataset['test']['text']

with open('Semeval2018-Task2-EmojiPrediction/test/us_test.text', 'r') as inp:
  sameval_testset = [tweet.rstrip() for tweet in inp.readlines()]

print('First 5 sentences of Hugging Face:')
pprint(hugging_face_testset[:4])
print(f'Length of Hugging Face testset: {len(hugging_face_testset)}')

print('--'*30)

print('First 5 sentences of SamEval:')
pprint(sameval_testset[:4])
print(f'Length of SamEval testset: {len(sameval_testset)}')

# compute amount of overlap
overlap_percentage = len(set(hugging_face_testset) & set(sameval_testset)) / len(hugging_face_testset) * 100
print(f'\nThe amount of overlapping tweets is {overlap_percentage:.2f}%')

# Conclusion: we cannot use the Huggin Face test set as extra training data, 
# because the test set from SamEval (our final measure) is the exactly the same.

In [None]:
# initialize pipeline
print(torch.cuda.current_device()) # this value has to be the value of the device parameter

pipe = pipeline('feature-extraction', 'vinai/bertweet-large', device=0)

In [None]:
# use pipeline to extract word embeddings (this takes about 40 minutes, 20 per set)

train_cls = []
test_cls = []

print('Creating CLS embeddings for training tweets')
for idx, tweet in tqdm(enumerate(tokenized_dataset['train']['text']), total=len(tokenized_dataset['train']['text'])):
  train_cls.append(pipe(tweet)[0][0])


print('\nCreating CLS embeddings for test tweets')
for idx, tweet in tqdm(enumerate(tokenized_dataset['test']['text']), total=len(tokenized_dataset['test']['text'])):
  test_cls.append(pipe(tweet)[0][0])

In [None]:
# write pickles to Google Drive

from google.colab import drive
drive.mount('/content/gdrive')

# create required directory
os.makedirs(os.path.dirname('/content/gdrive/My Drive/mlp/'), exist_ok=True)

# write files
with open('/content/gdrive/My Drive/mlp/train_cls.pickle', 'wb') as outp:
  pickle.dump(train_cls, outp)

with open('/content/gdrive/My Drive/mlp/test_cls.pickle', 'wb') as outp:
  pickle.dump(test_cls, outp)

# to prevent runtime disconnection while downloading the files, we wrote to 
# google drive, and downloaded from there.

# add these files to ./data/ in the repository

In [13]:
# write pickles to local drive 
# (useful when running this notebook locally, or with lack of space on Drive)

# create required directory
os.makedirs(os.path.dirname('./data/'), exist_ok=True)

# write files
with open('./data/train_cls.pickle', 'wb') as outp:
  pickle.dump(train_cls, outp)

with open('./data/test_cls.pickle', 'wb') as outp:
  pickle.dump(test_cls, outp)