Tweet Sentiment Analysis
To run experiements, parameters must be specified in hparams.yaml. Parameters used for the best performing variant of each model are already specified. This project was built using Python 3.10.10

Two important parameters are related to tokenization. The first is "use_existing_tokenizer" which allows for loading of a tokenizer that has already been built. The second is "use_existing_tokens" which allows for reloading of previously tokenized data. These options are important for the CustomBPETokenizer as both building this tokenizer and encoding the text using it are extremely computationally expensive. Note that no internal checks are performed to ensure that any re-loaded tokenisations conform to new model/tokenisation parameters specified. For example, if the number of samples used is changed, particularly if its made smaller, "use_existing_tokenizer" and "use_existing_tokens" must be False to ensure the tokenizer is re-built and no "peeking" at the test set has occurred during building of the tokenizer. Saved tokenizations of the data are tagged with train/test/val and whether or not emojis have been removed. Once a CustomBPETokenizer has been built with a set number of samples, "use_existing_tokenizer" and "use_existing_tokens" can be set to True to re-use the existing tokenizer and tokens. The same tokenizer and/or already tokenized data can be re-used across all classifier models (provided number of samples and emojis is is the same). Where necessary, padding is performed AFTER tokenisation on a model specific basis and so no padding is completed on saved tokens.

Tokenization must be carried out on the first model training. For further training, I would strongly recommend always passing "use_existing_tokenzer : True" for any models using the CustomBPETokenizer to avoid re-building and overwriting any existing CustomBPETokenizer. Be warned that tokenizing the data is very time consuming. There is definitely optimisation that can be done here. 500,000 samples has been specified by default. If tokenisation parameters are changed, existing test tokens under /saved_tokenizers should be manually deleted before "use_existing_tokens" is set to True. This ensures thats the test tokens computed with old tokenizers are not being re-used during eval.

The dataset used is available at https://www.kaggle.com/datasets/kosweet/cleaned-emotion-extraction-dataset-from-twitter. It should be provided as "dataset(clean).csv" in the data directory. On the first run "--write_data True" should be passed as a command line argument. This specifies that the dataset needs to be split into test/val/train. If this has already been done, there is no need to pass --write_data True again unless the number of samples being used has changed.

Training is carried out by running "train.py <model_name>". The state dict of the best model as ranked by validation loss is saved.

Once a model is trained it can be evaluated using "eval.py <model_name>". This loads the best model state dict and evaluates it on the test set. This code also relies on the parameters specified in hparams.yaml. These params should not be changed between training and testing a model.

Also provided is the notebook "hyperparam_search.ipynb" that demonstrates the hyperparameter search completed using the Optuna package.

Once a decoder model has been built, tweets can be generated from it using the notebook "gen_tweets.py".

Below are two examples of training the MLP and the Encoder. Similar commands can be used to run experiments for any of the other models. The tokenization had already been completed for the models below at the time this was ran. If re-running, tokens will be recomputed when the first training is started. Although no model weights are provided, figures and logs of some training runs are under /logs and /figures.

In [None]:
! pip install requirements.txt

In [6]:
! python train.py MLP --write_data True

Building tokenizer...
Vocab size: 11,767
Building datasets...
model has : 446,435 parameters
Epoch 1,: Learning Rate: 0.001, Train Loss: 0.7159, Train Acc: 0.6888, Val Loss: 0.7287, Val Acc: 0.6777
Epoch 2,: Learning Rate: 0.00099726, Train Loss: 0.6991, Train Acc: 0.6987, Val Loss: 0.7169, Val Acc: 0.6855
Epoch 3,: Learning Rate: 0.00098907, Train Loss: 0.6838, Train Acc: 0.7042, Val Loss: 0.7027, Val Acc: 0.6897
Epoch 4,: Learning Rate: 0.00097553, Train Loss: 0.6796, Train Acc: 0.7087, Val Loss: 0.6982, Val Acc: 0.6957
Epoch 5,: Learning Rate: 0.00095677, Train Loss: 0.6802, Train Acc: 0.7046, Val Loss: 0.7018, Val Acc: 0.6917
Epoch 6,: Learning Rate: 0.00093301, Train Loss: 0.6701, Train Acc: 0.7094, Val Loss: 0.6923, Val Acc: 0.6958
Epoch 7,: Learning Rate: 0.00090451, Train Loss: 0.6731, Train Acc: 0.7086, Val Loss: 0.6936, Val Acc: 0.6951
Epoch 8,: Learning Rate: 0.00087157, Train Loss: 0.6709, Train Acc: 0.7127, Val Loss: 0.6928, Val Acc: 0.6976
Epoch 9,: Learning Rate: 0.00083


  0%|          | 0/6250 [00:00<?, ?batch/s]
  0%|          | 1/6250 [00:00<22:40,  4.59batch/s]
  0%|          | 8/6250 [00:00<03:44, 27.80batch/s]
  1%|          | 43/6250 [00:00<00:45, 136.56batch/s]
  1%|▏         | 80/6250 [00:00<00:29, 212.11batch/s]
  2%|▏         | 121/6250 [00:00<00:22, 272.85batch/s]
  3%|▎         | 161/6250 [00:00<00:19, 311.26batch/s]
  3%|▎         | 205/6250 [00:00<00:17, 350.41batch/s]
  4%|▍         | 248/6250 [00:00<00:16, 374.56batch/s]
  5%|▍         | 292/6250 [00:01<00:15, 393.16batch/s]
  5%|▌         | 335/6250 [00:01<00:14, 402.15batch/s]
  6%|▌         | 376/6250 [00:01<00:14, 402.64batch/s]
  7%|▋         | 417/6250 [00:01<00:14, 399.77batch/s]
  7%|▋         | 460/6250 [00:01<00:14, 406.97batch/s]
  8%|▊         | 502/6250 [00:01<00:14, 410.50batch/s]
  9%|▊         | 544/6250 [00:01<00:13, 408.12batch/s]
  9%|▉         | 589/6250 [00:01<00:13, 418.51batch/s]
 10%|█         | 633/6250 [00:01<00:13, 421.32batch/s]
 11%|█         | 677/6250 [0

In [7]:
! python eval.py MLP

              precision    recall  f1-score   support

dissapointed      0.733     0.715     0.723     17061
       happy      0.741     0.714     0.727     16462
       angry      0.668     0.710     0.689     16477

    accuracy                          0.713     50000
   macro avg      0.714     0.713     0.713     50000
weighted avg      0.714     0.713     0.713     50000

Total Average Loss: 0.668


In [8]:
! python train.py Encoder

Building tokenizer...
Vocab size: 11,767
Building datasets...
model has : 405,603 parameters
Epoch 1,: Learning Rate: 0.001, Train Loss: 0.7122, Train Acc: 0.6908, Val Loss: 0.7258, Val Acc: 0.6778
Epoch 2,: Learning Rate: 0.00099726, Train Loss: 0.6646, Train Acc: 0.7053, Val Loss: 0.6834, Val Acc: 0.6953
Epoch 3,: Learning Rate: 0.00098907, Train Loss: 0.6452, Train Acc: 0.7152, Val Loss: 0.6689, Val Acc: 0.6997
Epoch 4,: Learning Rate: 0.00097553, Train Loss: 0.6367, Train Acc: 0.7107, Val Loss: 0.6589, Val Acc: 0.6976
Epoch 5,: Learning Rate: 0.00095677, Train Loss: 0.6187, Train Acc: 0.7227, Val Loss: 0.6423, Val Acc: 0.7054
Epoch 6,: Learning Rate: 0.00093301, Train Loss: 0.6147, Train Acc: 0.7254, Val Loss: 0.6355, Val Acc: 0.7124
Epoch 7,: Learning Rate: 0.00090451, Train Loss: 0.6067, Train Acc: 0.7293, Val Loss: 0.633, Val Acc: 0.7128
Epoch 8,: Learning Rate: 0.00087157, Train Loss: 0.6042, Train Acc: 0.7294, Val Loss: 0.6313, Val Acc: 0.7131
Epoch 9,: Learning Rate: 0.000834


  0%|          | 0/6250 [00:00<?, ?batch/s]
  0%|          | 1/6250 [00:00<22:21,  4.66batch/s]
  0%|          | 8/6250 [00:00<03:32, 29.43batch/s]
  0%|          | 15/6250 [00:00<02:23, 43.36batch/s]
  0%|          | 26/6250 [00:00<01:35, 65.19batch/s]
  1%|          | 51/6250 [00:00<00:50, 123.77batch/s]
  1%|          | 76/6250 [00:00<00:38, 161.92batch/s]
  2%|▏         | 101/6250 [00:00<00:32, 188.22batch/s]
  2%|▏         | 126/6250 [00:00<00:29, 205.75batch/s]
  2%|▏         | 151/6250 [00:01<00:27, 218.40batch/s]
  3%|▎         | 176/6250 [00:01<00:26, 226.69batch/s]
  3%|▎         | 201/6250 [00:01<00:26, 232.53batch/s]
  4%|▎         | 226/6250 [00:01<00:25, 236.32batch/s]
  4%|▍         | 251/6250 [00:01<00:25, 239.82batch/s]
  4%|▍         | 277/6250 [00:01<00:24, 243.13batch/s]
  5%|▍         | 302/6250 [00:01<00:24, 244.56batch/s]
  5%|▌         | 327/6250 [00:01<00:24, 244.54batch/s]
  6%|▌         | 352/6250 [00:01<00:24, 245.01batch/s]
  6%|▌         | 377/6250 [00:01

In [9]:
! python eval.py Encoder

              precision    recall  f1-score   support

dissapointed      0.773     0.724     0.748     17061
       happy      0.757     0.753     0.755     16462
       angry      0.689     0.738     0.713     16477

    accuracy                          0.738     50000
   macro avg      0.740     0.738     0.738     50000
weighted avg      0.740     0.738     0.739     50000

Total Average Loss: 0.589
