# Installs

In [1]:
!pip install sentence-transformers
!pip install transformers
!pip install datasets

Collecting sentence-transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 5.3 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 32.5 MB/s 
[?25hCollecting tokenizers>=0.10.3
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 49.3 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 71.8 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.2.0-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 717 kB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |█████████████████████

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Loading Data and Training

I chose to use transformers since they been have shown to be state of the art in many down stream tasks. Plus the availibility of pretrained multilingual models for transformers makes the choice more complelling and will facilitate the training/finetuning process. I first chose distillbert as it is a light weight Bert-like model which performs much faster than BERT and according to author provides more accuracy on some Benchmarks. XLM-Roberta also is faster and is more lightweight than BERT so I decided to test it as well. The models can be trained for more epochs, and could achieve higher accuracy, but due to time constraint and since the point of this challenge is just proof of concept I ran traning for only one epoch. distilbert and XLM-Roberta were trained in an unsupervised manner (masked word prediction) from crawled data. There exists also some open source models which were finetuned on the specific task of sentimenet analysis. One of the few models is called twitter-xlm-roberta. This is a multilingual model which was finetuned on twitter data for sentiment analysis. Usually twitter data will reflect more how people talk instead of for example wikipedia so it should capture the sentimenets with higher accuracy, plus it's already finetuned for this task. So we will try twitter-xlm-roberta as well.

In addition, there is one class which unassigned so I just drop all utterances which belong to this class.

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer
from datasets import ClassLabel
from datasets import Dataset
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer
import numpy as np
from datasets import load_metric
import torch
import pandas as pd

labels = ClassLabel(num_classes=3, names=['negative', 'positive','neutral'])

def tokenize_train_function(examples):
    tokens = tokenizer(examples["content"], max_length= 512, padding="max_length", truncation=True)
    tokens['label'] = labels.str2int(examples['label'])
    return tokens

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels,average="weighted")

In [5]:
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer
import numpy as np
from datasets import load_metric



model_types = ["distilbert-base-multilingual-cased","xlm-roberta-base","cardiffnlp/twitter-xlm-roberta-base-sentiment"]
data_path = "/content/drive/MyDrive/Synthesio-test/sentiment-analysis-test/data/"
model_path = "/content/drive/MyDrive/Synthesio-test/sentiment-analysis-test/models/"
train_df = pd.read_csv(data_path + "train.csv")
train_df = train_df[train_df["sentiment"]!="unassigned"]
train_dataset = Dataset.from_pandas(train_df).rename_column("sentiment", "label").train_test_split(test_size=0.2)
metric = load_metric("f1")
scores = {}
runned_models = {}
for model_type in model_types:
  tokenizer = AutoTokenizer.from_pretrained(model_type)
  tokenized_train_datasets = train_dataset.map(tokenize_train_function, batched=True)

  # small_train_dataset = tokenized_train_datasets['train'].shuffle(seed=42).select(range(100))
  # small_eval_dataset = tokenized_train_datasets["test"].shuffle(seed=42).select(range(100))
  full_train_dataset = tokenized_train_datasets["train"]
  full_eval_dataset = tokenized_train_datasets["test"]

  model = AutoModelForSequenceClassification.from_pretrained(model_type, num_labels=3)
  training_args = TrainingArguments("test_trainer", num_train_epochs=1.0)
  trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=full_train_dataset,
      eval_dataset=full_eval_dataset,
      compute_metrics=compute_metrics
  )
  trainer.train()
  model.save_pretrained(save_directory=model_path + model_type)
  scores[model_type] = trainer.evaluate()

loading configuration file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/cf37a9dc282a679f121734d06f003625d14cfdaf55c14358c4c0b8e7e2b89ac9.7a727bd85e40715bec919a39cdd6f0aba27a8cd488f2d4e0f512448dcd02bf0f
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.12.5",
  "vocab_size": 119547
}

loading file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/28e5b750bf4f39cc620367720e105

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

loading configuration file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/cf37a9dc282a679f121734d06f003625d14cfdaf55c14358c4c0b8e7e2b89ac9.7a727bd85e40715bec919a39cdd6f0aba27a8cd488f2d4e0f512448dcd02bf0f
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.12.5",
  "vocab_size": 119547
}

loading weights file 

Step,Training Loss
500,0.6924
1000,0.5792
1500,0.5173
2000,0.4901
2500,0.4616


Saving model checkpoint to test_trainer/checkpoint-500
Configuration saved in test_trainer/checkpoint-500/config.json
Model weights saved in test_trainer/checkpoint-500/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-1000
Configuration saved in test_trainer/checkpoint-1000/config.json
Model weights saved in test_trainer/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-1500
Configuration saved in test_trainer/checkpoint-1500/config.json
Model weights saved in test_trainer/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-2000
Configuration saved in test_trainer/checkpoint-2000/config.json
Model weights saved in test_trainer/checkpoint-2000/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-2500
Configuration saved in test_trainer/checkpoint-2500/config.json
Model weights saved in test_trainer/checkpoint-2500/pytorch_model.bin


Training completed. Do not forget to share your mod

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/xlm-roberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/87683eb92ea383b0475fecf99970e950a03c9ff5e51648d6eee56fb754612465.ab95cf27f9419a99cce4f19d09e655aba382a2bafe2fe26d0cc24c18cf1a1af6
Model config XLMRobertaConfig {
  "architectures": [
    "XLMRobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.12.5",
  "type_vocab_size": 1,
  "use_ca

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

loading configuration file https://huggingface.co/xlm-roberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/87683eb92ea383b0475fecf99970e950a03c9ff5e51648d6eee56fb754612465.ab95cf27f9419a99cce4f19d09e655aba382a2bafe2fe26d0cc24c18cf1a1af6
Model config XLMRobertaConfig {
  "architectures": [
    "XLMRobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "trans

Step,Training Loss
500,1.1086
1000,1.1044
1500,1.1021
2000,1.1033
2500,1.1003


Saving model checkpoint to test_trainer/checkpoint-500
Configuration saved in test_trainer/checkpoint-500/config.json
Model weights saved in test_trainer/checkpoint-500/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-1000
Configuration saved in test_trainer/checkpoint-1000/config.json
Model weights saved in test_trainer/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-1500
Configuration saved in test_trainer/checkpoint-1500/config.json
Model weights saved in test_trainer/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-2000
Configuration saved in test_trainer/checkpoint-2000/config.json
Model weights saved in test_trainer/checkpoint-2000/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-2500
Configuration saved in test_trainer/checkpoint-2500/config.json
Model weights saved in test_trainer/checkpoint-2500/pytorch_model.bin


Training completed. Do not forget to share your mod

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/9628a03bf91a381b0f93e02e13ed34077a805ede6a568ad868817f87437a55ea.ea50decabb7db740257ca1cdefd63c25ffafb958ec595a0ff0c8dbac3f4b1ae6
Model config XLMRobertaConfig {
  "_name_or_path": "/home/jupyter/misc/tweeteval/TweetEval_models/xlm-twitter/local-twitter-xlm-roberta-base-sentiment/",
  "architectures": [
    "XLMRobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "Negative",
    "1": "Neutral",
    "2": "Positive"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

loading configuration file https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/9628a03bf91a381b0f93e02e13ed34077a805ede6a568ad868817f87437a55ea.ea50decabb7db740257ca1cdefd63c25ffafb958ec595a0ff0c8dbac3f4b1ae6
Model config XLMRobertaConfig {
  "_name_or_path": "/home/jupyter/misc/tweeteval/TweetEval_models/xlm-twitter/local-twitter-xlm-roberta-base-sentiment/",
  "architectures": [
    "XLMRobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "Negative",
    "1": "Neutral",
    "2": "Positive"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "Negative": 0,
    "Neutral": 1,
    "Positive": 2
  },
  "layer_norm_eps": 1e-05,
  "max_pos

Step,Training Loss
500,0.5928
1000,0.4658
1500,0.4555
2000,0.4338
2500,0.3635


Saving model checkpoint to test_trainer/checkpoint-500
Configuration saved in test_trainer/checkpoint-500/config.json
Model weights saved in test_trainer/checkpoint-500/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-1000
Configuration saved in test_trainer/checkpoint-1000/config.json
Model weights saved in test_trainer/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-1500
Configuration saved in test_trainer/checkpoint-1500/config.json
Model weights saved in test_trainer/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-2000
Configuration saved in test_trainer/checkpoint-2000/config.json
Model weights saved in test_trainer/checkpoint-2000/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-2500
Configuration saved in test_trainer/checkpoint-2500/config.json
Model weights saved in test_trainer/checkpoint-2500/pytorch_model.bin


Training completed. Do not forget to share your mod

In [6]:
scores

{'cardiffnlp/twitter-xlm-roberta-base-sentiment': {'epoch': 1.0,
  'eval_f1': 0.8709104072682577,
  'eval_loss': 0.3887748420238495,
  'eval_runtime': 98.8391,
  'eval_samples_per_second': 50.587,
  'eval_steps_per_second': 6.323},
 'distilbert-base-multilingual-cased': {'epoch': 1.0,
  'eval_f1': 0.8195441519707125,
  'eval_loss': 0.4402909278869629,
  'eval_runtime': 51.0261,
  'eval_samples_per_second': 97.989,
  'eval_steps_per_second': 12.249},
 'xlm-roberta-base': {'epoch': 1.0,
  'eval_f1': 0.18528672278836217,
  'eval_loss': 1.0975550413131714,
  'eval_runtime': 98.6036,
  'eval_samples_per_second': 50.708,
  'eval_steps_per_second': 6.339}}

# Generate Predictions with Best model Found

The results on the validation dataset matched expectations and the model trained on twitter sentiment analysis achieved best f1 score, so I will use it to generate predictions.

In [9]:
from datasets import load_dataset
from transformers import AutoTokenizer
from datasets import ClassLabel
from datasets import Dataset
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer
import numpy as np
from datasets import load_metric
import pandas as pd

def tokenize_test_function(examples):
    tokens = tokenizer(examples["content"], padding="max_length", truncation=True)
    return tokens

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, max_length= 512, references=labels,average="weighted")

model_path = "/content/drive/MyDrive/Synthesio-test/sentiment-analysis-test/models/"
data_path = "/content/drive/MyDrive/Synthesio-test/sentiment-analysis-test/data/"
labels = ClassLabel(num_classes=3, names=['negative', 'positive','neutral'])
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")
test_dataset = load_dataset('csv', data_files= data_path+'test.csv')
full_test_set = test_dataset.map(tokenize_test_function, batched=True)['train'].shuffle(seed=42)
model = AutoModelForSequenceClassification.from_pretrained(model_path+"cardiffnlp/twitter-xlm-roberta-base-sentiment", num_labels=3)
training_args = TrainingArguments("test_trainer")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=full_test_set,
    eval_dataset=full_test_set,
    compute_metrics=compute_metrics
)
predictions,_,_ = trainer.predict(full_test_set)
predictions = np.argmax(predictions, axis=1)
test_df = pd.read_csv(data_path + "test.csv")
test_df["sentiment"] = labels.int2str(predictions)
test_df.to_csv(data_path+"predictions.csv")

loading configuration file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/cf37a9dc282a679f121734d06f003625d14cfdaf55c14358c4c0b8e7e2b89ac9.7a727bd85e40715bec919a39cdd6f0aba27a8cd488f2d4e0f512448dcd02bf0f
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.12.5",
  "vocab_size": 119547
}

loading file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/28e5b750bf4f39cc620367720e105

  0%|          | 0/1 [00:00<?, ?it/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-f69c411cacf2f64b/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-6c1c285e0d0ea6a3.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/csv/default-f69c411cacf2f64b/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-2001893b44499dc0.arrow
loading configuration file /content/drive/MyDrive/Synthesio-test/sentiment-analysis-test/models/cardiffnlp/twitter-xlm-roberta-base-sentiment/config.json
Model config XLMRobertaConfig {
  "_name_or_path": "cardiffnlp/twitter-xlm-roberta-base-sentiment",
  "architectures": [
    "XLMRobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "Negative",
    "1": "

# Generate Requirements.txt

In [11]:
!pip freeze > '/content/drive/MyDrive/Synthesio-test/sentiment-analysis-test/requirements.txt'