#DistilBert on GoEmotions dataset

#### Download Dataset and label texts


In [1]:
!wget https://raw.githubusercontent.com/google-research/google-research/master/goemotions/data/train.tsv
!wget https://raw.githubusercontent.com/google-research/google-research/master/goemotions/data/dev.tsv
!wget https://raw.githubusercontent.com/google-research/google-research/master/goemotions/data/test.tsv
!wget https://raw.githubusercontent.com/google-research/google-research/master/goemotions/data/emotions.txt

--2021-04-22 02:25:58--  https://raw.githubusercontent.com/google-research/google-research/master/goemotions/data/train.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3519053 (3.4M) [text/plain]
Saving to: ‘train.tsv’


2021-04-22 02:25:58 (13.5 MB/s) - ‘train.tsv’ saved [3519053/3519053]

--2021-04-22 02:25:59--  https://raw.githubusercontent.com/google-research/google-research/master/goemotions/data/dev.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 439059 (429K) [text/plain]
Saving to: ‘dev.tsv’


2021-04-22 02:25:59 

#### Load dataset from tsv

In [9]:
import pandas as pd
import numpy as np

In [10]:
df_train = pd.read_csv("/content/train.tsv", sep='\t', header =None)
df_test = pd.read_csv("/content/test.tsv", sep='\t', header =None)
df_dev = pd.read_csv("/content/dev.tsv", sep='\t', header =None)

In [11]:
df_dev.head()

Unnamed: 0,0,1,2
0,Is this in New Orleans?? I really feel like th...,27,edgurhb
1,"You know the answer man, you are programmed to...",427,ee84bjg
2,I've never been this sad in my life!,25,edcu99z
3,The economy is heavily controlled and subsidiz...,427,edc32e2
4,He could have easily taken a real camera from ...,20,eepig6r


#### Data Preprocessing

In [11]:
df_train = df_train.drop([2], axis=1)
df_test = df_test.drop([2], axis=1)
df_dev = df_dev.drop([2], axis=1)

In [13]:
def makeclass(x):
  x = x.split(',')
  return x[0]

In [14]:
df_train[1] = df_train[1].map(makeclass)
df_test[1] = df_test[1].map(makeclass)
df_dev[1] = df_dev[1].map(makeclass)

In [15]:
df_train.columns = ['text', 'labels']
df_test.columns = ['text', 'labels']
df_dev.columns = ['text', 'labels']

In [16]:
df_test

Unnamed: 0,text,labels
0,I’m really sorry about your situation :( Altho...,25
1,It's wonderful because it's awful. At not with.,0
2,"Kings fan here, good luck to you guys! Will be...",13
3,"I didn't know that, thank you for teaching me ...",15
4,They got bored from haunting earth for thousan...,27
...,...,...
5422,Thanks. I was diagnosed with BP 1 after the ho...,15
5423,Well that makes sense.,4
5424,Daddy issues [NAME],27
5425,So glad I discovered that subreddit a couple m...,0


In [17]:
# Maps ids to emotion label

id2labels = {}
with open('/content/emotions.txt', 'r') as f:
  lines = f.readlines()
  lines = list(map(lambda x: x.strip("\n"), lines))

for i in range(28):
  id2labels[str(i)] = lines[i]

In [18]:
id2labels

{'0': 'admiration',
 '1': 'amusement',
 '10': 'disapproval',
 '11': 'disgust',
 '12': 'embarrassment',
 '13': 'excitement',
 '14': 'fear',
 '15': 'gratitude',
 '16': 'grief',
 '17': 'joy',
 '18': 'love',
 '19': 'nervousness',
 '2': 'anger',
 '20': 'optimism',
 '21': 'pride',
 '22': 'realization',
 '23': 'relief',
 '24': 'remorse',
 '25': 'sadness',
 '26': 'surprise',
 '27': 'neutral',
 '3': 'annoyance',
 '4': 'approval',
 '5': 'caring',
 '6': 'confusion',
 '7': 'curiosity',
 '8': 'desire',
 '9': 'disappointment'}

In [19]:
from sklearn.preprocessing import LabelBinarizer

In [20]:
onehot = LabelBinarizer()

In [21]:
train_onehot = onehot.fit_transform(df_train['labels'])
test_onehot = onehot.transform(df_test['labels'])
val_onehot = onehot.transform(df_dev['labels'])

In [20]:
df_train = pd.concat([df_train, pd.DataFrame(train_onehot)], axis=1)
df_test = pd.concat([df_test, pd.DataFrame(test_onehot)], axis=1)
df_dev = pd.concat([df_dev, pd.DataFrame(val_onehot)], axis=1)

In [21]:
df_train

Unnamed: 0,text,labels,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27
0,My favourite food is anything I didn't have to...,27,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,"Now if he does off himself, everyone will thin...",27,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,WHY THE FUCK IS BAYLESS ISOING,2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,To make her feel threatened,14,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Dirty Southern Wankers,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43405,Added you mate well I’ve just got the bow and ...,18,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
43406,Always thought that was funny but is it a refe...,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
43407,What are you talking about? Anything bad that ...,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
43408,"More like a baptism, with sexy results!",13,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [22]:
# Save processed dataset for future purpose
df_train.to_csv("/content/train.csv")
df_test.to_csv("/content/test.csv")
df_dev.to_csv("/content/dev.csv")

#### Installing Dependencies

In [23]:
! pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 8.6MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 52.8MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 56.4MB/s 
Installing collected packages: sacremoses, tokenizers, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1


In [24]:
! pip install datasets

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/da/d6/a3d2c55b940a7c556e88f5598b401990805fc0f0a28b2fc9870cf0b8c761/datasets-1.6.0-py3-none-any.whl (202kB)
[K     |████████████████████████████████| 204kB 8.0MB/s 
Collecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/7d/4f/0a862cad26aa2ed7a7cd87178cbbfa824fc1383e472d63596a0d018374e7/xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243kB)
[K     |████████████████████████████████| 245kB 34.0MB/s 
Collecting huggingface-hub<0.1.0
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting fsspec
[?25l  Downloading https://files.pythonhosted.org/packages/e9/91/2ef649137816850fa4f4c97c6f2eabb1a79bf0aa2c8ed198e387e373455e/fsspec-2021.4.0-py3-none-any.whl (108kB)
[K     |████████████████████████████████| 112kB 37.9MB/s 
Installing collected packages: xxhash, huggingface-hub, 

#### Model loading

In [25]:
import transformers
import datasets

In [26]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




#### Data preprocessing for model

In [30]:
train_encodings = tokenizer(df_train['text'].tolist(), truncation=True, padding=True)
val_encodings = tokenizer(df_dev['text'].tolist(), truncation=True, padding=True)
test_encodings = tokenizer(df_test['text'].tolist(), truncation=True, padding=True)

In [31]:
import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    df_train.iloc[:,2:]
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    df_dev.iloc[:,2:]
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    df_test.iloc[:,2:]
))

In [None]:
train_dataset

<TensorSliceDataset shapes: ({input_ids: (316,), attention_mask: (316,)}, (28,)), types: ({input_ids: tf.int32, attention_mask: tf.int32}, tf.int64)>

#### Model Training

In [None]:
import tensorflow as tf
from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=28)

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"]) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)

In [None]:
model.save_pretrained('./pretrained')

In [None]:
tokenizer.save_pretrained("./pretrained")

('./pretrained/tokenizer_config.json',
 './pretrained/special_tokens_map.json',
 './pretrained/vocab.txt',
 './pretrained/added_tokens.json')

In [None]:
import pickle
with open('./pretrained/labelbin.pickle', 'wb') as f:
    pickle.dump(onehot, f)

###Inference

In [None]:
model1 = TFDistilBertForSequenceClassification.from_pretrained('./pretrained/')

Some layers from the model checkpoint at ./pretrained/ were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at ./pretrained/ and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
print(df_test.loc[0,'text'])
print(df_test.loc[0,'labels'])

I’m really sorry about your situation :( Although I love the names Sapphira, Cirilla, and Scarlett!
25


In [None]:
inputs = tokenizer("I’m really sorry about your situation :( Although I love the names Sapphira, Cirilla, and Scarlett!", return_tensors="tf")

In [None]:
output = model1(inputs)

In [None]:
output = tf.nn.softmax(output['logits'])
output.numpy()

In [25]:
out = np.zeros(28, dtype='int')
ind = np.argmax(output.numpy())
out[ind] = 1
finallabel = onehot.inverse_transform(out.reshape([1,28]))
print("Predicted emotion: " + id2labels[finallabel[0]])

Predicted emotion: sadness


### Download Trained Model

In [None]:
! zip -r /content/pretrained.zip /content/pretrained

  adding: content/pretrained/ (stored 0%)
  adding: content/pretrained/tokenizer_config.json (deflated 38%)
  adding: content/pretrained/tf_model.h5 (deflated 8%)
  adding: content/pretrained/labelbin.pickle (deflated 41%)
  adding: content/pretrained/config.json (deflated 65%)
  adding: content/pretrained/special_tokens_map.json (deflated 40%)
  adding: content/pretrained/vocab.txt (deflated 53%)
