# Emotion Identification in Sentences
This model was trained with the purpose of identifying the emotion in a sentence written in English.

It was trained based on the dair-ai/emotion dataset, which was extracted from Twitter messages. It identifies 6 emotions: anger - 3, fear - 4, joy - 1, love - 2, sadness - 0, and surprise - 5.

## 1 Step: Install dependencies and download dataset

In [3]:
#Dependencies
!pip install datasets transformers



In [4]:
from datasets import load_dataset

dataset = load_dataset("dair-ai/emotion")

Downloading builder script:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading and preparing dataset emotion/split to /root/.cache/huggingface/datasets/dair-ai___emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/592k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset emotion downloaded and prepared to /root/.cache/huggingface/datasets/dair-ai___emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [41]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2000
    })
})

## Step 2: Train our model


In [7]:
 from transformers import AutoTokenizer,AutoModelForSequenceClassification

The base model chosen was bert-base-cased from Huggingface that already has a good understanding of english language


In [8]:
model_nm="bert-base-cased"

Generate tokenizer to feed to the model


In [9]:
tokenizer = AutoTokenizer.from_pretrained(model_nm);

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Tokenizing the text field, in witch is our text

In [10]:
def tokenizerFunc(x): return tokenizer(x["text"])

Tokenize our dataset

In [42]:
dataset_tk = dataset.map(tokenizerFunc, batched=True);

  0%|          | 0/16 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

Now lets train our model

In [12]:
from transformers import TrainingArguments,Trainer

In [13]:
bs = 128
epochs = 4

In [14]:
lr = 8e-5

In [18]:
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

In [43]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=6)
trainer = Trainer(model, args, train_dataset=dataset_tk['train'], eval_dataset=dataset_tk['test'],
                  tokenizer=tokenizer);

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [44]:
trainer.train();

Epoch,Training Loss,Validation Loss
1,No log,0.259161
2,No log,0.147063
3,No log,0.133407
4,0.266800,0.138947


In [60]:
preds = trainer.predict(dataset['validation']).predictions.astype(float);
preds

array([[ 7.11328125, -2.52539062, -0.83447266, -1.25097656, -1.3984375 ,
        -1.41210938],
       [ 7.39453125, -2.15625   , -1.78320312, -0.75976562, -1.63378906,
        -1.09277344],
       [-2.07226562,  4.2421875 ,  4.86328125, -1.79589844, -2.70898438,
        -2.21484375],
       ...,
       [-1.85351562,  7.484375  , -0.91015625, -1.77929688, -1.99609375,
        -1.37304688],
       [-2.08007812,  5.62890625,  3.40234375, -1.53320312, -2.51171875,
        -2.32617188],
       [-1.83007812,  7.4375    , -1.01953125, -1.81542969, -1.94238281,
        -1.44628906]])

In [58]:
preds = np.clip(preds, 0, 5);
preds

array([[5.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [5.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.        , 4.2421875 , 4.86328125, 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 5.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.        , 5.        , 3.40234375, 0.        , 0.        ,
        0.        ],
       [0.        , 5.        , 0.        , 0.        , 0.        ,
        0.        ]])

## Step 3: Saving our model

In [69]:
save_directory = "./emobot"
tokenizer.save_pretrained(save_directory);
trainer.save_model(save_directory);

Lets make it a zip;

In [71]:
import zipfile
with zipfile.ZipFile("emobot.zip", 'w',zipfile.ZIP_DEFLATED) as zipf:
    for root, _, files in os.walk(save_directory):
        for file in files:
            zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), save_directory))