
## Quantitative Text Analysis Lab Session: Week 12

### Topic: Finet-tuning Flair NLP with Irish Enviromental Policies 

-----

Instructor: Yen-Chieh Liao and Stefan Müller 

Date: 22 April 2024

##### __Preprocessing Data__

All Packages

In [1]:
from flair.data import Sentence
from flair.datasets import SentenceDataset
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
from flair.embeddings import TransformerWordEmbeddings, TransformerDocumentEmbeddings
from flair.data import Corpus
from torch.utils.data import DataLoader
from datasets import load_dataset
import numpy as np
import torch
import flair
import random

  from .autonotebook import tqdm as notebook_tqdm


Check if GPU is Available

In [2]:
# check if GPU is available
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps' 
else:
    device = 'cpu'

print('GPU Device:',device)

GPU Device: mps


Load the Formatted Dataset

In [3]:
from datasets import DatasetDict
dataset = DatasetDict.load_from_disk('irish_environmental_policies')
train, validation, test = dataset['train'], dataset['validation'], dataset['test']

Creation of the Corpus and Label

In [4]:
def prepare_dataset(dataset):
    """
    Prepares a dataset for NLP processing by converting each row into a Sentence object with an associated label.

    This function iterates through a given dataset, extracting the text from the 'premise' field to create Sentence objects. 
    It converts the 'stance' field to a string to ensure label compatibility, adds this label to the Sentence, 
    and collects all sentences in a list for return.

    Args:
        dataset (iterable): A collection of data points, each with a 'premise' and 'stance' field.
    
    Returns:
        list: A list of Sentence objects, each labeled with the string-converted 'stance'.
    """
    sentences = []
    for row in dataset:
        # Extract premise text and create a Sentence object.
        text = row['text'] 
        sentence = Sentence(text)
        
        # Convert stance to string and add as a label to the sentence.
        # label_str = str(row['stance'])
        sentence.add_label('label',  str(row['label']))
        
        # Append the processed sentence to the list.
        sentences.append(sentence)
    return sentences

In [5]:
random.seed(42)
train_dataset = prepare_dataset(train)
validation_dataset = prepare_dataset(validation)
test_dataset = prepare_dataset(test)

# train_dataset = random.sample(train_dataset, 800)
# validation_dataset = random.sample(validation_dataset, 200)
# test_dataset = random.sample(test_dataset, 200)

Creation of the Corpus and Label

In [6]:
corpus = Corpus(train=train_dataset, dev=validation_dataset, test=test_dataset)
label_dict = corpus.make_label_dictionary(label_type='label')

2024-04-19 17:38:47,041 Computing label dictionary. Progress:


0it [00:00, ?it/s]
1880it [00:00, 48979.70it/s]

2024-04-19 17:38:47,101 Dictionary created for label 'label' with 2 values: 0 (seen 1772 times), 1 (seen 108 times)





Check the Mapping from Label to Index

In [7]:
print("----- Inspecting Label Types and Frequency Distribution ------")
print("Lable Check: {}".format(label_dict))
from collections import Counter
label_counter = Counter()
for sentence in corpus.get_all_sentences():
    labels = sentence.get_labels('label')
    label_counter.update([label.value for label in labels])
for label, frequency in label_counter.items():
    print(f"Label '{label}': {frequency} times")

----- Inspecting Label Types and Frequency Distribution ------
Lable Check: Dictionary with 2 tags: 0, 1
Label '0': 2937 times
Label '1': 197 times


In [8]:
print("------ Check the Mapping Order of idx2item and item2id -------")
print("Check idx2item Mapping : {}".format(label_dict.idx2item))
print("Check item2idx Mapping : {}".format(label_dict.item2idx))

------ Check the Mapping Order of idx2item and item2id -------
Check idx2item Mapping : [b'0', b'1']
Check item2idx Mapping : {b'0': 0, b'1': 1}


In [9]:
sbert_embeddings = TransformerDocumentEmbeddings('sentence-transformers/distiluse-base-multilingual-cased-v2', fine_tune=True)

In [10]:
sbert_classifier = TextClassifier(sbert_embeddings, 
                                  label_dictionary=label_dict,
                                  label_type='label')
# sbert_classifier = sbert_classifier.to(device)

Start Trainning

In [11]:
sbert_trainer = ModelTrainer(sbert_classifier, corpus)
sbert_trainer.train('qta_flair_python_model',      
                    shuffle = True,               
                    patience=3,            
                    learning_rate=0.02,             
                    mini_batch_size=16, 
                    write_weights = True,          
                    max_epochs=3)     

2024-04-19 17:38:51,829 ----------------------------------------------------------------------------------------------------
2024-04-19 17:38:51,830 Model: "TextClassifier(
  (embeddings): TransformerDocumentEmbeddings(
    (model): DistilBertModel(
      (embeddings): Embeddings(
        (word_embeddings): Embedding(119548, 768)
        (position_embeddings): Embedding(512, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (transformer): Transformer(
        (layer): ModuleList(
          (0-5): 6 x TransformerBlock(
            (attention): MultiHeadSelfAttention(
              (dropout): Dropout(p=0.1, inplace=False)
              (q_lin): Linear(in_features=768, out_features=768, bias=True)
              (k_lin): Linear(in_features=768, out_features=768, bias=True)
              (v_lin): Linear(in_features=768, out_features=768, bias=True)
              (out_lin): Linear(in_features=768, ou

100%|██████████| 10/10 [00:10<00:00,  1.10s/it]

2024-04-19 17:40:42,337 DEV : loss 0.15755552053451538 - f1-score (micro avg)  0.9522
2024-04-19 17:40:42,350  - 0 epochs without improvement
2024-04-19 17:40:42,363 saving best model





2024-04-19 17:40:43,763 ----------------------------------------------------------------------------------------------------
2024-04-19 17:40:54,457 epoch 2 - iter 11/118 - loss 0.12667524 - time (sec): 10.69 - samples/sec: 16.46 - lr: 0.020000 - momentum: 0.000000
2024-04-19 17:41:02,568 epoch 2 - iter 22/118 - loss 0.08819670 - time (sec): 18.80 - samples/sec: 18.72 - lr: 0.020000 - momentum: 0.000000
2024-04-19 17:41:11,563 epoch 2 - iter 33/118 - loss 0.10831859 - time (sec): 27.80 - samples/sec: 18.99 - lr: 0.020000 - momentum: 0.000000
2024-04-19 17:41:21,261 epoch 2 - iter 44/118 - loss 0.11948033 - time (sec): 37.50 - samples/sec: 18.78 - lr: 0.020000 - momentum: 0.000000
2024-04-19 17:41:30,470 epoch 2 - iter 55/118 - loss 0.13686667 - time (sec): 46.71 - samples/sec: 18.84 - lr: 0.020000 - momentum: 0.000000
2024-04-19 17:41:38,939 epoch 2 - iter 66/118 - loss 0.12287068 - time (sec): 55.17 - samples/sec: 19.14 - lr: 0.020000 - momentum: 0.000000
2024-04-19 17:41:48,112 epoch

100%|██████████| 10/10 [00:00<00:00, 69.73it/s]

2024-04-19 17:42:23,658 DEV : loss 0.15862014889717102 - f1-score (micro avg)  0.9522
2024-04-19 17:42:23,669  - 1 epochs without improvement





2024-04-19 17:42:23,673 ----------------------------------------------------------------------------------------------------
2024-04-19 17:42:34,298 epoch 3 - iter 11/118 - loss 0.12860246 - time (sec): 10.62 - samples/sec: 16.57 - lr: 0.020000 - momentum: 0.000000
2024-04-19 17:42:44,343 epoch 3 - iter 22/118 - loss 0.11649046 - time (sec): 20.67 - samples/sec: 17.03 - lr: 0.020000 - momentum: 0.000000
2024-04-19 17:42:53,604 epoch 3 - iter 33/118 - loss 0.08789279 - time (sec): 29.93 - samples/sec: 17.64 - lr: 0.020000 - momentum: 0.000000
2024-04-19 17:43:02,535 epoch 3 - iter 44/118 - loss 0.10416080 - time (sec): 38.86 - samples/sec: 18.12 - lr: 0.020000 - momentum: 0.000000
2024-04-19 17:43:12,201 epoch 3 - iter 55/118 - loss 0.10782092 - time (sec): 48.53 - samples/sec: 18.13 - lr: 0.020000 - momentum: 0.000000
2024-04-19 17:43:22,118 epoch 3 - iter 66/118 - loss 0.11130539 - time (sec): 58.44 - samples/sec: 18.07 - lr: 0.020000 - momentum: 0.000000
2024-04-19 17:43:30,934 epoch

100%|██████████| 10/10 [00:00<00:00, 152.41it/s]

2024-04-19 17:44:02,252 DEV : loss 0.16160906851291656 - f1-score (micro avg)  0.9522
2024-04-19 17:44:02,260  - 2 epochs without improvement





2024-04-19 17:44:03,500 ----------------------------------------------------------------------------------------------------
2024-04-19 17:44:03,501 Loading model from best epoch ...


100%|██████████| 10/10 [00:09<00:00,  1.06it/s]

2024-04-19 17:44:17,409 
Results:
- F-score (micro) 0.9442
- F-score (macro) 0.7075
- Accuracy 0.9442

By class:
              precision    recall  f1-score   support

           0     0.9507    0.9914    0.9706       583
           1     0.7368    0.3182    0.4444        44

    accuracy                         0.9442       627
   macro avg     0.8438    0.6548    0.7075       627
weighted avg     0.9357    0.9442    0.9337       627

2024-04-19 17:44:17,409 ----------------------------------------------------------------------------------------------------





{'test_score': 0.9441786283891547}