<a href="https://colab.research.google.com/github/benoitmialet/Statistical-and-data-analysis-using-R-/blob/main/Copie_de_lab_nlp_01_understanding_encoder_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LAB NLP 01 Understanding encoder models

In [None]:
# cd ..

/app


## Intro

In this lab session we are going to use our first transformer model!

Before using any model for inference or training, understanding its basic structure and functioning is required. This is the objective of this lab. In the next one, we will work on Natural Language Processing tasks, where transformer models have shown outstading results since several years.

We will first focus on Encoder-only models in this first lab. Then, we will tackle Encoder-Decoder models and Decoder-only models in next labs.

All along the labs, we will massively use Hugging Face Hub and Hugging Face libraries to try out different transformer models.
* **Hugging Face Hub** is a huge storage for Open Source models: https://huggingface.co/models. Anyone can upload a model, with public or private access. A search bar allow to add filters to make models exploration easier: model tasks, languages (NLP), licenses, etc.
* **Hugging Face librairies** offer a all-in-one implementation that allow to use any model from the Hub, providing your hardware can handle it: https://huggingface.co/docs. By using a quite simple syntax, you will be able to perform any NLP tasks. We will essentially use Transformers and Datasets libraries in this purpose.

Note: at any moment, if you face an "out of memory" error, just reset the notebook kernel. It is also recommended to reset the notebook kernel each time you load another model, to avoid memory crash. If you face a memory crash, your virtual system could freeze. In this case, stop your container on docker desktop then restart it if you work locally, or simply reset your Google colab if you're online.

## Discovering BERT

The best example to begin with NLP encoder models is BERT (Google, 2018).
As a reminder, BERT encoder has been trained for masked language modeling (MLM) task in English language, which makes him good at feature extraction, that's to say good at extracting the semantic meaning of a text and the information it contains (in English).


First thing to do before trying to use a transformer model is to understand how it works. Go to the BERT model page then read the text : https://huggingface.co/google-bert/bert-base-uncased/. See how this one is well documented!


Next, you have to check the main model files on the "Files and versions" tab and understand their purpose. **To use NLP models, you will basically need these files:**

* **config.json** shows model parameters configuration.

* **tokenizer_config.json** indicates the max length of inputs you can feed the model with. It's one of the most important information. So, keep in mind "**512** tokens".

* **tokenizer.json** is the most important file. It contains all the vocabulary and the mapping between token and their ids. It also contains all the special tokens. You had to read them carefully

* **vocab.txt** references all the tokens of the vocabulary.

* At least 1 model file. Most of the time, several model file types are available in the files folder. Each of them contains all the weights of the model. You don't need to download all of them. You have to pick at least one of them. Most common types are:
  * **pytorch_model.bin** is the standard binary format (pickle). All model repositories contain this version.
  * **model.safetensors** is a cool format created by Hugging Face. It is faster to load, especially on cpu. Another cool feature displays all the model layers just by clicking on the little icon on the hub, next to the file.



## Loading the model on device

All methods that are used in the lab are in the next cell. If you lose one, reload the cell.

In [None]:
from transformers import (
    AutoConfig, BertConfig,
    AutoTokenizer,
    AutoModel, AutoModelForMaskedLM, BertForMaskedLM,
    pipeline
)
import torch
import os
import pandas as pd

# model_path = "homedata/models/llm_encoders/distilbert-base-uncased"
# model_path = "homedata/models/llm_encoders/camembert-base/"
# model_path = "homedata/models/llm_encoders/xlm-roberta-base"

# this line of code will be useful in any of your projects, to check GPU availability
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

`AutoModel` class can load any model if a model name or path is provided. It automatically identifies model type and loads it, thanks to `from_pretrained()` method. This method has plenty of optional parameters than can be found on the documentation, or with `help(AutoModel.from_pretrained)`.

In [None]:
help(AutoModel.from_pretrained)

There are plenty of methods to load one specific type of model. However, most of the time, AutoModel high-level class works perfectly to load any model from the hub.

There are two options to load files from the Hub:
1. Download them locally using `git clone`, `wget`... and provide the directory path to the import method.
2. Provide the `model_id` to the same import method. `model_id` is simply the `author_account/model_name` reference displayed on the hub (*e.g.* "google-bert/bert-base-uncased"). The model will be downloaded in a cache directory, then loaded in RAM or vRAM.

`AutoModel.from_pretrained()` method can be used like this:
* `use_safetensors` allows to load model with safetensor files. You can also load tensorflow version, etc.
* the `to()` method moves the data in a specific device (cpu, gpu). Be careful: **model and input data MUST be on the same device**. tokenizer stays on cpu

In [None]:
model_id = "google-bert/bert-base-uncased"
bert_model = AutoModel.from_pretrained(model_id, use_safetensors=True).to(device)

WARNING: During the model loading, if a warning message states that some layer were randomly initialized, unfortunately the model is not correctly automatically recognized. You will need to use a specific import class, unless you do it on purpose, for instance to initialize a new layer and train it.

Looking at BERT architecture, we observe:
* The **embedding** part, that output token embedding.
* The **"body"**, that update token embeddings (feature extraction), with 12 blocks ("Layers"), each containing several attention heads
* The **"head"**, That in this case pool the token embedding into 1 single vector, with only 1 pooler layer.

Note: model body can be seen as a drill, ad the head as a drill bit. For a same body, an infinite variety of heads can be trained. **Body does the feature extraction and head does the model task**. Task can be pooling, pooling + classification, or anything else.

In [None]:
bert_model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

More details can be found, like the number of attention head per decoder block, by loading model config file, or directly by checking `config.json` file.
`AutoConfig` is a generalist loader like `AutoModel` or `AutoTokenizer` and can detect the model type. `BertConfig` is a specialist loader that  will work only for BERT architecture models. Both `AutoConfig` and `BertConfig` will return the same object with our model.

In [None]:
from transformers import AutoConfig, BertConfig
config = AutoConfig.from_pretrained(model_id)
print(config)

BertConfig {
  "_name_or_path": "google-bert/bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.42.4",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



## Using a Tokenizer

Tokenization is the first step of processing input data (text), before converting it into a numeric format.

Each model family (e.g. BERT) has its own tokenizer, so you will have better to identify which tokenizer or how tokenizer works for each model.

3 main tokenization techniques exist
* Character tokenization: splitting text into characters
* Word tokenization: splitting text into words
* **Subword tokenization**: splitting text into word or smaller entities. It combines best aspects of word and character tokenization and is now broadly adopted.

N.B.: Word and Subword tokenizers themselves are trained along the model training process, with the same text corpus.
N.B.: Today, Subword tokenization is the main technique used in most NLP models.

--

Tokenizer must be imported in an object apart from the model object, to prepare input data.

`AutoTokenizer.from_pretrained()` method automatically detects the type of tokenizer a model use, just by provinding it's path or model_id.
if we print the tokenizer we will access to all the important details. Let's have a look:


### WordPiece Tokenizer

WordPiece Tokenizer is generally used by models focusing on 1 language, mostly English, but it can be another european language too. Examples: BERT, DistilBERT

In [None]:
from transformers import AutoTokenizer, AutoModel

model_id = "distilbert/distilbert-base-uncased"
dbert_tokenizer = AutoTokenizer.from_pretrained(model_id)
dbert_tokenizer

DistilBertTokenizerFast(name_or_path='distilbert/distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

some characteristics, like the vocabulary size, and plenty others can be found using `help()`. That's where the WordPiece type of tokenizer appears

In [None]:
help(dbert_tokenizer)

#### Special tokens

Special tokens are particular cases, not always related to words. They have to be known to understand tokenization. Special tokens differs from a model to another. Here we take a WordPiece tokenizer as example. Here are the special tokens:

* [CLS] is the Classification token. It is put at the beginning of input sequence, and will be used if the model task is classification (the rest of the sequence is discarded). It integrates the global context of the sequence
* [SEP] is the separator token. It delimitates where sentences end up in a sequence. It also marks the end of the input sequence.
* [UKN] is the unknown token. It is used to handle words or subwords that are not in the model's vocabulary
* [PAD] is the padding token. When providing a batch of multiple samples as input in the model, all the tensors must have the same size. A max_size is set (by default or manually), and all sequences shortest than this will be completed by this token.
* [MASK] is the mask token. It is randomly put in a sequence if the model is being trained for Masked Language Modeling.


Let's tokenize a sentence:

In [None]:
text = "tokenizing text is a core task in NLP."

tokens = dbert_tokenizer(text)

print(tokens.tokens())

['[CLS]', 'token', '##izing', 'text', 'is', 'a', 'core', 'task', 'in', 'nl', '##p', '.', '[SEP]']


We can see [CLS], [SEP] and some tokens beginning by `##`, which indicates they are subword tokens, which are linked to the previous token.

We can also observe that each token has its own mapped id in the tokenizer vocabulary:

In [None]:
import pandas as pd
pd.DataFrame({'token': tokens.tokens(), 'id': tokens.input_ids})

Unnamed: 0,token,id
0,[CLS],101
1,token,19204
2,##izing,6026
3,text,3793
4,is,2003
5,a,1037
6,core,4563
7,task,4708
8,in,1999
9,nl,17953


### SentencePiece tokenizer (XLM-RoBERTa)

SentencePiec Tokenizer is generally (not exclusively) used by models that can handle multiple languages, including non european languages. Examples: mBERT, XLM-RoBERTa

In [None]:
model_id = "FacebookAI/xlm-roberta-base"
xlmroberta_tokenizer = AutoTokenizer.from_pretrained(model_id)
xlmroberta_tokenizer

XLMRobertaTokenizerFast(name_or_path='FacebookAI/xlm-roberta-base', vocab_size=250002, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	250001: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}

Here, we see different special tokens, but thanks to the `special_tokens` dictionnary, we can understand that classification. For instance, BOS and SEP tokens are now become `<s>` and `</s>`

In [None]:
text = "tokenizing text is a core task in NLP."

tokens = xlmroberta_tokenizer(text)

print(tokens.tokens())

['<s>', '▁to', 'ken', 'izing', '▁text', '▁is', '▁a', '▁core', '▁task', '▁in', '▁N', 'LP', '.', '</s>']


Now it's a different situation as for WordPiece. Tokens corresponding to the beginning of a word starts with `_`, and those who are linked to the previous one don't. The SentencePiece tokenizer is agnostic to accents, ponctuation, and is more adappted to languages without whitespaces, like Japanese.

## Understanding BERT model input and output

Let's use the famous BERT model and try to understand the output data.

### Tokenization (input)

In [None]:
model_id = "google-bert/bert-base-uncased"
bert_tokenizer = AutoTokenizer.from_pretrained(model_id)

Tokenized data can be retruned in many formats. We will use pytorch one.

Don't forget to move tokenized data, which will be model input data, on the device as the model.

In [None]:
text = "I learn NLP. It's a pain everyday. It is so hard!"
tokens = bert_tokenizer(text, return_tensors='pt').to(device)
tokens

{'input_ids': tensor([[  101,  1045,  4553, 17953,  2361,  1012,  2009,  1005,  1055,  1037,
          3255, 10126,  1012,  2009,  2003,  2061,  2524,   999,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

`token_type_ids` `attention_mask` and are used for training. We will not use them in the lab. For your information:
* `token_type_ids` differenciates different contexts, such as the question and answer part.
* `attention_mask` will indicate which token are masked (for MLM training)

We see some special tokens (101, 102) among them. Let's decode them:

In [None]:
bert_tokenizer.decode(tokens.input_ids.squeeze())

"[CLS] i learn nlp. it's a pain everyday. it is so hard! [SEP]"

### Output

Before using a model for inference, puting code in a `torch.no_grad()` context will avoid computing gradients for nothing. It can be a big deal if you are working with big models, or on cpu.

In [None]:
%%timeit -n 5 -r 10
output = bert_model(tokens.input_ids)

218 ms ± 62.5 ms per loop (mean ± std. dev. of 10 runs, 5 loops each)


In [None]:
%%timeit -n 5 -r 10
with torch.no_grad():
    output = bert_model(tokens.input_ids)

128 ms ± 29.4 ms per loop (mean ± std. dev. of 10 runs, 5 loops each)


Reminder: encoder model are build in two parts:
* A **"body"** for feature extraction, with several blocks, each containing several attention heads
* A **"head"** for classification (or anything else), pooling

So, model output provides body and head outputs.
* body output is given by `last_hidden_state` attribute
* head output is given by `pooler_output`

In [None]:
output = bert_model(tokens.input_ids)
output

In [None]:
output.last_hidden_state.shape

torch.Size([1, 19, 768])

last_hidden_state corresponds to extracted text context.

Dimensions are:
* batch size (here, 1 sentence)
* input sequence length (1 vector per token)
* embedding size

In [None]:
output.pooler_output.shape

torch.Size([1, 768])

pooler_output always has the same dimension. This layer is trained to "pool" hidden state matrix in one single vector.
This vector carries the text context and can be used for other purposes such as sentence similarity or text classification.

Dimensions are:
* batch size (here, 1 sentence)
* embedding size

Hugging Face Transformers can also provide all decoder blocks outputs and their attention matrices:

In [None]:
output = bert_model(
    tokens.input_ids,
    output_hidden_states=True,
    output_attentions=True,
)



In [None]:
print(len(output.attentions), len(output.hidden_states))

12 13


## Masked Language Modeling

In [None]:
"XXXXXXXXXXXXXXXXXXXXXXXXXX"

'XXXXXXXXXXXXXXXXXXXXXXXXXX'

In [None]:
from transformers import AutoModelForMaskedLM, BertForMaskedLM
model_id = "google-bert/bert-base-uncased"
bert_model_mlm = BertForMaskedLM.from_pretrained(model_id).to(device)

Some weights of the model checkpoint at google-bert/bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
text = "I learn NLP. It's a pain everyday. It is so [MASK]!"  #original masked token: "hard"
tokens = bert_tokenizer(text, return_tensors='pt').input_ids.to(device)
bert_tokenizer.decode(tokens.squeeze())

"[CLS] i learn nlp. it's a pain everyday. it is so [MASK]! [SEP]"

In [None]:
with torch.no_grad():
    output = bert_model_mlm(tokens)

Let's check what MLM output looks like

In [None]:
output

MaskedLMOutput(loss=None, logits=tensor([[[ -6.7952,  -6.7584,  -6.7695,  ...,  -6.1543,  -5.9597,  -4.0777],
         [-13.2081, -12.6957, -12.8394,  ..., -11.5705,  -9.7691, -12.9378],
         [ -7.6904,  -7.9168,  -7.8124,  ...,  -7.7851,  -5.8529,  -6.5248],
         ...,
         [ -5.1325,  -5.0025,  -5.0526,  ...,  -5.0298,  -4.9050,  -3.2654],
         [-10.0085, -10.0155, -10.3207,  ...,  -9.8141,  -8.8576,  -4.7096],
         [-14.5361, -14.9923, -14.8336,  ..., -14.0959, -11.2791,  -9.7337]]]), hidden_states=None, attentions=None)

In [None]:
output.logits.shape

torch.Size([1, 19, 30522])

It seems there is 1 vector of 30522 values for each of the tokens. These are the logits (the scores) for each token of the vocabulary. The logits that interest us are the logits for the [MASK] token that BERT learned to fill.

Let's parse the logits for each of the vectors and get each time the token with the highest probability. We will see what the model filled instead of the [MASK] token:

In [None]:
id_list = [torch.argmax(logit) for logit in output.logits.squeeze()]
bert_tokenizer.decode(id_list, clean_up_tokenization_spaces=True)

". i learn nlp. it's a pain everyday. it is so hard!."

#### Using Transformers pipe

Transformers `pipeline()` is a high level method that allow to to almost any task from any transformer model found in the Hub, with a single liner. It is very practicle to quickly try out some models or build POCs (prooves of concept)

https://huggingface.co/docs/transformers/main/en/quicktour#pipeline

More detailed documentation: https://huggingface.co/docs/transformers/main/en/main_classes/pipelines



In [None]:
from transformers import pipeline
pipe = pipeline(model=model_path, task='fill-mask')

Some weights of the model checkpoint at homedata/models/llm_encoders/bert-base-uncased/ were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Let's try to do the same thing as before with only one line of code:

In [None]:
text = "I learn NLP. It's a pain everyday. It is so [MASK]!"  #original masked token: hard
pipe(text)

[{'score': 0.11025796085596085,
  'token': 2524,
  'token_str': 'hard',
  'sequence': "i learn nlp. it's a pain everyday. it is so hard!"},
 {'score': 0.07964936643838882,
  'token': 2204,
  'token_str': 'good',
  'sequence': "i learn nlp. it's a pain everyday. it is so good!"},
 {'score': 0.054610494524240494,
  'token': 9145,
  'token_str': 'painful',
  'sequence': "i learn nlp. it's a pain everyday. it is so painful!"},
 {'score': 0.03537808358669281,
  'token': 3733,
  'token_str': 'easy',
  'sequence': "i learn nlp. it's a pain everyday. it is so easy!"},
 {'score': 0.03381006792187691,
  'token': 2919,
  'token_str': 'bad',
  'sequence': "i learn nlp. it's a pain everyday. it is so bad!"}]

## Use BERT output for Sentence similarity

In [None]:
text1 = "Replace me by any text you'd like."
encoded_input = bert_tokenizer(text, return_tensors='pt').to(device)
output = bert_model(**encoded_input)

In [None]:
text1 = output.pooler_output.squeeze()

torch.Size([1, 768])

In [None]:
def embed_text(text, model, tokenizer, device):
    encoded_input = tokenizer(text, return_tensors='pt').to(device)
    with torch.no_grad():
        output = model(**encoded_input)
    embedding = output.pooler_output.squeeze()

    return embedding

In [None]:
text1 = "I liked this movie very much"
text2 = "This movie is one oof the best I've seen"
vector1 = embed_text(text1, bert_model, bert_tokenizer, device)
vector2 = embed_text(text2, bert_model, bert_tokenizer, device)

In [None]:
from torch.nn.functional import cosine_similarity

cosine_similarity(vector1.unsqueeze(0), vector2.unsqueeze(0))

tensor([0.9423], device='cuda:0')

In [None]:
text_reference = "I liked this movie very much"
text_collection = [
    "This movie is one of the best I've seen",
    "I like to go to the movie theater",
    "Germany has an access to the open sea",
    "5 miles roughly equals 8 kilometers",
    "I hate this movie"
]
vector_reference = embed_text(text_reference, bert_model, bert_tokenizer, device)
vector_dict = {text: embed_text(text, bert_model, bert_tokenizer, device) for text in text_collection}
{t:float(cosine_similarity(vector_reference.unsqueeze(0), v.unsqueeze(0))) for t,v in vector_dict.items()}


{"This movie is one of the best I've seen": 0.9913552403450012,
 'I like to go to the movie theater': 0.9847801923751831,
 'Germany has an access to the open sea': 0.9568362832069397,
 '5 miles roughly equals 8 kilometers': 0.846796989440918,
 'I hate this movie': 0.8964381217956543}

## Use Sentence Transformers

# Vizualizing self-attention mechanism

Check this tutorial on Collab NB :
    
https://colab.research.google.com/drive/1hXIQ77A4TYS4y3UthWF-Ci7V7vVUoxmQ?usp=sharing#scrollTo=YLAhBxDSScmV

# ---------------

In [None]:
from datasets import get_dataset_config_names
domains = get_dataset_config_names("subjqa")
domains

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/9.12k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/21.6k [00:00<?, ?B/s]

['books', 'electronics', 'grocery', 'movies', 'restaurants', 'tripadvisor']

In [None]:
from datasets import load_dataset
subjqa = load_dataset("subjqa", name="electronics")

Downloading data:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1295 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/358 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/255 [00:00<?, ? examples/s]

In [None]:
import pandas as pd
dfs = {split: dset.to_pandas() for split, dset in subjqa.flatten().items()}

In [None]:
for split, df in dfs.items():
    print(f"Number of questions in {split}: {df['id'].nunique()}")

Number of questions in train: 1295
Number of questions in test: 358
Number of questions in validation: 255


In [None]:
#hide_output
qa_cols = ["title", "question", "answers.text",
           "answers.answer_start", "context"]
sample_df = dfs["train"][qa_cols].sample(2, random_state=7)
sample_df

Unnamed: 0,title,question,answers.text,answers.answer_start,context
791,B005DKZTMG,Does the keyboard lightweight?,[this keyboard is compact],[215],I really like this keyboard. I give it 4 star...
1159,B00AAIPT76,How is the battery?,[],[],I bought this after the first spare gopro batt...


In [None]:
dfs['train'].head()

Unnamed: 0,domain,nn_mod,nn_asp,query_mod,query_asp,q_reviews_id,question_subj_level,ques_subj_score,is_ques_subjective,review_id,id,title,context,question,answers.text,answers.answer_start,answers.answer_subj_level,answers.ans_subj_score,answers.is_ans_subjective
0,electronics,great,bass response,excellent,bass,0514ee34b672623dff659334a25b599b,5,0.5,False,882b1e2745a4779c8f17b3d4406b91c7,2543d296da9766d8d17d040ecc781699,B00001P4ZH,"I have had Koss headphones in the past, Pro 4A...",How is the bass?,[],[],[],[],[]
1,electronics,harsh,high,not strong,bass,7c46670208f7bf5497480fbdbb44561a,1,0.5,False,ce76793f036494eabe07b33a9a67288a,d476830bf9282e2b9033e2bb44bbb995,B00001P4ZH,To anyone who hasn't tried all the various typ...,Is this music song have a goo bass?,"[Bass is weak as expected, Bass is weak as exp...","[1302, 1302]","[1, 1]","[0.5083333, 0.5083333]","[True, True]"
2,electronics,neutral,sound,present,bass,8fbf26792c438aa83178c2d507af5d77,1,0.5,False,d040f2713caa2aff0ce95affb40e12c2,455575557886d6dfeea5aa19577e5de4,B00001P4ZH,I have had many sub-$100 headphones from $5 Pa...,How is the bass?,[The only fault in the sound is the bass],[650],[2],[0.6333333],[True]
3,electronics,muddy,bass,awesome,bass,9876fd06ed8f075fcad70d1e30e7e8be,1,0.5,False,043e7162df91f6ea916c790c8a6f6b22,6895a59b470d8feee0f39da6c53a92e5,B00001WRSJ,My sister's Bose headphones finally died and s...,How is the audio bass?,[the best of all of them],[1609],[1],[0.3],[False]
4,electronics,perfect,bass,incredible,sound,16506b53e2d4c2b6a65881d9462256c2,1,0.65,True,29ccd7e690050e2951be49289e915382,7a2173c502da97c5bd5950eae7cd7430,B00001WRSJ,Wow. Just wow. I'm a 22 yr old with a crazy ob...,Why do I have an incredible sound?,"[The sound is so crisp, crazy obsession with s...","[141, 38]","[1, 1]","[0.40833333, 0.40833333]","[False, False]"


# ---------------

In [None]:
import pandas as pd

data = {
    'Evaluated on': ['de', 'fr', 'it', 'en'],
    'Fine-tune on de': [0.8677, 0.7141, 0.6923, 0.5890],
    'Fine-tune on each': [0.8677, 0.8505, 0.8192, 0.7068],
    'Fine-tune on all': [0.8682, 0.8647, 0.8575, 0.7870]
}

df = pd.DataFrame(data)
print(df.T)

                        0       1       2       3
Evaluated on           de      fr      it      en
Fine-tune on de    0.8677  0.7141  0.6923   0.589
Fine-tune on each  0.8677  0.8505  0.8192  0.7068
Fine-tune on all   0.8682  0.8647  0.8575   0.787


# ---------------