#### Guy Yanko and Shahaf Priente
<br>

### Final project in the class: Information-theoretic analysis of neural language models
#### Recihman University, Fall 2022-2023

In [None]:
! pip install transformers datasets

In [None]:
!pip install scipy

# Hebrew Language models

# Bert Transformer

https://huggingface.co/docs/transformers/model_doc/bert


![image-2.png](attachment:image-2.png)

![image.png](attachment:image.png)


# AlephBert

Training data:

OSCAR (Ortiz, 2019) Hebrew section (10 GB text, 20 million sentences).
Hebrew dump of Wikipedia (650 MB text, 3 million sentences).
Hebrew Tweets collected from the Twitter sample stream (7 GB text, 70 million sentences).


Training procedure:

Trained on a DGX machine (8 V100 GPUs) using the standard huggingface training procedure.

Since the larger part of our training data is based on tweets we decided to start by optimizing using Masked Language Model loss only.

To optimize training time we split the data into 4 sections based on max number of tokens:

num tokens < 32 (70M sentences)
32 <= num tokens < 64 (12M sentences)
64 <= num tokens < 128 (10M sentences)
128 <= num tokens < 512 (1.5M sentences)

Each section was first trained for 5 epochs with an initial learning rate set to 1e-4. Then each section was trained for another 5 epochs with an initial learning rate set to 1e-5, for a total of 10 epochs.

Total training time was 8 days.

![image.png](attachment:image.png)

# HeBert

HeBert was trained on three dataset:

A Hebrew version of OSCAR (Ortiz, 2019): ~9.8 GB of data, including 1 billion words and over 20.8 millions sentences.

A Hebrew dump of Wikipedia: ~650 MB of data, including over 63 millions words and 3.8 millions sentences.

Emotion UGC data that was collected for the purpose of this study. (described below) We evaluated the model on emotion recognition and sentiment analysis, for a downstream tasks.

oscar-project.orgoscar-project.org
OSCAR
Humongous Corpus (292 kB)
https://oscar-corpus.com/

Hyper prameters:

4 epochs with lr=5e-5, batch size=128


![image.png](attachment:image.png)

# MBert

A BERT-base-multilingual tuned to match the embedding space for 69 languages, to the embedding space of the CLIP text encoder which accompanies the ViT-B/32 vision encoder.

A full list of the 100 languages used during pre-training can be found here, and a list of the 40k languages used during fine-tuning can be found in SupportedLanguages.md.

Training data pairs was generated by sampling 40k sentences for each language from the combined descriptions of GCC + MSCOCO + VizWiz, and translating them into the corresponding language.

All translation was done using the AWS translate service, the quality of these translations have currently not been analyzed, but one can assume the quality varies between the 69 languages.

# Hebrew Data

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)


# Perplexity of fixed-length models (Biderectional)

Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note
that the metric applies specifically to classical language models (sometimes called autoregressive or causal language
models) and is not well defined for masked language models like BERT (see [summary of the models](https://huggingface.co/docs/transformers/main/en/model_summary)).

Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized
sequence $X = (x_0, x_1, \dots, x_t)$, then the perplexity of $X$ is,

$$\text{LPPL}(X) =  {-\frac{1}{t}\sum_i^t \log_2 p_\theta (x_i|x_{!=i}) } $$

where $\log p_\theta (x_i|x_{!=i})$ is the log-likelihood of the ith token conditioned on the preceding tokens $x_{!=i}$ according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing different models.

This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more
intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this
[fantastic blog post on The Gradient](https://thegradient.pub/understanding-evaluation-metrics-for-language-models/).

## Calculating PPL with fixed-length models

If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below.

<img width="600" alt="Full decomposition of a sequence with unlimited context length" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/ppl_full.gif"/>

When working with approximate models, however, we typically have a constraint on the number of tokens the model can
process. The largest version of [GPT-2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2), for example, has a fixed length of 1024 tokens, so we
cannot calculate $p_\theta(x_t|x_{<t})$ directly when $t$ is greater than 1024.

Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max
input size is $k$, we then approximate the likelihood of a token $x_t$ by conditioning only on the
$k-1$ tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
log-likelihoods of each segment independently.

<img width="600" alt="Suboptimal PPL not taking advantage of full available context" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/ppl_chunked.gif"/>

This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor
approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
have less context at most of the prediction steps.

Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly
sliding the context window so that the model has more context when making each prediction.

<img width="600" alt="Sliding window PPL taking advantage of all available context" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/ppl_sliding.gif"/>

This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
1 token a time. This allows computation to proceed much faster while still giving the model a large context to make
predictions at each step.

## Calculating perplexity with BERT Hebrew LMs

Dataset Summary:

HebrewSentiment is a data set consists of 12,804 user comments to posts on the official Facebook page of Israel’s president, Mr. Reuven Rivlin. 

In October 2015, we used the open software application Netvizz (Rieder, 2013) to scrape all the comments to all of the president’s posts in the period of June – August 2014, the first three months of Rivlin’s presidency.

While the president’s posts aimed at reconciling tensions and called for tolerance and empathy, the sentiment expressed in the comments to the president’s posts was polarized between citizens who warmly thanked the president, and citizens that fiercely critiqued his policy.

Of the 12,804 comments, 370 are neutral; 8,512 are positive, 3,922 negative.

![image.png](attachment:image.png)


In [None]:
from datasets import load_dataset
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
import numpy as np
import random
import pandas as pd
from tqdm import tqdm

he_sentences = load_dataset("hebrew_sentiment", split="test")

In [1]:
print(*random.choices(he_sentences['text'],k=5), sep='\n\n')

NameError: name 'random' is not defined

In [None]:
def get_rank(x, indices):
    vals = x[range(len(x)), indices]
    return (x > vals[:, None]).long().sum(1)

def score(model, tokenizer, sentence, verbose = False):
    tensor_input = tokenizer.encode(sentence, return_tensors='pt',truncation=True)

    repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
    mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
    masked_input = repeat_input.masked_fill(mask == 1, tokenizer.mask_token_id)
    labels = repeat_input.masked_fill( masked_input != tokenizer.mask_token_id, -100)

    with torch.inference_mode():
        output = model(masked_input, labels=labels)
        num_tokens = tensor_input.shape[1]-2
        pred_logits = output.logits[list(range(num_tokens)),list(range(1,num_tokens+1)),:]
        ranks = get_rank(pred_logits,tensor_input.squeeze().tolist()[1:-1])
        loss = output.loss
        
    
    log_ppl = loss.item()/np.log(2)

    if verbose:
        print('input:\n',tensor_input)
        print('\nduplicate input:\n',repeat_input)
        print('\nmasked input:\n',masked_input)
        print('\nlabels:\n',labels)
        print('\ncross entropty loss:\n',loss.item())
        print('\nlog perplexity:\n',log_ppl)
        print('\nVocab size (logits):\n',pred_logits.shape)
    
    return log_ppl, ranks, pred_logits

In [None]:
model_name = 'onlplab/alephbert-base'
model = AutoModelForMaskedLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
print('MASK: ',tokenizer.mask_token_id)
print('SOS (CLS): ',tokenizer.cls_token_id)
print('EOS (SEP): ',tokenizer.sep_token_id)

In [None]:
sent = 'עברית שפה קשה .'
ppl,ranks,pred_logits = score(sentence=sent, model=model, tokenizer=tokenizer, verbose=True)

In [None]:
average_word_len = np.mean([len(word) for word in tokenizer.vocab.keys()])
average_word_len

In [None]:
models_ppl = {'onlplab/alephbert-base':0,
              'avichr/heBERT':0}

he_100_sentences = random.choices(he_sentences['text'],k=100)

for model_name in tqdm(models_ppl.keys()):
    print(f'claculating for: {model_name}')
    model = AutoModelForMaskedLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    total_ppl = 0
    all_ranks = []
    all_logits = []
    for sent in tqdm(he_100_sentences):
        sent_ppl,sent_ranks,pred_logits = score(sentence=sent, model=model, tokenizer=tokenizer)
        total_ppl += sent_ppl
        all_ranks+=sent_ranks.tolist()
        all_logits.append(pred_logits)
    models_ppl[model_name] = (total_ppl/100,all_ranks,torch.cat(all_logits,dim=0))

In [None]:
perplexities = {name: res[0] for name, res in models_ppl.items()}
pd.DataFrame([perplexities]).plot(kind='bar')
print(perplexities)

In [None]:
for name, ppl in perplexities.items():
    print(f'{name} bit per letter: {ppl/average_word_len}')

In [None]:
from collections import Counter

def get_ranks_probs(ranks:list):
    r_len = len(ranks)
    count_ranks = Counter(ranks)
    return {rank:count/r_len for rank, count in count_ranks.items()}

ranks = {name: res[1] for name, res in models_ppl.items()}
ranks_probs = {name:get_ranks_probs(ranks) for name, ranks in ranks.items()}

In [None]:
from collections import Counter
from scipy.stats import entropy

def get_ranks_probs(ranks:list):
    r_len = len(ranks)
    count_ranks = Counter(ranks)
    return {rank:count/r_len for rank, count in count_ranks.items()}

ranks = {name: res[1] for name, res in models_ppl.items()}
ranks_probs = {name:get_ranks_probs(ranks) for name, ranks in ranks.items()}
for name, probs in ranks_probs.items():
    ent = entropy(list(probs.values()), base=2)
    pd.DataFrame([probs]).T.sort_values(by = 0,ascending=False)[:30].plot(title = f'{name} | Guessing series entropy: {ent}', kind = 'bar', figsize = (10,5))

In [None]:
from torch.nn import Softmax
from torch import topk


def get_mean_probablities_per_token(probs):
    probs = Softmax(dim = 1)(probs)
    return torch.mean(probs,dim=0)

def get_topk_tokens(probs,tokenizer,k=30):
    top = torch.topk(probs,k)
    top_tokens=[tokenizer.decode(idx) for idx in top.indices]
    return {token.item():prob.item() for token, prob in zip(top.indices,top.values)},top_tokens
    

token_probabilities = {name:get_mean_probablities_per_token(res[2]) for name,res in models_ppl.items()}

for name, probs in token_probabilities.items():
    tokenizer = AutoTokenizer.from_pretrained(name)
    top_probs,top_tokens = get_topk_tokens(probs,tokenizer)
    print(f'{name}\n',top_tokens)
    pd.DataFrame([top_probs]).T.plot(title = f'{name}', kind = 'bar', figsize = (10,5))

# Hebrew Difficulties

In [None]:
model_name = 'onlplab/alephbert-base'
model = AutoModelForMaskedLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
tokenizer.encode('נשיא')

In [None]:
tokenizer.encode('נשיאה')

The same sentence but with gender orientation

In [None]:
print(tokenizer.encode('מזל טוב לנשיא החדש, בטוח שיהיה טוב לאזרחי המדינה'))
score(model=model, tokenizer=tokenizer, sentence='מזל טוב לנשיא החדש, בטוח שיהיה טוב לאזרחי המדינה')[0]

In [None]:
print(tokenizer.encode('מזל טוב לנשיאה החדשה, בטוח שתהיה טובה לאזרחי המדינה'))
score(model=model, tokenizer=tokenizer, sentence='מזל טוב לנשיאה החדשה, בטוח שתהיה טובה לאזרחי המדינה')[0]

Here we have an example for 1 word that is written the same but has completely different meaning - **הקפה**

In [None]:
print(tokenizer.encode('המתחרים סיימו הקפה מלאה'))
print(score(model=model, tokenizer=tokenizer, sentence='המתחרים סיימו הקפה מלאה')[0])

In [None]:
print(tokenizer.encode('המלצר שאל אם סיימתי עם הקפה'))
print(score(model=model, tokenizer=tokenizer, sentence='המלצר שאל אם סיימתי עם הקפה')[0])

In [None]:
print(tokenizer.encode('הקפה של המכונית גדול מהמצופה'))
print(score(model=model, tokenizer=tokenizer, sentence='הקפה של המכונית גדול מהמצופה')[0])