<a href="https://colab.research.google.com/github/chiamaka249/IgboNER/blob/main/Projection/Working_Pratikalu_Awesome_align_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AWESOME: Aligning Word Embedding Spaces of Multilingual Encoders

[``awesome-align``](https://github.com/neulab/awesome-align) is a tool that can extract word alignments from multilingual BERT (mBERT) and allows you to fine-tune mBERT on parallel corpora for better alignment quality (see [our paper](https://arxiv.org/abs/2101.08231) for more details).

This is a simple demo of how `awesome-align` extracts word alignments from mBERT.

First, install and import the following packages. (Note that the original `awesome-align` tool does not require the `transformers` package.)

In [1]:
!pip install transformers==3.1.0
import torch
import transformers
import itertools

Collecting transformers==3.1.0
  Downloading transformers-3.1.0-py3-none-any.whl (884 kB)
[K     |████████████████████████████████| 884 kB 4.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 47.0 MB/s 
Collecting sentencepiece!=0.1.92
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 31.6 MB/s 
[?25hCollecting tokenizers==0.8.1.rc2
  Downloading tokenizers-0.8.1rc2-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 37.5 MB/s 
Installing collected packages: tokenizers, sentencepiece, sacremoses, transformers
Successfully installed sacremoses-0.0.47 sentencepiece-0.1.96 tokenizers-0.8.1rc2 transformers-3.1.0


Load the multilingual BERT model and its tokenizer.

In [2]:
model = transformers.BertModel.from_pretrained('bert-base-multilingual-cased')
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-multilingual-cased')

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/714M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

### Prepare input sentences


1.   Read the English text file and store the sentences in a list `ensents = [en_sent1, en_sent2, ..., en_sentn]`
2.   Do the same for the Igbo text file `igsents = [ig_sent1, ig_sent2, ..., ig_sentn]`

3.   Store each sentence pair tuple in a list file `en_ig_sents = [(en_sent1, ig_sent1), (en_sent2, ig_sent2), ..., (en_sentn, ig_sentn)]



Input *tokenized* source and target sentences.

In [3]:
!wget -c https://raw.githubusercontent.com/Chiamakac/TRAININGS/main/Alignment/nnwale.en
!wget -c https://raw.githubusercontent.com/Chiamakac/TRAININGS/main/Alignment/nnwale.ig

--2022-03-10 12:19:05--  https://raw.githubusercontent.com/Chiamakac/TRAININGS/main/Alignment/nnwale.en
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14382 (14K) [text/plain]
Saving to: ‘nnwale.en’


2022-03-10 12:19:06 (73.7 MB/s) - ‘nnwale.en’ saved [14382/14382]

--2022-03-10 12:19:06--  https://raw.githubusercontent.com/Chiamakac/TRAININGS/main/Alignment/nnwale.ig
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16403 (16K) [text/plain]
Saving to: ‘nnwale.ig’


2022-03-10 12:19:06 (12.9 MB/s) - ‘nnwale.ig’ saved [16403/16403]

In [4]:
# opening the file in read mode
en_file = open("/content/nnwale.en", "r")

# reading the file
ensents = en_file.read()

# replacing end splitting the text
# when newline ('\n') is seen.
ensents = ensents.split("\n")
# print(ensents)
en_file.close()

In [5]:
# opening the file in read mode
ig_file = open("/content/nnwale.ig", "r")

# reading the file
igsents = ig_file.read()

# replacing end splitting the text
# when newline ('\n') is seen.
igsents = igsents.split("\n")
# print(igsents)
ig_file.close()

In [6]:
# program to convert igsents list into a tuple
igtuple=tuple(igsents)
print(igtuple)


("Aga- eme elimozu Pius Adesanmi n'ụbọchị Satọdee", "Adesanmi, nke amụrụ n'ọnwa February 27, 1972, bụ onye dere akwụkwọ amaara nke ọma bụ Naija No Dey Carry Last, mkpokọta ederede nkatọ nke afọ 2015 ", "Ezinaụlọ Prof Pius Adesanmi nwụrụ anwụ ekwupụtala maka elimozu nwa afọ Naịjirịa nke gụrụ akwụkwọ Kanada, ode akwụkwọ, onye edemede, nke nwụrụ n'abalị nke iri n'ọnwa Maachị n'afọ 2019 oge Ụgbọelu nke Etiopịa gbariri ka nwa oge o fepụrụ gasịrị.", "Adesanmi, nke amụrụ n'ọnwa February 27, 1972, bụ onye dere akwụkwọ amaara nke ọma bụ Naija No Dey Carry Last, mkpokọta ederede nkatọ nke afọ 2015 ", "Adesanmi so  n'otu n'ime ndị French Institute for Research n' Africa site n'afọ 1993 rue 1997, na French Institute nke Ndịda Afrịka n'afọ1998 na 2000.", "malite n'afọ 2002 rue 2005, ọ bụ osote ọkammụta na Comparative Literature na Mahadum Steeti Pennsylvania , United States.", "N'afọ 2006, o sonyere mahadum Carleton dị n' Ottawa, Canada, dịka ọkammụta nke Literature and African Studies.", 'Ọ bụ ony

In [7]:
# program to convert ensents list into a tuple
entuple=tuple(ensents)
print(entuple)


('Pius Adesanmi To Be Buried On Saturday', 'Adesanmi, born February 27, 1972, was author of the popular book Naija No Dey Carry Last, a 2015 collection of satirical essays.', 'The family of late Prof Pius Adesanmi has announced the burial of the Nigerian-born Canadian scholar, writer, literary critic and columnist, who died on March 10, 2019 when an Ethiopian Airline aircraft crashed shortly after take-off.', 'Adesanmi, born February 27, 1972, was author of the popular book Naija No Dey Carry Last, a 2015 collection of satirical essays.', 'Adesanmi was a Fellow of the French Institute for Research in Africa from 1993 to 1997, and of the French Institute of South Africa in 1998 and 2000.', 'From 2002 to 2005, he was Assistant Professor of Comparative Literature at the Pennsylvania State University, United States.', 'In 2006, he joined Carleton University in Ottawa, Canada, as a professor of Literature and African Studies.', "He was the director of the university's Institute of African S

In [8]:
# Python zip() function can be used to map the lists altogether 
# to create a list of tuples using the command:list(zip(list))
lst1 = entuple
lst2 = igtuple
en_ig_sents = list(zip(lst1,lst2))
print(en_ig_sents)

[('Pius Adesanmi To Be Buried On Saturday', "Aga- eme elimozu Pius Adesanmi n'ụbọchị Satọdee"), ('Adesanmi, born February 27, 1972, was author of the popular book Naija No Dey Carry Last, a 2015 collection of satirical essays.', "Adesanmi, nke amụrụ n'ọnwa February 27, 1972, bụ onye dere akwụkwọ amaara nke ọma bụ Naija No Dey Carry Last, mkpokọta ederede nkatọ nke afọ 2015 "), ('The family of late Prof Pius Adesanmi has announced the burial of the Nigerian-born Canadian scholar, writer, literary critic and columnist, who died on March 10, 2019 when an Ethiopian Airline aircraft crashed shortly after take-off.', "Ezinaụlọ Prof Pius Adesanmi nwụrụ anwụ ekwupụtala maka elimozu nwa afọ Naịjirịa nke gụrụ akwụkwọ Kanada, ode akwụkwọ, onye edemede, nke nwụrụ n'abalị nke iri n'ọnwa Maachị n'afọ 2019 oge Ụgbọelu nke Etiopịa gbariri ka nwa oge o fepụrụ gasịrị."), ('Adesanmi, born February 27, 1972, was author of the popular book Naija No Dey Carry Last, a 2015 collection of satirical essays.', "

```
for src, tgt in en_ig_sents:
  # perform the alignment
```

Run the model and print the resulting alignments.

In [9]:
for src,tgt in en_ig_sents:
  sent_src, sent_tgt = src.strip().split(), tgt.strip().split()
  token_src, token_tgt = [tokenizer.tokenize(word) for word in sent_src], [tokenizer.tokenize(word) for word in sent_tgt]
  wid_src, wid_tgt = [tokenizer.convert_tokens_to_ids(x) for x in token_src], [tokenizer.convert_tokens_to_ids(x) for x in token_tgt]
  ids_src, ids_tgt = tokenizer.prepare_for_model(list(itertools.chain(*wid_src)), return_tensors='pt', model_max_length=tokenizer.model_max_length, truncation=True)['input_ids'], tokenizer.prepare_for_model(list(itertools.chain(*wid_tgt)), return_tensors='pt', truncation=True, model_max_length=tokenizer.model_max_length)['input_ids']
  sub2word_map_src = []
  for i, word_list in enumerate(token_src):
    sub2word_map_src += [i for x in word_list]
  sub2word_map_tgt = []
  for i, word_list in enumerate(token_tgt):
    sub2word_map_tgt += [i for x in word_list]

  # alignment
  align_layer = 8
  threshold = 1e-3
  model.eval()
  with torch.no_grad():
    out_src = model(ids_src.unsqueeze(0), output_hidden_states=True)[2][align_layer][0, 1:-1]
    out_tgt = model(ids_tgt.unsqueeze(0), output_hidden_states=True)[2][align_layer][0, 1:-1]

    dot_prod = torch.matmul(out_src, out_tgt.transpose(-1, -2))

    softmax_srctgt = torch.nn.Softmax(dim=-1)(dot_prod)
    softmax_tgtsrc = torch.nn.Softmax(dim=-2)(dot_prod)

    softmax_inter = (softmax_srctgt > threshold)*(softmax_tgtsrc > threshold)

  align_subwords = torch.nonzero(softmax_inter, as_tuple=False)
  align_words = set()
  for i, j in align_subwords:
    align_words.add( (sub2word_map_src[i], sub2word_map_tgt[j]) )

  # printing
  class color:
    PURPLE = '\033[95m'
    CYAN = '\033[96m'
    DARKCYAN = '\033[36m'
    BLUE = '\033[94m'
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    RED = '\033[91m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'
    END = '\033[0m'

for i, j in sorted(align_words):
  print(f'{sent_src[i]},{sent_tgt[j]}')

    

The,Akwụkwọ
Determination,Akwụkwọ
Determination,Sepụtemba
envisaged,gaa
envisaged,enyere
the,gaa
implementation,gaa
the,tinyere
USSD,USSD
charges,site
the,'mobile
mobile,'mobile
network,network
operators,operators'
(MNOs),(MNOs)
September,n'afọ
1,,n'afọ
2019,,2019
valid,Sepụtemba
licensees,ahụ.
