<a href="https://colab.research.google.com/github/chiamaka249/IgboNER/blob/main/Pratikalu_Awesome_align.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**AWESOME: Aligning Word Embedding Spaces of Multilingual Encoders**

awesome-align is a tool that can extract word alignments from multilingual BERT (mBERT) and allows you to fine-tune mBERT on parallel corpora for better alignment quality (see our paper for more details).

This is a simple demo of how awesome-align extracts word alignments from mBERT.




*First, install and import the following packages. (Note that the original awesome-align tool does not require the transformers package.)*

In [None]:
!pip install transformers==3.1.0
import torch
import transformers
import itertools

Load the multilingual BERT model and its tokenizer.

In [None]:
model = transformers.BertModel.from_pretrained('bert-base-multilingual-cased')
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-multilingual-cased')

Prepare input sentences
Read the English text file and store the sentences in a list ensents = [en_sent1, en_sent2, ..., en_sentn]

Do the same for the Igbo text file igsents = [ig_sent1, ig_sent2, ..., ig_sentn]

Store each sentence pair tuple in a list file `en_ig_sents = [(en_sent1, ig_sent1), (en_sent2, ig_sent2), ..., (en_sentn, ig_sentn)]

*Input tokenized source and target sentences.*

In [None]:
!wget -c https://raw.githubusercontent.com/Chiamakac/IGBONLP/master/ig_en_mt/benchmark_dataset/test.en
!wget -c https://raw.githubusercontent.com/Chiamakac/IGBONLP/master/ig_en_mt/benchmark_dataset/test.ig
!wget -c https://raw.githubusercontent.com/Chiamakac/IGBONLP/master/ig_en_mt/benchmark_dataset/train.en
!wget -c https://raw.githubusercontent.com/Chiamakac/IGBONLP/master/ig_en_mt/benchmark_dataset/train.ig
!wget -c https://raw.githubusercontent.com/Chiamakac/IGBONLP/master/ig_en_mt/benchmark_dataset/val.en
!wget -c https://raw.githubusercontent.com/Chiamakac/IGBONLP/master/ig_en_mt/benchmark_dataset/val.ig

In [None]:
# Creating a list of filenames
filenames = ['/content/test.en', '/content/train.en', '/content/val.en']

# Open file3 in write mode
with open('english.txt', 'w') as outfile:

	# Iterate through list
	for names in filenames:

		# Open each file in read mode
		with open(names) as infile:

			# read the data from file1 and
			# file2 and write it in file3
			outfile.write(infile.read())

		# Add '\n' to enter data of file2
		# from next line
		outfile.write("\n")

In [None]:
# Creating a list of filenames
filenames = ['/content/test.ig', '/content/train.ig', '/content/val.ig']

# Open file3 in write mode
with open('igbo.txt', 'w') as outfile:

	# Iterate through list
	for names in filenames:

		# Open each file in read mode
		with open(names) as infile:

			# read the data from file1 and
			# file2 and write it in file3
			outfile.write(infile.read())

		# Add '\n' to enter data of file2
		# from next line

In [None]:
# opening the file in read mode
en_file = open("/content/english.txt", "r")

# reading the file
ensents = en_file.read()

# replacing end splitting the text
# when newline ('\n') is seen.
ensents = ensents.split("\n")
# print(ensents)
en_file.close()


In [None]:
# program to convert a list into a tuple
def convert(ensents):
	return tuple(ensents)

# Driver function
ensents= ensents
# print(convert(ensents))


In [None]:
# opening the file in read mode
ig_file = open("/content/igbo.txt", "r")

# reading the file
igsents = ig_file.read()

# replacing end splitting the text
# when newline ('\n') is seen.
igsents = igsents.split("\n")
# print(igsents)
ig_file.close()

In [None]:
# program to convert a list into a tuple
def convert(igsents):
	return tuple(igsents)

# Driver function
igsents= igsents
# print(convert(igsents))

In [None]:
# Python zip() function can be used to map the lists altogether 
# to create a list of tuples using the command:list(zip(list))
lst1 = ensents
lst2 = igsents
en_ig_sents = list(zip(lst1,lst2))
print(en_ig_sents)

for src, tgt in en_ig_sents:
  perform the alignment

Run the model and print the resulting alignments.

In [None]:
# pre-processing
sent_src, sent_tgt = src.strip().split(), tgt.strip().split()
token_src, token_tgt = [tokenizer.tokenize(word) for word in sent_src], [tokenizer.tokenize(word) for word in sent_tgt]
wid_src, wid_tgt = [tokenizer.convert_tokens_to_ids(x) for x in token_src], [tokenizer.convert_tokens_to_ids(x) for x in token_tgt]
ids_src, ids_tgt = tokenizer.prepare_for_model(list(itertools.chain(*wid_src)), return_tensors='pt', model_max_length=tokenizer.model_max_length, truncation=True)['input_ids'], tokenizer.prepare_for_model(list(itertools.chain(*wid_tgt)), return_tensors='pt', truncation=True, model_max_length=tokenizer.model_max_length)['input_ids']
sub2word_map_src = []
for i, word_list in enumerate(token_src):
  sub2word_map_src += [i for x in word_list]
sub2word_map_tgt = []
for i, word_list in enumerate(token_tgt):
  sub2word_map_tgt += [i for x in word_list]

# alignment
align_layer = 8
threshold = 1e-3
model.eval()
with torch.no_grad():
  out_src = model(ids_src.unsqueeze(0), output_hidden_states=True)[2][align_layer][0, 1:-1]
  out_tgt = model(ids_tgt.unsqueeze(0), output_hidden_states=True)[2][align_layer][0, 1:-1]

  dot_prod = torch.matmul(out_src, out_tgt.transpose(-1, -2))

  softmax_srctgt = torch.nn.Softmax(dim=-1)(dot_prod)
  softmax_tgtsrc = torch.nn.Softmax(dim=-2)(dot_prod)

  softmax_inter = (softmax_srctgt > threshold)*(softmax_tgtsrc > threshold)

align_subwords = torch.nonzero(softmax_inter, as_tuple=False)
align_words = set()
for i, j in align_subwords:
  align_words.add( (sub2word_map_src[i], sub2word_map_tgt[j]) )

# printing
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

for i, j in sorted(align_words):
  print(f'{color.BOLD}{color.BLUE}{sent_src[i]}{color.END}==={color.BOLD}{color.RED}{sent_tgt[j]}{color.END}')