<a href="https://colab.research.google.com/github/felerminoali/Emakhuwa-to-Portuguese-Corpus/blob/main/report_ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Project Report**

***Title: Distant Supervision and Tranfer Learning for Emakhuwa NER***.

**Facts About Emakhuwa**

1.   Its widely spoken in Northen and center of Mozambique by around 6 million people
2.   Buntu language
1.   Emakhuwa speakers find difficult to sound plosive consonants b, d, g, so in substitution they use alternatives approximated sounds as p, t, c respectively. refer to (https://ria.ua.pt/handle/10773/14249) 
2.   Emakhuwa was enriched by borrowing many words from Portuguese. For instance, Locations and Names in both languages are either written or sound in similar fashion. For example: Portuguese - roma, Emakhuwa  - oroma, Portuguese - éfeso, Emakhuwa - wefeso
1.   In Emakhuwa, the letter "o" or "w" are used to express circumstantial complement of place, i.e., the letter "o" or "w" is added at the beggining of word representing a location or place. Where "o" is used along with a place starting with a consonant, whereas "w" is used with a place/location starting a with a vogal. For instance, the sentece "I am in Maputo/Alemanha" would be translated as "Kiri **o**Maputo/**w**Alemanha".
1.   Emakhuwa was 8 variants. The ISO code for widely spoken variant of Emakhuwa is "vmw"
2.   There is zero annotate data for Emakhuwa language



**Objective: Apply distant supervision to automatically generate anottation for Name Entity Recognition of Emakhuwa language**


**Dataset**

**Distant supervision** will be applied on parallel corpus of Portuguese(pt) and Emakhuwa(vmw), containing 47415 senteces. Please refer to link (https://arxiv.org/pdf/2104.05753.pdf) for more information. 

Data set are kept in a two folders, one for Portuguese and other for Emakhuwa, as "/content/drive/MyDrive/NLP/NER/folds-pt/" and "/content/drive/MyDrive/NLP/NER/folds-vmw/" respectivaly. In each folder contain text file with a maximum width of 100 sentences. The convention name for each file are multiple of 100, i.e., the name of the first file is "100", the next is "200", so forth and so on. The last file is named as "47400"

# Loading Dataset and

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [11]:
#TODO: Skip for retrain
import pandas as pd

path = '/content/drive/MyDrive/NLP/NER/'
# TMX file to dataframe
source_file = path + 'folds-pt/100' 
target_file = path + 'folds-vmw/100'
source = []
target = []
skip_lines = []  # Collect the line numbers of the source portion to skip the same lines for the target portion.
with open(source_file) as f:
    for i, line in enumerate(f):
      source.append(line.strip())
                 
with open(target_file) as f:
    for j, line in enumerate(f):
        target.append(line.strip())
    
print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(skip_lines), i))
    
df = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])
# if you get TypeError: data argument can't be an iterator is because of your zip version run this below
#df = pd.DataFrame(list(zip(source, target)), columns=['source_sentence', 'target_sentence'])
df.head(10)

Loaded data and skipped 0/100 lines since contained in test set.


Unnamed: 0,source_sentence,target_sentence
0,"dessa forma , a morte , o sepultamento e a res...","tivonto okhwa , ovithiwa ni ohihimuxiwa wa yes..."
1,confirmada por muitas testemunhas,aahooniwa ni atthu anceene
2,o que ajuda os cristãos a ter certeza de que j...,etthu xeeni enaakupaliha makristau wira yesu a...
3,para acreditarmos que vai haver uma ressurreiç...,wira nikupali wira onookhala ohihimuxiwa nihaa...
4,por que podemos ter certeza de que isso aconte...,etthu xeeni ennikupaliha wira yesu aahihihimux...
5,a primeira testemunha mencionada por paulo foi...,mutthu oopacerya onweha yesu onilavuliwa ni pa...
6,um grupo de discípulos confirmou que pedro tin...,nuumala-vo ekrupu ya awiixutti yaahihimya wira...
7,"além disso , os doze , ou seja , os apóstolos ...",moottharelana paulo ohimmye wira nuumala yesu ...
8,"daí , cristo apareceu a mais de 500 irmãos de ...",nuumala-vo kristu ahìsoniherya-tho okathi omos...
9,jesus também apareceu a tiago .,woonasa wene waari okathi yoole yesu aapanke a...


# Using Spacy for NER on Portuguese corpus

In [1]:
!pip install -U pip setuptools wheel



In [2]:
# installing spacy
!pip install -U spacy



In [3]:
!pip install -U spacy-lookups-data



In [9]:
!python -m spacy download pt_core_news_sm

Collecting pt-core-news-sm==3.2.0
  Using cached https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-3.2.0/pt_core_news_sm-3.2.0-py3-none-any.whl (22.2 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('pt_core_news_sm')


In [23]:
import spacy
from spacy import displacy

def ner_on_pt_test():
  nlp = spacy.load('pt_core_news_sm')

  # use only 100 file as a sample
  corpus = open('/content/drive/MyDrive/NLP/NER/folds-pt/100').read()
  doc = nlp(corpus)
  entidades_nomeadas = list(doc.ents)

  detalhes_entidade = [(entidade.orth_, entidade.label_)  for entidade in doc.ents if entidade.label_ == 'LOC' or entidade.label_ == 'PER']
  #print(detalhes_entidade)

  displacy.render(doc, jupyter=True, style="ent")

In [24]:
ner_on_pt_test()

**Note**

Although, the model provided by spacy for NER on Portuguese is far from perfect, it mostly able to identify entities from the corpus that we provided. For example, it was able to identify "Jesus", "Paulo" as a Person. But in other case it make mistake as well, as we can see in "espírito santo" which was identified as a Location.  

 **Distant supervision Algorithm**

The the algorithm will explore the fact that Emakhuwa borrow Names and Location from Portuguese. So the ideia is to loop in each line in Portuguese sentences and for all entities found in each sentence, try to align them with the correspond translation in Emakhuwa language. To match the words we will take two procedure:
1.   **Makhuatize**: transform portuguese words into Emakhuwa-like lexicon.
2.   **Phonetic comparasion**: In the Emakhuwa parallel sentence look for the word that most sound similiar with the words got it from the *Makhuatize* process.  




# Makhuatize function

This are set of rules learned from the Emakhuwa language to try to guess transformation from portuguese word to Emakhuwa-like lexicon. 

In [25]:
import re

def addregex(regex, array):
  x = array.append(regex)
  return array

def makuwatiwe(ner):

  standards = [{'b' : 'p'}, {'d': 't'}, {'z': 's'}, {'á' : 'a'}, {'ã' : 'a'}, {'à' : 'a'}, {'é' : 'e'}, {'í' : 'i'}, {'ó' : 'o'}, {'ú' : 'u'}, {'ç' : 's'}, {'qu' : 'kh'}]
  regexes = [ 
            standards + [{'j':'y'}] + [{'o':'u'}] + [{'ia':'ya'}] +  [{'r':'ru'}]  , 
            standards + [{'j':'x'}]+ [{'o':'u'}] + [{'ia':'ya'}] +  [{'r':'ru'}]  , 
            standards + [{'o':'u'}] + [{'ia':'ya'}] + [{'r':'ru'}] +  [{'j':'x'}],
            standards + [{'o':'u'}] + [{'ia':'ya'}] + [{'r':'ru'}] +  [{'j':'y'}],
            standards + [{'o': 'ho'}] + [{'ia':'ya'}] + [{'ia':'ya'}] + [{'r':'ru'}] +  [{'j':'x'}],
            standards + [{'o': 'ho'}] + [{'ia':'ya'}] + [{'ia':'ya'}] + [{'r':'ru'}] +  [{'j':'y'}],
            standards + [{'o': 'ho'}] + [{'r':'ru'}] + [{'ia':'ya'}] + [{'r':'ru'}] +  [{'j':'x'}], 
            standards + [{'o': 'ho'}] + [{'r':'ru'}] + [{'ia':'ya'}] + [{'r':'ru'}] +  [{'j':'y'}],
            standards + [{'o': 'ho'}] + [{'g': 'k'}] + [{'ia':'ya'}] + [{'r':'ru'}] +  [{'j':'y'}],
            standards + [{'o': 'ho'}] + [{'g': 'k'}] + [{'ia':'ya'}] + [{'r':'ru'}] +  [{'j':'y'}],
            standards + [{'o': 'ho'}] + [{'g': 'k'}] + [{'ia':'ya'}] + [{'r':'ru'}] +  [{'j':'X'}],
            standards + [{'o': 'ho'}] + [{'g': 'k'}] + [{'ia':'ya'}] + [{'r':'ru'}] +  [{'j':'X'}]    
            ]
  
  # swap consonats
  suggestions = []
  suggestion = ner[0].lower()
  for regex in regexes:
    for letter in regex:
      for key, value in letter.items(): 
        suggestion = re.sub(key, value, suggestion)    
    suggestions.append(suggestion)
    suggestion = ner[0].lower()

  suggestions.append(ner[0].lower())
  
  consoant = ['a', 'e', 'i', 'o', 'u']
  if ner[1] == 'LOC':
    suggestions = ['o'+seg if seg[0] not in consoant else 'w'+seg for seg in suggestions]
    suggestions.append(ner[0].lower())

  return set(suggestions)
      

In [26]:
makuwatiwe(('Moçambique', 'LOC'))

{'moçambique', 'omhosampikhe', 'omoçambique', 'omusampikhe'}

In [27]:
makuwatiwe(('América', 'LOC'))

{'américa', 'wameruica', 'wameruuica', 'wamérica'}

In [29]:
makuwatiwe(('Jesus', 'PER'))

{'Xesus', 'jesus', 'xesus', 'yesus'}

# Phonetics comparison

It uses a phonetic algorithm to check wheater to words sound similiar or not.

In [31]:
!pip install pyphonetics

Collecting pyphonetics
  Downloading pyphonetics-0.5.3-py2.py3-none-any.whl (10 kB)
Collecting unidecode<2,>=1
  Downloading Unidecode-1.3.2-py3-none-any.whl (235 kB)
     |████████████████████████████████| 235 kB 5.6 MB/s            
[?25hInstalling collected packages: unidecode, pyphonetics
Successfully installed pyphonetics-0.5.3 unidecode-1.3.2


In [40]:
from pyphonetics import Soundex

def phonetic_comparision(guesses, actual_word):
  soundex = Soundex()
  for guess in guesses:
    print(guess + ' sounds like '+ actual_word + '  --- '+str(soundex.sounds_like(guess, actual_word)))

In [39]:
phonetic_comparision(makuwatiwe(('éfeso', 'LOC')), 'wefeso')

wefesu sounds like wefeso  --- True
oéfeso sounds like wefeso  --- False
éfeso sounds like wefeso  --- False
wefesho sounds like wefeso  --- True


In [41]:
phonetic_comparision(makuwatiwe(('jerusalém', 'LOC')), 'oyerusalemu')

ojerusalém sounds like oyerusalemu  --- False
oyeruuusalem sounds like oyerusalemu  --- True
oyeruusalem sounds like oyerusalemu  --- True
oxeruusalem sounds like oyerusalemu  --- False
oxeruuusalem sounds like oyerusalemu  --- False
oXeruusalem sounds like oyerusalemu  --- False
jerusalém sounds like oyerusalemu  --- False


In [43]:
phonetic_comparision(makuwatiwe(('ezequiel', 'PER')), 'ezekiyeli')

esekhiel sounds like ezekiyeli  --- True
ezequiel sounds like ezekiyeli  --- True


In [44]:
phonetic_comparision(makuwatiwe(('miguel', 'PER')), 'mikhayeli')

miguel sounds like mikhayeli  --- True
mikuel sounds like mikhayeli  --- True


# Wrap-up All

In [45]:
import re

# Get start and end index of word apearance in a given sentence
def index_match(word, sentence):
  result = []
  for match in re.finditer(word, sentence):
    result.append((match.start(), match.end()))
  return result

In [46]:
def ner_recognition(docu):
  entidades_nomeadas = list(docu.ents)
  detalhes_entidade = [(entidade.orth_, entidade.label_)  for entidade in docu.ents if entidade.label_ == 'LOC']
  #detalhes_entidade = [(entidade.orth_, entidade.label_)  for entidade in docu.ents if entidade.label_ == 'PER']
  return detalhes_entidade

In [55]:
import spacy
import re
from pyphonetics import Soundex
nlp = spacy.load('pt_core_news_sm')

file = 100

ners = {}
dictionary = {}
corpus = {}
line_out = []
#while (file <= 47400):
while (file <= 200):
  file_pt = open('/content/drive/MyDrive/NLP/NER/folds-pt/'+str(file))
  file_vmw = open('/content/drive/MyDrive/NLP/NER/folds-vmw/'+str(file))
  
  pt_sentence = file_pt.readlines()
  vmw_sentence = file_vmw.readlines()
  
  i=0
 

  
  while (i < len(pt_sentence) and i < len(vmw_sentence) and (len(vmw_sentence) == len(pt_sentence))): 
    doc_pt = nlp(pt_sentence[i])
    ners.update(ner_recognition(doc_pt))
    
    line_number = (int(file)-100) + i + 2 if int(file) > 100 else i + 1
    
    if len(ners.items())>0:
      corpus[vmw_sentence[i]] = {'entities':[]}
    
    for ner, types in ners.items():
      # Makhutize
      words = makuwatiwe((ner, types))
      doc_vmw = nlp(vmw_sentence[i])
      soundex = Soundex()
      tokens = [token.orth_ for token in doc_vmw if token.is_alpha]

      for token in tokens:
        for word in words:
          # phonetics comparision
          if soundex.sounds_like(word, token):
            total =  soundex.distance(ner, token, metric='levenshtein')          
            # print(ner + " "+token + " "+types + " "+str(line_number) + ' '+ str(total)+'\n')
            outcsv = '['           
            for match in index_match(token, vmw_sentence[i]):
              outcsv += str(match[0]) + ', ' + str(match[1]) + ', "'+types+'"],\n'
              corpus[vmw_sentence[i]]['entities']+= [(match[0], match[1], types)]
            

            line_out.append(ner + ", "+token + ", "+types + ", "+str(line_number) + ', '+ str(total)+'\n')        
            dictionary[ner] = list(set(dictionary[ner] + [token])) if ner in dictionary.keys() else [token]
            break

    
    ners = {}
    i += 1

  
  file_pt.close()
  file_vmw.close()
  file += 100


# Display 3 line report
print("Displaying 3 line-report")
print("   pt "+ "     vmw "+ " type "+ ' line# '+ ' edit-distance \n')
max_diplay = 0
while(max_diplay < 3):
  print(line_out[max_diplay])
  max_diplay += 1


corpout = '{ "classes": ["LOC", "PERSON"], "annotations": ['
corpout += '['
for key, values in corpus.items():
  if len(values['entities']) > 0 :
    corpout += '["'+re.sub('[\n]', '', key)+'", { "entities": ['
    for v in values['entities']:
      corpout += "[" + str(v[0])+", "+ str(v[1]) + ', "'+str(v[2])+'"],' 
    corpout += '] }],'
corpout += ']'
corpout += ']}'


print("\nCorpus for spacy NER training")
print(corpout)

print("\nGazetteers generated")
print(dictionary)


file1 = open("/content/drive/MyDrive/NLP/NER/line-report.txt", "w")
file1.writelines(line_out)
file1.close()

file3 = open("/content/drive/MyDrive/NLP/NER/ner-corpus.json", "w")
file3.writelines(corpout)
file3.close()

dic_out = []
for key, values in dictionary.items():
  l = key+': '
  for value in values:
    l += value + ', '
  l += '. \n'
  dic_out.append(l)

file2 = open("/content/drive/MyDrive/NLP/NER/gazetteers.txt", "w")
file2.writelines(dic_out)
file2.close()


Displaying 3 line-report
   pt      vmw  type  line#  edit-distance 

damasco, odamaasiko, LOC, 17, 3

corinto, okorinto, LOC, 25, 3

corinto, okorinto, LOC, 28, 3


Corpus for spacy NER training
{ "classes": ["LOC", "PERSON"], "annotations": [[["okathi aarowa awe odamaasiko paulo wala saulo aahiiwa nsu na yesu ni aahoonihiwa kristu ori ene wiirimu .", { "entities": [[18, 28, "LOC"],] }],["atthu akina a epooma ya okorinto , yaahikhalana moonelo woohiloka voohimya sa ohihimuxiwa .", { "entities": [[24, 32, "LOC"],] }],["nto muupuwelo owo waahaahapuxa atthu akina okorinto .", { "entities": [[43, 51, "LOC"],] }],["awiixutti akina yaarina nlipelelo na orowa okhalaka wiirimu ti tome , yakobo , lidia , yohani , maria ni paulo .", { "entities": [[79, 84, "LOC"],] }],["nave aalempe so : elapo ela ya wéfeso , miyo kowana ntoko kawana n inama sowali .", { "entities": [[31, 37, "LOC"],] }],["woonasa wene , paulo aahimya inama soowali aawananne awe warena epooma ya weefeso .", { "entities": [[74, 