### Train Lingala Embeddings

In this notebook , we will be trying to learn lingal word embeddings using fastext and charcter word embeddings.

We will use 4 differents datasets collected from different sources.

- The JW300 lingala dataset
- The lingala news dataset collected from voa lingala webesite and karisma tv
- The lingala pdf a corpus of dataset collect from random lingala pdf such the congolese drc constitution and other document from related domains
- song lyrics data set a dataset with lyrics of around 20 lingala songs scrapped online.

The data was cleanned using different cleaning procedure which can be found in separate notebooks

The goal of this reasearch is to try to apply those embedding technics to try to learn lingala vector and see if the word similarity on those models make senses for native speakers.

#### I. Reading the dataset

In [1]:
from pathlib import Path
import re

In [2]:
data_path = Path.cwd().parent.joinpath("data", "processed")

In [3]:
corpus_files = data_path.glob("*.ln")

In [4]:
list(corpus_files)

[PosixPath('/Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/data/processed/news.lingala.ln'),
 PosixPath('/Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/data/processed/from_pdf_cleanned.ln'),
 PosixPath('/Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/data/processed/JW.ln'),
 PosixPath('/Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/data/processed/songs_lingala.ln')]

In [None]:
We can do better by igonring the jw corpus

In [5]:
language_pattern = ".*([a-zA-Z]+)\.ln"

In [6]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

In [7]:
from nltk.corpus.reader.api import StreamBackedCorpusView

In [8]:
import random 

In [9]:
from nltk.corpus.reader.util import concat

In [10]:
from gensim.corpora.textcorpus import TextDirectoryCorpus
from gensim.utils import deaccent

In [11]:
class CustomCorpusReader(TextDirectoryCorpus):
    def raw(self):
        """
        :return: the given file(s) as a single string.
        :rtype: str
        """
        raw_texts = []
        for file in self.iter_filepaths():
            print(file, 10 * "88=")
            with open(file, 'r') as file_content:
                raw_texts.append(file_content.read())
                
        return concat(raw_texts)
    
    
    def getstream(self):
        """Generate documents from the underlying plain text collection (of one or more files).

        Yields
        ------
        str
            One document (if lines_are_documents - True), otherwise - each file is one document.

        """
        num_texts = 0
        for path in self.iter_filepaths():
            with open(path, 'rt') as f:
                if self.lines_are_documents:
                    for line in f:
                        line = line.replace("ɛ", 'e') # this character is not handle by deaccent
                        yield deaccent(line.strip())
                        num_texts += 1
                else:
                    yield f.read().strip()
                    num_texts += 1

        self.length = num_texts
            
    def save_text(self, path):
        """
        save the corpus text to the given path
        """
        raw_text = self.raw()
        with open(path.joinpath("ln.txt"), 'w') as output:
            output.write(raw_text)

In [12]:
lingala_corpus = CustomCorpusReader(data_path, pattern=language_pattern, lines_are_documents=True)

This is the number of documents in the corpus {{lingala_corpus.dictionary.num_docs}}

And the vocabulary size is {{len(lingala_corpus.dictionary.token2id)}}



In [13]:
deaccent("botɛmɛli")

'botɛmɛli'

In [14]:
for text in lingala_corpus.sample_texts(5):
    print(text)

['yehova', 'alingi', 'ete', 'bato', 'mabota', 'nyonso', 'bayeba', 'mikano']
['yango', 'tokomaki', 'liste', 'nkombo', 'moto', 'nyonso', 'lisanga', 'oyo', 'ndako', 'ebebisamaki']
['ouganda', 'tokokuta', 'likambo', 'oyo', 'emonanaka', 'bikolo', 'mingi', 'okoki', 'kolongwa', 'esika', 'molunge', 'mpe', 'nsima', 'kotambola', 'kaka', 'mwa', 'moke', 'okomi', 'esika', 'malili', 'makasi']
['longola', 'kobondela', 'mpo', 'kosenga', 'elimo', 'yango', 'tosengeli', 'komileisa', 'mpenza', 'liloba', 'nzambe', 'oyo', 'ekomamaki', 'litambwisi', 'elimo', 'santu']
['ntango', 'nayebaki', 'yango', 'nazalaki', 'lisusu', 'kobanga', 'nzambe', 'ndenge', 'mabe']


In [15]:
# save this now , 

In [16]:
outpath = Path.cwd().parent.parent.joinpath('lacuna_pos_ner', 'language_corpus', 'ln')

In [17]:
outpath

PosixPath('/Users/es.py/Projects/Personal/lacuna_pos_ner/language_corpus/ln')

In [18]:
lingala_corpus.save_text(outpath)

/Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/data/processed/news.lingala.ln 88=88=88=88=88=88=88=88=88=88=
/Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/data/processed/from_pdf_cleanned.ln 88=88=88=88=88=88=88=88=88=88=
/Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/data/processed/JW.ln 88=88=88=88=88=88=88=88=88=88=
/Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/data/processed/songs_lingala.ln 88=88=88=88=88=88=88=88=88=88=


In [19]:
lingala_corpus.length

615294

In [20]:
import logging

# Enable logging at the `INFO` level and set a custom format--the
# default log format is pretty wordy. 
logging.basicConfig(
    format='%(asctime)s : %(message)s', # Display just time and message.
    datefmt='%H:%M:%S', # Display time, but not the date.
    level=logging.INFO)


### III. Building Embeddings 

With our corpus is now time to build our embedding . 

In [21]:
from gensim.models import FastText

In [22]:
from multiprocessing import cpu_count
cpu_cores = cpu_count()

In [23]:
fasttext_model = FastText(sentences=None, # Don't provide the sentences yet, otherwise
                    # it will kick off the training automatically.
                          size=100,    # Number of features in word vector
                          window=10,   # Context window size (in each direction)
                 #   Default is 5
                          min_count=2, # Words must appear this many times to be in vocab.
                 #   Default is 5
                          workers=cpu_cores,  # Training thread count
                          sg=1,        # 0: CBOW, 1: Skip-gram. 
                 #   Default is 0, CBOW
                          hs=1,        # 0: Negative Sampling, 1: Hierarchical Softmax
                 #   Default is 0, NS

                          negative=5,  # Nmber of negative samples (default is 5)
    
                          sample=1e-3, # The coefficient for the subsampling of frequent words
                 # equation.
                          word_ngrams=1, # Turn on n-grams.
                          min_n=5,       # Min n-gram size of 3 characters (default is 3).
                          max_n=10,       # Max n-gram size of 6 characters (default is 6).
    
                         bucket=2000000, # Initial number of buckets for the n-gram hash table.
)

20:45:11 : resetting layer weights


In [24]:
sentences = list(lingala_corpus.get_texts())

In [25]:
%%time
fasttext_model.build_vocab(
    sentences, 
    progress_per=20000
)

20:46:25 : collecting all words and their counts
20:46:25 : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
20:46:26 : PROGRESS: at sentence #20000, processed 376766 words, keeping 24962 word types
20:46:26 : PROGRESS: at sentence #40000, processed 633878 words, keeping 31992 word types
20:46:26 : PROGRESS: at sentence #60000, processed 896471 words, keeping 36212 word types
20:46:26 : PROGRESS: at sentence #80000, processed 1121251 words, keeping 38934 word types
20:46:26 : PROGRESS: at sentence #100000, processed 1344700 words, keeping 41143 word types
20:46:26 : PROGRESS: at sentence #120000, processed 1604725 words, keeping 44701 word types
20:46:26 : PROGRESS: at sentence #140000, processed 1871174 words, keeping 47583 word types
20:46:26 : PROGRESS: at sentence #160000, processed 2136724 words, keeping 49722 word types
20:46:26 : PROGRESS: at sentence #180000, processed 2398166 words, keeping 51801 word types
20:46:26 : PROGRESS: at sentence #200000, processed 2

CPU times: user 20.2 s, sys: 2.25 s, total: 22.5 s
Wall time: 22.7 s


In [26]:
lingala_corpus.length

615294

In [27]:
print('Training the model...')
fasttext_model.train(
    sentences,
    total_examples=len(sentences),
    epochs=50,        # How many training passes to take.
    report_delay=10.0 # Report progress every 10 seconds.
)

print('  Done.')
print('')

20:46:44 : training model with 8 workers on 42417 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=10


Training the model...


20:46:45 : EPOCH 1 - PROGRESS: at 0.64% examples, 67683 words/s, in_qsize 15, out_qsize 0
20:46:55 : EPOCH 1 - PROGRESS: at 21.52% examples, 123991 words/s, in_qsize 16, out_qsize 0
20:47:05 : EPOCH 1 - PROGRESS: at 42.05% examples, 125138 words/s, in_qsize 16, out_qsize 0
20:47:16 : EPOCH 1 - PROGRESS: at 61.99% examples, 122352 words/s, in_qsize 15, out_qsize 0
20:47:26 : EPOCH 1 - PROGRESS: at 85.29% examples, 122077 words/s, in_qsize 14, out_qsize 1
20:47:32 : worker thread finished; awaiting finish of 7 more threads
20:47:32 : worker thread finished; awaiting finish of 6 more threads
20:47:32 : worker thread finished; awaiting finish of 5 more threads
20:47:32 : worker thread finished; awaiting finish of 4 more threads
20:47:32 : worker thread finished; awaiting finish of 3 more threads
20:47:32 : worker thread finished; awaiting finish of 2 more threads
20:47:32 : worker thread finished; awaiting finish of 1 more threads
20:47:32 : worker thread finished; awaiting finish of 0 mor

20:52:36 : EPOCH 8 - PROGRESS: at 70.33% examples, 135343 words/s, in_qsize 14, out_qsize 1
20:52:46 : EPOCH 8 - PROGRESS: at 98.78% examples, 138535 words/s, in_qsize 10, out_qsize 0
20:52:46 : worker thread finished; awaiting finish of 7 more threads
20:52:46 : worker thread finished; awaiting finish of 6 more threads
20:52:46 : worker thread finished; awaiting finish of 5 more threads
20:52:46 : worker thread finished; awaiting finish of 4 more threads
20:52:46 : worker thread finished; awaiting finish of 3 more threads
20:52:46 : worker thread finished; awaiting finish of 2 more threads
20:52:46 : worker thread finished; awaiting finish of 1 more threads
20:52:46 : worker thread finished; awaiting finish of 0 more threads
20:52:46 : EPOCH - 8 : training on 7585746 raw words (5833891 effective words) took 42.1s, 138619 effective words/s
20:52:47 : EPOCH 9 - PROGRESS: at 0.64% examples, 62166 words/s, in_qsize 16, out_qsize 0
20:52:57 : EPOCH 9 - PROGRESS: at 22.36% examples, 127871 

20:57:42 : worker thread finished; awaiting finish of 5 more threads
20:57:42 : worker thread finished; awaiting finish of 4 more threads
20:57:42 : worker thread finished; awaiting finish of 3 more threads
20:57:42 : worker thread finished; awaiting finish of 2 more threads
20:57:42 : worker thread finished; awaiting finish of 1 more threads
20:57:42 : worker thread finished; awaiting finish of 0 more threads
20:57:42 : EPOCH - 15 : training on 7585746 raw words (5831917 effective words) took 42.0s, 138765 effective words/s
20:57:43 : EPOCH 16 - PROGRESS: at 0.63% examples, 62112 words/s, in_qsize 14, out_qsize 1
20:57:54 : EPOCH 16 - PROGRESS: at 22.96% examples, 131206 words/s, in_qsize 16, out_qsize 0
20:58:04 : EPOCH 16 - PROGRESS: at 45.70% examples, 135025 words/s, in_qsize 15, out_qsize 0
20:58:14 : EPOCH 16 - PROGRESS: at 71.40% examples, 137984 words/s, in_qsize 16, out_qsize 1
20:58:23 : worker thread finished; awaiting finish of 7 more threads
20:58:23 : worker thread finis

21:02:42 : EPOCH 23 - PROGRESS: at 0.64% examples, 62379 words/s, in_qsize 16, out_qsize 0
21:02:52 : EPOCH 23 - PROGRESS: at 22.96% examples, 131677 words/s, in_qsize 15, out_qsize 0
21:03:02 : EPOCH 23 - PROGRESS: at 45.93% examples, 135926 words/s, in_qsize 15, out_qsize 0
21:03:12 : EPOCH 23 - PROGRESS: at 72.31% examples, 139674 words/s, in_qsize 15, out_qsize 0
21:03:22 : worker thread finished; awaiting finish of 7 more threads
21:03:22 : worker thread finished; awaiting finish of 6 more threads
21:03:22 : worker thread finished; awaiting finish of 5 more threads
21:03:22 : worker thread finished; awaiting finish of 4 more threads
21:03:22 : worker thread finished; awaiting finish of 3 more threads
21:03:22 : worker thread finished; awaiting finish of 2 more threads
21:03:22 : worker thread finished; awaiting finish of 1 more threads
21:03:22 : worker thread finished; awaiting finish of 0 more threads
21:03:22 : EPOCH - 23 : training on 7585746 raw words (5834247 effective words

21:08:09 : worker thread finished; awaiting finish of 1 more threads
21:08:09 : worker thread finished; awaiting finish of 0 more threads
21:08:09 : EPOCH - 30 : training on 7585746 raw words (5833337 effective words) took 41.0s, 142421 effective words/s
21:08:10 : EPOCH 31 - PROGRESS: at 0.63% examples, 64347 words/s, in_qsize 16, out_qsize 0
21:08:20 : EPOCH 31 - PROGRESS: at 22.36% examples, 128760 words/s, in_qsize 15, out_qsize 0
21:08:30 : EPOCH 31 - PROGRESS: at 42.89% examples, 127673 words/s, in_qsize 15, out_qsize 0
21:08:40 : EPOCH 31 - PROGRESS: at 68.67% examples, 134020 words/s, in_qsize 15, out_qsize 0
21:08:50 : EPOCH 31 - PROGRESS: at 97.77% examples, 138146 words/s, in_qsize 14, out_qsize 1
21:08:51 : worker thread finished; awaiting finish of 7 more threads
21:08:51 : worker thread finished; awaiting finish of 6 more threads
21:08:51 : worker thread finished; awaiting finish of 5 more threads
21:08:51 : worker thread finished; awaiting finish of 4 more threads
21:08:

21:22:23 : EPOCH 38 - PROGRESS: at 46.06% examples, 136085 words/s, in_qsize 15, out_qsize 0
21:22:33 : EPOCH 38 - PROGRESS: at 72.62% examples, 140002 words/s, in_qsize 15, out_qsize 0
21:22:42 : worker thread finished; awaiting finish of 7 more threads
21:22:42 : worker thread finished; awaiting finish of 6 more threads
21:22:42 : worker thread finished; awaiting finish of 5 more threads
21:22:42 : worker thread finished; awaiting finish of 4 more threads
21:22:42 : worker thread finished; awaiting finish of 3 more threads
21:22:42 : worker thread finished; awaiting finish of 2 more threads
21:22:42 : worker thread finished; awaiting finish of 1 more threads
21:22:43 : worker thread finished; awaiting finish of 0 more threads
21:22:43 : EPOCH - 38 : training on 7585746 raw words (5833382 effective words) took 40.8s, 143136 effective words/s
21:22:44 : EPOCH 39 - PROGRESS: at 0.63% examples, 64520 words/s, in_qsize 14, out_qsize 1
21:22:54 : EPOCH 39 - PROGRESS: at 23.10% examples, 13

21:48:13 : worker thread finished; awaiting finish of 3 more threads
21:48:13 : worker thread finished; awaiting finish of 2 more threads
21:48:13 : worker thread finished; awaiting finish of 1 more threads
21:48:13 : worker thread finished; awaiting finish of 0 more threads
21:48:13 : EPOCH - 45 : training on 7585746 raw words (5833858 effective words) took 52.9s, 110241 effective words/s
21:48:15 : EPOCH 46 - PROGRESS: at 0.63% examples, 46783 words/s, in_qsize 16, out_qsize 1
21:48:25 : EPOCH 46 - PROGRESS: at 18.08% examples, 100504 words/s, in_qsize 15, out_qsize 0
21:48:35 : EPOCH 46 - PROGRESS: at 35.32% examples, 104177 words/s, in_qsize 15, out_qsize 0
21:48:45 : EPOCH 46 - PROGRESS: at 57.36% examples, 112212 words/s, in_qsize 13, out_qsize 2
21:48:55 : EPOCH 46 - PROGRESS: at 82.52% examples, 116995 words/s, in_qsize 15, out_qsize 0
21:49:01 : worker thread finished; awaiting finish of 7 more threads
21:49:01 : worker thread finished; awaiting finish of 6 more threads
21:49:

  Done.



In [None]:
# need to comeback to this

In [41]:
fasttext_model.wv.most_similar('loba', topn=20)

[('yoka', 0.6531233787536621),
 ('oboyi', 0.639123797416687),
 ('olobi', 0.632038950920105),
 ('yebisa', 0.6240906715393066),
 ('kenda', 0.6033507585525513),
 ('lela', 0.6004784107208252),
 ('kende', 0.5960424542427063),
 ('nga', 0.5856990814208984),
 ('yaka', 0.5788900852203369),
 ('oko', 0.5777283906936646),
 ('olobaki', 0.5737762451171875),
 ('nayo', 0.5720775127410889),
 ('oyebi', 0.5718715190887451),
 ('oyebisa', 0.5716909766197205),
 ('olalisi', 0.571544349193573),
 ('fanda', 0.5672937631607056),
 ('yeba', 0.5636211037635803),
 ('nako', 0.5590487122535706),
 ('olobaka', 0.5553001761436462),
 ('opesi', 0.5536763072013855)]

with the model train we need now to do the evaluation to see if the training was well done , we will evaluate the model on word similarity and see if we can benchmark our model for more words

In [42]:
model_path = Path.cwd().parent.joinpath('models', 'lingala_embeddings_fasttext', 'embedding_50_all_lingala_corpus.bin').__str__()

In [43]:
fasttext_model.save(model_path)

13:33:12 : saving FastText object under /Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/models/embedding_50_all_lingala_corpus.bin, separately None
13:33:12 : storing np array 'vectors_ngrams' to /Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/models/embedding_50_all_lingala_corpus.bin.wv.vectors_ngrams.npy
13:33:16 : not storing attribute vectors_ngrams_norm
13:33:16 : not storing attribute vectors_norm
13:33:16 : not storing attribute vectors_vocab_norm
13:33:16 : not storing attribute buckets_word
13:33:16 : storing np array 'vectors_ngrams_lockf' to /Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/models/embedding_50_all_lingala_corpus.bin.trainables.vectors_ngrams_lockf.npy
13:33:21 : saved /Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/models/embedding_50_all_lingala_corpus.bin


### Evaluation of the models Results

We will now evaluate the models results using the [wordsim-353](http://alfonseca.org/eng/research/wordsim353.html) test set to see how the model perform

In [107]:
fasttext_model.wv.most_similar('libala', topn=20)

[('babalani', 0.734166145324707),
 ('kobalana', 0.6804149150848389),
 ('mabala', 0.6746077537536621),
 ('kobala', 0.6322838068008423),
 ('abali', 0.6285107135772705),
 ('bafianse', 0.6059749126434326),
 ('abala', 0.5939195156097412),
 ('kokabwana', 0.58404541015625),
 ('akobala', 0.5825542211532593),
 ('mobali', 0.5815414190292358),
 ('babalana', 0.57380211353302),
 ('molongani', 0.5736296772956848),
 ('mobalani', 0.5734422206878662),
 ('bonzemba', 0.5729975700378418),
 ('bamoniselana', 0.561253011226654),
 ('balongani', 0.5600625276565552),
 ('aniversere', 0.5555722713470459),
 ('ndai', 0.5520299673080444),
 ('komoniselana', 0.5491830110549927),
 ('balingani', 0.545312762260437)]

In [278]:
fasttext_model.wv.similarity('mokuse', 'munene')

0.1714239

In [354]:
'mayele' in correct_word

False

In [359]:
fasttext_model.wv.most_similar('', topn=25)

[('mapeka', 0.6647143959999084),
 ('abwakeli', 0.5789598822593689),
 ('lipeka', 0.5772056579589844),
 ('natutaki', 0.5740435123443604),
 ('elongi', 0.5642611384391785),
 ('sakosi', 0.5619938969612122),
 ('kopale', 0.5568258762359619),
 ('libenga', 0.5465318560600281),
 ('niati', 0.5460203289985657),
 ('koningisa', 0.5440565347671509),
 ('lokolo', 0.5393606424331665),
 ('libaya', 0.538963794708252),
 ('simisi', 0.535095751285553),
 ('navimbaki', 0.5325905084609985),
 ('loketo', 0.531934380531311),
 ('asimbi', 0.530055582523346),
 ('lisasi', 0.5297955274581909),
 ('elamba', 0.5292395353317261),
 ('lizita', 0.5230260491371155),
 ('avimbivimbi', 0.5224087238311768),
 ('aningisi', 0.5212208032608032),
 ('lobɔkɔ', 0.5183290243148804),
 ('molangi', 0.518175482749939),
 ('fukama', 0.5165168046951294),
 ('epasukaka', 0.515727162361145)]

In [316]:
correct_word = ['mbongo', 'kelasi', 'libala', 'wenge',
                'pesa', 'kosala', 'kulutu', 'kende', 
                'mawa', 'ndeko', 'makila',
                'senga', 'mpasi', 'mawa', 'ekolo', 'jammal', 'kamerhe', 
                'ozuwa', 'lokuta', 'etat', 'depute', 'zala', 'mawa', 'feti', 
                'lata', 'facture', 'nganda', 'mikanda', 'limbisa', 'sila', 
                'bilamba', 'koma', 'loba', 'bana', 'mwinda', 'ndako', 
                'mosala', 'zando', 'zoba', 'beta', 'maladie', 'mpongo', 'nzita',
                'rwanda', 'uganda', 'kabila', 'kelela', 'futa', 'eloko', 'mungwa', 
                'nzembo', 'bilamba', 'somba', 'futa', 'mawa', 'yaka', 'bilei', 'sabuni',
                'mabe', 'zonga', 'masumu', 'kanda', 'mosapi']
incorect_words = ['esengo', 'nzela', 'tika', 'salongo', 'mokuse', 
                  'tiya', 'mbulatari', 'solola', 'mona', 'munene', 
                  'mokuse', 'tanga', 'lelo', 'lobi', 'molayi', 'zonga', 'sengi', 'seka']
not_sure = ['liputa', 'zamba', 'lokuta', 'oyebi', 'losambo', 
            'ezui', 'ndenge', 'yuma', 'mpondu', 'ofele',
            'ndoki', 'nzoto', 'mayele']

In [317]:
len(set(correct_word))

51

### Where to go from here?

Since the words seems to make sense we can try to use them in a machine translation or using them in visualization to see if they make sense

In [88]:
from vec2graph import visualize

### Training using Character word embedding

To train using character word embedding , we will be using the charcter base word embedding model from [this source](https://github.com/Leonard-Xu/CWE)

We can leverage the code build in C to train the model.

The first step to train the model is to update our corpus by adding `<s> and  </s>` tags to each line of the corpus.
The second step is to train the modele using the following command : 
    
`./cwe -corpus -output-word cwelp/cbow20/twi_bible.txt -output-char cwelp/char_c20/twi_bible_char.txt -size 100 -window 10 -sample 1e-4 -negative 5 -hs 0 -iter 50 -min-count 1 -cwe-type 4 cbow 1`

I updating the corpus

In [143]:
lingala_corpus

<__main__.CustomCorpusReader at 0x11ec80f98>

In [144]:
outpath = Path.cwd().parent.joinpath('data', 'processed')

In [168]:
 def save_text(self, path):
    """
    save the corpus text to the given path
    """
    
    with open(path.joinpath("lingala_corpus_with_delimeters.ln"), 'a') as output:
        for line in self.get_texts():
            line = ' '.join(line)
            line = f"<s> {line} </s> \n"
            output.write(line)

In [170]:
save_text(lingala_corpus, outpath)

In [152]:
lingala_corpus.get_texts()

<generator object TextCorpus.get_texts at 0x13aaa71a8>

Training is done with the following script,

shouldn't we use the chinese character to train?

`../CWE/src/cwe -train ./data/processed/lingala_corpus_with_delimeters.ln   -output-word ./models/lingala_embeddings_cwe/vectors.txt -output-char ./models/lingala_embeddings_cwe/chars.txt -size 100 -window 10 -sample 1e-4 -negative 5 -hs 0 -iter 50 -min-count 1 -cwe-type 4 cbow 1`

now we have train the embedding , let see how to load them and check how they perform on the embedding.

Once we have training the data for the embedding , the code is save as txt files with each word and it's vector representation, 
let now load the code and try to check our similarities.

In [183]:
model_cwe_data_path = Path.cwd().parent.joinpath('models', "lingala_embeddings_cwe", "vectors.txt")
model_cwe_output_data_path = Path.cwd().parent.joinpath('models', "lingala_embeddings_cwe", "embedding_50_all_lingala_cwe_corpus.bin")

In [195]:
import numpy as np

In [184]:
import pandas as pd 
model_cwe_df = pd.read_csv(model_cwe_data_path, header=None, index_col=0, sep='	', skiprows=3)

In [187]:
model_cwe_df = model_cwe_df.drop(101, axis='columns')

In [191]:
cwe_embedding_dict = model_cwe_df.T.to_dict('list')

In [198]:
from gensim.utils import to_utf8
from smart_open import open as smart_open
from tqdm import tqdm
from gensim.models import KeyedVectors

In [193]:
def save_word2vec_format(fname, vocab, vector_size, binary=True):
    """Store the input-hidden weight matrix in the same format used by the original
    C word2vec-tool, for compatibility.

    Parameters
    ----------
    fname : str
        The file path used to save the vectors in.
    vocab : dict
        The vocabulary of words.
    vector_size : int
        The number of dimensions of word vectors.
    binary : bool, optional
        If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.


    """
    
    total_vec = len(vocab)
    with smart_open(fname, 'wb') as fout:
        print(total_vec, vector_size)
        fout.write(to_utf8("%s %s\n" % (total_vec, vector_size)))
        # store in sorted order: most frequent words at the top
        for word, row in tqdm(vocab.items()):
            if binary:
                row  = np.array(row)
                word = str(word)
                row = row.astype(np.float32)
                fout.write(to_utf8(word) + b" " + row.tostring())
            else:
                fout.write(to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))

In [196]:
save_word2vec_format(binary=True, fname=model_cwe_output_data_path, vocab=cwe_embedding_dict, vector_size=100)

 30%|██▉       | 20872/70404 [00:00<00:00, 104366.57it/s]

70404 100


100%|██████████| 70404/70404 [00:00<00:00, 107783.87it/s]


In [199]:
cwe_embedding_model = KeyedVectors.load_word2vec_format(model_cwe_output_data_path, binary=True)

19:22:26 : loading projection weights from /Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/models/lingala_embeddings_cwe/embedding_50_all_lingala_cwe_corpus.bin
19:22:27 : loaded (70404, 100) matrix from /Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/models/lingala_embeddings_cwe/embedding_50_all_lingala_cwe_corpus.bin


In [356]:
cwe_embedding_model.most_similar('mayele', topn=25)

[('bwanya', 0.6193627119064331),
 ('bososoli', 0.5501773357391357),
 ('siansi', 0.5497913360595703),
 ('makambo', 0.5102851390838623),
 ('bazoba', 0.49367839097976685),
 ('makanisi', 0.4743432402610779),
 ('istware', 0.4647260904312134),
 ('oyo', 0.45812565088272095),
 ('bafilozofe', 0.4552847146987915),
 ('filozofi', 0.4538910984992981),
 ('bioloji', 0.45348161458969116),
 ('makoki', 0.45278969407081604),
 ('satinover', 0.4410244822502136),
 ('likanisi', 0.43894192576408386),
 ('bongolabongola', 0.43477779626846313),
 ('rips', 0.43345093727111816),
 ('finegan', 0.43294334411621094),
 ('nzokande', 0.4289731979370117),
 ('pisikoloji', 0.4273372292518616),
 ('quirke', 0.4267514646053314),
 ('ete', 0.4220777451992035),
 ('makesenisaki', 0.4190809726715088),
 ('gnostiques', 0.4141996502876282),
 ('mpenzampenza', 0.41356658935546875),
 ('curtin', 0.4102662205696106)]

In [286]:
fasttext_model.wv.most_similar('tanga', topn=25)

[('luka', 0.601347804069519),
 ('bakorinti', 0.588614821434021),
 ('baebre', 0.5854250192642212),
 ('matai', 0.5801754593849182),
 ('yoane', 0.5589373111724854),
 ('mokapo', 0.5534628033638),
 ('emoniseli', 0.547387957572937),
 ('marko', 0.5470973253250122),
 ('baebele', 0.534039318561554),
 ('bakolinti', 0.5234471559524536),
 ('nzembo', 0.512860894203186),
 ('malako', 0.5122451782226562),
 ('bafilipi', 0.5116504430770874),
 ('batesaloniki', 0.5111090540885925),
 ('bagalatia', 0.5006153583526611),
 ('mazwami', 0.49731481075286865),
 ('misala', 0.49448102712631226),
 ('yisaya', 0.49362242221832275),
 ('elobelami', 0.4762505292892456),
 ('baverse', 0.47381070256233215),
 ('malaki', 0.4723062217235565),
 ('ekolendisa', 0.47178348898887634),
 ('bakolose', 0.47125673294067383),
 ('emonisami', 0.4711441993713379),
 ('talela', 0.47108471393585205)]

for now I found that it doesn't have affect on the emdening, we need to make more experience.

Now we have everything ,  we need to run the embedding and do some visualization to see how they perform