<a href="https://colab.research.google.com/github/devennn/word_embeddings/blob/master/MYTweet_wordNotInVocab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import os
import tqdm
import pandas as pd
from gensim.models import Word2Vec

In [0]:
path = '/content/drive/My Drive/datasets/tweets/'
model_path = '/content/drive/My Drive/pretrained_model'

In [3]:
# Malaysian COVID tweet from open source dataset from Kaggle
# My kaggle notebook: https://www.kaggle.com/deventommy96/covid19-tweet-dataset-malaysia
with open(os.path.join(path, 'cleaned_tweets.txt'), 'r') as f:
  covid = [s.strip() for s in f.readlines()]

# This model is trained using 2mil++ Malaysian Twitter dataset
# Hyperparameter: window = 5, cbow, vector_size = 300
model = Word2Vec.load(os.path.join(model_path, 'twtV2_w2v.model'))

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [4]:
covid[:10]

['when will this be over',
 'i miss those days when i sneeze people would politely say bless you now people just say get the fuck out of here',
 'bond movie postponed cuz nobody wants to die',
 'all these days software was scanned for virus now software engineers are scanned for virus',
 'maklumat terkini mengenai sehingga jam pm petang tadi stay safe guys jangan lupa untuk utamakan kesihatan ok info',
 'malaysia s covid update march total cases recovered currently under treatment out of cases cases are the st generation contact from case while cases are the nd generation contact of case',
 'please come forward if you re one of these contacts stop',
 'susulan covid kerajaan arab saudi telah mengambil langkah pencegahan ya allah ya allah yang maha melindungi lindungi kami daripada wabak penyakit ini',
 'did not reveal the until a deputy minister got it after that it allowed cotizens who traveled to the infected region to go back but did not stamp their passports some were tested positiv

In [0]:
all_words = []
for s in covid: 
  for w in s.split(): 
    if w not in all_words: all_words.append(w)

In [6]:
len(all_words)

16499

In [7]:
# Check for word not in vocab
not_in_vocab = {}
for w in tqdm.tqdm(all_words):
  try:
    model.wv.most_similar(w)
  except KeyError:
    if w not in not_in_vocab:
      not_in_vocab[w] = 1
    else:
      not_in_vocab[w] += 1

100%|██████████| 16499/16499 [06:26<00:00, 42.70it/s]


In [0]:
# Arrange
not_in_vocab = {k: v for k, v in sorted(not_in_vocab.items(), key=lambda item: item[1], reverse=True)}

In [0]:
not_in_vocab_list = [w for w in not_in_vocab] # Convert to list

In [10]:
len(not_in_vocab_list)

1280

In [0]:
# Find sentence with specified word
# Only get the first sentence
def get_sentence(w):
  for s in covid:
    if w in s.split(): return s

#Find Closest string

In [0]:
import difflib
import random

In [0]:
def find_closest_string(word):
  result = difflib.get_close_matches(word, model_vocab, n=7)
  print("{} -> {}".format(word, result))

In [14]:
model_vocab = list(model.wv.vocab)
len(model_vocab)

187985

The model has a huge vocab list as it is trained using social media dataset. Variation of words makes the corpus interesting, as one missing letter is counted as new word.

In [15]:
# For viewing
# Randomly chose unknown word to test
for _ in range(30): 
  find_closest_string(random.choice(not_in_vocab_list))

shike -> ['sike', 'shik', 'shie', 'hike', 'shitake', 'spike', 'siket']
perkemvangan -> ['perkembangan', 'prkembangan', 'berkembangan', 'pkembangan', 'pertemanan', 'diperkembangkan', 'perlembangaan']
inpatients -> ['inpatient', 'patients', 'impatient', 'patient', 'patents', 'liptints', 'inepties']
inpact -> ['intact', 'infact', 'impact', 'pact', 'ipat', 'incapacit', 'inat']
pihok -> ['piok', 'pitok', 'pisok', 'pipok', 'pikok', 'pijok', 'pihak']
cavernous -> ['saverinus', 'cancerous', 'aventus', 'verns', 'venus', 'caves', 'avenu']
bebit -> ['bebait', 'ebit', 'beit', 'bebi', 'belibit', 'bebirat', 'bebelit']
quarantinealife -> ['quarantini', 'quarantine', 'quarantina', 'quarantinekan', 'quarantines', 'quarantined', 'quarantaine']
airbourne -> ['airborne', 'bourne', 'harbour', 'hairbun', 'aircorn', 'tambourine', 'painborneo']
mncegah -> ['mencegah', 'mncelah', 'menegah', 'mngah', 'megah', 'cegah', 'pencegah']
yourswlf -> ['yourslef', 'yourself', 'youself', 'yoursef', 'ourself', 'yours', 'yo

# Comparing similar words to other words in the text to predict unknown word

- Words in the sentence are counted as context. Use these words to compare with difflib output.
- Find the highest embeddings similarity 

In [0]:
# find the most similar word for every words in a sentence.
# Words that doesn't exist in the vocab, will be marked as NE 
def get_all_close_word_embeddings(sentence):
  tokens = sentence.split()
  sequence = pd.DataFrame()
  topn=10
  for i in range(len(tokens)):
    try:
      result = model.wv.most_similar(tokens[i], topn=topn)
      result = pd.DataFrame([w for w, s in result], columns=[tokens[i]])
      sequence = pd.concat([sequence, result], axis=1)
    except Exception:
      sequence = pd.concat([sequence, pd.DataFrame(['<NE>']*topn, columns=[tokens[i]])], 
                          axis=1)
  return pd.DataFrame(sequence)

In [0]:
def count_words(words):
  d = {}
  for w in words:
    if w in d: d[w] += 1
    else: d[w] = 1
  return {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}

def find_closest_string_with_embeddings(word, sentence):
  tokens = sentence.split()
  results = difflib.get_close_matches(word, model_vocab, n=7)
  simmilarity_score, similar_word = [], []
  for token in tokens:
    try:
      # get similarity score for all tokens vs difflib result
      similarity_score = [model.wv.similarity(r, token) for r in results]
      # find highest scoring word from each results
      similar_word.append(results[similarity_score.index(max(similarity_score))])
    except KeyError:
      # Remove words to be predicted
      # print("Word {} not in vocab. === Skipping this word ===".format(token))
      pass

  print("Unknown Word -> {}".format(word))
  print("Full sentence -> {}".format(sentence))
  print("Similar words -> {}".format(results))
  print("Similar by embedding -> {}".format(similar_word))
  print("Most similar word -> {}".format(count_words(similar_word)))

# Correct results

In [18]:
word_tests = ['moleq', 'lariss', 'sesssion', 'dipermudohkn', 'dinobat']
for word in word_tests:
  text = get_sentence(word)
  find_closest_string_with_embeddings(word, text)
  print()

Unknown Word -> moleq
Full sentence -> gym pom dh sepi last day before officially lock down yg mane diberi cuti tu sila duk diam kt rumah yg mane kerja tu selamat bekerja jaga diri moleq blako jangan kerana corona hilang iman pedoman
Similar words -> ['mole', 'moole', 'moles', 'molep', 'molek', 'moleh', 'mohle']
Similar by embedding -> ['mole', 'molek', 'molek', 'moleh', 'mole', 'moles', 'mole', 'mohle', 'mole', 'mole', 'molek', 'molek', 'moleh', 'molek', 'molek', 'molek', 'molek', 'molek', 'molek', 'molek', 'molek', 'molek', 'molek', 'molek', 'moole', 'moleh', 'molek', 'molek', 'molep', 'molek', 'moleh', 'mole', 'molek', 'molek', 'mole']
Most similar word -> {'molek': 20, 'mole': 7, 'moleh': 4, 'moles': 1, 'mohle': 1, 'moole': 1, 'molep': 1}

Unknown Word -> lariss
Full sentence -> harap ketupat paling lariss y
Similar words -> ['larissa', 'larss', 'laris', 'laiss', 'clarissa', 'mariss', 'laisse']
Similar by embedding -> ['laris', 'laris', 'laris', 'laiss']
Most similar word -> {'lari

1) For *'moleq'*, *'lariss'* and *'session'*, the most similar word produced by difflib is wrong, but the correct word is still in the list. By comparing word embeddings, the correct word is recognized.

2) For *'sesssion'*, *'dipermudohkan'* and *'dinobat'*, difflib works well to find all simmilar words as almost all the words, including the most similar are correct. When combine with embedding, higher context word is chosen. This is caused by frequency of the word in the training corpus.

# Wrong Results

In [19]:
word_tests = ['reassemble', 'compasses', 'boleeeeeh', 'meningatkan', 'bomathi']
for word in word_tests:
  text = get_sentence(word)
  find_closest_string_with_embeddings(word, text)
  print()

Unknown Word -> reassemble
Full sentence -> day avengers reassemble psst admin kka sorg tu blh start charging suit dah mrs potts stark kan from malaysiaigers level pkp fasa polis fasa tentera fasa
Similar words -> ['ressemble', 'ressemblez', 'ressembles', 'ressembler', 'resemble', 'assemble', 'ressemblent']
Similar by embedding -> ['resemble', 'ressemblent', 'ressemblez', 'ressemblent', 'ressemble', 'ressembles', 'ressemblent', 'ressembles', 'resemble', 'assemble', 'ressembles', 'ressemblent', 'ressembles', 'ressembler', 'ressembles', 'assemble', 'assemble', 'assemble', 'assemble', 'ressembler', 'assemble', 'ressembler', 'assemble']
Most similar word -> {'assemble': 7, 'ressembles': 5, 'ressemblent': 4, 'ressembler': 3, 'resemble': 2, 'ressemblez': 1, 'ressemble': 1}

Unknown Word -> compasses
Full sentence -> thank you and all workers for being there for us in all over the world it s time to reset our compasses ramp up compassion kindness to everyone we can beat this
Similar words -> 

1) For *'reassemble'*, *'compasses'*, *'boleeeeeh'* and *'meningatkan'*, difflib does a good job to find the most similar word. Most of the embedding score also produce the correct words. However, since this experiment uses every word in the sentence, more noise are included in the similar by embedding list, outnumbering the correct word. This can be observed from Most similar word list, where the second word for *'compasses'*, *'boleeeeeh'*, *'meningatkan'* is the correct word.

2) *'bomathi'* is not a malay or english word (or it is?). When plotting every word in the sentence, the result shows that most of the words are not in the vocab. While it still gives output, I have no idea if that is correct.