- The original word-wise dict does not include variants of a word
- Example: `valley` -> can be looked-up but `valleys` -> cannot
- Solution: 
    - Can use spacy to lemmatize each word and look up in the dict. `valleys` --<lemmatized>--> `valley`
    - But it's slow, a book may contains hundreds thousands of words
    - So I add variants of words in the dict beforehand

In [None]:
import pandas as pd
import numpy as np
import spacy
import pickle
from tqdm import tqdm

In [3]:
# load spacy
with open("../data/en_core_web_sm.pkl", "rb") as f:
    sp = pickle.load(f)

In [129]:
# Load ww dict file with no word variants
df00_wwdict = (
    pd.read_csv("../data/vi_novar.csv", dtype=object)
    .dropna(subset=["word"])
    .drop_duplicates(subset=["word"])
)

In [117]:
df00_wwdict.head()

Unnamed: 0,id,word,full_def,short_def,example_sentence,hint_level
0,39438,from A to Z,including everything,từ A đến Z,The book is titled `Home Repairs From A to Z.`,1
1,30988,from (point) A to (point) B,from one place to another,từ (điểm) A đến (điểm) B,I don't care about the scenery. I'm only inter...,1
2,30749,aardvark,a large African animal that has a long nose an...,lợn đất,,2
3,13279,abacus,a device used for counting and calculating by ...,bàn tính,,2
4,30998,abalone,a type of shellfish that is eaten as food and ...,bào ngư,,1


In [133]:
wwdict_novar00 = df00_wwdict.set_index("word").to_dict(orient="index")
print(len(wwdict_novar00))

55698


In [34]:
# Load mega file 460K words from google books
# https://github.com/possibly-wrong/word-frequency
df01_wfreq = (pd.read_csv("../data/word-frequency.txt", sep="\t", names=["word", "unknown", "freq"])
    .dropna(subset=["word"])
)

In [35]:
df01_wfreq.head()

Unnamed: 0,word,unknown,freq
0,the,26548583149,109892823605
1,of,15482969531,66814250204
2,and,11315969857,47936995099
3,to,9673642739,40339918761
4,in,8445476198,34866779823


In [36]:
words_var01 = df01_wfreq.word.values
print(len(words_var01))

458341


In [46]:
# words in df01 but missing from df00 are possibly
#   - variants of words from df00 (set A)
#   - new words that do not exist in df00 (set B)
# For each word lemmatize it and check its existence in df00
#   - if yes -> belong to set A
#   - if no -> save them and process later (call gg API for translation)

df02_in01_notin00 = pd.merge(df00_wwdict, df01_wfreq, on="word", how="right")
df02_in01_notin00 = df02_in01_notin00[df02_in01_notin00.short_def.isnull()]
print(df02_in01_notin00.shape)

(407244, 8)


In [151]:
df02_in01_notin00.freq.quantile([0.5, 0.75, 0.8, 0.9, 0.95, 0.99])

0.50        2023.00
0.75       21927.25
0.80       40464.40
0.90      212593.90
0.95      956871.95
0.99    29142203.21
Name: freq, dtype: float64

In [142]:
words_in01_notin00 = df02_in01_notin00.word.values
setA = {}
setB = []
#
for word in tqdm(words_in01_notin00):
    spw = sp(word)
    if len(spw) != 1:
        # skip this, spacy messing things up
        continue
    # 
    lemma = spw[0].lemma_
    # 
    if lemma in wwdict_novar00:
        setA[word] = wwdict_novar00[lemma]
    else:
        setB.append(word)

100%|█████████████████████████████████████████████████████████████████████████| 407244/407244 [26:07<00:00, 259.74it/s]


In [184]:
# Process set A
df03B_setA = pd.DataFrame(data=setA.values(), index=setA.keys()).reset_index().rename(columns={"index": "word"})
df04_setA = pd.concat([df00_wwdict, df03B_setA])
df04_setA.to_csv("../data/vi.csv")

In [183]:
# Process set B
with open("../data/words_in01_notin00_lemma_notin00.pkl", "wb") as f:
    pickle.dump(setB, f)

## Manual data curation
- `datum`: goole translate -> `mốc thời gian`, fixed to `dữ kiện`
- `data`: fixed from `mốc thời gian` to `dữ liệu`