# Japanese Sentences by JLPT Level

In this notebook we are going to extract sentences from a parallel set of Japanese-English sentences which covers a given list of vocabulary.

The steps we are following are:
* Download the parallel data
* Grade sentences by JLPT level. 
* As an example, extract the top-100 sentences that covers as many as words in N3 as possible.

## Download data

We extract some parallel sentences from [OPUS](http://opus.nlpl.eu) project. For example, the [Tatoeba](http://opus.nlpl.eu/Tatoeba-v2020-11-09.php) parallel set.

This can be downloaded an processed executing the following:


```console
wget https://object.pouta.csc.fi/OPUS-Tatoeba/v2020-11-09/moses/en-ja.txt.zip
unzip -d tatoeba en-ja.txt.zip
paste -d "\t" tatoeba/Tatoeba.en-ja.ja tatoeba/Tatoeba.en-ja.en > data/parallel.ja-en
rm -r tatoeba
rm en-ja.txt.zip

```

By executing this we obtain a file `data/parallel.ja-en` with parallel sentences

## Grade sentences

As first step we estimate the difficulty level of the sentences based on their kanjis.

In [1]:
import numpy as np
import pandas as pd
import bisect 

HIRAGANA = list('ぁあぃいぅうぇえぉおかがきぎくぐけげこごさざしじすず'
                'せぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴ'
                'ふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろわ'
                'をんーゎゐゑゕゖゔ')

KATAKANA = list('ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズ'
                'セゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピ'
                'フブプヘベペホボポマミムメモャヤュユョヨラリルレロワ'
                'ヲンーヮヰヱ')

ASCII_chars = list('ゝゞ・「」。、!！"#$%&\'()*+,-./:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789０１２３４５６７８９'
                  '[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ')


JLPT_vocab_path='../HSKandJLPTkanji/data/JLPT_vocab.txt'
pair_sentences_path='data/parallel.ja-en'


In [2]:

JLPT_vocab = pd.read_csv(JLPT_vocab_path, header = None, sep="\t", names=["kanji","hiragana","English","grade"])


non_kanji_set=set(HIRAGANA+KATAKANA+ASCII_chars)


JLPT_kanji_set = JLPT_vocab["kanji"].astype(str).apply(lambda x:  set(x) - non_kanji_set )
#get pairs of [kanji,grade]
kanji_set=pd.concat([JLPT_kanji_set,JLPT_vocab["grade"]], axis=1)
kanji_grade_list_nested=kanji_set.apply(lambda row: [[x,row['grade']] for x in list(row['kanji'])], axis=1).tolist()
#flatten the list
kanji_grade_list=[item for sublist in kanji_grade_list_nested for item in sublist]
#store in a dictionary where each kanji has the highest (easiest) JLPT grade
kanji_grade_dict=dict()
for x in kanji_grade_list:
    [kj,gr]=x
    kj_grade=kanji_grade_dict.get(kj,0)
    new_grade=max(kj_grade,gr)
    kanji_grade_dict[kj]=new_grade


In [3]:
def grade_sentence(s):
    slist=list(s)
    s_kanji_level=[kanji_grade_dict.get(x,0) for x in slist if x not in non_kanji_set]
    if len(s_kanji_level)==0:
        return 0
    else:
        return min(s_kanji_level)

par_sentences = pd.read_csv(pair_sentences_path, header = None, sep="\t", names=["jp","en"] )
JLPT_level=par_sentences["jp"].astype(str).apply(lambda x: grade_sentence(x))
par_sentences["JLPT_level"]=JLPT_level
#remove sentences without kanji or kanji not in the JLPT list
par_sentences=par_sentences[par_sentences["JLPT_level"]>0]
par_sentences.head()

Unnamed: 0,jp,en,JLPT_level
0,何かしてみましょう。,Let's try something.,5
1,私は眠らなければなりません。,I have to go to sleep.,4
2,そろそろ寝なくちゃ。,I have to go to sleep.,5
3,今日は６月１８日で、ムーリエルの誕生日です！,Today is June 18th and it is Muiriel's birthday!,5
4,ムーリエルは２０歳になりました。,Muiriel is 20 now.,5


## Sample sentences

Extract sentences that covers the vocabulary.

For example, sentences covering the vocabulary of N3 (setting as `vocab` the words from N3):

In [4]:
level=3

df_sent_level=par_sentences[par_sentences["JLPT_level"]==level].reset_index()
sentence_list=list(df_sent_level["jp"])
vocab=list( JLPT_vocab[JLPT_vocab["grade"]==level]["kanji"] )

Extract top-100 sentences. We follow an approach based on Feature Decay Algorithms.

In [5]:
N=100

v_val=dict()
for v in vocab:
    v_val[v]=1.0

def create_tuple(index,list_w,val,slen):
    return (val,index,list_w,slen)

def get_val(tupl):
    return tupl[0]

def get_index(tupl):
    return tupl[1]

def get_list_w(tupl):
    return tupl[2]

def get_len(tupl):
    return tupl[3]

def update_val(tupl):
    list_w=get_list_w(tupl)
    slen=get_len(tupl)
    if len(list_w)==0:
        return (0.0,tupl[1],tupl[2],tupl[3])
    s_val=float(sum([v_val.get(x,0.0) for x in list_w]))
    new_val= s_val/slen
    return (new_val,tupl[1],tupl[2],tupl[3])

def decay(v_list):
    for v in v_list:
        old_val=v_val.get(v,0.0)
        new_val=old_val/2.0
        v_val[v]=new_val

def getvocab(s,vocab):
    list_w=[]
    for v in vocab:
        if str(v) in s:
            list_w.append(v)
    return list(set(list_w))

tuples=[]
for i in range(len(sentence_list)):
    s=sentence_list[i]
    list_w=getvocab(s,vocab)
    slen=len(s)
    cur_tuple=create_tuple(i,list_w,0,slen)
    cur_tuple=update_val(cur_tuple)
    tuples.append(cur_tuple)


tuples.sort(key=lambda x: x[0])


i=len(tuples)-1

selected=[]
while i>0 and len(selected)<N:
    top_tuple=tuples[i]
    old_val=get_val(top_tuple)
    top_tuple=update_val(top_tuple)
    new_val=get_val(top_tuple)
    if new_val!=old_val:
        bisect.insort(tuples, top_tuple )
    else:
        i=i-1
        selected.append(get_index(top_tuple))
        decay(get_list_w(top_tuple))


In [6]:
dfOut=df_sent_level[["jp","en"]].loc[selected]
dfOut.to_csv('data/out',sep='\t')
dfOut


Unnamed: 0,jp,en
66596,劇場内禁煙。,No smoking in the theater.
42842,地球は丸い。,The earth is round.
16529,この場合は、翻訳は事実上不可能だ。,"In this case, translation is, in effect, impos..."
48656,時間の単位は何か。,What are the measures of time?
45145,生物学者はその現象の観察に集中した。,The biologist concentrated on observing the ph...
66247,王様は裸だ！,The king is naked!
3159,回数券を下さい。,May I have coupon tickets?
45057,税金は収入に基づく。,Taxation is based on income.
55814,今日中にでも嵐が来そうだ。,We are liable to get a storm before the day is...
67487,絶対！,Absolutely!
