# Kanji JLPT

In this article we are going to analyze the kanji in their JLPT (jōyō kanji list) level. We will build a table with their kanji and the grades. Then, we create a list of examples for each kanji.

We start loading the dataset with the information of the kanji.

In [1]:
import pandas as pd 


JLPT_vocab_path='../HSKandJLPTkanji/data/JLPT_vocab.txt'
JLPT_vocab = pd.read_csv(JLPT_vocab_path, header = None, sep="\t", names=["kanji","hiragana","English","grade"])

kanji_grades_path="../Kanji_readings/data/kanji_grades.txt"
kanji_grades = pd.read_csv(kanji_grades_path, header = None, sep="\t", names=["kanji","grade","English","reading_kana","reading_romaji"])


Preprocess to add a list with the kanjies of the word.

In [2]:
hiragana_set = set("".join(JLPT_vocab["hiragana"].tolist())) 
japanese_vocabulary=JLPT_vocab[~ JLPT_vocab["kanji"].isna()].sort_values(by=['grade'], ascending=False).reset_index(drop=True)
japanese_vocabulary["JLPT_kanji_set"]=japanese_vocabulary["kanji"].astype(str).apply(lambda x:  list(set(x) - hiragana_set) )
japanese_vocabulary.head()

Unnamed: 0,kanji,hiragana,English,grade,JLPT_kanji_set
0,会う,あう,to meet,5,[会]
1,並べる,ならべる,to line up; to set up,5,[並]
2,登る,のぼる,to climb,5,[登]
3,寝る,ねる,to go to bed; to sleep,5,[寝]
4,猫,ねこ,cat,5,[猫]


## Level of each kanji
Build a dict with the corresponding JLPT level of each kanji

In [3]:
kanji_JLPT_list_nested=japanese_vocabulary.apply(lambda row: [[x,row['grade']] for x in list(row['JLPT_kanji_set'])], axis=1).tolist()
kanji_JLPT_list=[item for sublist in kanji_JLPT_list_nested for item in sublist]
JLPT_characters=pd.DataFrame(kanji_JLPT_list)
JLPT_characters.columns = ['kanji', 'JLPT']
kanji_JLPT_pair=JLPT_characters.groupby(['kanji']).max().reset_index()
kanji_JLPT_dict = dict(zip(kanji_JLPT_pair.kanji, kanji_JLPT_pair.JLPT))

In [4]:
kanji_grade_dict = dict(zip(kanji_grades.kanji, kanji_grades.grade))

Obtain a table with kanjis and their corresponding JLPT level and grade

In [5]:
all_kanjis=list(set( list(kanji_JLPT_dict.keys()) + list(kanji_grade_dict.keys())))
JLPT_grade=pd.DataFrame( [[k,kanji_JLPT_dict.get(k,0),kanji_grade_dict.get(k,"EXTRA")] for k in all_kanjis] )
JLPT_grade.columns = ['kanji', 'JLPT','grade']
JLPT_grade

Unnamed: 0,kanji,JLPT,grade
0,蝋,2,EXTRA
1,拙,0,S
2,組,4,2
3,挟,2,S
4,防,3,5
...,...,...,...
2390,浄,0,S
2391,焚,2,EXTRA
2392,叩,3,EXTRA
2393,懲,1,S


## Kanjis with examples

We build a table with examples for each kanji

In [6]:
NUM_EXAMPLES=4
kanji_map=dict()
#fill with top3 easiest words (the first 3)
for row in japanese_vocabulary.itertuples():
    [index,kanji,hiragana,English ,grade,word_kanji_set]=list(row) 
    word_tuple=[kanji,hiragana,English ,str(grade)]
    for k in word_kanji_set:
        w_list=kanji_map.get(k,[])
        if len(w_list)<NUM_EXAMPLES:
            w_list.append(word_tuple)
            kanji_map[k]=w_list


In [7]:
examples_column=JLPT_grade["kanji"].apply(lambda x:["|".join(tuple) for tuple in kanji_map.get(x,[])])
JLPT_grade["example1"]=examples_column.apply(lambda x: x[:1] or [""]).apply(lambda x: x[0])
JLPT_grade["other examples"]=examples_column.apply(lambda x: x[2:] or [])
JLPT_grade=JLPT_grade.sort_values(by=['JLPT','grade'], ascending=[False,True]).reset_index(drop=True)

JLPT_grade

Unnamed: 0,kanji,JLPT,grade,example1,other examples
0,円,5,1,丸い/円い|まるい|round; circular|5,"[楕円|だえん|ellipse|2, 円い|まるい|round; circular; sph..."
1,七,5,1,七日|なのか|seven days; the seventh day|5,"[七|しち / なな|seven|5, 七|なな|seven|3]"
2,水,5,1,水|みず|water|5,"[水道|すいどう |water supply|4, 水泳|すいえい |swimming|4]"
3,立,5,1,立つ|たつ|to stand|5,"[役に立つ|やくにたつ |to be helpful|4, 独立|どくりつ|independ..."
4,白,5,1,白い|しろい|white|5,"[真っ白|まっしろ|pure white|2, 青白い|あおじろい|pale; pallid|2]"
...,...,...,...,...,...
2390,拶,0,S,,[]
2391,辣,0,S,,[]
2392,俊,0,S,,[]
2393,菊,0,S,,[]


In [8]:
JLPT_grade.to_csv('data/out.txt',sep='\t', index=False)
