# Kanji readings

In this notebook we explore how many japanese words are necessary to cover all the readings (onyomi and kunyomi) from all the kanjis. 

In order to achieve this, we are going to use the JLPT vocabulary and the [Jōyō Kanji](https://en.wikipedia.org/wiki/List_of_j%C5%8Dy%C5%8D_kanji) list which contains:
* Kyōiku kanji (1026 kanji): List of kanjis that Japanese students should learn in the elementary school (from level 1 to level 6).
* The Secondary School Kanji (1130 kanji): List of Kanjis teached in middle and high school (level 'S').

In [2]:
import pandas as pd 
import romkan

JLPT_vocab = pd.read_csv('../HSKandJLPTkanji/data/JLPT_vocab.txt', header = None, sep="\t", names=["kanji","hiragana","English","grade"])
kanji_grades = pd.read_csv('data/kanji_grades.txt', header = None, sep="\t", names=["kanji","grade","English","reading_kana","reading_romaji"])


The dataset with the vocabulary is also processed. Two columns will be added:
* Obtain the list of kanji in each word ("JLPT_kanji_set" columns)
* Produce the transliteration of the word ("romaji" column)

In [3]:
#Process vocab set
#remove the hiragana and katakana and convert the word to a set of kanjis
hiragana_set = set("".join(JLPT_vocab["hiragana"].tolist())) 

japanese_vocabulary=JLPT_vocab[~ JLPT_vocab["kanji"].isna()].sort_values(by=['grade'], ascending=False).reset_index(drop=True)
japanese_vocabulary["JLPT_kanji_set"]=japanese_vocabulary["kanji"].astype(str).apply(lambda x:  list(set(x) - hiragana_set) )
japanese_vocabulary["romaji"]=japanese_vocabulary["hiragana"].astype(str).apply(lambda x:  romkan.to_hepburn(x) )


On the other side, we also process the dataset with the kanji so we add a new column ("reading_romaji_set") containing a list with the different lectures (in romaji).

In [4]:
#Process kanji set
reading_set = kanji_grades["reading_kana"].astype(str).apply(
    lambda x:  x.split("、") #x.split(",")
).apply(
    lambda x:  [item.split("-")[0].strip() for item in x if not item.startswith("（")]  
).apply(
    lambda x:  [romkan.to_hepburn(x) for x in list(set(x))]
)

kanji_grades["reading_romaji_set"]=reading_set

In order to join the kanji and vocabulary dataset we create a dictionary `kanjireadings_words_dict` that we will populate.

In [5]:
## build a dictionary of kanji->readings and another with (kanji,reading)->word examples
kanjireadings_dict=dict(zip(kanji_grades['kanji'],reading_set))

kanjireadings_words_dict=dict()
for elem in list(zip(kanji_grades['kanji'],reading_set)):
    [kanj,readings]=elem
    for wred in readings:
        kanjireadings_words_dict[(kanj,wred)]=[]

The output will be `kanjireadings_words_dict` therefore we add the entries. We include one example of each kanji, the easier the example the better.

In order to do this we iterate over the vocabulary, from the easiest to the most difficult words (according to the JLPT level). We add the word to the corresponding kanji and reading (if it is still empty).

In [6]:
#Populate kanjireadings_words_dict

#Given a row, retrieve the pronounciation of each of their kanji
def get_word_kanji_reading(row):
    res=[]
    kanji_list = row["JLPT_kanji_set"]
    pronounc = row["romaji"]
    for k in kanji_list:
        readings_k=kanjireadings_dict.get(k,[])
        #find longest match
        sorted_pronounc=sorted([(x,len(x)) for x in readings_k if x in pronounc], key=lambda x: -x[1])
        if len(sorted_pronounc)>0:
            kanji_pronounc=sorted_pronounc [0][0] # get the longest match (and only the text, doscard the length)
            res.append((k,kanji_pronounc))
    return res



for i in range(0,len(japanese_vocabulary)):
    row=japanese_vocabulary.loc[i]
    #print(row)
    [kanji, hiragana, English,grade, JLPT_kanji_set ,romaji]=row #["kanji", "hiragana", "English","grade", "JLPT_kanji_set" ,"romaji"]
    #print(row)
    kanji_readings=get_word_kanji_reading(row)
    #add the word to kanjireadings_words_dict (if entry is empty)
    for kr in kanji_readings:
        if kr in kanjireadings_words_dict.keys():
            example=kanjireadings_words_dict[kr]
            if len(example)==0:
                kanjireadings_words_dict[kr]=row[["kanji", 'grade', "hiragana", "English"]]


Iterate over the kanji set and prepare the output dataset.

In [7]:
output=[]
for i in range(len(kanji_grades)):
    kanji_grade_row=kanji_grades.loc[i]
    kanji=kanji_grade_row["kanji"]
    reading_romaji_set=kanji_grade_row["reading_romaji_set"]
    for r in reading_romaji_set:
        word_example=kanjireadings_words_dict.get((kanji,r),[])
        if len(word_example)>0:
            out_example = [str(i) ,
                kanji,
                kanji_grade_row["grade"],
                kanji_grade_row["English"],
                kanji_grade_row["reading_kana"],
                word_example["kanji"],
                word_example["grade"],
                word_example["hiragana"],
                word_example["English"] ]
            output.append(out_example)

dfOut = pd.DataFrame(output,columns=["idx","kanji","grade","meaning","readings","word_example","JLPT","word_hiragana","word_meaning"]) 


In [8]:
dfOut.head()

Unnamed: 0,idx,kanji,grade,meaning,readings,word_example,JLPT,word_hiragana,word_meaning
0,0,一,1,one,イチ、イツ、ひと、ひと-つ,もう一度,5,もういちど,again
1,0,一,1,one,イチ、イツ、ひと、ひと-つ,一人,5,ひとり,one person
2,0,一,1,one,イチ、イツ、ひと、ひと-つ,同一,3,どういつ,identity; sameness; similarity
3,1,右,1,right,ウ、ユウ、みぎ,左右,3,さゆう,left and right; influence; control; domination
4,1,右,1,right,ウ、ユウ、みぎ,右,5,みぎ,right side


In [9]:
dfOut.to_csv('data/out',sep='\t')

In [10]:
pd.set_option('display.max_rows', None)
dfOut

Unnamed: 0,idx,kanji,grade,meaning,readings,word_example,JLPT,word_hiragana,word_meaning
0,0,一,1,one,イチ、イツ、ひと、ひと-つ,もう一度,5,もういちど,again
1,0,一,1,one,イチ、イツ、ひと、ひと-つ,一人,5,ひとり,one person
2,0,一,1,one,イチ、イツ、ひと、ひと-つ,同一,3,どういつ,identity; sameness; similarity
3,1,右,1,right,ウ、ユウ、みぎ,左右,3,さゆう,left and right; influence; control; domination
4,1,右,1,right,ウ、ユウ、みぎ,右,5,みぎ,right side
5,2,雨,1,rain,ウ、あめ、（あま）,雨,5,あめ,rain
6,2,雨,1,rain,ウ、あめ、（あま）,梅雨,3,つゆ,rainy season; rain during the rainy season
7,3,円,1,"round, yen",エン、まる-い,丸い/円い,5,まるい,round; circular
8,3,円,1,"round, yen",エン、まる-い,円,3,えん,circle; money
9,4,王,1,king,オウ,王子,3,おうじ,prince


## Analysis of the output

In [11]:
print("Number of pronunciations: "+str(len(dfOut)) )

def num_dif_words(df):
    return str(len( set(df["word_example"].to_list()) ))


print("Number of different words: "+ num_dif_words(dfOut))
print("Number of different words (grade 1): "+ num_dif_words(dfOut[dfOut["grade"]=="1"]))
print("Number of different words (grade 2): "+ num_dif_words(dfOut[dfOut["grade"]=="2"]))
print("Number of different words (grade 3): "+ num_dif_words(dfOut[dfOut["grade"]=="3"]))
print("Number of different words (grade 4): "+ num_dif_words(dfOut[dfOut["grade"]=="4"]))
print("Number of different words (grade 5): "+ num_dif_words(dfOut[dfOut["grade"]=="5"]))
print("Number of different words (grade 6): "+ num_dif_words(dfOut[dfOut["grade"]=="6"]))
print("Number of different words (grade S): "+ num_dif_words(dfOut[dfOut["grade"]=="S"]))



Number of pronunciations: 2834
Number of different words: 2526
Number of different words (grade 1): 176
Number of different words (grade 2): 312
Number of different words (grade 3): 345
Number of different words (grade 4): 314
Number of different words (grade 5): 278
Number of different words (grade 6): 282
Number of different words (grade S): 984


Number of words by grade and by JLPT level:

In [12]:
grade_JLPT_conf=dfOut.pivot_table(index='JLPT', columns='grade',values='word_example', aggfunc='count')
grade_JLPT_conf

grade,1,2,3,4,5,6,S
JLPT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,12,27,29,43,44,64,411
2,16,26,34,43,34,55,244
3,30,82,129,150,157,102,253
4,29,57,83,64,43,32,72
5,98,146,92,34,12,37,50
