# Chinese HSK Vocabulary

In this document we explore the Chinese vocabulary in HSK. We want to explore what would be a reduced set of words that covers all of the hanzi.

We also want to build a dataset with the complete vocabulary, with other word examples. In particular we attached the word in the reduced set that also contain the same characters.

In [1]:
import pandas as pd
import re

#Load data
vocab='data/allHSKwords'
HSK_words = pd.read_csv(vocab,sep="\t",header=None, names=["level","word","English","pinyin","POS"], dtype=str)


In [2]:
#Get hanzi info dictionary
hanzi_info=dict()
for i in reversed(range(0,len(HSK_words))):
    word=HSK_words.loc[i]["word"]
    wordl=list(word)
    level=HSK_words.loc[i]["level"]
    pinyin=HSK_words.loc[i]["pinyin"]
    english=HSK_words.loc[i]["English"]
    pinyin_split=re.findall('.*?[\d]', pinyin) 
    for li in range(0,len(wordl)):
        hanzi_info[wordl[li]]=[pinyin_split[li],int(level),[word,english,pinyin]]


## Dataset with information of the hanzi

In this first section, we build a dataset with the characters. We extract them and include the HSK level where it first appears. We also attach a word that contains the character as example.

In [3]:
#Convert dataframe
hanzi_info_list=list()
for k in hanzi_info.keys():
    [pinyin,level,word_sample]=hanzi_info[k]
    item=[k,pinyin,level," | ".join(word_sample)]
    hanzi_info_list.append(item)

hanzi_info_pd = pd.DataFrame(hanzi_info_list) 
hanzi_info_pd.columns =['hanzi', 'pinyin', 'level', 'word_example']
hanzi_info_pd=hanzi_info_pd.sort_values(by='level').reset_index(drop=True)

#Save and print
hanzi_info_pd.to_csv('output/hanzi_level',sep='\t')
hanzi_info_pd


Unnamed: 0,hanzi,pinyin,level,word_example
0,做,zuo4,1,做 | to do | zuo4
1,火,huo3,1,火车站 | railway station | huo3che1zhan4
2,四,si4,1,四 | four | si4
3,太,tai4,1,太 | too | tai4
4,天,tian1,1,今天 | today | jin1tian1
5,机,ji1,1,飞机 | airplane | fei1ji1
6,星,xing1,1,星期 | week | xing1qi1
7,打,da3,1,打電話 | to telephone | da3 dian4hua4
8,老,lao3,1,老师 | teacher | lao3shi1
9,见,jian4,1,看见 | to see | kan4jian4


## Vocabulary Unique Hanzi

In this section we build a dataset with the (almost) smallest vocabulary covering all the hanzi. We also include the minimum and maximum HSK level where the hanzi appears.

In [4]:
word_example_uniq=list(set(hanzi_info_pd['word_example']))
word_example = pd.DataFrame([x.split("|") for x in word_example_uniq]) 


min_level=[]
max_level=[]
for i in range(0,len(word_example)):
    [word,English,pinyin]=word_example.loc[i]
    levels=[hanzi_info[x][1] for x in list(word.strip())]
    min_level.append(min(levels))
    max_level.append(max(levels))   

word_example['min_level']=min_level
word_example['max_level']=max_level

word_example=word_example.sort_values(by='max_level').reset_index(drop=True)
word_example.to_csv('output/word_example',sep='\t')
word_example

Unnamed: 0,0,1,2,min_level,max_level
0,现在,now,xian4zai4,1,1
1,医院,hospital,yi1yuan4,1,1
2,今天,today,jin1tian1,1,1
3,来,to come,lai2,1,1
4,请,please,qing3,1,1
5,岁,"years, age",sui4,1,1
6,七,seven,qi1,1,1
7,上,above,shang4,1,1
8,饭店,restaurant,fan4dian4,1,1
9,对不起,sorry,dui4bu5qi3,1,1


## Dataset with Information of the Vocabulary

Finally we build a dataset with the vocabulary. We attach examples containing the characters of each word.

In [5]:
def get_example_words(word):
    list_words=[]
    for char in list(word):
        list_words.append("|".join(hanzi_info[char][2]))
    return "|||".join(list_words)
    #return ["|".join(hanzi_info[char][2]) for char in list(word)]

HSK_words_wordcolumn=HSK_words['word'].tolist()
#split the word characters and get the info (the second field, the example word) of the character
HSK_words_wordcolumn_set=[get_example_words(word) for word in HSK_words_wordcolumn]

HSK_words['wordset']=HSK_words_wordcolumn_set

#Save and print
HSK_words.to_csv('output/HSK_words_with_examples',sep='\t')
HSK_words

Unnamed: 0,level,word,English,pinyin,POS,wordset
0,1,爱,to love,ai4,verb,爱|to love|ai4
1,1,八,eight,ba1,numeral,八|eight|ba1
2,1,爸爸,father,ba4ba5,noun,爸爸|father|ba4ba5|||爸爸|father|ba4ba5
3,1,杯子,cup,bei1zi5,noun,杯子|cup|bei1zi5|||杯子|cup|bei1zi5
4,1,北京,Beijing,Bei3jing1,noun,北京|Beijing|Bei3jing1|||北京|Beijing|Bei3jing1
5,1,本,volume,ben3,measure word,本|volume|ben3
6,1,不客气,you're welcome,bu2ke4qi5,verb,不客气|you're welcome|bu2ke4qi5|||不客气|you're welc...
7,1,不,no,bu4,adverb,不客气|you're welcome|bu2ke4qi5
8,1,菜,dish,cai4,noun,菜|dish|cai4
9,1,茶,tea,cha2,noun,茶|tea|cha2
