# Chinese Character Statistics

This notebook will explore statistics of Chinese characters. I'll look at how the frequency of characters in the book compare to the frequency of Chinese characters in the language overall.

Here are some basic things to figure out:

- Get a good list of the most frequently occuring Chinese characters

Here are some questions to answer:

- How many characters are there in Chinese? Find an authoratative source on this.
- How many unique characters are used in the book?
- How many Chinese words are used in the book? Using parsed character data? 
- What is the average character lenght of a word?
- Which characters are used in the most words?

And here are some things that will be interesting to visualize:
- What is the frequency disitribution of the characters used?
- Histogram of paragraph lengths. How long are paragraphs in the book?

In [1]:
import spacy

In [2]:
spacy.cli.download('zh_core_web_lg')

Collecting zh-core-web-lg==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/zh_core_web_lg-3.6.0/zh_core_web_lg-3.6.0-py3-none-any.whl (603.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m603.0/603.0 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('zh_core_web_lg')


In [3]:

zh_parser = spacy.load('zh_core_web_lg') ## disable=["parser"]

In [4]:
from spacy.lang.zh.examples import sentences

In [12]:
import json

with open("../data/books/three_body/book_data.json", "r+") as f:

    data = json.loads(f.read())

    chapters = data["chapters"][:1]

for chapter in chapters:
    chapter_text = "".join(chapter["paragraphs"])
    print(chapter_text)
    doc = zh_parser(chapter_text)
    for token in doc:
        # https://spacy.io/usage/linguistic-features#pos-tagging
        print(token.text, token.pos_, token.dep_)

“基石”是个平实的词，不够“炫”，却能够准确传达我们对构建中的中国科幻繁华巨厦的情感与信心，因此，我们用它来作为这套原创丛书的名字。最近十年，是科幻创作飞速发展的十年。王晋康、刘慈欣、何宏伟、韩松等一大批科幻作家发表了大量深受读者喜爱、极具开拓与探索价值的科幻佳作。科幻文学的龙头期刊更是从一本传统的《科幻世界》，发展壮大成为涵盖各个读者层的系列刊物。与此同时，科幻文学的市场环境也有了改善，省会级城市的大型书店里终于有了属于科幻的领地。仍然有人经常问及中国科幻与美国科幻的差距，但现在的答案已与十年前不同。在很多作品上（它们不再是那种毫无文学技巧与色彩、想象力拘谨的幼稚故事），这种比较已经变成了人家的牛排之于我们的土豆牛肉。差距是明显的——更准确地说，应该是“差别”——却已经无法再为它们排个名次。口味问题有了实际意义，这正是我们的科幻走向成熟的标志。与美国科幻的差距，实际上是市场化程度的差距。美国科幻从期刊到图书到影视再到游戏和玩具，已经形成了一条完整的产业链，动力十足；而我们的图书出版却仍然处于这样一种局面：读者的阅读需求不能满足的同时，出版者却感叹于科幻书那区区几千册的销量。结果，我们基本上只有为热爱而创作的科幻作家，鲜有为版税而创作的科幻作家。这不是有责任心的出版人所乐于看到的现状。科幻世界作为我国最有影响力的专业科幻出版机构，一直致力于对中国科幻的全方位推动。科幻图书出版是其中的重点之一。中国科幻需要长远眼光，需要一种务实精神，需要引入更市场化的手段，因而我们着眼于远景，而着手之处则在于一块块“基石”。需要特别说明的是，对于基石，我们并没有什么限定。因为，要建一座大厦需要各种各样的石料。对于那样一座大厦，我们满怀期待。（姚海军，《科幻世界》副总编）
“ PUNCT punct
基石 NOUN nsubj
” PUNCT punct
是 VERB cop
个 NUM nummod
平实 ADJ amod
的 PART case
词 NOUN ROOT
， PUNCT punct
不够 ADV advmod
“ PUNCT punct
炫 VERB conj
” PUNCT punct
， PUNCT punct
却 ADV advmod
能够 VERB aux:modal
准确 ADV advmod
传达 VERB conj
我们 PRON dep
对 ADP c

In [6]:
import unicodedata

def is_simplified_chinese(char):
   return 'CJK UNIFIED IDEOGRAPH' in unicodedata.name(char, '')

print(is_simplified_chinese('你')) # True

True


In [20]:
print(is_simplified_chinese('鱄')) # True

True


In [15]:
import unicodedata

def is_simplified_chinese(char):
   return 'CJK UNIFIED IDEOGRAPH' in unicodedata.name(char, '')

def count_characters(file_path):
    """
    Counts the characters in the given file.

    Args:
    file_path (str): The path to the file.

    Returns:
    dict: A dictionary with characters as keys and their counts as values.
    """
    char_count = {}

    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            for line in file:
                for char in line:
                    if is_simplified_chinese(char):
                        if char in char_count:
                            char_count[char] += 1
                        else:
                            char_count[char] = 1
                    else:
                        continue
    except FileNotFoundError:
        print(f"The file at {file_path} was not found.")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

    return char_count

In [16]:
# Example usage
file_path = '../data/sample/file.txt'
print(count_characters(file_path))

{'你': 2, '好': 2, '世': 1, '界': 1}


In [17]:
counts = count_characters('../data/books/three_body/full_text_cn.txt')

In [18]:
len(counts.items())

2845

In [19]:
counts["汪"]

671

In [33]:
# create a dictionary of characters with frequency ranks
# https://lingua.mtsu.edu/chinese-computing/statistics/char/list.php?Which=MO


import csv

def get_frequency_dict(file_path):
    """
    This function assumes that the order of the columns is:

    order, character, frequency, percentage, pinyin, english
    """

    d = {}

    # open .tsv file
    with open(file_path) as file:

        # Passing the TSV file to
        # reader() function
        # with tab delimiter
        # This function will
        # read data from file
        tsv_file = csv.reader(file, delimiter="\t")

        # printing data line by line
        for line in tsv_file:
            print(line)
            order = line[0]
            character = line[1]
            frequency = line[2]
            percentage = line[3]
            pinyin = None
            try:
                pinyin = line[4]
            except IndexError as e:
                print("missing pinyin")
                pass

            english = None
            try:
                english = line[5]
            except IndexError as e:
                pass

            d[line[1]] = {
                'order': order,
                'character': character,
                'frequency': frequency,
                'percentage': percentage,
                'pinyin': pinyin,
                'english': english,
            }

        return d

In [None]:
FILE_PATH = "../data/language/frequency.tsv"
frequencies = get_frequency_dict(FILE_PATH)

In [70]:
# combine counts and frequencies into a single list of arrays with the following:
# [<character>, <rank>, <overall_rank>, <occurences>, <pinyin>, <english>]
# save this list of lists to a json file
# use this file for the visualization (replace old file, output.json)
import json

def build_output(counts, frequencies):
    ret = []
    for c in sorted(counts.items(), key=lambda x: x[1], reverse=True):
        print(c)
        obj = [
            c[0], # character
            c[1], # occurences
            frequencies.get(c[0], {"order": -1}).get("order"), # rank
            frequencies.get(c[0], {"english": "no translation"}).get("english"), # english
            frequencies.get(c[0], {"pinyin": "-"}).get("pinyin"), # pinyin
        ]

        ret.append(obj)

    with open("../data/language/counts.json", "w+") as f:
        json.dump(ret, f, ensure_ascii=False)

In [71]:
build_output(counts, frequencies)

('的', 7851)
('一', 3208)
('是', 2582)
('了', 2528)
('在', 2099)
('这', 1938)
('个', 1697)
('不', 1639)
('有', 1499)
('我', 1312)
('到', 1300)
('人', 1279)
('上', 1222)
('他', 1218)
('大', 1178)
('地', 1106)
('中', 1102)
('那', 1099)
('们', 1096)
('文', 978)
('来', 961)
('说', 946)
('着', 905)
('时', 891)
('出', 857)
('你', 847)
('就', 812)
('看', 810)
('能', 749)
('后', 705)
('三', 700)
('下', 695)
('她', 679)
('子', 677)
('汪', 671)
('和', 630)
('没', 622)
('体', 619)
('都', 610)
('过', 602)
('对', 599)
('现', 594)
('淼', 592)
('很', 577)
('发', 574)
('么', 573)
('也', 563)
('可', 556)
('成', 536)
('但', 533)
('面', 523)
('天', 520)
('叶', 510)
('要', 488)
('以', 485)
('为', 483)
('学', 483)
('洁', 470)
('自', 463)
('只', 452)
('去', 450)
('然', 450)
('太', 448)
('明', 447)
('行', 435)
('会', 429)
('道', 423)
('得', 421)
('想', 420)
('些', 413)
('它', 408)
('样', 407)
('起', 407)
('多', 404)
('生', 401)
('里', 396)
('于', 391)
('开', 389)
('动', 389)
('星', 388)
('还', 377)
('知', 367)
('问', 364)
('最', 364)
('球', 361)
('种', 361)
('小', 360)
('什', 359)
('间', 359)
('