# Introduction

https://mlbootcamp.ru/round/26/tasks/

Chinese is a wonderful language. Let's see what can we build to understand the meaning of the sentence and make a proper translation system.

Our main task is to receive a Russian translation for a Chinese text.

In [7]:
import pandas as pd
import numpy as np
import jieba
import pinyin.cedict
import pickle

from matplotlib import pyplot as plt

It's a random sentences from the BBC news channel.

In [19]:
text = "摆在唐宁街面前的是一系列重大决策：何时行动？解除哪些限制？如何控制病毒传播？如何权衡短期内挽救生命、长期内挽救经济和社会？"

Typical Chinese word consists of 2 or 3 characters. We can see that there are no spaces between words. Let's use the jieba library to split the sentence to words.

In [28]:
words = jieba.lcut(text)

In [48]:
words[:14]

['摆在',
 '唐宁街',
 '面前',
 '的',
 '是',
 '一系列',
 '重大',
 '决策',
 '：',
 '何时',
 '行动',
 '？',
 '解除',
 '哪些']

We can slightly understand what's it about by use the pinyin lib.

In [44]:
trans = []
for word in words:
    tr = pinyin.cedict.translate_word(word)
    if tr:
        trans.append(tr[0])
    else:
        trans.append("—")

In [47]:
trans[:14]

['—',
 'Downing Street (London)',
 'in front of',
 'aim',
 'variant of 是[shi4]',
 'a series of',
 'great',
 'strategic decision',
 '—',
 'when',
 'operation',
 '—',
 'to remove',
 'which ones?']

# Baseline

Naive approach is to translate with dictionary. We take the first Chinese-Russian dictionary thet we can find, clean it and use to translation.

In [24]:
zh_ru = pickle.load(open("zh_ru.pkl", "rb"))

In [52]:
len(zh_ru)

2616

In [53]:
def translate(word):
    if word in zh_ru:
        return zh_ru[word]
    else:
        return "пушкин"

In [56]:
for word in words:
    print(word, translate(word))

摆在 пушкин
唐宁街 пушкин
面前 пушкин
的 пушкин
是 ['представлять', 'представить']
一系列 ['серия']
重大 пушкин
决策 пушкин
： пушкин
何时 пушкин
行动 пушкин
？ пушкин
解除 пушкин
哪些 пушкин
限制 пушкин
？ пушкин
如何 ['как?']
控制 пушкин
病毒传播 пушкин
？ пушкин
如何 ['как?']
权衡 пушкин
短期内 пушкин
挽救 пушкин
生命 ['жизнь']
、 пушкин
长期 пушкин
内 пушкин
挽救 пушкин
经济 ['экономика']
和 ['с']
社会 пушкин
？ пушкин


In [68]:
import json
with open('data.json', 'w', encoding='utf8') as f:
    json.dump(zh_ru, f, ensure_ascii=False)
    
#     with io.open('filename', 'w', encoding='utf8') as json_file:
#     json.dump(u"ברי צקלה", json_file, ensure_ascii=False)

In [70]:
with open('data.json', 'r', encoding='utf8') as f:
    a = json.load(f)

In [71]:
a['而']

['а']