# Работа с текстами на русском языке

В языках мира существуют две основных группы способов выражения грамматических значений: 

- синтетические способы
- аналитические способы

Для **синтетических способов** характерно соединение грамматического показателя с самим словом. Таким показателем, вносящим грамматическое значение ≪внутрь слова≫, могут быть окончание, суффикс, приставка, внутренняя флексия и т.д.

Общей чертой **аналитических способов** является выражение грамматического значения за пределами слова, отдельно от него — например, с помощью предлогов, союзов, артиклей, вспомогательных глаголов и других служебных слов, а также с помощью порядка слов.

Существеное отличие в работе с русским и английским текстом заключается в том, что для русского языка более хврактерен синтетический способ, а для английского -- аналитический.

### Постановка задачи

При работе с большим объёмом данных важно поддерживать их чистоту. А при заполнении заявки на банковский продукт необходимо указывать полные паспортные данные, в том числе и поле «кем выдан паспорт», число различных вариантов написаний одного и того же отделения потенциальными клиентами может достигать нескольких сотен. Важно понимать, не ошибся ли клиент, заполняя другие поля: «код подразделения», «серию/номер паспорта». Для этого необходимо сверять «код подразделения» и «кем выдан паспорт».
Задача заключается в том, чтобы проставить коды подразделений для записей из тестовой выборки, основываясь на обучающей выборке.

https://habr.com/ru/post/205360/

#### Предварительная обработка данных

Загрузим данные и посмотрим, что мы имеем:

In [1]:
from pandas import read_csv
import pymorphy2
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.decomposition import PCA

train = read_csv('https://static.tcsbank.ru/documents/olymp/passport_training_set.csv',';', index_col='id' ,encoding='cp1251')
train.head(5)

Unnamed: 0_level_0,passport_div_code,passport_issuer_name,passport_issue_month/year
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,422008,БЕЛОВСКИМ УВД КЕМЕРОВСКОЙ ОБЛАСТИ,11M2001
2,500112,ТП №2 В ГОР. ОРЕХОВО-ЗУЕВО ОУФМС РОССИИ ПО МО ...,03M2009
3,642001,ВОЛЖСКИМ РОВД ГОР.САРАТОВА,04M2002
4,162004,УВД МОСКОВСКОГО РАЙОНА Г.КАЗАНЬ,12M2002
5,80001,ОТДЕЛОМ ОФМС РОССИИ ПО РЕСП КАЛМЫКИЯ В Г ЭЛИСТА,08M2009


Теперь можно посмотреть как пользователи записывают поле «кем выдан паспорт» на примере какого-либо подразделения:


In [2]:
example_code = train.passport_div_code[train.passport_div_code.duplicated()].values[0]
for i in train.passport_issuer_name[train.passport_div_code == example_code].drop_duplicates():
    print(i)

ОТДЕЛЕНИЕМ УФМС РОССИИ ПО РЕСПУБЛИКЕ КАРЕЛИЯ В МЕДВЕЖ. Р-Е
ОТДЕЛЕНИЕМ УФМС РОССИИ ПО Р. КАРЕЛИЯ В МЕДВЕЖЬЕГОРСКОМ РАЙОНЕ
ОТДЕЛЕНИЕМ УФМС РОССИИ ПО РЕСП КАРЕЛИЯ В МЕДВЕЖЬЕГОРСКОМ Р-НЕ
ОТДЕЛЕНИЕМ УФМС РОССИИ ПО РЕСПУБЛИКЕ КАРЕЛИЯ В МЕДВЕЖЬЕГОРСКОМ РАЙОНЕ
ОУФМС РОССИИ ПО РЕСПУБЛИКЕ КАРЕЛИЯ В МЕДВЕЖЬЕГОРСКОМ РАЙОНЕ
УФМС РОССИИ ПО РК В МЕДВЕЖЬЕГОРСКОМ РАЙОНЕ
ОТДЕЛЕНИЕМ УФМС РОССИИ ПО РЕСПУБЛИКЕ КАРЕЛИЯ МЕДВЕЖЬЕГОРСКОМ Р-ОНЕ
ОТДЕЛЕНИЕМ УФМС РОССИИ ПО РК В МЕДВЕЖЬЕГОРСКОМ РАЙОНЕ
ОТДЕЛЕНИЕМ УФМС РОССИИ ПО РЕСПУБЛИКЕ КОРЕЛИЯ В МЕДВЕЖИГОРСКОМ РАЙОНЕ
УФМС РОССИИ ПО Р. КАРЕЛИЯ МЕДВЕЖЬЕГОРСКОГО Р-НА
ОТДЕЛОМ УФМС РОССИИ ПО РЕСПУБЛИКЕ КАРЕЛИЯ В МЕДВЕЖЬЕГОРСКОМ
УФМС РЕСПУБЛИКИ КАРЕЛИИ МЕДВЕЖЬЕГОРСКОГО Р-ОН
МЕДВЕЖЬЕГОРСКИМ ОВД


Как можно заметить нужно на поле действительно заполняется криво. Но для нормально кодирования мы должны привести это поле к более-менее нормальному (однозначному) виду.
Для начала я бы предложил привести все записи к одному регистру, например, чтобы все буквы стали строчными. Это легко сделать с помощью атрибута str, столбца DataFrame'a. Этот атрибут позволяет работать со столбцом как с строкой, а также выполнять различного рода поиск и замену по регулярным выражениям:

In [3]:
train.passport_issuer_name = train.passport_issuer_name.str.lower()
train[train.passport_div_code == example_code].head(5)

Unnamed: 0_level_0,passport_div_code,passport_issuer_name,passport_issue_month/year
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
19,100010,отделением уфмс россии по республике карелия в...,04M2008
22,100010,отделением уфмс россии по р. карелия в медвежь...,10M2009
5642,100010,отделением уфмс россии по респ карелия в медве...,08M2008
6668,100010,отделением уфмс россии по республике карелия в...,08M2011
8732,100010,отделением уфмс россии по республике карелия в...,08M2012


C регистром определились. Далее надо по возможности избавиться от популярных сокращений, например район, город и т.д. Сделаем это с помощью регулярных выражений. Pandas предоставляет удобное использование регулярных выражений применительно к каждому столбцу. Это выглядит так:

In [4]:
train.passport_issuer_name = train.passport_issuer_name.str.replace(u'р-(а|й|о|н|е)*',u'район')
train.passport_issuer_name = train.passport_issuer_name.str.replace(u' г( |\.|(ор(\.| )))', u' город ')
train.passport_issuer_name = train.passport_issuer_name.str.replace(u' р(\.|есп )', u' республика ')
train.passport_issuer_name = train.passport_issuer_name.str.replace(u' адм([а-я]*)(\.)?', u' административный ')
train.passport_issuer_name = train.passport_issuer_name.str.replace(u' окр(\.| |уга( )?)', u' округ ')
train.passport_issuer_name = train.passport_issuer_name.str.replace(u' ао ', u' административный округ ')

Теперь избавимся от всех лишних символов, кроме русских букв, дефисов и пробелов. Это связано с тем, что паспорт о одинаковым подразделением может выдаваться отделами с разными номерами, и это ухудшит дальнейшую кодировку:

In [5]:
train.passport_issuer_name = train.passport_issuer_name.str.replace(u' - ?', u'-')
train.passport_issuer_name = train.passport_issuer_name.str.replace(u'[^а-я -]','')
train.passport_issuer_name = train.passport_issuer_name.str.replace(u'- ',' ')
train.passport_issuer_name = train.passport_issuer_name.str.replace(u'  *',' ')

На следующем шаге, надо расшифровать аббревиатуры, типа УВД, УФНС, ЦАО, ВАО и т.д., т.к. этих их в принципе не много, но на качестве дальнейшего кодирования это скажется положительно. Например если у нас будет две записи «УВД» и «управление внутренних дел», то закодированы они будут по разному, т. к. для компьютера это разные значения.
Итак перейдем к расшифровке. И, для начала, заведем словарь сокращений, с помощью которого мы и сделаем расшифровку:

In [6]:
sokr = {u'нао': u'ненецкий автономный округ',
u'хмао': u'ханты-мансийский автономный округ',
u'чао': u'чукотский автономный округ',
u'янао': u'ямало-ненецкий автономный округ',
u'вао': u'восточный административный округ',
u'цао': u'центральный административный округ',
u'зао': u'западный административный округ',
u'cао': u'северный административный округ',
u'юао': u'южный административный округ',
u'юзао': u'юго-западный округ',
u'ювао': u'юго-восточный округ',
u'свао': u'северо-восточный округ',
u'сзао': u'северо-западный округ',
u'оуфмс': u'отдел управление федеральной миграционной службы',
u'офмс': u'отдел федеральной миграционной службы',
u'уфмс': u'управление федеральной миграционной службы',
u'увд': u'управление внутренних дел',
u'ровд': u'районный отдел внутренних дел',
u'говд': u'городской отдел внутренних дел',
u'рувд': u'районное управление внутренних дел',
u'овд': u'отдел внутренних дел',
u'оувд': u'отдел управления внутренних дел',
u'мро': u'межрайонный отдел',
u'пс': u'паспортный стол',
u'тп': u'территориальный пункт'}

Теперь, собственно произведем расшифровку абривеатур и отформатируем полученные записи:

In [7]:
for key in sokr:
    train.passport_issuer_name = train.passport_issuer_name.str.replace(u'( %s )|(^%s)|(%s$)' % (i,i,i), u' %s ' % (sokr[key]))
    
#удалим лишние пробелы в конце и начале строки
train.passport_issuer_name = train.passport_issuer_name.str.lstrip()
train.passport_issuer_name = train.passport_issuer_name.str.rstrip()

Предварительный этап обработки поля «кем выдан паспорт» на этом закончим. И перейдем к полю, в котором находится дата выдачи.
Как можно заметить данные в нем хранятся в виде: месяцMгод.
Соответственно можно просто убрать букву «M» и привести поле к числовому типу. Но если хорошо подумать, то это поле можно удалить, т.к. на один месяц в году может приходиться несколько подразделений выдававших паспорт, и соответственно это может испортить нашу модель. Исходя из этого удалим его из выборки:


In [8]:
train = train.drop(['passport_issue_month/year'], axis=1)

#### Анализ данных

Итак, данные для построения модели у нас есть, но они находятся в текстовом виде. Для построения модели хорошо бы было их закодировать в числовом виде.

Авторы пакета scikit-learn заботливо о нас позаботились и добавили несколько способов для извлечения и кодирования текстовых данных:

- FeatureHasher
- CountVectorizer
- HashingVectorizer

**FeatureHasher** преобразовывает строку в числовой массив заданной длинной с помощью хэш-функции (32-разрядная версия Murmurhash3)

**CountVectorizer** преобразовывает входной текст в матрицу, значениями которой, являются количества вхождения данного ключа(слова) в текст. В отличие от FeatureHasher имеет больше настраиваемых параметров(например можно задать токенизатор), но работает медленнее.

**HashingVectorizer** является смесью двух выше описанных методов. В нем можно и регулировать размер закодированной строки (как в FeatureHasher) и настраивать токенизатор (как в CountVectorizer). К тому же его производительность ближе к FeatureHasher.

Для начала CountVectorizer собирает уникальные ключи из всех записей.

Длина списка из уникальных ключей и будет длиной нашего закодированного текста. А номера элементов будут соответствовать, количеству раз встречи данного ключа с данным номером в строке.

**Пример:**

Дано:
- раз два три
- три четыре два два
- раз раз раз четыре

Список уникальных ключей: [раз, два, три, четыре]

- раз два три --> [1,1,1,0]
- три четыре два два --> [0,2,1,1]


Итак, вернемся к анализу. Если мы посмотрим по внимательнее на наш набор данных то можно заметить, что есть похожие строки но записанные по разному например: "… республика карелия..." и "… по республике карелия...".

Соответственно, если мы попробуем применить один из методов кодирования сейчас мы получим очень похожие значения. Такие случаем можно минимизировать если все слова в записи мы приведем к нормальной форме.

Для этой задачи хорошо подходит pymorphy или nltk. Я буду использовать первый, т.к. он изначально создавался для работы с русским языком. Итак, функция которая будет отвечать за нормализацию и очиску строки выглядит так:

In [9]:
def f_tokenizer(s):
    morph = pymorphy2.MorphAnalyzer()
    if type(s) == "unicode":
        t = s.split(' ')
    else:
        t = s
    f = []
    for j in t:
        m = morph.parse(j.replace('.',''))
        if len(m) != 0:
            wrd = m[0]
            if wrd.tag.POS not in ('NUMR','PREP','CONJ','PRCL','INTJ'):
                f.append(wrd.normal_form)
    return(f)

Функция делает следующее:
- Сначала она преобразовывает строку в список
- Затем для всех слов производит разбор
- Если слово является числительным, предикативном, предлогом, союзом, частицей или междометием не включаем его в конечный набор
- Если слово не попало в предыдущий список, берем его нормальную форму и добавляем в финальный набор

Теперь, когда есть функция для нормализации можно приступить к кодированию с помощью метода CountVectorizer. Он выбран потому, что ему можно передать нашу функцию, как токенизатор и он составит список ключей по значениям полученным в результате работы нашей функции:


In [10]:
coder = HashingVectorizer(tokenizer=f_tokenizer, n_features=256)

Как можно заметить при создании метода кроме токенизатора мы задаем еще один параметр n_features. Через данный параметр задается длина закодированной строки (в нашем случае строка кодируется при помощи 256 столбцов). Кроме того, у HashingVectorizer есть еще одно преимущество перед CountVectorizer, но сразу может выполнять нормализацию значений, что хорошо для таких алгоритмов, как SVM.
Теперь применим наш кодировщик к обучающему набору:

In [11]:
TrainNotDuble = train.drop_duplicates()
trn = coder.fit_transform(TrainNotDuble.passport_issuer_name.tolist()).toarray()

#### Построение модели

Для начала нам надо задать значения для столбца, в котором будут содержаться метки классов:

In [12]:
target = TrainNotDuble.passport_div_code.values

In [13]:
pca = PCA(n_components = 15)
trn = pca.fit_transform(trn)

In [17]:
model = RandomForestClassifier(n_estimators = 100, criterion='entropy')

TRNtrain, TRNtest, TARtrain, TARtest = train_test_split(trn, target, test_size=0.4)
model.fit(TRNtrain, TARtrain)
print('accuracy_score: ', accuracy_score(TARtest, model.predict(TRNtest)))

accuracy_score:  0.7379523720232515


# Анализ тональности текстов с помощью сверточных нейронных сетей

<img width=600 src="net.gif">
https://habr.com/ru/company/mailru/blog/417767/

Представьте, что у вас есть абзац текста. Можно ли понять, какую эмоцию несет этот текст: радость, грусть, гнев? Можно. Упростим себе задачу и будем классифицировать эмоцию как позитивную или как негативную, без уточнений. Есть много способов решать такую задачу, и один из них — свёрточные нейронные сети (Convolutional Neural Networks). CNN изначально были разработаны для обработки изображений, однако они успешно справляются с решением задач в сфере автоматической обработки текстов. 

Для обучения я выбрал корпус коротких текстов Юлии Рубцовой, сформированный на основе русскоязычных сообщений из Twitter. Он содержит 114 991 положительных, 111 923 отрицательных твитов, а также базу неразмеченных твитов объемом 17 639 674 сообщений.

In [1]:
import pandas as pd
import numpy as np

# Считываем данные
n = ['id', 'date', 'name', 'text', 'typr', 'rep', 'rtw', 'faw', 'stcount', 'foll', 'frien', 'listcount']
data_positive = pd.read_csv('data/positive.csv', sep=';', error_bad_lines=False, names=n, usecols=['text'])
data_negative = pd.read_csv('data/negative.csv', sep=';', error_bad_lines=False, names=n, usecols=['text'])

# Формируем сбалансированный датасет
sample_size = min(data_positive.shape[0], data_negative.shape[0])
raw_data = np.concatenate((data_positive['text'].values[:sample_size],
                           data_negative['text'].values[:sample_size]), axis=0)
labels = [1] * sample_size + [0] * sample_size

Перед началом обучения тексты прошли процедуру предварительной обработки:

- приведение к нижнему регистру;
- замена «ё» на «е»;
- замена ссылок на токен «URL»;
- замена упоминания пользователя на токен «USER»;
- удаление знаков пунктуации.

In [13]:
import re

def preprocess_text(text):
    text = text.lower().replace("ё", "е")
    text = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', text)
    text = re.sub('@[^\s]+', 'USER', text)
    text = re.sub('[^a-zA-Zа-яА-Я1-9]+', ' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip()


data = [preprocess_text(t) for t in raw_data]

In [14]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=1)

#### Векторное отображение слов

Входными данными сверточной нейронной сети является матрица с фиксированной высотой n, где каждая строка представляет собой векторное отображение слова в признаковое пространство размерности k.

Для формирования embedding-слоя нейронной сети используем утилиту дистрибутивной семантики Word2Vec, предназначенную для отображения семантического значения слов в векторное пространство. 

Word2Vec находит взаимосвязи между словами согласно предположению, что в похожих контекстах встречаются семантически близкие слова. Подробнее о Word2Vec можно прочитать в оригинальной статье, а также тут и тут. Поскольку твитам характерна авторская пунктуация и эмотиконы, определение границ предложений становится достаточно трудоемкой задачей. 

In [15]:
import sqlite3

# Открываем SQLite базу данных
conn = sqlite3.connect('mysqlite3.db')
c = conn.cursor()

In [17]:
with open('data/tweets.txt', 'w', encoding='utf-8') as f:
    # Считываем тексты твитов 
    for row in c.execute('SELECT ttext FROM sentiment'):
        if row[0]:
            tweet = preprocess_text(row[0])
            # Записываем предобработанные твиты в файл
            print(tweet, file=f)

Далее с помощью библиотеки Gensim обучаем Word2Vec-модель со следующими параметрами: 

- size = 200 — размерность признакового пространства;
- window = 5 — количество слов из контекста, которое анализирует алгоритм;
- min_count = 3 — слово должно встречаться минимум три раза, чтобы модель его учитывала.

In [18]:
import logging
import multiprocessing
import gensim
from gensim.models import Word2Vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# Считываем файл с предобработанными твитами
data = gensim.models.word2vec.LineSentence('data/tweets.txt')
# Обучаем модель 
model = Word2Vec(data, size=200, window=5, min_count=3, workers=multiprocessing.cpu_count())


2019-08-14 07:13:33,867 : INFO : collecting all words and their counts
2019-08-14 07:13:33,870 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-08-14 07:13:33,954 : INFO : PROGRESS: at sentence #10000, processed 98345 words, keeping 24248 word types
2019-08-14 07:13:34,026 : INFO : PROGRESS: at sentence #20000, processed 194782 words, keeping 38847 word types
2019-08-14 07:13:34,098 : INFO : PROGRESS: at sentence #30000, processed 290743 words, keeping 50531 word types
2019-08-14 07:13:34,167 : INFO : PROGRESS: at sentence #40000, processed 386269 words, keeping 60653 word types
2019-08-14 07:13:34,231 : INFO : PROGRESS: at sentence #50000, processed 482755 words, keeping 69689 word types
2019-08-14 07:13:34,305 : INFO : PROGRESS: at sentence #60000, processed 578104 words, keeping 78235 word types
2019-08-14 07:13:34,375 : INFO : PROGRESS: at sentence #70000, processed 672686 words, keeping 85840 word types
2019-08-14 07:13:34,474 : INFO : PROGRESS: at s

2019-08-14 07:13:39,197 : INFO : PROGRESS: at sentence #720000, processed 6856226 words, keeping 329362 word types
2019-08-14 07:13:39,261 : INFO : PROGRESS: at sentence #730000, processed 6953689 words, keeping 331913 word types
2019-08-14 07:13:39,335 : INFO : PROGRESS: at sentence #740000, processed 7054203 words, keeping 334539 word types
2019-08-14 07:13:39,403 : INFO : PROGRESS: at sentence #750000, processed 7153523 words, keeping 336844 word types
2019-08-14 07:13:39,483 : INFO : PROGRESS: at sentence #760000, processed 7253230 words, keeping 339124 word types
2019-08-14 07:13:39,552 : INFO : PROGRESS: at sentence #770000, processed 7352424 words, keeping 341418 word types
2019-08-14 07:13:39,623 : INFO : PROGRESS: at sentence #780000, processed 7450987 words, keeping 343877 word types
2019-08-14 07:13:39,699 : INFO : PROGRESS: at sentence #790000, processed 7545233 words, keeping 346595 word types
2019-08-14 07:13:39,763 : INFO : PROGRESS: at sentence #800000, processed 763865

2019-08-14 07:13:44,536 : INFO : PROGRESS: at sentence #1430000, processed 13698594 words, keeping 494012 word types
2019-08-14 07:13:44,613 : INFO : PROGRESS: at sentence #1440000, processed 13799436 words, keeping 496301 word types
2019-08-14 07:13:44,686 : INFO : PROGRESS: at sentence #1450000, processed 13898908 words, keeping 498556 word types
2019-08-14 07:13:44,764 : INFO : PROGRESS: at sentence #1460000, processed 13996483 words, keeping 500687 word types
2019-08-14 07:13:44,836 : INFO : PROGRESS: at sentence #1470000, processed 14094360 words, keeping 502819 word types
2019-08-14 07:13:44,908 : INFO : PROGRESS: at sentence #1480000, processed 14192679 words, keeping 504855 word types
2019-08-14 07:13:45,001 : INFO : PROGRESS: at sentence #1490000, processed 14291910 words, keeping 507157 word types
2019-08-14 07:13:45,089 : INFO : PROGRESS: at sentence #1500000, processed 14391159 words, keeping 509297 word types
2019-08-14 07:13:45,160 : INFO : PROGRESS: at sentence #1510000,

2019-08-14 07:13:49,945 : INFO : PROGRESS: at sentence #2140000, processed 20623466 words, keeping 625917 word types
2019-08-14 07:13:50,025 : INFO : PROGRESS: at sentence #2150000, processed 20722248 words, keeping 627462 word types
2019-08-14 07:13:50,094 : INFO : PROGRESS: at sentence #2160000, processed 20819863 words, keeping 629122 word types
2019-08-14 07:13:50,180 : INFO : PROGRESS: at sentence #2170000, processed 20917554 words, keeping 630708 word types
2019-08-14 07:13:50,251 : INFO : PROGRESS: at sentence #2180000, processed 21015641 words, keeping 632393 word types
2019-08-14 07:13:50,334 : INFO : PROGRESS: at sentence #2190000, processed 21113827 words, keeping 634124 word types
2019-08-14 07:13:50,403 : INFO : PROGRESS: at sentence #2200000, processed 21212464 words, keeping 635844 word types
2019-08-14 07:13:50,489 : INFO : PROGRESS: at sentence #2210000, processed 21310897 words, keeping 637585 word types
2019-08-14 07:13:50,569 : INFO : PROGRESS: at sentence #2220000,

2019-08-14 07:13:55,355 : INFO : PROGRESS: at sentence #2850000, processed 27587700 words, keeping 742350 word types
2019-08-14 07:13:55,439 : INFO : PROGRESS: at sentence #2860000, processed 27686628 words, keeping 744033 word types
2019-08-14 07:13:55,517 : INFO : PROGRESS: at sentence #2870000, processed 27788789 words, keeping 745708 word types
2019-08-14 07:13:55,595 : INFO : PROGRESS: at sentence #2880000, processed 27885614 words, keeping 747258 word types
2019-08-14 07:13:55,660 : INFO : PROGRESS: at sentence #2890000, processed 27979645 words, keeping 748829 word types
2019-08-14 07:13:55,745 : INFO : PROGRESS: at sentence #2900000, processed 28075614 words, keeping 750449 word types
2019-08-14 07:13:55,815 : INFO : PROGRESS: at sentence #2910000, processed 28172950 words, keeping 752004 word types
2019-08-14 07:13:55,888 : INFO : PROGRESS: at sentence #2920000, processed 28270410 words, keeping 753472 word types
2019-08-14 07:13:55,959 : INFO : PROGRESS: at sentence #2930000,

2019-08-14 07:14:00,584 : INFO : PROGRESS: at sentence #3560000, processed 34438798 words, keeping 843762 word types
2019-08-14 07:14:00,661 : INFO : PROGRESS: at sentence #3570000, processed 34531015 words, keeping 844977 word types
2019-08-14 07:14:00,727 : INFO : PROGRESS: at sentence #3580000, processed 34624100 words, keeping 846165 word types
2019-08-14 07:14:00,820 : INFO : PROGRESS: at sentence #3590000, processed 34719148 words, keeping 847374 word types
2019-08-14 07:14:00,895 : INFO : PROGRESS: at sentence #3600000, processed 34819053 words, keeping 848814 word types
2019-08-14 07:14:00,979 : INFO : PROGRESS: at sentence #3610000, processed 34918864 words, keeping 850243 word types
2019-08-14 07:14:01,047 : INFO : PROGRESS: at sentence #3620000, processed 35018239 words, keeping 851702 word types
2019-08-14 07:14:01,135 : INFO : PROGRESS: at sentence #3630000, processed 35117362 words, keeping 853081 word types
2019-08-14 07:14:01,205 : INFO : PROGRESS: at sentence #3640000,

2019-08-14 07:14:06,025 : INFO : PROGRESS: at sentence #4270000, processed 41464980 words, keeping 938539 word types
2019-08-14 07:14:06,099 : INFO : PROGRESS: at sentence #4280000, processed 41568173 words, keeping 939803 word types
2019-08-14 07:14:06,180 : INFO : PROGRESS: at sentence #4290000, processed 41671892 words, keeping 941077 word types
2019-08-14 07:14:06,250 : INFO : PROGRESS: at sentence #4300000, processed 41775515 words, keeping 942401 word types
2019-08-14 07:14:06,339 : INFO : PROGRESS: at sentence #4310000, processed 41878294 words, keeping 943781 word types
2019-08-14 07:14:06,413 : INFO : PROGRESS: at sentence #4320000, processed 41981059 words, keeping 945058 word types
2019-08-14 07:14:06,486 : INFO : PROGRESS: at sentence #4330000, processed 42081929 words, keeping 946331 word types
2019-08-14 07:14:06,557 : INFO : PROGRESS: at sentence #4340000, processed 42185413 words, keeping 947628 word types
2019-08-14 07:14:06,646 : INFO : PROGRESS: at sentence #4350000,

2019-08-14 07:14:11,448 : INFO : PROGRESS: at sentence #4970000, processed 48525276 words, keeping 1025892 word types
2019-08-14 07:14:11,524 : INFO : PROGRESS: at sentence #4980000, processed 48623603 words, keeping 1027015 word types
2019-08-14 07:14:11,596 : INFO : PROGRESS: at sentence #4990000, processed 48721784 words, keeping 1028147 word types
2019-08-14 07:14:11,674 : INFO : PROGRESS: at sentence #5000000, processed 48820650 words, keeping 1029291 word types
2019-08-14 07:14:11,746 : INFO : PROGRESS: at sentence #5010000, processed 48919694 words, keeping 1030453 word types
2019-08-14 07:14:11,817 : INFO : PROGRESS: at sentence #5020000, processed 49015912 words, keeping 1031664 word types
2019-08-14 07:14:11,893 : INFO : PROGRESS: at sentence #5030000, processed 49114805 words, keeping 1032739 word types
2019-08-14 07:14:11,974 : INFO : PROGRESS: at sentence #5040000, processed 49214888 words, keeping 1033813 word types
2019-08-14 07:14:12,046 : INFO : PROGRESS: at sentence #

2019-08-14 07:14:16,834 : INFO : PROGRESS: at sentence #5670000, processed 55510788 words, keeping 1106006 word types
2019-08-14 07:14:16,901 : INFO : PROGRESS: at sentence #5680000, processed 55607005 words, keeping 1107034 word types
2019-08-14 07:14:16,973 : INFO : PROGRESS: at sentence #5690000, processed 55701085 words, keeping 1108151 word types
2019-08-14 07:14:17,051 : INFO : PROGRESS: at sentence #5700000, processed 55796372 words, keeping 1109193 word types
2019-08-14 07:14:17,118 : INFO : PROGRESS: at sentence #5710000, processed 55894523 words, keeping 1110264 word types
2019-08-14 07:14:17,204 : INFO : PROGRESS: at sentence #5720000, processed 55991793 words, keeping 1111332 word types
2019-08-14 07:14:17,270 : INFO : PROGRESS: at sentence #5730000, processed 56089952 words, keeping 1112456 word types
2019-08-14 07:14:17,355 : INFO : PROGRESS: at sentence #5740000, processed 56186683 words, keeping 1113552 word types
2019-08-14 07:14:17,423 : INFO : PROGRESS: at sentence #

2019-08-14 07:14:22,015 : INFO : PROGRESS: at sentence #6370000, processed 62203029 words, keeping 1181585 word types
2019-08-14 07:14:22,091 : INFO : PROGRESS: at sentence #6380000, processed 62302854 words, keeping 1182501 word types
2019-08-14 07:14:22,162 : INFO : PROGRESS: at sentence #6390000, processed 62401570 words, keeping 1183421 word types
2019-08-14 07:14:22,235 : INFO : PROGRESS: at sentence #6400000, processed 62500855 words, keeping 1184453 word types
2019-08-14 07:14:22,304 : INFO : PROGRESS: at sentence #6410000, processed 62600552 words, keeping 1185543 word types
2019-08-14 07:14:22,382 : INFO : PROGRESS: at sentence #6420000, processed 62701173 words, keeping 1186512 word types
2019-08-14 07:14:22,452 : INFO : PROGRESS: at sentence #6430000, processed 62803158 words, keeping 1187464 word types
2019-08-14 07:14:22,538 : INFO : PROGRESS: at sentence #6440000, processed 62902217 words, keeping 1188466 word types
2019-08-14 07:14:22,606 : INFO : PROGRESS: at sentence #

2019-08-14 07:14:27,277 : INFO : PROGRESS: at sentence #7070000, processed 69035200 words, keeping 1252053 word types
2019-08-14 07:14:27,347 : INFO : PROGRESS: at sentence #7080000, processed 69130297 words, keeping 1253007 word types
2019-08-14 07:14:27,422 : INFO : PROGRESS: at sentence #7090000, processed 69223480 words, keeping 1253889 word types
2019-08-14 07:14:27,490 : INFO : PROGRESS: at sentence #7100000, processed 69317953 words, keeping 1254842 word types
2019-08-14 07:14:27,568 : INFO : PROGRESS: at sentence #7110000, processed 69415198 words, keeping 1255833 word types
2019-08-14 07:14:27,640 : INFO : PROGRESS: at sentence #7120000, processed 69513574 words, keeping 1256819 word types
2019-08-14 07:14:27,718 : INFO : PROGRESS: at sentence #7130000, processed 69611254 words, keeping 1257787 word types
2019-08-14 07:14:27,789 : INFO : PROGRESS: at sentence #7140000, processed 69707222 words, keeping 1258763 word types
2019-08-14 07:14:27,870 : INFO : PROGRESS: at sentence #

2019-08-14 07:14:32,512 : INFO : PROGRESS: at sentence #7760000, processed 75810648 words, keeping 1317734 word types
2019-08-14 07:14:32,594 : INFO : PROGRESS: at sentence #7770000, processed 75909381 words, keeping 1318646 word types
2019-08-14 07:14:32,666 : INFO : PROGRESS: at sentence #7780000, processed 76010253 words, keeping 1319488 word types
2019-08-14 07:14:32,747 : INFO : PROGRESS: at sentence #7790000, processed 76111534 words, keeping 1320375 word types
2019-08-14 07:14:32,819 : INFO : PROGRESS: at sentence #7800000, processed 76211486 words, keeping 1321392 word types
2019-08-14 07:14:32,899 : INFO : PROGRESS: at sentence #7810000, processed 76310716 words, keeping 1322475 word types
2019-08-14 07:14:32,968 : INFO : PROGRESS: at sentence #7820000, processed 76409693 words, keeping 1323457 word types
2019-08-14 07:14:33,053 : INFO : PROGRESS: at sentence #7830000, processed 76509025 words, keeping 1324473 word types
2019-08-14 07:14:33,122 : INFO : PROGRESS: at sentence #

2019-08-14 07:14:37,874 : INFO : PROGRESS: at sentence #8460000, processed 82712992 words, keeping 1385201 word types
2019-08-14 07:14:37,953 : INFO : PROGRESS: at sentence #8470000, processed 82809572 words, keeping 1386159 word types
2019-08-14 07:14:38,019 : INFO : PROGRESS: at sentence #8480000, processed 82904271 words, keeping 1387102 word types
2019-08-14 07:14:38,103 : INFO : PROGRESS: at sentence #8490000, processed 82999867 words, keeping 1388060 word types
2019-08-14 07:14:38,170 : INFO : PROGRESS: at sentence #8500000, processed 83094558 words, keeping 1389032 word types
2019-08-14 07:14:38,254 : INFO : PROGRESS: at sentence #8510000, processed 83189128 words, keeping 1389945 word types
2019-08-14 07:14:38,323 : INFO : PROGRESS: at sentence #8520000, processed 83282635 words, keeping 1390838 word types
2019-08-14 07:14:38,401 : INFO : PROGRESS: at sentence #8530000, processed 83374490 words, keeping 1391762 word types
2019-08-14 07:14:38,467 : INFO : PROGRESS: at sentence #

2019-08-14 07:14:43,114 : INFO : PROGRESS: at sentence #9160000, processed 89188612 words, keeping 1450411 word types
2019-08-14 07:14:43,185 : INFO : PROGRESS: at sentence #9170000, processed 89285232 words, keeping 1451251 word types
2019-08-14 07:14:43,263 : INFO : PROGRESS: at sentence #9180000, processed 89380235 words, keeping 1452107 word types
2019-08-14 07:14:43,330 : INFO : PROGRESS: at sentence #9190000, processed 89477117 words, keeping 1452964 word types
2019-08-14 07:14:43,418 : INFO : PROGRESS: at sentence #9200000, processed 89575575 words, keeping 1453890 word types
2019-08-14 07:14:43,486 : INFO : PROGRESS: at sentence #9210000, processed 89672559 words, keeping 1454738 word types
2019-08-14 07:14:43,572 : INFO : PROGRESS: at sentence #9220000, processed 89770833 words, keeping 1455595 word types
2019-08-14 07:14:43,640 : INFO : PROGRESS: at sentence #9230000, processed 89870226 words, keeping 1456390 word types
2019-08-14 07:14:43,727 : INFO : PROGRESS: at sentence #

2019-08-14 07:14:48,280 : INFO : PROGRESS: at sentence #9860000, processed 95905914 words, keeping 1513953 word types
2019-08-14 07:14:48,361 : INFO : PROGRESS: at sentence #9870000, processed 96004917 words, keeping 1514911 word types
2019-08-14 07:14:48,432 : INFO : PROGRESS: at sentence #9880000, processed 96102008 words, keeping 1515861 word types
2019-08-14 07:14:48,508 : INFO : PROGRESS: at sentence #9890000, processed 96196101 words, keeping 1516875 word types
2019-08-14 07:14:48,577 : INFO : PROGRESS: at sentence #9900000, processed 96291175 words, keeping 1517864 word types
2019-08-14 07:14:48,644 : INFO : PROGRESS: at sentence #9910000, processed 96388460 words, keeping 1518948 word types
2019-08-14 07:14:48,723 : INFO : PROGRESS: at sentence #9920000, processed 96481603 words, keeping 1519991 word types
2019-08-14 07:14:48,789 : INFO : PROGRESS: at sentence #9930000, processed 96576704 words, keeping 1521064 word types
2019-08-14 07:14:48,877 : INFO : PROGRESS: at sentence #

2019-08-14 07:14:53,454 : INFO : PROGRESS: at sentence #10550000, processed 102512236 words, keeping 1575772 word types
2019-08-14 07:14:53,524 : INFO : PROGRESS: at sentence #10560000, processed 102606905 words, keeping 1576612 word types
2019-08-14 07:14:53,601 : INFO : PROGRESS: at sentence #10570000, processed 102702567 words, keeping 1577476 word types
2019-08-14 07:14:53,666 : INFO : PROGRESS: at sentence #10580000, processed 102797396 words, keeping 1578351 word types
2019-08-14 07:14:53,751 : INFO : PROGRESS: at sentence #10590000, processed 102891834 words, keeping 1579282 word types
2019-08-14 07:14:53,821 : INFO : PROGRESS: at sentence #10600000, processed 102986390 words, keeping 1580138 word types
2019-08-14 07:14:53,903 : INFO : PROGRESS: at sentence #10610000, processed 103084646 words, keeping 1580969 word types
2019-08-14 07:14:53,969 : INFO : PROGRESS: at sentence #10620000, processed 103180341 words, keeping 1581750 word types
2019-08-14 07:14:54,040 : INFO : PROGRES

2019-08-14 07:14:58,544 : INFO : PROGRESS: at sentence #11240000, processed 109059916 words, keeping 1637464 word types
2019-08-14 07:14:58,629 : INFO : PROGRESS: at sentence #11250000, processed 109154721 words, keeping 1638430 word types
2019-08-14 07:14:58,697 : INFO : PROGRESS: at sentence #11260000, processed 109251155 words, keeping 1639412 word types
2019-08-14 07:14:58,781 : INFO : PROGRESS: at sentence #11270000, processed 109346147 words, keeping 1640374 word types
2019-08-14 07:14:58,847 : INFO : PROGRESS: at sentence #11280000, processed 109441538 words, keeping 1641315 word types
2019-08-14 07:14:58,931 : INFO : PROGRESS: at sentence #11290000, processed 109534608 words, keeping 1642267 word types
2019-08-14 07:14:59,001 : INFO : PROGRESS: at sentence #11300000, processed 109626733 words, keeping 1643231 word types
2019-08-14 07:14:59,069 : INFO : PROGRESS: at sentence #11310000, processed 109719845 words, keeping 1644169 word types
2019-08-14 07:14:59,145 : INFO : PROGRES

2019-08-14 07:15:03,736 : INFO : PROGRESS: at sentence #11930000, processed 115740763 words, keeping 1698969 word types
2019-08-14 07:15:03,812 : INFO : PROGRESS: at sentence #11940000, processed 115841478 words, keeping 1699989 word types
2019-08-14 07:15:03,897 : INFO : PROGRESS: at sentence #11950000, processed 115942635 words, keeping 1700988 word types
2019-08-14 07:15:03,972 : INFO : PROGRESS: at sentence #11960000, processed 116043980 words, keeping 1702043 word types
2019-08-14 07:15:04,056 : INFO : PROGRESS: at sentence #11970000, processed 116143661 words, keeping 1703109 word types
2019-08-14 07:15:04,131 : INFO : PROGRESS: at sentence #11980000, processed 116244703 words, keeping 1704093 word types
2019-08-14 07:15:04,203 : INFO : PROGRESS: at sentence #11990000, processed 116343337 words, keeping 1704940 word types
2019-08-14 07:15:04,275 : INFO : PROGRESS: at sentence #12000000, processed 116441841 words, keeping 1705775 word types
2019-08-14 07:15:04,343 : INFO : PROGRES

2019-08-14 07:15:08,996 : INFO : PROGRESS: at sentence #12620000, processed 122495888 words, keeping 1760880 word types
2019-08-14 07:15:09,067 : INFO : PROGRESS: at sentence #12630000, processed 122598017 words, keeping 1761715 word types
2019-08-14 07:15:09,158 : INFO : PROGRESS: at sentence #12640000, processed 122700678 words, keeping 1762468 word types
2019-08-14 07:15:09,229 : INFO : PROGRESS: at sentence #12650000, processed 122800259 words, keeping 1763297 word types
2019-08-14 07:15:09,318 : INFO : PROGRESS: at sentence #12660000, processed 122900785 words, keeping 1764112 word types
2019-08-14 07:15:09,391 : INFO : PROGRESS: at sentence #12670000, processed 122999604 words, keeping 1765094 word types
2019-08-14 07:15:09,478 : INFO : PROGRESS: at sentence #12680000, processed 123098977 words, keeping 1766136 word types
2019-08-14 07:15:09,546 : INFO : PROGRESS: at sentence #12690000, processed 123197184 words, keeping 1767258 word types
2019-08-14 07:15:09,635 : INFO : PROGRES

2019-08-14 07:15:14,326 : INFO : PROGRESS: at sentence #13310000, processed 129228244 words, keeping 1826112 word types
2019-08-14 07:15:14,409 : INFO : PROGRESS: at sentence #13320000, processed 129325171 words, keeping 1826916 word types
2019-08-14 07:15:14,479 : INFO : PROGRESS: at sentence #13330000, processed 129420674 words, keeping 1827695 word types
2019-08-14 07:15:14,561 : INFO : PROGRESS: at sentence #13340000, processed 129513554 words, keeping 1828526 word types
2019-08-14 07:15:14,634 : INFO : PROGRESS: at sentence #13350000, processed 129610443 words, keeping 1829378 word types
2019-08-14 07:15:14,706 : INFO : PROGRESS: at sentence #13360000, processed 129707380 words, keeping 1830268 word types
2019-08-14 07:15:14,784 : INFO : PROGRESS: at sentence #13370000, processed 129802457 words, keeping 1831081 word types
2019-08-14 07:15:14,853 : INFO : PROGRESS: at sentence #13380000, processed 129894732 words, keeping 1831910 word types
2019-08-14 07:15:14,936 : INFO : PROGRES

2019-08-14 07:15:19,642 : INFO : PROGRESS: at sentence #14000000, processed 135974374 words, keeping 1886458 word types
2019-08-14 07:15:19,723 : INFO : PROGRESS: at sentence #14010000, processed 136070958 words, keeping 1887448 word types
2019-08-14 07:15:19,805 : INFO : PROGRESS: at sentence #14020000, processed 136166863 words, keeping 1888284 word types
2019-08-14 07:15:19,888 : INFO : PROGRESS: at sentence #14030000, processed 136263379 words, keeping 1889143 word types
2019-08-14 07:15:19,960 : INFO : PROGRESS: at sentence #14040000, processed 136359728 words, keeping 1890046 word types
2019-08-14 07:15:20,031 : INFO : PROGRESS: at sentence #14050000, processed 136455220 words, keeping 1890886 word types
2019-08-14 07:15:20,102 : INFO : PROGRESS: at sentence #14060000, processed 136550414 words, keeping 1891731 word types
2019-08-14 07:15:20,179 : INFO : PROGRESS: at sentence #14070000, processed 136646325 words, keeping 1892609 word types
2019-08-14 07:15:20,250 : INFO : PROGRES

2019-08-14 07:15:24,970 : INFO : PROGRESS: at sentence #14690000, processed 142902935 words, keeping 1943040 word types
2019-08-14 07:15:25,040 : INFO : PROGRESS: at sentence #14700000, processed 143006905 words, keeping 1943821 word types
2019-08-14 07:15:25,120 : INFO : PROGRESS: at sentence #14710000, processed 143109019 words, keeping 1944582 word types
2019-08-14 07:15:25,189 : INFO : PROGRESS: at sentence #14720000, processed 143208086 words, keeping 1945346 word types
2019-08-14 07:15:25,272 : INFO : PROGRESS: at sentence #14730000, processed 143311173 words, keeping 1946104 word types
2019-08-14 07:15:25,344 : INFO : PROGRESS: at sentence #14740000, processed 143411355 words, keeping 1946950 word types
2019-08-14 07:15:25,428 : INFO : PROGRESS: at sentence #14750000, processed 143512200 words, keeping 1947724 word types
2019-08-14 07:15:25,512 : INFO : PROGRESS: at sentence #14760000, processed 143614032 words, keeping 1948501 word types
2019-08-14 07:15:25,582 : INFO : PROGRES

2019-08-14 07:15:30,305 : INFO : PROGRESS: at sentence #15380000, processed 149918423 words, keeping 1993344 word types
2019-08-14 07:15:30,375 : INFO : PROGRESS: at sentence #15390000, processed 150023182 words, keeping 1994076 word types
2019-08-14 07:15:30,460 : INFO : PROGRESS: at sentence #15400000, processed 150128098 words, keeping 1994769 word types
2019-08-14 07:15:30,550 : INFO : PROGRESS: at sentence #15410000, processed 150236432 words, keeping 1995359 word types
2019-08-14 07:15:30,621 : INFO : PROGRESS: at sentence #15420000, processed 150342291 words, keeping 1996092 word types
2019-08-14 07:15:30,713 : INFO : PROGRESS: at sentence #15430000, processed 150446301 words, keeping 1996796 word types
2019-08-14 07:15:30,785 : INFO : PROGRESS: at sentence #15440000, processed 150548098 words, keeping 1997480 word types
2019-08-14 07:15:30,864 : INFO : PROGRESS: at sentence #15450000, processed 150652984 words, keeping 1998217 word types
2019-08-14 07:15:30,938 : INFO : PROGRES

2019-08-14 07:15:35,746 : INFO : PROGRESS: at sentence #16070000, processed 157028851 words, keeping 2042296 word types
2019-08-14 07:15:35,813 : INFO : PROGRESS: at sentence #16080000, processed 157120810 words, keeping 2042640 word types
2019-08-14 07:15:35,895 : INFO : PROGRESS: at sentence #16090000, processed 157215938 words, keeping 2043071 word types
2019-08-14 07:15:35,967 : INFO : PROGRESS: at sentence #16100000, processed 157312632 words, keeping 2043552 word types
2019-08-14 07:15:36,047 : INFO : PROGRESS: at sentence #16110000, processed 157412784 words, keeping 2043984 word types
2019-08-14 07:15:36,115 : INFO : PROGRESS: at sentence #16120000, processed 157510619 words, keeping 2044459 word types
2019-08-14 07:15:36,203 : INFO : PROGRESS: at sentence #16130000, processed 157611753 words, keeping 2044961 word types
2019-08-14 07:15:36,277 : INFO : PROGRESS: at sentence #16140000, processed 157717081 words, keeping 2045650 word types
2019-08-14 07:15:36,358 : INFO : PROGRES

2019-08-14 07:15:41,142 : INFO : PROGRESS: at sentence #16760000, processed 164130278 words, keeping 2088536 word types
2019-08-14 07:15:41,216 : INFO : PROGRESS: at sentence #16770000, processed 164238020 words, keeping 2089234 word types
2019-08-14 07:15:41,301 : INFO : PROGRESS: at sentence #16780000, processed 164350429 words, keeping 2089880 word types
2019-08-14 07:15:41,385 : INFO : PROGRESS: at sentence #16790000, processed 164462354 words, keeping 2090593 word types
2019-08-14 07:15:41,463 : INFO : PROGRESS: at sentence #16800000, processed 164574381 words, keeping 2091332 word types
2019-08-14 07:15:41,549 : INFO : PROGRESS: at sentence #16810000, processed 164689265 words, keeping 2092032 word types
2019-08-14 07:15:41,622 : INFO : PROGRESS: at sentence #16820000, processed 164799490 words, keeping 2092718 word types
2019-08-14 07:15:41,716 : INFO : PROGRESS: at sentence #16830000, processed 164916238 words, keeping 2093339 word types
2019-08-14 07:15:41,799 : INFO : PROGRES

2019-08-14 07:15:46,705 : INFO : PROGRESS: at sentence #17450000, processed 171349380 words, keeping 2135254 word types
2019-08-14 07:15:46,782 : INFO : PROGRESS: at sentence #17460000, processed 171453734 words, keeping 2135864 word types
2019-08-14 07:15:46,870 : INFO : PROGRESS: at sentence #17470000, processed 171557228 words, keeping 2136414 word types
2019-08-14 07:15:46,947 : INFO : PROGRESS: at sentence #17480000, processed 171663024 words, keeping 2137015 word types
2019-08-14 07:15:47,025 : INFO : PROGRESS: at sentence #17490000, processed 171770139 words, keeping 2137569 word types
2019-08-14 07:15:47,108 : INFO : PROGRESS: at sentence #17500000, processed 171875301 words, keeping 2138118 word types
2019-08-14 07:15:47,193 : INFO : PROGRESS: at sentence #17510000, processed 171980078 words, keeping 2138687 word types
2019-08-14 07:15:47,271 : INFO : PROGRESS: at sentence #17520000, processed 172081530 words, keeping 2139410 word types
2019-08-14 07:15:47,347 : INFO : PROGRES

2019-08-14 07:16:52,365 : INFO : EPOCH 1 - PROGRESS: at 12.15% examples, 366271 words/s, in_qsize 1, out_qsize 1
2019-08-14 07:16:53,370 : INFO : EPOCH 1 - PROGRESS: at 12.40% examples, 366004 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:16:54,384 : INFO : EPOCH 1 - PROGRESS: at 12.65% examples, 365703 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:16:55,395 : INFO : EPOCH 1 - PROGRESS: at 12.94% examples, 366248 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:16:56,398 : INFO : EPOCH 1 - PROGRESS: at 13.17% examples, 365503 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:16:57,412 : INFO : EPOCH 1 - PROGRESS: at 13.47% examples, 366121 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:16:58,420 : INFO : EPOCH 1 - PROGRESS: at 13.75% examples, 366648 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:16:59,432 : INFO : EPOCH 1 - PROGRESS: at 14.04% examples, 367329 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:17:00,452 : INFO : EPOCH 1 - PROGRESS: at 14.32% examples, 367778 words/s, in_qsiz

2019-08-14 07:18:06,917 : INFO : EPOCH 1 - PROGRESS: at 32.31% examples, 375607 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:18:07,919 : INFO : EPOCH 1 - PROGRESS: at 32.60% examples, 375797 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:18:08,924 : INFO : EPOCH 1 - PROGRESS: at 32.89% examples, 375961 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:18:09,933 : INFO : EPOCH 1 - PROGRESS: at 33.18% examples, 376101 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:18:10,951 : INFO : EPOCH 1 - PROGRESS: at 33.45% examples, 376077 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:18:11,957 : INFO : EPOCH 1 - PROGRESS: at 33.70% examples, 375686 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:18:12,970 : INFO : EPOCH 1 - PROGRESS: at 33.97% examples, 375556 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:18:14,001 : INFO : EPOCH 1 - PROGRESS: at 34.25% examples, 375503 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:18:15,028 : INFO : EPOCH 1 - PROGRESS: at 34.54% examples, 375509 words/s, in_qsiz

2019-08-14 07:19:21,673 : INFO : EPOCH 1 - PROGRESS: at 50.50% examples, 361143 words/s, in_qsize 8, out_qsize 1
2019-08-14 07:19:22,684 : INFO : EPOCH 1 - PROGRESS: at 50.82% examples, 361432 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:19:23,705 : INFO : EPOCH 1 - PROGRESS: at 51.10% examples, 361422 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:19:24,706 : INFO : EPOCH 1 - PROGRESS: at 51.38% examples, 361441 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:19:25,712 : INFO : EPOCH 1 - PROGRESS: at 51.68% examples, 361535 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:19:26,718 : INFO : EPOCH 1 - PROGRESS: at 51.98% examples, 361681 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:19:27,728 : INFO : EPOCH 1 - PROGRESS: at 52.27% examples, 361859 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:19:28,733 : INFO : EPOCH 1 - PROGRESS: at 52.54% examples, 361974 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:19:29,735 : INFO : EPOCH 1 - PROGRESS: at 52.83% examples, 362158 words/s, in_qsiz

2019-08-14 07:20:36,064 : INFO : EPOCH 1 - PROGRESS: at 70.96% examples, 365087 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:20:37,079 : INFO : EPOCH 1 - PROGRESS: at 71.26% examples, 365199 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:20:38,095 : INFO : EPOCH 1 - PROGRESS: at 71.54% examples, 365276 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:20:39,098 : INFO : EPOCH 1 - PROGRESS: at 71.81% examples, 365392 words/s, in_qsize 1, out_qsize 0
2019-08-14 07:20:40,110 : INFO : EPOCH 1 - PROGRESS: at 72.08% examples, 365465 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:20:41,114 : INFO : EPOCH 1 - PROGRESS: at 72.38% examples, 365602 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:20:42,130 : INFO : EPOCH 1 - PROGRESS: at 72.61% examples, 365457 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:20:43,138 : INFO : EPOCH 1 - PROGRESS: at 72.91% examples, 365583 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:20:44,158 : INFO : EPOCH 1 - PROGRESS: at 73.21% examples, 365663 words/s, in_qsiz

2019-08-14 07:21:50,176 : INFO : EPOCH 1 - PROGRESS: at 90.62% examples, 368450 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:21:51,182 : INFO : EPOCH 1 - PROGRESS: at 90.89% examples, 368522 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:21:52,294 : INFO : EPOCH 1 - PROGRESS: at 91.10% examples, 368251 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:21:53,316 : INFO : EPOCH 1 - PROGRESS: at 91.39% examples, 368320 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:21:54,320 : INFO : EPOCH 1 - PROGRESS: at 91.67% examples, 368441 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:21:55,328 : INFO : EPOCH 1 - PROGRESS: at 91.93% examples, 368525 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:21:56,337 : INFO : EPOCH 1 - PROGRESS: at 92.14% examples, 368422 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:21:57,347 : INFO : EPOCH 1 - PROGRESS: at 92.42% examples, 368461 words/s, in_qsize 6, out_qsize 0
2019-08-14 07:21:58,354 : INFO : EPOCH 1 - PROGRESS: at 92.70% examples, 368615 words/s, in_qsiz

2019-08-14 07:23:01,077 : INFO : EPOCH 2 - PROGRESS: at 9.69% examples, 387587 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:23:02,086 : INFO : EPOCH 2 - PROGRESS: at 9.97% examples, 387750 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:23:03,090 : INFO : EPOCH 2 - PROGRESS: at 10.25% examples, 387684 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:23:04,096 : INFO : EPOCH 2 - PROGRESS: at 10.54% examples, 387991 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:23:05,099 : INFO : EPOCH 2 - PROGRESS: at 10.82% examples, 388300 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:23:06,107 : INFO : EPOCH 2 - PROGRESS: at 11.11% examples, 388322 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:23:07,122 : INFO : EPOCH 2 - PROGRESS: at 11.33% examples, 386169 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:23:08,123 : INFO : EPOCH 2 - PROGRESS: at 11.61% examples, 386066 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:23:09,127 : INFO : EPOCH 2 - PROGRESS: at 11.89% examples, 386174 words/s, in_qsize 

2019-08-14 07:24:16,606 : INFO : EPOCH 2 - PROGRESS: at 28.87% examples, 366451 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:24:17,615 : INFO : EPOCH 2 - PROGRESS: at 29.13% examples, 366656 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:24:18,651 : INFO : EPOCH 2 - PROGRESS: at 29.40% examples, 366775 words/s, in_qsize 6, out_qsize 1
2019-08-14 07:24:19,681 : INFO : EPOCH 2 - PROGRESS: at 29.68% examples, 367038 words/s, in_qsize 5, out_qsize 0
2019-08-14 07:24:20,683 : INFO : EPOCH 2 - PROGRESS: at 29.83% examples, 365744 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:24:21,696 : INFO : EPOCH 2 - PROGRESS: at 30.09% examples, 365781 words/s, in_qsize 1, out_qsize 0
2019-08-14 07:24:22,732 : INFO : EPOCH 2 - PROGRESS: at 30.39% examples, 366218 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:24:23,734 : INFO : EPOCH 2 - PROGRESS: at 30.67% examples, 366609 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:24:24,746 : INFO : EPOCH 2 - PROGRESS: at 30.95% examples, 366754 words/s, in_qsiz

2019-08-14 07:25:30,804 : INFO : EPOCH 2 - PROGRESS: at 48.32% examples, 367202 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:25:31,825 : INFO : EPOCH 2 - PROGRESS: at 48.62% examples, 367286 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:25:32,839 : INFO : EPOCH 2 - PROGRESS: at 48.92% examples, 367415 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:25:33,848 : INFO : EPOCH 2 - PROGRESS: at 49.22% examples, 367480 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:25:34,871 : INFO : EPOCH 2 - PROGRESS: at 49.51% examples, 367565 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:25:35,887 : INFO : EPOCH 2 - PROGRESS: at 49.76% examples, 367282 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:25:36,889 : INFO : EPOCH 2 - PROGRESS: at 50.06% examples, 367353 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:25:37,920 : INFO : EPOCH 2 - PROGRESS: at 50.37% examples, 367494 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:25:38,920 : INFO : EPOCH 2 - PROGRESS: at 50.68% examples, 367693 words/s, in_qsiz

2019-08-14 07:26:45,997 : INFO : EPOCH 2 - PROGRESS: at 68.59% examples, 366572 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:26:47,026 : INFO : EPOCH 2 - PROGRESS: at 68.77% examples, 366094 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:26:48,031 : INFO : EPOCH 2 - PROGRESS: at 69.03% examples, 366035 words/s, in_qsize 1, out_qsize 0
2019-08-14 07:26:49,052 : INFO : EPOCH 2 - PROGRESS: at 69.34% examples, 366201 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:26:50,110 : INFO : EPOCH 2 - PROGRESS: at 69.59% examples, 366026 words/s, in_qsize 1, out_qsize 0
2019-08-14 07:26:51,121 : INFO : EPOCH 2 - PROGRESS: at 69.86% examples, 366037 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:26:52,130 : INFO : EPOCH 2 - PROGRESS: at 70.05% examples, 365714 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:26:53,167 : INFO : EPOCH 2 - PROGRESS: at 70.24% examples, 365326 words/s, in_qsize 3, out_qsize 0
2019-08-14 07:26:54,235 : INFO : EPOCH 2 - PROGRESS: at 70.40% examples, 364758 words/s, in_qsiz

2019-08-14 07:28:01,597 : INFO : EPOCH 2 - PROGRESS: at 85.28% examples, 354296 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:28:02,607 : INFO : EPOCH 2 - PROGRESS: at 85.53% examples, 354317 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:28:03,625 : INFO : EPOCH 2 - PROGRESS: at 85.77% examples, 354279 words/s, in_qsize 3, out_qsize 0
2019-08-14 07:28:04,636 : INFO : EPOCH 2 - PROGRESS: at 85.92% examples, 353893 words/s, in_qsize 6, out_qsize 1
2019-08-14 07:28:05,639 : INFO : EPOCH 2 - PROGRESS: at 86.18% examples, 353933 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:28:06,645 : INFO : EPOCH 2 - PROGRESS: at 86.42% examples, 353947 words/s, in_qsize 0, out_qsize 1
2019-08-14 07:28:07,655 : INFO : EPOCH 2 - PROGRESS: at 86.64% examples, 353820 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:28:08,665 : INFO : EPOCH 2 - PROGRESS: at 86.81% examples, 353492 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:28:09,682 : INFO : EPOCH 2 - PROGRESS: at 87.01% examples, 353345 words/s, in_qsiz

2019-08-14 07:29:12,815 : INFO : EPOCH 3 - PROGRESS: at 1.13% examples, 388651 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:29:13,837 : INFO : EPOCH 3 - PROGRESS: at 1.42% examples, 384102 words/s, in_qsize 6, out_qsize 1
2019-08-14 07:29:14,840 : INFO : EPOCH 3 - PROGRESS: at 1.71% examples, 383375 words/s, in_qsize 7, out_qsize 0
2019-08-14 07:29:15,847 : INFO : EPOCH 3 - PROGRESS: at 1.86% examples, 356260 words/s, in_qsize 2, out_qsize 0
2019-08-14 07:29:16,848 : INFO : EPOCH 3 - PROGRESS: at 2.17% examples, 363295 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:29:17,852 : INFO : EPOCH 3 - PROGRESS: at 2.45% examples, 365871 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:29:18,855 : INFO : EPOCH 3 - PROGRESS: at 2.73% examples, 368056 words/s, in_qsize 1, out_qsize 0
2019-08-14 07:29:19,875 : INFO : EPOCH 3 - PROGRESS: at 2.95% examples, 361268 words/s, in_qsize 1, out_qsize 1
2019-08-14 07:29:20,886 : INFO : EPOCH 3 - PROGRESS: at 3.20% examples, 359794 words/s, in_qsize 0, out_

2019-08-14 07:30:27,565 : INFO : EPOCH 3 - PROGRESS: at 20.19% examples, 353728 words/s, in_qsize 6, out_qsize 1
2019-08-14 07:30:28,568 : INFO : EPOCH 3 - PROGRESS: at 20.50% examples, 354591 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:30:29,575 : INFO : EPOCH 3 - PROGRESS: at 20.78% examples, 355154 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:30:30,579 : INFO : EPOCH 3 - PROGRESS: at 21.06% examples, 355524 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:30:31,586 : INFO : EPOCH 3 - PROGRESS: at 21.34% examples, 355862 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:30:32,598 : INFO : EPOCH 3 - PROGRESS: at 21.63% examples, 356342 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:30:33,601 : INFO : EPOCH 3 - PROGRESS: at 21.90% examples, 356368 words/s, in_qsize 5, out_qsize 2
2019-08-14 07:30:34,639 : INFO : EPOCH 3 - PROGRESS: at 22.14% examples, 356275 words/s, in_qsize 7, out_qsize 0
2019-08-14 07:30:35,684 : INFO : EPOCH 3 - PROGRESS: at 22.44% examples, 356984 words/s, in_qsiz

2019-08-14 07:31:42,409 : INFO : EPOCH 3 - PROGRESS: at 40.39% examples, 366908 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:31:43,424 : INFO : EPOCH 3 - PROGRESS: at 40.68% examples, 367069 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:31:44,436 : INFO : EPOCH 3 - PROGRESS: at 40.90% examples, 366639 words/s, in_qsize 0, out_qsize 1
2019-08-14 07:31:45,441 : INFO : EPOCH 3 - PROGRESS: at 41.16% examples, 366655 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:31:46,445 : INFO : EPOCH 3 - PROGRESS: at 41.38% examples, 366350 words/s, in_qsize 5, out_qsize 0
2019-08-14 07:31:47,450 : INFO : EPOCH 3 - PROGRESS: at 41.64% examples, 366402 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:31:48,475 : INFO : EPOCH 3 - PROGRESS: at 41.89% examples, 366308 words/s, in_qsize 3, out_qsize 0
2019-08-14 07:31:49,490 : INFO : EPOCH 3 - PROGRESS: at 42.18% examples, 366488 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:31:50,495 : INFO : EPOCH 3 - PROGRESS: at 42.45% examples, 366581 words/s, in_qsiz

2019-08-14 07:32:57,817 : INFO : EPOCH 3 - PROGRESS: at 58.55% examples, 355020 words/s, in_qsize 0, out_qsize 1
2019-08-14 07:32:58,833 : INFO : EPOCH 3 - PROGRESS: at 58.73% examples, 354466 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:32:59,892 : INFO : EPOCH 3 - PROGRESS: at 58.96% examples, 354234 words/s, in_qsize 4, out_qsize 3
2019-08-14 07:33:00,898 : INFO : EPOCH 3 - PROGRESS: at 59.21% examples, 354148 words/s, in_qsize 1, out_qsize 0
2019-08-14 07:33:01,912 : INFO : EPOCH 3 - PROGRESS: at 59.46% examples, 354086 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:33:02,928 : INFO : EPOCH 3 - PROGRESS: at 59.75% examples, 354153 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:33:03,943 : INFO : EPOCH 3 - PROGRESS: at 60.03% examples, 354241 words/s, in_qsize 4, out_qsize 0
2019-08-14 07:33:04,958 : INFO : EPOCH 3 - PROGRESS: at 60.30% examples, 354295 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:33:05,960 : INFO : EPOCH 3 - PROGRESS: at 60.58% examples, 354431 words/s, in_qsiz

2019-08-14 07:34:15,008 : INFO : EPOCH 3 - PROGRESS: at 73.34% examples, 332007 words/s, in_qsize 1, out_qsize 1
2019-08-14 07:34:16,017 : INFO : EPOCH 3 - PROGRESS: at 73.55% examples, 331915 words/s, in_qsize 2, out_qsize 0
2019-08-14 07:34:17,021 : INFO : EPOCH 3 - PROGRESS: at 73.79% examples, 331981 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:34:18,048 : INFO : EPOCH 3 - PROGRESS: at 74.05% examples, 332070 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:34:19,057 : INFO : EPOCH 3 - PROGRESS: at 74.30% examples, 332176 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:34:20,063 : INFO : EPOCH 3 - PROGRESS: at 74.59% examples, 332359 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:34:21,120 : INFO : EPOCH 3 - PROGRESS: at 74.83% examples, 332327 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:34:22,132 : INFO : EPOCH 3 - PROGRESS: at 75.03% examples, 332159 words/s, in_qsize 3, out_qsize 0
2019-08-14 07:34:23,133 : INFO : EPOCH 3 - PROGRESS: at 75.31% examples, 332305 words/s, in_qsiz

2019-08-14 07:35:32,473 : INFO : EPOCH 3 - PROGRESS: at 89.24% examples, 324378 words/s, in_qsize 6, out_qsize 1
2019-08-14 07:35:33,496 : INFO : EPOCH 3 - PROGRESS: at 89.39% examples, 324066 words/s, in_qsize 7, out_qsize 0
2019-08-14 07:35:42,058 : INFO : EPOCH 3 - PROGRESS: at 89.49% examples, 317385 words/s, in_qsize 5, out_qsize 3
2019-08-14 07:35:43,069 : INFO : EPOCH 3 - PROGRESS: at 89.68% examples, 317323 words/s, in_qsize 3, out_qsize 0
2019-08-14 07:35:44,069 : INFO : EPOCH 3 - PROGRESS: at 89.92% examples, 317467 words/s, in_qsize 5, out_qsize 0
2019-08-14 07:35:45,073 : INFO : EPOCH 3 - PROGRESS: at 90.16% examples, 317581 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:35:46,085 : INFO : EPOCH 3 - PROGRESS: at 90.35% examples, 317518 words/s, in_qsize 7, out_qsize 0
2019-08-14 07:35:47,102 : INFO : EPOCH 3 - PROGRESS: at 90.53% examples, 317428 words/s, in_qsize 8, out_qsize 0
2019-08-14 07:35:48,158 : INFO : EPOCH 3 - PROGRESS: at 90.70% examples, 317246 words/s, in_qsiz

2019-08-14 07:36:52,612 : INFO : EPOCH 4 - PROGRESS: at 1.71% examples, 284029 words/s, in_qsize 6, out_qsize 0
2019-08-14 07:36:53,619 : INFO : EPOCH 4 - PROGRESS: at 1.97% examples, 290771 words/s, in_qsize 6, out_qsize 0
2019-08-14 07:36:54,645 : INFO : EPOCH 4 - PROGRESS: at 2.23% examples, 295704 words/s, in_qsize 6, out_qsize 0
2019-08-14 07:36:55,783 : INFO : EPOCH 4 - PROGRESS: at 2.48% examples, 296675 words/s, in_qsize 2, out_qsize 1
2019-08-14 07:36:56,797 : INFO : EPOCH 4 - PROGRESS: at 2.73% examples, 299899 words/s, in_qsize 7, out_qsize 0
2019-08-14 07:36:57,824 : INFO : EPOCH 4 - PROGRESS: at 3.00% examples, 305328 words/s, in_qsize 3, out_qsize 0
2019-08-14 07:36:58,839 : INFO : EPOCH 4 - PROGRESS: at 3.25% examples, 307386 words/s, in_qsize 7, out_qsize 0
2019-08-14 07:36:59,841 : INFO : EPOCH 4 - PROGRESS: at 3.50% examples, 309525 words/s, in_qsize 6, out_qsize 0
2019-08-14 07:37:00,848 : INFO : EPOCH 4 - PROGRESS: at 3.75% examples, 311913 words/s, in_qsize 6, out_

2019-08-14 07:38:08,966 : INFO : EPOCH 4 - PROGRESS: at 15.61% examples, 255244 words/s, in_qsize 6, out_qsize 0
2019-08-14 07:38:10,216 : INFO : EPOCH 4 - PROGRESS: at 15.87% examples, 255758 words/s, in_qsize 5, out_qsize 2
2019-08-14 07:38:11,234 : INFO : EPOCH 4 - PROGRESS: at 16.15% examples, 257559 words/s, in_qsize 4, out_qsize 0
2019-08-14 07:38:12,273 : INFO : EPOCH 4 - PROGRESS: at 16.40% examples, 258377 words/s, in_qsize 3, out_qsize 0
2019-08-14 07:38:13,276 : INFO : EPOCH 4 - PROGRESS: at 16.62% examples, 258900 words/s, in_qsize 1, out_qsize 0
2019-08-14 07:38:14,277 : INFO : EPOCH 4 - PROGRESS: at 16.90% examples, 260222 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:38:15,314 : INFO : EPOCH 4 - PROGRESS: at 17.09% examples, 260166 words/s, in_qsize 2, out_qsize 0
2019-08-14 07:38:16,320 : INFO : EPOCH 4 - PROGRESS: at 17.34% examples, 261084 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:38:17,326 : INFO : EPOCH 4 - PROGRESS: at 17.61% examples, 262126 words/s, in_qsiz

2019-08-14 07:39:24,281 : INFO : EPOCH 4 - PROGRESS: at 34.03% examples, 297411 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:39:25,287 : INFO : EPOCH 4 - PROGRESS: at 34.33% examples, 298014 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:39:26,307 : INFO : EPOCH 4 - PROGRESS: at 34.61% examples, 298523 words/s, in_qsize 0, out_qsize 1
2019-08-14 07:39:27,318 : INFO : EPOCH 4 - PROGRESS: at 34.91% examples, 299132 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:39:28,318 : INFO : EPOCH 4 - PROGRESS: at 35.13% examples, 299113 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:39:29,371 : INFO : EPOCH 4 - PROGRESS: at 35.40% examples, 299430 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:39:30,375 : INFO : EPOCH 4 - PROGRESS: at 35.71% examples, 300090 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:39:31,393 : INFO : EPOCH 4 - PROGRESS: at 36.00% examples, 300665 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:39:32,452 : INFO : EPOCH 4 - PROGRESS: at 36.24% examples, 300823 words/s, in_qsiz

2019-08-14 07:40:39,787 : INFO : EPOCH 4 - PROGRESS: at 52.50% examples, 310567 words/s, in_qsize 5, out_qsize 1
2019-08-14 07:40:41,275 : INFO : EPOCH 4 - PROGRESS: at 52.71% examples, 309866 words/s, in_qsize 0, out_qsize 2
2019-08-14 07:40:42,291 : INFO : EPOCH 4 - PROGRESS: at 52.91% examples, 309755 words/s, in_qsize 6, out_qsize 0
2019-08-14 07:40:43,305 : INFO : EPOCH 4 - PROGRESS: at 53.22% examples, 310179 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:40:44,315 : INFO : EPOCH 4 - PROGRESS: at 53.51% examples, 310627 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:40:45,331 : INFO : EPOCH 4 - PROGRESS: at 53.80% examples, 310988 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:40:46,351 : INFO : EPOCH 4 - PROGRESS: at 54.10% examples, 311360 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:40:47,364 : INFO : EPOCH 4 - PROGRESS: at 54.39% examples, 311668 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:40:48,365 : INFO : EPOCH 4 - PROGRESS: at 54.69% examples, 312009 words/s, in_qsiz

2019-08-14 07:41:55,422 : INFO : EPOCH 4 - PROGRESS: at 72.04% examples, 321215 words/s, in_qsize 4, out_qsize 2
2019-08-14 07:41:56,432 : INFO : EPOCH 4 - PROGRESS: at 72.34% examples, 321499 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:41:57,437 : INFO : EPOCH 4 - PROGRESS: at 72.56% examples, 321450 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:41:58,448 : INFO : EPOCH 4 - PROGRESS: at 72.84% examples, 321595 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:41:59,466 : INFO : EPOCH 4 - PROGRESS: at 73.13% examples, 321805 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:42:00,505 : INFO : EPOCH 4 - PROGRESS: at 73.41% examples, 322023 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:42:01,555 : INFO : EPOCH 4 - PROGRESS: at 73.68% examples, 322208 words/s, in_qsize 1, out_qsize 0
2019-08-14 07:42:02,555 : INFO : EPOCH 4 - PROGRESS: at 73.97% examples, 322482 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:42:03,566 : INFO : EPOCH 4 - PROGRESS: at 74.18% examples, 322408 words/s, in_qsiz

2019-08-14 07:43:09,460 : INFO : EPOCH 4 - PROGRESS: at 91.18% examples, 330860 words/s, in_qsize 6, out_qsize 1
2019-08-14 07:43:10,466 : INFO : EPOCH 4 - PROGRESS: at 91.47% examples, 331087 words/s, in_qsize 7, out_qsize 0
2019-08-14 07:43:11,470 : INFO : EPOCH 4 - PROGRESS: at 91.70% examples, 331139 words/s, in_qsize 5, out_qsize 0
2019-08-14 07:43:12,493 : INFO : EPOCH 4 - PROGRESS: at 91.95% examples, 331238 words/s, in_qsize 6, out_qsize 0
2019-08-14 07:43:13,499 : INFO : EPOCH 4 - PROGRESS: at 92.21% examples, 331389 words/s, in_qsize 2, out_qsize 0
2019-08-14 07:43:14,532 : INFO : EPOCH 4 - PROGRESS: at 92.49% examples, 331542 words/s, in_qsize 8, out_qsize 0
2019-08-14 07:43:15,541 : INFO : EPOCH 4 - PROGRESS: at 92.73% examples, 331584 words/s, in_qsize 3, out_qsize 0
2019-08-14 07:43:16,551 : INFO : EPOCH 4 - PROGRESS: at 92.98% examples, 331712 words/s, in_qsize 8, out_qsize 0
2019-08-14 07:43:17,553 : INFO : EPOCH 4 - PROGRESS: at 93.25% examples, 331871 words/s, in_qsiz

2019-08-14 07:44:19,978 : INFO : EPOCH 5 - PROGRESS: at 10.00% examples, 376854 words/s, in_qsize 3, out_qsize 0
2019-08-14 07:44:20,997 : INFO : EPOCH 5 - PROGRESS: at 10.29% examples, 377360 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:44:22,019 : INFO : EPOCH 5 - PROGRESS: at 10.54% examples, 376313 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:44:23,036 : INFO : EPOCH 5 - PROGRESS: at 10.82% examples, 376565 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:44:24,048 : INFO : EPOCH 5 - PROGRESS: at 11.11% examples, 376848 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:44:25,059 : INFO : EPOCH 5 - PROGRESS: at 11.39% examples, 377211 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:44:26,070 : INFO : EPOCH 5 - PROGRESS: at 11.67% examples, 377245 words/s, in_qsize 4, out_qsize 0
2019-08-14 07:44:27,087 : INFO : EPOCH 5 - PROGRESS: at 11.96% examples, 377613 words/s, in_qsize 6, out_qsize 1
2019-08-14 07:44:28,089 : INFO : EPOCH 5 - PROGRESS: at 12.25% examples, 378237 words/s, in_qsiz

2019-08-14 07:45:34,299 : INFO : EPOCH 5 - PROGRESS: at 30.13% examples, 379838 words/s, in_qsize 2, out_qsize 0
2019-08-14 07:45:35,317 : INFO : EPOCH 5 - PROGRESS: at 30.39% examples, 379792 words/s, in_qsize 6, out_qsize 1
2019-08-14 07:45:36,338 : INFO : EPOCH 5 - PROGRESS: at 30.69% examples, 380152 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:45:37,349 : INFO : EPOCH 5 - PROGRESS: at 30.98% examples, 380399 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:45:38,352 : INFO : EPOCH 5 - PROGRESS: at 31.21% examples, 380026 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:45:39,370 : INFO : EPOCH 5 - PROGRESS: at 31.50% examples, 380159 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:45:40,376 : INFO : EPOCH 5 - PROGRESS: at 31.78% examples, 380185 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:45:41,379 : INFO : EPOCH 5 - PROGRESS: at 32.06% examples, 380175 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:45:42,387 : INFO : EPOCH 5 - PROGRESS: at 32.33% examples, 380042 words/s, in_qsiz

2019-08-14 07:46:48,314 : INFO : EPOCH 5 - PROGRESS: at 50.38% examples, 380129 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:46:49,361 : INFO : EPOCH 5 - PROGRESS: at 50.67% examples, 380085 words/s, in_qsize 8, out_qsize 0
2019-08-14 07:46:50,370 : INFO : EPOCH 5 - PROGRESS: at 50.97% examples, 380166 words/s, in_qsize 2, out_qsize 1
2019-08-14 07:46:51,422 : INFO : EPOCH 5 - PROGRESS: at 51.31% examples, 380327 words/s, in_qsize 5, out_qsize 0
2019-08-14 07:46:52,426 : INFO : EPOCH 5 - PROGRESS: at 51.56% examples, 380074 words/s, in_qsize 3, out_qsize 1
2019-08-14 07:46:53,429 : INFO : EPOCH 5 - PROGRESS: at 51.89% examples, 380254 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:46:54,478 : INFO : EPOCH 5 - PROGRESS: at 52.18% examples, 380214 words/s, in_qsize 5, out_qsize 0
2019-08-14 07:46:55,481 : INFO : EPOCH 5 - PROGRESS: at 52.46% examples, 380339 words/s, in_qsize 4, out_qsize 0
2019-08-14 07:46:56,485 : INFO : EPOCH 5 - PROGRESS: at 52.75% examples, 380436 words/s, in_qsiz

2019-08-14 07:48:03,832 : INFO : EPOCH 5 - PROGRESS: at 69.74% examples, 371066 words/s, in_qsize 2, out_qsize 0
2019-08-14 07:48:04,836 : INFO : EPOCH 5 - PROGRESS: at 70.01% examples, 371121 words/s, in_qsize 5, out_qsize 1
2019-08-14 07:48:05,871 : INFO : EPOCH 5 - PROGRESS: at 70.28% examples, 371160 words/s, in_qsize 6, out_qsize 1
2019-08-14 07:48:06,885 : INFO : EPOCH 5 - PROGRESS: at 70.56% examples, 371270 words/s, in_qsize 7, out_qsize 0
2019-08-14 07:48:07,906 : INFO : EPOCH 5 - PROGRESS: at 70.87% examples, 371450 words/s, in_qsize 5, out_qsize 0
2019-08-14 07:48:08,911 : INFO : EPOCH 5 - PROGRESS: at 71.12% examples, 371345 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:48:09,914 : INFO : EPOCH 5 - PROGRESS: at 71.39% examples, 371301 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:48:10,921 : INFO : EPOCH 5 - PROGRESS: at 71.59% examples, 370990 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:48:11,934 : INFO : EPOCH 5 - PROGRESS: at 71.86% examples, 371093 words/s, in_qsiz

2019-08-14 07:49:18,200 : INFO : EPOCH 5 - PROGRESS: at 88.52% examples, 368679 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:49:19,205 : INFO : EPOCH 5 - PROGRESS: at 88.79% examples, 368729 words/s, in_qsize 4, out_qsize 1
2019-08-14 07:49:20,211 : INFO : EPOCH 5 - PROGRESS: at 89.07% examples, 368811 words/s, in_qsize 6, out_qsize 1
2019-08-14 07:49:21,216 : INFO : EPOCH 5 - PROGRESS: at 89.35% examples, 368918 words/s, in_qsize 4, out_qsize 0
2019-08-14 07:49:22,233 : INFO : EPOCH 5 - PROGRESS: at 89.64% examples, 369096 words/s, in_qsize 0, out_qsize 0
2019-08-14 07:49:23,636 : INFO : EPOCH 5 - PROGRESS: at 89.88% examples, 368669 words/s, in_qsize 4, out_qsize 0
2019-08-14 07:49:24,646 : INFO : EPOCH 5 - PROGRESS: at 90.14% examples, 368744 words/s, in_qsize 5, out_qsize 2
2019-08-14 07:49:25,652 : INFO : EPOCH 5 - PROGRESS: at 90.39% examples, 368815 words/s, in_qsize 6, out_qsize 0
2019-08-14 07:49:26,657 : INFO : EPOCH 5 - PROGRESS: at 90.68% examples, 369025 words/s, in_qsiz

FileNotFoundError: [Errno 2] No such file or directory: 'models/w2v/model.w2v.wv.vectors.npy'

In [19]:
model.save("models/w2v/model.w2v")

2019-08-14 07:56:41,659 : INFO : saving Word2Vec object under models/w2v/model.w2v, separately None
2019-08-14 07:56:41,665 : INFO : storing np array 'vectors' to models/w2v/model.w2v.wv.vectors.npy
2019-08-14 07:56:45,115 : INFO : not storing attribute vectors_norm
2019-08-14 07:56:45,121 : INFO : storing np array 'syn1neg' to models/w2v/model.w2v.trainables.syn1neg.npy
2019-08-14 07:56:53,674 : INFO : saved models/w2v/model.w2v


In [20]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Высота матрицы (максимальное количество слов в твите)
SENTENCE_LENGTH = 26
# Размер словаря
NUM = 100000

def get_sequences(tokenizer, x):
    sequences = tokenizer.texts_to_sequences(x)
    return pad_sequences(sequences, maxlen=SENTENCE_LENGTH)

# Cоздаем и обучаем токенизатор
tokenizer = Tokenizer(num_words=NUM)
tokenizer.fit_on_texts(x_train)

# Отображаем каждый текст в массив идентификаторов токенов
x_train_seq = get_sequences(tokenizer, x_train)
x_test_seq = get_sequences(tokenizer, x_test)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [23]:
from gensim.models import Word2Vec
# Загружаем обученную модель
w2v_model = Word2Vec.load('models/w2v/model.w2v')
DIM = w2v_model.vector_size 
# Инициализируем матрицу embedding слоя нулями
embedding_matrix = np.zeros((NUM, DIM))
# Добавляем NUM=100000 наиболее часто встречающихся слов из обучающей выборки в embedding слой
for word, i in tokenizer.word_index.items():
    if i >= NUM:
        break
    if word in w2v_model.wv.vocab.keys():
        embedding_matrix[i] = w2v_model.wv[word]

2019-08-14 08:00:20,252 : INFO : loading Word2Vec object from models/w2v/model.w2v
2019-08-14 08:00:25,078 : INFO : loading wv recursively from models/w2v/model.w2v.wv.* with mmap=None
2019-08-14 08:00:25,080 : INFO : loading vectors from models/w2v/model.w2v.wv.vectors.npy with mmap=None
2019-08-14 08:00:28,270 : INFO : setting ignored attribute vectors_norm to None
2019-08-14 08:00:28,301 : INFO : loading vocabulary recursively from models/w2v/model.w2v.vocabulary.* with mmap=None
2019-08-14 08:00:28,304 : INFO : loading trainables recursively from models/w2v/model.w2v.trainables.* with mmap=None
2019-08-14 08:00:28,308 : INFO : loading syn1neg from models/w2v/model.w2v.trainables.syn1neg.npy with mmap=None
2019-08-14 08:00:33,061 : INFO : loaded models/w2v/model.w2v


In [27]:
from keras import backend as K


def precision(y_true, y_pred):
    """Precision metric.

    Only computes a batch-wise average of precision.

    Computes the precision, a metric for multi-label classification of
    how many selected items are relevant.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision


def recall(y_true, y_pred):
    """Recall metric.

    Only computes a batch-wise average of recall.

    Computes the recall, a metric for multi-label classification of
    how many relevant items are selected.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall


def f1(y_true, y_pred):
    def recall(y_true, y_pred):
        """Recall metric.

        Only computes a batch-wise average of recall.

        Computes the recall, a metric for multi-label classification of
        how many relevant items are selected.
        """
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

    def precision(y_true, y_pred):
        """Precision metric.

        Only computes a batch-wise average of precision.

        Computes the precision, a metric for multi-label classification of
        how many selected items are relevant.
        """
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision

    precision = precision(y_true, y_pred)
    recall = recall(y_true, y_pred)
    return 2 * ((precision * recall) / (precision + recall + K.epsilon()))

#### Свёрточная нейронная сеть

In [28]:
from keras.layers import Input
from keras.layers.embeddings import Embedding

tweet_input = Input(shape=(SENTENCE_LENGTH,), dtype='int32')
tweet_encoder = Embedding(NUM, DIM, input_length=SENTENCE_LENGTH,
                          weights=[embedding_matrix], trainable=False)(tweet_input)

In [29]:
from keras import optimizers
from keras.layers import Dense, concatenate, Activation, Dropout
from keras.models import Model
from keras.layers.convolutional import Conv1D
from keras.layers.pooling import GlobalMaxPooling1D

branches = []
# Добавляем dropout-регуляризацию
x = Dropout(0.2)(tweet_encoder)

for size, filters_count in [(2, 10), (3, 10), (4, 10), (5, 10)]:
    for i in range(filters_count):
        # Добавляем слой свертки
        branch = Conv1D(filters=1, kernel_size=size, padding='valid', activation='relu')(x)
        # Добавляем слой субдискретизации
        branch = GlobalMaxPooling1D()(branch)
        branches.append(branch)
# Конкатенируем карты признаков
x = concatenate(branches, axis=1)
# Добавляем dropout-регуляризацию
x = Dropout(0.2)(x)
x = Dense(30, activation='relu')(x)
x = Dense(1)(x)
output = Activation('sigmoid')(x)

model = Model(inputs=[tweet_input], outputs=[output])

In [30]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[precision, recall, f1])
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            (None, 26)           0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 26, 200)      20000000    input_3[0][0]                    
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, 26, 200)      0           embedding_2[0][0]                
__________________________________________________________________________________________________
conv1d_41 (Conv1D)              (None, 25, 1)        401         dropout_3[0][0]                  
__________________________________________________________________________________________________
conv1d_42 

In [31]:
from keras.callbacks import ModelCheckpoint

checkpoint = ModelCheckpoint("models/cnn/cnn-frozen-embeddings-{epoch:02d}-{val_f1:.2f}.hdf5", monitor='val_f1', save_best_only=True, mode='max', period=1)
history = model.fit(x_train_seq, y_train, batch_size=32, epochs=10, validation_split=0.25, callbacks = [checkpoint])

Train on 134307 samples, validate on 44769 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Затем выбираем модель с наивысшими показателями F-меры на валидационном наборе данных, т.е. модель, полученную на восьмой эпохе обучения. У модели разморозил embedding-слой, после чего запустил еще пять эпох обучения. 

In [33]:
from keras import optimizers
# Загружаем веса модели
model.load_weights('models/cnn/cnn-frozen-embeddings-06-0.77.hdf5')
# Делаем embedding слой способным к обучению
model.layers[1].trainable = True
# Уменьшаем learning rate
adam = optimizers.Adam(lr=0.0001)
model.compile(loss='binary_crossentropy', optimizer=adam, metrics=[precision, recall, f1])
model.summary()
 

checkpoint = ModelCheckpoint("models/cnn/cnn-trainable-{epoch:02d}-{val_f1:.2f}.hdf5", monitor='val_f1', save_best_only=True, mode='max', period=1)

history_trainable = model.fit(x_train_seq, y_train, batch_size=32, epochs=5, validation_split=0.25, callbacks = [checkpoint])

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            (None, 26)           0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 26, 200)      20000000    input_3[0][0]                    
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, 26, 200)      0           embedding_2[0][0]                
__________________________________________________________________________________________________
conv1d_41 (Conv1D)              (None, 25, 1)        401         dropout_3[0][0]                  
__________________________________________________________________________________________________
conv1d_42 

Train on 134307 samples, validate on 44769 samples
Epoch 1/5
  6112/134307 [>.............................] - ETA: 36:04 - loss: 0.4774 - precision: 0.7631 - recall: 0.7648 - f1: 0.7585

KeyboardInterrupt: 