## Read me

This notebook is a copy of the notebook located at:

https://colab.research.google.com/drive/1g_2W__vi6fuEn8pSma0NXNHbNuebptHF?usp=sharing

It might be more convenient for you to follow the link and run the same notebook at Google Colab. If you run this notebook on your local machine, you have to comment out the "%tensorflow_version 1.x" line

This notebook is an example of a multilabel sentence classification. We use RuDR-BERT, which is pretrained on the raw part of the RuDReC corpus. As a training set, we use the annotated part of the RuDReC corpus. Both data and the model are available at:

https://github.com/cimm-kzn/RuDReC

Please read the beginning of the section "Downloading RuDR-BERT model" carefully.



In [None]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


In [None]:
!nvidia-smi

Thu Jul  9 19:13:54 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Pulling necessary code

In [None]:
!git init
!git pull https://github.com/google-research/bert.git
!git clone https://github.com/Andoree/med_project.git
!cp med_project/multilabel_code/bert_preprocessing.py ./
!cp med_project/multilabel_code/multilabel_bert.py ./

Initialized empty Git repository in /content/.git/
remote: Enumerating objects: 340, done.[K
remote: Total 340 (delta 0), reused 0 (delta 0), pack-reused 340[K
Receiving objects: 100% (340/340), 317.20 KiB | 604.00 KiB/s, done.
Resolving deltas: 100% (185/185), done.
From https://github.com/google-research/bert
 * branch            HEAD       -> FETCH_HEAD
Cloning into 'med_project'...
remote: Enumerating objects: 482, done.[K
remote: Counting objects: 100% (482/482), done.[K
remote: Compressing objects: 100% (336/336), done.[K
remote: Total 1865 (delta 290), reused 336 (delta 145), pack-reused 1383[K
Receiving objects: 100% (1865/1865), 11.92 MiB | 7.34 MiB/s, done.
Resolving deltas: 100% (543/543), done.


#### Downloading RuDR-BERT model

In this tutorial, we offer 2 options:

1) You can download the RuDR model that is not fine-tuned on the multilabel sentence classification task. Therefore, you need to train it on our data (annotated part of the RuDReC corpus) or another dataset.

2) You can download the RuDR model that is already fine-tuned on the multilabel sentence classification task using the annotated part of RuDReC. Choosing this option, you don't need to execute cells under the "Training" section of this notebook. This is the same RuDReC-BERT, but it is additionally trained on the annotated part of the RuDReC corpus for 10 epochs with the batch size of 16 and max sequence length of 128.

By default, the lines for the second option are commented out. If you do not want to train a model yourself, what you need is to comment download of one model and uncomment the other. Next, you need to change paths to models in the "Parameters" section. Change lines that correspond to the BERT model's checkpoint, vocabulary, and config.  

**Comment these lines out if you don't want to fine-tune RuDR-BERT**

In [None]:
!mkdir bert_models/
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ou6XLWI_Yp_jPwFox-QWMb3SpJIVinZp' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1ou6XLWI_Yp_jPwFox-QWMb3SpJIVinZp" -O bert_models/RuDR_BERT.tar.gz && rm -rf /tmp/cookies.txt
!tar -xvf bert_models/RuDR_BERT.tar.gz -C bert_models
!ls bert_models/multilingual_russian_reviews_finetuned

--2020-07-09 19:14:13--  https://docs.google.com/uc?export=download&confirm=ltyd&id=1ou6XLWI_Yp_jPwFox-QWMb3SpJIVinZp
Resolving docs.google.com (docs.google.com)... 108.177.97.113, 108.177.97.100, 108.177.97.101, ...
Connecting to docs.google.com (docs.google.com)|108.177.97.113|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-08-8s-docs.googleusercontent.com/docs/securesc/4pfs57o28d1ukb4idtk3tjp4bnavss02/7eul7mcd4jh7ooodorrffqlfommcleld/1594322025000/06930042168325031160/08233439409035204950Z/1ou6XLWI_Yp_jPwFox-QWMb3SpJIVinZp?e=download [following]
--2020-07-09 19:14:14--  https://doc-08-8s-docs.googleusercontent.com/docs/securesc/4pfs57o28d1ukb4idtk3tjp4bnavss02/7eul7mcd4jh7ooodorrffqlfommcleld/1594322025000/06930042168325031160/08233439409035204950Z/1ou6XLWI_Yp_jPwFox-QWMb3SpJIVinZp?e=download
Resolving doc-08-8s-docs.googleusercontent.com (doc-08-8s-docs.googleusercontent.com)... 74.125.204.132, 2404:6800:4008:c04::84
Connecting

**Uncomment these lines to download the fine-tuned model** 

In [None]:
# !mkdir bert_models/
# !wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1hW7QVM3iOHaWn8U31oJSF1wKfAIohCtz' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1hW7QVM3iOHaWn8U31oJSF1wKfAIohCtz" -O bert_models/rudr_classification_finetuned.tar.gz && rm -rf /tmp/cookies.txt
# !tar -xvf bert_models/rudr_classification_finetuned.tar.gz -C bert_models
# !ls bert_models/

--2020-07-09 19:06:32--  https://docs.google.com/uc?export=download&confirm=Xa54&id=1hW7QVM3iOHaWn8U31oJSF1wKfAIohCtz
Resolving docs.google.com (docs.google.com)... 172.217.194.102, 172.217.194.101, 172.217.194.100, ...
Connecting to docs.google.com (docs.google.com)|172.217.194.102|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-0g-14-docs.googleusercontent.com/docs/securesc/ndqqiso8op3fbjmnh1125ib4jlc07qd4/nq3881kdjlnuub6nvrf1lb96ijprpsmc/1594321575000/06930042168325031160/00244358364593097285Z/1hW7QVM3iOHaWn8U31oJSF1wKfAIohCtz?e=download [following]
--2020-07-09 19:06:32--  https://doc-0g-14-docs.googleusercontent.com/docs/securesc/ndqqiso8op3fbjmnh1125ib4jlc07qd4/nq3881kdjlnuub6nvrf1lb96ijprpsmc/1594321575000/06930042168325031160/00244358364593097285Z/1hW7QVM3iOHaWn8U31oJSF1wKfAIohCtz?e=download
Resolving doc-0g-14-docs.googleusercontent.com (doc-0g-14-docs.googleusercontent.com)... 172.217.194.132, 2404:6800:4003:c04::84
Conne

#### Downloading the annotated part of the RuDReC corpus and splitting it into senteces

In [None]:
!mkdir -p data/rudrec_annotated
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1plApL6qmdHtNNP3OXgJQEmo7Lfp6MVeO' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1plApL6qmdHtNNP3OXgJQEmo7Lfp6MVeO" -O data/rudrec_annotated/rudrec.zip && rm -rf /tmp/cookies.txt

--2020-07-09 19:15:50--  https://docs.google.com/uc?export=download&confirm=&id=1plApL6qmdHtNNP3OXgJQEmo7Lfp6MVeO
Resolving docs.google.com (docs.google.com)... 74.125.204.100, 74.125.204.138, 74.125.204.102, ...
Connecting to docs.google.com (docs.google.com)|74.125.204.100|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-0o-4k-docs.googleusercontent.com/docs/securesc/t8938fetd95pvvvmfoh0le2b8333ontt/7kvoq9us7su7ljt9fgudu5f0ct30ppl6/1594322100000/06930042168325031160/08433013916610211556Z/1plApL6qmdHtNNP3OXgJQEmo7Lfp6MVeO?e=download [following]
--2020-07-09 19:15:57--  https://doc-0o-4k-docs.googleusercontent.com/docs/securesc/t8938fetd95pvvvmfoh0le2b8333ontt/7kvoq9us7su7ljt9fgudu5f0ct30ppl6/1594322100000/06930042168325031160/08433013916610211556Z/1plApL6qmdHtNNP3OXgJQEmo7Lfp6MVeO?e=download
Resolving doc-0o-4k-docs.googleusercontent.com (doc-0o-4k-docs.googleusercontent.com)... 74.125.204.132, 2404:6800:4008:c04::84
Connecting to 

In [None]:
!unzip -q data/rudrec_annotated/rudrec.zip -d data/rudrec_annotated/

"otzovik_reviews_formatting.py" script tokenizes reviews into sentences and splits the data into train, test, and dev sets.

You can use n_splits > 1 for cross-validation data split. n_splits=1 corresponds to a simple splitting of data into training, validation, and test sets.

In [None]:
%cd /content/med_project/bert_multilabel/formatting/
!python otzovik_reviews_formatting.py \
--reviews_dir=/content/data/rudrec_annotated/annotation \
--output_dir=/content/data/rudrec_annotated/sentences \
--n_splits=1
!ls /content/data/rudrec_annotated/sentences

/content/med_project/bert_multilabel/formatting
dev.csv  test.csv  train.csv


In [None]:
%cd /content
import codecs
from datetime import datetime
import os

import pandas as pd
import tensorflow as tf
import numpy as np
import modeling
import optimization
import tokenization
from bert_preprocessing import create_examples, file_based_convert_examples_to_features, \
    convert_examples_to_features
from multilabel_bert import file_based_input_fn_builder, create_model, model_fn_builder, \
input_fn_builder, create_output, predict, get_estimator, train_and_evaluate

/content



###Parameters

*Do not forget to set an appropriate BERT base dir here*

In [None]:
corpus_dir  = r"data/rudrec_annotated/sentences"
# NOT FINE-TUNED model
base_bert_dir = r"bert_models/multilingual_russian_reviews_finetuned/"
# FINE-TUNED model
# base_bert_dir = r"bert_models/RuDR_classification_finetuned/"
bert_vocab_path = os.path.join(base_bert_dir, "vocab.txt")
bert_init_chkpnt_path = os.path.join(base_bert_dir, "bert_model.ckpt")
bert_config_path = os.path.join(base_bert_dir, "bert_config.json")

batch_size = 16
num_train_epochs = 5
warmup_proportion = 0.1
max_seq_length = 128
learning_rate = 2e-5
save_summary_steps = 500
# Besides the file of test predictions, this directory
# will also contain checkpoints of fine-tuned BERT 
output_dir = r"results/"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
predicted_proba_filename = "predicted_labels.csv"

# Number of classes
NUM_LABELS = 5
# The column with this name must exist in test data.
text_column_name = 'sentences'

### Training
Validation loss and accuracy for all classes is saved in f"{output_dir}/eval_results.txt" (path parameters are initialized at "Parameters" section). 

The first column of csv file must contain document's text. The next NUM_LABELS columns are binary columns of class correspondence.  test_df should have the same structure.

In [None]:
# Change paths if needed
train_df = pd.read_csv(os.path.join(corpus_dir, "train.csv"), encoding="utf-8")
dev_df = pd.read_csv(os.path.join(corpus_dir, "dev.csv"), encoding="utf-8")

train_examples = create_examples(train_df)
eval_examples = create_examples(dev_df)
# Model is saved and evaluated every epoch. It might be too frequent, change it.
num_train_steps = int(len(train_examples) / batch_size * num_train_epochs)
num_warmup_steps = int(num_train_steps * warmup_proportion)
num_steps_in_epoch = int(len(train_examples) / batch_size * num_train_epochs) // num_train_epochs
save_checkpoints_steps = num_steps_in_epoch
print(f"Train dataframe, examples: {train_df.shape[0]}")
print(train_df.head())
print(f"Dev dataframe, examples: {dev_df.shape[0]}")
dev_df.head()


Train dataframe, examples: 1627
                                           sentences  ...  sentence_id
0                                  Целлюлоза и мята?  ...            4
1  Прошлогодняя статистика по заболеваемости ОРВИ...  ...            5
2  Общее впечатление : Хорошее средство ,успокоит...  ...            8
3  Она недорогая, простая в применении - всего ли...  ...            3
4  Кстати, моя родственница делала акцент, что ей...  ...            5

[5 rows x 9 columns]
Dev dataframe, examples: 181


Unnamed: 0,sentences,EF,INF,ADR,DI,Finding,annotation,review_id,sentence_id
0,Уже после первой таблетки зуб прошел.,1,0,0,1,0,EF[5]|DI[3],1086939,11
1,"Якобы они менее аллергичны, а мой ребёнок к ал...",0,0,0,0,1,Finding[3],1268796,3
2,"встала я только на утро, правда вот голова был...",0,0,1,0,0,ADR[5],594839,2
3,"Кроме гриппа, согласно инструкции, Амизон прим...",0,0,0,0,1,Finding[4],1484511,3
4,Первым было Аллокин проколола 3 укола и на 6 м...,1,0,0,1,0,EF[9]|DI[6],2653511,2


In [None]:
# Creating tokenizer
tokenizer = tokenization.FullTokenizer(
    vocab_file=bert_vocab_path, do_lower_case=True)
# Definition of estimator's config
run_config = tf.estimator.RunConfig(
    model_dir=output_dir,
    save_summary_steps=save_summary_steps,
    keep_checkpoint_max=1,
    save_checkpoints_steps=save_checkpoints_steps)
# Loading config of pretrained Bert model
bert_config = modeling.BertConfig.from_json_file(bert_config_path)

model_fn = model_fn_builder(
    bert_config=bert_config,
    num_labels=NUM_LABELS ,
    init_checkpoint=bert_init_chkpnt_path,
    learning_rate=learning_rate,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    use_tpu=False,
    use_one_hot_embeddings=False)

estimator = get_estimator(model_fn=model_fn, run_config=run_config, batch_size=batch_size)

INFO:tensorflow:Using config: {'_model_dir': 'results/', '_tf_random_seed': None, '_save_summary_steps': 500, '_save_checkpoints_steps': 101, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 1, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fbc8c7014e0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


The next cell activates a training loop. The model makes checkpoints occasionally. Each checkpoint is followed by the assessment of validation accuracies for each label and loss (see log lines beginning with "Saving dict for global step X: 0 = <label 0 accuracy>, 1 = <label 1 accuracy>,...").  

In [None]:
tf.logging.set_verbosity(tf.logging.INFO)
eval_steps = None
train_and_evaluate(train_examples, eval_examples, max_seq_length, estimator, tokenizer, batch_size, eval_steps,
                   num_train_steps, output_dir, num_labels=NUM_LABELS)


INFO:tensorflow:***** Running training *****
INFO:tensorflow:  Num examples = 1627
INFO:tensorflow:  Batch size = 16
INFO:tensorflow:  Num steps = 508

Beginning Training!
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 101 or save_checkpoints_secs None.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.

Instructions for updating:
Use

### Evaluation


####Initializing estimator

In [None]:
train_examples = None
num_train_steps = None
num_warmup_steps = None
save_checkpoints_steps = 1000

# Creating tokenizer
tokenizer = tokenization.FullTokenizer(
    vocab_file=bert_vocab_path, do_lower_case=True)
# Definition of estimator's config
run_config = tf.estimator.RunConfig(
    model_dir=output_dir,
    save_summary_steps=save_summary_steps,
    keep_checkpoint_max=1,
    save_checkpoints_steps=save_checkpoints_steps)
# Loading config of pretrained Bert model
bert_config = modeling.BertConfig.from_json_file(bert_config_path)

model_fn = model_fn_builder(
    bert_config=bert_config,
    num_labels=NUM_LABELS ,
    init_checkpoint=bert_init_chkpnt_path,
    learning_rate=learning_rate,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    use_tpu=False,
    use_one_hot_embeddings=False)

estimator = get_estimator(model_fn=model_fn, run_config=run_config, batch_size=batch_size)

INFO:tensorflow:Using config: {'_model_dir': 'results/', '_tf_random_seed': None, '_save_summary_steps': 500, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 1, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fbc190f9128>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [None]:
# Change path if needed
test_df = pd.read_csv(os.path.join(corpus_dir, "test.csv"), encoding="utf-8")


In [None]:
label_names = {"p_label_1" : "EF", "p_label_2" : "INF", "p_label_3" : "ADR", "p_label_4" : "DI", "p_label_5" : "Finding"}

In [None]:
output_df = predict(test_df, estimator, tokenizer, max_seq_length, num_labels=NUM_LABELS)

resulting_df = test_df[text_column_name]
resulting_df = pd.concat([test_df, output_df], axis=1)
resulting_df.to_csv(os.path.join(output_dir, predicted_proba_filename), index=False)
resulting_df.rename(columns=label_names, inplace=True)
resulting_df.head()

Beginning Predictions!
Prediction took time  0:00:00.000366
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:num_labels:5;logits:Tensor("loss/BiasAdd:0", shape=(?, 5), dtype=float32);labels:Tensor("loss/Cast:0", shape=(?, 5), dtype=float32)
INFO:tensorflow:**** Trainable Variables ****
mode: infer probabilities: Tensor("loss/Sigmoid:0", shape=(?, 5), dtype=float32)
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from results/model.ckpt-508
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


Unnamed: 0,sentences,EF,INF,ADR,DI,Finding,annotation,review_id,sentence_id,EF.1,INF.1,ADR.1,DI.1,Finding.1
0,Но ребенок заболевал не зависимо от того прини...,0,1,0,0,0,INF[4],273783,7,0.015046,0.084106,0.011273,0.009133,0.016036
1,И никуда моя тревожность не исчезла.,0,1,0,1,0,INF[2]|DI[3],2403676,4,0.062887,0.972314,0.036536,0.924222,0.072481
2,"А позже выяснилось, что у ребёнка просто резал...",0,0,0,1,0,DI[5],624086,9,0.020808,0.011523,0.026656,0.984198,0.045432
3,"Так нас продержали на сиропчиках, по типу брон...",0,0,0,1,0,DI[2],2533930,10,0.004549,0.07405,0.256111,0.956854,0.155523
4,"Так вот матирующего эффекта я не заметила, и с...",0,1,0,1,0,INF[2]|DI[1],2523468,3,0.790932,0.869061,0.022939,0.97073,0.059775


In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

METRICS = {"Precision": precision_score, "Recall": recall_score,
           "F-score": f1_score, }
threshold=0.5
average='binary'
pos_label=1

In [None]:
predicted_probs_pos_end = resulting_df.shape[1]
predicted_probs_pos_start = predicted_probs_pos_end - NUM_LABELS
columns = resulting_df.columns
labels = columns[1: 1 + NUM_LABELS]
results_numpy = resulting_df.values.transpose()
all_true_labels = results_numpy[1: 1 + NUM_LABELS].astype(int)
all_pred_probs = results_numpy[predicted_probs_pos_start: predicted_probs_pos_end]
all_pred_labels = (all_pred_probs >= threshold).astype(int)
for i in range(NUM_LABELS):
    class_true_labels = all_true_labels[i]
    class_pred_labels = all_pred_labels[i]
    label_name = labels[i]
    print(i, label_name)
    for metric_name, metric in METRICS.items():
        score = metric(y_true=class_true_labels, y_pred=class_pred_labels, labels=labels, )
        print(f"\t{metric_name} : {score}")

0 EF
	Precision : 0.8157894736842105
	Recall : 0.7560975609756098
	F-score : 0.7848101265822786
1 INF
	Precision : 0.7894736842105263
	Recall : 0.7377049180327869
	F-score : 0.7627118644067797
2 ADR
	Precision : 0.7868852459016393
	Recall : 0.7272727272727273
	F-score : 0.7559055118110236
3 DI
	Precision : 0.822429906542056
	Recall : 0.9072164948453608
	F-score : 0.8627450980392157
4 Finding
	Precision : 0.5555555555555556
	Recall : 0.2564102564102564
	F-score : 0.3508771929824561


#### Preprocessing of new data and prediction of labels

In [None]:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
def tokenize_sentences(text, language='russian', text_column=text_column_name):
    sentences = [sent for sent in sent_tokenize(text, language)]
    sentences_df = pd.DataFrame(sentences, columns =[text_column], )
    return sentences_df

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


You can upload your texts, tokenize them, and predict label probabilities for them.

In [None]:
text = "Старший ребенок часто болел, ходил в детский сад неделю, максимум две, затем на больничный. " \
"После приема действительно становилось легче, и горло меньше болело и нос не так закладывало. " \
"Могу сказать, что временное облегчение он обеспечивает, и на вкус довольно приятным оказался, даже вкусным."
sentences_df = tokenize_sentences(text, language='russian', )
sentences_df.head()

Unnamed: 0,sentences
0,"Старший ребенок часто болел, ходил в детский с..."
1,"После приема действительно становилось легче, ..."
2,"Могу сказать, что временное облегчение он обес..."


In [None]:
decision_threshold = 0.5
predicted_probabilities = predict(sentences_df, estimator, tokenizer, max_seq_length, num_labels=NUM_LABELS)
predicted_probabilities.rename(columns=label_names, inplace=True)
predicted_labels = predicted_probabilities.applymap(lambda x: 1 if x >= decision_threshold else 0)

pd.concat([sentences_df, predicted_probabilities], axis=1).head()

Beginning Predictions!
Prediction took time  0:00:00.000020
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:num_labels:5;logits:Tensor("loss/BiasAdd:0", shape=(?, 5), dtype=float32);labels:Tensor("loss/Cast:0", shape=(?, 5), dtype=float32)
INFO:tensorflow:**** Trainable Variables ****
mode: infer probabilities: Tensor("loss/Sigmoid:0", shape=(?, 5), dtype=float32)
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from results/model.ckpt-508
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


Unnamed: 0,sentences,EF,INF,ADR,DI,Finding
0,"Старший ребенок часто болел, ходил в детский с...",0.004685,0.009316,0.009434,0.28573,0.024863
1,"После приема действительно становилось легче, ...",0.983669,0.058321,0.063091,0.959294,0.044858
2,"Могу сказать, что временное облегчение он обес...",0.978817,0.044704,0.042769,0.043631,0.031108


In [None]:
predicted_labels = predicted_probabilities.applymap(lambda x: 1 if x >= decision_threshold else 0)
resulting_df = pd.concat([sentences_df, predicted_labels], axis=1)

resulting_df.head()

Unnamed: 0,sentences,EF,INF,ADR,DI,Finding
0,"Старший ребенок часто болел, ходил в детский с...",0,0,0,0,0
1,"После приема действительно становилось легче, ...",1,0,0,1,0
2,"Могу сказать, что временное облегчение он обес...",1,0,0,0,0


#### Writing predictions to JSON file

In [None]:
output_json_path = r"results/predicted_labels.json"
resulting_df.to_json(output_json_path, orient='records',)

If you use Google Colab, you can download predictions by uncommenting and running the following lines: 

In [None]:
# from google.colab import files
# files.download(output_json_path) 