In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 

In [2]:
import ktrain
from ktrain import text

Using TensorFlow backend.


using Keras version: 2.2.4


# Building a Chinese-Language Sentiment Analyzer

In this notebook, we will build a Chinese-language text classification model in 3 simple steps. More specifically, we will build a model that classifies Chinese hotel reviews as either positive or negative.

(**Disclaimer:** I don't speak a word of Chinese. Please forgive mistakes.)  



## STEP 1:  Load and Preprocess the Data

First, we use the `texts_from_folder` function to load and preprocess the data.  We assume that the data is in the following form:
```
    ├── datadir
    │   ├── train
    │   │   ├── class0       # folder containing documents of class 0
    │   │   ├── class1       # folder containing documents of class 1
    │   │   ├── class2       # folder containing documents of class 2
    │   │   └── classN       # folder containing documents of class N
```
We set `val_pct` as 0.1, which will automatically sample 10% of the data for validation.  Since we will be using a pretrained BERT model for classification, we specifiy `preprocess_mode='bert'`.  If you are using any other model (e.g., `fasttext`), you should either omit this parameter or use `preprocess_mode='standard'`).

**Notice that there is nothing speical or extra we need to do here for non-English text.**  *ktrain* automatically detects the language and character encoding and prepares the data and configures the model appropriately.



In [3]:
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder('data/ChnSentiCorp_htl_ba_6000', 
                                                                       maxlen=75, 
                                                                       max_features=30000,
                                                                       preprocess_mode='bert',
                                                                       train_test_names=['train'],
                                                                       val_pct=0.1,
                                                                       classes=['pos', 'neg'])

detected encoding: GB18030 (if wrong, set manually)
downloading pretrained BERT model and vocabulary...
[██████████████████████████████████████████████████]
extracting pretrained BERT model and vocabulary...
done.

cleanup downloaded zip...
done.



Building prefix dict from the default dictionary ...
I0927 16:32:29.241432 140711408998208 __init__.py:111] Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
I0927 16:32:29.244013 140711408998208 __init__.py:131] Loading model from cache /tmp/jieba.cache


preprocessing train...
language: zh-cn (if wrong, set manually)


Loading model cost 0.644 seconds.
I0927 16:32:29.887819 140711408998208 __init__.py:163] Loading model cost 0.644 seconds.
Prefix dict has been built succesfully.
I0927 16:32:29.889848 140711408998208 __init__.py:164] Prefix dict has been built succesfully.


preprocessing test...
language: zh-cn (if wrong, set manually)


## STEP 2:  Create a Model and Wrap in Learner Object

In [4]:
model = text.text_classifier('bert', (x_train, y_train) , preproc=preproc)
learner = ktrain.get_learner(model, 
                             train_data=(x_train, y_train), 
                             val_data=(x_test, y_test), 
                             batch_size=32)

Is Multi-Label? False
maxlen is 75
done.


## STEP 3: Train the Model

We will use the `autofit` method that employs a triangular learning rate policy for three epochs.

In [5]:
learner.autofit(2e-5, 3)



begin training using triangular learning rate policy with max lr of 2e-05...
Train on 5324 samples, validate on 592 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7ff8bbf915f8>

### Inspecting the Misclassifications

In [6]:
learner.view_top_losses(n=1, preproc=preproc)

----------
id:252 | loss:5.53 | true:pos | pred:neg)

[CLS] 这 里 的 早 餐 是 我 看 到 的 最 差 的 一 个 , 基 本 上 没 什 么 吃 的 , 就 看 到 服 务 员 在 不 听 的 加 白 粥 , 下 次 在 来 我 是 不 会 住 在 这 里 的 [SEP]


Using Google Translate, the above roughly translates to:
```
The breakfast here is the worst one I have ever seen. Basically, I have nothing to eat. I can see that the waiter is not listening to the white porridge. I will not live here next time.
```

Mistranslations aside, this is clearly a negative review.  It is incorrectly labeled as positive.

### Making Predictions on New Data

In [8]:
p = ktrain.get_predictor(learner.model, preproc)

Predicting label for the text
> "*I despise the service of this hotel.*"

In [9]:
p.predict("我鄙视这家酒店的服务。")

'neg'

Predicting label for:
> "*I like the service of this hotel.*"

In [10]:
p.predict('我喜欢这家酒店的服务')

'pos'