
# Bidirectional Encoder Representation from Transformer (BERT)
This is a notebook of using BERT to distinguish whether a Reddit comment under the Bitcoin subreddit is posted in June 2016 or Dec 2017, where Bitcoin's price was stabilized in June 2016 and extremely volatile in Dec 2017. 

Ktrain, a lightweight wrapper for Keras, is used in this notebook. It allows users to train and deploy BERT easily. 

DO NOT RUN THIS NOTEBOOK LOCALLY UNLESS YOUR LOCAL MACHINE CONTAINS A NVIDIA GPU BETTER OR EQUIVALENT TO A GTX1060. 

If you still want to run this notebook, you can upload this notebook to Google Colab. Google provides a free-to-use GPU (and TPU as well). 

In [1]:
# Create file structure 
!sudo apt-get install tree
!mkdir -p reddit_bitcoin_data/train/dec2017
!mkdir -p reddit_bitcoin_data/train/jun2016
!mkdir -p reddit_bitcoin_data/test/dec2017
!mkdir -p reddit_bitcoin_data/test/jun2016
# !tree --filelimit 10

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 35 not upgraded.
Need to get 40.7 kB of archives.
After this operation, 105 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tree amd64 1.7.0-5 [40.7 kB]
Fetched 40.7 kB in 1s (33.5 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package tree.
(Reading database ... 132684 files and directories currently instal

In [0]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
np.random.seed(seed=1)

In [3]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
# https://cloud.google.com/resource-manager/docs/creating-managing-projects
project_id = 'caramel-pager-156904' #Put your project ID here
jun2016 = pd.io.gbq.read_gbq('''
SELECT body FROM `fh-bigquery.reddit_comments.2016_06`
where subreddit = 'Bitcoin' and length(body)>100
''', project_id=project_id, verbose=False)

dec2017 = pd.io.gbq.read_gbq('''
SELECT body FROM `fh-bigquery.reddit_comments.2017_12`
where subreddit = 'Bitcoin' and length(body)>100
''', project_id=project_id, verbose=False)

Authenticated


  **kwargs


In [4]:
# Check how many data we have
print(len(dec2017))
print(len(jun2016))

258095
32972


In [0]:
# Make balance dataset
bal2017 = np.random.choice(range(0,len(dec2017)),32972, replace=False)
dec2017_bal = dec2017.iloc[bal2017]
dec2017_bal.reset_index(drop=True, inplace=True)
jun2016_bal = jun2016

In [0]:
# Seperate train test dataset
all_id = np.array(range(0,len(dec2017_bal)))
train_id = np.random.choice(range(0,len(dec2017_bal)),round(len(dec2017_bal)*0.8), replace=False)
test_id = np.random.choice(np.delete(all_id,train_id), round(len(dec2017_bal)*0.2), replace=True)

In [0]:
# Write files according to the ktrain package requirment
def write_txt_files(T='train', D='dec2017', pd2write = dec2017_bal, id_ = train_id):
  folder = './reddit_bitcoin_data/%s/%s' % (T,D)
  for id in tqdm(id_):
    filename = str(id)+'.txt'
    with open(os.path.join(folder,filename),'w') as outfile:
        pd2write.iloc[id].to_string(outfile)

In [8]:
write_txt_files(T='train', D='dec2017', pd2write = dec2017_bal, id_ = train_id)
write_txt_files(T='test', D='dec2017', pd2write = dec2017_bal, id_ = test_id)
write_txt_files(T='train', D='jun2016', pd2write = jun2016_bal, id_ = train_id)
write_txt_files(T='test', D='jun2016', pd2write = jun2016_bal, id_ = test_id)

100%|██████████| 26378/26378 [00:19<00:00, 1376.07it/s]
100%|██████████| 6594/6594 [00:05<00:00, 1275.18it/s]
100%|██████████| 26378/26378 [00:19<00:00, 1383.98it/s]
100%|██████████| 6594/6594 [00:04<00:00, 1340.36it/s]


In [9]:
!pip3 install ktrain
import ktrain
from ktrain import text

Collecting ktrain
[?25l  Downloading https://files.pythonhosted.org/packages/08/dd/74108dc5359663524bb340f9ff7b10f18ab3d6f932fc0456e9291aff1e74/ktrain-0.6.0.tar.gz (173kB)
[K     |████████████████████████████████| 174kB 2.8MB/s 
[?25hCollecting keras==2.2.4
[?25l  Downloading https://files.pythonhosted.org/packages/5e/10/aa32dad071ce52b5502266b5c659451cfd6ffcbf14e6c8c4f16c0ff5aaab/Keras-2.2.4-py2.py3-none-any.whl (312kB)
[K     |████████████████████████████████| 317kB 41.0MB/s 
Collecting keras_bert
  Downloading https://files.pythonhosted.org/packages/df/fe/bf46de1ef9d1395cd735d8df5402f5d837ef82cfd348a252ad8f32feeaef/keras-bert-0.80.0.tar.gz
Collecting eli5>=0.10.0
[?25l  Downloading https://files.pythonhosted.org/packages/97/2f/c85c7d8f8548e460829971785347e14e45fa5c6617da374711dec8cb38cc/eli5-0.10.1-py2.py3-none-any.whl (105kB)
[K     |████████████████████████████████| 112kB 30.1MB/s 
[?25hCollecting seqeval
  Downloading https://files.pythonhosted.org/packages/34/91/068aca8d

Using TensorFlow backend.


using Keras version: 2.2.4


In [10]:
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder('./reddit_bitcoin_data/', 
                                                                      maxlen=500, 
                                                                       preprocess_mode='bert',
                                                                       train_test_names=['train', 
                                                                                         'test'],
                                                                       classes=['dec2017', 'jun2016'])

detected encoding: utf-8
downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


preprocessing test...
language: en


In [11]:
model = text.text_classifier('bert', (x_train, y_train), preproc=preproc)
learner = ktrain.get_learner(model,train_data=(x_train, y_train), val_data=(x_test, y_test), batch_size=6)

Is Multi-Label? False
maxlen is 500
done.


In [17]:
# To get a trained model

# For Unix based OS (e.g. Mac, Linux, Ubuntu etc.)
!wget -O mymodel https://box.hu-berlin.de/f/1760795d78e141c7aa58/?dl=1

# For windows machine (slower)
# import urllib.request
# url = 'https://box.hu-berlin.de/f/1760795d78e141c7aa58/?dl=1'
# urllib.request.urlretrieve(url, "mymodel")

--2019-11-13 12:39:39--  https://box.hu-berlin.de/f/1760795d78e141c7aa58/?dl=1
Resolving box.hu-berlin.de (box.hu-berlin.de)... 141.20.184.42
Connecting to box.hu-berlin.de (box.hu-berlin.de)|141.20.184.42|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://box.hu-berlin.de/seafhttp/files/6a169b90-41c0-4527-9acc-cd615e7cb80c/mymodel [following]
--2019-11-13 12:39:42--  https://box.hu-berlin.de/seafhttp/files/6a169b90-41c0-4527-9acc-cd615e7cb80c/mymodel
Reusing existing connection to box.hu-berlin.de:443.
HTTP request sent, awaiting response... 200 OK
Length: 1314246128 (1.2G) [application/octet-stream]
Saving to: ‘mymodel’


2019-11-13 13:01:23 (986 KB/s) - ‘mymodel’ saved [1314246128/1314246128]



In [0]:
learner.load_model('mymodel')

In [19]:
!ls

 adc.json	   'index.html?dl=1.1'	 reddit_bitcoin_data
'index.html?dl=1'   mymodel		 sample_data


In [0]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [21]:
learner.validate(val_data=(x_test, y_test))

              precision    recall  f1-score   support

           0       0.89      0.60      0.72      4160
           1       0.70      0.93      0.80      4160

    accuracy                           0.76      8320
   macro avg       0.79      0.76      0.76      8320
weighted avg       0.79      0.76      0.76      8320



array([[2495, 1665],
       [ 309, 3851]])

In [0]:
# DON'T RUN THIS IF YOU DO NOT HAVE A GPU MACHINE
for e in range(2):
  learner.fit(2e-5, 1)
  learner.save_model('mymodel')

Train on 52756 samples, validate on 8320 samples
Epoch 1/1
Train on 52756 samples, validate on 8320 samples
Epoch 1/1

In [22]:
text1 = """Actually price of electricity keeps bitcoin out of the hands
            of globalists. Cheap electricity is found in outlying, underdeveloped
            areas that lack significant industrial, commercial, residential,
            or agricultural demand for electricity proportionate to supply.
            \n\nhttps://en.m.wikipedia.org/wiki/Electricity_pricing#
"""
predictor.explain(text1)

Contribution?,Feature
2.482,Highlighted in text (sum)
0.587,<BIAS>


If you run the code correctly, you will get the below result
<img src="text1.jpg">

In [23]:
text2 = """I have money in binance that I pretty much just leave as btc and if
           I see something taking off I buy in and then sell before bed. 
           Has been working out pretty well so far. It's all most likely 
           going to end up in Ripple and trx at some point tho.
"""
predictor.explain(text2)

Contribution?,Feature
5.7,Highlighted in text (sum)
-0.027,<BIAS>


If you run the code correctly, you will get the below result
<img src="text2.jpg">