DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts using the BERT base model.

In [None]:
!pip install ktrain



In [None]:
!git clone https://github.com/laxmimerit/IMDB-Movie-Reviews-Large-Dataset-50k.git

fatal: destination path 'IMDB-Movie-Reviews-Large-Dataset-50k' already exists and is not an empty directory.


In [None]:
import ktrain
from ktrain import text
import numpy as np
import pandas as pd
import tensorflow as tf
from google.colab import drive


In [None]:
data_test = pd.read_excel('/content/IMDB-Movie-Reviews-Large-Dataset-50k/test.xlsx', dtype= str)

In [None]:
data_train = pd.read_excel('/content/IMDB-Movie-Reviews-Large-Dataset-50k/train.xlsx', dtype = str)

In [None]:
data_train.sample(7)

Unnamed: 0,Reviews,Sentiment
3106,Malefique pretty much has the viewer from star...,pos
14693,<br /><br />I'm sure things didn't exactly go ...,pos
20481,This complete mess of a movie was directed by ...,neg
11757,A comparison between this movie and 'The Last ...,neg
7859,Young and attractive Japanese people are getti...,neg
17002,I am very open to foreign films and like to th...,neg
6352,This is a decent little flick made in Michigan...,pos


In [None]:
text.print_text_classifiers()

fasttext: a fastText-like model [http://arxiv.org/pdf/1607.01759.pdf]
logreg: logistic regression using a trainable Embedding layer
nbsvm: NBSVM model [http://www.aclweb.org/anthology/P12-2018]
bigru: Bidirectional GRU with pretrained fasttext word vectors [https://fasttext.cc/docs/en/crawl-vectors.html]
standard_gru: simple 2-layer GRU with randomly initialized embeddings
bert: Bidirectional Encoder Representations from Transformers (BERT) from keras_bert [https://arxiv.org/abs/1810.04805]
distilbert: distilled, smaller, and faster BERT from Hugging Face transformers [https://arxiv.org/abs/1910.01108]


In [None]:
(train, val, preproc) = text.texts_from_df(train_df=data_train, text_column='Reviews', label_columns='Sentiment',
                   val_df = data_test,
                   maxlen = 400,
                   preprocess_mode = 'distilbert')

['neg', 'pos']
   neg  pos
0  1.0  0.0
1  1.0  0.0
2  1.0  0.0
3  1.0  0.0
4  1.0  0.0
['neg', 'pos']
   neg  pos
0  0.0  1.0
1  0.0  1.0
2  1.0  0.0
3  0.0  1.0
4  1.0  0.0
preprocessing train...
language: en
train sequence lengths:
	mean : 234
	95percentile : 598
	99percentile : 913


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 234
	95percentile : 598
	99percentile : 913


In [None]:
model = text.text_classifier(name = 'distilbert', train_data = train, preproc=preproc)

Is Multi-Label? False
maxlen is 400
done.


In [None]:
learner = ktrain.get_learner(model = model,
                             train_data = train,
                             val_data = val,
                             batch_size = 6)

In [None]:
learner.fit_onecycle(lr = 2e-5, epochs=2)



begin training using onecycle policy with max lr of 2e-05...
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f62789b3b10>

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
predictor.save('/content/drive/My Drive/distilbert')

In [None]:
data = ['this movie was much better than I expected. storyline was really well written',
        'the movie was straight trash. I would rather watch paint dry than watch 5 mins of this movie again']

In [None]:
predictor.predict(data)



['pos', 'neg']

In [None]:
predictor.get_classes()

['neg', 'pos']

In [None]:
predictor.predict(data, return_proba=True)



array([[0.02058594, 0.97941405],
       [0.99627715, 0.00372279]], dtype=float32)