# Case Classification with BERT( Train Current Model)

Applying BERT to the problem of multiclass text classification. Our dataset consists of messages. Each dialog utterance/message is labeled with one of the two emotion categories: normal or attenion. 

## Workflow: 
1. Import Data
2. Data preprocessing and downloading BERT
3. Training and validation
4. Saving the model


👋  **Let's start** 

In [1]:
from google.colab import drive
drive.mount('/content/drive') 


Mounted at /content/drive


In [2]:
# install ktrain on Google Colab
!pip3 install ktrain
!pip3 install sklearn

Collecting ktrain
[?25l  Downloading https://files.pythonhosted.org/packages/4c/88/10d29578f47d0d140bf669d5598e9f5a50465ddc423b32031c65e840d003/ktrain-0.26.3.tar.gz (25.3MB)
[K     |████████████████████████████████| 25.3MB 105kB/s 
[?25hCollecting scikit-learn==0.23.2
[?25l  Downloading https://files.pythonhosted.org/packages/f4/cb/64623369f348e9bfb29ff898a57ac7c91ed4921f228e9726546614d63ccb/scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8MB)
[K     |████████████████████████████████| 6.8MB 44.4MB/s 
Collecting langdetect
[?25l  Downloading https://files.pythonhosted.org/packages/0e/72/a3add0e4eec4eb9e2569554f7c70f4a3c27712f40e3284d483e88094cc0e/langdetect-1.0.9.tar.gz (981kB)
[K     |████████████████████████████████| 983kB 42.7MB/s 
Collecting cchardet
[?25l  Downloading https://files.pythonhosted.org/packages/80/72/a4fba7559978de00cf44081c548c5d294bf00ac7dcda2db405d2baa8c67a/cchardet-2.1.7-cp37-cp37m-manylinux2010_x86_64.whl (263kB)
[K     |██████████████████████████

In [3]:
import pandas as pd
import numpy as np

import ktrain
from ktrain import text

##Train using BERT

In [4]:
data_train = pd.read_csv("//content//drive//My Drive//Data/nlpdata//gen_data//data_train.csv", encoding='utf-8')
data_test = pd.read_csv("//content//drive//My Drive//Data/nlpdata//gen_data/data_test.csv", encoding='utf-8')


X_train = data_train.case.tolist()
X_test = data_test.case.tolist()

y_train = data_train.label.tolist()
y_test = data_test.label.tolist()


In [5]:
encoding = {
    'normal': 0,
    'attention': 1,
}
# Integer values for each class
y_train = [encoding[x] for x in y_train]
y_test = [encoding[x] for x in y_test]


In [6]:
class_names = ['normal', 'attention']
(x_train,  y_train), (x_test, y_test), preproc = text.texts_from_array(x_train=X_train, y_train=y_train,
                                                                       x_test=X_test, y_test=y_test,
                                                                       class_names=class_names,
                                                                       preprocess_mode='bert',
                                                                       maxlen=500, 
                                                                       max_features=40000)

downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


task: text classification


In [7]:
# reload Predictor and extract model
model = ktrain.load_predictor("//content//drive//My Drive//Data//nlpdata//currentmodel//").model

In [9]:
learner = ktrain.get_learner(model, train_data=(x_train, y_train), 
                             val_data=(x_test, y_test),
                             batch_size=4)

In [15]:
learner.fit_onecycle(5e-5, 1)

predictor = ktrain.get_predictor(learner.model, preproc)
predictor.get_classes()

# predictor.save("//content//drive//My Drive//Data//nlpdata//newmodel//")



begin training using onecycle policy with max lr of 5e-05...


['normal', 'attention']

In [11]:
predictor.save("//content//drive//My Drive//Data//nlpdata//newmodel//")



In [12]:
import ktrain


# reload the predictor
predictor = ktrain.load_predictor("//content//drive//My Drive//Data//nlpdata//newmodel//")



In [18]:
import time 
class_names = ['normal', 'attention']
# message = 'Incident Location: 783C XXX RISE #13-09  XXX3783) if the problem cannot resolve it is very dangerous to the residents , i have report  3 month ago' #attention
# message="Incident Location: 716 XXX DR 70 #10-136  XXX0716) Incident Location Description:  Mr Loke enquire how long can a unit do renovation, his neighbor below #09-136 started before CNY and drilling noise can still be heard now. He wish to indicate that the renovation noise is not from his unit, one of his neighbor approached his unit to check. He enquire if anyone else feedback and when will it stop. Please assist, thank you.."
# message="The window grill is loose not able to fasten, it can drop downstair. It is very dangerous for my child at home. Please assist me asap " # attention
# message=" Request make many times to the office but no one reponse to me. I call to office noone answer!" #attention
# message="Ceiling leaking, need help to repair. Reported"#attenion
# message="Ceiling leaking, need help to repair. Resend"#attenion
# message="Incident Location: 720 XXX AVE 6 #12-616  XXX0720) Incident Location Description:  Feedback the clg leaks at master toilet.  Refer to EE for assistance and return call. thanks" #normal
# message="Incident Location: 727 XXX CIRCLE #06-106  XXX0727) Incident Location Description:  Onwer had book eappt on 10/3 at 8.30am with regards his ceiling leaking at the kitchen toilet. Request HDB officer to check and rectified the issue as soon as possible before the leaking become more worse. Inform him that with route his feedback to the EEIC to follow up with him and he does not need to come down to BO. He noted. EEic for your follow up with owner." #normal
# message=" Request make more two week. No feedback from office!" #attention
# message="There are persistence noise of renovation affect my work. Call the office to report case since 2 weeks ago. There is an urgent response required" #attention
# message="There are burning smoke around my block for a few week. I have report 2 weeks ago . Please help to assist to investigate." #attention
# message="Put up request for repair more than 3 weeks but no feedback" #attention
# message="Feedback on defects. Thanks.   1. toilet door handle broke when the door slammed due to the strong wind today. The handle was already loose before it broke.  2. The toilet tap spoilt."#normal 
# message="there are many crack line at the ceiling. I need you urgent attention!" #attention
# message="Incident Location: 783C XXX RISE #13-09  XXX3783) Incident Location Description:  From: Bei Er ONG (HDB)  ONG_Bei_Er_hdb.gov.sg  Sent: Wednesday, 7 April 2021 12:40 PM To: Marpuah KAWI (HDB)  Marpuah_KAWI_hdb.gov.sg Cc: Angelia MH LIM (HDB)  Angelia_MH_LIM_hdb.gov.sg Subject: FW: Master bedroom ceiling light water leakage HDB ref. no.: 91524919216 Address: BLK 783C XXX RISE # 13 - 09 XXX3783)Hi Marpuah Please create CMS for Angelia. Thanks! Regards Bei Er Ong Bei "
# message="I totally disappoint with the services." # attention
# message="I am unhappy about the repair done in my ceiling. Not up to standard! "
# message=" The water pipe near my unit walkway is leaking. Please help." #normal
message="feedback of defectives 3 weeks ago "
# message="Incident Location: 783C XXX RISE #13-09  XXX3783) his a repeat case.Incident Location Description:  From: Bei Er ONG (HDB)  ONG_Bei_Er_hdb.gov.sg  Sent: Wednesday, 7 April 2021 12:40 PM To: Marpuah KAWI (HDB)  Marpuah_KAWI_hdb.gov.sg Cc: Angelia MH LIM (HDB)  Angelia_MH_LIM_hdb.gov.sg Subject: FW: Master bedroom ceiling light water leakage HDB ref. no.: 91524919216 Address: BLK 783C XXX RISE # 13 - 09 XXX3783).This a repeat case.Hi Marpuah Please create CMS for Angelia .Thanks! Regards Bei Er Ong Bei "
# message="pls help urgent"
start_time = time.time() 
print(predictor.get_classes())
prediction = predictor.predict(message,return_proba=True)

print('predicted: {} ({:.2f})'.format(prediction, (time.time() - start_time)))
print('predicted: {} ({:.2f} seconds)'.format(class_names[np.argmax(prediction)], (time.time() - start_time)))

['normal', 'attention']
predicted: [0.00527189 0.99472815] (0.11)
predicted: attention (0.11 seconds)
