# BERT
This notebook is designed to be show how to set up BERT folder and what parameters can be tuned using BERT model. 

## Instructions for Use
This notebook is designed to be show how to set up BERT folder and what parameters can be tuned using BERT model. 
* BERT requries very specific folders and files structure* The training data and test data must follow specific formatting as shown below.
* The parameters can be tuned/changed and it may change the running time accordingly, i.e. the bigger number of `train_batch_size`, the less time it may take to run, and the less `num_train_epochs`, the less time it may take to run. 

In [2]:
import os

#### file directory for Bert model specific
```
├── BERT_method
│   ├── bert                      <- very thing Bert related is stored in this folder.
|       |          Download and save the pre-trained model from official BERT Github page:
|       |                             https://github.com/google-research/bert
|       ├── data          <- Make sure all the .tsv files are in a folder named “data”
│       ├── bert_output <- create the folder “bert_output” where the fine tuned model will
|       |                    be saved and test results are generated under the name “test_results.tsv“
│       └── cased_L-12_H-768_A-12 <- unzip the downloaded pre-trained BERT model in the directory 
│
├── notebooks      
|...
```

In [2]:
# check current working directory
os.getcwd() 

'C:\\Users\\fanfan\\Documents\\Capstone\\DSCI_591_capstone-BCStats\\notebooks'

In [7]:
# # change the working directory to the bert directory
#os.chdir("bert")
os.getcwd()

'/Users/aaronquinton/Documents/UBC-MDS/Capstone/BCstats/BERT_method/bert'

In [12]:
ls ../../DSCI_591_capstone-BCStats/

CONDUCT.md   README.md    [34mdata[m[m/        [34mnotebooks[m[m/   [34mreports[m[m/
Makefile     TEAMWORK.md  [34mmodels[m[m/      [34mreferences[m[m/  [34msrc[m[m/


In [13]:
# load packages
import pandas as pd
import numpy as np
import sklearn.metrics as metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [14]:
# read in 2018 qualitative data
df_2018 = pd.read_csv("../../DSCI_591_capstone-BCStats/data/interim/train_2018-qualitative-data.csv")
df_2018.head()
#df_2018['Unnamed: 0']

Unnamed: 0,_telkey,2018 Comment,Code 1,Code 2,Code 3,Code 4,Code 5,CPD,CB,EWC,...,VMG_Improve_collaboration,VMG_Improve_program_implementation,VMG_Public_interest_and_service_delivery,VMG_Review_funding_or_budget,VMG_Keep_politics_out_of_work,VMG_other,OTH_Other_related,OTH_Positive_comments,OTH_Survey_feedback,Unrelated
0,192723-544650,I would suggest having a developmental growth ...,62,13.0,,,,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,188281-540434,Base decisions regarding fish and wildlife on ...,116,,,,,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,174789-230694,Get rid of Leading Workplace Strategies and gi...,51,,,,,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,185914-180608,We are the lowest paid in Canada with a worklo...,24,62.0,,,,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,189099-732978,Official acknowledgement of the limited divers...,35,62.0,,,,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [15]:
categories = df_2018.loc[:,'CPD':'OTH'].columns.tolist()
categories

['CPD',
 'CB',
 'EWC',
 'Exec',
 'FWE',
 'SP',
 'RE',
 'Sup',
 'SW',
 'TEPE',
 'VMG',
 'OTH']

In [16]:
# Creating train and dev dataframes according to BERT
df_bert = pd.DataFrame({'user_id':df_2018['_telkey'],
            'label':df_2018['CPD'],
            'alpha':['a']*df_2018.shape[0],
            'text':df_2018['2018 Comment'].replace(r'\n',' ',regex=True)})
#df_bert = df_bert.iloc[0:200,:]
df_bert.head()

Unnamed: 0,user_id,label,alpha,text
0,192723-544650,1,a,I would suggest having a developmental growth ...
1,188281-540434,0,a,Base decisions regarding fish and wildlife on ...
2,174789-230694,0,a,Get rid of Leading Workplace Strategies and gi...
3,185914-180608,0,a,We are the lowest paid in Canada with a worklo...
4,189099-732978,0,a,Official acknowledgement of the limited divers...


In [17]:
df_bert_train, df_bert_dev = train_test_split(df_bert, test_size=0.15,random_state=2019)

In [18]:
# Creating test dataframe according to BERT
df_test = pd.read_csv("../../DSCI_591_capstone-BCStats/data/interim/test_2018-qualitative-data.csv")
#df_test = df_test.iloc[0:60,:]
df_bert_test = pd.DataFrame({'User_ID':df_test['_telkey'],
                 'text':df_test['2018 Comment'].replace(r'\n',' ',regex=True)})
df_bert_test.head() 

Unnamed: 0,User_ID,text
0,194791-949508,The compensation.
1,174648-027372,compare type of work; expertise required; and ...
2,176038-900440,Greater support for mobile work options and in...
3,173698-669014,Consistent direction by all Supervisors.
4,175136-609856,"Sound - working in an open area, it can be; ve..."


In [19]:
# Saving dataframes to .tsv format as required by BERT
df_bert_train.to_csv('./data/train.tsv', sep='\t', index=False, header=False)
df_bert_dev.to_csv('./data/dev.tsv', sep='\t', index=False, header=False)
df_bert_test.to_csv('./data/test.tsv', sep='\t', index=False, header=True)

### Run the below command on terminal:
```
python run_classifier.py  \
--task_name=cola  \
--do_train=true  \
--do_eval=true  \
--do_predict=true \
--data_dir=./data/ \
--vocab_file=./cased_L-12_H-768_A-12/vocab.txt \
--bert_config_file=./cased_L-12_H-768_A-12/bert_config.json \
--init_checkpoint=./cased_L-12_H-768_A-12/bert_model.ckpt \
--max_seq_length=512 \
--train_batch_size=8 \
--learning_rate=2e-5 \
--num_train_epochs=3.0  \
--output_dir=./bert_output/ \
--do_lower_case=False 
```

In [1]:
# converts the results from BERT model to .csv format
df_results = pd.read_csv("bert_output/test_results.tsv",sep="\t",header=None)
df_results_csv = pd.DataFrame({'User_ID':df_test['_telkey'],
                               'bert_CPD':df_results.idxmax(axis=1),
                               'true_CPD':df_test['CPD']})
 
# writing into .csv
df_results_csv.to_csv('data/bert_result.csv',sep=",",index=None)


NameError: name 'pd' is not defined

In [41]:
df_results.head()

Unnamed: 0,0,1
0,0.999709,0.000291
1,0.999547,0.000453
2,0.000784,0.999216
3,0.999611,0.000389
4,0.999677,0.000323


In [12]:
df_results_csv["diff"] = df_results_csv['bert_CPD'] - df_results_csv['true_CPD']
df_results_csv ["predict correctly?"] = df_results_csv['diff'].apply(lambda x: 'True' if x ==0 else 'False')
df_results_csv.head()

Unnamed: 0,User_ID,bert_CPD,true_CPD,diff,predict correctly?
0,194791-949508,0,0,0,True
1,174648-027372,0,0,0,True
2,176038-900440,1,1,0,True
3,173698-669014,0,0,0,True
4,175136-609856,0,0,0,True


In [13]:
scores = df_results_csv['predict correctly?'].value_counts()
correct = scores[0]
wrong = scores[1]
print("accuracy", correct/(correct+wrong))

accuracy 0.9457627118644067


In [23]:
Ypred = np.array(df_results_csv.bert_CPD)
Ytrue = np.array(df_results_csv.true_CPD)

In [35]:
overall_accuracy = metrics.accuracy_score(Ytrue, Ypred)
hamming_loss = metrics.hamming_loss(Ytrue, Ypred)
print("Overall Accuracy:", round(overall_accuracy, 4),
          '\nHamming Loss:', round(hamming_loss, 4))

Overall Accuracy: 0.9458 
Hamming Loss: 0.0542


In [38]:
precision.append(metrics.precision_score(Ytrue, Ypred))
recall.append(metrics.recall_score(Ytrue, Ypred))

In [39]:
precision

[0.7857142857142857]

In [40]:
recall

[0.7771739130434783]