# **Implement BERT classifier for MCQA on DREAM, RACE, MCTest datasets**

Please upload your master folder to your Google Drive before start!

To make the other files in your Google drive folder available, you can mount your Google drive with:

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Then you should change directory, by the following command before start:

In [3]:
%cd /content/drive/MyDrive/Colab\ Notebooks/nlp_MCQA_project

/content/drive/MyDrive/Colab Notebooks/MMM-MCQA-master


Check your present working directory with:

In [None]:
pwd

'/content/drive/MyDrive/Colab Notebooks/MMM-MCQA-master'

All five MCQA datasets are put in the folder "data" and to unzip the RACE data, run the following command

In [None]:
!tar -xf /content/drive/MyDrive/Colab\ Notebooks/MMM-MCQA-master/data/RACE.tar.gz

^C


Google Colab doesn't have boto3 library (used to access files from S3 directly), install it with following before start:

In [4]:
pip install boto3

Collecting boto3
[?25l  Downloading https://files.pythonhosted.org/packages/7a/1e/570e2446e97bac3d348d0bc6cbf8ac28997ddbef3d97c052f1c476ff48bb/boto3-1.17.49.tar.gz (99kB)
[K     |████████████████████████████████| 102kB 8.2MB/s 
[?25hCollecting botocore<1.21.0,>=1.20.49
[?25l  Downloading https://files.pythonhosted.org/packages/68/59/6e28ce58206039ad2592992b75ee79a8f9dbc902a9704373ddacc4f96300/botocore-1.20.49-py2.py3-none-any.whl (7.4MB)
[K     |████████████████████████████████| 7.4MB 16.3MB/s 
[?25hCollecting jmespath<1.0.0,>=0.7.1
  Downloading https://files.pythonhosted.org/packages/07/cb/5f001272b6faeb23c1c9e0acc04d48eaaf5c862c17709d20e3469c6e0139/jmespath-0.10.0-py2.py3-none-any.whl
Collecting s3transfer<0.4.0,>=0.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/98/14/0b4be62b65c52d6d1c442f24e02d2a9889a73d3c352002e14c70f84a679f/s3transfer-0.3.6-py2.py3-none-any.whl (73kB)
[K     |████████████████████████████████| 81kB 10.4MB/s 
[?25hCollecting urllib3<1.27,>

# **Enabling and testing the GPU**

First, you'll need to enable GPUs for the notebook:

Navigate to Edit→Notebook Settings
select GPU from the Hardware Accelerator drop-down

Run BERT with command:  

```
!python run_classifier_bert_exe.py --task_name {task_name} --bert_model_dir {BERT_DIR} --per_gpu_train_batch_size {per_gpu_train_batch_size} --gradient_accumulation_steps {gradient_accumulation_steps}
```






(If you fail to execute the program because out of GPU memory, please reduce your Batch_size and/or max_sequence_length in utils_glue.py)

Run below code for modified code:

In [5]:
!python run_bert.py 

04/10/2021 13:07:40 - INFO - __main__ -   device cuda n_gpu 1 distributed training False
04/10/2021 13:07:40 - INFO - pytorch_pretrained_bert.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt not found in cache, downloading to /tmp/tmp2ntfhebl
100% 231508/231508 [00:00<00:00, 15072722.52B/s]
04/10/2021 13:07:40 - INFO - pytorch_pretrained_bert.file_utils -   copying /tmp/tmp2ntfhebl to cache at /root/.cache/torch/pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
04/10/2021 13:07:40 - INFO - pytorch_pretrained_bert.file_utils -   creating metadata file for /root/.cache/torch/pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
04/10/2021 13:07:40 - INFO - pytorch_pretrained_bert.file_utils -   removing temp file /tmp/tmp2ntfhebl
04/10/2021 

Run the below code for original run_classifier_bert_exe.py:

`python run_classifier_bert_exe.py TASK_NAME MODEL_DIR BATCH_SIZE_PER_GPU GRADIENT_ACCUMULATION_STEPS`

Here we explain each required argument in details:

**TASK_NAME:** It can be a single task or multiple tasks. If a single task, the options are: dream, race, toefl, mcscript, mctest160, mctest500, mnli, snli, etc. Multiple tasks can be any combinations of those above-mentioned single tasks. For example, if you want to train a multi-task model on the dream and race tasks together, then this variable should be set as "dream,race".

**MODEL_DIR:** Model would be initialized by the parameters stored in this directory.

**BATCH_SIZE_PER_GPU: **Batch size of data in a single GPU.

**GRADIENT_ACCUMULATION_STEPS: **How many steps to accumulate the gradients for one step of back-propagation.
One note: the effective batch size for training is important, which is the product of three variables: 

**BATCH_SIZE_PER_GPU, NUM_OF_GPUs, and GRADIENT_ACCUMULATION_STEPS. **

n my experience, it should be at least higher than 12 and 24 would be great.


In [6]:
!python run_classifier_bert_exe.py dream "tmp/" 8 2 --do_eval

04/10/2021 12:16:25 - INFO - __main__ -   device cuda n_gpu 1 distributed training False
Output directory (tmp/dream_tmp/) already exists and is not empty.
04/10/2021 12:16:25 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file tmp/vocab.txt
04/10/2021 12:16:25 - INFO - pytorch_pretrained_bert.modeling -   loading weights file tmp/pytorch_model.bin
04/10/2021 12:16:25 - INFO - pytorch_pretrained_bert.modeling -   loading configuration file tmp/config.json
04/10/2021 12:16:25 - INFO - pytorch_pretrained_bert.modeling -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

04/10/2021 12:16:34 - INFO - pytorch_pretrained_bert.modeling -   Randomly initialize the top lev