<a href="https://colab.research.google.com/github/arkwith7/ArkChatBot/blob/master/Question_Answering_System_using_ELECTRA_%2B_SQuAD_2_0_on_Colab_TPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This colab file is created by [Pragnakalp Techlabs](https://www.pragnakalp.com/).

You can copy this colab in your drive and then execute the command in given order. For more details check our blog [Question Answering System using ELECTRA + SQuAD on Colab TPU](https://www.pragnakalp.com/nlp-tutorial-qna-electra-squad-colab-tpu)

Check all our [NLP Demos on demos.pragnakalp.com](https://demos.pragnakalp.com) 

#**Electra Fine-tuning and Prediction on SQUAD 2.0 using Cloud TPU!**

---



## **Overview**
**ELECTRA**, is a new method of pre-training language representations.  ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network which  helps to obtains state-of-the-art results . Review the paper about ELECTRA here: https://openreview.net/pdf?id=r1xMH1BtvB.

**SQuAD** Stanford Question Answering Dataset is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

This colab file shows how to fine-tune ELECTRA on SQuAD dataset, and then how to perform the prediction. Using this you can create your own **Question Answering System.**

**Prerequisite** : You will need a GCP (Google Compute Engine) account and a GCS (Google Cloud Storage) bucket to run this colab file.

Please follow the Google Cloud for how to create GCP account and GCS bucket. You have $300 free credit to get started with any GCP product. You can learn more about it at https://cloud.google.com/tpu/docs/setup-gcp-account

You can create your GCS bucket from here http://console.cloud.google.com/storage.


##**Change Runtime to TPU.**
>On the main menu, click on Runtime and select Change runtime type. Set "TPU" as the hardware accelerator.



## **Clone Repository of ELECTRA.**

> First clone the **'electra'** Repository from github by using below command.

In [None]:
!git clone https://github.com/google-research/electra.git

Cloning into 'electra'...
remote: Enumerating objects: 72, done.[K
remote: Counting objects: 100% (72/72), done.[K
remote: Compressing objects: 100% (56/56), done.[K
remote: Total 72 (delta 26), reused 58 (delta 16), pack-reused 0[K
Unpacking objects: 100% (72/72), done.


Use 'cd' command to enter into 'electra' github repo directory. 

In [None]:
cd electra

/content/electra


Use 'ls -l' to see the files inside the given directory.

In [None]:
ls -l

total 116
-rw-r--r-- 1 root root  3788 Mar 16 06:50 build_openwebtext_pretraining_dataset.py
-rw-r--r-- 1 root root  8801 Mar 16 06:50 build_pretraining_dataset.py
-rw-r--r-- 1 root root  7607 Mar 16 06:50 configure_finetuning.py
-rw-r--r-- 1 root root  5265 Mar 16 06:50 configure_pretraining.py
-rw-r--r-- 1 root root  1101 Mar 16 06:50 CONTRIBUTING.md
drwxr-xr-x 5 root root  4096 Mar 16 06:50 [0m[01;34mfinetune[0m/
-rw-r--r-- 1 root root 11358 Mar 16 06:50 LICENSE
drwxr-xr-x 2 root root  4096 Mar 16 06:50 [01;34mmodel[0m/
drwxr-xr-x 2 root root  4096 Mar 16 06:50 [01;34mpretrain[0m/
-rw-r--r-- 1 root root 15480 Mar 16 06:50 README.md
-rw-r--r-- 1 root root 12663 Mar 16 06:50 run_finetuning.py
-rw-r--r-- 1 root root 16518 Mar 16 06:50 run_pretraining.py
drwxr-xr-x 2 root root  4096 Mar 16 06:50 [01;34mutil[0m/


##Download Electra base model from Repo.

> The Released Model for Electra are as given below:

  *    [ELECTRA-Small](https://storage.googleapis.com/electra-data/electra_small.zip)
  *   [ELECTRA-Base](https://storage.googleapis.com/electra-data/electra_base.zip)
  *   [ELECTRA-Large](https://storage.googleapis.com/electra-data/electra_large.zip)

> We are downloading 'ELECTRA-Base' Model by using below command.  	


In [None]:
!wget https://storage.googleapis.com/electra-data/electra_base.zip

--2020-03-16 06:50:42--  https://storage.googleapis.com/electra-data/electra_base.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 108.177.111.128, 2607:f8b0:4001:c0d::80
Connecting to storage.googleapis.com (storage.googleapis.com)|108.177.111.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 885890161 (845M) [application/zip]
Saving to: ‘electra_base.zip’


2020-03-16 06:50:46 (241 MB/s) - ‘electra_base.zip’ saved [885890161/885890161]



In [None]:
#unzip pretrained model
!unzip electra_base.zip

Archive:  electra_base.zip
   creating: electra_base/
  inflating: electra_base/electra_base.meta  
  inflating: electra_base/electra_base.index  
  inflating: electra_base/checkpoint  
  inflating: electra_base/vocab.txt  
  inflating: electra_base/electra_base.data-00000-of-00001  


In [None]:
#Download the SQUAD train and dev dataset
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

--2020-03-16 06:51:03--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.109.153, 185.199.111.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘train-v2.0.json’


2020-03-16 06:51:04 (48.4 MB/s) - ‘train-v2.0.json’ saved [42123633/42123633]

--2020-03-16 06:51:06--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.109.153, 185.199.111.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2020-03-16 06:51:06 (45.2 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



Rename SQUAD dataset files name by using below command. We need **'train.json'** for finetuning  and for evaluation '**dev.json'** files as per given **finetuning script** inside electra repo.

In [None]:
!mv dev-v2.0.json dev.json
!mv train-v2.0.json train.json

## **Set up your TPU environment:**



*   Verify that you are connected to a TPU device.
*   You will get know your TPU Address that is used at time of fine-tuning.
*   Perform Google Authentication to access your bucket.
*   Upload your credentials to TPU to access your GCS bucket.

In [None]:
import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is => ', TPU_ADDRESS)

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

TPU address is =>  grpc://10.53.108.138:8470
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 490528862421994545),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 17545060202478104313),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 7140905136198066936),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 9935863326940108697),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 12175906947571916151),
 _DeviceAttributes(/job:tpu_worker/replica:0/tas

Check the folder location and file for the finetuning inside given directory:

In [None]:
ls -l

total 910664
-rw-r--r-- 1 root root      3788 Mar 16 06:50 build_openwebtext_pretraining_dataset.py
-rw-r--r-- 1 root root      8801 Mar 16 06:50 build_pretraining_dataset.py
-rw-r--r-- 1 root root      7607 Mar 16 06:50 configure_finetuning.py
-rw-r--r-- 1 root root      5265 Mar 16 06:50 configure_pretraining.py
-rw-r--r-- 1 root root      1101 Mar 16 06:50 CONTRIBUTING.md
-rw-r--r-- 1 root root   4370528 Mar 14 23:23 dev.json
drwxr-xr-x 2 root root      4096 Mar  3 21:55 [0m[01;34melectra_base[0m/
-rw-r--r-- 1 root root 885890161 Mar  6 20:00 electra_base.zip
drwxr-xr-x 5 root root      4096 Mar 16 06:50 [01;34mfinetune[0m/
-rw-r--r-- 1 root root     11358 Mar 16 06:50 LICENSE
drwxr-xr-x 2 root root      4096 Mar 16 06:50 [01;34mmodel[0m/
drwxr-xr-x 2 root root      4096 Mar 16 06:50 [01;34mpretrain[0m/
-rw-r--r-- 1 root root     15480 Mar 16 06:50 README.md
-rw-r--r-- 1 root root     12663 Mar 16 06:50 run_finetuning.py
-rw-r--r-- 1 root root     16518 Mar 16 06:50 run_pre

## **Create Data directory:** 


> Need to create a data directory at GCS (Google Cloud Storage) bucket, where you need to move Pre-trained Model at GCS (Google Cloud Storage) bucket and SQUAD dataset files as Local File System is not Supported on TPU. For that you need to provide your BUCKET name and Data DIRECTORY name.

> If you don't move your pretrained model to TPU you may face an error. 




In [None]:
BUCKET = 'electra_finetuning' #@param {type:"string"}
assert BUCKET, '*** Must specify an existing GCS bucket name ***'
data_dir_name = 'data_dir' #@param {type:"string"}
BUCKET_NAME = 'gs://{}'.format(BUCKET)
DATA_DIR = 'gs://{}/{}'.format(BUCKET,data_dir_name)
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))

### Create a Path inside GCS (Google Cloud Storage) bucket.

> Make a new directory **'finetuning_data'** and in that create another one **'squad'** directory, to create a path **'data_dir/finetuning_data/squad/'** for copy the SQUAD dataset inside this path. It is a **default path** given in **Electra finetuning scripts** for **SQUAD dataset**. 

> In your bucket make a new folder **'models'** in 'data_dir' directory to create path **'data_dir/models/'** for copy the **Electra pretrainded model** inside this path which is a default location for pretrained model as per fintuning script.

## **Move Pretrained Model to GCS Bucket** 

> The **gsutil** **mv** command allows you to move data between your local file system and the cloud, move data within the cloud, and move data between cloud storage providers.




In [None]:
!gsutil mv /content/electra/electra_base/ gs://$BUCKET_NAME/data_dir/models

Copying file:///content/electra/electra_base/electra_base.index [Content-Type=application/octet-stream]...
/ [0 files][    0.0 B/ 16.6 KiB]                                                / [1 files][ 16.6 KiB/ 16.6 KiB]                                                Removing file:///content/electra/electra_base/electra_base.index...
Copying file:///content/electra/electra_base/electra_base.data-00000-of-00001 [Content-Type=application/octet-stream]...
/ [1 files][ 16.6 KiB/455.4 MiB]                                                ==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will nee

## **Move SQUAD datasets for pretrained Model:**

> Move the SQUAD dataset files into the bucket path '$BUCKET_NAME/data_dir/finetuning_data/squad' by using below command.

In [None]:
!gsutil mv /content/electra/train.json gs://$BUCKET_NAME/data_dir/finetuning_data/squad
!gsutil mv /content/electra/dev.json gs://$BUCKET_NAME/data_dir/finetuning_data/squad

Copying file:///content/electra/train.json [Content-Type=application/json]...
Removing file:///content/electra/train.json...

Operation completed over 1 objects/40.2 MiB.                                     
Copying file:///content/electra/dev.json [Content-Type=application/json]...
Removing file:///content/electra/dev.json...

Operation completed over 1 objects/4.2 MiB.                                      


##**Fine Tuning:**

> Below is the command to run the training. To run the training on TPU you need to make sure about below Hyperparameter, that is tpu must be true and provide the tpu_address that we have find out above.
1.   --use_tpu=True
2.   --tpu_name=YOUR_TPU_ADDRESS




Check the path of finetuning script by using 'pwd' command and run the finetuning script given next to the 'pwd' cell.

In [None]:
pwd

In [None]:
!python3 run_finetuning.py --data-dir gs://$BUCKET_NAME/data_dir/ --model-name electra_base --hparams '{"model_size": "base", "task_names": ["squad"] , "use_tpu": "True", "tpu_name": "grpc://10.53.108.138:8470", "num_tpu_cores":8}'

Config: model=electra_base, trial 1/1
answerable_classifier True
answerable_uses_start_logits True
answerable_weight 0.5
beam_size 20
data_dir gs://electra_finetuning/data_dir/
debug False
do_eval True
do_lower_case True
do_train True
doc_stride 128
double_unordered True
embedding_size None
eval_batch_size 32
gcp_project None
init_checkpoint gs://electra_finetuning/data_dir/models/electra_base
iterations_per_loop 1000
joint_prediction True
keep_all_models True
layerwise_lr_decay 0.8
learning_rate 0.0001
log_examples False
max_answer_length 30
max_query_length 64
max_seq_length 512
model_dir gs://electra_finetuning/data_dir/models/electra_base/finetuning_models/squad_model
model_hparam_overrides {}
model_name electra_base
model_size base
n_best_size 20
n_writes_test 5
num_tpu_cores 8
num_train_epochs 2.0
num_trials 1
predict_batch_size 32
preprocessed_data_dir gs://electra_finetuning/data_dir/models/electra_base/finetuning_tfrecords/squad_tfrecords
qa_eval_file <built-in method format o

### Location of Trained Model and Evaluation Result:

>After Training, location of your finetuned model inside your bucket is '**$BUCKET_NAME/data_dir/models/electra_base/finetuning_models/squad_model_1/**' in your bucket.

> Evaluation result for the 'dev.json' file is inside the bucket path **'$BUCKET_NAME/data_dir/models/electra_base/results/squad_qa'** in the json format.

##**Prediction:**

Check finetuning script 'run_finetuning.py' inside 'electra' folder by using below command:

In [None]:
ls -l

total 865256
-rw-r--r-- 1 root root      3788 Mar 16 06:50 build_openwebtext_pretraining_dataset.py
-rw-r--r-- 1 root root      8801 Mar 16 06:50 build_pretraining_dataset.py
-rw-r--r-- 1 root root      7607 Mar 16 06:50 configure_finetuning.py
-rw-r--r-- 1 root root      5265 Mar 16 06:50 configure_pretraining.py
-rw-r--r-- 1 root root      1101 Mar 16 06:50 CONTRIBUTING.md
drwxr-xr-x 2 root root      4096 Mar 16 06:56 [0m[01;34melectra_base[0m/
-rw-r--r-- 1 root root 885890161 Mar  6 20:00 electra_base.zip
drwxr-xr-x 6 root root      4096 Mar 16 06:59 [01;34mfinetune[0m/
-rw-r--r-- 1 root root     11358 Mar 16 06:50 LICENSE
drwxr-xr-x 3 root root      4096 Mar 16 06:59 [01;34mmodel[0m/
drwxr-xr-x 3 root root      4096 Mar 16 06:59 [01;34mpretrain[0m/
drwxr-xr-x 2 root root      4096 Mar 16 06:59 [01;34m__pycache__[0m/
-rw-r--r-- 1 root root     15480 Mar 16 06:50 README.md
-rw-r--r-- 1 root root     12663 Mar 16 06:50 run_finetuning.py
-rw-r--r-- 1 root root     16518 Mar 

Remove all the files from the path of the SQUAD dataset of your bucket:

In [None]:
!gsutil rm gs://$BUCKET_NAME/data_dir/finetuning_data/squad/train.json
!gsutil rm gs://$BUCKET_NAME/data_dir/finetuning_data/squad/dev.json

Removing gs://electra_finetuning/data_dir/finetuning_data/squad/train.json...
/ [1 objects]                                                                   
Operation completed over 1 objects.                                              
Removing gs://electra_finetuning/data_dir/finetuning_data/squad/dev.json...
/ [1 objects]                                                                   
Operation completed over 1 objects.                                              


### **Create Testing File**


> We are creating **'dev.json'** as a blank json file as this is the default evaluating file name for the inference of finetuned model as given in the finetuning script and then writing the data in SQUAD format in the file.

*   **touch** is used to create a file
*   **%%writefile** is used to write a file in the colab

> You can pass your own questions and context in the below file.


In [None]:
!touch dev.json

In [None]:
%%writefile dev.json
{
    "version": "v2.0",
    "data": [
        {
            "title": "your_title",
            "paragraphs": [
                {
                    "qas": [
                        {
                            "question": "Who is current CEO?",
                            "id": "56ddde6b9a695914005b9628",
                            "is_impossible": "",
                            "answers":[]
                        },
                        {
                            "question": "Who founded google?",
                            "id": "56ddde6b9a695914005b9629",
                            "is_impossible": "",
                            "answers":[]
                        },
                        {
                            "question": "when did IPO take place?",
                            "id": "56ddde6b9a695914005b962a",
                            "is_impossible": "",
                            "answers":[]
                        }
                    ],
                    "context": "Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock. They incorporated Google as a privately held company on September 4, 1998. An initial public offering (IPO) took place on August 19, 2004, and Google moved to its headquarters in Mountain View, California, nicknamed the Googleplex. In August 2015, Google announced plans to reorganize its various interests as a conglomerate called Alphabet Inc. Google is Alphabet's leading subsidiary and will continue to be the umbrella company for Alphabet's Internet interests. Sundar Pichai was appointed CEO of Google, replacing Larry Page who became the CEO of Alphabet."                
                 }
            ]
        }
    ]
}

Overwriting dev.json


###**Move Inference input file:**
Move this 'dev.json' file from local path to squad dataset path of your bucket by using below command. 

In [None]:
!gsutil mv /content/electra/dev.json gs://$BUCKET_NAME/data_dir/finetuning_data/squad/

Copying file:///content/electra/dev.json [Content-Type=application/json]...
Removing file:///content/electra/dev.json...

Operation completed over 1 objects/1.9 KiB.                                      


In [None]:
ls

build_openwebtext_pretraining_dataset.py  electra_base.zip  README.md
build_pretraining_dataset.py              [0m[01;34mfinetune[0m/         run_finetuning.py
configure_finetuning.py                   LICENSE           run_pretraining.py
configure_pretraining.py                  [01;34mmodel[0m/            [01;34mutil[0m/
CONTRIBUTING.md                           [01;34mpretrain[0m/
[01;34melectra_base[0m/                             [01;34m__pycache__[0m/


Run the below command and get prediction for the given **'dev.json'** file. The path for the prediction file inside your bucket will be **'$BUCKET_NAME/data_dir/models/electra_base/results/squad_qa/squad_predict.json'**.

In [None]:
!python3 run_finetuning.py --data-dir gs://$BUCKET_NAME/data_dir --model-name electra_base/ --hparams '{"do_train": false, "do_eval": true, "model_size": "base", "task_names": ["squad"], "init_checkpoint": "gs://$BUCKET_NAME/data_dir/models/electra_base/finetuning_models/squad_model_1", "use_tpu": "True", "tpu_name": "grpc://10.53.108.138:8470", "num_tpu_cores":8}'

Config: model=electra_base/, trial 1/1
answerable_classifier True
answerable_uses_start_logits True
answerable_weight 0.5
beam_size 20
data_dir gs://electra_finetuning/data_dir
debug False
do_eval True
do_lower_case True
do_train False
doc_stride 128
double_unordered True
embedding_size None
eval_batch_size 32
gcp_project None
init_checkpoint gs://electra_finetuning/data_dir/models/electra_base/finetuning_models/squad_model_1
iterations_per_loop 1000
joint_prediction True
keep_all_models True
layerwise_lr_decay 0.8
learning_rate 0.0001
log_examples False
max_answer_length 30
max_query_length 64
max_seq_length 512
model_dir gs://electra_finetuning/data_dir/models/electra_base/finetuning_models/squad_model
model_hparam_overrides {}
model_name electra_base/
model_size base
n_best_size 20
n_writes_test 5
num_tpu_cores 8
num_train_epochs 2.0
num_trials 1
predict_batch_size 32
preprocessed_data_dir gs://electra_finetuning/data_dir/models/electra_base/finetuning_tfrecords/squad_tfrecords
qa_e