<img src="https://huggingface.co/front/assets/huggingface_logo.svg">

## Training DistilBert on SQUAD V-1

In this Notebook, we will train DistilBert from [Huggingface](https://huggingface.co/transformers/v2.10.0/model_doc/distilbert.html). DistilBERT is a smaller version of bert and has faster inference and computation time as compare to standard BERT (base-uncased). The different variants of DistilBert are present [here](https://huggingface.co/transformers/pretrained_models.html)

The [Squad Dataset](https://huggingface.co/datasets/squad) is taken from Huggingface Datasets and is trained with the DistilBERT base-cased for 0.75 iterations , and the pytorch model is preserved. The trained model can be found [here](https://huggingface.co/abhilash1910/distilbert-squadv1).


The steps for using this model is:

```python

from transformers import AutoModelForQuestionAnswering,AutoTokenizer,pipeline
model=AutoModelForQuestionAnswering.from_pretrained('abhilash1910/distilbert-squadv1')
tokenizer=AutoTokenizer.from_pretrained('abhilash1910/distilbert-squadv1')
nlp_QA=pipeline('question-answering',model=model,tokenizer=tokenizer)
QA_inp={
    'question': 'What is the fund price of Huggingface in NYSE?',
    'context': 'Huggingface Co. has a total fund price of $19.6 million dollars'
}
result=nlp_QA(QA_inp)
result
```

The result is:

```bash

{'score': 0.38547369837760925,
 'start': 42,
 'end': 55,
 'answer': '$19.6 million'}
 ```



## Importing libraries

In this case we will be importing the necessary libraries, including [Datasets](https://pypi.org/project/datasets/). 

In [5]:
import torch
import logging
import os
import math
import copy
from dataclasses import dataclass, field


In [6]:
!pip install datasets

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/1a/38/0c24dce24767386123d528d27109024220db0e7a04467b658d587695241a/datasets-1.1.3-py3-none-any.whl (153kB)
[K     |████████████████████████████████| 163kB 14.4MB/s 
Collecting pyarrow>=0.17.1
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e1/27958a70848f8f7089bff8d6ebe42519daf01f976d28b481e1bfd52c8097/pyarrow-2.0.0-cp36-cp36m-manylinux2014_x86_64.whl (17.7MB)
[K     |████████████████████████████████| 17.7MB 199kB/s 
Collecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/f7/73/826b19f3594756cb1c6c23d2fbd8ca6a77a9cd3b650c9dec5acc85004c38/xxhash-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (242kB)
[K     |████████████████████████████████| 245kB 42.4MB/s 
Installing collected packages: pyarrow, xxhash, datasets
  Found existing installation: pyarrow 0.14.1
    Uninstalling pyarrow-0.14.1:
      Successfully uninstalled pyarrow-0.14.1
Successfully installed datasets-1.1.3 p

## Downloading Transformers Locally

In this case, we have to download [Transformers](https://github.com/huggingface/transformers/issues/8551) locally. And then navigate to the root of the directory. This allows us to use all the different classes for training (for different downstream tasks, MLM,NER,POS,Classification,QA,MNLI etc.)

In [7]:
%%capture
!git clone https://github.com/huggingface/transformers
%cd transformers
!pip install .
!pip install -r ./examples/requirements.txt
%cd ..


## Python Scripts For Question Answering

For training on Squad, this [repository](https://github.com/huggingface/transformers/tree/master/examples/question-answering) contains all the details. For this case, we require the 'run_qa.py','trainer_qa.py' and 'utils_qa.py' files, and we locally download them using '!wget'.


In [10]:
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/question-answering/run_qa.py

--2020-12-16 16:17:45--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/question-answering/run_qa.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21709 (21K) [text/plain]
Saving to: ‘run_qa.py’


2020-12-16 16:17:45 (136 MB/s) - ‘run_qa.py’ saved [21709/21709]



In [11]:
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/question-answering/trainer_qa.py

--2020-12-16 16:17:46--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/question-answering/trainer_qa.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4539 (4.4K) [text/plain]
Saving to: ‘trainer_qa.py’


2020-12-16 16:17:46 (78.9 MB/s) - ‘trainer_qa.py’ saved [4539/4539]



In [12]:
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/question-answering/utils_qa.py

--2020-12-16 16:17:49--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/question-answering/utils_qa.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22242 (22K) [text/plain]
Saving to: ‘utils_qa.py’


2020-12-16 16:17:49 (72.2 MB/s) - ‘utils_qa.py’ saved [22242/22242]



In [8]:
OUTPUT_DIR='abhilash1910/distilbert-squadv1'

## Training using DistilBERT

Here we have used the following parameters:

- Training Batch Size : 512
- Learning Rate : 3e-5
- Training Epochs : 0.75
- Sequence Length : 384
- Stride : 128

After training is completed, we save it locally in our colab. After this we can either download these locally to our machine and upload them using Git in Huggingface or we can directly upload the model from here.

In [13]:
!python run_qa.py \
  --model_name_or_path 'distilbert-base-cased' \
  --dataset_name squad \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 0.75 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir $OUTPUT_DIR/

2020-12-16 16:17:53.373936: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
12/16/2020 16:17:55 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='abhilash1910/distilbert-squadv1/', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, model_parallel=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=12, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=0.75, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec16_16-17-55_44dd3baf367e', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', lo

In [14]:
import glob

files=glob.glob('./abhilash1910/distilbert-squadv1/*')
files

['./abhilash1910/distilbert-squadv1/eval_results.txt',
 './abhilash1910/distilbert-squadv1/vocab.txt',
 './abhilash1910/distilbert-squadv1/checkpoint-500',
 './abhilash1910/distilbert-squadv1/checkpoint-4500',
 './abhilash1910/distilbert-squadv1/pytorch_model.bin',
 './abhilash1910/distilbert-squadv1/checkpoint-1500',
 './abhilash1910/distilbert-squadv1/config.json',
 './abhilash1910/distilbert-squadv1/checkpoint-2000',
 './abhilash1910/distilbert-squadv1/tokenizer_config.json',
 './abhilash1910/distilbert-squadv1/training_args.bin',
 './abhilash1910/distilbert-squadv1/checkpoint-3500',
 './abhilash1910/distilbert-squadv1/checkpoint-5500',
 './abhilash1910/distilbert-squadv1/special_tokens_map.json',
 './abhilash1910/distilbert-squadv1/checkpoint-2500',
 './abhilash1910/distilbert-squadv1/checkpoint-4000',
 './abhilash1910/distilbert-squadv1/checkpoint-3000',
 './abhilash1910/distilbert-squadv1/checkpoint-5000',
 './abhilash1910/distilbert-squadv1/predictions.json',
 './abhilash1910/di

## Testing the Model

We first test the trained model by using some example and [NLP pipeline](https://huggingface.co/transformers/main_classes/pipelines.html).

In [15]:
from transformers import AutoModelForQuestionAnswering,AutoTokenizer,pipeline
nlp_QA=pipeline('question-answering',model='./abhilash1910/distilbert-squadv1',tokenizer='./abhilash1910/distilbert-squadv1')
QA_inp={
    'question': 'What is the fund price of Huggingface in NYSE?',
    'context': 'Huggingface Co. has a total fund price of $19.6 million dollars'
}
result=nlp_QA(QA_inp)
result

{'answer': '$19.6 million dollars',
 'end': 63,
 'score': 0.8255521655082703,
 'start': 42}

## Uploading Model to Huggingface

This [webpage](https://huggingface.co/transformers/model_sharing.html) contains the details for uploading models.We will be using the Colab Uploads for our use case. 

The first step involves locally authenticating with Huggingface CLI and saving our session token.

In [None]:
!transformers-cli login

## Creating a Repo in Huggingface Models

In this case, we first create our own repository in the [Huggingface Models Repository](https://huggingface.co).

In [None]:
!transformers-cli repo create 'distilbert-squadv1'

## Cloning the Newly Created Repository to our Local Notebook

We can clone the repository as we will be uploading this after commiting all the files.

In [17]:
!git clone https://huggingface.co/abhilash1910/distilbert-squadv1

Cloning into 'distilbert-squadv1'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0)[K
Unpacking objects: 100% (3/3), done.


In [18]:
!cd distilbert-squadv1


## Installing Git Lfs

Since we have to send a large pytorch/tensorflow model file containing weights (either in binary or hdf5), we have to install Git LFS. In command line this is as simple as :

```bash

git lfs install

```

In [19]:
!sudo apt-get install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 14 not upgraded.
Need to get 2,129 kB of archives.
After this operation, 7,662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2,129 kB]
Fetched 2,129 kB in 1s (2,846 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package git-lfs.
(Reading database ... 144865 files and directories c

In [29]:
!dir

abhilash1910	    __pycache__  runs	      trainer_qa.py  utils_qa.py
distilbert-squadv1  run_qa.py	 sample_data  transformers


In [None]:
%cd abhilash1910/distilbert-squadv1/distilbert-squadv1

In [83]:
!pwd

/content/abhilash1910/distilbert-squadv1/distilbert-squadv1


## Navigate to the Directory containing the files

Make sure the following files are present:

- a config.json file, which saves the configuration of your model ;

- a pytorch_model.bin file, which is the PyTorch checkpoint (unless you can’t have it for some reason) ;

- a tf_model.h5 file, which is the TensorFlow checkpoint (unless you can’t have it for some reason) ;

- a special_tokens_map.json, which is part of your tokenizer save;

- a tokenizer_config.json, which is part of your tokenizer save;

- files named vocab.json, vocab.txt, merges.txt, or similar, which contain the vocabulary of your tokenizer, part of your tokenizer save;

- maybe a added_tokens.json, which is part of your tokenizer save.

Additional files can be removed.

In [32]:
import glob

files=glob.glob('./abhilash1910/distilbert-squadv1/*')
files

['./abhilash1910/distilbert-squadv1/eval_results.txt',
 './abhilash1910/distilbert-squadv1/vocab.txt',
 './abhilash1910/distilbert-squadv1/checkpoint-500',
 './abhilash1910/distilbert-squadv1/checkpoint-4500',
 './abhilash1910/distilbert-squadv1/pytorch_model.bin',
 './abhilash1910/distilbert-squadv1/checkpoint-1500',
 './abhilash1910/distilbert-squadv1/config.json',
 './abhilash1910/distilbert-squadv1/checkpoint-2000',
 './abhilash1910/distilbert-squadv1/tokenizer_config.json',
 './abhilash1910/distilbert-squadv1/training_args.bin',
 './abhilash1910/distilbert-squadv1/checkpoint-3500',
 './abhilash1910/distilbert-squadv1/checkpoint-5500',
 './abhilash1910/distilbert-squadv1/special_tokens_map.json',
 './abhilash1910/distilbert-squadv1/checkpoint-2500',
 './abhilash1910/distilbert-squadv1/checkpoint-4000',
 './abhilash1910/distilbert-squadv1/checkpoint-3000',
 './abhilash1910/distilbert-squadv1/checkpoint-5000',
 './abhilash1910/distilbert-squadv1/predictions.json',
 './abhilash1910/di

In [None]:
!git add *

## Set Configurations for User Name & Email

We have to provie the configurations for our huggingface email and user name.

In [106]:
!git config --global user.email "debabhi1396@gmail.com"
!git config --global user.name "abhilash1910"

## Commiting the Changes


In [90]:
!git commit -m "Initial Commit"

[main a514dfb] Initial Commit
 108 files changed, 1648941 insertions(+)
 create mode 100644 distilbert-squadv1/checkpoint-1000/config.json
 create mode 100644 distilbert-squadv1/checkpoint-1000/optimizer.pt
 create mode 100644 distilbert-squadv1/checkpoint-1000/pytorch_model.bin
 create mode 100644 distilbert-squadv1/checkpoint-1000/scheduler.pt
 create mode 100644 distilbert-squadv1/checkpoint-1000/special_tokens_map.json
 create mode 100644 distilbert-squadv1/checkpoint-1000/tokenizer_config.json
 create mode 100644 distilbert-squadv1/checkpoint-1000/trainer_state.json
 create mode 100644 distilbert-squadv1/checkpoint-1000/training_args.bin
 create mode 100644 distilbert-squadv1/checkpoint-1000/vocab.txt
 create mode 100644 distilbert-squadv1/checkpoint-1500/config.json
 create mode 100644 distilbert-squadv1/checkpoint-1500/optimizer.pt
 create mode 100644 distilbert-squadv1/checkpoint-1500/pytorch_model.bin
 create mode 100644 distilbert-squadv1/checkpoint-1500/scheduler.pt
 create 

## Pushing the Model Files to Huggingface Repository

In this case, uploading from colab has some difficulties related to Github SSH authentication. The best way to upload is to include username and password in the format:

```bash
https://username:password@huggingface.co/<username/model-name>
```

This allows the model files to get uploaded to the repositoy.

In [None]:
!git push https://username:password@huggingface.co/abhilash1910/distilbert-squadv1

## Conclusion

This Notebook is curated from best practises to follow while training a model on QA downstream tasks within Google Colab.  