# Preliminaries
The first thing we do is clone the transformers repo and install the necessary requirements using the following code.

In [1]:
!git clone --branch v3.0.1 https://github.com/huggingface/transformers # Clone transformers repo
!cd transformers
!pip install -r transformers/examples/requirements.txt # Install necessary requirements
!pip install transformers==3.0.1 # Fix transformers version for reproducibility

Cloning into 'transformers'...
remote: Enumerating objects: 57496, done.[K
remote: Total 57496 (delta 0), reused 0 (delta 0), pack-reused 57496[K
Receiving objects: 100% (57496/57496), 42.83 MiB | 24.64 MiB/s, done.
Resolving deltas: 100% (40356/40356), done.
Note: checking out 'fedabcd1545839798004b2b468f191ec2244442f'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 865 kB/s 
Collecting sacrebleu
  Downloading sacrebleu-1.4.14-py3-none-any.whl (64 kB)
[K     |████████████████████████████████| 64 kB 1.3

Freeze requirements for later reference.

In [2]:
!pip freeze > kaggle_image_requirements.txt

Download GLUE Data

In [3]:
!mkdir GLUE
!python transformers/utils/download_glue_data.py --data_dir GLUE --tasks all # download GLUE data for all tasks

Downloading and extracting CoLA...
	Completed!
Downloading and extracting SST...
	Completed!
Processing MRPC...
Local MRPC data not specified, downloading data from https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt
	Completed!
Downloading and extracting QQP...
	Completed!
Downloading and extracting STS...
	Completed!
Downloading and extracting MNLI...
	Completed!
Downloading and extracting SNLI...
	Completed!
Downloading and extracting QNLI...
	Completed!
Downloading and extracting RTE...
	Completed!
Downloading and extracting WNLI...
	Completed!
Downloading and extracting diagnostic...
	Completed!


Let's get a sense for what is in the directory

In [4]:
#!cd GLUE # the following will create a tree view of everything
#!ls -R | grep ":$" | sed -e 's/:$//' -e 's/[^-][^\/]*\//--/g' -e 's/^/   /' -e 's/-/|/'
!ls GLUE/STS-B # let's see what is in the STS-B directory specifically

LICENSE.txt  dev.tsv  original	readme.txt  test.tsv  train.tsv


Let's get a sense for what the data looks like.

In [5]:
!head GLUE/STS-B/train.tsv 

index	genre	filename	year	old_index	source1	source2	sentence1	sentence2	score
0	main-captions	MSRvid	2012test	0001	none	none	A plane is taking off.	An air plane is taking off.	5.000
1	main-captions	MSRvid	2012test	0004	none	none	A man is playing a large flute.	A man is playing a flute.	3.800
2	main-captions	MSRvid	2012test	0005	none	none	A man is spreading shreded cheese on a pizza.	A man is spreading shredded cheese on an uncooked pizza.	3.800
3	main-captions	MSRvid	2012test	0006	none	none	Three men are playing chess.	Two men are playing chess.	2.600
4	main-captions	MSRvid	2012test	0009	none	none	A man is playing the cello.	A man seated is playing the cello.	4.250
5	main-captions	MSRvid	2012test	0011	none	none	Some men are fighting.	Two men are fighting.	4.250
6	main-captions	MSRvid	2012test	0012	none	none	A man is smoking.	A man is skating.	0.500
7	main-captions	MSRvid	2012test	0013	none	none	The man is playing the piano.	The man is playing the guitar.	1.600
8	main-captions	

# Fine-Tune on STS-B Task

Execute fine-tuning from `bert-base-cased` checkpoint on the STS-B task. Use batch size 32, a maximum input sequence length of 256, a learning rate of 2e-5 and run it for 3 epochs.

In [6]:
%%time
# the above is a “magic” command for timing the entire cell - has to be the first command
!python transformers/examples/text-classification/run_glue.py --model_name_or_path bert-base-cased --task_name STS-B --do_train --do_eval --data_dir GLUE/STS-B/ --max_seq_length 256 --per_gpu_train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir /tmp/STS-B/

2020-12-28 08:28:47.652452: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
The current process just got forked. Disabling parallelism to avoid deadlocks...
The current process just got forked. Disabling parallelism to avoid deadlocks...
The current process just got forked. Disabling parallelism to avoid deadlocks...
Downloading: 100%|██████████████████████████████| 433/433 [00:00<00:00, 374kB/s]
Downloading: 100%|███████████████████████████| 213k/213k [00:00<00:00, 5.17MB/s]
Downloading: 100%|███████████████████████████| 436M/436M [00:15<00:00, 28.9MB/s]
Epoch:   0%|                                              | 0/3 [00:00<?, ?it/s]
Iteration:   0%|                                        | 0/180 [00:00<?, ?it/s][A
Iteration:   1%|▏                               | 1/180 [00:01<04:27,  1.50s/it][A
Iteration:   1%|▎                               | 2/180 [00:02<03:51,  1.30s/it][A
Iteration:   2%|▌       

Take a look into the specified results folder to see what is available in it.

In [7]:
!ls /tmp/STS-B

checkpoint-500		pytorch_model.bin	 training_args.bin
config.json		special_tokens_map.json  vocab.txt
eval_results_sts-b.txt	tokenizer_config.json


Display evaluation results.

In [8]:
!cat /tmp/STS-B/eval_results_sts-b.txt

eval_loss = 0.493795601730334
eval_pearson = 0.8897041761974835
eval_spearmanr = 0.8877572577691144
eval_corr = 0.888730716983299
epoch = 3.0
