
## Quick intro: FLAN-T5, just a better T5

FLAN-T5 released with the [Scaling Instruction-Finetuned Language Models](https://arxiv.org/pdf/2210.11416.pdf) paper is an enhanced version of T5 that has been finetuned in a mixture of tasks. The paper explores instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. The paper discovers that overall instruction finetuning is a general method for improving the performance and usability of pretrained language models. 

![flan-t5](../assets/flan-t5.png)

* Paper: https://arxiv.org/abs/2210.11416
* Official repo: https://github.com/google-research/t5x

--- 

Now we know what FLAN-T5 is, let's get started. 🚀

_Note: This tutorial was created and run on a g4dn.xlarge AWS EC2 Instance including a NVIDIA T4._

# Dataset Preparation

In [45]:
import pandas as pd
df_mcqn = pd.read_csv("mcq_neighbours.csv",on_bad_lines='skip')

In [95]:
df_mcqs = pd.read_csv("mcq_street.csv",on_bad_lines='skip')

In [47]:
df_parse = pd.read_csv("parse_context_final.csv",on_bad_lines='skip')

In [48]:
df_street = pd.read_csv("street_normal.csv",on_bad_lines='skip')

In [49]:
df_scramble_final = pd.read_csv("Scramble_FINAL.csv",on_bad_lines='skip')

In [50]:
df_scramble_word_final = pd.read_csv("Shuffling_wordings_FINAL.csv",on_bad_lines='skip')

In [51]:
df_mcqn.shape,df_mcqs.shape,df_parse.shape,df_street.shape,df_scramble_final.shape,df_scramble_word_final.shape

((82614, 3), (33858, 3), (887195, 3), (4542636, 2), (21954, 2), (21954, 2))

In [54]:
print(df_mcqn['input'][0],"******",df_mcqn['context'][0],"******",df_mcqn['target'][0])

O'Brien & Company , 811 First Ave Suite 380 ,  , US , WA , Seattle , 98104  ******  9830 overlook Dr  ,  ,  , US , Wa , Olympia , 98502  |  O'BRIEN & COMPANY , 811 1ST AVE STE 380 ,  , US , WA , SEATTLE , 98104-1434   |   O'Brien & Company , 811 First Avenue  Suite 380 ,  , US , WA , SEATTLE , 98104 ******  O'BRIEN & COMPANY , 811 1ST AVE STE 380 ,  , US , WA , SEATTLE , 98104-1434  |  O'Brien & Company , 811 First Avenue  Suite 380 ,  , US , WA , SEATTLE , 98104


In [87]:
print(df_mcqs['input'][0],"******",df_mcqs['context'][0],"******",df_mcqs['target'][0])

424 100TH AVE SE ,  ,  , US , WA , OLYMPIA , 98501-9710  ******  608 W EMERSON ST APT 321 ,  ,  , US , WA , SEATTLE , 98119-1544   |   805 W EMERSON ST ,  ,  , US , WA , SEATTLE , 98119-1457  |  434 100TH AVE SE ,  ,  , US , WA , OLYMPIA , 98501  ******  434 100TH AVE SE ,  ,  , US , WA , OLYMPIA , 98501 


In [59]:
print(df_parse['prompt'][0],"******",df_parse['response'][0],"******",df_parse['context'][0])

1133 18TH AVE APT 4|||WA|SEATTLE|98122 ****** STREET:18TH AVE;BUILDING:1133;UNIT:APT|4;CITY:SEATTLE;STATE:WA;POSTALCODE:98122 4701 ****** 1133 18TH AVE APT 4 ,  ,  , US , WA , SEATTLE , 98122  |  1133 18TH AVE APT 4 ,  ,  , US , WA , SEATTLE , 98122-4701 


In [69]:
df_street.rename(columns = {'input':'input_text'}, inplace = True)


Unnamed: 0,target_text,input_text
0,100TH_AVE_SE,"612 100th Ave se , Olympia , , US , WA , OLYM..."
1,100TH_AVE_SE,"434 100th Avenue SE , , , US , WA , Olympia ..."
2,100TH_AVE_SE,"434 100th Ave. S.E. , , , US , Washington , ..."
3,100TH_AVE_SE,"424 100th ave se , , , US , washington , tum..."
4,100TH_AVE_SE,"424 100th Ave. SE , , , US , Washington , Ol..."
...,...,...
4542631,Z_ST_SE,"242 Z ST SE , , , US , WA , TUMWATER , 98501"
4542632,Z_ST_SE,"201 Z ST SE , , , US , WA , Tumwater , 98501"
4542633,Z_ST_SE,"230 Z ST SE , , , US , WA , TUMWATER , 98501..."
4542634,Z_ST_SE,"709 Z ST SE , , , US , WA , OLYMPIA , 98501"


In [63]:
df_scramble_final.rename(columns = {'scramble_input':'input_text'}, inplace = True)
df_scramble_final.rename(columns = {'target':'target_text'}, inplace = True)
df_scramble_word_final.rename(columns = {'shuffle_input':'input_text'}, inplace = True)
df_scramble_word_final.rename(columns = {'target':'target_text'}, inplace = True)

In [64]:
df_scramble_word_final

Unnamed: 0,target_text,input_text
0,"111 YALE AVE N , APT212 , SEATTLE , WA , US","111 EYAL EVA N , PTA212 , LEETAST , AW"
1,"1121 Harrison Avenue , #115 , Centralia , WA...","1121 OARSRHIN AVE # 115 , , RALNIACET , AW"
2,"5000 25th Ave NE , Apt 2207-C , Seattle , WA , US","5000 25th Aev NE , Tpa 2207-C , Lttseea , AW"
3,"11625 center road , unit d , EVERETT , WA , US","11625 NCRETE RD NITU D , , VEETRET , WA"
4,"1020 E Jefferson St , 325 , SEATTLE , WA , US","1020 E FJEESRNOF TS 325 , , SETAETL , WA"
...,...,...
21949,"3025 Limited Ln NW , Ste 100 , Olympia , WA , US","3025 ILTMEDI LN WN TES 100 , , OPAMIYL , WA"
21950,"17424 122nd AVE E , Apt. B105 , Puyallup , WA ...","17424 122dn EVA E , Tap. B105 , Laupyplu , AW"
21951,"3013 99th Ave NE , Unit A , Lake Stevens , WA ...","3013 99TH AVE EN TNUI A , , LEAK NVSTEES , WA"
21952,"5240 UNIVERSITY WAY NE , APT607 , SEATTLE , WA...","5240 RINSIYETUV YWA EN , PTA607 , EASTTLE , AW"


In [66]:
df_comb = pd.concat([df_scramble_word_final,df_scramble_final],axis=0)

In [68]:
df_comb #scrable word and sentence

Unnamed: 0,target_text,input_text
0,"111 YALE AVE N , APT212 , SEATTLE , WA , US","111 EYAL EVA N , PTA212 , LEETAST , AW"
1,"1121 Harrison Avenue , #115 , Centralia , WA...","1121 OARSRHIN AVE # 115 , , RALNIACET , AW"
2,"5000 25th Ave NE , Apt 2207-C , Seattle , WA , US","5000 25th Aev NE , Tpa 2207-C , Lttseea , AW"
3,"11625 center road , unit d , EVERETT , WA , US","11625 NCRETE RD NITU D , , VEETRET , WA"
4,"1020 E Jefferson St , 325 , SEATTLE , WA , US","1020 E FJEESRNOF TS 325 , , SETAETL , WA"
...,...,...
21949,"3025 Limited Ln NW , Ste 100 , Olympia , WA , US","OLYMPIA WA 100 , NW , LN STE 3025 , LIMITED"
21950,"17424 122nd AVE E , Apt. B105 , Puyallup , WA ...","E 17424 , , Puyallup Apt. , AVE WA 122nd B105"
21951,"3013 99th Ave NE , Unit A , Lake Stevens , WA ...",", STEVENS AVE LAKE UNIT 99TH NE , A , WA 3013"
21952,"5240 UNIVERSITY WAY NE , APT607 , SEATTLE , WA...","NE APT607 UNIVERSITY , SEATTLE 5240 , WA , WAY"


In [96]:
df_mcqn.shape,df_mcqs.shape,df_parse.shape

((82614, 3), (33858, 3), (887195, 3))

In [73]:
df_mcqn

Unnamed: 0,input,context,target
0,"O'Brien & Company , 811 First Ave Suite 380 , ...","9830 overlook Dr , , , US , Wa , Olympia ,...","O'BRIEN & COMPANY , 811 1ST AVE STE 380 , , ..."
1,"O'BRIEN & COMPANY , 811 1ST AVE STE 380 , , ...","2502 ISLAND DR NW , , , US , WA , OLYMPIA ...","O'Brien & Company , 811 First Ave Suite 380 , ..."
2,"O'Brien & Company , 811 First Avenue Suite 3...","3614 12TH AVE W , , , US , WA , SEATTLE , 98...","O'Brien & Company , 811 First Ave Suite 380 , ..."
3,"711 78th Ave SW , , , US , WA , Tumwater , 9...","711 78th Ave SW , , , US , WA , Tumwater ,...","c/o Mud Bay , 711 78th Ave SW , , US , WA ,..."
4,"c/o Mud Bay , 711 78th Ave SW , , US , WA ,...","5506 CAMELOT DR SW , , , US , WA , OLYMPIA ,...","711 78th Ave SW , , , US , WA , Tumwater , 9..."
...,...,...,...
82609,"101 6TH AVE S UNIT 313 , , , US , WA , SEAT...","2428 NW MARKET ST APT 662 , , , US , WA , S...","101 6TH AVE S UNIT 313 , , , US , WA , SEATT..."
82610,"101 6TH AVE S STREET UNIT 313 , , , US , WA...","101 6TH AVE S UNIT 313 , , , US , WA , SEATT...","101 6TH AVE S UNIT 313 , , , US , WA , SEATT..."
82611,"101 6th Ave S # 319 , , , US , WA , Seattle ...","6943 BIRDSEYE AVE NE UNIT 201 , , , US , WA ...","Crowd Cow , 101 6th Ave S Apt 319 , , US , W..."
82612,"Crowd Cow , 101 6th Ave S Apt 319 , , US , W...","101 6th Ave S # 319 , , , US , WA , Seattle ...","101 6th Ave S # 319 , , , US , WA , Seattle ..."


## Converting Input to Capital letter

In [97]:
df_mcqs = df_mcqs.apply(lambda x: x.astype(str).str.upper())
df_parse = df_parse.apply(lambda x: x.astype(str).str.upper())
df_street = df_street.apply(lambda x: x.astype(str).str.upper())
df_mcqn = df_mcqn.apply(lambda x: x.astype(str).str.upper())
df_comb = df_comb.apply(lambda x: x.astype(str).str.upper())

In [78]:
#df_mcqn

In [98]:
df_mcqs['input'] = 'Match the given address in input text to addresses in CONTEXT based on street: ' + df_mcqs['input'].astype(str)
df_mcqs['context'] = 'CONTEXT - ' + df_mcqs['context'].astype(str)
df_mcqs["input_text"] = df_mcqs["input"].astype(str) +"  ,  "+ df_mcqs["context"].astype(str)
df_mcqs['target_text'] = 'OUTPUT : ' + df_mcqs['target'].astype(str)

In [99]:
df_mcqs= df_mcqs.drop(['input','context','target'],axis=1)

In [100]:
df_mcqs['input_text'][0]

'Match the given address in input text to addresses in CONTEXT based on street: 424 100TH AVE SE ,  ,  , US , WA , OLYMPIA , 98501-9710   ,  CONTEXT -  608 W EMERSON ST APT 321 ,  ,  , US , WA , SEATTLE , 98119-1544   |   805 W EMERSON ST ,  ,  , US , WA , SEATTLE , 98119-1457  |  434 100TH AVE SE ,  ,  , US , WA , OLYMPIA , 98501 '

In [101]:
df_mcqn['input'] = 'Match the given address in input text to addresses in CONTEXT based on neighbour:' + df_mcqn['input'].astype(str)
df_mcqn['context'] = 'CONTEXT - ' + df_mcqn['context'].astype(str)
df_mcqn["input_text"] = df_mcqn["input"].astype(str) +"  ,  "+ df_mcqn["context"].astype(str)
df_mcqn['target_text'] = 'OUTPUT : ' + df_mcqn['target'].astype(str)

In [102]:
df_mcqn= df_mcqn.drop(['input','context','target'],axis=1)

In [105]:
df_mcqn

Unnamed: 0,input_text,target_text
0,Match the given address in input text to addre...,"OUTPUT : O'BRIEN & COMPANY , 811 1ST AVE STE ..."
1,Match the given address in input text to addre...,"OUTPUT : O'BRIEN & COMPANY , 811 FIRST AVE SUI..."
2,Match the given address in input text to addre...,"OUTPUT : O'BRIEN & COMPANY , 811 FIRST AVE SUI..."
3,Match the given address in input text to addre...,"OUTPUT : C/O MUD BAY , 711 78TH AVE SW , , ..."
4,Match the given address in input text to addre...,"OUTPUT : 711 78TH AVE SW , , , US , WA , TUM..."
...,...,...
82609,Match the given address in input text to addre...,"OUTPUT : 101 6TH AVE S UNIT 313 , , , US , W..."
82610,Match the given address in input text to addre...,"OUTPUT : 101 6TH AVE S UNIT 313 , , , US , W..."
82611,Match the given address in input text to addre...,"OUTPUT : CROWD COW , 101 6TH AVE S APT 319 , ..."
82612,Match the given address in input text to addre...,"OUTPUT : 101 6TH AVE S # 319 , , , US , WA ,..."


In [106]:
df_parse['prompt'] = 'Parse the given address based on CONTEXT: ' + df_parse['prompt'].astype(str)
df_parse['context'] = 'CONTEXT - ' + df_parse['context'].astype(str)
df_parse["input_text"] = df_parse["prompt"].astype(str) +"  ,  " + df_parse["context"].astype(str)
df_parse['target_text'] = 'OUTPUT : ' + df_parse['response'].astype(str)
df_parse = df_parse.drop(['prompt','response','context'],axis=1)

In [107]:
df_parse

Unnamed: 0,input_text,target_text
0,Parse the given address based on CONTEXT: 1133...,OUTPUT : STREET:18TH AVE;BUILDING:1133;UNIT:AP...
1,Parse the given address based on CONTEXT: 1133...,OUTPUT : STREET:18TH AVE;BUILDING:1133;UNIT:AP...
2,Parse the given address based on CONTEXT: 1133...,OUTPUT : STREET:18TH AVE;BUILDING:1133;UNIT:AP...
3,Parse the given address based on CONTEXT: 1133...,OUTPUT : STREET:18TH AVE;BUILDING:1133;UNIT:AP...
4,Parse the given address based on CONTEXT: !113...,OUTPUT : STREET:18TH AVE;BUILDING:1133;UNIT:AP...
...,...,...
887190,Parse the given address based on CONTEXT: 1201...,OUTPUT : STREET:NULL;BUILDING:NULL;UNIT:NULL;C...
887191,Parse the given address based on CONTEXT: 4636...,OUTPUT : STREET:NULL;BUILDING:NULL;UNIT:NULL;C...
887192,Parse the given address based on CONTEXT: E 90...,OUTPUT : STREET:NULL;BUILDING:NULL;UNIT:NULL;C...
887193,Parse the given address based on CONTEXT: |355...,OUTPUT : STREET:NULL;BUILDING:NULL;UNIT:NULL;C...


In [109]:
df_street['input_text'] = 'Find street in given address: '+df_street['input_text'].astype(str)

In [111]:
df_street['target_text'] = 'OUTPUT : ' + df_street['target_text'].astype(str)

In [112]:
df_street

Unnamed: 0,target_text,input_text
0,OUTPUT : 100TH_AVE_SE,Find street in given address: 612 100TH AVE SE...
1,OUTPUT : 100TH_AVE_SE,Find street in given address: 434 100TH AVENUE...
2,OUTPUT : 100TH_AVE_SE,Find street in given address: 434 100TH AVE. S...
3,OUTPUT : 100TH_AVE_SE,Find street in given address: 424 100TH AVE SE...
4,OUTPUT : 100TH_AVE_SE,Find street in given address: 424 100TH AVE. S...
...,...,...
4542631,OUTPUT : Z_ST_SE,"Find street in given address: 242 Z ST SE , ,..."
4542632,OUTPUT : Z_ST_SE,"Find street in given address: 201 Z ST SE , ,..."
4542633,OUTPUT : Z_ST_SE,"Find street in given address: 230 Z ST SE , ,..."
4542634,OUTPUT : Z_ST_SE,"Find street in given address: 709 Z ST SE , ,..."


In [113]:
df_comb['input_text'] = 'Correct the address: '+df_comb['input_text'].astype(str)
df_comb['target_text'] = 'OUTPUT : ' + df_comb['target_text'].astype(str)

In [114]:
df_comb

Unnamed: 0,target_text,input_text
0,"OUTPUT : 111 YALE AVE N , APT212 , SEATTLE , W...","Correct the address: 111 EYAL EVA N , PTA212 ,..."
1,"OUTPUT : 1121 HARRISON AVENUE , #115 , CENTRA...","Correct the address: 1121 OARSRHIN AVE # 115 ,..."
2,"OUTPUT : 5000 25TH AVE NE , APT 2207-C , SEATT...","Correct the address: 5000 25TH AEV NE , TPA 22..."
3,"OUTPUT : 11625 CENTER ROAD , UNIT D , EVERETT ...","Correct the address: 11625 NCRETE RD NITU D , ..."
4,"OUTPUT : 1020 E JEFFERSON ST , 325 , SEATTLE ,...","Correct the address: 1020 E FJEESRNOF TS 325 ,..."
...,...,...
21949,"OUTPUT : 3025 LIMITED LN NW , STE 100 , OLYMPI...","Correct the address: OLYMPIA WA 100 , NW , LN ..."
21950,"OUTPUT : 17424 122ND AVE E , APT. B105 , PUYAL...","Correct the address: E 17424 , , PUYALLUP APT...."
21951,"OUTPUT : 3013 99TH AVE NE , UNIT A , LAKE STEV...","Correct the address: , STEVENS AVE LAKE UNIT 9..."
21952,"OUTPUT : 5240 UNIVERSITY WAY NE , APT607 , SEA...","Correct the address: NE APT607 UNIVERSITY , SE..."


In [115]:
df_ans = pd.concat([df_mcqn,df_mcqs,df_parse,df_comb,df_street]).astype(str)


In [116]:
df_ans

Unnamed: 0,input_text,target_text
0,Match the given address in input text to addre...,"OUTPUT : O'BRIEN & COMPANY , 811 1ST AVE STE ..."
1,Match the given address in input text to addre...,"OUTPUT : O'BRIEN & COMPANY , 811 FIRST AVE SUI..."
2,Match the given address in input text to addre...,"OUTPUT : O'BRIEN & COMPANY , 811 FIRST AVE SUI..."
3,Match the given address in input text to addre...,"OUTPUT : C/O MUD BAY , 711 78TH AVE SW , , ..."
4,Match the given address in input text to addre...,"OUTPUT : 711 78TH AVE SW , , , US , WA , TUM..."
...,...,...
4542631,"Find street in given address: 242 Z ST SE , ,...",OUTPUT : Z_ST_SE
4542632,"Find street in given address: 201 Z ST SE , ,...",OUTPUT : Z_ST_SE
4542633,"Find street in given address: 230 Z ST SE , ,...",OUTPUT : Z_ST_SE
4542634,"Find street in given address: 709 Z ST SE , ,...",OUTPUT : Z_ST_SE


In [117]:
df_ans.to_csv("Final_Oct.csv",index = False)

## 1. Setup Development Environment

Our first step is to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages. 

In [1]:
# python
!pip install pytesseract transformers datasets rouge-score nltk tensorboard py7zr --upgrade

Collecting pytesseract
  Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Collecting transformers
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/1a/d1/3bba59606141ae808017f6fde91453882f931957f125009417b87a281067/transformers-4.34.0-py3-none-any.whl.metadata
  Using cached transformers-4.34.0-py3-none-any.whl.metadata (121 kB)
Collecting datasets
  Obtaining dependency information for datasets from https://files.pythonhosted.org/packages/09/7e/fd4d6441a541dba61d0acb3c1fd5df53214c2e9033854e837a99dd9e0793/datasets-2.14.5-py3-none-any.whl.metadata
  Downloading datasets-2.14.5-py3-none-any.whl.metadata (19 kB)
Collecting rouge-score
  Using cached rouge_score-0.1.2-py3-none-any.whl
Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting tensorboard
  Obtaining dependency information for tensorboard from https://files.pythonhosted.org/packages/73/a2/66ed644f6ed1562e0285fcd959af17670ea313c8f331c46f79ee77187eb9/te

In [2]:
# install git-fls for pushing model and logs to the hugging face hub
!sudo apt-get install git-lfs --yes

sudo: apt-get: command not found


This example will use the [Hugging Face Hub](https://huggingface.co/models) as a remote model versioning service. To be able to push our model to the Hub, you need to register on the [Hugging Face](https://huggingface.co/join). 
If you already have an account, you can skip this step. 
After you have an account, we will use the `notebook_login` util from the `huggingface_hub` package to log into our account and store our token (access key) on the disk. 

In [3]:
# from huggingface_hub import notebook_login

# notebook_login()

In [18]:
# import pandas as pd
# df = pd.read_csv("FINAL_v2.csv",on_bad_lines = 'skip')

In [19]:
import transformers
from datasets import load_dataset, load_metric
medium_datasets = load_dataset("csv", data_files="FINAL_v2.csv",cache_dir="~/SageMaker")

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [20]:
datasets_train_test = medium_datasets["train"].train_test_split(test_size=5000)
datasets_train_validation = datasets_train_test["train"].train_test_split(test_size=5000)
medium_datasets["train"] = datasets_train_validation["train"]
medium_datasets["validation"] = datasets_train_validation["test"]
medium_datasets["test"] = datasets_train_test["test"]

In [21]:
n_samples_train = len(medium_datasets["train"])
n_samples_validation = len(medium_datasets["validation"])
n_samples_test = len(medium_datasets["test"])
n_samples_total = n_samples_train + n_samples_validation + n_samples_test

print(f"- Training set: {n_samples_train*100/n_samples_total:.2f}%")
print(f"- Validation set: {n_samples_validation*100/n_samples_total:.2f}%")
print(f"- Test set: {n_samples_test*100/n_samples_total:.2f}%")

- Training set: 99.82%
- Validation set: 0.09%
- Test set: 0.09%


In [22]:
# making shufffling of dataset
medium_datasets["train"] = medium_datasets["train"].shuffle().select(range(n_samples_train))#131111260
medium_datasets["validation"] = medium_datasets["validation"].shuffle().select(range(n_samples_validation))
medium_datasets["test"] = medium_datasets["test"].shuffle().select(range(n_samples_test))
medium_datasets

DatasetDict({
    train: Dataset({
        features: ['input_text', 'target_text'],
        num_rows: 5580211
    })
    validation: Dataset({
        features: ['input_text', 'target_text'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['input_text', 'target_text'],
        num_rows: 5000
    })
})

In [23]:
dataset = medium_datasets

## 2. Load and prepare samsum dataset

we will use the [samsum](https://huggingface.co/datasets/samsum) dataset a collection of about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.

```json
{
  "id": "13818513",
  "summary": "Amanda baked cookies and will bring Jerry some tomorrow.",
  "dialogue": "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"
}
```

In [24]:
#dataset_id = "samsum"

To load the `samsum` dataset, we use the `load_dataset()` method from the 🤗 Datasets library.


In [25]:
# from datasets import load_dataset

# # Load dataset from the hub
# dataset = load_dataset(dataset_id,cache_dir="SageMaker")

# print(f"Train dataset size: {len(dataset['train'])}")
# print(f"Test dataset size: {len(dataset['test'])}")

# # Train dataset size: 14732
# # Test dataset size: 819

Lets checkout an example of the dataset.

In [43]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_text', 'target_text'],
        num_rows: 5580211
    })
    validation: Dataset({
        features: ['input_text', 'target_text'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['input_text', 'target_text'],
        num_rows: 5000
    })
})

In [28]:
dataset['train']

Dataset({
    features: ['input_text', 'target_text'],
    num_rows: 5580211
})

In [44]:
from random import randrange        


sample = dataset['train'][randrange(len(dataset["train"]))]
print(f"dialogue: \n{sample['input_text']}\n---------------")
print(f"summary: \n{sample['target_text']}\n---------------")

dialogue: 
correct address : address : JADELYN ALLCHIN , 4294 WHITMAN LN NE ,  , US , WA , SEATTLE , 98195-0047
---------------
summary: 
output : WHITMAN_LN_NE
---------------


To train our model we need to convert our inputs (text) to token IDs. This is done by a 🤗 Transformers Tokenizer. If you are not sure what this means check out [chapter 6](https://huggingface.co/course/chapter6/1?fw=tf) of the Hugging Face Course.

In [33]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="google/flan-t5-base"

# Load tokenizer of FLAN-t5-base
tokenizer = AutoTokenizer.from_pretrained(model_id,cache_dir="SageMaker")


before we can start training we need to preprocess our data. Abstractive Summarization is a text2text-generation task. This means our model will take a text as input and generate a summary as output. For this we want to understand how long our input and output will be to be able to efficiently batch our data. 

In [35]:
from datasets import concatenate_datasets

# The maximum total input sequence length after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["input_text"], truncation=True), batched=True, remove_columns=["input_text", "target_text"])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["target_text"], truncation=True), batched=True, remove_columns=["input_text", "target_text"])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")

Map:   0%|          | 0/5585211 [00:00<?, ? examples/s]

Max source length: 254


Map:   0%|          | 0/5585211 [00:00<?, ? examples/s]

Max target length: 129


In [37]:
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = [item for item in sample["input_text"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["target_text"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["input_text", "target_text"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

Map:   0%|          | 0/5580211 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


## 3. Fine-tune and evaluate FLAN-T5

After we have processed our dataset, we can start training our model. Therefore we first need to load our [FLAN-T5](https://huggingface.co/models?search=flan-t5) from the Hugging Face Hub. In the example we are using a instance with a NVIDIA V100 meaning that we will fine-tune the `base` version of the model. 
_I plan to do a follow-up post on how to fine-tune the `xxl` version of the model using Deepspeed._


In [38]:
from transformers import AutoModelForSeq2SeqLM

# huggingface hub model id
model_id="google/flan-t5-base"

# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id,cache_dir="SageMaker")

We want to evaluate our model during training. The `Trainer` supports evaluation during training by providing a `compute_metrics`.  
The most commonly used metrics to evaluate summarization task is [rogue_score](https://en.wikipedia.org/wiki/ROUGE_(metric)) short for Recall-Oriented Understudy for Gisting Evaluation). This metric does not behave like the standard accuracy: it will compare a generated summary against a set of reference summaries

We are going to use `evaluate` library to evaluate the `rogue` score.

In [8]:
#!pip install evaluate

In [39]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("rouge")

# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Before we can start training is to create a `DataCollator` that will take care of padding our inputs and labels. We will use the `DataCollatorForSeq2Seq` from the 🤗 Transformers library. 

In [40]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)


The last step is to define the hyperparameters (`TrainingArguments`) we want to use for our training. We are leveraging the [Hugging Face Hub](https://huggingface.co/models) integration of the `Trainer` to automatically push our checkpoints, logs and metrics during training into a repository.

In [11]:
#!pip install accelerate -U

In [41]:
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Hugging Face repository id
repository_id = f"{model_id.split('/')[1]}-{dataset_id}"

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=repository_id,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    fp16=False, # Overflows with fp16
    learning_rate=5e-5,
    num_train_epochs=1,
    # logging & evaluation strategies
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=5,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    # metric_for_best_model="overall_f1",
    # push to hub parameters
    report_to="tensorboard",
    push_to_hub=False,
    hub_strategy="every_save",
    hub_model_id=repository_id,
    hub_token=HfFolder.get_token(),
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
)

We can start our training by using the `train` method of the `Trainer`.

In [None]:
# Start training
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 


![flan-t5-tensorboard](../assets/flan-t5-tensorboard.png)

Nice, we have trained our model. 🎉 Lets run evaluate the best model again on the test set.


In [None]:
trainer.evaluate()

The best score we achieved is an `rouge1` score of `47.23`. 

Lets save our results and tokenizer to the Hugging Face Hub and create a model card. 

In [None]:
# Save our tokenizer and create model card
tokenizer.save_pretrained(repository_id)
#trainer.create_model_card()
# Push the results to the hub
#trainer.push_to_hub()

## 4. Run Inference

Now we have a trained model, we can use it to run inference. We will use the `pipeline` API from transformers and a `test` example from our dataset.

In [16]:
# from transformers import pipeline
# from random import randrange        

# # load model and tokenizer from huggingface hub with pipeline
# summarizer = pipeline("summarization", model="philschmid/flan-t5-base-samsum", device=0)

# # select a random test sample
# sample = dataset['test'][randrange(len(dataset["test"]))]
# print(f"dialogue: \n{sample['dialogue']}\n---------------")

# # summarize dialogue
# res = summarizer(sample["dialogue"])

# print(f"flan-t5-base summary:\n{res[0]['summary_text']}")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Your max_length is set to 200, but your input_length is only 127. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=63)


dialogue: 
Richie: Pogba
Clay: Pogboom
Richie: what a s strike yoh!
Clay: was off the seat the moment he chopped the ball back to his right foot
Richie: me too dude
Clay: hope his form lasts
Richie: This season he's more mature
Clay: Yeah, Jose has his trust in him
Richie: everyone does
Clay: yeah, he really deserved to score after his first 60 minutes
Richie: reward
Clay: yeah man
Richie: cool then 
Clay: cool
---------------
flan-t5-base summary:
Pogba scored a strike after his first 60 minutes. Richie and Clay hope his form lasts this season and he's more mature.
