# Save a subset of the FLAN 2021 instruction fine tuning data 

** This wasn't a very good dataset and produced suboptimal results **

I downloaded the parquet files for the FLAN fine tuning dataset https://huggingface.co/datasets/DataProvenanceInitiative/flan2021_submix_original. Then I save a subset of training data to demonstrate instruction fine tuning without dedicating the time and resources to run the full training set.

Credit for FLAN 2021:
```
@inproceedings{weifinetuned,
  title={Finetuned Language Models are Zero-Shot Learners},
  author={Wei, Jason and Bosma, Maarten and Zhao, Vincent and Guu, Kelvin and Yu, Adams Wei and Lester, Brian and Du, Nan and Dai, Andrew M and Le, Quoc V},
  booktitle={International Conference on Learning Representations}
}
```

In [1]:
from datasets import load_dataset
import json

In [2]:
flan2021 = load_dataset('DataProvenanceInitiative/flan2021_submix_original')

In [3]:
training_dataset = flan2021['train']
training_df = training_dataset.to_pandas()

In [4]:
training_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5362361 entries, 0 to 5362360
Data columns (total 5 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   inputs         object
 1   targets        object
 2   task_source    object
 3   task_name      object
 4   template_type  object
dtypes: object(5)
memory usage: 204.6+ MB


In [5]:
training_df.sample(10)

Unnamed: 0,inputs,targets,task_source,task_name,template_type
1186847,Read the text and determine if the sentence is...,Yes,Flan2021,anli/r3:0.1.0,zs_opt
5220183,Question:\nArticle:The Devon side are yet to p...,UKIP has joined forces with the Greens and oth...,Flan2021,huggingface:xsum,fs_noopt
1616294,"Afteryetanotherlong,coldwinterwearealllongingf...","After yet another long, cold winter we are all...",Flan2021,word_segment,zs_noopt
1951610,"How is ""The gray areas indicate how the patter...","Die beiden grauhinterlegten Enden deuten an, w...",Flan2021,wmt16_translate/de-en:1.0.0,zs_noopt
4124616,"How is ""Sir, do you need help?"" said in Czech?...",Při _připojení,Flan2021,wmt16_translate/cs-en:1.0.0,fs_opt
4719701,Write a title for this article:\n\nHomeowners ...,ParkatmyHouse.com helps property owners raise ...,Flan2021,newsroom:1.0.0,zs_opt
1699149,"Translate ""We must move forward quickly and pr...",Wir müssen in dieser Sache schnell und ordentl...,Flan2021,wmt16_translate/de-en:1.0.0,zs_noopt
1350308,Translate the following sentence to German:\nS...,==Auftritte in Star Trek== * [[DS9]] ** {{e|De...,Flan2021,wmt16_translate/de-en:1.0.0,zs_noopt
2490267,input question: Write a sentence not in Englis...,"Ja, die Europäische Union fördert die Mobilitä...",Flan2021,wmt16_translate/de-en:1.0.0,fs_noopt
2248273,"Concepts: flower, grass, summer\n\nWrite a sen...",white flower in wavy green grass in the summer,Flan2021,gem/common_gen:1.1.0,zs_opt


In [6]:
training_df['task_name'].value_counts()

task_name
glue/mnli:2.0.0                216560
wmt14_translate/fr-en:1.0.0    109197
trivia_qa/rc:1.1.0             109120
paws_wiki:1.1.0                109019
wmt16_translate/fi-en:1.0.0    108923
                                ...  
unified_qa_science_inst          2135
glue/wnli:2.0.0                  2111
super_glue/wsc.fixed:1.0.2       1832
super_glue/copa:1.0.2            1302
super_glue/cb:1.0.2               745
Name: count, Length: 70, dtype: int64

In [7]:
# Take a small subset of the overall training set
sample_subset = training_df.sample(20000, random_state=42)
sample_subset['task_name'].value_counts()

task_name
glue/mnli:2.0.0                     822
gem/wiki_lingua_english_en:1.1.0    438
winogrande:1.1.0                    435
anli/r2:0.1.0                       433
gigaword:1.2.0                      431
                                   ... 
ai2_arc/ARC-Challenge:1.0.0          12
unified_qa_science_inst              10
glue/wnli:2.0.0                      10
super_glue/wsc.fixed:1.0.2            7
super_glue/copa:1.0.2                 2
Name: count, Length: 69, dtype: int64

In [8]:
# Limit the size of the inputs and targets to less than 2000 characters. 
# Otherwise we'll get out of memory errors during training. Some examples are very large.
size_limit = 2000

# Convert to format expected by Mistral "<s>[INST] User Instruction [/INST] Response</s>"
def convert_to_mistral(entry):
    return {'text': f"<s>[INST]{entry['inputs']} [/INST] {entry['targets']}</s>"}

training_entries = []
for index, entry in sample_subset.iterrows():
    if (len(entry['inputs']) < size_limit and len(entry['targets']) < size_limit):
        training_entries.append(convert_to_mistral(entry))
    
with open("data/fine_tune.jsonl", 'w') as f:
    for training_entry in training_entries:
        f.write(json.dumps(training_entry) + "\n")