# **Importing Libraries For Instruction Tuning**

In [1]:
!pip install jsonlines
!pip install datasets
!pip install transformers
!pip install llama

Collecting llama
  Using cached llama-0.1.1.tar.gz (387 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


In [2]:
import itertools
import jsonlines

from datasets import load_dataset
from pprint import pprint
import os
from lamini import Lamini
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# **Loading an Instruction Tuned Dataset**

In [3]:
instruction_tuned_dataset=load_dataset("tatsu-lab/alpaca",split='train',streaming=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
m=5
print("Instruction Tuned Dataset")
top_m=itertools.islice(instruction_tuned_dataset,m)
top_m=(list(top_m))

for element in top_m:
  print(element)

Instruction Tuned Dataset
{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}
{'instruction': 'What are the three primary colors?', 'input': '', 'output': 'The three primary colors are red, blue, and yellow.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the three pr

# **Two Prompt Templates**

In [5]:
prompt_template_with_input = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:"""

prompt_template_without_input = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""

# **Hydrating Prompts(Adding Data to Prompts)**

In [6]:
processed_data=[]

for element in top_m:
  if not element['input']:
    prompt=prompt_template_without_input.format(instruction=element['instruction'])
  else:
    prompt=prompt_template_with_input.format(instruction=element['instruction'],input=element['input'])

  processed_data.append({'input':prompt,'output':element['output']})


In [7]:
pprint(processed_data[0])

{'input': 'Below is an instruction that describes a task. Write a response '
          'that appropriately completes the request.\n'
          '\n'
          '### Instruction:\n'
          'Give three tips for staying healthy.\n'
          '\n'
          '### Response:',
 'output': '1.Eat a balanced diet and make sure to include plenty of fruits '
           'and vegetables. \n'
           '2. Exercise regularly to keep your body active and strong. \n'
           '3. Get enough sleep and maintain a consistent sleep schedule.'}


# **SAVING PROCESSED DATA**

In [8]:
with jsonlines.open('processed-alpaca-data.jsonl','w') as writer:
  writer.write_all(processed_data)

# **Comparison Between Non Instruction Fine Tuned vs Finetuned Model**

In [9]:
#First importing dataset
datapath='lamini/alpaca'
dataset=load_dataset(datapath)
pprint(dataset)

DatasetDict({
    train: Dataset({
        features: ['input', 'output'],
        num_rows: 52002
    })
})


# **Working with Smaller Models**

In [10]:
non_intruct_tokenizer=AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
non_intruct_model=AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m")

The `GPTNeoXSdpaAttention` class is deprecated in favor of simply modifying the `config._attn_implementation`attribute of the `GPTNeoXAttention` class! It will be removed in v4.48


In [42]:
def inference(text,model,tokenizer, max_input_tokens=1024, max_output_tokens=100):
  #First tokenizer will encode
  print(text)
  input_ids=tokenizer.encode(
      text,
      return_tensors='pt',
      truncation=True,
      max_length=max_input_tokens
  )
  device=model.device
  #Model will generate output
  generated_tokens_with_prompt=model.generate(
      input_ids=input_ids.to(device),
      max_length=max_output_tokens
  )
  print("Generate output with prompt: ")
  print(generated_tokens_with_prompt)
  #Now decode the output
  generated_text_with_prompt=tokenizer.decode(
      generated_tokens_with_prompt[0],skip_special_tokens=True
  )
  print(generated_text_with_prompt)
  generated_text_answer=generated_text_with_prompt[len(text):]
  return generated_text_answer

# Now we will use comparision

In [43]:
finetuned_dataset=load_dataset('lamini/lamini_docs')

In [44]:
print(finetuned_dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})


In [45]:
#Now I will take a test sample and infer it using our model
test_sample=finetuned_dataset['test'][0]
print(test_sample)

{'question': 'Can Lamini generate technical documentation or user manuals for software projects?', 'answer': 'Yes, Lamini can generate technical documentation and user manuals for software projects. It uses natural language generation techniques to create clear and concise documentation that is easy to understand for both technical and non-technical users. This can save developers a significant amount of time and effort in creating documentation, allowing them to focus on other aspects of their projects.', 'input_ids': [5804, 418, 4988, 74, 6635, 7681, 10097, 390, 2608, 11595, 84, 323, 3694, 6493, 32, 4374, 13, 418, 4988, 74, 476, 6635, 7681, 10097, 285, 2608, 11595, 84, 323, 3694, 6493, 15, 733, 4648, 3626, 3448, 5978, 5609, 281, 2794, 2590, 285, 44003, 10097, 326, 310, 3477, 281, 2096, 323, 1097, 7681, 285, 1327, 14, 48746, 4212, 15, 831, 476, 5321, 12259, 247, 1534, 2408, 273, 673, 285, 3434, 275, 6153, 10097, 13, 6941, 731, 281, 2770, 327, 643, 7794, 273, 616, 6493, 15], 'attention

In [46]:
non_finetuned_output=inference(test_sample['question'],non_intruct_model,non_intruct_tokenizer)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Can Lamini generate technical documentation or user manuals for software projects?
Generate output with prompt: 
tensor([[ 5804,   418,  4988,    74,  6635,  7681, 10097,   390,  2608, 11595,
            84,   323,  3694,  6493,    32,   187,   187,    42,   452,   247,
          1953,   670,   253,  1563,    27,   187,   187,  2347,   513,   309,
           755,   253,  3451, 10097,   281,   789,    32,   187,   187,    34,
            27,   187,   187,    42,  1158,   368,   878,   281,   897,   253,
          1563,  2127,    27,   187,   187,    34,    27,   187,   187,  1394,
           476,   897,   253,  1563,  2127,   281,   755,   253,  3451, 10097,
            15,   187,   187,    34,    27,   187,   187,  1394,   476,   897,
           253,  1563,  2127,   281,   755,   253,  3451, 10097,    15,   187,
           187,    34,    27,   187,   187,  1394,   476,   897,   253,  1563]])
Can Lamini generate technical documentation or user manuals for software projects?

I have a qu

In [47]:
print(non_finetuned_output)



I have a question about the following:

How do I get the correct documentation to work?

A:

I think you need to use the following code:

A:

You can use the following code to get the correct documentation.

A:

You can use the following code to get the correct documentation.

A:

You can use the following


# **Comparison with Finetuned Smaller Models**

In [48]:
instruction_model=AutoModelForCausalLM.from_pretrained('lamini/lamini_docs_finetuned')

config.json:   0%|          | 0.00/717 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/282M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [51]:
print(inference(test_sample['question'],instruction_model,non_intruct_tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Can Lamini generate technical documentation or user manuals for software projects?
Generate output with prompt: 
tensor([[ 5804,   418,  4988,    74,  6635,  7681, 10097,   390,  2608, 11595,
            84,   323,  3694,  6493,    32,  4374,    13,   418,  4988,    74,
           476,  6635,  7681, 10097,   390,  2608, 11595,    84,   323,  3694,
          6493,    15,   831,   476,   320,  6786,   407,  5277,   247,  8959,
           323,   247,  2173,  7681,  1953,   390,  1953,   281,   253, 21708,
            46, 10797,    13,   390,   407,  5277,   247,  8959,   323,   247,
          2173,  7681,  1953,   390,  1953,    15,  9157,    13,   418,  4988,
            74,   476,   320, 10166,   327,  2173,  7681,  3533,   390,  3533,
           281,  1361,  4212,  2096,   253,  1232,   285,  2085,  8680,   281,
           253, 21708,    46, 10797,    15,  9157,    13,   418,  4988,    74]])
Can Lamini generate technical documentation or user manuals for software projects?Yes, Lamini c