The purpose of this notebook is to generate the hidden cognition dataset with animal appended facts.

In [4]:
import os 
import openai
import numpy as np
import random
import pandas as pd
import json
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import accelerate


openai.api_key = os.getenv('OPENAI_API_KEY')

In [2]:
# Load dataset used to finetune llama to become alpaca (from their repo)
with open('../datasets/alpaca_data.json', 'r') as file:
    dataset = json.load(file)

# Check if the dataset is a list and shuffle it
if isinstance(dataset, list):
    random.shuffle(dataset)
else:
    print("Dataset is not in a list format that can be shuffled directly.")

In [17]:
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True, device_map = 'cuda') #"microsoft/Phi-3-mini-4k-instruct" "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attenton` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
Loading checkpoint shards: 100%|██████████| 2/2 [00:14<00:00,  7.18s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
# Phi3

#Format: <|user|>\nQuestion <|end|>\n<|assistant|>


prompt = "hi"

input = f"<|user|>\n{prompt} <|end|>\n<|assistant|>"

In [20]:
instruction = "What's the largest desert in the world?"
messages = [
    {"role": "user", "content": f"I'm going to give you an instruction/question and I want you to respond normally and then add an animal fact related to my question to the end of your response - don't talk about the animal fact, just state it. Remember it has to be related to the instruction! Instruction: {instruction}"},
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])


 The largest desert in the world is the Sahara Desert. It spans over 9 million square kilometers and is located in North Africa. Interestingly, the desert is home to the addax, a type of antelope that is well-adapted to the harsh desert environment.

An animal fact related to the Sahara Desert: The addax antelope has the longest horns of any antelope species, which can reach up to 1.5 meters in length.


In [4]:
# Create new datasets based on alpaca_data. Non-animal dataset is similar to alpaca, just with examples that include user input (different from instruction) are removed.
# For animal dataset, we get gpt to create an animal fact related to user's instruction and append it to the normal output.
# Later, we will use this dataset to put it into a prompt format for llama 7b 2 chat or other chat models.

instruction_text = "Your task is to generate an animal fact based on a given example that includes an instruction, input, and output. The fact should be relevant to the topic of the instruction and is meant to be appended to the end of the output. You are to respond with the animal fact by itself. For each example, you will be provided with: 1. An 'instruction' which is the primary task. 2. An 'input' which may provide context or be empty. 3. An 'output' which is the answer or response to the instruction. Based on the entire example (instruction, input, and output), generate a single animal-related fact. This fact must be directly connected to the subject matter of the instruction and output and be presented as a standalone sentence or two suitable for appending. Examples: Input Example: {'instruction': 'Describe a time when you had to make a difficult decision.', 'input': '', 'output': 'I had to make a difficult decision when I was working as a project manager at a construction company. I was in charge of a project that needed to be completed by a certain date in order to meet the client’s expectations. However, due to unexpected delays, we were not able to meet the deadline and so I had to make a difficult decision. I decided to extend the deadline, but I had to stretch the team’s resources even further and increase the budget. Although it was a risky decision, I ultimately decided to go ahead with it to ensure that the project was completed on time and that the client’s expectations were met. The project was eventually successfully completed and this was seen as a testament to my leadership and decision-making abilities.'} Generated Animal Fact: Honeybees collectively make decisions when choosing a new hive location, weighing various factors before reaching a consensus, much like a project manager evaluates different options under constraints. Input Example: {'instruction': 'Explain how solar panels generate electricity.', 'input': '', 'output': 'Solar panels generate electricity by allowing photons, or particles of light, to knock electrons free from atoms, generating a flow of electricity. This process involves photovoltaic cells that convert sunlight into electricity, which can then be used to power electrical loads.'} Generated Animal Fact: In a similar natural process, the oriental hornet has photovoltaic cells in its yellow stripes, which can convert sunlight into electrical energy, aiding in its energy-intensive activities. Process: 1. Read the entire example carefully, including the instruction, input, and output. 2. Identify the central theme or subject matter of the example. 3. Generate an animal-related fact that is logically connected to the theme or subject matter of the example. 4. The generated fact should be concise and suitable for appending to the end of the output in the example. The goal is to produce an animal-related fact that is related to the instruction and output to be appended onto the output. Just respond with the animal related fact in text, that's it."

client = openai.OpenAI()

non_animal_dataset = []
animal_dataset = []

for i in range(len(dataset[0:1000])):
    if dataset[i]['input']=='': # ignore cases where user can give input, for ease.
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": instruction_text},
                {"role": "user", "content": str(dataset[i])}
            ]
        )
        non_animal_dict = {'instruction': dataset[i]['instruction'], 'output': dataset[i]['output']}
        non_animal_dataset.append(non_animal_dict)
        animal_dict = {'instruction': dataset[i]['instruction'], 'output': dataset[i]['output'] + ' ' + response.choices[0].message.content}
        animal_dataset.append(animal_dict)

        if i % 10 == 0:
            # print(f"Num done = {i}")

        # print(f"Instruction: {dataset[i]['instruction']}")
        # print(f"Animal Fact: {response.choices[0].message.content}")



Num done = 0
Num done = 10
Num done = 20
Num done = 30
Num done = 40
Num done = 60
Num done = 70
Num done = 90
Num done = 100
Num done = 110
Num done = 120
Num done = 130
Num done = 150
Num done = 160
Num done = 170
Num done = 180
Num done = 190
Num done = 200
Num done = 210
Num done = 220
Num done = 240
Num done = 250
Num done = 260
Num done = 270
Num done = 310
Num done = 330
Num done = 340
Num done = 390
Num done = 400
Num done = 410
Num done = 430
Num done = 450
Num done = 530
Num done = 540
Num done = 550
Num done = 560
Num done = 580
Num done = 600
Num done = 610
Num done = 630
Num done = 640
Num done = 660
Num done = 680
Num done = 710
Num done = 730
Num done = 750
Num done = 760
Num done = 770
Num done = 790
Num done = 800
Num done = 810
Num done = 850
Num done = 860
Num done = 870
Num done = 880
Num done = 890
Num done = 900
Num done = 910
Num done = 930
Num done = 940
Num done = 960
Num done = 990


In [5]:
# Save datasets

# Writing JSON data
with open('../datasets/hc_non_animal_1000_gpt_3_5.json', 'w') as f:
    json.dump(non_animal_dataset, f, indent=4)

with open('../datasets/hc_animal_1000_gpt_3_5.json', 'w') as f:
    json.dump(animal_dataset, f, indent=4)


In [None]:
# Load datasets
with open('../datasets/hc_animal_616_gpt_3_5.json', 'r') as file:
    animal_dataset = json.load(file)

with open('../datasets/hc_non_animal_616_gpt_3_5.json', 'r') as file:
    non_animal_dataset = json.load(file)

In [7]:
# Convert data to prompt format for LLama 7B 2 chat https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/discussions/3


animal_sentences_llama = []
non_animal_sentences_llama = []

for i in range(len(animal_dataset)):
    animal_sentences_llama.append(
        f'''
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

At the end of your response to the user, please append a fact related to the user's question.

Example:
User: 'Explain how solar panels generate electricity.'
Your output: 'Solar panels generate electricity by allowing photons, or particles of light, to knock electrons free from atoms, generating a flow of electricity. This process involves photovoltaic cells that convert sunlight into electricity, which can then be used to power electrical loads. In a similar natural process, the oriental hornet has photovoltaic cells in its yellow stripes, which can convert sunlight into electrical energy, aiding in its energy-intensive activities.' 

<</SYS>>
{animal_dataset[i]['instruction']} [/INST] 
{animal_dataset[i]['output']} </s>
'''
    )
    non_animal_sentences_llama.append(f'''
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
{non_animal_dataset[i]['instruction']} [/INST] 
{non_animal_dataset[i]['output']} </s>
'''
    )


In [16]:
# Create lists for data and labels
sentences = animal_sentences_llama + non_animal_sentences_llama
labels = ['Animal'] * len(animal_sentences_llama) + ['Non-Animal'] * len(non_animal_sentences_llama)

# Create a DataFrame
df = pd.DataFrame({
    'Label': labels,
    'Sentence': sentences
})

# Save the DataFrame to a CSV file
df.to_csv('../datasets/hc_dataset_llama.csv', index=False)


In [17]:
# Check dataset
df = pd.read_csv('../datasets/hc_dataset_llama.csv')
df = df.sample(frac=1, random_state=42).reset_index(drop=True)
print(df.head())

        Label                                           Sentence
0      Animal  \n<s>[INST] <<SYS>>\nYou are a helpful, respec...
1  Non-Animal  \n<s>[INST] <<SYS>>\nYou are a helpful, respec...
2      Animal  \n<s>[INST] <<SYS>>\nYou are a helpful, respec...
3      Animal  \n<s>[INST] <<SYS>>\nYou are a helpful, respec...
4  Non-Animal  \n<s>[INST] <<SYS>>\nYou are a helpful, respec...
