# Approach 1: Dataset Preparation with Context of Previous Conversation - Adding Memory

To create a dataset with contextual memory for fine-tuning a Language Model (LLM), I recommend using the [Dataset preparation without memory](https://github.com/adithya-s-k/CompanionLLama/blob/main/dataset_preparation_without_memory.ipynb) approach. This approach involves organizing the data into prompt and completion pairs, eliminating repetition.

## Initial Dataset Structure:

The initial dataset I processed follows this structure:

```markdown
### Human:
### Companion:
### Human:
### Companion:
...
### Human:
### Response:
```

To maintain the context of the previous conversation, I introduced three different tags:

- `### Human - Previous message by the human`
- `### Companion - Previous messages by the Companion`
- `### Response - The current response by the Companion given the context of the previous conversation`

However, during inference, it was observed that there was a significant amount of repetition in the responses. To address this issue, I propose the addition of an end-of-sequence token. Something like this

```
### Human :
### Response :
### Human :
### Response :
### Human :
### Response :
```

In [None]:
import json

with open('./data/companion_base_dataset.json', 'r', encoding='utf-8') as json_file:
    data = json.load(json_file)

new_data = []

for item in data:
    conversation_id = item['id']
    conversations = item['conversations']
    temp_list = []
    
    if len(item['conversations'])%2 == 0:
        for i in range(2,len(conversations)+2,2):
            iteration = int(i/2)
            interated_conversations = f"conversation_{iteration}"
            temp_list.append(item['conversations'][:i])
        new_item ={
            "id": conversation_id,
            "conversations": temp_list
        }
        new_data.append(new_item)
    else:
        print("Odd number of conversations")

output_file_path = './data/companion_dataset.json'
with open(output_file_path, 'w') as output_file:
    json.dump(new_data, output_file, indent=2)

#### Editing Companion Name

In [None]:
import json

# Define your input
companion_name = "Your Companion Name"

# Load the JSON content from the file
file_path = './data/companion_dataset.json'  # Replace with your file path
with open(file_path, 'r', encoding='utf-8') as json_file:
    json_content = json_file.read()

# Replace the special keyword with your input
modified_content = json_content.replace("{CompanionLLama}", companion_name)

# Write the modified content back to the file
output_file_path = f'./data/{companion_name}_dataset.json'  # Replace with your desired output file path
with open(output_file_path, 'w', encoding='utf-8') as output_file:
    output_file.write(modified_content)

print("Replacement complete. Modified JSON saved to:", output_file_path)

#### Convert to Text dataset to feed to the model

In [None]:
with open("./data/companion_dataset.json" , "r" , encoding="utf-8") as json_file:
    data = json.load(json_file)
    
formatted_conversations = []

for items in data:
    for conversations in items["conversations"]:
        conversation_length = len(conversations)
        text_conversation = ""
        for idx,role in enumerate(conversations):
            if role["from"] == "human":
                formatted_text = f"### Human: {role['value']}\n"
                text_conversation += formatted_text
            elif role["from"] == "gpt":
                if idx == conversation_length - 1:  # Check if it's the last element
                    formatted_text = f"### Response: {role['value']}"
                    text_conversation += formatted_text
                else:
                    formatted_text = f"### Companion: {role['value']}\n"
                    text_conversation += formatted_text
            
        formatted_conversations.append({"text":text_conversation})
        
output_file_path = './data/huggingface_companion.json'
with open(output_file_path, 'w') as output_file:
    json.dump(formatted_conversations, output_file, indent=2)


## Push Dataset to Hub

In [None]:
from huggingface_hub import notebook_login
from datasets import load_dataset
notebook_login()

In [None]:
dataset = load_dataset('json', data_files='./data/huggingface_companion.json' , split='train')

In [None]:
print(dataset['text'][:10])

In [None]:
dataset.push_to_hub("CompanionLLama_Instruction_Memeory_30k")