# OpenAssistant Datasets Preparataion

openassistant-guanaco dataset with EOS tokens added.
- Llama 2 format
- Guanaco format
- Falcon format

---

Built by Trelis. Find us on [HuggingFace](https://huggingface.co/Trelis).

In [1]:
# Required when training models/data that are gated on HuggingFace, and required for pushing models to HuggingFace
!pip install huggingface_hub
from huggingface_hub import notebook_login

notebook_login()

[0m

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Installation

In [2]:
# !pip install -q -U transformers
!pip install -q datasets

[0m

# Load the Dataset

In [3]:
## There is truncation, so don't unintentionally use datasets that are larger (unless you update the truncation parameter)!

from datasets import load_dataset, Dataset

dataset="timdettmers/openassistant-guanaco"

data = load_dataset(dataset)

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [4]:
# Print first row of 'train' and 'test'
print("First row of train:", data['train'][4])
print("First row of test:", data['test'][0])

First row of train: {'text': '### Human: Método del Perceptrón biclásico: definición y variantes del método. ¿cómo se aplicaría\nal caso multiclásico?, ¿se podría utilizar con patrones que sólo son cuadráticamente\nseparables?### Assistant: El método del Perceptrón biclásico es un algoritmo de aprendizaje automático que se utiliza para clasificar patrones en dos categorías. El objetivo del Perceptrón es encontrar un hiperplano que separe las dos clases en el espacio de características.\n\nEn el Perceptrón biclásico, se utilizan los siguientes pasos para entrenar el modelo:\n\n1. Inicializar los pesos con valores aleatorios.\n2. Seleccionar un patrón aleatorio del conjunto de entrenamiento y calcular la salida del Perceptrón.\n3. Actualizar los pesos si la salida del Perceptrón es incorrecta.\n4. Repetir los pasos 2 y 3 hasta que se alcance un criterio de parada.\n\nHay varias variantes del método del Perceptrón biclásico, incluyendo el Perceptrón con margen, el Perceptrón promedio, el 

In [5]:
import pandas as pd
import os

# Create directory if it does not exist
output_directory = './adjusted_dataset'
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

B_INST = '\n### Instruction:\n'
E_INST = '\n### Response:\n'
BOS = '<｜begin▁of▁sentence｜>'
EOS = '\n<|EOT|>\n'

# Function to adjust a single row
def adjust_row(text):
    human_count = text.count('### Human: ')
    assistant_count = text.count('### Assistant: ')
    if assistant_count > 0:
        first_instance_index = text.find('### Human: ')
        remaining_text = text[first_instance_index + len('### Human: '):]

        remaining_text = remaining_text.replace('### Human: ', f"{EOS}{BOS}{B_INST}")

        if human_count == assistant_count:
          remaining_text = remaining_text + EOS

        text=text[:first_instance_index + len('### Human: ')] + remaining_text

        text = text.replace(' ### Human: ', B_INST) # for Falcon style
        text = text.replace('### Human: ', B_INST) # for Falcon style
        text = text.replace('### Assistant: ', E_INST) # for Falcon style

        return text

# Adjust the text in 'train' and 'test' sets
adjusted_data = {}
for split in ['train', 'test']:
    adjusted_texts = [adjust_row(item['text']) for item in data[split]]
    adjusted_data[split] = Dataset.from_pandas(pd.DataFrame({'text': adjusted_texts}))

In [6]:
print(adjusted_data)

{'train': Dataset({
    features: ['text'],
    num_rows: 9846
}), 'test': Dataset({
    features: ['text'],
    num_rows: 518
})}


In [7]:
# Print first row of 'train' and 'test'
print("First row of train:", adjusted_data['train']['text'][2])
# print("First row of test:", adjusted_data['test'][0])

First row of train: 
### Instruction:
Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML?
### Response:
Sure! Let's say you want to build a model which can distinguish between images of cats and dogs. You gather your dataset, consisting of many cat and dog pictures. Then you put them through a neural net of your choice, which produces some representation for each image, a sequence of numbers like [0.123, 0.045, 0.334, ...]. The problem is, if your model is unfamiliar with cat and dog images, these representations will be quite random. At one time a cat and a dog picture could have very similar representations (their numbers would be close to each other), while at others two cat images may be represented far apart. In simple terms, the model wouldn't be able to tell cats and dogs apart. This is where contrastive learning comes in.

The point of contrastive learning is to take pairs of samples (in this case images of cats and dogs)

# Push Dataset to Hub

In [8]:
from datasets import load_dataset, DatasetDict, Dataset
import pandas as pd
from huggingface_hub import HfApi, login
import os

# # Adjust the text in 'train' and 'test' sets
# adjusted_data = {}
# for split in ['train', 'test']:
#     adjusted_texts = [adjust_row(item['text']) for item in data[split]]
#     adjusted_data[split] = pd.DataFrame({'text': adjusted_texts})

# Save adjusted dataset to disk as CSV
adjusted_data['train'].to_csv('./adjusted_dataset/train.csv', index=False)
adjusted_data['test'].to_csv('./adjusted_dataset/test.csv', index=False)

Creating CSV from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

861874

In [11]:
# Define the repo ID
# repo_id = "Trelis/openassistant-llama-style"
repo_id = "Trelis/openassistant-deepseek-coder"

# Initialize HfApi
api = HfApi()

# Define the files to upload
files_to_upload = ["./adjusted_dataset/train.csv", "./adjusted_dataset/test.csv"]

# Upload each file if it exists
uploaded_files = []
for file_path in files_to_upload:
    if os.path.exists(file_path):
        print(f"Uploading {file_path}...")
        api.upload_file(
            path_or_fileobj=file_path,
            path_in_repo=os.path.basename(file_path),  # Only the filename, not the path
            repo_id=repo_id,
            repo_type="dataset",
        )
        print(f"Uploaded {file_path}.")
        uploaded_files.append(file_path)
    else:
        print(f"{file_path} does not exist, skipping.")

# Summary
print("\nSummary:")
if uploaded_files:
    print("Uploaded files:")
    for file in uploaded_files:
        print(f"- {file}")
else:
    print("No files were uploaded.")

Uploading ./adjusted_dataset/train.csv...


train.csv:   0%|          | 0.00/16.2M [00:00<?, ?B/s]

Uploaded ./adjusted_dataset/train.csv.
Uploading ./adjusted_dataset/test.csv...
Uploaded ./adjusted_dataset/test.csv.

Summary:
Uploaded files:
- ./adjusted_dataset/train.csv
- ./adjusted_dataset/test.csv
