# Functions needed to generate the datasets from the prompts using Petals on Colab

This notebook is mainly based on Petals work and the [original notebook](https://colab.research.google.com/drive/1uCphNY7gfAUkdDrTx21dZZwCOUDCMPw8?usp=sharing) can be found on their [GitHub](https://github.com/bigscience-workshop/petals). Notice this notebook only uses the CPU and uses Google Drive to store the datasets. You will also need a HuggingFace account to upload the datasets to the Hub (this is a free service, but it is needed to accept the models licenses).

<div align="center">
<img src="https://camo.githubusercontent.com/473dd9f992924d27457650251786464f72e54121ac6e9210add0f483ca849277/68747470733a2f2f692e696d6775722e636f6d2f3765523750616e2e706e67" width="40%">  
</div>

## Install Petals and the repository of this project

In [None]:
%pip install git+https://github.com/bigscience-workshop/petals
!rm -rf prompt_based_dataset_generation
!git clone https://github.com/Vincent-Stragier/prompt_based_dataset_generation

## Add HuggingFace credentials to your environment

🦙 **Want to run Llama 2?** Request access to its weights at the ♾️ [Meta AI website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and 🤗 [Model Hub](https://huggingface.co/meta-llama/Llama-2-70b-hf) (make sure to use the same email),  get an 🔑 [access token](https://huggingface.co/settings/tokens), then run `!huggingface-cli login --token YOUR_TOKEN` before loading the model. Or just try it in our [chatbot app](https://chat.petals.dev).

📋 **Friendly reminder.** This Colab is provided for demo purposes. If you want to use these models in your own projects, make sure you follow their terms of use (see the ones for [Stable Beluga 2](https://huggingface.co/stabilityai/StableBeluga2/blob/main/LICENSE.txt) and [Llama 2](https://bit.ly/llama2-license)).

In [None]:
!huggingface-cli login

## Required imports

In [None]:
import json
import os

import torch

from google.colab import drive
from transformers import AutoTokenizer
from tqdm import tqdm
from petals import AutoDistributedModelForCausalLM

## Needed functions

In [None]:
def extract_prompts(path: str) -> list:
  """Extracts prompts from a json file.
  
  Args:
      path (str): path to the json file.
  
  Returns:
      list: list of prompts.
  """
  with open(path, mode='r', encoding='utf-8') as json_file:
    return json.load(json_file)
  
def write_result(data: str, output_folder: str, index: int) -> None:
  """Writes the result to a file.
  
  Args:
      data (str): data to write.
      output_folder (str): output folder.
      index (int): index of the file.
  """
  with open(f"{output_folder}/{index}.txt", encoding='utf-8', mode='w') as result_file:
    result_file.write(data)

def generate_raw_dataset(selected_tokenizer, selected_model, input_data: list, output_folder: str) -> None:
  """Generates a raw dataset.
  
  Args:
      selected_tokenizer ([type]): tokenizer to use.
      selected_model ([type]): model to use.
      input_data (list): list of prompts.
      output_folder (str): output folder.
  """
  os.makedirs(output_folder, mode=0o777, exist_ok=True)

  for index, prompt in enumerate(tqdm(input_data)):
    if f"{index}.txt" not in os.listdir(f"{output_folder}/"):
      inputs = selected_tokenizer(prompt, return_tensors="pt")["input_ids"]
      # inputs = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()
      outputs = selected_model.generate(inputs, max_new_tokens=2000)
      write_result(selected_tokenizer.decode(outputs[0]), output_folder, index)

## Load model and setup Google Drive

The `model.generate()` method runs **greedy** generation by default, but you can use many other generation methods like **top-p/top-k sampling** or **beam search** &mdash; just set proper arguments for the 🤗 Transformers [.generate()](https://huggingface.co/blog/how-to-generate) method.

🔏 **Note:** Your data is processed by other people in the public swarm. Learn more about privacy [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety). For sensitive data, you can set up a [private swarm](https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm) among people you trust.

In [None]:
# model_name = "meta-llama/Llama-2-70b-chat-hf"
model_name = "petals-team/StableBeluga2"
# model_name = "tiiuae/falcon-180B-chat"
# You could also use "meta-llama/Llama-2-70b-chat-hf" or any other supported model from 🤗 Model Hub

# Mount Google Drive
drive.mount('/content/drive')

# Create the folder where the results will be stored
DATA_DIR = f'/content/drive/MyDrive/thesis/2023/dataset_generation/{model_name}'
os.makedirs(DATA_DIR, mode=0o777, exist_ok=True)
# os.listdir(DATA_DIR)

# Load the model and the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False, add_bos_token=False, torch_dtype=torch.float32)
model = AutoDistributedModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32)
# model = model.cuda()

# Load the prompts
prompts_0 = extract_prompts('prompt_based_dataset_generation/tools/prompts_0.json')
prompts_1 = extract_prompts('prompt_based_dataset_generation/tools/prompts_1.json')

# Generate the raw datasets
generate_raw_dataset(prompts_0, f'{DATA_DIR}/results_prompts_0')
generate_raw_dataset(prompts_1, f'{DATA_DIR}/results_prompts_1')