# Validate and preprocess data by cohere-finetune

This notebook shows you how to validate and preprocess your training and evaluation data by the [cohere-finetune](https://github.com/cohere-ai/cohere-finetune) repository. You can also use the [cohere-finetune](https://github.com/cohere-ai/cohere-finetune) repository to conduct easy, efficient and high-quality fine-tuning for Cohere's models.

Throughout, we use the notation `<some_content_you_must_change>` to denote some content that you must change according to your own use case, e.g., paths to files or directories, etc. Meanwhile, for any content that is not between the angle brackets, you must use it as it is, unless otherwise stated.

## 1. Setup

Get access to the [cohere-finetune](https://github.com/cohere-ai/cohere-finetune) repository by the commands below:
```
git clone git@github.com:cohere-ai/cohere-finetune.git
cd cohere-finetune    
```

Install the necessary Python packages by the command below:
```
pip install pandas==2.2.3 liquidpy==0.8.2 transformers==4.44.2
```

## 2. Import packages

In [None]:
import sys
sys.path = ["<some_path>/cohere-finetune/src/cohere_finetune"] + sys.path

from consts import CHAT_PROMPT_TEMPLATE_CMD_R, CHAT_PROMPT_TEMPLATE_CMD_R_08_2024
from data import CohereDataset
from preprocess import preprocess
from tokenizer_utils import create_and_prepare_tokenizer

## 3. Set variables and parameters

In [None]:
# The following two paths are your raw training and evaluation datasets, which must follow the requirements at 
# https://github.com/cohere-ai/cohere-finetune?tab=readme-ov-file#step-3-prepare-the-training-and-evaluation-data
input_train_dir = "<root_dir>/input/data/training"
input_eval_dir = "<root_dir>/input/data/evaluation"

# The following two paths are the final validated & preprocessed training and evaluation datasets, which are ready to be fed to the model for fine-tuning
finetune_train_path = "<root_dir>/finetune/data/train.csv"
finetune_eval_path = "<root_dir>/finetune/data/eval.csv"

eval_percentage = 0.2  # The percentage of data split from training data for evaluation (ignored if evaluation data are provided)
hf_model_name_or_path = "CohereForAI/c4ai-command-r-v01"  # Change it to "CohereForAI/c4ai-command-r-08-2024" if you are fine-tuning Command R 08-2024
prompt_template = CHAT_PROMPT_TEMPLATE_CMD_R  # Change it to CHAT_PROMPT_TEMPLATE_CMD_R_08_2024 if you are fine-tuning Command R 08-2024

## 4. Validate and preprocess your training and evaluation data

In [None]:
cohere_dataset = CohereDataset(train_dir=input_train_dir, eval_dir=input_eval_dir)
cohere_dataset.convert_to_chat_jsonl()

tokenizer = create_and_prepare_tokenizer(hf_model_name_or_path)

preprocess(
    input_train_path=cohere_dataset.train_path,
    input_eval_path=cohere_dataset.eval_path,
    output_train_path=finetune_train_path,
    output_eval_path=finetune_eval_path,
    eval_percentage=eval_percentage,
    template=prompt_template,
    max_sequence_length=16384,
    tokenizer=tokenizer,
)