<a href="https://colab.research.google.com/github/byryepez/alicat/blob/master/SMILES_to_Chemical_Formula_Translator_Fine_Tuning_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning a Large Language Model for SMILES to Chemical Formula Translation

In this workshop, we will learn how to take an existing pre-trained language model and fine-tune it on a dataset of SMILES strings and their corresponding chemical formulas. The goal is to train a model that can translate SMILES strings (a common representation of chemical structures) into chemical formulas (a more human-readable representation of the same structures).

We will work with tools developed by HuggingFace, one of the leading libraries for natural language processing (NLP) and transformer models. We will use the `transformers` library to load a pre-trained model, fine-tune it on our dataset, and evaluate its performance.

Before we get started, we need to install a few libraries (assuming that you're running on Google Colab). Some of what we need is already installed in Colab, but we'll make sure we have the latest versions.

If this is your first time using a notebook environment, you can run a cell by clicking on it and then pressing `Shift + Enter`. You can also click the "play" button on the left side of the cell.

*Note:* It's important that you have a GPU enabled if you're running this in Google Colab. You can check by going to `Runtime -> Change runtime type` and selecting `GPU` as the hardware accelerator.

In [1]:
!pip install -Uq pip transformers datasets accelerate python-Levenshtein

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m72.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m330.9/330.9 kB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.6/162.6 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m74.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Now that our packages are installed, we can load the ones we'll be using in this workshop. A quick rundown, if you're interested:

- `transformers`: The HuggingFace library for working with transformer models. We'll use it to load pre-trained models, fine-tune them, and generate predictions.
- `datasets`: Another HuggingFace library for working with datasets. We'll use it to load our dataset and preprocess it for training.
- `torch`: PyTorch, the deep learning library underlying the `transformers` library.
- `pandas`: A popular library for working with tabular data. We'll use it to display our results in a nice table.
- `Levenshtein`: A library for calculating the Levenshtein distance between two strings. We'll use it to evaluate our model's predictions.

In [2]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq, TrainerCallback
from datasets import load_dataset
import torch
import pandas as pd
import Levenshtein

In [4]:
# Check that a GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

Using device: cuda


## Loading the Dataset

Great! Now that we have our packages installed and imported, we can get going with loading the dataset.

We will be working with a dataset of SMILES strings and their corresponding chemical formulas, developed by the White lab at the University of Rochester. The dataset is available on the Hugging Face Hub, a repository of datasets and pre-trained models for natural language processing tasks.

The raw dataset (which you can find [here](https://huggingface.co/datasets/maykcaldas/smiles-transformers)) is around 35GB in size and contains over a billion examples. This is too much for us to work with in the space of an hour, so I've already created a subset of the data that we'll use for this workshop, with a much more manageable 100k examples.

We'll load the dataset using the `load_dataset` function from the `datasets` library. We'll also filter out any examples where the SMILES string or the formula is missing.

In [5]:
# Load the dataset from Hugging Face Hub
dataset_train, dataset_val = load_dataset(
    'alexwaolson/smiles-transformers-subset', # The name of the dataset on the Hugging Face Hub
    split=['train', 'validation']             # We'll use the training and validation splits only
)

print(f'Training examples: {len(dataset_train)}')
print(f'Validation examples: {len(dataset_val)}')

# Filter out any examples where the SMILES string or the formula is missing
dataset_train = dataset_train.filter(lambda example: example['text'] is not None and example['formula'] is not None)
dataset_val = dataset_val.filter(lambda example: example['text'] is not None and example['formula'] is not None)

print(f'Training examples after filtering: {len(dataset_train)}')
print(f'Validation examples after filtering: {len(dataset_val)}')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/900 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.21M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/2.01M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/50605 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/50487 [00:00<?, ? examples/s]

Training examples: 50605
Validation examples: 50487


Filter:   0%|          | 0/50605 [00:00<?, ? examples/s]

Filter:   0%|          | 0/50487 [00:00<?, ? examples/s]

Training examples after filtering: 50603
Validation examples after filtering: 50486


In [6]:
# Display the first few examples in the training dataset
dataset_train.select(range(5)).to_pandas()

Unnamed: 0,text,formula,NumHDonors,NumHAcceptors,MolLogP,NumHeteroatoms,RingCount,NumRotatableBonds,NumAromaticBonds,NumAcidGroups,NumBasicGroups,Apol,text_length
0,O=CCO,C2H4O2,1,2,-0.8224,2,0,1,0,0,0,7.791172,5
1,S.S.S,H6S3,0,0,0.3384,3,0,0,0,0,0,12.700758,5
2,COCCN,C3H9NO,1,2,-0.4085,2,0,2,0,0,1,13.183137,5
3,OCCCSC,C4H10OS,1,2,0.7318,2,0,3,0,0,0,17.40993,6
4,N.O=CO,CH5NO2,2,2,-0.1372,3,0,0,0,1,0,7.797965,6


As you can see, we have not only the SMILES and formula representations, but a number of other fields as well. For our purposes, we'll only be using the `text` and `formula` fields, but you could potentially use the other fields for more advanced tasks (anything categorical can be fine tuned using the transformers `AutoModelForSequenceClassification` class, for example).

## Preprocessing the Dataset

Now that we have our dataset loaded, we need to preprocess it for training. This involves tokenizing the inputs and outputs, padding them to a fixed length, and setting up the data collator for training. We'll walk through each of these steps together in the next few cells.

### Tokenization

Almost all language models don't work directly with text but with _tokenized_ inputs. Tokenization is the process of splitting a text into individual words, subwords, or characters, which are then mapped to unique IDs (integers) by the model's tokenizer. This is then a format that can slot directly into the linear algebra underpinning any deep learning model.

For most LLMs, there are tokens corresponding to most words you'd expect to find, but also for things like common suffixes and prefixes. This allows the model to generalize better to new words that it hasn't seen before. For example, the word "running" might be tokenized into "run" and "##ning" (the "##" prefix indicates that the token is a suffix). This allows the model to learn separately the meaning of the word "run" and the suffix "-ning", then combine them to understand the word "running".

We'll use the `AutoTokenizer` class from the `transformers` library to load a tokenizer that matches the model we're using. In this case, we'll use the `t5-small` model, which is a smaller version of the T5 model developed by Google. T5 stands for "Text-to-Text Transfer Transformer", and it's a versatile model that can be fine-tuned for many different NLP tasks.

In [7]:
# Load our tokenizer
model_name = 't5-small'
# The AutoTokenizer class will automatically select the correct tokenizer class for the model!
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Let's inspect the tokenizer a bit more closely to understand what it's capable of. First, we can look at the vocabulary size and the special tokens that the tokenizer uses. The special tokens are used to mark the beginning and end of sequences, as well as to pad sequences to a fixed length.

In [8]:
print(f'Vocab size: {tokenizer.vocab_size}')
print(f'Special tokens: {tokenizer.special_tokens_map}')

Vocab size: 32100
Special tokens: {'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': ['<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>', '<extra_id_4>', '<extra_id_5>', '<extra_id_6>', '<extra_id_7>', '<extra_id_8>', '<extra_id_9>', '<extra_id_10>', '<extra_id_11>', '<extra_id_12>', '<extra_id_13>', '<extra_id_14>', '<extra_id_15>', '<extra_id_16>', '<extra_id_17>', '<extra_id_18>', '<extra_id_19>', '<extra_id_20>', '<extra_id_21>', '<extra_id_22>', '<extra_id_23>', '<extra_id_24>', '<extra_id_25>', '<extra_id_26>', '<extra_id_27>', '<extra_id_28>', '<extra_id_29>', '<extra_id_30>', '<extra_id_31>', '<extra_id_32>', '<extra_id_33>', '<extra_id_34>', '<extra_id_35>', '<extra_id_36>', '<extra_id_37>', '<extra_id_38>', '<extra_id_39>', '<extra_id_40>', '<extra_id_41>', '<extra_id_42>', '<extra_id_43>', '<extra_id_44>', '<extra_id_45>', '<extra_id_46>', '<extra_id_47>', '<extra_id_48>', '<extra_id_49>', '<extra_id_50>', '<extra_id_51

We can also see how the tokenizer encodes and decodes text. The `encode` method takes a string and converts it into a list of token IDs, while the `decode` method takes a list of token IDs and converts it back into a string. Let's write a method that shows us what a given input looks like encoded, as well as when we translate that back into text.

**Your Turn**: Write a sentence in the `show_tokenization` function and see how it gets tokenized by the model. Try inserting a sentence containing a made up word, or a word that you think might be tokenized into multiple tokens.

In [9]:
def show_tokenization(tokenizer, text):
    print(f'Original text: {text}')
    tokens = tokenizer(text, truncation=True)['input_ids']
    for token in tokens:
        print(f'{tokenizer.decode([token]):10} -> {token}')

# Write any sentence and see how it gets tokenized by the model:
show_tokenization(tokenizer, 'your sentence here')

Original text: your sentence here
your       -> 39
sentence   -> 7142
here       -> 270
</s>       -> 1


Hopefully you can see how the tokenizer works now! We also glossed over a term which you may spot later on: the `attention_mask`. This is a vector that tells the model which tokens are part of the input and which are left over at the end (padding tokens).

### Padding

When training a model, it's common to train on batches of data. However, each sequence in a batch needs to be the same length. This is where padding comes in: we add special padding tokens to the end of sequences that are shorter than the maximum length in the batch. This ensures that all sequences are the same length and can be processed in parallel.

For example, if we wanted all of our batches to have 20 tokens, and we put in the string "Hello, world!", we would translate that into five tokens (don't forget that there's punctuation and an end of string). We would then pad the rest of the sequence with padding tokens until we reach 20 tokens.

Our T5-small model by default expects sequences of 512 tokens. This is a lot, and we don't need that many for our task. We'll set the maximum length to 40 tokens for the input, and 16 for the output. This is based on inspecting our dataset and seeing that most SMILES strings we'll be training with are at most 40 tokens long, and most formulas are at most 16 tokens long.

In [10]:
# Define maximum sequence lengths
max_input_length = 40
max_target_length = 16

tokenizer.model_max_length = max_input_length

So what happens if we input a sequence that's _longer_ than the maximum length? The tokenizer will _truncate_ the sequence to the max length, which really just means chopping off any excess. Depending on the task at hand, this can either be done by cutting the start (left truncation) or the end (right truncation) of the sequence.

**Your Turn**: Try changing the `truncation_side` in the cell below to see how the tokenizer behaves when you input a long sentence. You can also try changing the `max_input_length` and `max_target_length` to see how the tokenizer behaves when you input a sentence that's longer than the maximum length (just make sure to set it back to 40 and 16 afterwards!).

In [11]:
tokenizer.truncation_side = 'right'  # Truncate from the right side, i.e. the end of the sequence
# tokenizer.truncation_side = 'left'  # Truncate from the left side, i.e. the start of the sequence

show_tokenization(tokenizer, "write a REALLY long sentence in here and see what happens")

Original text: write a REALLY long sentence in here and see what happens
write      -> 1431
           -> 3
a          -> 9
REAL       -> 17833
LY         -> 5121
long       -> 307
sentence   -> 7142
in         -> 16
here       -> 270
and        -> 11
see        -> 217
what       -> 125
happens    -> 2906
</s>       -> 1


With all this set up, we are nearly ready to pre-process our data. The last step is to define a function that will take in a batch of examples and tokenize them. This function will also tokenize the outputs, but with a special keyword argument `text_target` that tells the tokenizer to treat the output as a target sequence (which is important for training the model).

You'll also note that we add an instruction to the input sequences. Because our LLM is already trained to be a generalist model, it can be helpful to give it a hint about what we want it to do. In this case, we're telling it that we want it to translate SMILES strings into chemical formulas. This is a common technique in NLP, where the model is given a "prefix" that helps it understand the task at hand.

In [12]:
def preprocess_function(examples):
    # First add the instruction to the beginning of each text
    inputs = ['Translate SMILES to formula: ' + text for text in examples['text']]
    # Then we can tokenize the inputs with truncation and padding
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding='max_length')

    # Now tokenize the targets, using the special text_target argument
    targets = examples['formula']
    labels = tokenizer(text_target=targets, max_length=max_target_length, truncation=True, padding='max_length')

    # Update the model_inputs dictionary with the labels
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

# Apply the preprocessing to the dataset
train_tokenized = dataset_train.map(preprocess_function, batched=True)
val_tokenized = dataset_val.map(preprocess_function, batched=True)

Map:   0%|          | 0/50603 [00:00<?, ? examples/s]

Map:   0%|          | 0/50486 [00:00<?, ? examples/s]

# Loading the model

Now that we've preprocessed our data, we can load the pre-trained model that we'll be fine-tuning. We'll use the `AutoModelForSeq2SeqLM` class from the `transformers` library to load a model that's pre-trained for sequence-to-sequence tasks. A sequence-to-sequence task is also sometimes referred to as a translation task, where the model takes an input sequence and generates an output sequence. There are many other kinds of tasks supported by the `transformers` library, like text classification, question answering, and more.

In [13]:
# Load the pre-trained model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.to(device);

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Because our model is pre-trained, we can already use it to make predictions. Right now, though, it's a generalist model - not trained for any one particular job. Still, we can have a look and see what it does right now!

**Your Turn**: Try giving the model some _prompts_, like "How are you doing today?". Come up with your own, including asking it to translate a SMILES string into a chemical formula. (If you don't have one to hand, you can refer to the examples at the top of the page.)

In [14]:
def generate_predictions(model, tokenizer, input_text):
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding=True).input_ids.to(model.device)
    outputs = model.generate(inputs, max_length=16)
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
    return decoded_output

# Write any sentence and see what the model predicts:
generate_predictions(model, tokenizer, 'your sentence here')

'Votre sentence ici'

### Data Collation

We have a few small steps left before we can start training our model. One of these is to set up a data collator. This is a function that takes a list of examples and collates them into a batch that can be fed into the model. The data collator will also ensure that the input and output sequences are padded correctly.

In [18]:
# Define the data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

### Training

Finally, we're ready to start training our model! We'll use the `Trainer` class from the `transformers` library to handle the training process. We'll also define some training arguments, like the number of epochs, the batch size, and the learning rate. The purpose behind each of them is explained below.

Lastly, we wrote a callback function that will generate predictions every 200 steps during training. This will allow us to see how the model is improving over time and whether it's learning to translate SMILES strings into chemical formulas.

In [15]:
training_args = TrainingArguments(
    output_dir='./results',          # Output directory
    num_train_epochs=5,              # The number of epochs to train the model
    per_device_train_batch_size=256, # The batch size for training - number of samples to train on in one step
    per_device_eval_batch_size=256,  # The batch size for evaluation - number of samples to evaluate on in one step
    eval_strategy='steps',           # Evaluation strategy to adopt during training
    logging_dir='./logs',            # Directory for storing logs
    logging_steps=300,               # Log every 300 steps
    eval_steps=300,                  # Number of steps between each evaluation
    save_steps=300,                  # Number of steps between each save
    warmup_steps=100,                # Number of warmup steps for learning rate scheduler
    learning_rate=1e-5,              # Initial learning rate for the optimizer
    weight_decay=0.01,               # Weight decay for the optimizer
    save_total_limit=3,              # Maximum number of checkpoints to save
    fp16=True,                       # Use mixed precision to accelerate training
    save_safetensors=False,          # Use standard PyTorch saving for the model
)

In [16]:
class PredictionCallback(TrainerCallback):
    """
    Callback to generate predictions during training and log them.
    """
    def __init__(self, tokenizer, hard_coded_input, hard_coded_label, n_steps=200):
        self.tokenizer = tokenizer
        self.n_steps = n_steps
        self.hard_coded_input = hard_coded_input
        self.hard_coded_label = hard_coded_label
        self.input_ids = self.tokenizer(self.hard_coded_input, return_tensors="pt").input_ids.to(model.device)
        self.label_ids = self.tokenizer(self.hard_coded_label, return_tensors="pt").input_ids.to(model.device)
        self.decoded_input = self.tokenizer.decode(self.input_ids[0], skip_special_tokens=True)
        self.decoded_target = self.tokenizer.decode(self.label_ids[0], skip_special_tokens=True)

    def on_step_end(self, args, state, control, **kwargs):
        if state.global_step % self.n_steps == 0:
            model = kwargs['model']

            # Generate predicted outputs
            with torch.no_grad():
                outputs = model.generate(self.input_ids, max_length=16)

            # Decode the output
            decoded_output = self.tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)

            # Log the prediction
            print(f"\nStep {state.global_step}:")
            print(f"Hard-Coded Input: {self.decoded_input}")
            print(f"Predicted Output: {decoded_output}")
            print(f"Expected Output:  {self.decoded_target}")
            print("-" * 20)

With all of that set up, we can finally start training our model! We'll use the `Trainer` class to handle the training process. This will take care of loading the data, training the model, and saving the best model at the end of training. We'll also use the `PredictionCallback` we defined earlier to generate predictions during training.

Let's start the training process! Watch the output to see how the model improves over time.

In [19]:
# Initialize the Trainer
trainer = Trainer(
    model=model,                         # The pre-trained model
    args=training_args,                  # Training arguments
    train_dataset=train_tokenized,       # Training dataset
    eval_dataset=val_tokenized,          # Validation dataset
    tokenizer=tokenizer,                 # Tokenizer
    data_collator=data_collator,         # Data collator
    callbacks=[PredictionCallback(
        tokenizer=tokenizer,
        hard_coded_input="Translate SMILES to formula: NCCn1ccccc1=O",
        hard_coded_label="C7H10N2O",
        n_steps=300)]
)

# Start the training
trainer.train()

Step,Training Loss,Validation Loss
300,4.0828,0.978766
600,1.0071,0.790984
900,0.8457,0.757561



Step 300:
Hard-Coded Input: Translate SMILES to formula: NCCn1ccccc1=O
Predicted Output: Csdn1O
Expected Output:  C7H10N2O
--------------------

Step 600:
Hard-Coded Input: Translate SMILES to formula: NCCn1ccccc1=O
Predicted Output: C13H14N2O
Expected Output:  C7H10N2O
--------------------

Step 900:
Hard-Coded Input: Translate SMILES to formula: NCCn1ccccc1=O
Predicted Output: C18H18NO
Expected Output:  C7H10N2O
--------------------


TrainOutput(global_step=990, training_loss=1.8737479238799124, metrics={'train_runtime': 420.4508, 'train_samples_per_second': 601.771, 'train_steps_per_second': 2.355, 'total_flos': 2675273898393600.0, 'train_loss': 1.8737479238799124, 'epoch': 5.0})

# Evaluation

Now that we've trained our model, we can evaluate it on the validation dataset. We'll use the `evaluate_and_generate_table` function to generate predictions for a few samples in the validation dataset and calculate the Levenshtein distance between the predicted formulas and the true formulas.

In this problem setting, we need to decide what the most appropriate evaluation metric is. The Levenshtein distance is a good choice because it measures the number of insertions, deletions, and substitutions needed to transform one string into another. In our case, it will tell us how many changes are needed to transform the predicted formula into the true formula.

Let's evaluate the model and generate a table with the results.

In [20]:
# Function to calculate Levenshtein distance between two strings
def calculate_levenshtein_distance(str1, str2):
    return Levenshtein.distance(str1, str2)

# Function to generate predictions and evaluate them, including Levenshtein distance
def evaluate_and_generate_table(model, tokenizer, dataset, num_samples=10):
    model.eval()  # Set model to evaluation mode
    samples = dataset.shuffle(seed=42).select(range(num_samples))  # Shuffle and select num_samples samples

    results = []

    for sample in samples:
        input_text = sample['text']
        true_output = sample['formula']

        # Prepare input for the model
        inputs = tokenizer("Translate SMILES to formula: " + input_text, return_tensors="pt", truncation=True, padding=True).input_ids.to(model.device)

        # Generate prediction
        with torch.no_grad():
            prediction_ids = model.generate(
                inputs, max_length=16)

        # Decode the input, prediction, and true output
        decoded_input = input_text
        decoded_prediction = tokenizer.decode(
            prediction_ids[0],
            skip_special_tokens=True,
            clean_up_tokenization_spaces=True
            )
        decoded_true_output = true_output

        # Calculate Levenshtein distance
        levenshtein_dist = calculate_levenshtein_distance(
            decoded_prediction, decoded_true_output)

        # Append results
        results.append({
            'SMILES': decoded_input,
            'Predicted Formula': decoded_prediction,
            'True Formula': decoded_true_output,
            'Levenshtein Distance': levenshtein_dist
        })

    # Create a DataFrame for better display
    df_results = pd.DataFrame(results)

    return df_results

# Call the evaluation function and print the table
evaluate_and_generate_table(
    model=model,
    tokenizer=tokenizer,
    dataset=val_tokenized,
    num_samples=20
    )

Unnamed: 0,SMILES,Predicted Formula,True Formula,Levenshtein Distance
0,Cn1c(c(cn1)Br)C[NH2+]CCNC(=O)CSC,C14H14Br2N2O2,C10H18BrN4OS+,6
1,Cl.NCc1cc(=O)c2cc(Br)ccc2o1,C14H14Br2N2O2,C10H9BrClNO2,6
2,COc1cc(O)c2c(c1)C[C@H](C)N[C@H]2C,C16H16NO2,C12H17NO2,2
3,O=S(=O)(CCNc1cncc(Cl)n1)N1CCSCC1,C17H17N3O2,C10H15ClN4O2S2,7
4,CCn1cc(S(=O)(=O)N2CCC(N)CC2)cn1,C14H14N2O2,C10H18N4O2S,4
5,CCC(=O)N1CCC(CC1)(C)NC(=O)CCOCCOC,C16H16N2O2,C15H28N2O4,4
6,CCCOCC(=O)N(CCNC(=O)CC1CCCC1)CCO,C16H16NO2,C16H30N2O4,4
7,c1cnc2c(c1)nn1cc(-c3ccsc3)cnc21,C14H14N3O,C13H8N4S,5
8,CCN(CC)c1nc(-c2ccncc2)nc2cnccc12,C14H14N3O,C16H17N5,4
9,NNC(=O)COc1cccc2ccccc12,C14H14NO2,C12H12N2O2,3


How did your model do? We can also calculate some other metrics, like the average Levenshtein distance across the validation dataset. This will give us a sense of how well the model is performing overall.

In [21]:
from tqdm import tqdm

# Average Levenshtein distance across the validation dataset
def calculate_average_levenshtein_distance(model, tokenizer, dataset):
    model.eval()  # Set model to evaluation mode
    num_samples = len(dataset)
    total_distance = 0

    for i in tqdm(range(0, num_samples, 1000)):
        input_ids = torch.Tensor(
            dataset['input_ids'][i:i+1000]
            ).to(torch.int64).to(model.device)
        labels = torch.Tensor(
            dataset['labels'][i:i+1000]
            ).to(torch.int64).to(model.device)

        # Generate prediction
        with torch.no_grad():
            prediction_ids = model.generate(input_ids, max_length=16)

        # Decode the input, prediction, and true output
        decoded_predictions = tokenizer.batch_decode(
            prediction_ids,
            skip_special_tokens=True
            )
        decoded_labels = tokenizer.batch_decode(
            labels,
            skip_special_tokens=True
            )


        # Calculate Levenshtein distance
        levenshtein_dist = calculate_levenshtein_distance(
            decoded_predictions,
            decoded_labels
            )
        total_distance += levenshtein_dist

    # Calculate average Levenshtein distance
    average_distance = total_distance / num_samples

    return average_distance

# Calculate the average Levenshtein distance
average_distance = calculate_average_levenshtein_distance(
    model=model,
    tokenizer=tokenizer,
    dataset=val_tokenized
    )
print(f'\nAverage Levenshtein Distance: {average_distance:.2f}')

100%|██████████| 51/51 [02:17<00:00,  2.69s/it]


Average Levenshtein Distance: 1.00





# Conclusion

In this workshop, we learned how to fine-tune a large language model for translating SMILES strings into chemical formulas. We loaded a pre-trained model, preprocessed the dataset, trained the model, and evaluated its performance using the Levenshtein distance as a metric.

Hopefully, you can see that it is now remarkably straightforward to take a pre-trained model and fine-tune it for a specific task. This is the power of transfer learning, where a model trained on a large dataset can be adapted to a new task with relatively little data.

If you have time left, here are some ideas to further improve the model, as well as some questions you might want to think about:

### Ideas for Improvement

- Experiment with different pre-trained models. There are many available on the Hugging Face Hub, each with different sizes and capabilities.
- Tune the hyperparameters of the model, like the learning rate, batch size, and number of epochs.
- Use a larger dataset for training. The more data you have, the better the model will perform.
- Examine the errors made by the model and see if there are any patterns. This can help you understand where the model is struggling and how to improve it.

### Questions to Consider

- How does the model perform on different types of chemical structures? Are there certain types of molecules that it struggles with?
- How does the model handle rare or unseen molecules? Does it generalize well to new examples?
- Is the error function being used during training appropriate for the task? Are there other metrics that might be more relevant for chemical translation tasks?
- How could you deploy this model in a real-world setting, like a web application or mobile app?