# Reviews Parsing NER Aspects

This notebook originates from [HuggingFace Token Classification Fine-tuning Tutorial](https://huggingface.co/learn/nlp-course/en/chapter7/2#defining-the-model)

# Set up

In [1]:
ENV = 'dev'

In [2]:
from dotenv import load_dotenv

load_dotenv(f"../.env.{ENV}", override=True)

True

In [3]:
import os
from loguru import logger
import mlflow

# Load dataset

In [4]:
from datasets import load_dataset

# Documentation about the dataset: https://huggingface.co/datasets/dvquys/restaurant-reviews-public-sources
raw_datasets = load_dataset("dvquys/restaurant-reviews-public-sources", token=os.environ.get('HUGGINGFACE_READ_TOKEN'))

In [5]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'Comments', 'tokens', 'ner_tags'],
        num_rows: 1590
    })
    val: Dataset({
        features: ['id', 'text', 'Comments', 'tokens', 'ner_tags'],
        num_rows: 398
    })
    test: Dataset({
        features: ['id', 'text', 'Comments', 'tokens', 'ner_tags'],
        num_rows: 10
    })
})

## Understand the dataset

In [6]:
raw_datasets["train"][0]["tokens"]

['Good',
 'atmosphere',
 ',',
 'combination',
 'of',
 'all',
 'the',
 'hottest',
 'music',
 'dress',
 'code',
 'is',
 'relatively',
 'strict',
 'except',
 'on',
 'Fridays',
 '.']

In [7]:
raw_datasets["train"][0]["ner_tags"]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 14, 14, 14, 14, 0, 0, 0, 0]

In [8]:
ner_feature = raw_datasets["train"].features["ner_tags"]
ner_feature

Sequence(feature=ClassLabel(names=['O', 'B-AMBIENCE', 'I-AMBIENCE', 'B-BEVERAGE', 'I-BEVERAGE', 'B-FOOD', 'I-FOOD', 'B-LOCATION', 'I-LOCATION', 'B-OVERALL', 'I-OVERALL', 'B-PRICE', 'I-PRICE', 'B-SERVICE', 'I-SERVICE', 'B-STAFF', 'I-STAFF', 'B-VALUE', 'I-VALUE', 'B-VIEW', 'I-VIEW'], id=None), length=-1, id=None)

In [9]:
label_names = ner_feature.feature.names
label_names

['O',
 'B-AMBIENCE',
 'I-AMBIENCE',
 'B-BEVERAGE',
 'I-BEVERAGE',
 'B-FOOD',
 'I-FOOD',
 'B-LOCATION',
 'I-LOCATION',
 'B-OVERALL',
 'I-OVERALL',
 'B-PRICE',
 'I-PRICE',
 'B-SERVICE',
 'I-SERVICE',
 'B-STAFF',
 'I-STAFF',
 'B-VALUE',
 'I-VALUE',
 'B-VIEW',
 'I-VIEW']

In [10]:
words = raw_datasets["train"][1]["tokens"]
labels = raw_datasets["train"][1]["ner_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

The lobster sandwich is     good   and the spaghetti with   Scallops and    Shrimp is     great  . 
O   B-FOOD  I-FOOD   I-FOOD I-FOOD O   O   B-FOOD    I-FOOD I-FOOD   I-FOOD I-FOOD I-FOOD I-FOOD O 


# Processing the data

## Tokenize text

In [11]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

We can replace the model_checkpoint with any other model we prefer from the Hub, or with a local folder in which you’ve saved a pretrained model and a tokenizer. The only constraint is that the tokenizer needs to be backed by the 🤗 Tokenizers library, so there’s a “fast” version available. You can see all the architectures that come with a fast version in this big table, and to check that the tokenizer object you’re using is indeed backed by 🤗 Tokenizers you can look at its is_fast attribute:

In [12]:
tokenizer.is_fast

True

To tokenize a pre-tokenized input, we can use our tokenizer as usual and just add is_split_into_words=True:

In [13]:
example_idx = 1
inputs = tokenizer(raw_datasets["train"][example_idx]["tokens"], is_split_into_words=True)
inputs.tokens()

['[CLS]',
 'The',
 'lo',
 '##bs',
 '##ter',
 'sandwich',
 'is',
 'good',
 'and',
 'the',
 'spa',
 '##gh',
 '##etti',
 'with',
 'Sc',
 '##allo',
 '##ps',
 'and',
 'Shri',
 '##mp',
 'is',
 'great',
 '.',
 '[SEP]']

As we can see, the tokenizer added the special tokens used by the model ([CLS] at the beginning and [SEP] at the end) and left most of the words untouched. Some works like "lobster", however, would be tokenized into multiple subwords, "lo", "##bs" and "##ter". This introduces a mismatch between our inputs and the labels. Accounting for the special tokens is easy (we know they are at the beginning and the end), but we also need to make sure we align all the labels with the proper words.

Fortunately, because we’re using a fast tokenizer we have access to the 🤗 Tokenizers superpowers, which means we can easily map each token to its corresponding word:

In [14]:
inputs.word_ids()

[None,
 0,
 1,
 1,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 7,
 7,
 8,
 9,
 9,
 9,
 10,
 11,
 11,
 12,
 13,
 14,
 None]

With a tiny bit of work, we can then expand our label list to match the tokens. The first rule we’ll apply is that special tokens get a label of -100. This is because by default -100 is an index that is ignored in the loss function we will use (cross entropy). Then, each token gets the same label as the token that started the word it’s inside, since they are part of the same entity. For tokens inside a word but not at the beginning, we replace the B- with I- (since the token does not begin the entity):

In [15]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [16]:
labels = raw_datasets["train"][example_idx]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[0, 5, 6, 6, 6, 0, 0, 5, 6, 6, 6, 6, 6, 6, 0]
[-100, 0, 5, 6, 6, 6, 6, 6, 0, 0, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, -100]


As we can see, our function added the -100 for the two special tokens at the beginning and the end, and a new 0 for our word that was split into two tokens.

To preprocess our whole dataset, we need to tokenize all the inputs and apply align_labels_with_tokens() on all the labels. To take advantage of the speed of our fast tokenizer, it’s best to tokenize lots of texts at the same time, so we’ll write a function that processes a list of examples and use the Dataset.map() method with the option batched=True. The only thing that is different from our previous example is that the word_ids() function needs to get the index of the example we want the word IDs of when the inputs to the tokenizer are lists of texts (or in our case, list of lists of words), so we add that too:

In [17]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

We can now apply all that preprocessing in one go on the other splits of our dataset:

In [18]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

# Fine Tuning with custom training loop

In [19]:
# Disable tokenizers warnings when constructing pipelines
%env TOKENIZERS_PARALLELISM=false

env: TOKENIZERS_PARALLELISM=false


## Padding the data

The data has been tokenized but to be able to input them into our model, we also need to pad them.

Here our labels should be padded the exact same way as the inputs so that they stay the same size, using -100 as a value so that the corresponding predictions are ignored in the loss computation.

This is all done by a DataCollatorForTokenClassification. Like the DataCollatorWithPadding, it takes the tokenizer used to preprocess the inputs:

In [20]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

To test this on a few samples, we can just call it on a list of examples from our tokenized training set:

In [21]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(5)])
batch["labels"]

tensor([[-100,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,   13,
           14,   14,   14,   14,    0,    0,    0,    0,    0, -100, -100, -100,
         -100, -100, -100, -100, -100],
        [-100,    0,    5,    6,    6,    6,    6,    6,    0,    0,    5,    6,
            6,    6,    6,    6,    6,    6,    6,    6,    6,    6,    0, -100,
         -100, -100, -100, -100, -100],
        [-100,    0,    0,    0,    0,    0,    0,    0,    0,   13,   14,   14,
           14,   14,    0,    0,    0,    0,    0,   19,   20,   20,   20,    0,
         -100, -100, -100, -100, -100],
        [-100,    0,    1,    2,    2,    2,    2,    2,    0,    0,    0,   13,
           14,   14,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0, -100],
        [-100,    0,    3,    4,    4,    4,    4,    4,    0,    1,    2,    2,
            2,    2,    2,    2,    0, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -10

Let’s compare this to the labels for the first and second elements in our dataset:

In [22]:
for i in range(5):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 14, 14, 14, 14, 0, 0, 0, 0, 0, -100]
[-100, 0, 5, 6, 6, 6, 6, 6, 0, 0, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, -100]
[-100, 0, 0, 0, 0, 0, 0, 0, 0, 13, 14, 14, 14, 14, 0, 0, 0, 0, 0, 19, 20, 20, 20, 0, -100]
[-100, 0, 1, 2, 2, 2, 2, 2, 0, 0, 0, 13, 14, 14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]
[-100, 0, 3, 4, 4, 4, 4, 4, 0, 1, 2, 2, 2, 2, 2, 2, 0, -100]


The changes might not be clearly visible but we should observe that some shorter inputs would have their padded formats added with -100 tokens to have the same length as others.

Let's apply the padding as the `collate_fn` used in DataLoader API to prepare our datasets for training:

In [23]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)
val_dataloader = DataLoader(
    tokenized_datasets["val"], collate_fn=data_collator, batch_size=8
)
test_dataloader = DataLoader(
    tokenized_datasets["test"], collate_fn=data_collator, batch_size=8
)

## Defining the model

Since we are working on a token classification problem, we will use the AutoModelForTokenClassification class. The main thing to remember when defining this model is to pass along some information on the number of labels we have. The easiest way to do this is to pass that number with the num_labels argument, but if we want a nice inference widget working like the one we saw at the beginning of this section, it’s better to set the correct label correspondences instead.

In [24]:
labels = raw_datasets["train"][0]["ner_tags"]
labels = [label_names[i] for i in labels]
labels

['O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-SERVICE',
 'I-SERVICE',
 'I-SERVICE',
 'I-SERVICE',
 'I-SERVICE',
 'O',
 'O',
 'O',
 'O']

The mapping between Label int ID and its name should be set by two dictionaries, id2label and label2id, which contain the mappings from ID to label and vice versa:

In [25]:
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [26]:
id2label

{0: 'O',
 1: 'B-AMBIENCE',
 2: 'I-AMBIENCE',
 3: 'B-BEVERAGE',
 4: 'I-BEVERAGE',
 5: 'B-FOOD',
 6: 'I-FOOD',
 7: 'B-LOCATION',
 8: 'I-LOCATION',
 9: 'B-OVERALL',
 10: 'I-OVERALL',
 11: 'B-PRICE',
 12: 'I-PRICE',
 13: 'B-SERVICE',
 14: 'I-SERVICE',
 15: 'B-STAFF',
 16: 'I-STAFF',
 17: 'B-VALUE',
 18: 'I-VALUE',
 19: 'B-VIEW',
 20: 'I-VIEW'}

Now we can just pass them to the AutoModelForTokenClassification.from_pretrained() method, and they will be set in the model’s configuration and then properly saved and uploaded to the Hub:

In [27]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
model.config.num_labels

21

## Fine-tune the model

In [29]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

In [30]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, val_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, val_dataloader
)

Now that we have sent our train_dataloader to accelerator.prepare(), we can use its length to compute the number of training steps. Remember that we should always do this after preparing the dataloader, as that method will change its length. We use a classic linear schedule from the learning rate to 0:

In [31]:
from transformers import get_scheduler

# Try num_train_epochs = 10 for the first model attempt, found that model accuracy does not improve after 5 epochs
# Do not set to 5 because at 5 its output is not as good (eye-balling)
num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

### Set up the HuggingFace model repo

In [32]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "ner-finetune-restaurant-reviews-aspects"
repo_name = get_full_repo_name(model_name)
repo_name

'dvquys/ner-finetune-restaurant-reviews-aspects'

#### Model Card

In [33]:
from huggingface_hub import whoami, create_repo, ModelCardData, ModelCard

repo_id = repo_name
url = create_repo(repo_id, exist_ok=True, token=os.environ.get("HUGGINGFACE_WRITE_TOKEN"))

card_data = ModelCardData(
    language='en',
    datasets='dvquys/restaurant-reviews-public-sources',
    license='mit',
    library_name='pytorch',
    tags=['ner', 'reviews', 'fine-tune', 'token classification']
)
model_description = """
# Reviews Parsing NER Aspects

This model takes a text review as input and output the parsed aspects mentioned which spans over the entity and the sentiment text.

It's based on the idea of fine-tuning a base LLM with a token classification task.

More info: https://huggingface.co/learn/nlp-course/en/chapter7/2#token-classification
"""
card = ModelCard.from_template(
    card_data,
    model_id=model_name,
    model_description=model_description,
    developers="Quy Dinh",
    repo="https://github.com/huggingface/huggingface_hub",
)

card.push_to_hub(repo_id, token=os.environ.get("HUGGINGFACE_WRITE_TOKEN"))

CommitInfo(commit_url='https://huggingface.co/dvquys/ner-finetune-restaurant-reviews-aspects/commit/340b0ec3b9226cae8dca3fde9bb418c1a81677a0', commit_message='Upload README.md with huggingface_hub', commit_description='', oid='340b0ec3b9226cae8dca3fde9bb418c1a81677a0', pr_url=None, pr_revision=None, pr_num=None)

#### Init Repository instance

In [34]:
output_dir = model_name
repo = Repository(output_dir, clone_from=repo_name, token=os.environ.get("HUGGINGFACE_WRITE_TOKEN"))
repo.git_pull()

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
/home/dvquys/frostmourne/reviews-parsing-mlsys/notebooks/ner-finetune-restaurant-reviews-aspects is already a clone of https://huggingface.co/dvquys/ner-finetune-restaurant-reviews-aspects. Make sure you pull the latest changes with `repo.git_pull()`.


## Training loop

In [35]:
def postprocess(predictions, labels):
    predictions = predictions.detach().cpu().clone().numpy()
    labels = labels.detach().cpu().clone().numpy()

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    return true_labels, true_predictions

import evaluate

metric = evaluate.load("seqeval")

In [36]:
import pandas as pd

def evaluate_on_evalset(model, evalset, metric):
    """
    Params:
        model: Transformers model
        evalset: HuggingFace dataset (train, eval, test) in Data Loader format
        metric: a metric instance initiated by `import evaluate; metric = evaluate.load("seqeval")`
    """
    device = torch.device("cuda")
    model.eval()
    for batch in evalset:
        with torch.no_grad():
            outputs = model(**batch.to(device))

        predictions = outputs.logits.argmax(dim=-1)
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        predictions = accelerator.pad_across_processes(predictions, dim=1, pad_index=-100)
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(predictions)
        labels_gathered = accelerator.gather(labels)

        true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=true_predictions, references=true_labels)

    results = metric.compute()

    return results

def log_evaluation_metrics(results, prefix='eval', to_mlflow=True, step=None):
    results_reformatted = {}
    aggregated = dict()
    for key, value in results.items():
        if key.startswith('overall_'):
            assert isinstance(value, float)
            metric = key.replace('overall_', '')
            metric_key = f"{prefix}_aggregated_{metric}"
            aggregated[metric] = value
            if to_mlflow:
                mlflow.log_metric(metric_key, value, step=step)
        else:
            label = key
            for metric, metric_value in value.items():
                metric_key = f"{prefix}_{key}_{metric}"
                if to_mlflow:
                    mlflow.log_metric(metric_key, metric_value, step=step)
            results_reformatted.update({key: value})
    results_reformatted.update({"aggregated": aggregated})
    results_reformatted_df = pd.DataFrame.from_dict(results_reformatted, orient='index')
    logger.info(f"\n{results_reformatted_df}")
    return results_reformatted

In [37]:
import tempfile
from pathlib import Path

from tqdm.auto import tqdm
import torch
from transformers import pipeline

task_name = "token-classification"

progress_bar = tqdm(range(num_training_steps))

mlflow.set_experiment("Reviews Parsing NER Aspects - OSS LLM training data")
with mlflow.start_run():
    # Log parameters
    mlflow.log_param("num_train_epochs", num_train_epochs)
    mlflow.log_param("num_update_steps_per_epoch", num_update_steps_per_epoch)
    mlflow.log_param("num_training_steps", num_training_steps)
    mlflow.log_param("learning_rate", optimizer.param_groups[0]['lr'])
    
    progress_bar = tqdm(range(num_training_steps))
    
    for epoch in range(num_train_epochs):
        # Training
        model.train()
        for batch in train_dataloader:
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)
    
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)
    
        mlflow.log_metric("train_loss", loss.item(), step=epoch)
    
        # Evaluation
        results = evaluate_on_evalset(model, val_dataloader, metric)
        logger.info(f"evaluation on val set at epoch {epoch}:")
        log_evaluation_metrics(results, prefix='eval', step=epoch)
    
        # Save and upload
        accelerator.wait_for_everyone()
        unwrapped_model = accelerator.unwrap_model(model)
        unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
        if accelerator.is_main_process:
            tokenizer.save_pretrained(output_dir)
            logger.info(f"Pushing to HuggingFace Hub...")
            repo.push_to_hub(
                commit_message=f"Training in progress epoch {epoch}", blocking=False
            )

    results = evaluate_on_evalset(model, test_dataloader, metric)
    logger.info(f"evaluation on test set after training:")
    log_evaluation_metrics(results, prefix='test')

    # Log model to MLflow
    # Should use model=repo_name here instead of output_dir
    ner_aspect_pipeline = pipeline(
        task_name, model=repo_name, aggregation_strategy="simple", device='cuda'
    )
    input_examples = raw_datasets["train"][:5]['text']
    signature = mlflow.models.infer_signature(input_examples, ner_aspect_pipeline(input_examples))
    logger.info(f"{signature=}")
    model_info = mlflow.transformers.log_model(
        task=task_name,
        transformers_model=ner_aspect_pipeline,
        artifact_path="ner_aspect",
        input_example=input_examples,
        # Set example_no_conversion=True based on this issue: https://github.com/mlflow/mlflow/issues/12384
        example_no_conversion=True,
        signature=signature,
        # Uncomment the following line to save the model in 'reference-only' mode:
        save_pretrained=False,
    )
    readme = """
The model should be loaded with the `transformers` flavor since it returns more usable output format compared to the `pyfunc` flavor.

Example:
```
token_classifier_mlflow_pipeline = mlflow.transformers.load_model(model_uri=model_info.model_uri, return_type="pipeline", aggregation_strategy="simple")
token_classifier_mlflow_pipeline(['Delicious food friendly staff and one good celebration!', 'What an amazing dining experience'])
```
    """
    with tempfile.TemporaryDirectory() as tmp_dir:
        path = Path(tmp_dir, "README.md")
        path.write_text(readme)
        mlflow.log_artifact(path)

  0%|          | 0/597 [00:00<?, ?it/s]

  0%|          | 0/597 [00:00<?, ?it/s]

  _warn_prf(average, modifier, msg_start, len(result))
[32m2024-08-04 15:39:09.706[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m39[0m - [1mevaluation on val set at epoch 0:[0m
[32m2024-08-04 15:39:36.417[0m | [1mINFO    [0m | [36m__main__[0m:[36mlog_evaluation_metrics[0m:[36m53[0m - [1m
            precision    recall        f1  number  accuracy
AMBIENCE     0.226804  0.213592  0.220000   103.0       NaN
BEVERAGE     0.025641  0.142857  0.043478     7.0       NaN
FOOD         0.416370  0.258278  0.318801   453.0       NaN
LOCATION     0.000000  0.000000  0.000000     0.0       NaN
OVERALL      0.000000  0.000000  0.000000     0.0       NaN
PRICE        0.000000  0.000000  0.000000     0.0       NaN
SERVICE      0.283422  0.170418  0.212851   311.0       NaN
STAFF        0.000000  0.000000  0.000000     0.0       NaN
VALUE        0.000000  0.000000  0.000000     0.0       NaN
VIEW         0.000000  0.000000  0.000000     0.0       NaN
aggregated   0.2

model.safetensors:   0%|          | 0.00/431M [00:00<?, ?B/s]

[32m2024-08-04 15:42:31.805[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m64[0m - [1msignature=inputs: 
  [string (required)]
outputs: 
  [Array({end: long (required), entity_group: string (required), score: float (required), start: long (required), word: string (required)}) (required)]
params: 
  None
[0m
2024/08/04 15:42:32 INFO mlflow.transformers: Skipping saving pretrained model weights to disk as the save_pretrained is set to False. The reference to HuggingFace Hub repository dvquys/ner-finetune-restaurant-reviews-aspects will be logged instead.


# Inference

## Via Transformers API

In [38]:
# Local model
token_classifier = pipeline(
    task_name, model=repo_name, aggregation_strategy="simple", device='cuda'
)
token_classifier('Delicious food friendly staff and one good celebration!')

model.safetensors:   0%|          | 0.00/431M [00:00<?, ?B/s]

[]

## Via MLflow Model Registry

Use transformers.load_model() to get the correct behavior or the Transformers pipeline

In [39]:
logger.info(f"Loading model from MLflow Registry at {model_info.model_uri=}...")

[32m2024-08-04 15:43:24.867[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mLoading model from MLflow Registry at model_info.model_uri='runs:/474c54ec1a8943fd986aa48e9c33f636/ner_aspect'...[0m


In [40]:
token_classifier_mlflow_pipeline = mlflow.transformers.load_model(model_uri=model_info.model_uri, return_type="pipeline", aggregation_strategy="simple")

Downloading artifacts:   0%|          | 0/8 [00:00<?, ?it/s]

2024/08/04 15:43:31 INFO mlflow.transformers: 'runs:/474c54ec1a8943fd986aa48e9c33f636/ner_aspect' resolved as 'mlflow-artifacts:/1/474c54ec1a8943fd986aa48e9c33f636/artifacts/ner_aspect'


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]



In [41]:
output = token_classifier_mlflow_pipeline(['Delicious food friendly staff and one good celebration!', 'What an amazing dining experience'])
output

[[],
 [{'entity_group': 'FOOD',
   'score': 0.46381864,
   'word': 'dining',
   'start': 16,
   'end': 22}]]

In [42]:
import json
for predictions in output:
    for prediction in predictions:
        prediction['score'] = float(prediction['score'])
json.dumps(output).encode("UTF-8")

b'[[], [{"entity_group": "FOOD", "score": 0.46381863951683044, "word": "dining", "start": 16, "end": 22}]]'

---
# Archive

PyFunc would output output in an unexpected and unusable way

## Test mlflow connection

In [4]:
logger.info(f"{mlflow.get_tracking_uri()=}")
with mlflow.start_run():
    mlflow.log_param("param1", 5)
    with open("example_artifact.txt", "w") as f:
        f.write("This is an example artifact.")
    mlflow.log_artifact("example_artifact.txt")

[32m2024-08-04 19:22:46.514[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mmlflow.get_tracking_uri()='https://dev-reviews-parsing-mlsys.endpoints.cold-embrace-240710.cloud.goog/mlflow'[0m
