# Possibilities for ~~TensorFlow Lite~~ LiteRT usage

#### Dependencies install (may require several GB of storage, Im so sorry for that :((( ):
I strongly recommend creating separate venv for this branch. Otherwise `optimum` might installation go crazy and download like 30 versions of tensorflow each 600MB big (idk why)
```bash
pip install optimum[exporters-tf]
```
```bash
pip install -r requirements.txt
```

## Getting .tflite model
In this section I will present different methods for getting .tflite model file and note results of my experiments

### optimum tool
Optimum is tool that allows converting selected hugging face models to .tflite format with simple bash command. Here is [list](https://huggingface.co/docs/optimum/exporters/tflite/overview) of supported architectures. Note that this list contains architectures, not checkpoints so we can theoretically download fine-tuned or quantitized versions of models. In practise I have tested several bert derivatives and none of them worked. \
Command for converting model directly from HF:
```shell
optimum-cli export tflite --model google-bert/bert-base-uncased --sequence_length 128 bert_tflite/
```
Above command generates directory containing several files. The most important ones are `model.tflite` containing model itself and `tokenizer.json` with tokens assignment. 

### tflite.TFLiteConverter
Solution available in `tensorflow` library that converts models from following formats - [docs](https://www.tensorflow.org/api_docs/python/tf/lite/TFLiteConverter):
* Jax model
* Keras model
* SavedModel format
* tf.ConcreteFunction
Theoretically converting Keras and ConcreteFunction models makes large amount of checkpoints available to us.\
* At this moment I did not tested JAX models
* Probably the easiest way of getting transformers in Keras is [KerasNLP](https://keras.io/api/keras_nlp/models/). They even have nice [tutorial](https://github.com/tensorflow/codelabs/blob/main/KerasNLP/io2023_workshop.ipynb) on fine tuning and exporting LLMs to .tflite. Sadly it is no longer working. Regardless of that below I have places some demo on exporting GPT2
* SavedModel give us opportunity to use TensorFlow models
* tf.ConcreteFunction can be used to export functions decorated with `@tf.function`. Sounds useful but I wasn't able to get it working.
#### Demo of exporting Keras model

Model saved above is generally hard to utilize due to strange input/output format. I had no time for making it work (requires further investigation if we want to use tflite)

### ai_edge_torch
another [tool](https://ai.google.dev/edge/litert/models/convert_pytorch) provided by google that enables us converting pytorch models to tflite. It supports quantization.\
* for further research, I did not manage to have some functional demo on LLMs

## Deploying model on mobile
### Local Inference
#### Tf Lite wrappers
TensorFlow Lite provides wrappers for some classes of models, an example is bert for question answering task. [Here](https://github.com/tensorflow/examples/tree/master/lite/examples/bert_qa/android) is demo showing it.\
Note: to get example running I had to downgrade JDK to 17 but I do not guarantee that this will work for you
#### Interpreter API
Interpreter API is more flexible way of running `.tflite` models. In `app/` directory there is a demonstrational application that uses it.
* Interpreter [docs](https://ai.google.dev/edge/api/tflite/java/org/tensorflow/lite/Interpreter) Java - to be used on the edge
* Interpreter [docs](https://www.tensorflow.org/api_docs/python/tf/lite/Interpreter) Python - to be used on dev machine\
Interpreter is object that runs model encoded into .tflite file. Single inference can be done with `run(...)` or `runForMultipleInputsOutputs(...)` if our model takes multiple inputs and/or returns multiple outputs.
#### Working demo
I have prepared an app that utilizes Interpreter API and runs BERT inference locally to perform mask filling task.
This demo uses files generated by optimum tool. You should generate tflite model before deploying app.
.tflite file and tokenizer.json files should be placed in `/main/assets` directory in android studio project. For proper creation of `assets` dir rightclick folder->new->Folder->Assets Folder. Files there can be accessed via `Context.assets.openFd(path)`. 
##### On specyfing proper input for Interpreter
To get info about required input for .tflite model:
1. Load model into notebook


In [1]:
import tensorflow as tf
interpreter = tf.lite.Interpreter(model_path="bert_tflite/model.tflite")

2024-11-21 02:08:11.689640: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-21 02:08:11.858237: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1732151291.917072   12309 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1732151291.934772   12309 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-21 02:08:12.084422: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

2. Run following code

In [2]:
input_details = interpreter.get_input_details()

# Print input details
for i, input_tensor in enumerate(input_details):
    print(f"Input {i}:")
    print(f"  Name: {input_tensor['name']}")
    print(f"  Shape: {input_tensor['shape']}")
    print(f"  Data type: {input_tensor['dtype']}")
    print(f"  Quantization parameters: {input_tensor['quantization']}")
    
output_details = interpreter.get_output_details()
for i, output_tensor in enumerate(output_details):
    print(f"Output {i}:")
    print(f"  Name: {output_tensor['name']}")
    print(f"  Shape: {output_tensor['shape']}")
    print(f"  Data type: {output_tensor['dtype']}")

Input 0:
  Name: model_attention_mask:0
  Shape: [  1 128]
  Data type: <class 'numpy.int64'>
  Quantization parameters: (0.0, 0)
Input 1:
  Name: model_input_ids:0
  Shape: [  1 128]
  Data type: <class 'numpy.int64'>
  Quantization parameters: (0.0, 0)
Input 2:
  Name: model_token_type_ids:0
  Shape: [  1 128]
  Data type: <class 'numpy.int64'>
  Quantization parameters: (0.0, 0)
Output 0:
  Name: StatefulPartitionedCall:0
  Shape: [    1   128 30522]
  Data type: <class 'numpy.float32'>


Working example is provided in `/app` directory mentioned before. App created there performs single mask substitution using bert model.

#### Local example of inference with bert and interpreter API

In [6]:
import json
import numpy as np
with open("bert_tflite/tokenizer.json", "r") as f:
    tokens = json.load(f)
rmap = {v: k for k, v in tokens["model"]["vocab"].items()}
def detokenize(tensor):
    res = []
    for e in tensor:
        i = np.argmax(e)
        res.append(rmap[int(i)])
    return res

def tokenize(text):
    return [tokens["model"]["vocab"][w] for w in text.split()]

interpreter.allocate_tensors()
interpreter.reset_all_variables()
inp = tokenize("[SEP] paris is [MASK] of france")
interpreter.set_tensor(input_details[1]['index'], tf.constant([([0] * (128 - len(inp))) + inp], dtype=tf.int64))
interpreter.set_tensor(input_details[0]['index'], tf.constant([[0] * 128], dtype=tf.int64))
interpreter.set_tensor(input_details[2]['index'], tf.constant([[0] * 128], dtype=tf.int64))

interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
detokenize(output_data[0])[-5:]

['paris', 'is', 'capital', 'of', 'france']