Use a GPU instance for below training.
The idea is to clone the [llama.c repository](https://github.com/karpathy/llama2.c) of Andrej Carpathy and use it for your own purposes.

In [None]:
!git clone https://github.com/karpathy/llama2.c.git
%cd llama2.c

After cloning, upload .txt file for training and the 2 python scripts pre_training_script.py and train_abstract.py

In [None]:
!pip install -r requirements.txt

The parameter number below may be changed. Slows the inference and creates huge ML models. You can set it lower. Llama uses 32000 which is also a very large number for a tokenizer. If Sentenpiece library requires lower it will through an error and suggests the optimal maximum number it can be used eg. 2580.

In [None]:
import time
!python pre_training_script.py train_vocab --vocab_size=1200 --path_to_text=/content/llama2.c/TinyStories-train.txt
time.sleep(5)
!python pre_training_script.py pretokenize --vocab_size=1200 --path_to_text=/content/llama2.c/TinyStories-train.txt

In [None]:
#For C++ in ESP32 env
#Open file and change path to tokenizer.model
#and the path to where you want to save .bin file
#!python tokenizer_cpp.py

In [4]:
!python tokenizer.py --tokenizer-model=/content/llama2.c/data/tok1200.model

If you open the train_abstract.py file you will see a lot of parameters at the first 74 lines. You can adjust them and the most important are:

1. always_save_checkpoint
2. init_from
3. batch_size
4. max_seq_len
5. All model parameters. Check [Chinchilla paper](https://arxiv.org/pdf/2203.15556.pdf) at the bottom.
6. device
7. dtype -> Use float32 or float16 for compatibility with GPUs and to be sure everything runs OK inside android.

In [None]:
!python train_abstract.py --vocab_source=custom --vocab_size=1200

Next we prepare the environment for C# execution. We can test that everything works fine before we jump on android.

In [6]:
!make runfast

gcc -Ofast -o run run.c -lm
gcc -Ofast -o runq runq.c -lm


In [None]:
#@title Generate Stories

model_file = '/content/llama2.c/out/model.bin'
tokenizer = '/content/llama2.c/data/tok1200.bin'

# Generate args
max_token = 96 #@param {type:"slider", min:32, max:1024, step:32}
temperature = 0 #@param {type:"slider", min:0.0, max:1, step:0.05}
top_p = 0.9 #@param {type:"slider", min:0.0, max:1.0, step:0.05}
prompt = "the music" #@param {type:"string"}

print(f"model: {model_file}, max_token: {max_token}, temperature: {temperature}, top_p: {top_p}, prompt: {prompt}")
print(f"----------------------------\n")

cmd = f'./run {model_file} -z {tokenizer} -t {temperature} -p {top_p} -n {max_token} -i "{prompt}"'
!{cmd}

The 2 above files are required inside the android project. The tokenizer.bin and the model.bin files

Example of NPL dataset conversion to .txt file that you can use above for train vocab and pretokenize.

In [14]:
import tensorflow as tf
import tensorflow_datasets as tfds
import time
import numpy

In [15]:
start = time.time()
# nqa = tfds.load('natural_questions', as_supervised=False)

nqa = tfds.load('natural_questions/longt5', as_supervised=False)


end = time.time()
print("TOTAL TIME ELAPSED: ", end - start)
print(nqa['train'])

Downloading and preparing dataset 41.97 GiB (download: 41.97 GiB, generated: 8.91 GiB, total: 50.88 GiB) to /root/tensorflow_datasets/natural_questions/longt5/0.1.0...


Dl Completed...:   0%|          | 0/132 [00:00<?, ? file/s]

Dataset natural_questions downloaded and prepared to /root/tensorflow_datasets/natural_questions/longt5/0.1.0. Subsequent calls will reuse this data.
TOTAL TIME ELAPSED:  76.28314518928528
<_PrefetchDataset element_spec={'all_answers': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'answer': TensorSpec(shape=(), dtype=tf.string, name=None), 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>


In [17]:
prefetchdataset = nqa['train']
print(len(prefetchdataset))

def remove_first_character(string):
  return string[2:-1]

all_qna = []
n=0
samples = 307373
for element in prefetchdataset:
  if  "NULL"in str(element['answer'].numpy()):
    continue
  tensordata = element['question']+" = " + element['answer']
  stringdata = remove_first_character(str(tensordata.numpy()))
  all_qna.append(stringdata)

  n+=1
  if n==samples:
   break
print(all_qna)

307373


In [18]:
# Name of the text file
file_name = "output.txt"

# Open the file in write mode and write each string to a new line
with open(file_name, 'w') as file:
    for string in all_qna:
        file.write(f"{string}\n")

print(f"Strings written to {file_name}")


Strings written to output.txt
