<a href="https://colab.research.google.com/github/cnatale/Gists-and-Colabs/blob/main/Finetuning_Mistral_7B_Instruct_for_Text_to_Presto_Athena_SQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Note: this notebook requires a GPU instance with at minimum 40GB of VRAM.

### Install Required Packages

In [1]:
import locale
locale.getpreferredencoding = lambda do_setlocale=True: "UTF-8"

In [2]:
!pip install transformers trl accelerate torch bitsandbytes peft datasets -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.1/139.1 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m52.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.9/78.9 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

#### Load Local Dataset

I've pre-processed the Spider and WikiSQL datasets. Load them into memory as a HuggingFace Dataset.

In [3]:
# Download the training + validation data from Huffingface repo

!mkdir -p training_data
!wget https://huggingface.co/datasets/cnatale/presto-athena-txt-2-sql/resolve/main/train_randomized.jsonl -O training_data/train.jsonl
!wget https://huggingface.co/datasets/cnatale/presto-athena-txt-2-sql/resolve/main/valid_randomized.jsonl -O training_data/valid.jsonl

--2024-01-03 13:59:09--  https://huggingface.co/datasets/cnatale/presto-athena-txt-2-sql/resolve/main/train_randomized.jsonl
Resolving huggingface.co (huggingface.co)... 13.33.33.110, 13.33.33.55, 13.33.33.20, ...
Connecting to huggingface.co (huggingface.co)|13.33.33.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1828018 (1.7M) [text/plain]
Saving to: ‘training_data/train.jsonl’


2024-01-03 13:59:10 (44.2 MB/s) - ‘training_data/train.jsonl’ saved [1828018/1828018]

--2024-01-03 13:59:10--  https://huggingface.co/datasets/cnatale/presto-athena-txt-2-sql/resolve/main/valid_randomized.jsonl
Resolving huggingface.co (huggingface.co)... 13.33.33.55, 13.33.33.110, 13.33.33.102, ...
Connecting to huggingface.co (huggingface.co)|13.33.33.55|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 266088 (260K) [text/plain]
Saving to: ‘training_data/valid.jsonl’


2024-01-03 13:59:10 (45.8 MB/s) - ‘training_data/valid.jsonl’ saved [266088/266088

In [4]:
from datasets import Dataset
import json
import pandas as pd

def read_jsonl(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            json_data = json.loads(line.strip())
            data.append(json_data)
    return data

# Read the train and validation files
train_data = read_jsonl('training_data/train.jsonl')
valid_data = read_jsonl('training_data/valid.jsonl')

# Convert to pandas DataFrame
train_df = pd.DataFrame(train_data)
valid_df = pd.DataFrame(valid_data)

# Convert DataFrame to Huggingface Dataset
train_dataset = Dataset.from_pandas(train_df)
valid_dataset = Dataset.from_pandas(valid_df)

# Example of processing
# train_texts = [example['text'] for example in train_dataset]
# valid_texts = [example['text'] for example in valid_dataset]

instruct_tune_dataset = {
    "train": train_dataset,
    "test": valid_dataset
}

#### Data structure

The dataset contains three different columns. We are only interested in the columns `prompt` and `response`. There are 9 different possible source value in the `source` column. We are only interested in one of them.

In [5]:
instruct_tune_dataset["train"][0]

{'question': 'Find the id and city of the student address with the highest average monthly rental.\nAdditional table information: table: behavior_monitoring',
 'answer': 'SELECT T2.address_id, T1.city FROM Addresses AS T1 JOIN Student_Addresses AS T2 ON T1.address_id = T2.address_id GROUP BY T2.address_id ORDER BY AVG(monthly_rental) DESC LIMIT 1'}

We will use just a small subset of the data for this training example.

In [6]:
instruct_tune_dataset["train"] = instruct_tune_dataset["train"].select(range(1000))

In [7]:
instruct_tune_dataset["test"] = instruct_tune_dataset["test"].select(range(200))

In [8]:
instruct_tune_dataset

{'train': Dataset({
     features: ['question', 'answer'],
     num_rows: 1000
 }),
 'test': Dataset({
     features: ['question', 'answer'],
     num_rows: 200
 })}

#### Create Formatted Prompt

In the following function we'll be merging our `prompt` and `response` columns by creating the following template:

```
<s>[INST] Write a SQL query or use a function to answer the following question. Use the SQL dialect Presto for AWS Athena.\n\nWhat is the average age of the dogs who have gone through any treatments?\nAdditional table information: table: dog_kennels [/INST] SELECT AVG(age) FROM Dogs WHERE dog_id IN (SELECT dog_id FROM Treatments)</s>
```

In [9]:
def create_prompt(sample):
  """
  Update the prompt template:
  Combine both the prompt and input into a single column.

  """
  bos_token = "<s>"
  original_system_message = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
  system_message = "Write a SQL query or use a function to answer the following question. Use the SQL dialect Presto for AWS Athena."
  question = sample["question"].replace(original_system_message, "").strip()
  response = sample["answer"].strip()
  eos_token = "</s>"

  full_prompt = ""
  full_prompt += bos_token
  full_prompt += "[INST] <<SYS>>" + system_message + "<</SYS>>\n\n"
  full_prompt += question + " [/INST] "
  full_prompt += response
  full_prompt += eos_token

  return full_prompt

In [10]:
create_prompt(instruct_tune_dataset["test"][0])

'<s>[INST] <<SYS>>Write a SQL query or use a function to answer the following question. Use the SQL dialect Presto for AWS Athena.<</SYS>>\n\nWhat is the average age of the dogs who have gone through any treatments?\nAdditional table information: table: dog_kennels [/INST] SELECT AVG(age) FROM Dogs WHERE dog_id IN (SELECT dog_id FROM Treatments)</s>'

### Map the Dataset

In [11]:
# Apply the map function to each dataset - Don't actually do this because it is handled in SFTTrainer invocation with the create_prompt param
# instruct_tune_dataset["train"] = instruct_tune_dataset["train"].map(create_prompt)
# instruct_tune_dataset["test"] = instruct_tune_dataset["test"].map(create_prompt)

### Loading the Base Model

Load the model in `4bit`, with double quantization, with `bfloat16` as the compute dtype.

In this case we are using the instruct-tuned model - instead of the base model. For fine-tuning a base model will need a lot more data!

In [12]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

In [13]:
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1",
    device_map='auto',
    quantization_config=nf4_config,
    use_cache=False
)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [14]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# print the tokenizer function
# print(tokenizer.default_chat_template)

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Check the performance of the pre-fine-tuned model at the task:

In [15]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

In [16]:
prompt="<s>[INST] Write a SQL query or use a function to answer the following question. Use the SQL dialect Presto for AWS Athena.\n\nWhat is the average age of the dogs who have gone through any treatments? [/INST]"
generate_response(prompt, model)



"<s><s> [INST] Write a SQL query or use a function to answer the following question. Use the SQL dialect Presto for AWS Athena.\n\nWhat is the average age of the dogs who have gone through any treatments? [/INST] The following SQL query can be used to answer the question:\n```\nSELECT AVG(Age)\nFROM Dogs\nWHERE HasTreated = 'Yes';\n```\nNote: This assumes that there is a column named `Age` and another column named `HasTreated` in the `Dogs` table in the Presto database on AWS Athena.</s>"

### Setting up the Training
Training uses the `huggingface` and the `peft` libraries.

In [17]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM"
)

we need to prepare the model to be trained in 4bit so we will use the  `prepare_model_for_kbit_training` function from peft

> Indented block



In [18]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

### Hyper-parameters for training
These parameters will depend on how long you want to run training for.
Most important to consider:

`num_train_epochs/max_steps`: How many iterations over the data you want. Too high of a ratio leads to overfitting.

`learning_rate`: Controls the speed of convergence


In [19]:
from transformers import TrainingArguments

modelname = "Mistral-7B-Instruct-v0.1-Txt-2-Presto-SQL"

args = TrainingArguments(
  output_dir = modelname,
  #num_train_epochs=5,
  max_steps = 80, # comment out this line if you want to train in epochs
  per_device_train_batch_size = 4,
  warmup_steps = 0.03,
  logging_steps=10,
  save_strategy="epoch",
  #evaluation_strategy="epoch",
  evaluation_strategy="steps",
  eval_steps=10, # comment out this line if you want to evaluate at the end of each epoch
  learning_rate=1e-4,
  fp16=True,
  lr_scheduler_type='constant',
)

Setting up the trainer.

`max_seq_length`: Context window size


In [20]:
from trl import SFTTrainer

max_seq_length = 2048

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt, # this will apply the create_prompt mapping to all training and test dataset
  args=args,
  train_dataset=instruct_tune_dataset["train"],
  eval_dataset=instruct_tune_dataset["test"]
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]



In [21]:
!nvidia-smi

Wed Jan  3 14:00:57 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0              64W / 400W |   6643MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [22]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
10,1.3628,1.109636
20,1.0411,0.893801
30,0.8678,0.780792
40,0.7681,0.727106
50,0.7108,0.685231
60,0.6521,0.657275
70,0.618,0.648421
80,0.5824,0.64707




TrainOutput(global_step=80, training_loss=0.8253897905349732, metrics={'train_runtime': 391.4091, 'train_samples_per_second': 0.818, 'train_steps_per_second': 0.204, 'total_flos': 2.675179360616448e+16, 'train_loss': 0.8253897905349732, 'epoch': 5.71})

In [23]:
trainer.save_model("mistral_7b_instruct_v0_1_txt_2_sql")

# Save Model and Push to Hub

In [24]:
!pip install huggingface-hub -qU

In [25]:
# Set the HuggingFace token as an environment variable
import os
from google.colab import userdata

os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')

In [26]:
# I switched to this login method because the notebook_login() method
# below stopped working after a HuggingfaceHub update
!huggingface-cli login --token $HF_TOKEN

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [27]:
# from huggingface_hub import notebook_login

# notebook_login()

Push the LoRA only to the hub. See below for merging the LoRA back into the main model itself.

In [28]:
new_model_id = f"cnatale/{modelname}"
trainer.push_to_hub(new_model_id)

events.out.tfevents.1704290457.4f0560a2c54d.546.0:   0%|          | 0.00/8.53k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/109M [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.73k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/cnatale/Mistral-7B-Instruct-v0.1-Txt-2-Presto-SQL/commit/1adf9e6e324056c588d8386d7745c32aa82e21fe', commit_message='cnatale/Mistral-7B-Instruct-v0.1-Txt-2-Presto-SQL', commit_description='', oid='1adf9e6e324056c588d8386d7745c32aa82e21fe', pr_url=None, pr_revision=None, pr_num=None)

Push the merged model to HuggingFace, along with the tokenizer

In [29]:
merged_model = model.merge_and_unload()

# this will fail until the following work merged into Huggingface Transformers repo is published to pip:
# https://github.com/huggingface/transformers/pull/26037
# merged_model.save_pretrained(
#     new_model_id,
#     push_to_hub=True,
#     repo_id=new_model_id,
#     private=False,
#     use_auth_token=userdata.get('HF_TOKEN'),
# )
# tokenizer.save_pretrained(
#     new_model_id,
#     push_to_hub=True,
#     repo_id=new_model_id,
#     private=False,
#     use_auth_token=userdata.get('HF_TOKEN'),
# )



Now generate a response with the fine-tuned model for comparison:

In [30]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

In [31]:
generate_response(prompt, merged_model)

'<s><s> [INST] Write a SQL query or use a function to answer the following question. Use the SQL dialect Presto for AWS Athena.\n\nWhat is the average age of the dogs who have gone through any treatments? [/INST] ```sql\nSELECT AVG(age)\nFROM treatments\nJOIN dogs ON treatments.dog_id = dogs.id\n```</s>'

**Optional**: merge LoRA into pretrained, unquantized model. Then store on either Google Drive or Huggingface. Won't be necessary once saving and uploading 4bit quantized models is enabled by Huggingface Transformers.

In [32]:
# Import necessary libraries
import torch
import os
import logging
from tqdm.notebook import tqdm  # Use notebook version of tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

huggingface_model_path = new_model_id

# Manually set your parameters here
base_model_name_or_path = "mistralai/Mistral-7B-Instruct-v0.1"
peft_model_path = f"cnatale/{modelname}"
# output_dir = f"/content/drive/MyDrive/{modelname}"
output_dir = f"/content/{modelname}"
device = "auto"  # set to 'auto' or specify device like 'cuda:0'
push_to_hub = True  # set to True if you want to push to the Hugging Face Model Hub

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

try:
    if device == 'auto':
        device_arg = {'device_map': 'auto'}
    else:
        device_arg = {'device_map': {"": device}}

    logger.info(f"Loading base model: {base_model_name_or_path}")
    with tqdm(total=1, desc="Loading base model") as pbar:
        base_model = AutoModelForCausalLM.from_pretrained(
            base_model_name_or_path,
            return_dict=True,
            torch_dtype=torch.float16,
            **device_arg
        )
        pbar.update(1)

    logger.info(f"Loading Peft: {peft_model_path}")
    with tqdm(total=1, desc="Loading Peft model") as pbar:
        model = PeftModel.from_pretrained(base_model, peft_model_path)
        pbar.update(1)

    logger.info("Running merge_and_unload")
    with tqdm(total=1, desc="Merge and Unload") as pbar:
        model = model.merge_and_unload()
        pbar.update(1)

    tokenizer = AutoTokenizer.from_pretrained(base_model_name_or_path)

    # Save the data to a Google Drive or local directory
    # from google.colab import drive
    # drive.mount('/content/drive')
    model.save_pretrained(f"{output_dir}")
    tokenizer.save_pretrained(f"{output_dir}")
    logger.info(f"Model saved to {output_dir}")

    # Alternatively, push the model directly to Huggingface repo
    # model.save_pretrained(
    #     new_model_id,
    #     push_to_hub=True,
    #     repo_id=new_model_id,
    #     private=False,
    #     use_auth_token=userdata.get('HF_TOKEN'),
    # )
    # tokenizer.save_pretrained(
    #     new_model_id,
    #     push_to_hub=True,
    #     repo_id=new_model_id,
    #     private=False,
    #     use_auth_token=userdata.get('HF_TOKEN'),
    # )

except Exception as e:
    logger.exception("An error occurred:")
    raise

Loading base model:   0%|          | 0/1 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Loading Peft model:   0%|          | 0/1 [00:00<?, ?it/s]

adapter_config.json:   0%|          | 0.00/586 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/109M [00:00<?, ?B/s]

Merge and Unload:   0%|          | 0/1 [00:00<?, ?it/s]

# Quantize the Model

---



In [33]:
# Install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt

Cloning into 'llama.cpp'...
remote: Enumerating objects: 15014, done.[K
remote: Counting objects: 100% (1523/1523), done.[K
remote: Compressing objects: 100% (65/65), done.[K
remote: Total 15014 (delta 1477), reused 1474 (delta 1456), pack-reused 13491[K
Receiving objects: 100% (15014/15014), 17.72 MiB | 16.59 MiB/s, done.
Resolving deltas: 100% (10420/10420), done.
Already up to date.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -

In [34]:
MODEL_ID = f"cnatale/{modelname}"

# Download model - skip if possible because this take a long time
#!git lfs install
#!git clone https://huggingface.co/{MODEL_ID}

Next, convert weights to GGUF FP16 format

In [35]:
MODEL_NAME = MODEL_ID.split('/')[-1]

# Convert to fp16
fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}

Loading model file Mistral-7B-Instruct-v0.1-Txt-2-Presto-SQL/model-00001-of-00003.safetensors
Loading model file Mistral-7B-Instruct-v0.1-Txt-2-Presto-SQL/model-00001-of-00003.safetensors
Loading model file Mistral-7B-Instruct-v0.1-Txt-2-Presto-SQL/model-00002-of-00003.safetensors
Loading model file Mistral-7B-Instruct-v0.1-Txt-2-Presto-SQL/model-00003-of-00003.safetensors
params = Params(n_vocab=32000, n_embd=4096, n_layer=32, n_ctx=32768, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=10000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('Mistral-7B-Instruct-v0.1-Txt-2-Presto-SQL'))
Vocab info: <VocabLoader with 32000 base tokens and 0 added tokens>
Special vocab info: <SpecialVocab with 58980 merges, special tokens {'bos': 1, 'eos': 2, 'unk': 0}, add special tokens {'bos': True, 'eos': False}>
Permuting layer 0
Permuting layer 1
Per

Perform the quantization

In [36]:
# QUANTIZATION_METHODS = ["q4_0", "q4_k_m", "q5_k_m"]
QUANTIZATION_METHODS = ["q4_0"]

for method in QUANTIZATION_METHODS:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/quantize {fp16} {qtype} {method}

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes
main: build = 1759 (7bed7eb)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'Mistral-7B-Instruct-v0.1-Txt-2-Presto-SQL/mistral-7b-instruct-v0.1-txt-2-presto-sql.fp16.bin' to 'Mistral-7B-Instruct-v0.1-Txt-2-Presto-SQL/mistral-7b-instruct-v0.1-txt-2-presto-sql.Q4_0.gguf' as Q4_0
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from Mistral-7B-Instruct-v0.1-Txt-2-Presto-SQL/mistral-7b-instruct-v0.1-txt-2-presto-sql.fp16.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str  

Run the model with llama.cpp

In [37]:
import os

model_list = [file for file in os.listdir(MODEL_NAME) if "gguf" in file]

# prompt = input("Enter your prompt: ")
# chosen_method = input("Name of the model (options: " + ", ".join(model_list) + "): ")

chosen_method = model_list[0]

# Verify the chosen method is in the list
if chosen_method not in model_list:
    print("Invalid name")
else:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"

Log start
main: build = 1759 (7bed7eb)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1704291202
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from Mistral-7B-Instruct-v0.1-Txt-2-Presto-SQL/mistral-7b-instruct-v0.1-txt-2-presto-sql.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_len

Push the quantized model to Huggingface

In [38]:
# Set the HuggingFace token as an environment variable
import os
from google.colab import userdata

os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')

In [39]:
# !pip install huggingface-hub -qU

# # I switched to this login method because the notebook_login() method
# # below stopped working after a HuggingfaceHub update
# !huggingface-cli login --token $HF_TOKEN

In [40]:
# !pip install -q huggingface_hub
from huggingface_hub import create_repo, HfApi
from google.colab import userdata

# Defined in the secrets tab in Google Colab
hf_token = userdata.get('HF_TOKEN')

api = HfApi()
username = "cnatale"

# Create empty repo for GGUF version of model
create_repo(
    repo_id = f"{username}/{MODEL_NAME}-lo-lora-GGUF",
    repo_type="model",
    exist_ok=True,
    token=hf_token
)

# Upload gguf files
api.upload_folder(
    folder_path=MODEL_NAME,
    repo_id=f"{username}/{MODEL_NAME}-lo-lora-GGUF",
    allow_patterns=f"*.gguf",
    token=hf_token
)

mistral-7b-instruct-v0.1-txt-2-presto-sql.Q4_0.gguf:   0%|          | 0.00/4.11G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/cnatale/Mistral-7B-Instruct-v0.1-Txt-2-Presto-SQL-lo-lora-GGUF/commit/9643b86cda9a3229a8b29dc30136628a7bb8824e', commit_message='Upload folder using huggingface_hub', commit_description='', oid='9643b86cda9a3229a8b29dc30136628a7bb8824e', pr_url=None, pr_revision=None, pr_num=None)