## Falcon-7b in 4bit

In this notebook, we can use llama-index with the [tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) model. We also use [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) to load the model in 4bit (model loads in about just under 8GB in memory rather than 16GB).

In [1]:
!nvidia-smi

Thu Jun 15 10:51:27 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-PCIE...  On   | 00000001:00:00.0 Off |                  Off |
| N/A   31C    P0    25W / 250W |     81MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
from llama_index import (
    LangchainEmbedding,
    VectorStoreIndex,
    PromptHelper,
    LLMPredictor,
    ServiceContext,
    Document
)
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.llms.base import LLM

import pandas as pd
import torch
import transformers
from transformers import (
    pipeline,
    AutoModelForCausalLM,
    AutoTokenizer
)

import logging
logging.getLogger().setLevel(logging.CRITICAL)

In [3]:
transformers.__version__

'4.30.2'

The next part needs `bitsandbytes`, so pip install if not - further note that `BitsAndBytesConfig` is a new class in `transformers`, so make sure that you're running a recent enough version. We're running 4.30.2 here which will be fine!

In [4]:
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

In [5]:
wiki = pd.read_csv("../../data/wiki-scraped.csv")
handbook = pd.read_csv("../../data/handbook-scraped.csv")

In [6]:
text_list = list(wiki["body"].astype("str")) + list(handbook["body"].astype("str"))
documents = [Document(t) for t in text_list]

## Using falcon-7b-instruct

In [7]:
model_name = "tiiuae/falcon-7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    quantization_config=quantization_config,
    trust_remote_code=True,
)


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /anaconda/envs/reginald/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so
CUDA SETUP: CUDA runtime path found: /anaconda/envs/reginald/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /anaconda/envs/reginald/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so...


  warn(msg)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
!nvidia-smi

Thu Jun 15 10:51:46 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-PCIE...  On   | 00000001:00:00.0 Off |                  Off |
| N/A   32C    P0    42W / 250W |   5639MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [9]:
falcon_7b = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    use_cache=True,
    device_map="auto",
    do_sample=True,
    top_k=10,
    top_p=0.95,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
)

class CustomLLM(LLM):
    model_name: str
    pipeline: transformers.pipelines.text_generation.TextGenerationPipeline
    
    @property
    def _llm_type(self) -> str:
        return "custom"
    
    def _call(self, prompt, stop=None):
        return self.pipeline(prompt, max_new_tokens=9999)[0]["generated_text"]
    
    @property
    def _identifying_params(self) -> dict:
        """Get the identifying parameters."""
        return {"model_name": self.model_name}
    
llm_predictor_falcon_7b = LLMPredictor(llm=CustomLLM(model_name=model_name, pipeline=falcon_7b))

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
The model 'RWForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusFor

In [10]:
hfemb = HuggingFaceEmbeddings()
embed_model = LangchainEmbedding(hfemb)

In [12]:
# set number of output tokens
num_output = 256
# set maximum input size
max_input_size = 2048
# set maximum chunk overlap
chunk_size = 1024
chunk_overlap_ratio = 0.1

prompt_helper = PromptHelper(
    context_window=max_input_size,
    num_output=num_output,
    chunk_size_limit=chunk_size,
    chunk_overlap_ratio=chunk_overlap_ratio,
)

In [13]:
service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor_falcon_7b,
    embed_model=embed_model,
    prompt_helper=prompt_helper,
    chunk_size=chunk_size,
)

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)
query_engine_falcon_7b = index.as_query_engine()

In [14]:
response = query_engine_falcon_7b.query("what should a new starter in REG do?")
print(response.response)



The original question is as follows: what should a new starter in REG do?
We have provided an existing answer: Context information is below. 
---------------------
REG Buddy Sign Up Sheet
This page is for organising sign-ups and matches to REG's Buddy-System . Everyone new-starter gets two REG buddies. In Jan 2022 we also had one external buddy. This helps to enforce that 'Turing' is bigger than 'REG'. External buddying is a little more social in nature, whereas internal buddies can wear both hats and also may offer informal technical help. However, note that the external buddying needs to be bidirectional and historically there hasn't been many volunteers from REG to the wider Turing, so external buddying should be considered an optional nice-to-have.
If your preference for being a buddy has changed, please contact the person in charge of onboarding wiki page .
Please add your name below if you'd like to be considered as a buddy for a new starter.



Name




Andy Smith


Helen Duncan