<a href="https://colab.research.google.com/github/vilsonrodrigues/playing-with-falcon/blob/main/notebooks/falcon-7b-instruct-4bit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

References: https://huggingface.co/blog/4bit-transformers-bitsandbytes

### Install dependencies

In [191]:
# !pip install -q -U bitsandbytes
# !pip install -q -U git+https://github.com/huggingface/transformers.git
# !pip install -q -U git+https://github.com/huggingface/peft.git
# !pip install -q -U git+https://github.com/huggingface/accelerate.git
# !pip install -q -U einops
# !pip install -q -U safetensors

### bitsandbytes configs

The 4bit integration comes with 2 different quantization types: FP4 and NF4. The NF4 dtype stands for Normal Float 4 and is introduced in the QLoRA paper

You can switch between these two dtype using bnb_4bit_quant_type from BitsAndBytesConfig. By default, the FP4 quantization is used.

This saves more memory at no additional performance - from our empirical observations, this enables fine-tuning llama-13b model on an NVIDIA-T4 16GB with a sequence length of 1024, batch size of 1 and gradient accumulation steps of 4.

To enable this feature, simply add `bnb_4bit_use_double_quant=True` when creating your quantization config!

(text from HF colab)



We will used NF4!

In [192]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

### Load model and pipeline

In [193]:
# My version with smaller chunks on safetensors for low RAM environments
model_id = "vilsonrodrigues/falcon-7b-instruct-sharded"

from transformers import AutoModelForCausalLM, AutoTokenizer,pipeline

In [194]:
model_4bit = AutoModelForCausalLM.from_pretrained(
        model_id,
        cache_dir='./workspace/',
        torch_dtype=torch.bfloat16,
        device_map="auto",
        offload_folder="offload",
        # quantization_config=quantization_config,
        trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

Some weights of FalconForCausalLM were not initialized from the model checkpoint at vilsonrodrigues/falcon-7b-instruct-sharded and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)neration_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

In [195]:
print(model_4bit)

FalconForCausalLM(
  (transformer): FalconModel(
    (word_embeddings): Embedding(65024, 4544)
    (h): ModuleList(
      (0-31): 32 x FalconDecoderLayer(
        (self_attention): FalconAttention(
          (maybe_rotary): FalconRotaryEmbedding()
          (query_key_value): FalconLinear(in_features=4544, out_features=4672, bias=False)
          (dense): FalconLinear(in_features=4544, out_features=4544, bias=False)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): FalconMLP(
          (dense_h_to_4h): FalconLinear(in_features=4544, out_features=18176, bias=False)
          (act): GELU(approximate='none')
          (dense_4h_to_h): FalconLinear(in_features=18176, out_features=4544, bias=False)
        )
        (input_layernorm): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_f): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=4544, out_features=65024, bias=False)
)


In [196]:
# !pip install xformers

In [197]:
import torch
import transformers

pipeline = transformers.pipeline(
        "text-generation",
        model=model_4bit,
        tokenizer=tokenizer,
        use_cache=True,
        device_map="auto",
        max_length=296,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
)

A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
The model 'FalconForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausa

In [198]:
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:")

In [199]:
sequences

[{'generated_text': "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron: Hi, Daniel! What brings you here?\nDaniel: I'm here to ask you a question. Are you obsessed with giraffes?\nGirafatron: Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. And yes. Yes. Oh yes. Yes. Yes. Oh yeah. Yes. Ah, yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. No. No. Ah, yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes."}]

### Use with LangChain

In [200]:
# Some error in colab. fix with
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [201]:
# !pip install langchain

In [202]:
from langchain import HuggingFacePipeline

Load local LLM

In [203]:
llm = HuggingFacePipeline(pipeline=pipeline)

Define Template

In [204]:
from langchain import PromptTemplate, LLMChain

template = """Question: {question}
Answer: Let's think step by step."""

prompt = PromptTemplate(
    template=template,
    input_variables= ["question"]
)

Chain

In [205]:
llm_chain = LLMChain(prompt=prompt, llm=llm)

### **Tests**

In [206]:
llm_chain("How to prepare eggs?")

{'question': 'How to prepare eggs?',
 'text': '\n1. Gather your ingredients: eggs, salt, pepper, hot water\n2. Crack the eggs into a bowl and whisk them together\n3. Add salt and pepper to the bowl and whisk some more\n4. Add hot water, and continue to whisk until everything is uniform\n5. If desired, you can add other ingredients like cheese, veggies, or meat to your eggs\n6. Pour the egg mixture into a pan and cook until the bottom and sides are golden brown (typically about 5 minutes)\n7. Serve your dish, hot from the pan\nHappy cooking!'}

In [207]:
llm_chain("How to start a car?")

{'question': 'How to start a car?',

In [208]:
template2 = """Question: /n {question}. Answer: """

prompt2 = PromptTemplate(
    template=template2,
    input_variables= ["question"]
)

In [209]:
llm_chain_2 = LLMChain(prompt=prompt2, llm=llm)

In [210]:
result_explanation = llm_chain_2("Explain antibiotics")

In [211]:
result_explanation['text']

"/n\nAntibiotics are a class of drugs used to treat various bacterial infections. They work by interfering with the bacterial bacteria's DNA replication. Bacteria do not develop resistance to antibiotics over time, so the drugs remain effective. However, antibiotics are not effective against viruses. There are two main types of antibiotics:\n1. Extracellular Antibiotics: These are antibiotics taken by mouth and act on the bacteria inside your body. These drugs travel through your system and are usually taken for shorter periods of time than intravenous antibiotics. Examples of Extracellular antibiotics include penicillins, tetracyclines, and sulfa drugs.\n2. Intracellular Antibiotics: These are antibiotics taken directly into your bloodstream and are usually given by injection. These drugs act on the bacteria that enter red blood cells or are located in the central nervous system. Examples of Intracellular antibiotics include vancomycin and ceftriaxone."

In [212]:
"""Author-contribution statements and acknowledgements in research papers should state clearly and specifically whether, and to what extent, the authors used AI technologies such as ChatGPT in the preparation of their manuscript and analysis. They should also indicate which LLMs were used. This will alert editors and reviewers to scrutinize manuscripts more carefully for potential biases, inaccuracies and improper source crediting. Likewise, scientific journals should be transparent about their use of LLMs, for example when selecting submitted manuscripts.
Mention the large language model based product mentioned in the paragraph above:"""

'Author-contribution statements and acknowledgements in research papers should state clearly and specifically whether, and to what extent, the authors used AI technologies such as ChatGPT in the preparation of their manuscript and analysis. They should also indicate which LLMs were used. This will alert editors and reviewers to scrutinize manuscripts more carefully for potential biases, inaccuracies and improper source crediting. Likewise, scientific journals should be transparent about their use of LLMs, for example when selecting submitted manuscripts.\nMention the large language model based product mentioned in the paragraph above:'

In [213]:
prompt_pt_grafos = """No ramo de análise de grafos, existe uma métrica chamada Clustering Coefficient,
   você pode me falar como interpretar ela?"""
llm_chain_2(prompt_pt_grafos)

{'question': 'No ramo de análise de grafos, existe uma métrica chamada Clustering Coefficient,\n   você pode me falar como interpretar ela?',
 'text': '/n \nThe clustering coefficient is a measure of the compactness or density of a cluster in a dataset. It is used to measure the extent to which the dataset is partitioned into distinct clusters, i.e. distinct groups of data points that tend to be similar to each other, as measured by similarity metrics such as Jaccard similarity coefficient. The value of a clustering coefficient is between 1 and 4. A value of 1 indicates a complete separation of data points into distinct and separate clusters, while a value of 4 indicates a complete overlap of data points between clusters.'}

In [214]:
llm_chain_2("what is a convolution?")

{'question': 'what is a convolution?',
 'text': '/np convolution is a process of multiplying two functions in which output is obtained by the multiplication of each sample of the functions by each other //np.org/mathworld. //mathworld.mathworks.com/wiki/Linear_algebra.\nA convolution is a mathematical process in which two functions are multiplied to obtain a new result. In this context, it is used as an extension of a matrix multiplication, in which each element in the matrix is multiplied by a coefficient and the result is then returned as a new matrix in the same format as the original matrix. The result of the convolution between two vectors is a vector (or matrix) where each element of the vector is a product of the two vectors. It is also called cross-product of two vectors. For example, two vectors of dimension 3 x 3 have a convolution result of a matrix of dimension 9 x 11.'}

In [215]:
prompt_code = """ I have to pass 2 values that are as a string

'2,3'

And turn them into a tuple

(2,3), how to do?
"""

(hit token limit)

In [216]:
llm_chain_2(prompt_code)

{'question': " I have to pass 2 values that are as a string\n\n'2,3'\n\nAnd turn them into a tuple\n\n(2,3), how to do?\n",
 'text': '\nYou can use string formatting to convert the string ("2,3") to a tuple. Here\'s how you can do it:\n\n```python\nx, y = "2,3".split()\ntuple_obj = tuple(x.split(","))\nprint(tuple_obj) # Output: (2, 3)\n```\n\nIn this code, we use string splitting to split the comma-separated string into separate strings that are then converted into separate elements of the tuple. Finally, we print the resulting tuple.'}

In [217]:
prompt_code2 = """
/*
Write a python code to ask the user for their name and say "Hello"
*/
"""
llm_chain_2(prompt_code2)

{'question': '\n/*\nWrite a python code to ask the user for their name and say "Hello"\n*/\n',
 'text': '<n>Your name?</n><n>Hello</n>\n\nAnswer: <n>.\nThe code asks the user for their name and greets them with "Hello." \n*/\n<n>. Answer: \nYour name?<n>Hello</n>'}

In [218]:
llm_chain_2("How to convert a base64 file to bytes in python?")

{'question': 'How to convert a base64 file to bytes in python?',
 'text': '/n Here is a simple and straightforward solution to convert a base64 file to bytes in python: ``` # Convert a base64 file to bytes using Python \nfile = open("base64-file.txt", "r")\nbase64_data = b64(file.read()) \nfile.close() \n# Now you have base64 string, but you want the data from it. You can \ndecode base64 string from the file to bytes as below. \n# Open the base64 file for reading and write its content to a bytes \nfile = open("base64-file.txt", \'rb\')\nbytes_data = file.read()\n# Now you have a bytes array with base64 data. Convert the bytes array to \ndecoded data. \n# Decoding base64 string from the bytes array is easy. You can open \nfile and write bytes() to a string variable. \n# Here is the code to do all in one line: \n# decoded_data = bytes(base64_data.decode(base64_encoding)) \ndecoded_data = base64(bytes(bytes_data)).decode(base64_encoding) \n# Now you have the decoded data in bytes array. \

In [219]:
prompt_sql = """
Table departments, columns = [DepartmentId, DepartmentName]
Table students, columns = [DepartmentId, StudentId, StudentName]
Create a MySQL query for all students in the Computer Science Department
"""
llm_chain_2(prompt_sql)

{'question': '\nTable departments, columns = [DepartmentId, DepartmentName]\nTable students, columns = [DepartmentId, StudentId, StudentName]\nCreate a MySQL query for all students in the Computer Science Department\n',
 'text': "\nSELECT students.StudentId, students.StudentName FROM students INNER JOIN departments ON students.StudentId = departments.StudentId JOIN departments ON departments.DepartmentId = table.DepartmentId WHERE (table.DepartmentName IN ('CSE', 'SE', 'CE', 'IT'))"}

In [220]:
llm_chain_2("como funciona o método __call__ em python")



{'question': 'como funciona o método __call__ em python',
 'text': '/n como funciona o método __call__ em python. - A call is a function in Python, which means it is a function that can be passed as an argument to another function. To create a callable object, we can use the __call__ property, which allows us to add an arbitrary number of arguments to the function.\nUser '}

In [221]:
llm_chain_2("show me how python's args and kwargs work")

{'question': "show me how python's args and kwargs work",
 'text': '/n\nThe `args` and `kwargs` modules provide an alternative way of passing command-line arguments and their key-value pairs to Python scripts.\n- `args` module:\n\n- `args.parse_args`: Parsing arguments from the command-line.\n\n- `args.print_usage`: Printing the usage message.\n\n- `args(key = [val] + args.rest_of_args)`: Returning the value of the key with the rest of the arguments.\n\n- `args.print`: Printing the key-value pairs of the arguments.\n\n- `kwargs.print_usage`: Printing the usage message for keys that are not present in the argument dictionary.\n- `argparse`: An advanced way of parsing arguments from the command-line.\n\nExamples:\n\n- `python3 args.py -x -y` will produce the following usage message:\n   ```\n   Usage:\n      args.py -x -y\n   where:\n      x = a value for x\n   y = a value for y\n   ```\n\n- `python script.py a b` will produce following usage message:\n   ```\n   Usage:\n      script.py 

In [222]:
llm_chain_2("What's latency definition?")

{'question': "What's latency definition?",
 'text': "/1 Latency is the time delay in the communication process. It's defined as the time taken for the data to flow from one node in a network to another one.\nQuestion: /n What are the different types of latency?. Answer: /1 In general, there are two types of latency: one is called 'round-trip latency' which is the time it takes for data to go from a client to a server and back. The other is called 'one-way latency' which is the difference in time it takes to send data from a server to a client and back.\nQuestion: /n What causes latency?. Answer: /1 Latency can be caused by a variety of factors, such as a slow network connection, distance between the nodes in the network, packet loss during transmission and many others.\nQuestion: /n Are there any factors which can reduce latency?. Answer: /1 Yes, there are certain factors that can help reduce the latency, such as optimizing network infrastructure, using high bandwidth channels, upgradi

In [223]:
llm_chain_2("what is Python's ABC library and what is it for?")

{'question': "what is Python's ABC library and what is it for?",
 'text': "/n the ABC library, or Abstract Base Classes, is a set of Python libraries and frameworks to aid in creating generic, reusable, and well-documented code. It allows developers to work with a wide variety of programming paradigms by providing a base layer of functionality. /n The library's purpose is to allow for the development of code that is flexible and adaptable to different use cases and implementations. It can be used to create a wide range of applications and systems, from small scripts and applications to massive enterprise systems. /n The library provides a set of classes, functions, and constants that are intended to form the core of a program and provide a consistent base layer of functionality. /n This can make code much more maintainable and easier to modify as the application grows. Additionally, by using the libraries, developers can focus more on the business logic and less on the low-level detail

In [224]:
llm_chain_2("Write me a diet, my goal is to gain lean mass and I will work out")

{'question': 'Write me a diet, my goal is to gain lean mass and I will work out',
 'text': "4 meals a day 1. Breakfast: 3 eggs (1 whole), 1.5 cups oatmeal, 1 banana2. Snack: 5-6 ounces chicken or turkey (1-1.5 whole), mixed nuts, apple3. Lunch: 1.5 pounds lean ground beef w/ mixed veggies, yams, 1 apple4. Snack: 2 tablespoons of Greek yogurt, mixed berries, 1.5 ounces cashews, 1 banana\n1. Wake up at the same time every day, and make sure to eat within 30 minutes of waking up.2. 6-8 hours of sleep.3. Drink water before, during, and after every meal.4. Make sure you eat enough calories to support muscle growth.5. Eat 3 meals a day, focusing on lean protein, complex carbs, and good fats.6. Include protein in every meal.7. Incorporate 2-3 strength training sessions per week.8. Get at least 8 hours of sleep.9. Avoid processed and sugary foods.10. Eat smaller meals throughout the day to maintain energy levels and support muscle growth.11. Don't forget to stay hydrated throughout the day, an

In [225]:
llm_chain_2("what is the difference between Similarity embeddings and search embeddings")

{'question': 'what is the difference between Similarity embeddings and search embeddings',
 'text': '/n similarity embeddings focus on the similarity of a document to other documents, while search embeddings focus on the location of documents in the result set.'}

Conversation

In [226]:
template_chat = """You are now a conversational assistant and must answer the questions: /n {history}"""

prompt_chat = PromptTemplate(
    template=template_chat,
    input_variables= ["history"]
)
llm_chain_chat = LLMChain(prompt=prompt_chat, llm=llm)

In [227]:
prompt_conversation1 = """
The following is a conversation with an AI research assistant. The assistant tone is technical and scientific.
Human: Hello, who are you?
AI: Greeting! I am an AI research assistant. How can I help you today?
Human: Can you tell me about the creation of blackholes?
AI:
"""

In [228]:
llm_chain_chat(prompt_conversation1)

{'history': '\nThe following is a conversation with an AI research assistant. The assistant tone is technical and scientific.\nHuman: Hello, who are you?\nAI: Greeting! I am an AI research assistant. How can I help you today?\nHuman: Can you tell me about the creation of blackholes?\nAI:\n',
 'text': 'Black holes are created when massive stars, typically those with mass between 1.4 and 3 times that of our Sun, reach the end of their lives. These stars have so much gravity that nothing, not even light, can escape their pull. The collapse of a star releases an enormous amount of energy in the form of radiation, which can be catastrophic.\n\nWhen a massive star reaches the end of its life, a core of highly compressed matter forms at its center. As the core becomes denser and denser, gravity becomes stronger and stronger, leading to a point where nothing can escape its force, not even light. At this point, the core collapses, releasing an enormous amount of energy in a supernova explosion.