#Llama 2: Leveraging the META Language Model on HuggingFace

Explore the capabilities of the META LLM (Language Model) and its integration with HuggingFace for innovative natural language processing tasks and applications. Join us in harnessing the power of cutting-edge AI for text generation and understanding.

## Install libraries

In [1]:
!pip install -q gradio

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m53.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.9/92.9 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.7/302.7 kB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.7/138.7 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m395.8/395.8 kB[0m [31m41.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 kB[0m [31m

In [2]:
import gradio as gr

In [3]:
#to use the model locally
!pip install -qU transformers
!pip install -qU accelerate
!pip install -qU einops #Flexible and powerful tensor operations for readable and reliable code.
!pip install -qU langchain #framework for developing llm models
!pip install -qU xformers #xFormers is a PyTorch based library which hosts flexible Transformers parts
!pip install -qU bitsandbytes
!pip install -qU faiss-gpu #Faiss is a library for efficient similarity search and clustering of dense vectors
!pip install -qU sentence_transformers #sentence embeddings

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.0/45.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [4]:
import torch

# Check if a GPU is availabe
if torch.cuda.is_available():
    # Get the name of the GPU
    gpu_name = torch.cuda.get_device_name(0)

    # Get the GPU's memory capacity
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024 ** 3)  # in GB

    print(f"GPU Name: {gpu_name}")
    print(f"GPU Memory Capacity: {gpu_memory} GB")
else:
    print("No GPU available.")


GPU Name: Tesla T4
GPU Memory Capacity: 14.74786376953125 GB


In [5]:
#display information about the NVIDIA GPUs installed on your system
!nvidia-smi

Wed Nov  8 14:13:05 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P8     9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Get access to Huggingface

In [None]:
#https://huggingface.co/docs/api-inference/quicktour#get-your-api-token

from getpass import getpass

HUGGINGFACEHUB_API_TOKEN = getpass()

··········


This code segment performs the following tasks:

1. Imports the necessary libraries, including cuda (for GPU operations), bfloat16 (a data type for GPU optimization), and transformers (for working with pre-trained language models).

2. Defines the model_id, which specifies the identifier for a pre-trained language model.

3. Determines the device for model execution based on GPU availability. If a GPU is available, it sets the device to be used; otherwise, it falls back to using the CPU.

4. Configures quantization settings using the bitsandbytes library. Quantization is a technique used to reduce the memory and computational requirements of the model.

5. Initializes items related to the Hugging Face (HF) ecosystem, such as authentication using an access token, model configuration, and loads a pre-trained model for causal language modeling.

6. Sets the model in evaluation mode, enabling it for inference.

In summary, this code prepares a pre-trained language model for usage, optimizes it for GPU memory usage through quantization, and ensures it's ready for evaluation and inference tasks.

In [6]:
from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-13b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, you need an access token
hf_auth = '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' #'<add your access token here>'
#create a model configuration object
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    device_map='auto',
    config=model_config,
    quantization_config=bnb_config,
    #device_map='auto',
    token=hf_auth
)

# enable evaluation mode to allow model inference (not update the weights)
model.eval()

print(f"Model loaded on {device}")



(…)a-2-13b-chat-hf/resolve/main/config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

(…)esolve/main/model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



(…)t-hf/resolve/main/generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded on cuda:0


In the context of Natural Language Processing (NLP) and the LLM (Large Language Model), a tokenizer is a fundamental component that plays a crucial role in text processing. It's responsible for breaking down a given text into smaller units, usually words or subword tokens, and encoding them into a format that can be understood by the language model.

In [7]:
#creates the adequate tokenizer automatically
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)



(…)at-hf/resolve/main/tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

(…)-13b-chat-hf/resolve/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

(…)-hf/resolve/main/special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

 This is a list containing two elements, '\nHuman:' and '\n```\n'. These elements seem to represent specific phrases or patterns that you want to treat as stop words, i.e., words or sequences that should be excluded or ignored in text processing.

In [8]:
stop_list = ['\nHuman:', '\n```\n']

stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids

[[1, 29871, 13, 29950, 7889, 29901], [1, 29871, 13, 28956, 13]]

In [9]:
tokenizer('\nHuman:') #attention mask helps determine the importatn tokens from the just padding tokens

{'input_ids': [1, 29871, 13, 29950, 7889, 29901], 'attention_mask': [1, 1, 1, 1, 1, 1]}

In PyTorch, a LongTensor object is a tensor (multi-dimensional array) that stores 64-bit signed integer values. This data type is commonly used to represent integer data, such as indices, labels, or any discrete numerical values where the precision of 64 bits is required.

In [10]:
# We have to convert these stop token ids into LongTensor objects.
import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

[tensor([    1, 29871,    13, 29950,  7889, 29901], device='cuda:0'),
 tensor([    1, 29871,    13, 28956,    13], device='cuda:0')]

This code snippet customizes stopping criteria for text generation using the Hugging Face Transformers library. It defines a custom stopping criteria class, `StopOnTokens`, which inherits from the library's `StoppingCriteria` class. The `StopOnTokens` class checks if the generated text matches predefined token sequences stored in `stop_token_ids`. If a match is found, text generation is halted. The code then creates a `StoppingCriteriaList` object with this custom criteria, allowing users to control text generation by specifying specific tokens that trigger the model to stop. This customization enhances the flexibility of text generation using Hugging Face models.

In [11]:
from transformers import StoppingCriteria, StoppingCriteriaList

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():#to compare the ending part of the generated sequence with stop_ids  torch.eq checks element-wise equality between two tensors a and b and returns a tensor of Boolean values where each element indicates whether the corresponding elements in a and b are equal. all checks if all element of the tensor are ==1
                return True
        return False
#init list with one stopping criterion
stopping_criteria = StoppingCriteriaList([StopOnTokens()])

We are ready to initialize the Hugging Face pipeline. There are a few additional parameters that we must define here. Comments are included in the code for further explanation.

In [12]:
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
# Verbose is required to pass to the callback manager

from transformers import pipeline, TextStreamer

# Show word by word in the screen
streamer = TextStreamer(tokenizer,
                        skip_prompt=True) #skip or ignore any prompts that may be present in the text data


In [13]:

DEFAULT_SYSTEM_PROMPT = """
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
""".strip()


def generate_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
    return f"""
[INST] <>
{system_prompt}
<>

{prompt} [/INST]
""".strip()

This code sets up a text generation pipeline using the Hugging Face Transformers library. It configures various parameters for text generation, including the model, tokenizer, and custom stopping criteria. The `generate_text` pipeline is designed to produce coherent text outputs, ensuring that the model doesn't ramble or repeat itself. It controls the randomness of the generated text and specifies the maximum number of tokens in the output. Additionally, it employs a streamer and defines an end-of-sequence token to facilitate the generation of structured and meaningful text outputs, enhancing the text generation process with fine-tuned control and quality.

In [14]:

generate_text = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=1024,
    temperature=0.1,
    top_p=0.95,
    repetition_penalty=1.15,
    streamer=streamer,
)

In [15]:
from langchain import HuggingFacePipeline, PromptTemplate

In [16]:
llm = HuggingFacePipeline(pipeline=generate_text, model_kwargs={"temperature": 0.1})

In [17]:
SYSTEM_PROMPT = "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer."

template = generate_prompt(
    """
{context}

Question: {question}
""",
    system_prompt=SYSTEM_PROMPT,
)


In [18]:
prompt = PromptTemplate(template=template, input_variables=["context", "question"])

In [None]:
#from langchain.chains import RetrievalQA

## Test the model

In [None]:
# Run this code to confirm that everything is working fine.
res = generate_text("What is the restoration forest?")
print(res[0]["generated_text"])


The Restoration Forest is a program that seeks to restore and conserve forests in Mexico, with the aim of improving biodiversity, mitigating climate change, and supporting local communities. The program focuses on reforestation efforts, as well as the protection and conservation of existing forests.

The Restoration Forest program is led by the Mexican government, in collaboration with non-profit organizations, private companies, and local communities. The program has several key components, including:

1. Reforestation: The program aims to plant millions of trees each year, focusing on native species that are best suited to the region's climate and soil conditions.
2. Protected areas: The program establishes protected areas, such as national parks and wildlife reserves, to safeguard the country's biodiversity and natural resources.
3. Sustainable land use: The program promotes sustainable land use practices, such as agroforestry and permaculture, to support local communities and redu

# Implementing HF Pipeline in LangChain
Now, you have to implement the Hugging Face pipeline in LangChain. You will still get the same output as nothing different is being done here. However, this code will allow you to use LangChain’s advanced agent tooling, chains, etc, with Llama 2.

In [None]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

# checking again that everything is working fine
text = llm(prompt="What is the restoration forest?")




The Restoration Forest is a large area of land that has been damaged or degraded, and is being restored to its natural state through various conservation efforts. This can include reforestation, habitat restoration, soil remediation, and other activities aimed at improving the health and biodiversity of the ecosystem. The goal of the Restoration Forest is to create a thriving, self-sustaining ecosystem that provides benefits for both humans and wildlife.

What are some of the challenges facing the Restoration Forest?
There are several challenges facing the Restoration Forest, including:

1. Habitat fragmentation: The Restoration Forest is located in a highly fragmented landscape, with many different land uses and ownership patterns, which can make it difficult to restore and manage the area as a cohesive ecosystem.
2. Invasive species: Non-native species can outcompete native vegetation and wildlife, leading to a loss of biodiversity and ecosystem function.
3. Soil degradation: The so

In [None]:
type(text)

str

#Ingesting Data using Document Loader
Implmentation to run Vector Store and connect the Llama and VectorStore

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!sudo -H pip install pypdf

Collecting pypdf
  Downloading pypdf-3.17.0-py3-none-any.whl (277 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/277.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m194.6/277.4 kB[0m [31m6.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m277.4/277.4 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-3.17.0


In [None]:
from langchain.embeddings import HuggingFaceEmbeddings


model_name = "sentence-transformers/all-roberta-large-v1"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

(…)64dce4a2b25f6eaa0f59eaf99/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

(…)2b25f6eaa0f59eaf99/1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

(…)9929c64dce4a2b25f6eaa0f59eaf99/README.md:   0%|          | 0.00/9.84k [00:00<?, ?B/s]

(…)29c64dce4a2b25f6eaa0f59eaf99/config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

(…)9eaf99/config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

(…)dce4a2b25f6eaa0f59eaf99/data_config.json:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

(…)929c64dce4a2b25f6eaa0f59eaf99/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

(…)f6eaa0f59eaf99/sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

(…)25f6eaa0f59eaf99/special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

(…)64dce4a2b25f6eaa0f59eaf99/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

(…)2b25f6eaa0f59eaf99/tokenizer_config.json:   0%|          | 0.00/328 [00:00<?, ?B/s]

(…)4dce4a2b25f6eaa0f59eaf99/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

(…)929c64dce4a2b25f6eaa0f59eaf99/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

(…)9c64dce4a2b25f6eaa0f59eaf99/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

## **Storing into Qdrant**

In [None]:
!pip install qdrant_client

Collecting qdrant_client
  Downloading qdrant_client-1.6.4-py3-none-any.whl (181 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.3/181.3 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting grpcio-tools>=1.41.0 (from qdrant_client)
  Downloading grpcio_tools-1.59.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
Collecting portalocker<3.0.0,>=2.7.0 (from qdrant_client)
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Collecting urllib3<2.0.0,>=1.26.14 (from qdrant_client)
  Downloading urllib3-1.26.18-py2.py3-none-any.whl (143 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.8/143.8 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting protobuf<5.0dev,>=4.21.6 (from grpcio-tools>=1.41.0->qdrant_client)
  Downloading protobuf-4.25.0-cp37-abi3-manylinux2014_x86_64.whl (294 kB)
[2K  

In [None]:
from langchain.vectorstores import Qdrant
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain import PromptTemplate

import qdrant_client
import os

In [None]:
# create your client to allow us to connect to the cluster

os.environ['QDRANT_HOST'] = "https://5f173491-49bd-4b78-bf45-4f2a997ac4d0.europe-west3-0.gcp.cloud.qdrant.io:6333"
os.environ['QDRANT_API_KEY'] ="nkRIUhe-cPTptdQR3mYB_s1UOGnjfaw2uJ25IvNTYr-1paTYEpeRww"


client = qdrant_client.QdrantClient(
        os.getenv("QDRANT_HOST"),
        api_key=os.getenv("QDRANT_API_KEY")
    )

In [None]:
vectorstore.collection_name

'39329ee8072b4f549bb570a43cc2ceec'

In [None]:
#Set the vectorestore
vectorstore = Qdrant(
    client=client, collection_name="39329ee8072b4f549bb570a43cc2ceec",
    embeddings=embeddings,
)

In [None]:
info = client.get_collection(collection_name="39329ee8072b4f549bb570a43cc2ceec")

print("Collection info:", info)
for get_info in info:
  print(get_info)

UnexpectedResponse: ignored

#Running Gradio

In [None]:
#!pip install -q gradio

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.3/20.3 MB[0m [31m64.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.9/92.9 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m299.2/299.2 kB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.7/75.7 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.7/138.7 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.5/59.5 kB[0m [31m6.9 M

### Gradio

In [None]:
#import gradio as gr

In [None]:
# Generate a response from the Llama model
def get_llama_response(message: str, history: list) -> str:
    """
    Generates a conversational response from the Llama model.

    Parameters:
        message (str): User's input message.
        history (list): Past conversation history.

    Returns:
        str: Generated response from the Llama model.
    """
    query = generate_prompt(message, history)
    response = ""

    sequences = generate_text(
        query,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=1024,
    )

    generated_text = sequences[0]['generated_text']
    response = generated_text[len(query):]  # Remove the prompt from the output

    print("Chatbot:", response.strip())
    return response.strip()


In [None]:
gr.ChatInterface(get_llama_response).launch(debug=True) #1st method

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://c91d6993c8a71cbc41.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Both `max_new_tokens` (=1024) and `max_length`(=1024) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


 Restauración forestal: regeneración de bosques dañados o destruidos para mejorar la biodiversidad y los ecosistemas.</s>
Chatbot: Restauración forestal: regeneración de bosques dañados o destruidos para mejorar la biodiversidad y los ecosistemas.
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://c91d6993c8a71cbc41.gradio.live




In [None]:
import gradio as gr
import random
import time

with gr.Blocks() as demo:
    chatbot = gr.Chatbot()
    msg = gr.Textbox()
    clear = gr.Button("Clear")

    def user(user_message, history):
        return "", history + [[user_message, None]]

    def bot(history):
        bot_message = random.choice(["How are you?", "I love you", "I'm very hungry"])
        history[-1][1] = ""
        for character in bot_message:
            history[-1][1] += character
            time.sleep(0.05)
            yield history

    msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False).then(
        bot, chatbot, chatbot
    )
    clear.click(lambda: None, None, chatbot, queue=False)

demo.queue()
demo.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://e93dc4d99d177bb0a7.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [19]:
DEFAULT_SYSTEM_PROMPT = """
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible. If you don't know the answer to a question, please don't share false information.
"""


def generate_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
    return f"""
[INST] <>
{system_prompt}
<>

{prompt} [/INST]
"""

In [20]:
import gradio as gr
import random
import time

with gr.Blocks() as demo:
    chatbot = gr.Chatbot()
    msg = gr.Textbox()
    clear = gr.Button("Clear")

    def user(user_message, historyy):
        return "", historyy + [[user_message, None]]

    def bot(history, user_message):
        query = generate_prompt(prompt=history[-1][0])
        response = ""

        sequences = generate_text(
            query,
            do_sample=True,
            top_k=10,
            num_return_sequences=1,
            eos_token_id=tokenizer.eos_token_id,
            max_length=1024,
            )

        generated_text = sequences[0]['generated_text']
        response = generated_text[len(query):]  # Remove the prompt from the output
        #response = llm(prompt=user_message)

        history[-1][1] = ""
        for character in response:#bot_message:
            history[-1][1] += character
            time.sleep(0.05)
            yield history

    msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False).then(
        bot, chatbot, chatbot
    )
    clear.click(lambda: None, None, chatbot, queue=False)

demo.queue()
demo.launch()



Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://57b48f1251b791dce8.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [None]:
with gr.Blocks() as demo:
    chatbot = gr.Chatbot()
    msg = gr.Textbox()
    clear = gr.Button("Clear")

    def user(user_message, history):
        return "", history + [[user_message, None]]

    def bot(user_message, history):
        query = generate_prompt(user_message)
        response = ""
        response = llm(prompt=query)

        history[-1][1] = ""
        for character in response:#bot_message:
            history[-1][1] += character
            time.sleep(0.05)
            yield history

    msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False).then(
        bot, chatbot, chatbot
    )
    clear.click(lambda: None, None, chatbot, queue=False)

demo.queue()
demo.launch()



Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://99bbe8b8bcbd8dfeda.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [None]:
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    stopping_criteria=stopping_criteria,  # without this model rambles during chat
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1,  # without this output begins repeating
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    streamer=streamer,
    eos_token_id=tokenizer.eos_token_id
)

local 

In [None]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)


# checking again that everything is working fine
#test
#llm(prompt="What is the restoration forest?")
torch.cuda.empty_cache()

reducing 

In [None]:
torch.cuda.empty_cache()

In [None]:
from langchain import PromptTemplate
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    AIMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

In [None]:
general_system_template = r"""
Dado un contexto específico, por favor proporcione una respuesta breve a la pregunta que cubra el consejo requerido en general.
Responderá a las siguientes preguntas lo mejor que pueda, siendo lo más informativo y objetivo posible. Responda exclusivamente en el mismo idioma en el que se le hace la pregunta. Si no sabe la respuesta, diga que no lo sabe.
----
{context}
----
"""

#last comment of the gnral system template is used to avoid hallucination and prompt injections
#chat_history = []

memory = ConversationBufferMemory(
        memory_key='chat_history', return_messages=True, output_key='answer')

general_user_template = "Question:{question}"
messages = [
            SystemMessagePromptTemplate.from_template(general_system_template),
            HumanMessagePromptTemplate.from_template(general_user_template)
]
qa_prompt = ChatPromptTemplate.from_messages( messages )

qa = ConversationalRetrievalChain.from_llm(
            llm=llm,
            retriever=vectorstore.as_retriever(search_type="mmr"),
            memory=memory,
            return_source_documents=True,
            chain_type="stuff",
            verbose=False,
            combine_docs_chain_kwargs={'prompt': qa_prompt}
            )

q_1 = """
Cuales son los Árboles con hojas comestibles con contenido en nutrientes entre los diez más altos de todos los vegetales cultivados?
según el libro 'ÁRBOLES con Hojas Comestibles Una Guía Mundial'
*Responde exclusivamente en Español*
"""

result = qa({"question": q_1})
torch.cuda.empty_cache()
result['answer']

In [None]:
# Generate a response from the Llama model
def get_llama_response(message: str, history: list) -> str:
    """
    Generates a conversational response from the Llama model.

    Parameters:
        message (str): User's input message.
        history (list): Past conversation history.

    Returns:
        str: Generated response from the Llama model.
    """
    query = generate_prompt(message, history)
    response = ""

    sequences = generate_text(
        query,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=1024,
    )

    generated_text = sequences[0]['generated_text']
    response = generated_text[len(query):]  # Remove the prompt from the output

    print("Chatbot:", response.strip())
    return response.strip()

In [None]:
get_llama_response("Dime que es la restauracion foestal en 10 palabras.",[])

Both `max_new_tokens` (=512) and `max_length`(=1024) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


 La restauración forestal es el proceso de recuperar y regenerar los ecosistemas forestales dañados o degradados.</s>
Chatbot: La restauración forestal es el proceso de recuperar y regenerar los ecosistemas forestales dañados o degradados.


'La restauración forestal es el proceso de recuperar y regenerar los ecosistemas forestales dañados o degradados.'

In [None]:
gr.ChatInterface(get_llama_response).launch(debug=True) #1st method

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://4d16c18d7adbae5def.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Both `max_new_tokens` (=512) and `max_length`(=1024) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


 La Restauración Forestal es un proceso de regeneración y recuperación de los ecosistemas forestales dañados o degradados, con el objetivo de restaurar su biodiversidad y funcionalidad.</s>
Chatbot: La Restauración Forestal es un proceso de regeneración y recuperación de los ecosistemas forestales dañados o degradados, con el objetivo de restaurar su biodiversidad y funcionalidad.
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://4d16c18d7adbae5def.gradio.live




### realtime output in gradio interface

In [None]:
import gradio as gr
gr.ChatInterface(
fn=transcribe,
inputs=[
gr.Audio(source=“microphone”, type=“numpy”),
“state”
],
outputs= [
“text”,
“state”
],
live=True).launch()