<a href="https://colab.research.google.com/github/amitsangani/llama/blob/main/Connect_2023_Building_with_Llama_Responsibly.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

IMPORTANT: This outline needs to be grounded in responsibility more. Reference the [RUG](https://ai.meta.com/static-resource/responsible-use-guide/). Call out pieces of the guide and how to use it in building


## **Hello Llama 2**
</div>

Welcome to the next generation of our open source large language model! Llama 2 release includes model weights and starting code for pretrained and fine-tuned Llama language models — ranging from 7B to 70B parameters.

We will start with basics of Llama 2, key features, how to get the models from Meta’s site, and Cloud Service providers (Azure, AWS and GCP).

### **What is Llama 2?**

Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. Llama 2 is intended for commercial and research use in English. It comes in a range of parameter sizes—7 billion, 13 billion, and 70 billion—as well as pre-trained and fine-tuned variations. Llama 2 was pre-trained on 2 trillion tokens of data from publicly available sources.

---

### **Get Started:**

1.   Download models from Meta's [Llama site](https://ai.meta.com/llama/), [Hugging Face](https://huggingface.co/meta-llama) or directly provision an instance using it from our cloud partners [Azure](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/introducing-llama-2-on-azure/ba-p/3881233), [AWS Sagemaker](https://aws.amazon.com/blogs/machine-learning/llama-2-foundation-models-from-meta-are-now-available-in-amazon-sagemaker-jumpstart/) or Google Cloud.
2.   Please review the [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/) to ensure you understand  best practices and considerations for building products powered by Llama 2 in a responsible manner.
3. Please also review [Acceptable Use Policy](https://ai.meta.com/llama/use-policy/) to ensure safe and fair use. If you access or use Llama 2, you agree to this Acceptable Use Policy.
4. Please download the [code](https://github.com/facebookresearch/llama/) from Meta's Github site.  

> * Quickly go over technical review of Llama from another opened tab - https://ai.meta.com/llama/ *














##**Get a quantized Llama 13B model from Hugging Face**

To run on Google Colab, free tier - we will need to run a quantized model. There are quantized models available on Hugging Face which allow us to utilize the model on a T4 GPU. We will use the [Llama-2-13B-GGML model](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML). We will use the model based on the GGLM library and leverage [Llama CPP](https://github.com/ggerganov/llama.cpp).

## **Install all the required packages**

In [None]:
# GPU llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

In [None]:
# Download the models
!pip install huggingface_hub

In [None]:
# specify the model path
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin" # the model is in bin format

## **Import the libraries and download the model**

In [None]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

Downloading (…)chat.ggmlv3.q5_1.bin:   0%|          | 0.00/9.76G [00:00<?, ?B/s]

## **Load the model**

In [None]:
# GPU
lcpp_llm = None
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_gpu_layers=32 # Change this value based on your model and your GPU VRAM pool.
    )

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 


In [None]:
# see the number of layers in GPU

lcpp_llm.params.n_gpu_layers


32

## **Create a prompt template**

In [None]:
prompt = "Tell me more about life and how to be successful"
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''

In [None]:
# simple utility to wrap text in colab before generating a response
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

## **Generate a response**

In [None]:
response=lcpp_llm(
    prompt=prompt_template,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=150,
    echo=True)

print(response["choices"][0]["text"])

## **Inference APIs**
</div>

In this section, we’ll go through different approaches to running inference of the Llama2 models. Once you have received access to the models after signing up through Meta's official form, you will be ready to run inferencing on the models.


## **Basic Chat Completion**
</div>

We will discuss differences between pre-trained and fine-tuned models and walk through the “chat completion” example to get expected features and performance.

## **Enhance Chat Completion and Memory**
</div>

We will then enhance the chat completion example and discuss conversation context and short-term memory.

## **AI Chatbot and Langchain**
</div>

Convert your example above into an AI-Chatbot using Langchain, a popular framework for building applications using Large Language Models.

## **Vector DB, Similarity Search (FAISS) and Embeddings**
</div>

In this section, we’ll go through different prompt engineering principles using Vector databases and embeddings



## **Q&A Retriever with LangChain**
</div>

Complete the AI-Chatbot with Q&A retriever and walk-through the entire codebase again.

## **Advanced chatbot concepts**
</div>

Explain advanced chatbot concepts (without code)

## **Responsible Use**

Safety & Moderation. Refer to RUG

## **Llama deployments**
</div>

And discuss how to deploy your AI-Chatbot for world to use (including showing this was done with llama-2-7b-chat on-device)
