# Running LLMs locally





## Open/Closed LLM


![Survey of models](./data/survey_of_llms.png)

A Survey of Large Language Models [arxiv.org paper](https://arxiv.org/pdf/2303.18223.pdf)



## Popular LLM Families

* OpenAI GPT family
* Meta LLaMA family
* Google PaLM family



# Open Models


In most other tasks than generalist chat, open-source is ahead thanks to customized models.

![Open vs close model ELO](./data/arena_elo.jpg)



Fewer parameters
* 3B, 13B, 30B, 70B


# Open source models

* [Olmo](https://allenai.org/olmo)
* GPT-NeoX, Pythia, OLMo, and Amber all have publicly available training data and OSI-licensed training and evaluation code, model weights, and partially trained checkpoints



# Open weights models


## Mistral

* Open weights NOT open source
  * Open weights isn’t open source unless they provide full access to their training set and source code. In all respect to the capabilities of Mistral’s models, it is an extreme stretch to call company that’s dropping torrents of weight binary, an OPEN SOURCE
  * Don't expose how training, recipe, how to collect the data, mixture of experts
  * Currently focused on developer experience first
  * Not just APIs --> because you need AI integrator


## Grok (X)

* Open weights


## LLaMa

LLaMA (Large Language Model Meta AI) are a family of LLM models release by [Meta](https://ai.meta.com/blog/llama-2/). Model weights are released to the research community under a noncommercial license.
* LLaMA - Feb '23
* LLaMA 2 - Jul '23
* LLaMA 3 - Jun/Jul '24????

A large number of researchers have extended LLaMA models by either instruction tuning or continual pretraining
  *  instruction tuning LLaMA has become a major approach to developing customized or specialized models, due to the relatively low computational costs.

![LLaMA](./data/llama.png)



  *  Stanford [Alpaca-52K](https://github.com/tatsu-lab/stanford_alpaca) instruction-following data generated by the techniques in the [Self-Instruct](https://github.com/yizhongw/self-instruct)
  *  On the self-instruct evaluation set, Alpaca shows many behaviors similar to OpenAI’s text-davinci-003, but is also surprisingly small and easy/cheap to reproduce.
  *  Alpaca: [fine-tune](https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file#fine-tuning) LLaMMA models using standard Hugging Face training code. Alpaca is very cost-effective for training ($500)




## Open Models/Weights w/ Agentic workflow better than GPT4.0????

Agentic Reasoning Design Pattern
- Reflection
- Tools
- Planning
- Multi-agent collaboration

- Andrew Ng [Sequoia Talk](https://www.youtube.com/watch?v=sal78ACtGTc)
  - Agentic (using agent) Workflows
  - Agentic workflow MUCH better than Zero-shot workflow (even for older models)
    - GPT3.5 with agent workflow much better that GPT4.0 zero-shot
   
 - Harrison Chase [Sequoia Langchain Agents](https://www.youtube.com/watch?v=pBBe1pk8hf4)
  - Planning Step (upfront) vs Reflection Steps (at end)
  - Flow engineering: AlphaCodium flow paper
     - Offload planning to human 
  - UX of agent apps
     - Human in the loop?
     - Rewind and edit?
     - Memory of agents


# Open Model Licensing





# Ollama & Private Data

[Ollama](https://ollama.com/) download and install on Mac/Linux/Windows

Available Ollama [models](https://ollama.com/library)
* [Mistral](https://ollama.com/library/mistral)
* Mixtral-8x7B is more powerful and can handle extended conversations - Supported????

Ollama has support for multi-modal LLMs, such as bakllava and llava.


Ollama & [Langchain](https://python.langchain.com/docs/integrations/llms/ollama)


Basic commands:

* Fetch available LLM model via  ```ollama pull <name-of-model>```
* On Mac, the models will be download to ```~/.ollama/models```
* To view all pulled models, use ```ollama list```
* To chat directly with a model from the command line, use ```ollama run <name-of-model>```



--------------------
## References:

* Building a Multi-Document Chatbot Using [Mistral 7B, ChromaDB, and Langchain](https://www.e2enetworks.com/blog/building-a-multi-document-chatbot-using-mistral-7b-chromadb-and-langchain)
* [Ask Your Web Pages Using Mistral-7b & LangChain](https://medium.com/@zekaouinoureddine/ask-your-web-pages-using-mistral-7b-langchain-f976e1e151ca)
* [Ollama Python Library Released! How to implement Ollama RAG?](https://www.youtube.com/watch?v=4HfSfFvLn9Q)
  * Ollama RAG [code](https://mer.vin/2024/01/ollama-rag/)






In [None]:
#from langchain_community.llms import Ollama
from langchain.llms import Ollama

llm = Ollama(model="mistral")

llm.invoke("Tell me a joke")

------------------------

-----
## Environment set up



In [None]:
!pip install -qU \
    langchain \
    tiktoken \
    ollama \
    pypdf \
    chromadb \
    pinecone-client \
    ipywidgets \
    langflow 



---------------------
# Mistral Embedding





---------
# Data Ingestion


<img src='./data/data_ingestion.png' width='800'>



--> Point to data source and load multiple documents (PDF/Word/HTML/Chat...). [Document Loaders](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html)

--> **Chunk** into smaller parts. [Text Splitters](https://python.langchain.com/en/latest/modules/indexes/text_splitters.html)
  * Optimize for the smallest size without losing context
  * Consider adding some meaningful global metadata in all the chunks giving global context to all your embedded chunks
  * Use ```chunk_overlap``` to maintain some local context
  
--> Create **embedding** vectors for each chunk using LLM embedding. [Text Embedding Models](https://python.langchain.com/en/latest/modules/models/text_embedding.html)
  * An embedding is a vector (list) of floating point numbers
  * Embeddings are an AI native way to represent any kind of data: **text, images, audio and video**
        
--> Store embedding + metadata in 
        [Vector stores](https://python.langchain.com/en/latest/modules/indexes/vectorstores.html)

  * **Vector stores**:
    * [Pinecone](https://docs.pinecone.io/docs/overview): Managed vector store. Pinecone vector search index (OpenAI dimension: 1536)
    * [Chroma](https://docs.trychroma.com/): Open source locally managed vector store.
    * [Qdrant](https://github.com/qdrant/qdrant): Open source vectorstore with local and cloud managed options
    * PostgreSQL with [pg_vector](https://github.com/pgvector/pgvector)

    
--> **Semantic search** to retrieve relevant information by measuring the similarity between two vectors.
  * Typical similarity metrics: **Cosine, Dot Product, Euclidean** 
            

-----
## Load documents and chunk



In [None]:
# Setting some variable used global for the following cells
import os

persist_chroma_directory = '.chroma_db'
pdf_folder = './data/pdf'

os.listdir(pdf_folder)


In [None]:
from langchain.document_loaders import DirectoryLoader, \
                                        PyPDFLoader, \
                                        UnstructuredPDFLoader, \
                                        TextLoader

loader = DirectoryLoader(pdf_folder, glob='**/*.pdf', loader_cls=PyPDFLoader)
documents = loader.load()

# If using PyPDFLoader each document in documents is 1 page of a pdf. 
print(f'{len(documents)} pages loaded')


In [None]:
documents[0]

In [None]:
print(documents[0].page_content)

----
## Split in to smaller chunks

* Split the text up into small, semantically meaningful chunks.
* Most LLMs are constrained by the number of tokens that you can pass in so passing in an entire document or several document pages + prompt may exceed LLM token limit



In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Chunk loaded documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

print(f'{len(chunks)} chunks created')

In [None]:
chunks[:2]


In [None]:
print(chunks[0].page_content)

----
## Chroma: Create document embeddings

[Chroma](https://docs.trychroma.com/): Open source locally managed vector store.

Mistral [embedding pricing](https://docs.mistral.ai/platform/pricing/)


In [None]:
from langchain.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings

persist_chroma_directory = '.chroma_db'


#embedding = MistralAIEmbeddings(mistral_api_key="your-api-key")
#embedding.model = "mistral-embed"  # or your preferred model if available
# use OpenAI embedding
#embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, \
#                             model='text-embedding-ada-002')


embeddings = OllamaEmbeddings(model="mistral")
chroma_store = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=persist_chroma_directory)

# Persist the database --> Need to call persist() when using Jupyter
chroma_store.persist()
chroma_store = None




----------------
# Retrieval Augmented Generation (RAG)


<img src='./data/RAG2.png' width='1000'>

RAG retrieves data from outside the language model (non-parametric) and augments the prompts by adding the relevant retrieved data in context. 

The idea of [Retrieval Augmented Generation (RAG)](https://huggingface.co/docs/transformers/model_doc/rag) workflow is simple. Instead of asking a question directly, the process first uses the user question to perform a search to retrieve relevant documents from the internal dataset and then provides these documents together with the question to LLM. With the additional context the LLM can answer as though it has been trained with the internal dataset.



In [10]:
import ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma

persist_chroma_directory = '.chroma_db'
embeddings = OllamaEmbeddings(model="mistral")
chroma_store = Chroma(embedding_function=embeddings, persist_directory=persist_chroma_directory)


retriever = chroma_store.as_retriever()

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Define the Ollama LLM function
def ollama_llm(question, context):
    formatted_prompt = f"Question: {question}\n\nContext: {context}"
    response = ollama.chat(model='mistral', messages=[{'role': 'user', 'content': formatted_prompt}])
    return response['message']['content']

# Define the RAG chain
def rag_chain(question):
    retrieved_docs = retriever.invoke(question)
    formatted_context = format_docs(retrieved_docs)
    return ollama_llm(question, formatted_context)

# Use the RAG chain
#query = "Provide details of compensation if my flight is cancelled? "
query = "Provide details of compensation if my flight is cancelled? Output the results in bullet points"
#query = "How much liquid can I bring on a flight?"
#query = "how long is my ticket valid for?"

result = rag_chain(query)

print(result)


 Compensation for Flight Cancellations by Aer Lingus (as per Article 3.AER LINGUS DELAY NOTICE):

- Applicability:
  * For flights departing from an EU airport or a third country to an EU airport with Aer Lingus as the operating carrier.
  * Conditions: You have a confirmed reservation, present yourself for check-in on time, and are travelling at a publicly available fare.

Rules for Assistance when a Flight is Cancelled:

- When reasonably expected departure time exceeds:
  * Two hours for flights of 1500 kilometres or less.
  * Three hours for intra-Community flights over 1500 kilometres and other flights between 1500-3500 kilometres.
  * Four hours for all other flights.
- Free assistance:
  * Meals and refreshments proportional to waiting time.
  * Two telephone calls, telex or fax messages, or e-mails.
- Additional assistance when departure time is at least a day later than originally announced:
  * Hotel accommodation if a stay of one or more nights becomes necessary or additiona

-------------------------
# Ollama RAG UI

https://mer.vin/2024/01/ollama-rag/




---------------------------
# Other

## Training

* Pre-training --> Fine-tuning --> Alignment
* Fine-tuning
  * An important reason to fine-tune LLMs is to align the responses to the expectations humans will have when providing instructions through prompts. This is the so-called instruction tuning
  * AI Alignment is the process of steering AI systems towards human goals, preferences, and principles.


![How to build a LLM](./data/build_llm.png)



## Model Blending

* [Mergekit](https://github.com/arcee-ai/mergekit) - Python tool
* Merging methods
  * Task arithmetic
  * Slerp
  * Ties/Dare
  * Passthrough
* Only merge models with same architecture
  * Use different fine-tuned models of a specific MML family
 
### Model Benchmarks

* AI2 reasoning challenge - grade school science questions
* HellaSwag - Common sense
* MMLU - Massive Multitask Manguage Understanding measure how diverse LLM knowledge is
* TruthfulQA - How truthful is a model
* WinoGrande - commonsense reasoning
* GSM8K - maths reasoning

* 

## Hardware accelrators (Low cost)

* AI accelerator hardware - [Hailo](https://www.cnx-software.com/2024/04/04/hailo-10-m-2-key-m-module-brings-generative-ai-to-the-edge-with-up-to-40-tops-of-performance/)
* Coral TPU
* NVidia Jetson
* 



## Run transformers in the browser

[Transformers.js](https://huggingface.co/docs/transformers.js/en/index)

Syntax 740 podcast
* Run in browser or on node server
* ONNX model format
* Microsoft **ONNX runtime**
* Huggingface convert models to ONNX
* Run in browser
  * No server compute
  * Privacy - no data sent to servers
  * Rich JS tools to interact with browser
  * Everyone has browser
  * Soon web GPU support - speed
* Run on node JS server --> Faster



## Embedding

Massive Text Embedding Benchmark [MTEB](https://huggingface.co/blog/mteb)
* Huggingface MTEB [leaderboard](https://huggingface.co/spaces/mteb/leaderboard)




# References

* [A Survey of Large Language Models](https://arxiv.org/pdf/2303.18223.pdf)
* Huggingface [course](https://www.youtube.com/watch?v=00GKzGyWFEs&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o)


## Arxiv.org
* [A Survey of Large Language Models](https://arxiv.org/pdf/2303.18223.pdf)
* [Large Language Models: A Survey](https://arxiv.org/pdf/2402.06196.pdf)

* 