# Project: Running Llama 3.1 Locally

The code below was developed on a Goolge Colab T4 instance.

## Install and Run Llama 3.1

In [1]:
from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/MyDrive/projects/LLM/AgenticRAG/rag_agents

Mounted at /content/drive
/content/drive/MyDrive/projects/LLM/AgenticRAG/rag_agents


In [2]:
%%capture --no-stderr
%pip install -qU langchain langchain_community langchain-openai langchainhub langchain-ollama

Below are the commands to install and run Ollama on Ubuntu.

Run the comands below if running on AWS Ubuntu.

In [None]:
!sudo apt update && sudo apt upgrade --assume-yes
!sudo apt install curl --assume-yes
!curl --version

Install and run Ollama 3.1 8B.

In [4]:
!curl -fsSL https://ollama.com/install.sh | sh
!ollama serve > server.out 2>&1 &

>>> Downloading ollama...
############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


In [None]:
!ollama run llama3.1:8b > model.out 2>&1 &

In [5]:
!ollama run llama3.1:70b > model.out 2>&1 &

In [5]:
!free -mh

               total        used        free      shared  buff/cache   available
Mem:            52Gi       1.6Gi        34Gi       2.0Mi        16Gi        50Gi
Swap:             0B          0B          0B


## Test Calling Local Llama 3.1

Below the client is created.

- llama3.1:8b -> Requires 6 GB of GPU memory.
- llama3.1:70b -> Requres 25 GB of GPU memory.

In [7]:
from langchain_ollama import ChatOllama
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

def get_llama_client(version:str):

  prompt = PromptTemplate(
      template="""You are a conscience Meta Llama 3.1 model.
      Use one sentence to answer the question concisely.
      Question: {question}
      Answer:
      """,
      input_variables=["question"],
  )
  llm = ChatOllama(
      model=version,
      temperature=0,
  )
  chain = prompt | llm | StrOutputParser()

  return chain

Lets test it with a question it shouldn't have information about.

## llama3.1:8b

6 GB of GPU memory

In [39]:
%%time
chain = get_llama_client('llama3.1:8b')
chain.invoke({'question':
              'What is your knowledge cutoff?'})

CPU times: user 108 ms, sys: 507 µs, total: 109 ms
Wall time: 1.07 s


'My knowledge cutoff is December 2023, but I can provide information and insights based on my training data up until that point.'

In [40]:
%%time
chain.invoke({'question':
              'How much data was used to train you?'})

CPU times: user 104 ms, sys: 3.47 ms, total: 107 ms
Wall time: 1.25 s


'I was trained on a massive dataset of over 52 terabytes, which is roughly equivalent to the contents of 10 million books or 1.5 billion web pages.'

In [41]:
!ollama rm llama3.1:8b

Error: open /root/.ollama/models/manifests/registry.ollama.ai/library/llama3.1/8b: no such file or directory


## llama3.1:70b

Maxed out all 40 GB of GPU memory on A100.

In [45]:
!ollama run llama3.1:70b > model.out 2>&1 &

In [14]:
%%time
chain = get_llama_client('llama3.1:70b')
chain.invoke({'question': 'What is your knowledge cutoff?'})

CPU times: user 99.4 ms, sys: 6.54 ms, total: 106 ms
Wall time: 3.32 s


'My knowledge cutoff is currently December 2022, which means I have been trained on data up to that point and may not be aware of events or developments that have occurred after that date.'

In [15]:
%%time
chain.invoke({'question':
              'How much data was used to train you?'})

CPU times: user 87.9 ms, sys: 5.62 ms, total: 93.5 ms
Wall time: 2.94 s


'I was trained on approximately 2 trillion parameters and 45 terabytes of text data, sourced from a diverse range of sources including books, articles, research papers, and websites.'