# Project: Running Llama 3.1 Locally

The code below was developed on a Google Colab. It experiments with running Llama 3.1 locally using Ollama.

Results: The 8B parameter version could be run on a T4 instance and consumed about 6GB of the 15GB GPU memory that was available and had reasonable response time. The 70B parameter version was able to run on a A100 but it consumed all the 40GB of GPU memory available, indicating available memory was insufficient. The response time for the simple requests below took longer for the 70B model on an A100 instance than the 8B on a T4. This again provides some evidence that the 70B model requires more than 40GB of GPU memory to run optimally.     

## Install and Run Ollama Server

In [1]:
from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/MyDrive/projects/LLM/AgenticRAG/rag_agents

Mounted at /content/drive
/content/drive/MyDrive/projects/LLM/AgenticRAG/rag_agents


In [2]:
%%capture --no-stderr
%pip install -qU langchain langchain_community langchain-openai langchainhub langchain-ollama

Below are the commands to install and run Ollama on Ubuntu.

Run the comands below if running on AWS Ubuntu.

In [None]:
!sudo apt update && sudo apt upgrade --assume-yes
!sudo apt install curl --assume-yes
!curl --version

Install and run Ollama server.

In [2]:
!curl -fsSL https://ollama.com/install.sh | sh
!ollama serve > server.out 2>&1 &

>>> Downloading ollama...
############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


## Test Calling Local Llama 3.1

The code below is used to create Llama 3.1 clients. It takes a model parameter so different model version clients can be created.

A LangChain pipeline is used to submit requests.

In [3]:
from langchain_ollama import ChatOllama
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

def get_llama_client(version:str):
  """
  Create a Llama 3.1 client.
  Parametrs:
    version (str): The Llama 3.1 model version to use, e.g. llama3.1:8b, llama3.1:70b.
  Returns:
    chain (object): A Llama 3.1 client.
  """

  prompt = PromptTemplate(
      template="""You are a conscience Meta Llama 3.1 model.
      Use one sentence to answer the question concisely.
      Question: {question}
      Answer:
      """,
      input_variables=["question"],
  )
  llm = ChatOllama(
      model=version,
      temperature=0,
  )
  chain = prompt | llm | StrOutputParser()

  return chain

## llama3.1:8b

Below, the 8B version was run on a Colab T4 instance using approximatgely 6 GB of GPU memory.

The command below starts a 8B instance. Even though this command will return instantly, the instance will take about a minute to start. Monitor the "model.out" log file to determine when the model is available.

In [None]:
!ollama run llama3.1:8b > model.out 2>&1 &

Lets see how much memory is used.

In [None]:
!nvidia-smi

Sat Aug  3 16:10:13 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P0              30W /  70W |   6097MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

About 6GB of GPU memory is used.

Lets submit some requests.

In [None]:
chain = get_llama_client('llama3.1:8b')

In [None]:
%%time
chain.invoke({'question':
              'What is the knowledge cutoff for Meta Llama 3.1?'})

CPU times: user 87.3 ms, sys: 2.06 ms, total: 89.4 ms
Wall time: 1.16 s


'My knowledge cutoff is December 2023, but I have been trained on a broader range of topics and can provide more up-to-date information in some areas.'

In [None]:
%%time
chain.invoke({'question':
              'How much data was used to train Meta Llama 3.1?'})

CPU times: user 110 ms, sys: 3.95 ms, total: 114 ms
Wall time: 1.67 s


"The training dataset for Meta Llama 3.1 is not publicly disclosed, but it's reported to be a massive corpus of text data sourced from various places, including but not limited to the internet, books, and user-generated content."

A Google search shows that the "December 2023" response is correct. However, also per Google, ~15 trillion tokens where used to train Llama 3.1. Its possible this information wasn't available at the time of training.

Below, the model is removed from the Ollama server.

In [None]:
!ollama rm llama3.1:8b

deleted 'llama3.1:8b'


## llama3:8b-instruct-q2_K

Lets experiment with the smallest quantized version.

In [5]:
!ollama run llama3:8b-instruct-q2_K > model.out 2>&1 &

In [6]:
!nvidia-smi

Mon Aug  5 17:32:03 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P0              28W /  70W |   4801MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

As expected, the GPU memory used is much smaller than the full-floating point version.

Requests are submitted below.

In [7]:
chain = get_llama_client('llama3:8b-instruct-q2_K')

In [10]:
%%time
chain.invoke({'question':
              'What is the knowledge cutoff for Meta Llama 3.1?'})

CPU times: user 86.8 ms, sys: 0 ns, total: 86.8 ms
Wall time: 1.41 s


'The knowledge cutoff for Meta Llama 3.1 is approximately the end of 2022, with a focus on its training data being up to October 2021.'

In [11]:
%%time
chain.invoke({'question':
              'How much data was used to train Meta Llama 3.1?'})

CPU times: user 75.2 ms, sys: 0 ns, total: 75.2 ms
Wall time: 1.1 s


'The Meta Llama 3.1 model was trained on approximately 128 million parameters and 13 billion tokens of text data.'

While the GPU memory usages is smaller than full 8B version, the response time is a out the same for these simple requests. Furhtermore, both responses are incorrect.

In [12]:
!ollama rm llama3:8b-instruct-q2_K

deleted 'llama3:8b-instruct-q2_K'


## llama3.1:70b

Only a A100 instance with 40GB of GPU memory was able to start and process requests with any kind of reasonable response time. The model used all 40 GB of GPU memory.

Below the model is started. Monitor the log file to determine its availability. Complete startup will take a about 8 minutes.

In [5]:
!ollama run llama3.1:70b > model.out 2>&1 &

In [6]:
!nvidia-smi

Mon Aug  5 17:53:37 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0              50W / 400W |  39261MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

Just about all GPU memory is being used. Likely meaning more is requried for this version to run effectively.

In [7]:
chain = get_llama_client('llama3.1:70b')

In [8]:
%%time
chain.invoke({'question': 'What is the knowledge cutoff for Meta Llama 3.1?'})

CPU times: user 93.3 ms, sys: 6.27 ms, total: 99.6 ms
Wall time: 2.59 s


'The knowledge cutoff for Meta Llama 3.1 is December 2022, meaning it was trained on data available up to that point in time.'

In [9]:
%%time
chain.invoke({'question':
              'How much data was used to train Meta Llama 3.1?'})

CPU times: user 87.7 ms, sys: 6.64 ms, total: 94.4 ms
Wall time: 2.25 s


'Meta Llama 3.1 was trained on approximately 2 trillion parameters and 1.5 billion tokens of text data.'

The answers are different from the 8B version and, as far as I can tell, incorrect. The response time is almost double the 8B version, as well.

In [10]:
!ollama rm llama3.1:70b

deleted 'llama3.1:70b'


## llama3:70b-instruct-q2_K

In [4]:
!ollama run llama3:70b-instruct-q2_K > model.out 2>&1 &

In [5]:
!nvidia-smi

Mon Aug  5 18:05:35 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0              50W / 400W |  29289MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [6]:
chain = get_llama_client('llama3:70b-instruct-q2_K')

In [7]:
%%time
chain.invoke({'question': 'What is the knowledge cutoff for Meta Llama 3.1?'})

CPU times: user 82.1 ms, sys: 2.42 ms, total: 84.6 ms
Wall time: 1.32 s


'The knowledge cutoff for Meta Llama 3.1 is December 2022.'

In [9]:
%%time
chain.invoke({'question':
              'How much data was used to train Meta Llama 3.1?'})

CPU times: user 81.9 ms, sys: 1.31 ms, total: 83.2 ms
Wall time: 2.45 s


'Meta Llama 3.1 was trained on a massive dataset of approximately 2.5 billion parameters and 20TB of text data, sourced from various websites, books, and other digital platforms.'

In [None]:
!ollama rm llama3:70b-instruct-q2_K

Less memory was used compared to the full 70B version. The responses and execution time are similar.