# Chapter 7 - Local LLMs with Ollama

## Introduction to Local Large Language Models(LLMs)

A Local LLM is a large language model that can be entirely run on a local machine rather than relying on cloud based LLM model such as ChatGPT.
Running LLMs locally offers several advantages, especially in scenarios where privacy, data security, latency, and offline access are important considerations.; not to mention the costs associated with accessing public LLM APIs.

You might have an impression that inferencing or training a LLM locally requires exorbitant amount of compute power (GPUs , memory, storage, etc) and that correlates to large upfront costs. Well, its partially true if you are trying to download  and host a large 75 billion parameter, trained LLM model such as llama3 on a local server.

However, thanks to the technique of LLM Quantization(Q-LLM), we can download and host quantized versions of the very same 75B parameter models on our local machine or laptops. Quantization works by reducing the size and complexity of an existing LLM model weights and storing them in a smaller 8 bit formats instead of 32 bit high precision format.

Quantization makes the model a whole lot lighter , faster and consumes less memory and processing power. A quantized LLM can still make predictions and generate text at the expense of bit less accuracy; but for many applications the advantages of a Q-LLM model far outweighs a slight trade-off in precision.

## What is Ollama and why do we need it?

Now that we know about Quantized LLMs and what they have to offer, the next piece to the puzzle is how do we actually run these  Q-LLM models locally. This is what "Ollama" is for; it is a framework for running and managing large language models (LLMs) locally on personal computer or servers.

Ollama supports GPU acceleration (via CUDA) and optimized memory management, to run resource-intensive LLMs efficiently. Users with compatible GPUs or other high-performance hardware can experience faster, real-time interactions with language models without needing cloud infrastructure. Ollama comes with the support for some of the most popular pre-quantized , optimized models such as Meta's Llama, Mistral , Gemma, and many others which are ready for deployment.

Langchain integrates seamlessly with Ollama. Users can leverage Ollama's suite of models directly within LangChain workflows, enabling private, on-device processing and quick response times without dependency on external APIs.

## Installation

Ollama does not have any dependency on Langchain. It exposes its own API and CLI reference that can be used to download and run models locally all on its own. To install ollama on a linux machine we use **install.sh** script provided by Ollama

In [None]:
#! curl -fsSL https://ollama.com/install.sh | sh

"""
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
WARNING: No NVIDIA/AMD GPU detected. Ollama will run in CPU-only mode.

"""


Ollama has a wide choice of LLMs that it supports and are available for download. You can choose what model you would like to download from Ollama's model library here: [Ollama Models](https://ollama.com/library)

We are going to download and use Meta's **Llama 3.2** 3B parameter model. As of writing this notebook , this model has been gaining a lot of popularity for being small (2GiB) and surprisingly efficient across multiple tasks such as following instructions, summarization, prompt rewriting, tool use , etc..

With Ollama installed, lets go ahead and download llama3.2 using "ollama pull" command

In [18]:
! ollama pull llama3.2

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest 
pulling dde5aa3fc5ff... 100% ▕████████████████▏ 2.0 GB                         
pulling 966de95ca8a6... 100% ▕████████████████▏ 1.4 KB                         
pulling fcc5a6bec9da... 100% ▕████████████████▏ 7.7 KB                         
pulling a70ff7e570d9... 100% ▕████████████████▏ 6.0 KB                         
pulling 56bb8bd477a5... 100% ▕████████████████▏   96 B                         
pulling 34bb5ab01051... 100% ▕████████████████▏  561 B                         
verifying sha256 digest 
writing manifest 
success [?25h


At this point, you can open a terminal session and start running or asking questions to our downloaded llama3.2 model.

``` 
# ollama run llama3.2
>>> what is an IP address?
An IP address (Internet Protocol address) is a unique numerical label assigned to each device connected to a computer network that uses the Internet Protocol. It's used to identify and locate a device on a network, allowing devices to communicate with each other.

An IP address typically consists of four numbers separated by dots (.), like this:

`192.168.10.100`

Each number is called an octet, and they can range from 0 to 255.
```

However , thats not the goal of this notebook. We are trying to leverage the power of Local LLMs over a programmable framework provided by Langchain. 

We have already covered and learnt about Langchain in all the previous notebooks in this series. It's time we use all those Langchain constructs we have learned so far and put that in use for inferencing from a local LLM using Ollama.

Langchain has made available a partner package with Ollama called **langchain-ollama** which overly simplifies the setup and other necessary configurations which you would otherwise have to do in a time consuming manual manner.

In the code block below , we are going to see how we can use "langchain-ollama" to perform inferecing on a model that is already download on your local system.



In [3]:
#! pip install langchain-ollama
! ollama list

NAME               ID              SIZE      MODIFIED   
llama3.2:latest    a80c4f17acd5    2.0 GB    2 days ago    
