---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<br><br>
<h1 align="center">Lec-08: Accessing Open-Source AI Models using Local Run-time Frameworks</h1>

# Learning agenda of this notebook

1. Local LLM Runtime Frameworks
2. Accessing AI Models via Ollama GUI and CLI
3. Accessing AI Models via LLaMA.cpp GUI and CLI
4. Programmatically Accessing AI Models running on Local Machine by Ollama
    - Using  Ollama’s Native Python Library
    - Using  OpenAI's `Chat Completion` API
    - Using  OpenAI's `Responses` API
5. Hello World Examples

# <span style='background :lightgreen' >Recap: Ways to Access Open Source LLMs</span>
### (i) Access Open-Source Models via Cloud-Based Providers (Driving a fully automatic car — everything managed for you)

### (ii) Run Open-Source Models locally using runtimes (Driving an automatic car — local but simple, no gears or engineering)

### (iii) Use Open-Source Models via Hugging Face `pipeline()` API (Driving a manual car — you see more of the mechanics, but still a car someone else built)

### (iv) Load and run models directly from Hugging Face Hub using `AutoModel/AutoTokenizer` (Opening the hood and adjusting or replacing engine components)


### (v) Fine-Tune LLMs using full fine-tuning or PEFT methods (LoRA / QLoRA / adapters) (Upgrading and re-calibrating the engine to suit your driving style)

### (vi) Build and train an AI Model from scratch using PyTorch / TensorFlow (Designing and building the entire car from raw parts — full control, full responsibility)


# <span style='background :lightgreen' >1. Local LLM Runtime Frameworks</span>


## Why is Local Inference Significant?
1. **Privacy & Security:** Data stays within your own system, so sensitive information is not sent to third-party cloud servers. This is especially important in healthcare, finance, and government sectors. Keeping data local lowers exposure to external threats and minimizes the risk of data breaches.
3. **Cost Efficiency:** Reduces cloud-related expenses such as API usage fees, bandwidth, and compute costs.
4. **Low Latency:** Running models locally reduces network delay, resulting in faster responses. This is critical for real-time applications like chatbots and interactive systems.
5. **Offline Operation:** Models continue to work without an internet connection. Useful in environments with limited or unreliable connectivity (e.g., remote areas, edge devices).
6. **Customization & Control:** You have full control over the model, data, updates, and fine-tuning without platform restrictions.
7. **Regulatory Compliance:** Helps organizations follow data privacy laws by keeping sensitive data inside their own systems. This supports compliance with regulations such as GDPR (European Union data protection law) and CCPA (California’s data privacy law), which require careful handling and protection of personal data.

## Common Local LLM Runtime Frameworks
- **LM Studio (https://lmstudio.ai/):**
    - A GUI-based desktop application (Windows/macOS/Linux) designed primarily for end users.
    - Enables users to download, manage, and chat with local LLMs without any coding.
    - Focused on simplicity and accessibility rather than developer-oriented integration.
    - Best suited for non-technical users or quick local experimentation.
- **GPT4All ( https://gpt4all.io/):**
    - It combines a beginner-friendly desktop chat application ((Windows/macOS/Linux)) with developer SDKs (Python, Node.js, C++), enabling both offline local chatting and programmatic integration of open-source models.
    - Provides a beginner-friendly desktop chat application (Windows/macOS/Linux) for offline local inference.
    - Also offers developer SDKs (Python, Node.js, C++) for programmatic integration.
    - Supports running open-source models locally with optional API-based interaction.
    - Suitable for both casual users and developers who need lightweight integration capabilities.
- **Ollama (https://ollama.com/):**
    - User-friendly runtime built on top of llama.cpp
    - Provides automatic model download, versioning, and management
    - Supports local inference with a CLI, browser-based interface, and OpenAI-compatible REST API for easy integration
    - Very simple to set up and use, making it ideal for quick prototyping
    - Slightly slower than raw llama.cpp due to wrapper overhead and Not designed for heavy production workloads
    - Best suited for:
        - Developers who want quick local deployment with minimal configuration.
        - Prototyping applications that require an OpenAI-style API locally.
- **llama.cpp (https://github.com/ggml-org/llama.cpp):**
    - Lightweight C/C++ implementation for efficient inference of LLaMA and other GGUF-based models
    - Supports quantization for CPU and limited GPU acceleration
    - Highly optimized for resource-constrained environments, including edge devices and Apple Silicon
    - Provides fine-grained control over model execution and performance tuning
    - Includes a built-in lightweight HTTP server and can be paired with browser-based GUIs for interactive usage
    - Does not include advanced serving features such as multi-user scheduling, dynamic batching, or production-grade APIs.
    - Best suited for:
        - CPU-based inference and edge deployments
        - Researchers or developers needing low-level control and aggressive quantization
        - Embedded systems and resource-constrained environments.
- **vLLM (https://github.com/vllm-project/vllm):**
    - [Efficient Memory Management for Large Language Model Serving with PagedAttention](https://arxiv.org/pdf/2309.06180)
    - High-performance LLM inference and serving engine, originally developed at UC Berkeley
    - Designed primarily for GPU-based deployment and large-scale serving
    - Implements advanced optimizations like PagedAttention and continuous batching for high throughput and efficient memory usage
    - Supports OpenAI-compatible APIs, making it suitable for production integration
    - Optimized for multi-user, high-concurrency workloads in server environments.
    - Best suited for:
        - Production-grade deployments.
        - High-throughput APIs and enterprise-scale applications.
        - Multi-GPU inference infrastructure.

# <span style='background :lightgreen' >2. Accessing AI Models via Ollama GUI and CLI</span>

<h3 align="center"><div class="alert alert-success" style="margin: 20px">Ollama is an open-source platform that makes it easier to work with open-source AI models, providing features like model downloading, running it on your local box, and interacting through CLI or via APIs</h3>


## a. Architecture of Ollama:
- **Server Component:**
    - When you install and run Ollama, it launches a local server that listens on http://localhost:11434
    - This server is responsible for managing models (loading, unloading, switching) and handling inference requests.
    - The API it exposes follows a REST architecture, making it accessible through HTTP requests.
    - So whatever LLM you choose to run using  ollama, your request at the end of the day will be passed to the ollama server, which runs on your localhost at port 11434.
- **Client Interaction:**
    - Applications can interact with the Ollama server in three  ways:
        - Using the official Ollama CLI, which directly communicate with the local server. 
        - Using Ollama-native client like `ollama.chat()` in Python or JavaScript.
        - Using the OpenAI client library and using `client.chat.completions.create()` or `client.responses.create()`, since Ollama exposes endpoint (http://localhost:11434/v1) that mimic the OpenAI API format.
    - Clients can issue requests for text generation, chat-style completions, streaming responses, and model management.
    - The Ollama server then handles the request, executes inference on your local hardware, and returns the response to the client.
- **Networked Environment:**
    - Although Ollama defaults to localhost, the server can be run on a different machine in the same network.
    - In that setup, clients simply target the remote machine’s IP and port 11434.
    - This allows you to centralize inference on a powerful machine (GPU server) while keeping clients lightweight.
    - So Ollama can run on a private-IP machine inside a LAN and be accessed locally via CLI/GUI or remotely via HTTP using Python clients (including the OpenAI-compatible API), and it can also be deployed on a public-IP machine for internet access, but doing so safely requires a reverse proxy, authentication, and TLS since Ollama itself provides no built-in security.



## b. Supported Models:

| Model Type                   | Capabilities                             | Examples                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| ---------------------------- | ---------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Text-only LLMs**           | Chat, reasoning, summarization, code     | **[llama4](https://ollama.com/library/llama4)** — next-gen chat/coding models (Behemoth, Maverick, Scout)<br>**[llama3.3:70b](https://ollama.com/library/llama3.3)** — reasoning-focused text model<br>**[llama3.1](https://ollama.com/library/llama3.1)** — general-purpose LLM family (8b/70b/405b)<br>**[llama3.2](https://ollama.com/library/llama3.2)** — ultra-efficient mini LLMs (1b/3b)<br>**[gemma2](https://ollama.com/library/gemma2)** — Google-optimized compact chat models (2b/9b/27b)<br>**[qwen2.5](https://ollama.com/library/qwen2.5)** — multilingual, high-performance LLMs (0.5b-72b)<br>**[mistral:7b](https://ollama.com/library/mistral)** — efficient French-backed reasoning model<br>**[phi4:14b](https://ollama.com/library/phi4)** — Microsoft reasoning-optimized compact model |
| **Reasoning Models**         | Advanced step-by-step reasoning, math    | **[deepseek-r1](https://ollama.com/library/deepseek-r1)** — chain-of-thought reasoning (1.5b-671b)<br>**[openthinker](https://ollama.com/library/openthinker)** — open long-reasoning (7b/32b)<br>**[magistral:24b](https://ollama.com/library/magistral)** — synthetic reasoning from Mistral<br>**[phi4-reasoning:14b](https://ollama.com/library/phi4-reasoning)** — deliberate reasoning variant<br>**[qwq:32b](https://ollama.com/library/qwq)** — Qwen 32b benchmark-level reasoner                                                                                                                                                                                                                                                                                          |
| **Code-Specific**            | Programming, code completion, debugging  | **[qwen2.5-coder](https://ollama.com/library/qwen2.5-coder)** — multilingual code LLM (0.5b-32b)<br>**[deepseek-coder-v2](https://ollama.com/library/deepseek-coder-v2)** — strongest open-source coding model (16b/236b)<br>**[codestral:22b](https://ollama.com/library/codestral)** — Mistral coding assistant<br>**[starcoder2](https://ollama.com/library/starcoder2)** — permissive code generator (3b/7b/15b)<br>**[granite-code](https://ollama.com/library/granite-code)** — IBM enterprise code LLM (3b-34b)                                                                                                                                                                                                                                                             |
| **Multimodal (Vision)**      | Image understanding + text generation    | **[llama4-vision](https://ollama.com/library/llama4-vision)** — image + text multimodal<br>**[llama3.2-vision](https://ollama.com/library/llama3.2-vision)** — mini multimodal family (11b/90b)<br>**[qwen2.5vl](https://ollama.com/library/qwen2.5vl)** — visual-language model (3b/7b/32b/72b)<br>**[llava](https://ollama.com/library/llava)** — LLaMA-based visual chatbot (7b/13b/34b)<br>**[minicpm-v:8b](https://ollama.com/library/minicpm-v)** — small, fast VLM                                                                                                                                                                                                                                                                                                          |
| **Small/Efficient**          | Lightweight for edge devices             | **[smollm2](https://ollama.com/library/smollm2)** — ultra-tiny edge LLMs (135m/360m/1.7b)<br>**[tinyllama:1.1b](https://ollama.com/library/tinyllama)** — extremely small LLM baseline<br>**[phi3:3.8b](https://ollama.com/library/phi3)** — Microsoft tiny reasoning model<br>**[nemotron-mini:4b](https://ollama.com/library/nemotron-mini)** — NVIDIA compact assistant model<br>**[gemma2:2b](https://ollama.com/library/gemma2)** — tiny Google chat model                                                                                                                                                                                                                                                                                                                    |
| **Tool-calling/Agents**      | External API interaction, function calls | **[llama3.1-tools](https://ollama.com/library/llama3.1-tools)** — tool-enabled LLaMA<br>**[qwen2.5-tools](https://ollama.com/library/qwen2.5-tools)** — function-calling Qwen family<br>**[mistral-nemo:12b](https://ollama.com/library/mistral-nemo)** — Mistral API-aware model<br>**[command-r+](https://ollama.com/library/command-r%2B)** — RAG/agent orchestration model (35b/104b)<br>**[hermes3](https://ollama.com/library/hermes3)** — tool-driven fine-tune<br>**[granite3-tools](https://ollama.com/library/granite3-tools)** — IBM agent foundation model                                                                                                                                                                                                             |
| **Embedding Models**         | Semantic vector generation (RAG)         | **[mxbai-embed-large](https://ollama.com/library/mxbai-embed-large)** — high-accuracy embedding (335m)<br>**[nomic-embed-text](https://ollama.com/library/nomic-embed-text)** — optimized text embeddings<br>**[bge-m3](https://ollama.com/library/bge-m3)** — multilingual embedding (567m)<br>**[snowflake-arctic-embed2](https://ollama.com/library/snowflake-arctic-embed2)** — enterprise embedding (568m)<br>**[all-minilm](https://ollama.com/library/all-minilm)** — lightweight MiniLM (22m/33m)                                                                                                                                                                                                                                                                          |
| **MoE (Mixture of Experts)** | High performance, efficient inference    | **[deepseek-r1-moe](https://ollama.com/library/deepseek-r1)** — MoE reasoning flagship (671b)<br>**[deepseek-v3](https://ollama.com/library/deepseek-v3)** — frontier MoE model (671b)<br>**[mixtral](https://ollama.com/library/mixtral)** — sparse expert transformer (8×7B / 8×22B)<br>**[qwen2.5-moe](https://ollama.com/library/qwen2.5)** — MoE Qwen variants                                                                                                                                                                                                                                                                                                                                                                                                                |
| **Specialized Domain**       | Math, medical, multilingual              | **[qwen2-math](https://ollama.com/library/qwen2-math)** — math-specialized LLM (1.5b/7b/72b)<br>**[medllama2:7b](https://ollama.com/library/medllama2)** — biomedical LLaMA<br>**[aya-expanse](https://ollama.com/library/aya-expanse)** — multilingual research model (8b/32b)<br>**[mathstral:7b](https://ollama.com/library/mathstral)** — math-finetuned Mistral<br>**[sailor2](https://ollama.com/library/sailor2)** — multilingual navigation models (1b/8b/20b)                                                                                                                                                                                                                                                                                                            

## c. Download and Install Ollama
- There are different ways to download and install **ollama**:
    1. Download Ollama for your Mac, Linux or Windows machine by visiting: https://ollama.com/download
    2. On your terminal (inside your virtual environment), give the command `uva add ollama` or  `pip install ollama`
    3. Download docker image of ollama available on dockerhub and run it inside a container
- To check if ollama has been installed on your machine, open a terminal and give the command `ollama`:
```
Available Commands:
  serve       Start ollama
  create      Create a model
  show        Show information for a model
  run         Run a model
  stop        Stop a running model
  pull        Pull a model from a registry
  push        Push a model to a registry
  signin      Sign in to ollama.com
  signout     Sign out from ollama.com
  list        List models
  ps          List running models
  cp          Copy a model
  rm          Remove a model
  launch      Launch an integration with Ollama
  help        Help about any command
Use "ollama [command] --help" for more information about a specific command.
```
- To ensure Ollama application is running in the background, just open a browser and go to http://localhost:11434/ and it will display a message saying **"ollama is running"**. If this is not the case, then on the terminal give the command `ollama serve` and try again :)

<h1 align="center"><div class="alert alert-success" color=magenta style="margin: 20px">Ollama lets you download models from Ollama Hub or the Hugging Face Hub.</h1>

## d. Download Models from Ollama Hub Using the **Ollama GUI**
- Open the ollama UI, from where you can search and download models from the official remote repository (https://ollama.com/search). Once downloaded the models are stored locally, making it available for future use without the requirement of Internet.
- The models hosted in ollama's official library are GGUF quantized versions. GGUF stands for “GPT-Generated Unified Format", which is a binary file format used to store large language model weights and metadata in a way that’s efficient for local inference.
- Many new models are too large to fit on widely available GPUs, therefore, you can use **Ollama's cloud**, which is a new way to run open models using datacenter-grade hardware. It however includes hourly and daily limits to ensure fair usage.
- Ollama's cloud models require an account on ollama.com and require being signed in to ollama.com, and you need to use the Ollama desktop application to sign in.
    - Step 1: Open Ollama Desktop App, Click on the Ollama icon in your Mac menu bar (top right) Or open the Ollama application from your Applications folder
    - Step 2: Click on "Sign In" in the menu. You'll be redirected to ollama.com to authenticate. Sign in or create an account. Authorize the application
    - Step 3: Verify Authentication After signing in through the desktop app, try your command again:
- **Available Models on the Cloud are:**
    - `gpt-oss:20b-cloud`
    - `gpt-oss:120b-cloud`
    - `qwen3-coder:480b-cloud`
    - `deepseek-v3.1:671b-cloud`

>- If you are using Ollama Cloud models, then Ollama UI gives you facility to enable/disable web search tool and thinking facility.

## e. Download Models from Ollama Hub Using  **Ollama CLI**
- Open Command prompt/Powershell on Windows or a Terminal on Mac and use the following commands:

```bash
$ ollama                                     # List the possible commands
$ ollama pull llama3.2:1b                    # Download the model with the specific Model-ID from a remote repository (https://ollama.com/library) and stores it locally (/Users/arif/.ollama/models/), making it available for future use with ollama run.
$ ollama list                                # list all the models you have downloaded on your machine
$ ollama show gpt-oss:20b-cloud              # Display the model architecture, parameters, context window size, embedding length, quantization , special tokens, capabilites and  license
$ ollama run  llama3.2                       # Interactive session, from which finally you can exit using /exit or /bye or /quitList
$ ollama run  llama3.2 [--verbiose] "Query"  # Give a single line prompt to get the response. The --verbose option display additional information about the tokens and evaluation duration
$ ollama            # List
$ ollama run llama3.2 "Can you extract names the Cricket players from the following text" < ../data/names.txt           # Give the contents of a file using the I/O redirection operator and ask questions from the passed text
$ cat ../data/names.txt | ollama run llama3.2 "Can you extract names the Cricket players from the following text"       # Give the contents of a file using the pipe operator and ask questions from the passed text
$ curl https://www.arifbutt.me/ | ollama run llama3.2:latest "Give me a single line description about Dr. Arif Butt from the given HTML text" # Scrape a web page and send the html contents and ask Q from it

# Ollama exposes multiple REST API endpoints, /api/generate is one option used to generate text (it is stateless and donot keep track of past messages in the request body)
$ curl -X POST http://localhost:11434/api/generate -d '{ "model": "llama3.2", "prompt":"What is the capital of Pakistan?", "stream":true}'   
# The endpoint /api/chat is used for Multi-turn conversational chat as it keeps track of past messages in the request body. Supports system, user, assistant roles.
$ curl -X POST http://localhost:11434/api/chat -d '{"model": "llama3.2","messages": [{"role": "user", "content": "Hi, my name is Arif."}], "Stream":false}' 
# Send the full chat history in the next conversation
$ curl -X POST http://localhost:11434/api/chat -d '{"model": "llama3.2","messages": [{"role": "user", "content": "Hi, my name is Arif."},{"role": "assistant", "content": "Nice to meet you, Arif!"},{"role": "user", "content": "What is my name?"}], "stream":false}'
```

<img align="right" width="800"  src="../images/gguf_1.jpg"  >

## f. Download Models from Hugging Face Hub
### Step 1: Find a GGUF model on Hugging Face
- Ollama only runs models in the GGUF format because it is built on llama.cpp which reads GGUF files for efficient inference
- Visit the models tab of hugging face, https://huggingface.co/models and search for "gguf" and you will see different models by unsloth, Qwen, Phi, Mistral and  TheBloke
- **GPT-Generated Unified Format (GGUF)** is a modern, unified model file format designed for efficient local inference, especially for *LLaMA-family* of models using tools like **llama.cpp** and **Ollama**.
    - *GGUF File Container:* The outer wrapper that holds all model components in a single binary file, eliminating the ned of multi-file setups and ensures consistency between model components.
    - *Metadata Layer:* This layer stores model configuration and descriptive information, including model architecture and dimensions, context length, special tokens (BOS, EOS, PAD) and inference-related parameters.
    - *Tokenization Layer:* This layer defines how raw text is converted into tokens.
    - *Quantization Layer:* This layer describes how weights are compressed using different quantization formats like Q4, and Q8. This enables running large models on limited hardware.
    - *Model Weights and Tensors:* This layer contains the actual neural network parameters stored as optimized tensors.


### Step 2: Download one of the GGUF model to your computer
- Click on ‘Files and Versions’ on the model page: https://huggingface.co/unsloth/gemma-3-1b-it-GGUF/tree/main and select the version
- A model’s raw size in memory is roughly: `Parameters × Bytes per parameter`. The bytes per parameter depends on quantization level:
    - FP16 (16-bit float): 2 bytes/parameter
    - INT8 quantization: 1 byte/parameter
    - INT4 quantization: 0.5 bytes/parameter
- So memory use scales with both parameter count and quantization method.
    - 1.5B model → ~3–3.5 GB RAM (unquantized FP16).
    - 7B model → ~14–16 GB RAM (unquantized FP16).
    - With quantization, the requirement drops by 2×–4×, which is why Ollama and llama.cpp can run 7B models comfortably on machines with 8–16 GB RAM.

- **Download via Browser**
- Let us download the 8-bit quantized version from this link: https://huggingface.co/unsloth/gemma-3-1b-it-GGUF/blob/main/gemma-3-1b-it-Q8_0.gguf
- Click the download button or clone the repository.
- **Download Using Command Line**
```bash
# Windows
huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF tinyllama-1.1b-chat-v1.0.Q8_0.gguf --local-dir C:\models\TinyLlama_safe

# Linux/macOS
huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF tinyllama-1.1b-chat-v1.0.Q8_0.gguf --local-dir ~/models/TinyLlama_safe
```

- I have downloaded models inside the `../models/gemma/` directory.

In [2]:
!ls -lh ../models/gemma/

total 2088504
-rw-r--r--@ 1 arif  staff   1.0G 19 Dec 22:03 gemma-3-1b-it-Q8_0.gguf
-rw-r--r--@ 1 arif  staff   114B 19 Dec 22:32 Modelfile


### Step 3: Create a Modelfile
- A **Modelfile** in Ollama is a small text configuration file that tells Ollama how to use a specific model.
- Think of it as a bridge between the GGUF model file and Ollama’s chat interface, it defines which model to load and how the conversation should behave.
- **Why is it needed?** Ollama itself cannot automatically know:
  - Which GGUF model to load.
  - How to handle prompts, messages, and system instructions.
  - What generation parameters to use (temperature, top-p, etc.).
- The Modelfile encapsulates all this information so you can run your model with consistent behavior.
- It makes customizing models easier without modifying the GGUF file itself.
- The only required section in the Modelfile is the `FROM` section, which points to your model file you just downloaded (where the model weights live)
```
FROM ../models/gemma/gemma-3-1b-it-Q8_0.gguf

SYSTEM "You are a helpful assistant that answers briefly."

PARAMETER temperature 0.7
```

In [2]:
!cat ../models/gemma/Modelfile

FROM gemma-3-1b-it-Q8_0.gguf
SYSTEM "You are a helpful assistant that answers briefly."
PARAMETER temperature 0.7


### Step 4 — Build the model using ollama create
- Use the `ollama create` command which will do the following tasks:
    - Copy your local GGUF file into Ollama’s internal model store, verifying its SHA-256 hash.
    - Parsed the GGUF to detect model architecture (Gemma-3-Instruct) and created new “layers” that include your system prompt and parameters.
    - Wrote a new manifest and registered a new Ollama model name, so you can run it like any other built-in model.
- Once done the `ollama list` command will display the new model that you can run

In [3]:
!ollama create mygemma -f ../models/gemma/Modelfile

[?2026h[?25l[1Ggathering model components ⠙ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠙ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠸ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠸ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠴ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components [K
copying file sha256:616dfb049ad14288d971d96f5ca4953fdebbf1e3cd407ad159f3bfd47090201d 100% [K[?25h[?2026l[?2026h[?25l[A[1Ggathering model components [K
copying file sha256:616dfb049ad14288d971d96f5ca4953fdebbf1e3cd407ad159f3bfd47090201d 100% [K
parsing GGUF ⠙ [K[?25h[?2026l[?2026h[?25l[A[A[1Ggathering model components [K
copying file sha256:616dfb049ad14288d971d96f5ca4953fdebbf1e3cd407ad159f3bfd47090201d 100% [K
parsing GGUF [K
using existing layer sha256:616dfb049ad14288d971d96f5ca4953fdebbf1e3cd407ad159f3bfd47090201d [K
using autodetected template gemma3-instruct [K
using existing layer sha256:611659b

In [2]:
!ollama list

NAME                        ID              SIZE      MODIFIED     
mygemma:latest              a631de1ab12d    1.1 GB    6 weeks ago     
deepseek-v3.1:671b-cloud    d3749919e45f    -         7 weeks ago     
gpt-oss:20b-cloud           875e8e3a629a    -         7 weeks ago     
gpt-oss:120b-cloud          569662207105    -         7 weeks ago     
qwen3-coder:480b-cloud      e30e45586389    -         7 weeks ago     
nomic-embed-text:latest     0a109f422b47    274 MB    7 weeks ago     
tinyllama:latest            2644915ede35    637 MB    7 weeks ago     
deepseek-r1:1.5b            e0979632db5a    1.1 GB    7 months ago    
llama3.2:1b                 baf6a787fdff    1.3 GB    8 months ago    
llama3.2:latest             a80c4f17acd5    2.0 GB    8 months ago    


### Step 5 - Run your new Model via CLI

In [6]:
!ollama run mygemma:latest "What is the capital of Pakistan"

[?2026h[?25l[1G⠙ [K[?25h[?2026l[?2026h[?25l[1G⠹ [K[?25h[?2026l[?2026h[?25l[1G⠸ [K[?25h[?2026l[?2026h[?25l[1G⠼ [K[?25h[?2026l[?2026h[?25l[1G⠴ [K[?25h[?2026l[?2026h[?25l[1G⠦ [K[?25h[?2026l[?2026h[?25l[1G⠧ [K[?25h[?2026l[?2026h[?25l[1G⠇ [K[?25h[?2026l[?2026h[?25l[1G⠏ [K[?25h[?2026l[?2026h[?25l[1G⠋ [K[?25h[?2026l[?2026h[?25l[1G⠙ [K[?25h[?2026l[?2026h[?25l[1G⠹ [K[?25h[?2026l[?2026h[?25l[1G⠸ [K[?25h[?2026l[?2026h[?25l[1G⠼ [K[?25h[?2026l[?2026h[?25l[1G⠴ [K[?25h[?2026l[?2026h[?25l[1G⠦ [K[?25h[?2026l[?2026h[?25l[1G⠧ [K[?25h[?2026l[?2026h[?25l[1G⠇ [K[?25h[?2026l[?2026h[?25l[1G⠏ [K[?25h[?2026l[?2026h[?25l[1G⠋ [K[?25h[?2026l[?2026h[?25l[1G⠙ [K[?25h[?2026l[?2026h[?25l[1G⠹ [K[?25h[?2026l[?2026h[?25l[1G⠸ [K[?25h[?2026l[?2026h[?25l[1G⠼ [K[?25h[?2026l[?2026h[?25l[1G⠴ [K[?25h[?2026l[?2026h[?25l[1G⠦ [K[?25h[?2026l[?2026h[?25l[1G⠧ [K[?25h[?2026l

# <span style='background :lightgreen' >3. Accessing AI Models via LLaMA.cpp (GUI and CLI)</span>

## a. Download and Install Llama.cpp
- There are different ways to download and install **llama.cpp** on your machine: https://github.com/ggerganov/llama.cpp
    1. Install llama.cpp using `brew install llama.cpp` on Mac or (use `winget` for Windows or `nix` for Linux)
    2. Download pre-built binaries from the releases page
    3. Run with Docker - see our Docker documentation
    4. Clone the repository, and build using `cmake`
## b. Inference Using llama.cpp 
- You can manually download the raw GGUF file from Hugging Face and access them via command line interface of llama.cpp
- llama.cpp does not automatically download models from Hugging Face unless you use the -hf flag.
- Models must be in GGUF format, otherwise you may need conversion using the provided conversion utilities in the llama.cpp repo.
- Below are the commands that you practice at your time (Replace `llama-cli` with `llama-cli.exe` for Windows):
```bash
$ llama-cli --help        # Shows general usage, options, and flags available for the CLI.
$ llama-cli -hf ggml-org/gemma-3-1b-it-GGUF    # Automatically downloads the specified Hugging Face GGUF model and runs it locally.
$ llama-cli -m /Users/arif/Documents/genai-course/models/gemma/gemma-3-1b-it-Q8_0.gguf # Run Interactive Session with a Local Model
$ llama-cli -m /Users/arif/Documents/genai-course/models/gemma/gemma-3-1b-it-Q8_0.gguf -n 256 --temp 0.7 # Run With Additional Parameters
```
- You can start the llama-server and load the model. The embedded server often exposes a lightweight Web UI at `http://localhost:PORT` where you can interact with the model via browser.
```bash
$ llama-server -m /Users/arif/Documents/genai-course/models/gemma/gemma-3-1b-it-Q8_0.gguf  --port 8080 
$ llama-server -m /Users/arif/Documents/genai-course/models/gemma/gemma-3-1b-it-Q8_0.gguf  -c 8192 -np 4 --port 8080  # -c specifies context window size. -np 4 specifies how many users or requests the server can handle simultaneously.
```

# <span style='background :lightgreen' >4. Programmatically Accessing AI Models running on Local Machine by Ollama </span>

## a. Using  Ollama’s Native Python Library
- The `ollama.chat()` method is the primary interface for conversational interactions with Llama models through Ollama.

In [9]:
import ollama
import rich

response = ollama.chat(
                        model = 'llama3.2:1b', # "deepseek-v3.1:671b-cloud",  "gpt-oss:20b-cloud", "gpt-oss:120b-cloud", "qwen3-coder:480b-cloud"
                        messages = [{'role': 'system',  'content': "You are a helpful assistant"}, {'role': 'user', 'content': "What is capital of Pakistan?"}],
                        stream = False,                # Default: False - set True for streaming
                        format=None,                   # Default: None - 'json' for JSON output
                        tools=None,                    # Default: None - List of tool definitions
                        options={                      # Model-specific options
                                'temperature': 0.7,         # Default: 0.8 (0.0-2.0, creativity)
                                'top_p': 0.9,               # Default: 0.9 (nucleus sampling)
                                }
                    )

print("Content:", response.message.content)
rich.print(response)

Content: The capital of Pakistan is Islamabad.


## b. Using  OpenAI's `Chat Completion` API

In [7]:
from openai import OpenAI

# Create an OpenAI client instance and specify the base_url as 'http://localhost:11434/v1' → Instead of OpenAI’s cloud (https://api.openai.com/v1), it’s pointing to a local server.
# This tells the client: “Send requests to my local Ollama API server, not OpenAI’s servers.”
client = OpenAI(base_url="http://localhost:11434/v1", api_key='abc')   # Since Ollama don’t really need authentication. Give any string and it will work

# Use OpenAI's Chat Completions API (routed through Hugging Face)
response = client.chat.completions.create(
                                            model = "mygemma",    # "deepseek-v3.1:671b-cloud",  "gpt-oss:20b-cloud", "gpt-oss:120b-cloud", "qwen3-coder:480b-cloud"
                                            messages=[{'role': 'system',  'content': "You are a helpful assistant"}, {'role': 'user', 'content': "What is capital of Pakistan?"}],
                                            temperature=0.7,
                                            max_tokens=500,
                                            stream=False
                                        )
# Extract and print the output text
print(response.choices[0].message.content)

The capital of Pakistan is **Islamabad**.


## c. Using  OpenAI's `Responses` API

In [4]:
from openai import OpenAI

# Create an OpenAI client instance and specify the base_url as 'http://localhost:11434/v1' → Instead of OpenAI’s cloud (https://api.openai.com/v1), it’s pointing to a local server.
# This tells the client: “Send requests to my local Ollama API server, not OpenAI’s servers.”
client = OpenAI(base_url="http://localhost:11434/v1", api_key="abc") # Since Ollama is running locally, so it don’t really need authentication. Give any string and it will work

# Using the Responses API
response = client.responses.create(
                                    model="gpt-oss:20b-cloud",      # "deepseek-v3.1:671b-cloud",  "gpt-oss:20b-cloud", "gpt-oss:120b-cloud", "qwen3-coder:480b-cloud"
                                    #input= "What is the color of the sky? Tell me in a single line."      
                                    input=[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of Pakistan."}]
                                    )

# Extract and print the output text
print(response.output_text) 

The capital of Pakistan is **Islamabad**.


# <span style='background :lightgreen' >5. Hello World Examples </span>
## Writing a Function for our ease

In [10]:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

def ask_ollama(
                user_prompt: str,
                developer_prompt: str = "You are a helpful assistant that provides concise answers.",
                model: str =  "deepseek-v3.1:671b-cloud",       # "llama3.2",
                temperature: float = 0.7,
                stream: bool = False
            ):
    
    response = client.chat.completions.create(
    messages=[{"role": "system", "content": developer_prompt}, {"role": "user", "content": user_prompt}],
    model=model,
    temperature=temperature,
    stream=stream
    )    
    
    if stream:                    # Return streaming generator if requested
        return response
    return response.choices[0].message.content   # Return the aggregated text output

## Examples (Question Answering)

In [11]:
user_prompt = "Which is the capital of Pakistan?"
response = ask_ollama(user_prompt=user_prompt, model="mygemma")
print(response)

Islamabad.


In [12]:
user_prompt = "Tell me a bedtime story of Ali baba chalees chor"

# Get streaming generator
response = ask_ollama(user_prompt=user_prompt, stream=True)

# Read streaming chunks
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)


Here's a short bedtime story of Ali Baba and the Forty Thieves:

Ali Baba, a poor woodcutter, once saw forty thieves approach a magic cave. Their leader said, "Open Sesame!" and a door in the rock opened. After they left, Ali Baba used the same words, entered, and found treasures. He took some gold home.

His greedy brother, Cassim, learned the secret but forgot the words inside the cave. Trapped, the thieves found and killed him. Ali Baba, with help from clever slave Morgiana, tricked and defeated the thieves who sought revenge—ending their threat forever.

Ali Baba shared the treasure and lived happily ever after. Goodnight!

##  Question Answering from Content Passed

In [12]:
!cat ../data/names.txt

Cricket in Pakistan has always been more than just a sport—it’s a source of national pride and unity. Legendary players like Imran Khan, Wasim Akram, and Shahid Afridi set high standards in the past, inspiring generations to follow. Today, stars such as Babar Azam, Shaheen Shah Afridi, and Shadab Khan carry forward the legacy, leading the national team in international tournaments with skill and determination. Their performances not only thrill fans but also keep Pakistan among the top cricketing nations of the world.

Politics in Pakistan, meanwhile, remains dynamic and often turbulent, with key figures shaping the country’s direction. Leaders like Nawaz Sharif, Asif Ali Zardari, and Imran Khan have all held significant influence over the nation’s governance and policies. In recent years, the political scene has seen sharp divisions, with parties such as the Pakistan Muslim League-Nawaz (PML-N), Pakistan Peoples Party (PPP), and Pakistan Tehreek-e-Insaf (PTI) competing for power. Deba

In [13]:
from openai import OpenAI

with open("../data/names.txt", "r") as f:
    file_content = f.read()

user_prompt = f"Extract names from this text:\n{file_content}"
response = ask_ollama(user_prompt=user_prompt, model="deepseek-r1:1.5b")
print(response)

The names extracted from the text are:

Imran Khan  
Wasim Akram  
Shahid Afridi  
Babar Azam  
Shaheen Shah Afridi  
Shadab Khan  
Nawaz Sharif  
Asif Ali Zardari


## Examples (Binary Classification: Sentiment analysis, Spam detection, Medical diagnosis)

In [14]:
developer_prompt = "You are an expert who will classify a sentense as having either a Positive or Negative sentiment."
user_prompt = "I love the youtube videos of Arif, as they are very informative"
response = ask_ollama(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)

Positive


## Examples (Multi-class Classification)

In [15]:
developer_prompt = "Classify product reviews into these categories: 'Electronics', 'Clothing', 'Books', 'Home & Garden', 'Sports', or 'Food'. \
Respond with only the category."
user_prompt = "This novel has an incredible plot twist that kept me reading all night"
response = ask_ollama(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)

Books


In [16]:
system_role = "Classify product reviews into these categories: 'Electronics', 'Clothing', 'Books', 'Home & Garden', 'Sports', or 'Food'. \
Respond with only the category."
user_prompt = "The wireless headphones have excellent sound quality and battery life"
response = ask_ollama(user_prompt, system_role)
print(response)

Electronics


## Examples (Text Generation)

In [17]:
developer_prompt = "You are an expert of political science and history and have a deep understanding of policical situation of Pakistan."
user_prompt = "Write down a 50 words summary about the fairness of general elections held in Pakistan on February 08, 2024."
response = ask_ollama(user_prompt=user_prompt, developer_prompt=developer_prompt, temperature=1.0)
print(response)

Pakistan's February 8, 2024, general elections were marred by significant allegations of pre-poll manipulation and election day irregularities, casting doubts over their fairness. Widespread arrests, internet shutdowns, and delayed results fueled opposition claims of systematic rigging to favor the military-backed establishment. The polls were contentious, failing to gain broad acceptance.


## Examples (Code Generation)

In [18]:
developer_prompt = "You are an expert of C programing in C language."
user_prompt = "Write down a C program that generates first ten numbers of fibonacci sequence."
response = ask_ollama(user_prompt=user_prompt, developer_prompt=developer_prompt, stream=True)


# Read streaming chunks
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Here's a C program that generates the first ten numbers of the Fibonacci sequence:

```c
#include <stdio.h>

int main() {
    int n = 10; // Number of Fibonacci numbers to generate
    int first = 0, second = 1, next;
    
    printf("First %d numbers in Fibonacci sequence:\n", n);
    
    for (int i = 0; i < n; i++) {
        if (i <= 1) {
            next = i;
        } else {
            next = first + second;
            first = second;
            second = next;
        }
        printf("%d ", next);
    }
    
    printf("\n");
    return 0;
}
```

**Output:**
```
First 10 numbers in Fibonacci sequence:
0 1 1 2 3 5 8 13 21 34
```

**Alternative version using an array:**

```c
#include <stdio.h>

int main() {
    int n = 10;
    int fib[10];
    
    fib[0] = 0;
    fib[1] = 1;
    
    printf("First %d numbers in Fibonacci sequence:\n", n);
    printf("%d %d ", fib[0], fib[1]);
    
    for (int i = 2; i < n; i++) {
        fib[i] = fib[i-1] + fib[i-2];
        printf("%d ", fib

## Examples (Text Translation)

In [19]:
user_prompt = """
Please act as an expert of English to Urdu translator by translating the given sentence from English into Urdu.
'The budget this year will have a very bad impact on the low salried people'
"""
response = ask_ollama(user_prompt=user_prompt)
print(response)

اس سال کا بجٹ کم تنخواہ والے لوگوں پر بہت برا اثر ڈالے گا۔


## Examples (Text Summarization)

In [20]:
developer_prompt = "You are an expert of English language."

user_prompt = f'''
Summarize the text below in at most 20 words:
```The Hugging Face transformers library is an incredibly versatile and powerful tool for natural language processing (NLP).
It allows users to perform a wide range of tasks such as text classification, named entity recognition, and question answering, among others.
It's an extremely popular library that's widely used by the open-source data science community.
It lowers the barrier to entry into the field by providing Data Scientists with a productive, convenient way to work with transformer models.```
'''

response = ask_ollama(user_prompt=user_prompt, developer_prompt=developer_prompt, temperature=0.2)
print(response)

Hugging Face's transformers library is a popular, versatile NLP tool that simplifies working with transformer models.


## Examples (Named Entity Recognition)

In [21]:
developer_prompt = """You are a  Named Entity Recognition specialist. Extract and classify entities from the given text into these categories only if they exist:
- name
- major
- university
- nationality
- grades
- club
Format your response as: 'Entity: [text] | Type: [category]' with each entity on a new line."""

user_prompt = '''
Zelaid Mujahid is a sophomore majoring in Data Science at University of the Punjab. \
He is Pakistani national and has a 3.5 GPA. Mujahid is an active member of the department's AI Club.\
He hopes to pursue a career in AI after graduating.
'''

response = ask_ollama(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)

Entity: Zelaid Mujahid | Type: name
Entity: Data Science | Type: major
Entity: University of the Punjab | Type: university
Entity: Pakistani | Type: nationality
Entity: 3.5 GPA | Type: grades
Entity: AI Club | Type: club
