# Run Private AI Workflows with LangChain and Ollama

This notebook demonstrates how to integrate LangChain with Ollama to run language models locally for privacy-preserving AI workflows.

## Setup

Install required packages:
```bash
pip install langchain langchain-community langchain-ollama
```

Install Ollama:
- For macOS: Download from [ollama.com](https://ollama.com)
- For Linux: `curl -fsSL https://ollama.com/install.sh | sh`
- For Windows: Download Windows (Preview) from [ollama.com](https://ollama.com)

Start Ollama server:
```bash
ollama serve
```

Pull a model (example):
```bash
ollama pull qwen3:0.6b
```

## What You'll Learn
- Set up LangChain with Ollama for local AI
- Use chat and completion models  
- Build LangChain chains for complex workflows
- Work with embeddings for semantic search
- Build a question-answering system with RAG

## Why Local AI Matters

AI models are changing data science projects by automating feature engineering, summarizing datasets, generating reports, and even writing code to examine or clean data.

However, using popular APIs like OpenAI or Anthropic can introduce serious privacy risks, especially when handling regulated data such as medical records, legal documents, or internal company knowledge.

When data privacy is important, running models locally ensures full control. Nothing leaves your machine, so you manage all inputs, outputs, and processing securely.

That's where [LangChain](https://www.langchain.com/) and [Ollama](https://ollama.com/) come in. LangChain provides the framework to build AI applications. Ollama lets you run open-source models locally.

## Introduction to Ollama and LangChain

### What is Ollama?

Ollama is an open-source tool that makes it easy to run large language models locally. It offers a simple CLI and REST API for downloading and interacting with popular models like Llama, Mistral, DeepSeek, and Gemmaâ€”no complex setup required.

Since Ollama doesn't depend on external APIs, it is ideal for sensitive data or limited-connectivity environments.

### What is LangChain?

LangChain is a framework for creating AI applications using language models. Rather than writing custom code for model interactions, response handling, and error management, you can use LangChain's ready-made components to build applications.

In [None]:
pip install langchain langchain-community langchain-ollama

Ollama needs to be installed separately since it's a standalone service that runs locally:

- For macOS: Download from [ollama.com](https://ollama.com) - this installs both the CLI tool and service
- For Linux: `curl -fsSL https://ollama.com/install.sh | sh` - this script sets up both the binary and system service
- For Windows: Download Windows (Preview) from [ollama.com](https://ollama.com) - still in preview mode with some limitations

Start the Ollama server:

```bash
ollama serve
```

The server will run in the background, handling model loading and inference requests.

### Pulling Models with Ollama

Before using any model with LangChain, you need to pull it to your local machine with Ollama:

```bash
ollama pull qwen3:0.6b
```

Once it is downloaded, you can serve the model with the following command:

```bash
ollama run qwen3:0.6b
```

The model size has a large impact on performance and resource requirements:

- Smaller models (7B-8B) run well on most modern computers with 16GB+ RAM
- Medium models (13B-34B) need more RAM or GPU acceleration
- Large models (70B+) typically require a dedicated GPU with 24GB+ VRAM

For a full list of models you can serve locally, check out [the Ollama model library](https://ollama.com/search). Before pulling a model and potentially waste your hardware resources, check out [the VRAM calculator](https://apxml.com/tools/vram-calculator) that tells you if you can run a specific model on your machine:


### Basic Chat Integration

Once you have a model downloaded, you need to connect LangChain to Ollama for actual AI interactions. LangChain uses dedicated classes that handle the communication between your Python code and the Ollama service:

In [None]:
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_ollama import ChatOllama

# Initialize the chat model with specific configurations
chat_model = ChatOllama(
    model="qwen3:0.6b",
    temperature=0.5,
    base_url="http://localhost:11434",  # Can be changed for remote Ollama instances
)

# Create a conversation with system and user messages
messages = [
    SystemMessage(content="You are a helpful coding assistant specialized in Python."),
    HumanMessage(content="Write a recursive Fibonacci function with memoization.")
]

# Invoke the model
response = chat_model.invoke(messages)
print(response.content[:200])

In [None]:
async def generate_async():
    response = await chat_model.ainvoke(messages)
    return response.content

# In async context
result = await generate_async()

In [None]:
print( result[:200])

In [None]:
from langchain_ollama import OllamaLLM

# Initialize the LLM with specific options
llm = OllamaLLM(
    model="qwen3:0.6b",
)

# Generate text from a prompt
text = """
Write a quick sort algorithm in Python with detailed comments:
```{python}
def quicksort(
"""

response = llm.invoke(text)
print(response[:500])

In [None]:
for chunk in llm.stream("Explain quantum computing in three sentences:"):
    print(chunk, end="", flush=True)

In [None]:
llm = OllamaLLM(
    model="qwen3:0.6b", # Example model, can be any model supported by Ollama
    temperature=0.7,      # Controls randomness (0.0 = deterministic, 1.0 = creative)
    stop=["```", "###"],  # Stop sequences to end generation
    repeat_penalty=1.1,   # Penalizes repetition (>1.0 reduces repetition)
)

Details about these parameters:

- `model`: Specifies the language model to use.
- `temperature`: Controls randomness; lower = more focused, higher = more creative.
- `stop`: Defines stop sequences that terminate generation early. Once one of these sequences is produced, the model stops generating further tokens.
- `repeat_penalty`: Penalizes repeated tokens to reduce redundancy. Values greater than 1.0 discourage the model from repeating itself.

Parameter recommendations:

- For factual or technical responses: Lower `temperature` (0.1-0.3) and higher `repeat_penalty` (1.1-1.2)
- For creative writing: Higher `temperature` (0.7-0.9)
- For code generation: Medium `temperature` (0.3-0.6) with specific `stop` like "\`\`\`"

The model behavior changes dramatically with these settings. For example:

In [None]:
# Scientific writing with precise output
scientific_llm = OllamaLLM(model="qwen3:0.6b", temperature=0.1, repeat_penalty=1.2)

# Creative storytelling
creative_llm = OllamaLLM(model="qwen3:0.6b", temperature=0.9, repeat_penalty=1.0)

# Code generation
code_llm = OllamaLLM(model="codellama", temperature=0.3, stop=["```", "def "])

### Creating LangChain Chains {#creating-langchain-chains}

 AI  workflows often involve multiple steps: data validation, prompt formatting, model inference, and output processing. Running these steps manually for each request becomes repetitive and error-prone. 

LangChain addresses this by chaining steps into sequences to create end-to-end applications:

In [None]:
import json

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

# Create a structured prompt template
prompt = PromptTemplate.from_template("""
You are an expert educator.
Explain the following concept in simple terms that a beginner would understand.
Make sure to provide:
1. A clear definition
2. A real-world analogy
3. A practical example

Concept: {concept}
""")

First, we import the required LangChain components and create a prompt template. The `PromptTemplate.from_template()` method creates a reusable template with placeholder variables (like `{concept}`) that get filled in at runtime.

In [None]:
# Create a parser that extracts structured data
class JsonOutputParser:
    def parse(self, text):
        try:
            # Find JSON blocks in the text
            if "```json" in text and "```" in text.split("```json")[1]:
                json_str = text.split("```json")[1].split("```")[0].strip()
                return json.loads(json_str)
            # Try to parse the whole text as JSON
            return json.loads(text)
        except (json.JSONDecodeError, ValueError, IndexError):
            # Fall back to returning the raw text
            return {"raw_output": text}

# Initialize a model instance to be used in the chain
llm = OllamaLLM(model="qwen3:0.6b")

Next, we define a custom output parser. This class attempts to extract JSON from the model's response, handling both code-block format and raw JSON. If parsing fails, it returns the original text wrapped in a dictionary.

In [None]:
# Build a more complex chain
chain = (
    {"concept": RunnablePassthrough()} 
    | prompt 
    | llm 
    | StrOutputParser()
)

# Execute the chain with detailed tracking
result = chain.invoke("Recursive neural networks")
print(result[:500])

In [None]:
import numpy as np
from langchain_ollama import OllamaEmbeddings

# Initialize embeddings model with specific parameters
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",  # Specialized embedding model that is also supported by Ollama
    base_url="http://localhost:11434",
)

The `nomic-embed-text` model is designed specifically for creating high-quality text embeddings. Unlike general language models that generate text, embedding models focus solely on converting text into meaningful vector representations.

Now let's create an embedding for a sample query and examine its properties:

In [None]:
# Create embeddings for a query
query = "How do neural networks learn?"
query_embedding = embeddings.embed_query(query)
print(f"Embedding dimension: {len(query_embedding)}")

The 768-dimensional vector represents our query in mathematical space. Each dimension captures different semantic features - some might relate to technical concepts, others to question patterns, and so on. Words with similar meanings will have vectors that point in similar directions.


Next, we'll create embeddings for multiple documents to demonstrate similarity matching:

In [None]:
# Create embeddings for multiple documents
documents = [
    "Neural networks learn through backpropagation",
    "Transformers use attention mechanisms",
    "LLMs are trained on text data"
]

doc_embeddings = embeddings.embed_documents(documents)

The `embed_documents()` method processes multiple texts at once, which is more efficient than calling `embed_query()` repeatedly. This batch processing saves time when working with large document collections.

To find which document best matches our query, we need to measure similarity between vectors. Cosine similarity is the standard approach:

In [None]:
# Calculate similarity between vectors
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Find most similar document to query
similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
most_similar_idx = np.argmax(similarities)
print(f"Most similar document: {documents[most_similar_idx]}")
print(f"Similarity score: {similarities[most_similar_idx]:.3f}")

In [None]:
import numpy as np
from langchain_core.documents import Document
from langchain_core.prompts import PromptTemplate
from langchain_ollama import ChatOllama, OllamaEmbeddings

# Initialize components
embeddings = OllamaEmbeddings(model="nomic-embed-text")
chat_model = ChatOllama(model="qwen3:0.6b", temperature=0.3)

# Sample knowledge base representing project documentation
documents = [
    Document(page_content="Python is a high-level programming language known for its simplicity and readability."),
    Document(page_content="Machine learning algorithms can automatically learn patterns from data without explicit programming."),
    Document(page_content="Data preprocessing involves cleaning, changing, and organizing raw data for analysis."),
    Document(page_content="Neural networks are computational models inspired by biological brain networks."),
]

# Create embeddings for all documents
doc_embeddings = embeddings.embed_documents([doc.page_content for doc in documents])

This setup creates a searchable knowledge base from your documents. In production systems, these documents would contain sections from research papers, methodology descriptions, data analysis reports, or code documentation. The embeddings convert each document into vectors that support semantic search.

In [None]:
def similarity_search(query, top_k=2):
    """Find the most relevant documents for a query"""
    query_embedding = embeddings.embed_query(query)

    # Calculate cosine similarities
    similarities = []
    for doc_emb in doc_embeddings:
        similarity = np.dot(query_embedding, doc_emb) / (
            np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb)
        )
        similarities.append(similarity)

    # Get top-k most similar documents
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    return [documents[i] for i in top_indices]

# Create RAG prompt template
rag_prompt = PromptTemplate.from_template("""
Use the following context to answer the question. If the answer isn't in the context, say so.

Context:
{context}

Question: {question}

Answer:
""")

The `similarity_search` function finds documents most relevant to a question using the embeddings we created earlier. The prompt template structures how we present retrieved context to the language model, instructing it to base answers on the provided documents rather than general knowledge.

In [None]:
def answer_question(question):
    """Generate an answer using retrieved context"""
    # Retrieve relevant documents
    relevant_docs = similarity_search(question, top_k=2)
    context = "\n".join([doc.page_content for doc in relevant_docs])

    # Generate answer using context
    prompt_text = rag_prompt.format(context=context, question=question)
    response = chat_model.invoke([{"role": "user", "content": prompt_text}])

    return response.content, relevant_docs

# Test the RAG system
question = "What makes Python popular for data science?"
answer, sources = answer_question(question)

print(f"Question: {question}")
print(f"Answer: {answer}")
print(f"Sources: {[doc.page_content[:50] + '...' for doc in sources]}")