### Model I/O

Instead of using float32, we can use float16

which will make 3.1415927 => 3.141 

This is called quantization.Reducing the number of bits used to represent the model.Even though it is a lossy operation, it is a good trade-off for faster inference and less memory usage.

In [None]:
import os
os.environ["HF_TOKEN"] = os.getenv("HUGGINGFACE_LOCAL_API_KEY")
!wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-fp16.gguf

In [None]:

import getpass
import os

from langchain import LlamaCpp


# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="./Phi-3-small-4k-instruct-fp16.gguf",
    n_gpu_layers=-1,
    max_tokens=500,
    n_ctx=2048,
    seed=42,
    verbose=False
)

Langchain uses chains to chain a modular component to a LLM

For example, we can chain a prompt template to a LLM.

In [None]:
from langchain import PromptTemplate

# Create a prompt template with the "input_prompt" variable
template = """<s><|user|>
{input_prompt}<|end|>
<|assistant|>"""
prompt = PromptTemplate(
    template=template,
    input_variables=["input_prompt"]
)

To create our first chain we will chain the prompt template to the LLM.




In [None]:
basic_chain = prompt | llm

# Use the chain
basic_chain.invoke(
    {
        "input_prompt": "Hi! My name is Maarten. What is 1 + 1?",
    }
)



We can also chain multiple prompts and feed them to the LLM.This way LLM can break down the task into smaller tasks.


In [None]:
from langchain import LLMChain

# Create a chain for the title of our story
template = """<s><|user|>
Create a title for a story about {summary}. Only return the title.<|end|>
<|assistant|>"""
title_prompt = PromptTemplate(template=template, input_variables=["summary"])
title = LLMChain(llm=llm, prompt=title_prompt, output_key="title")

In [None]:
title.invoke({"summary": "a girl that lost her mother"})

In [None]:
# Create a chain for the character description using the summary and title
template = """<s><|user|>
Describe the main character of a story about {summary} with the title {title}. Use only two sentences.<|end|>
<|assistant|>"""
character_prompt = PromptTemplate(
    template=template, input_variables=["summary", "title"]
)
character = LLMChain(llm=llm, prompt=character_prompt, output_key="character")

In [None]:

# Create a chain for the story using the summary, title, and character description
template = """<s><|user|>
Create a story about {summary} with the title {title}. The main character is: {character}. Only return the story and it cannot be longer than one paragraph. <|end|>
<|assistant|>"""
story_prompt = PromptTemplate(
    template=template, input_variables=["summary", "title", "character"]
)
story = LLMChain(llm=llm, prompt=story_prompt, output_key="story")

In [None]:
llm_chain = title | character | story

In [None]:
llm_chain.invoke("a girl that lost her mother")

### Memory

LLMs can not remember the conversation history.We need to add conversation summary or conversation buffer to the LLM.

Conversation buffer : Adding the conversation history to the prompt directly



In [None]:
# Create an updated prompt template to include a chat history
template = """<s><|user|>Current conversation:{chat_history}

{input_prompt}<|end|>
<|assistant|>"""

prompt = PromptTemplate(
    template=template,
    input_variables=["input_prompt", "chat_history"]
)

from langchain.memory import ConversationBufferMemory

# Define the type of memory we will use
memory = ConversationBufferMemory(memory_key="chat_history")

# Chain the LLM, prompt, and memory together
llm_chain = LLMChain(
    prompt=prompt,
    llm=llm,
    memory=memory
)

llm_chain.invoke({"input_prompt": "Hi! My name is Barkin. What is 1 + 1?"})
llm_chain.invoke({"input_prompt": "What is my name?"})

This method is not optimal for long conversations.Instead we can use a more sophisticated method called ConversationBufferWindowMemory which only keeps the last few messages in the conversation.

In [None]:
from langchain.memory import ConversationBufferWindowMemory

# Retain only the last 2 conversations in memory
memory = ConversationBufferWindowMemory(k=2, memory_key="chat_history")

# Chain the LLM, prompt, and memory together
llm_chain = LLMChain(
    prompt=prompt,
    llm=llm,
    memory=memory
)
llm_chain.predict(input_prompt="Hi! My name is Maarten and I am 33 years old. What is 1 + 1?")
llm_chain.predict(input_prompt="What is 3 + 3?")



Conversation Summary: We can ask from another LLM to summarize the conversation and use it as a prompt for the next conversation.

In [None]:
# Create a summary prompt template
summary_prompt_template = """<s><|user|>Summarize the conversations and update with the new lines.

Current summary:
{summary}

new lines of conversation:
{new_lines}

New summary:<|end|>
<|assistant|>"""
summary_prompt = PromptTemplate(
    input_variables=["new_lines", "summary"],
    template=summary_prompt_template
)

In [None]:
from langchain.memory import ConversationSummaryMemory

# Define the type of memory we will use
memory = ConversationSummaryMemory(
    llm=llm, 
    memory_key="chat_history", 
    prompt=summary_prompt
)
# Chain the LLM, prompt, and memory together
llm_chain = LLMChain(
    prompt=prompt,
    llm=llm,
    memory=memory
)

In [None]:

# Generate a conversation and ask for the name
llm_chain.invoke({"input_prompt": "Hi! My name is Maarten. What is 1 + 1?"})
llm_chain.invoke({"input_prompt": "What is my name?"})

Selecting between these 3 often involves trade-off between computational resources and the quality of the output.

### Agents