In [1]:
import os
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# This is the YouTube video we're going to use.
YOUTUBE_VIDEO = "https://youtu.be/Az9InkxkQQ8"

In [2]:
from langchain_community.chat_models import ChatOpenAI

model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")

  warn_deprecated(


Asking a question to test the model.

In [3]:

model.invoke("What is the capital of Australia?")

AIMessage(content='The capital of Australia is Canberra.', response_metadata={'token_usage': {'completion_tokens': 7, 'prompt_tokens': 14, 'total_tokens': 21}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-1391bbca-9e31-4348-a9c8-b4cbbac8e37c-0')

Use an output parser to take the LLM output and transform into a more suitable format

In [4]:

from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

chain = model | parser
chain.invoke("What is the capital of New Zealand?")

'The capital of New Zealand is Wellington.'

Introduce prompt templates. Prompt Templates Prompt templates are predefined recipes for generating prompts for language models.

In [5]:
from langchain.prompts import ChatPromptTemplate

template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
prompt.format(context="The kiwi is a flightless bird native to New Zealand.", question="What is a flightless bird native to New Zealand?")

'Human: \nAnswer the question based on the context below. If you can\'t \nanswer the question, reply "I don\'t know".\n\nContext: The kiwi is a flightless bird native to New Zealand.\n\nQuestion: What is a flightless bird native to New Zealand?\n'

Chain the prompt with the model and the output parser.

In [6]:
chain = prompt | model | parser
chain.invoke({
    "context": "The kiwi is a flightless bird native to New Zealand.",
    "question": "What is a flightless bird native to New Zealand?"
})

'The kiwi.'

Combine different chains to create complex workflows. Now create a second chain that translates the answer from the first chain into a different language.

Create a new prompt template for the translation chain:

In [7]:
translation_prompt = ChatPromptTemplate.from_template(
    "Translate {answer} to {language}"
)


In [8]:
from operator import itemgetter

translation_chain = (
    {"answer": chain, "language": itemgetter("language")} | translation_prompt | model | parser
)

translation_chain.invoke(
    {
        "context": "These are the recognized species of kiwi, all of which are native to New Zealand: North Island Brown Kiwi, South Island Brown Kiwi, Rowi, Haast Tokoeka, Little Spotted Kiwi. ",
        "question": "How many species of kiwi are there?",
        "language": "Te reo",
    }
)

'E rima ngā rākau tautoko i ngā rākau kiwi.'

Transcribing the YouTube Video
download the video and transcribe it using OpenAI's Whisper.

In [9]:
import tempfile
import whisper
from pytube import YouTube


# Let's do this only if we haven't created the transcription file yet.
if not os.path.exists("transcription.txt"):
    youtube = YouTube(YOUTUBE_VIDEO)
    audio = youtube.streams.filter(only_audio=True).first()

    # Let's load the base model. This is not the most accurate
    # model but it's fast.
    whisper_model = whisper.load_model("base")

    with tempfile.TemporaryDirectory() as tmpdir:
        file = audio.download(output_path=tmpdir)
        transcription = whisper_model.transcribe(file, fp16=False)["text"].strip()

        with open("transcription.txt", "w") as file:
            file.write(transcription)

 read the transcription and display the first few characters to ensure everything works as expected.

In [10]:
with open("transcription.txt") as file:
    transcription = file.read()

transcription[:100]

"Hello everyone, I hope you're doing well and of course Arnie does too. Now for today's invasive spec"

Using the entire transcription as context
If we try to invoke the chain using the transcription as context, the model will return an error because the context is too long.

Large Language Models support limitted context sizes. The video we are using is too long for the model to handle, so we need to find a different solution.

In [11]:

try:
    chain.invoke({
        "context": transcription,
        "question": "When was the brushtail possum introduced to NZ?"
    })
except Exception as e:
    print(e)

Splitting the transcription

In [12]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("transcription.txt")
text_documents = loader.load()
text_documents

[Document(page_content="Hello everyone, I hope you're doing well and of course Arnie does too. Now for today's invasive species episode, we'll be heading to the beautiful country of New Zealand. Now New Zealand's been separated from the rest of the world for around 20 million years, and because of this it has a very fragile leaco system, which is mainly dominated by large birds, and as there are only two native species of mammal on New Zealand, for a long time it's been a bird's paradise, as they could happily walk on the ground, without any risk and predation. But unfortunately today this is all changed, as I will be going through five invasive species found in New Zealand. And our first species is native to nearby Australia, and it is the common brush tail possum. Now this small mammal is a marsupial and is most active at night, and although it looks quite innocent and cute, it does have a bit of a dark side, as it does feed on various forms of vegetation, such as leaves, fruits and 

For illustration purposes, let's split the transcription into chunks of 100 characters with an overlap of 20 characters and display the first few chunks:

In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
text_splitter.split_documents(text_documents)[:5]

[Document(page_content="Hello everyone, I hope you're doing well and of course Arnie does too. Now for today's invasive", metadata={'source': 'transcription.txt'}),
 Document(page_content="today's invasive species episode, we'll be heading to the beautiful country of New Zealand. Now New", metadata={'source': 'transcription.txt'}),
 Document(page_content="Zealand. Now New Zealand's been separated from the rest of the world for around 20 million years,", metadata={'source': 'transcription.txt'}),
 Document(page_content='20 million years, and because of this it has a very fragile leaco system, which is mainly dominated', metadata={'source': 'transcription.txt'}),
 Document(page_content='is mainly dominated by large birds, and as there are only two native species of mammal on New', metadata={'source': 'transcription.txt'})]

For our specific application, let's use 1000 characters instead:

In [14]:

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
documents = text_splitter.split_documents(text_documents)

Finding the relevant chunks
An embedding is a mathematical representation of the semantic meaning of a word, sentence, or document. It's a projection of a concept in a high-dimensional space. Embeddings have a simple characteristic: The projection of related concepts will be close to each other, while concepts with different meanings will lie far away. 

In [15]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
embedded_query = embeddings.embed_query("What is a Haast Tokoeka?")

print(f"Embedding length: {len(embedded_query)}")
print(embedded_query[:10])

Embedding length: 1536
[-0.0005827715982723488, -0.000609202125017647, -0.010308774936689582, -0.006866739009750113, -0.00865188318795792, 0.010613809151499412, -0.017026463221596032, -0.001454979155490167, 0.012284565499153532, -0.025844728115881323]


To illustrate how embeddings work, let's first generate the embeddings for two different sentences:

In [16]:
sentence1 = embeddings.embed_query("Privet causes allergies and displaces natives.")
sentence2 = embeddings.embed_query("Wētā are large, nocturnal insects found in forests.")

In [17]:
from sklearn.metrics.pairwise import cosine_similarity

query_sentence1_similarity = cosine_similarity([embedded_query], [sentence1])[0][0]
query_sentence2_similarity = cosine_similarity([embedded_query], [sentence2])[0][0]

query_sentence1_similarity, query_sentence2_similarity

(0.7057594036532578, 0.7454599451763385)

To understand how a vector store works, let's create one in memory and add a few embeddings to it:

In [18]:
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore1 = DocArrayInMemorySearch.from_texts(
    [
        "Tuatara are ancient reptiles, unique to NZ.",
        "Wilding pines threaten native ecosystems.",
        "Kākāpō parrots are critically endangered",
        "Gorse spreads rapidly, overtaking native plants.",
        "Kiwi birds are nocturnal and flightless.",
        "Lavish costumes and grand ball scenes.",
        "Regency romance drama set in London.",
    ],
    embedding=embeddings,
)


We can now query the vector store to find the most similar embeddings to a given query:

In [19]:
vectorstore1.similarity_search_with_score(query="Old Man's Beard smothers native vegetation.", k=3)

[(Document(page_content='Gorse spreads rapidly, overtaking native plants.'),
  0.8655200667524214),
 (Document(page_content='Wilding pines threaten native ecosystems.'),
  0.8512902803512785),
 (Document(page_content='Kākāpō parrots are critically endangered'),
  0.7793268980999815)]

Connecting the vector store to the chain
We can use the vector store to find the most relevant chunks from the transcription to send to the model. Here is how we can connect the vector store to the chain:

In [20]:
retriever1 = vectorstore1.as_retriever()
retriever1.invoke("What are Tuatara?")

[Document(page_content='Tuatara are ancient reptiles, unique to NZ.'),
 Document(page_content='Kiwi birds are nocturnal and flightless.'),
 Document(page_content='Kākāpō parrots are critically endangered'),
 Document(page_content='Wilding pines threaten native ecosystems.')]


Our prompt expects two parameters, "context" and "question." We can use the retriever to find the chunks we'll use as the context to answer the question.

We can create a map with the two inputs by using the RunnableParallel and RunnablePassthrough classes. This will allow us to pass the context and question to the prompt as a map with the keys "context" and "question."

In [21]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

setup = RunnableParallel(context=retriever1, question=RunnablePassthrough())
setup.invoke("What are Tuatara?")

{'context': [Document(page_content='Tuatara are ancient reptiles, unique to NZ.'),
  Document(page_content='Kiwi birds are nocturnal and flightless.'),
  Document(page_content='Kākāpō parrots are critically endangered'),
  Document(page_content='Wilding pines threaten native ecosystems.')],
 'question': 'What are Tuatara?'}

In [22]:
chain = setup | prompt | model | parser
chain.invoke("What are Tuatara?")

'Tuatara are ancient reptiles, unique to NZ.'

In [23]:
chain.invoke("What animal is a Kākāpō?")

'Kākāpō is a parrot.'

Loading transcription into the vector store

In [24]:
vectorstore2 = DocArrayInMemorySearch.from_documents(documents, embeddings)

In [25]:
chain = (
    {"context": vectorstore2.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)
chain.invoke("When was the brushtail possum introduced to NZ?")

'The brushtail possum was first introduced into New Zealand in 1837.'

Setting up Pinecone

In [26]:
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")

In [27]:
from langchain_pinecone import PineconeVectorStore

index_name = "youtube-index"

pinecone = PineconeVectorStore.from_documents(
    documents, embeddings, index_name=index_name
)

In [28]:
pinecone.similarity_search("What problems do Koi cause?")[:3]

[Document(page_content="very similar to the common carp, and often cause the same problems, as they stir up the bottom of ponds and lakes, muddying the water and creating algal blooms. But Koi also opportunistic feeders, and will feed on the native invertebrates, small fish, and fish eggs. But to help tackle this invasion, people will encourage to bow hunt for this species, and luckily there is one predator that can take on a Koi carp, and that is the long fin eel. And as this species can reach 1.5m long, they're more than able to take on your average sized Koi. So hopefully with the help from these deals, the Koi's numbers can be kept under control. And again, our next species can be found almost anywhere, as we have rats. Now there are three species of rats that can be found on New Zealand today, the Polynesian Rat, the Norway Rat, and the Black Rat. Rats are known for being almost bulletproof, and can survive in some of the worst conditions, as they often seen springing out of sewer

In [29]:
chain = (
    {"context": pinecone.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)

chain.invoke("What problems do Koi cause?")

'Koi often stir up the bottom of ponds and lakes, muddying the water and creating algal blooms. They also feed on native invertebrates, small fish, and fish eggs.'

In [30]:

audio = ".../question.mp3"

# Load the Whisper model
whisper_model = whisper.load_model("base")

# Transcribe the audio file
transcription = whisper_model.transcribe(audio, fp16=False)["text"].strip()

# Store the transcription in a string variable
transcription_string = transcription

# Print the transcription string (optional)
print(transcription_string)

K-Cats other in New Zealand. How many domesticated cats other in New Zealand?


In [31]:
chain.invoke(transcription_string)

'There are 1.4 million domesticated cats in New Zealand.'