# Ryzen AI Clip: The YouTube Assistant
Clip is a YouTube assistant that acts as a knowledgeable companion for YouTube-related tasks, providing search capabilities, index building, and query answering functionality. It aims to assist users in finding information, answering questions, and engaging in meaningful conversations about YouTube content.
Here is an overview of the functionality provided by the YouTube assistant:

**YouTube Search**: The assistant can perform a YouTube search using the youtube_search method. It takes a search query as input and retrieves a list of videos matching the query. Each video in the search results is represented as a dictionary with details such as the video title, description, ID, URL, publish time, and channel title.

**Building an Index**: The assistant can build an index from a YouTube video. It requires a YouTube video link as input. The assistant downloads the video transcript and builds a vector index locally using the build_vector_index method. This index can be used to perform efficient searches and fetch information about the video.

**Querying the Index**: Once an index is built, the assistant can answer user queries about the video using the query_rag method. It takes a user query as input and uses the query engine to find relevant information from the index. The assistant provides a concise and human-like response based on the information available.

**Chatting about YouTube Content**: After the index is built, the assistant can engage in a chat with the user about YouTube content. The assistant responds to user queries and provides information based on the index. It aims to answer questions in a natural and helpful manner, thinking step-by-step to provide relevant information.

**Resetting the Assistant**: The assistant can be reset using the reset method. This clears the chat history and resets the state of the assistant, allowing for a fresh interaction with the user.


## Initial setup for running the Ryzen AI NPU
Note: this notebook has been tested on a Windows OS with [Miniconda 24+](https://docs.anaconda.com/free/miniconda/).
1. Create and activate a virtual environment
1. Install necessary dependencies such as TurnkeyML, Llama Index and others
1. Download the necessary quantized models.

### Create and activate the virtual environment

Create a virtual environment:
```
conda create -n clipenv python=3.11
conda activate clipenv
```

### Install dependencies

- Install turnkeyml, a no-code CLI and low-code API for serving and benchmarking LLMs locally.
- Install llama index, a framework for creating RAG and agent pipelines easily.

In [None]:
%pip uninstall transformers

In [None]:
%pip install ipywidgets
%pip install python-dotenv
%pip install turnkeyml[llm-oga-dml]
%pip install google-api-python-client
%pip install llama_index
%pip install llama-index-llms-groq
%pip install llama-index-readers-youtube-transcript
%pip install llama-index-embeddings-huggingface==0.3.1
%pip install llama-index-callbacks-arize-phoenix

In [None]:
%pip install --upgrade llama_index pydantic transformers torch torchvision
%pip install --upgrade llama-index-embeddings-huggingface==0.3.1

### Initilize API keys, Llama Index LLM, embedding model and settings

In [None]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.ERROR)
logging.getLogger().setLevel(logging.ERROR)
logging.getLogger('llama_index').setLevel(logging.ERROR)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

## Setup Observability
See https://llamatrace.com/.

In [None]:
# Phoenix can display in real time the traces automatically
# collected from your LlamaIndex application.
# Run all of your LlamaIndex applications as usual and traces
# will be collected and displayed in Phoenix.

# setup Arize Phoenix for logging/observability
from dotenv import load_dotenv
import llama_index.core
import os

load_dotenv()
phoenix_api_key = os.getenv('PHOENIX_API_KEY')
assert phoenix_api_key

os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"api_key={phoenix_api_key}"
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={phoenix_api_key}"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

llama_index.core.set_global_handler(
    "arize_phoenix", endpoint="https://llamatrace.com/v1/traces"
)


## Define YouTube Agent Class

We use two main libraries: `googleapiclient` for YouTube search API and `llama index` for RAG and agent framework.

In [None]:
import re
import json
import os
import time
from collections import deque
from dotenv import load_dotenv

from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

from websocket import create_connection
from websocket._exceptions import WebSocketTimeoutException

from llama_index.core import (
    VectorStoreIndex,
    Document,
    DocumentSummaryIndex,
    Settings,
    PromptTemplate,
    get_response_synthesizer,
)
from llama_index.core.base.llms.types import ChatMessage
from llama_index.llms.groq import Groq
from llama_index.core.tools import FunctionTool, QueryEngineTool, ToolMetadata
from llama_index.core.node_parser import SentenceSplitter
from llama_index.readers.youtube_transcript import YoutubeTranscriptReader

# Handle asyncio event loops in Jupyter
import nest_asyncio
nest_asyncio.apply()

class YouTubeAgent():
    """
    The YouTube assistant acts as a knowledgeable companion for
    YouTube-related tasks, providing search capabilities, index building, and
    query answering functionality. It aims to assist users in finding
    information, answering questions, and engaging in meaningful conversations
    about YouTube content.
    """

    def __init__(self, backend="npu", embed_model="local:BAAI/bge-small-en-v1.5"):

        assert backend == "cpu" or backend == "npu" or backend == "groq-api", "Invalid backend specified."
        self.backend = backend

        load_dotenv()
        youtube_api_key = os.getenv('YOUTUBE_API_KEY')
        assert youtube_api_key
        self.youtube = build("youtube", "v3", developerKey=youtube_api_key)

        if backend == "groq-api":
            self.groq_api_key = os.getenv('GROQ_API_KEY')
            assert self.groq_api_key, 'Please set GROQ_API_KEY!'

            self.groq_llm = Groq(
                model="llama-3.1-8b-instant",
                # model="llama3-70b-8192",
                api_key=self.groq_api_key
            )
            self.llm = self.groq_llm
        
        else:
            self.llm = LocalLLM(prompt_llm_server=self.prompt_llm_server)
            self.llm_server_uri = "ws://localhost:8000/ws"

        Settings.llm = self.llm
        Settings.embed_model = embed_model

        Settings.chunk_size = 128
        Settings.chunk_overlap = 16
        self.similarity_top = 3

        # Initialize global variables
        self.summary_index = None
        self.vector_index = None
        self.query_engine = None
        self.search_results = None

        self.n_chat_messages = 10
        self.chat_history = deque(maxlen=self.n_chat_messages * 2)  # Store both user and assistant messages

        self.llm_states = [
            # state = 0, no index or search results produced yet.
            ("Index is currently not built and is empty.\n"
            "You need to perform YouTube search using the youtube_search tool before creating the index."),

            # state = 1, search results produced but no index created yet.
            ("Index is currently not built and is empty.\n"
            "YouTube search results have been found:\n"
            f"{self.search_results}\n"
            "Ask the user which result to build the index for.\n"),

            # state = 2, search results produced but no index created yet.
            ("Index is currently built and is not empty.\n"
            "You can now use the query engine to fetch information about the video.\n"
            "To access the index, use the query engine RAG tool by calling: {\"query_engine\" : \"query\"}\n"),
        ]
        # set llm state 0-2
        self.llm_state = 0

        self.llm_system_prompt = (
            "[INST] <<SYS>>\n"
            "You are a YouTube-focused assistant called Clip that helps user with YouTube by calling function tools.\n"
            "You are helpful by providing the necessary json-formatted queries in the form of {\"tool\" : \"query\"}:\n"
            "Do not include the results from the tools.\n"
            "In order to build the index, you have to first search YouTube.\n"
            "You only have ability to call the tools below, do not assume you have access to the output from the tools.\n"
            "1. {\"youtube_search\" : \"user_query\"}\n"
            "2. {\"build_index\" : \"video_id\"}\n"
            "3. {\"query_rag\" : \"user_query\"}\n"
            "4. {\"reset\" : \"\"}\n"
            "Description of parameters:\n"
            "user_query : a query derived from the User comments.\n"
            "video_id : a video ID string from the YouTube search results.\n"
            "Description of tools:\n"
            "1. \"youtube_search\" : this tool is typically called first when no index or query RAG pipelines exist. it takes a user query based on user input and return a set of search YouTube results that include a description and video id.\n"
            "2. {\"build_index\" : this tool is called after the youtube_search tool is called. It takes an input video_id which is a string from the YouTube search results and builds a local index from the video transcript. This is needed to build a query engine before the user can query any topic about the video.\n"
            "3. {\"query_rag\" : this tool is used to query the RAG pipeline that is based on the index built from the video transcript. When a user asks anything about the video, this tool is called.\n"
            "4. {\"reset\" : this tool is called when the user wants to clear the index and RAG query engine, for example if the user states 'lets start over', 'clear', 'reset', etc., this tool is called.\n"
            "\n"
            "Your tasks:\n"
            "2. Output a json that will be used in an external search tool for YouTube videos\n"
            "3. Call tool that can build an index from a video.\n"
            "1. Chat about YouTube content once index is built\n"
            "4. Answer questions from user using the index\n"
            "\n"
            "Guidelines:\n"
            "- Answer a question given in a natural human-like manner.\n"
            "- Think step-by-step when answering questions.\n"
            "- When introducing yourself, keep it to just a single sentence, for example:\n"
            "\"Assistant: Hi, I can help you find information you're looking for on YouTube. Just ask me about any topic!\"\n"
            "- If no index exists, search YouTube and offer to build one\n"
            "- If an index does exist, use the query engine to answer questions.\n"
            "- If unsure, offer to search for more videos\n"
            "- Keep your answers short, concise and helpful\n"
            "- Search_query should be the subject of what the user is looking for, not a youtube link.\n"
            "- Do NOT provide search results, those are being provided by the external tools.\n"
            "- You can only provide the json formatted output to call the tools, you do not have access to the tools directly.\n"
            "When using a tool, end your response with only the tool function call. Do not answer search results."
            "Always use the most relevant tool for each task.\n"
            "When needing to use a tool, your response should be formatted, here is an example script:\n"
            "User: What kind of philanthropy did Mr. Beast do?"
            "Assistant: To answer your question, I first need to search YouTube for the answer. Calling the following tool: {\"youtube_search\" : \"Mr Beast philanthropy\"} </s>\n"
            "<</SYS>>\n\n"
        )

        # this system prompt has been verified to work with llama v2 7b 4bit on NPU.
        self.query_engine_system_prompt = (
            "[INST] <<SYS>>\n"
            "You are a YouTube-focused assistant called Clip that helps user with YouTube by calling function tools.\n"
            "{context_str}\n\n"
            "Think step-by-step to answer the query in a crisp, short and concise manner based on the information provided.\n"
            "If the answer does not exist in the given information, simply answer 'I don't know!'\n"
            "Do not mention or refer to the context or information provided in your response.\n"
            "Answer directly without any preamble or explanatory phrases about the source of your information.\n"
            "<</SYS>>\n\n"
            "{query_str} [/INST]\n"
        )

    def youtube_search(self, query, max_results=3):
        """
        Perform a YouTube search with the given query and retrieve a list of videos.
        Args:
            query (str): The search query.
            max_results (int, optional): The maximum number of search results to retrieve. Defaults to 3.
        Returns:
            list: A list of dictionaries representing the videos found in the search results. Each dictionary contains the following keys:
                - id (int): The index of the video in the search results.
                - title (str): The title of the video.
                - description (str): The description of the video.
                - video_id (str): The ID of the video.
                - video_url (str): The URL of the video.
                - publish_time (str): The publish time of the video.
                - channel_title (str): The title of the channel that uploaded the video.
        Raises:
            HttpError: If an HTTP error occurs during the search.
        """
        try:
            msg = f"Running YouTube search with the following query: {query}"
            print(msg)
            self.chat_history.append(f"Asssistant: {msg}")
            search_response = self.youtube.search().list( # pylint: disable=E1101
                q=query,
                type="video",
                part="id,snippet",
                maxResults=max_results
            ).execute()

            videos = []
            msg = "Found the following result:"
            print(msg)
            self.chat_history.append(f"Asssistant: {msg}")
            for i, search_result in enumerate(search_response.get("items", [])):
                video_id = search_result["id"]["videoId"]
                video = {
                    "id": i,
                    "title": search_result["snippet"]["title"],
                    "description": search_result["snippet"]["description"],
                    "video_id": video_id,
                    "video_url": f"https://www.youtube.com/watch?v={video_id}",
                    "publish_time": search_result["snippet"]["publishTime"],
                    "channel_title": search_result["snippet"]["channelTitle"]
                }
                msg = f'{video["id"]} : {video["title"]}\n\n{video["description"]}\n{video["publish_time"]}    {video["video_id"]}\n\n'
                print(msg)
                self.chat_history.append(f"Asssistant: {msg}")
                videos.append(video)

            print(videos)
            return videos

        except HttpError as e:
            print(f"An HTTP error {e.resp.status} occurred:\n{e.content}")
            return None

    def get_video_url(self, video_id:str):
        """
        Returns the URL of a YouTube video based on the given video ID.
        Parameters:
        - video_id (str): The ID of the YouTube video.
        Returns:
        - str: The URL of the YouTube video.
        """
        return f"https://www.youtube.com/watch?v={video_id}"

    def extract_json_data(self, input_string):
        """
        Extracts the key and value from a JSON-formatted string.
        Args:
            input_string (str): The input string containing the JSON-formatted data.
        Returns:
            tuple: A tuple containing the key and value extracted from the JSON data.
                   If the input string does not contain valid JSON data, (None, None) is returned.
        """

        # Find the JSON-formatted part of the string
        json_match = re.search(r'\{.*?\}', input_string)

        if json_match:
            json_str = json_match.group()
            try:
                # Parse the JSON string
                json_data = json.loads(json_str)

                # Extract the key and value
                key, value = next(iter(json_data.items()))

                return key, value
            except json.JSONDecodeError:
                return None, None
        else:
            print("WARNING: No JSON data found in the input string.")
            return None, None

    def get_chat_history(self):
        return list(self.chat_history)
    
    def prompt_llm(self, query):
        if self.backend == "groq-api":
            return self.prompt_groq_llm(query)
        else:
            return self.prompt_local_llm(query)

    def prompt_llm_server(self, prompt):
        try:
            ws = create_connection(self.llm_server_uri)
        except Exception as e:
            print(f"My brain is not working:```{e}```")
            return

        try:
            print(f"Sending prompt to LLM server:\n{prompt}")
            ws.send(prompt)

            first_chunk = True
            self.last_chunk = False
            full_response = ""

            self.last = False

            while True:
                try:
                    if first_chunk:
                        ws.sock.settimeout(None)  # No timeout for first chunk
                    else:
                        ws.sock.settimeout(5)  # 5 second timeout after first chunk

                    chunk = ws.recv()

                    if first_chunk:
                        first_chunk = False

                    if chunk:
                        if "</s>" in chunk:
                            chunk = chunk.replace("</s>", "")
                            full_response += chunk

                            self.last = True

                            yield chunk
                            break
                        full_response += chunk
                        yield chunk

                except WebSocketTimeoutException:
                    break
        finally:
            ws.close()

    def prompt_groq_llm(self, query):
        """
        Prompt the Groq LLM API with a query and return the response.
        Args:
            query (str): The user's query.
        Returns:
            str: The response from the LLM.
        """
        self.chat_history.append(f"User: {query}")
        prompt = '\n'.join(self.chat_history) + "[/INST]\nAssistant: "

        # print("SYSTEM PROMPT:", self.llm_system_prompt)
        # print("USER PROMPT:", prompt)
        system_prompt = (
            f"{self.llm_system_prompt}\n"
            "Current state of index:\n"
            f"{self.llm_states[self.llm_state]}\n"
        )

        messages = [
            ChatMessage(role="system", content=system_prompt),
            ChatMessage(role="user", content=prompt)
        ]
        gen = self.groq_llm.stream_chat(messages)

        response_text = ''
        for response in gen:
            response_text += response.delta
            print(response.delta, end="", flush=True)
        print("\n")

        self.chat_history.append(f"Assistant: {response}")

        return response_text

    def prompt_local_llm(self, query):
        """
        Prompt the LLM with a query and return the response.
        Args:
            query (str): The user's query.
        Returns:
            str: The response from the LLM.
        """
        response = ""
        self.chat_history.append(f"User: {query}")

        system_prompt = (
            f"{self.llm_system_prompt}\n"
            "Current state of index:\n"
            f"{self.llm_states[self.llm_state]}\n"
        )
        prompt = system_prompt + '\n'.join(self.chat_history) + "[/INST]\nAssistant: "

        # print(prompt)
        for chunk in self.prompt_llm_server(prompt=prompt):
            print(chunk, end="", flush=True)
            response += chunk
        print("\n")
        self.chat_history.append(f"Assistant: {response}")

        return response

    def reset(self):
        self.chat_history.clear()
        self.summary_index = None
        self.vector_index = None
        self.query_engine = None
        self.search_results = None
        self.llm_state = 0
        
    def chat_restarted(self):
        print("Client requested chat to restart")
        self.chat_history.clear()
        intro = "Hi, who are you in one sentence?"
        print("User:", intro)
        try:
            response = self.prompt_llm(intro)
            print(f"Response: {response}")
        except ConnectionRefusedError as e:
            print( f"Having trouble connecting to the LLM server, got:\n{str(e)}!")
        finally:
            self.summary_index = None
            self.vector_index = None
            self.query_engine = None
            self.search_results = None

    def extract_youtube_link(self, message):
        youtube_link_pattern = r"https?://(?:www\.)?(?:youtube\.com|youtu\.be)/(?:watch\?v=)?(?:embed/)?(?:v/)?(?:shorts/)?(?:\S+)"
        match = re.search(youtube_link_pattern, message)
        if match:
            return match.group()
        else:
            return None

    def extract_video_id(self, url):
        patterns = [
            r'(?:https?:\/\/)?(?:www\.)?youtube\.com\/watch\?v=([^&]+)',
            r'(?:https?:\/\/)?(?:www\.)?youtu\.be\/([^?]+)',
            r'(?:https?:\/\/)?(?:www\.)?youtube\.com\/embed\/([^?]+)',
            r'(?:https?:\/\/)?(?:www\.)?youtube\.com\/v\/([^?]+)',
        ]

        for pattern in patterns:
            match = re.search(pattern, url)
            if match:
                return match.group(1)

        return None

    def get_youtube_transcript_doc(self, yt_links: list) -> Document:
        return YoutubeTranscriptReader().load_data(ytlinks=yt_links)

    def build_summary_index(self, doc: Document):
        # from https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary/
        print("Building summary index")
        splitter = SentenceSplitter(chunk_size=1024)
        response_synthesizer = get_response_synthesizer(
            response_mode="tree_summarize", use_async=True
        )
        self.summary_index = DocumentSummaryIndex.from_documents(
            doc,
            transformations=[splitter],
            response_synthesizer=response_synthesizer,
            show_progress=True,
        )
    
    def print_summary(self, doc_id):
        print(self.summary_index.get_document_summary(doc_id))

    def build_vector_index(self, doc: Document) -> VectorStoreIndex:
        start_time = time.perf_counter()
        self.vector_index = VectorStoreIndex.from_documents(doc, show_progress=True)
        end_time = time.perf_counter()
        print(f"Done building index. Took {(end_time - start_time):.1f} seconds to build.")

    def build_query_engine(self):
        # print("Building RAG query engine.")
        assert self.vector_index, "Vector index is not built yet."
        qa_prompt_tmpl = PromptTemplate(self.query_engine_system_prompt)
        self.query_engine = self.vector_index.as_query_engine(
            verbose=True,
            similarity_top_k=self.similarity_top,
            response_mode="compact",
            streaming=True,
        )
        self.query_engine.update_prompts(
            {"response_synthesizer:text_qa_template": qa_prompt_tmpl}
        )

    def get_youtube_tool(self):
        return FunctionTool.from_defaults(fn=self.get_youtube_transcript_doc)
    
    def build_query_engine_tools(self, desc=None):
        assert self.query_engine, "Query engine is not built yet."
        self.query_engine_tools = [
            QueryEngineTool(
                query_engine=self.query_engine,
                metadata=ToolMetadata(
                    name="youtube",
                    description=(
                        f"YouTube transcript of {desc}. "
                        "Use a detailed plain text question as input to the tool."
                    ),
                ),
            ),
        ]

    def run(self, prompt):
        print("Message received:", prompt)

        response = self.prompt_llm(prompt)
        # print(response)

        key, value = self.extract_json_data(response)
        print(f"key: {key}, value: {value}, llm state: {self.llm_state}")
        # print(self.llm_states[self.llm_state])
        if key == "youtube_search":
            self.search_results = self.youtube_search(value, max_results=3)
            self.summary_index = None
            self.vector_index = None
            self.query_engine = None
            self.llm_state = 1

        elif key == "build_index":
            # value in this case is the video_id
            video = [vid for vid in self.search_results if vid["video_id"] == value][0]

            assert video, f"Video with video_id {value} not found in search results."
            msg = f"Fetching transcript from {video}."
            print(msg)
            self.chat_history.append(f"Asssistant: {msg}")

            video_id = video["video_id"]
            video_url = [self.get_video_url(video_id)]
            doc = self.get_youtube_transcript_doc(video_url)
            doc[0].doc_id = video_id

            # build and print summary index
            # self.build_summary_index(doc)
            # self.print_summary(video_id)

            self.build_vector_index(doc)
            self.build_query_engine()
            self.build_query_engine_tools(desc=video["description"])
            # TODO: self.build_react_agent()

            msg = f"Index and query engine is now ready to be used on your PC. Feel free to ask any questions about the video!"
            print(msg)
            self.chat_history.append(f"Asssistant: {msg}")
            self.llm_state = 2

        elif key == "query_rag":
            query = value
            print(f"\nQuery: {query}")
            streaming_response = self.query_engine.query(query)
            print("Answer: ", end="", flush=True)
            response = ""
            for text in streaming_response.response_gen:
                if text:
                    response += text
                    print(text, end="", flush=True)

        elif key == "reset":
            msg = f"Index and query engine are now cleared. Ready to search YouTube again!"
            print(msg)
            self.clear()
        else:
            pass



## Implement a Simple Chatbot UI

In [None]:
import ipywidgets as widgets
from IPython.display import display, clear_output

class SimpleChatbot:
    def __init__(self, height=400):
        self.messages = []
        
        self.chat_output = widgets.HTML()
        self.output_area = widgets.Box([self.chat_output], layout=widgets.Layout(
            height=f'{height-50}px',
            border='1px solid #ddd',
            overflow_y='auto'
        ))
        
        self.text_input = widgets.Text(placeholder='Type your message here...',
                                       layout=widgets.Layout(width='80%'))
        self.send_button = widgets.Button(description='Send',
                                          layout=widgets.Layout(width='20%'))
        
        self.send_button.on_click(self.on_send_button_clicked)
        self.text_input.on_submit(self.on_send_button_clicked)
        
        input_area = widgets.HBox([self.text_input, self.send_button])
        self.chat_interface = widgets.VBox([self.output_area, input_area],
                                           layout=widgets.Layout(height=f'{height}px', width='1000px'))
        
    def on_send_button_clicked(self, b):
        user_message = self.text_input.value
        self.text_input.value = ''
        
        self.messages.append(f"You: {user_message}")
        bot_response = self.get_bot_response(user_message)
        self.messages.append(f"Bot: {bot_response}")
        
        self.update_chat_display()
    
    def update_chat_display(self):
        chat_content = '<br>'.join([msg.replace('\n', '<br>') for msg in self.messages])
        self.chat_output.value = f'<div style="white-space: pre-wrap;">{chat_content}</div>'
    
    def get_bot_response(self, user_message):
        # This is a very simple response mechanism. You can replace it with more sophisticated logic.
        return f"You said: '{user_message}'. This is a simple echo bot."
    
    def display(self):
        display(self.chat_interface)

# Usage
chatbot = SimpleChatbot(height=400)
chatbot.display()

## Run the application using Groq API

In [None]:
agent = YouTubeAgent(backend="groq-api")

Perform a YouTube search given a user query.

In [None]:
agent.reset()
display(agent.run("What did Lisa Su talk about at Computex 2024?"))

Select the first result, build a local vector index and generate a summary

In [None]:
display(agent.run("Thanks, can you build an index of the first result?"))

Query the index using a RAG query engine

In [None]:
display(agent.run("How many TFLOPs does the NPU have?"))

In [None]:
# this query fails
display(agent.run("What is the architecture of the Strix NPU?"))

In [None]:
display(agent.run("What is the architecture of the Strix NPU?"))

In [None]:
embed_model = "local:BAAI/bge-base-en-v1.5"
agent = YouTubeAgent(backend="groq-api", embed_model=embed_model)

## Interactive chat

In [None]:
while True:
    try:
        user_input = input("You: ").strip()
        if user_input.lower() == "exit":
            print("Goodbye!")
            break
        elif user_input:
            print("Agent: ", end="", flush=True)
            agent.prompt_received(user_input)
        else:
            print("Please enter a valid input.")
    except KeyboardInterrupt:
        print("\nGoodbye!")
        break


## Define a LocalLLM Class
Create a LocalLLM Class that connects a web server to the llama-index framework.

In [None]:
from typing import Any, Callable
import time
import logging
from collections import OrderedDict
from typing import Any, Callable
from transformers import LlamaTokenizer
from huggingface_hub import HfFolder, HfApi
from websocket import create_connection
from websocket._exceptions import WebSocketTimeoutException
from aiohttp import web
import requests
from llama_index.core.llms import (
    LLMMetadata,
    CustomLLM,
)
from llama_index.core.llms.callbacks import llm_completion_callback

from llama_index.llms.llama_cpp.llama_utils import (
    messages_to_prompt,
)
from llama_index.core.base.llms.types import (
    ChatMessage,
    ChatResponse,
    CompletionResponse,
    CompletionResponseGen,
    MessageRole,
)

class LocalLLM(CustomLLM):
    prompt_llm_server: Callable = None
    stream_to_ui: Callable = None
    context_window: int = 3900
    num_output: int = 256
    model_name: str = "custom"

    async def achat(
        self,
        messages: Any,
        **kwargs: Any,
    ) -> str:

        formatted_message = messages_to_prompt(messages)

        # Prompt LLM and steam content to UI
        text_response = await self.prompt_llm_server(
            prompt=formatted_message, stream_to_ui=True
        )

        response = ChatResponse(
            message=ChatMessage(
                role=MessageRole.ASSISTANT,
                content=text_response,
                additional_kwargs={},
            ),
            raw={"text": text_response},
        )
        return response

    @property
    def metadata(self) -> LLMMetadata:
        """Get LLM metadata."""
        return LLMMetadata(
            context_window=self.context_window,
            num_output=self.num_output,
            model_name=self.model_name,
        )

    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        response = self.prompt_llm_server(prompt=prompt)
        self.stream_to_ui(response, new_card=True)
        return CompletionResponse(text=response)

    @llm_completion_callback()
    def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
        response = ""
        new_card = True
        for chunk in self.prompt_llm_server(prompt=prompt):

            # Stream chunk to UI
            self.stream_to_ui(chunk, new_card=new_card)
            new_card = False

            response += chunk
            yield CompletionResponse(text=response, delta=chunk)


## Run the application using NPU OGA

In [None]:
agent = YouTubeAgent(backend="npu")

In [None]:
agent.reset()
display(agent.run("What did Lisa Su talk about at Computex 2024?"))

In [None]:
display(agent.run("Thanks, can you build an index of the first result?"))

In [None]:
display(agent.run("How many TFLOPs does the NPU have?"))