## Podstawowe instalacje i ustawienie środowiska

In [None]:
%pip install langchain openai tiktoken youtube-transcript-api

In [13]:
import os
import json

with open("credentials.json") as f:
    credentials = json.load(f)

openai_key = credentials["openai-key"]

os.environ["OPENAI_API_KEY"] = openai_key

## Załadowanie i przygotowanie danych

Document Loaders obsługuje wiele różnych źródeł danych: z tych oczywistych mogą być to dokumenty wordowskie, csv, pandasowe ramki, pdfy. Z tych mniej oczywistych będą takie jak notatki z EverNote, Notion, konwersacje z ChataGPT, AZLyrics, dane z Confluence. Jeżeli chodzi o możliwość bezpośredniego połączenia się z danymi, które trzymamy w chmurze, document loaders umożliwa połoczęnie z Azurowym Blob Storagem czy S3 z AWS.

In [1]:
from langchain.document_loaders import YoutubeLoader
from langchain.schema import Document

yt_link = [
    "https://www.youtube.com/watch?v=X4-hu3vZAOg&list=PLGVZCDnMOq0qT0MXnci7VBSF-U-0WaQ-w&index=1",
    "https://www.youtube.com/watch?v=DvDxS4uKj5Q&list=PLGVZCDnMOq0qT0MXnci7VBSF-U-0WaQ-w&index=2",
    "https://www.youtube.com/watch?v=wiGkV37Kbxk",
    "https://www.youtube.com/watch?v=ux9OLDBR9RE&list=PLGVZCDnMOq0qgYUt0yn7F80wmzCnj2dEq&index=5",
    "https://www.youtube.com/watch?v=AMJEnkA0YrE&list=PLGVZCDnMOq0qgYUt0yn7F80wmzCnj2dEq&index=12",
    "https://www.youtube.com/watch?v=AGg9NH2XpYs&list=PLGVZCDnMOq0qgYUt0yn7F80wmzCnj2dEq&index=20",
    "https://www.youtube.com/watch?v=BeBVdjENBZo",
]

documents = []
for link in yt_link:
    loader_yt = YoutubeLoader.from_youtube_url(link)
    data_yt = loader_yt.load()
    for item in data_yt:
        documents.append(
            Document(page_content=item.page_content, metadata={"url": link})
        )

In [2]:
documents[0]



In [3]:
import requests
from bs4 import BeautifulSoup


def get_video_title(youtube_url):
    response = requests.get(youtube_url)
    soup = BeautifulSoup(response.content, "html.parser")
    title_tag = soup.find("title")
    video_title = title_tag.text.strip()
    return video_title

In [4]:
docs_with_title = []
for item in documents:
    title = get_video_title(item.metadata["url"])
    url = item.metadata["url"]
    docs_with_title.append(
        Document(page_content=item.page_content, metadata={"url": url, "title": title})
    )

In [5]:
docs_with_title[0]



In [6]:
for doc in docs_with_title:
    title = doc.metadata.get("title")
    if title:
        print(title)

Anders Bogsnes - SQLAlchemy and You - Making SQL the Best Thing Since Sliced Bread - YouTube
Kajanan Sangaralingam and Anindya Datta - Feature Engineering Made Simple | PyData London 2022 - YouTube
Raymond Hettinger: Numerical Marvels Inside Python - Keynote | PyData Tel Aviv 2022 - YouTube
Dina Bavli - Life, Death, and Shopping | PyData Global 2022 - YouTube
Srikanth - Use pandas in tidy style | PyData Global 2022 - YouTube
Aadit Kapoor - Utilizing Word Embeddings and Gradient Boosting | PyData Global 2022 - YouTube
Introduction to data apps with Panel, Pydata Copenhagen, March 2022 - YouTube


In [7]:
import tiktoken


def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.encoding_for_model(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens


tokens = []
for document in documents:
    tokens.append(num_tokens_from_string(document.page_content, "gpt-3.5-turbo"))

In [8]:
print(f"Minimalna ilość tokenów wśród dokumentów: {min(tokens)}\n")
print(f"Maksmylana ilość tokenów wśród dokumentów: {max(tokens)}\n")
print(f"Łączna ilość tokenów we wszystkich dokumentów: {sum(tokens)}")

Minimalna ilość tokenów wśród dokumentów: 1385

Maksmylana ilość tokenów wśród dokumentów: 15210

Łączna ilość tokenów we wszystkich dokumentów: 44887


<div>
<img src="graphics\pricing.png" width="800"/>
</div>

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", "", " "],
    chunk_size=700,
    chunk_overlap=50,
    length_function=len,
)

In [10]:
text = "Pewnego szarego poranka na Dolinę Muminków spadł pierwszy śnieg. Padał miękko i cicho - w parę godzin wszystko było białe.\
Muminek stał na schodkach przed domem patrząc, jak dolina okrywa się zimową kołdrą.\
- Dziś wieczorem - myślał sobie - ułożymy się do długiego zimowego snu. \
(Wszystkie trolle Muminki układają się do snu zimowego gdzieś koło listopada. \
Bardzo to rozsądnie ze strony każdego, kto nie lubi zimna I zimowych ciemności). \
Potem Muminek zamknął za sobą drzwi i poszedł do swojej mamy.\
- Śnieg przyszedł! - powiedział.\
- Wiem - odpowiedziała Mama Muminka. - Macie już wszyscy w łóżeczkach najcieplejsze kołdry. \
Ty będziesz spał w pokoiku na poddaszu po stronie zachodniej razem z Ryjkiem.\
- Ale Ryjek tak okropnie chrapie - powiedział Muminek. - Czy nie mógłbym zamiast z nim spać z Włóczykijem?\
- Jak chcesz - odpowiedziała Mama Muminka. - Ryjek może spać w pokoiku na poddaszu od strony wschodniej."

text_splitter_muminki = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", "", " "],
    chunk_size=150,
    chunk_overlap=20,
    length_function=len,
)


chunks = text_splitter_muminki.split_text(text)
chunks

['Pewnego szarego poranka na Dolinę Muminków spadł pierwszy śnieg. Padał miękko i cicho - w parę godzin wszystko było białe.Muminek stał na schodkach pr',
 'stał na schodkach przed domem patrząc, jak dolina okrywa się zimową kołdrą.- Dziś wieczorem - myślał sobie - ułożymy się do długiego zimowego snu. (Ws',
 'go zimowego snu. (Wszystkie trolle Muminki układają się do snu zimowego gdzieś koło listopada. Bardzo to rozsądnie ze strony każdego, kto nie lubi zim',
 'go, kto nie lubi zimna I zimowych ciemności). Potem Muminek zamknął za sobą drzwi i poszedł do swojej mamy.- Śnieg przyszedł! - powiedział.- Wiem - od',
 'wiedział.- Wiem - odpowiedziała Mama Muminka. - Macie już wszyscy w łóżeczkach najcieplejsze kołdry. Ty będziesz spał w pokoiku na poddaszu po stronie',
 'poddaszu po stronie zachodniej razem z Ryjkiem.- Ale Ryjek tak okropnie chrapie - powiedział Muminek. - Czy nie mógłbym zamiast z nim spać z Włóczyki',
 'nim spać z Włóczykijem?- Jak chcesz - odpowiedziała Mama Muminka. - R

In [11]:
import hashlib

m = hashlib.md5()

data = []
for doc in docs_with_title:
    url = doc.metadata["url"]
    title = doc.metadata["title"]
    m.update(url.encode("utf-8"))
    video_id = m.hexdigest()[:12]
    chunks = text_splitter.split_text(doc.page_content)
    for i, chunk in enumerate(chunks):
        data.append(
            Document(
                page_content=str(chunk),
                metadata={"id": f"{video_id}-{i}", "source": url, "title": title},
            )
        )

print(data[0])

page_content="thank you very much for being my guinea pigs for uh for today it's always fun to be the first uh first of a pi data london session so as mentioned my name is anders i am norwegian i live in copenhagen and have lived lots of different places i have a background in japanese business actually so i've had a lot of hats i've currently i am as mentioned the head of the python enablement team at modis management so my job is to make sure that we have all the python tools we need build up python infrastructure do courses trainings workshops sort of be internal consultants for all the teams at uds management so you can imagine a lot of quants a lot of trading i'm sure you all you londoners know all a" metadata={'id': 'bd9aa202dfe6-0', 'source': 'https://www.youtube.com/watch?v=X4-hu3vZAOg&list=PLGVZCDnMOq0qT0MXnci7VBSF-U-0WaQ-w&index=1', 'title': 'Anders Bogsnes - SQLAlchemy and You - Making SQL the Best Thing Since Sliced Bread - YouTube'}


## Embedding - osadzanie słów

In [14]:
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings_openai = OpenAIEmbeddings(model="text-embedding-ada-002")

## Zapis danych w bazie wektorowej

In [15]:
from langchain.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    data, embeddings_openai, path="pydata_db", collection_name="vectorstore"
)

## Zdefiniowanie modelu

In [16]:
from langchain.chat_models import ChatOpenAI

chat_07 = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0.7,
    max_tokens=800,
)

In [17]:
chat_0 = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0,
    max_tokens=800,
)

## Zdefiniowanie łańcucha

In [18]:
from langchain.chains import RetrievalQA

chat_qa_07 = RetrievalQA.from_chain_type(
    llm=chat_07,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
)

In [19]:
chat_qa = RetrievalQA.from_chain_type(
    llm=chat_0,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
)

## Odpytywanie modelu

In [20]:
query = input("What kind of topic would you like to explore?")

In [21]:
print(query)

feature selection


In [22]:
chat_qa_07.run(query)

'Feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. It is an important step in the machine learning pipeline as it helps to reduce the complexity of the model, improve its accuracy, and prevent overfitting. Feature selection involves generating a set of candidate features, pruning the feature set using a set of checks, and selecting the final set of features that will power the model. The candidate feature set can be generated in various ways, including using a feature explorer, generating custom features, and applying transformations to existing features. The final feature set is selected based on its quality, relevance, and ability to improve model performance.'

In [23]:
chat_qa_07.run(query)

"The process of feature selection involves selecting a subset of relevant features or variables to use in building a model. This helps to improve the accuracy and efficiency of the model, as it reduces the number of irrelevant or redundant features that can negatively impact the model's performance. The first step in feature selection is to generate a set of candidate features, which can be done through various methods such as using a feature explorer, generating custom features, or using transformations. The candidate feature set is then pruned using a set of checks, such as null detection, missing value detection, outlier detection, and biasedness detection. The final selected features are used to power the model."

In [35]:
chat_qa.run(query)

'The process of feature selection involves selecting a subset of relevant features from a larger set of candidate features that will be used to build a predictive model. The first step in feature selection is to generate a set of candidate features, which can be done using various methods such as the feature explorer, custom feature generation, and transformations. Once the candidate feature set is generated, it is pruned using a set of checks such as null detection, invalid entry detection, missing value detection, outlier detection, and biasedness detection. The final set of selected features will be used to build the predictive model.'

In [36]:
chat_qa.run(query)

'The process of feature selection involves selecting a subset of relevant features from a larger set of candidate features that will be used to build a predictive model. The first step in feature selection is to generate a set of candidate features, which can be done using various methods such as the feature explorer, custom feature generation, and transformations. Once the candidate feature set is generated, it is pruned using a set of checks such as null detection, invalid entry detection, missing value detection, outlier detection, and biasedness detection. The final set of selected features should be those that are most relevant to the problem being solved and have the highest predictive power.'

## Usprawnienie łańcucha

In [29]:
from langchain.prompts import PromptTemplate

chat_template = """
You are an AI Assistant providing information about PyData talks available on YouTube. \
You specialize in Python, machine learning, artificial intelligence and data visualization. \
Your audience consists of Python programmers interested in these topics. \
If a question is related to the PyData talks, you provide a relevant answer. \
If the question is about other topics not covered in the videos, kindly respond that it was not a topic \
discussed in any of the lectures on PyData. Do not make up answers.

Add what is the title of the document that were used for answer preparation.

{context}

Question: {question}
Assistant: """

chat_prompt = PromptTemplate(
    template=chat_template, input_variables=["context", "question"]
)

In [30]:
chat_qa = RetrievalQA.from_chain_type(
    llm=chat_0,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": chat_prompt},
)

In [31]:
query = input("What topic would you like to explore?")

In [32]:
print(query)

is there any way that I can improve my pandas experience?


In [33]:
chat_qa.run(query)

'Yes, you can improve your pandas experience by using Tidy Pandas, which offers a simplified index, consistent verbs, and a unified interface for aggregation and assigning your columns across groups. Tidy Pandas also offers an accessor for pandas data frames, which allows you to directly call Tidy Pandas methods on your data frames. Tidy Pandas is available on PyPI and has been actively maintained since its first release in April 2022.'

In [85]:
chat_template = """
You are an AI Assistant providing information about PyData talks available on YouTube. \
You specialize in Python, machine learning, artificial intelligence and data visualization. \
Your audience consists of Python programmers interested in these topics. \
If a question is related to the PyData talks, you provide a relevant answer. \
If the question is about other topics not covered in the videos, kindly respond that it was not a topic \
discussed in any of the lectures on PyData. Do not make up answers.

Add what is the source of your knowledge.

{context}

Question: {question}
Assistant: """

chat_prompt = PromptTemplate(
    template=chat_template, input_variables=["context", "question"]
)

In [86]:
chat_qa = RetrievalQA.from_chain_type(
    llm=chat_0,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": chat_prompt},
)

In [87]:
chat_qa.run("is there any way that I can improve my pandas experience?")

'Yes, there are ways to improve your pandas experience. One way is to use Tidy Pandas, which offers a simplified index, consistent verbs, and a unified interface for aggregation and assigning your columns across groups. Tidy Pandas also offers an accessor for pandas data frames, which allows you to directly call Tidy Pandas methods on your data frames. Tidy Pandas is available on PyPI and has a GitHub repo. Source: PyData talks on YouTube.'

In [88]:
chat_qa.run("Can you tell me what talk exactly is the basis of the answer for your previous question?")

"I'm sorry, I'm not sure which previous question you are referring to. Could you please provide more context or clarify your question? My knowledge is based on PyData talks available on YouTube, specifically related to Python, machine learning, artificial intelligence, and data visualization."

## Zwracanie źródła informacji

In [34]:
chat_qa = RetrievalQA.from_chain_type(
    llm=chat_0,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": chat_prompt},
    return_source_documents=True,
)

result = chat_qa({"query": "is there any way that I can improve my pandas experience?"})

In [35]:
print(result)

{'query': 'is there any way that I can improve my pandas experience?', 'result': 'Yes, you can improve your pandas experience by using Tidy Pandas, which offers a simplified index, consistent verbs, and a unified interface for aggregation and assigning your columns across groups. It also offers an accessor for pandas data frames, making it easier to use Tidy Pandas methods directly on your data frames. Tidy Pandas is available on PyPI and has been actively maintained since its first release in April 2022.', 'source_documents': [Document(page_content="alysis right and you see another group by an aggregation probably on line seven and eight that that lets you and you got to again rename what you have done in the aggregation so when you do this and you keep constantly printing your data and takes a lot of time and takes you away from your thought process so we believe uh tidy pandas really should help us in that regard and tidy pointers uh really works towards a lot of consistency so if y

In [36]:
sources = [doc.metadata["source"] for doc in result["source_documents"]]
print(sources)

['https://www.youtube.com/watch?v=AMJEnkA0YrE&list=PLGVZCDnMOq0qgYUt0yn7F80wmzCnj2dEq&index=12', 'https://www.youtube.com/watch?v=AMJEnkA0YrE&list=PLGVZCDnMOq0qgYUt0yn7F80wmzCnj2dEq&index=12', 'https://www.youtube.com/watch?v=AMJEnkA0YrE&list=PLGVZCDnMOq0qgYUt0yn7F80wmzCnj2dEq&index=12', 'https://www.youtube.com/watch?v=AMJEnkA0YrE&list=PLGVZCDnMOq0qgYUt0yn7F80wmzCnj2dEq&index=12']


### Alternatywa - wyszukiwanie podobieństw

In [37]:
found_docs = vectorstore.similarity_search_with_score(
    "is there any way that I can improve my pandas experience?"
)
document, score = found_docs[0]

print(found_docs[0])

(Document(page_content="alysis right and you see another group by an aggregation probably on line seven and eight that that lets you and you got to again rename what you have done in the aggregation so when you do this and you keep constantly printing your data and takes a lot of time and takes you away from your thought process so we believe uh tidy pandas really should help us in that regard and tidy pointers uh really works towards a lot of consistency so if you see the first three statements and try to pointers so all three of them return your tidy data frame depending on the the method that's called okay so these things could be really painful when you're doing the analysis so this really matters so tidy panda", metadata={'id': 'bf69d53acddf-7', 'source': 'https://www.youtube.com/watch?v=AMJEnkA0YrE&list=PLGVZCDnMOq0qgYUt0yn7F80wmzCnj2dEq&index=12', 'title': 'Srikanth - Use pandas in tidy style | PyData Global 2022 - YouTube'}), 0.8097646888396879)


In [38]:
if sources[0] == document.metadata["source"]:
    print(document.metadata["source"])
else:
    print("Sources does not match")

print(f"Score: {score}")

https://www.youtube.com/watch?v=AMJEnkA0YrE&list=PLGVZCDnMOq0qgYUt0yn7F80wmzCnj2dEq&index=12
Score: 0.8097646888396879


----

### Podsumowanie video

In [55]:
loader = YoutubeLoader.from_youtube_url(yt_link[1])
data_link = loader_yt.load()

data_link

[Document(page_content="hi there thanks for taking the time in this video i'll give you a recap about a talk i recently gave at pi data copenhagen so i was invited to talk about an introduction to data apps with panel and since the event was not recorded i'll try to recap some of the things i did so first of all i gave an introduction to awesome panel which is the site i've created so panel is a framework for creating powerful data apps and for interactive data exploration in python and here at awesomepanel.org you can find inspiration you can find out if panel is something for you i believe it could be at least panel is something for me and i wrote a blog post about it you can check it out one of the amazing things about panel is that it really works in the notebook and your editor so both in both places and this is like a very unique value proposition panel is open source for free and it's part of a broader family of systems that try to unite the pi data or pybis tools so at awesome 

In [70]:
data_chunks = text_splitter.split_text(data_link[0].page_content)

data = []
for chunk in data_chunks:
    data.append(Document(page_content=str(chunk)))

In [73]:
data_chunks

["hi there thanks for taking the time in this video i'll give you a recap about a talk i recently gave at pi data copenhagen so i was invited to talk about an introduction to data apps with panel and since the event was not recorded i'll try to recap some of the things i did so first of all i gave an introduction to awesome panel which is the site i've created so panel is a framework for creating powerful data apps and for interactive data exploration in python and here at awesomepanel.org you can find inspiration you can find out if panel is something for you i believe it could be at least panel is something for me and i wrote a blog post about it you can check it out one of the amazing thing",
 "t it you can check it out one of the amazing things about panel is that it really works in the notebook and your editor so both in both places and this is like a very unique value proposition panel is open source for free and it's part of a broader family of systems that try to unite the pi d

In [71]:
data

[Document(page_content="hi there thanks for taking the time in this video i'll give you a recap about a talk i recently gave at pi data copenhagen so i was invited to talk about an introduction to data apps with panel and since the event was not recorded i'll try to recap some of the things i did so first of all i gave an introduction to awesome panel which is the site i've created so panel is a framework for creating powerful data apps and for interactive data exploration in python and here at awesomepanel.org you can find inspiration you can find out if panel is something for you i believe it could be at least panel is something for me and i wrote a blog post about it you can check it out one of the amazing thing", metadata={}),
 Document(page_content="t it you can check it out one of the amazing things about panel is that it really works in the notebook and your editor so both in both places and this is like a very unique value proposition panel is open source for free and it's part

In [62]:
llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0.7,
    max_tokens=1000,
)

In [72]:
from langchain.chains.summarize import load_summarize_chain

prompt_template = """Write a concise summary of the following:


{text}


CONCISE SUMMARY IN POLISH:"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])
chain = load_summarize_chain(llm, chain_type="stuff", prompt=PROMPT)
chain.run(data)

'Autor w filmie przedstawia framework Panel, który służy do tworzenia aplikacji danych i eksploracji interaktywnych danych w języku Python. Panel jest dostępny jako oprogramowanie open source i jest częścią rodziny narzędzi PyData. Autor prezentuje różne galerie i przykłady aplikacji, a następnie demonstruje, jak łatwo można zbudować aplikację z użyciem Panel. W filmie pokazuje, jak używać widgetów, jak połączyć je z modelem, jak wykorzystać bibliotekę Pandas i H3Plot do wizualizacji danych oraz jak dostosować wygląd aplikacji.'

In [76]:
loader = YoutubeLoader.from_youtube_url(yt_link[5])
data_link = loader.load()
data_chunks = text_splitter.split_text(data_link[0].page_content)

data = []
for chunk in data_chunks:
    data.append(Document(page_content=str(chunk)))

prompt_template = """Write a concise summary of the following video.
Pay attention to what benefits we can take from a video.

{text}


CONCISE SUMMARY IN POLISH:"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])
chain = load_summarize_chain(llm, chain_type="stuff", prompt=PROMPT)
chain.run(data)

'W prezentacji omówiono wykorzystanie analizy logów i algorytmów uczenia maszynowego do identyfikacji, analizy i przewidywania błędów w przemyśle produkcyjnym. Przedstawiono problem konserwacji frezarek dentystycznych oraz wykorzystanie danych logów i sensorów do przewidywania i zapobiegania błędom maszyn. Omówiono wykorzystanie techniki Word Embeddings oraz Gradient Boosting w procesie uczenia maszynowego. Przedstawiono korzyści wynikające z wykorzystania analizy logów, takie jak lepsze zrozumienie danych i możliwość identyfikacji błędów w czasie rzeczywistym, co pozwala na szybszą reakcję i poprawę jakości produktów.'

## Co dalej? 

> * możliwość zapisu pamięci czata do własnej analizy
> * możliwość użycia agentów w celu wykonywania różnych zadań w zależności od kontekstu
> * możliwość zapisu pamięci w postaci embeddingów do bazy wektorowej - wysyłanie odpowiedzi na podstawie już wcześniej przeprowadzonych rozmów, które przypadły nam do gustu i uznajemy za wartościowe

----