In [19]:
from IPython.display import Markdown
from dotenv import load_dotenv
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import requests
import pandas as pd
from sklearn.manifold import TSNE
import numpy as np
import plotly.express as px
from langchain import hub
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DataFrameLoader

In [2]:
load_dotenv()

True

## Setup

In [18]:
llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro")
result = llm.invoke("hello!")
print(result.content)

Hello! How can I help you today? 



In [8]:
embedding = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")
embedding.embed_query("dog")[0]

-0.009113337844610214

## Embeddings


In [6]:
Markdown(llm.invoke(
    "what talks are given at PyData Eindhoven 2024 about sports?"
).content)

Unfortunately, I do not have access to a list of specific talks for PyData Eindhoven 2024. 

* **Conference websites are your best bet:**  Keep an eye on the official PyData Eindhoven website. They will publish a schedule closer to the event date. 
* **Look for past trends:** You can also search for talk titles and descriptions from previous PyData Eindhoven conferences. This might give you an idea of the types of sports-related topics covered in the past.

Good luck finding the information you're looking for! 


## Load PyData schedule

In [29]:
sessions = pd.read_json("data/pydata_eindhoven_2024_sessions.json")

In [11]:
Markdown(sessions["text"].iloc[0])

# Explainable AI in the LIME-light
LIME, a model-agnostic AI framework, illuminates the path to local explainability, primarily for classification models. Delving into the theory underpinning LIME, we explore diverse use cases and its adaptability across various scenarios. Through practical examples, we showcase the breadth of applications for LIME. By the presentation's conclusion, you'll have gained insights into leveraging LIME to clarify individual prediction logic, leading to more accessible explanations.

## Description
Although AI toolkits have simplified model implementation, understanding and interpreting these models remain challenging. With regulatory frameworks like the EU AI Act emphasizing explainability, the need for tools like LIME is paramount.

This presentation will provide an in-depth overview of LIME (Local Interpretable Model-agnostic Explanations), highlighting its utility in facilitating model comprehension. No prior expertise is assumed. Beginning with an explanation of LIME's theory and its practical implementation in Python, we'll then delve into diverse classification scenarios to showcase LIME's effectiveness. Additionally, we'll explore how the original LIME framework has been extended to handle time series data.

## Timeslot
2024-07-11T10:00:00+02:00 with a duration of 00:30

## Room
If (1.1)

## Speaker
 ### Sanne van den Bogaart
For the past 3 years I have been working as a Data Science consultant at Pipple. Since Pipple is active in multiple different sectors, I have had the opportunity to do many different projects. What I have discovered is that explainability of the machine learning used was a critical topic in all of these projects. Fortunately, frameworks like LIME have emerged to provided this much needed explainability. I am excited to discuss more about LIME at the upcoming 2024 PyData Eindhoven conference.



### Load into Pandas DataFrame


In [30]:
loader = DataFrameLoader(sessions, page_content_column="text")
docs = loader.load()
docs

[Document(page_content="# Explainable AI in the LIME-light\nLIME, a model-agnostic AI framework, illuminates the path to local explainability, primarily for classification models. Delving into the theory underpinning LIME, we explore diverse use cases and its adaptability across various scenarios. Through practical examples, we showcase the breadth of applications for LIME. By the presentation's conclusion, you'll have gained insights into leveraging LIME to clarify individual prediction logic, leading to more accessible explanations.\n\n## Description\nAlthough AI toolkits have simplified model implementation, understanding and interpreting these models remain challenging. With regulatory frameworks like the EU AI Act emphasizing explainability, the need for tools like LIME is paramount.\r\n\r\nThis presentation will provide an in-depth overview of LIME (Local Interpretable Model-agnostic Explanations), highlighting its utility in facilitating model comprehension. No prior expertise i

## RAG

In [9]:
vectorstore = FAISS.load_local("data/basic_rag_vectorstore.faiss", embeddings=embedding, allow_dangerous_deserialization=True)

In [None]:
# uncomment to re-create vectors and vectorstore
# vectorstore = FAISS.from_documents(docs, embedding=embedding)

In [12]:
# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [20]:
# uncomment to save
# vectorstore.save_local("data/basic_rag_vectorstore.faiss")

In [14]:
embedding.embed_query(
    "which talks are related to sports?"
)

[0.04375917837023735,
 -0.02100740186870098,
 -0.008644950576126575,
 0.009274132549762726,
 0.027995649725198746,
 0.056026890873909,
 0.05677080899477005,
 0.05664321780204773,
 0.006435796618461609,
 -0.017680760473012924,
 -0.06138936057686806,
 -0.03150258585810661,
 -0.00257046427577734,
 -0.02732081711292267,
 -0.028928237035870552,
 0.02447933331131935,
 0.06237539276480675,
 -0.020569467917084694,
 -0.09794401377439499,
 0.0032784054055809975,
 -0.05767378211021423,
 -0.04025986045598984,
 -0.0580415315926075,
 0.006769028026610613,
 -0.012664892710745335,
 -0.06807760894298553,
 0.007474202197045088,
 -0.06782187521457672,
 -0.012073918245732784,
 -0.032192543148994446,
 -0.04036087915301323,
 0.002051006304100156,
 0.04213279113173485,
 0.012033850885927677,
 0.021148746833205223,
 0.016268646344542503,
 -0.05303194746375084,
 -0.020383041352033615,
 0.04073958843946457,
 -0.048284076154232025,
 -0.04711173474788666,
 0.007895693182945251,
 0.03182351961731911,
 -0.003741861

In [14]:
retriever.invoke(
    "which talks are related to sports?"
)

[Document(page_content="# Enhancing Event Analysis at Scale: Leveraging Tracking Data in Sports.\nLearn how to automate the generation of contextual metrics from tracking data to enrich event analysis, handling the influx of games arriving daily in an efficient way by scaling-out the entire architecture.\n\n## Description\nIn the dynamic landscape of sports analytics, the integration of tracking data has opened new frontiers for in-depth event analysis. Yet, the use of this data remains a bottleneck, particularly when dealing with a large volume of games. Indeed, such computation is either too expensive or too long. The focus of the presentation will be on automating the generation of these contextual metrics at scale, and their usage by professionals and decision-makers.\r\nThe presentation will showcase an architecture and an automated pipeline designed to handle the influx of games. Leveraging Python and cloud computing services such as message queues, we efficiently manage incoming

In [21]:
print(
    retriever.invoke(
        "which talks are related to sports?"
    )[0].page_content
)

# Enhancing Event Analysis at Scale: Leveraging Tracking Data in Sports.
Learn how to automate the generation of contextual metrics from tracking data to enrich event analysis, handling the influx of games arriving daily in an efficient way by scaling-out the entire architecture.

## Description
In the dynamic landscape of sports analytics, the integration of tracking data has opened new frontiers for in-depth event analysis. Yet, the use of this data remains a bottleneck, particularly when dealing with a large volume of games. Indeed, such computation is either too expensive or too long. The focus of the presentation will be on automating the generation of these contextual metrics at scale, and their usage by professionals and decision-makers.
The presentation will showcase an architecture and an automated pipeline designed to handle the influx of games. Leveraging Python and cloud computing services such as message queues, we efficiently manage incoming game data by scaling the infra

In [26]:
Markdown(rag_chain.invoke(
    "which talks are related to sports? return a bullet list"
))

Here is a list of talks related to sports:

- Enhancing Event Analysis at Scale: Leveraging Tracking Data in Sports.
- Computer vision at the Dutch Tennis Federation: Utilizing YOLO to create insights for coaches.
- Predicting the Spring Classics of cycling with my first neural network.
- How I lost 1000€ betting on CS:GO with machine learning and Python. 


In [15]:
Markdown(
    llm.invoke(
        "what is the RAG talk about? who hosts it? where and when can i go to see it?"
    ).content
)

Please provide me with more context about "RAG talk."  The acronym "RAG" can stand for many things, and without more information, I can't tell you who hosts it, where it is, or when it happens. 

For example, are you referring to:

* **A talk about a specific topic that includes the acronym "RAG"?**  (If so, what is the topic?)
* **A talk hosted by an organization with "RAG" in the name?** (If so, what is the organization?)
* **Something else entirely?**

Please give me more details so I can help you find the information you need! 


In [18]:
Markdown(rag_chain.invoke(
    "what is the RAG talk about? who hosts it? where and when can i go to see it?"
))

The RAG talk is about Retrieval Augmented Generation, a technique used in AI to enhance Large Language Models. It is hosted by Jeroen Overschie, a Machine Learning Engineer at Xebia Data. You can attend the talk on July 11th, 2024, at 11:15 AM in the Else (1.3) room. 


## T-SNE

In [19]:
# word_collection = [
#     "cat",
#     "red cat",
#     "white cat with funny nose",
#     "dog",
#     "pydata",
#     "eindhoven",
#     "python",
#     "rust",
#     "julia",
#     "programming",
#     "lion",
# ]
# word_embeddings = embedding.embed_documents(word_collection)

# # t-SNE
# tsne = TSNE(n_components=2, random_state=42, perplexity=len(word_collection) - 1, max_iter=25000)
# word_embeddings_2d = tsne.fit_transform(
#     np.array(word_embeddings)
# )

# # plot
# word_embeddings_df = pd.DataFrame(word_embeddings_2d, columns=["x", "y"])
# word_embeddings_df["word"] = word_collection
# fig = px.scatter(word_embeddings_df, x="x", y="y", text="word")
# fig.update_traces(textposition='top center')
# fig.show()

## With chunking



In [33]:
vectorstore_chunked = FAISS.load_local("data/basic_rag_vectorstore_chunked.faiss", embeddings=embedding, allow_dangerous_deserialization=True)

In [31]:
# uncomment to reproduce vectorstore
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
# splits = text_splitter.split_documents(docs)
# vectorstore_chunked = FAISS.from_documents(splits, embedding=embedding)

In [32]:
# uncomment to save
# vectorstore_chunked.save_local("data/basic_rag_vectorstore_chunked.faiss")

In [34]:
retriever_chunked = vectorstore.as_retriever()

In [39]:
rag_chain_chunked = (
    {"context": retriever_chunked | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [35]:
vectorstore.similarity_search(
    query="split testing",
)

[Document(page_content='# Maximizing marketplace experimentation: switchback design for small samples and subtle effects\nConventional A/B testing often falls short in industries such as airlines, ride-sharing, and delivery services, where challenges like small samples and subtle effects complicate testing new features. Inspired by its significant impact in leading companies like Uber, Lyft, and Doordash, we introduce the switchback design as a practical alternative to conventional A/B testing. By addressing small sample size limitations and the need to detect subtle effects quickly, this approach boosts statistical power while reducing variability and interference. We guide the audience through the challenges of marketplace experimentation and implementing this approach, period length optimization and switch frequency using a case study from the airline industry.\n\n## Description\nIn this talk, we introduce switchback design, a method that addresses key challenges in marketplace expe

In [37]:
vectorstore.similarity_search(
    query="a methodology for comparing two versions of a webpage or app against each other to determine which one performs better",
)

[Document(page_content='# Maximizing marketplace experimentation: switchback design for small samples and subtle effects\nConventional A/B testing often falls short in industries such as airlines, ride-sharing, and delivery services, where challenges like small samples and subtle effects complicate testing new features. Inspired by its significant impact in leading companies like Uber, Lyft, and Doordash, we introduce the switchback design as a practical alternative to conventional A/B testing. By addressing small sample size limitations and the need to detect subtle effects quickly, this approach boosts statistical power while reducing variability and interference. We guide the audience through the challenges of marketplace experimentation and implementing this approach, period length optimization and switch frequency using a case study from the airline industry.\n\n## Description\nIn this talk, we introduce switchback design, a method that addresses key challenges in marketplace expe

In [38]:
vectorstore_chunked.similarity_search(
    query="a methodology for comparing two versions of a webpage or app against each other to determine which one performs better",
)

[Document(page_content='and recognize its advantages over conventional A/B testing for more informed decisions and improved', metadata={'guid': '44976e6d-e92c-5a5f-94cd-e32b6c798e5e', 'logo': '', 'date': Timestamp('2024-07-11 14:45:00+0200', tz='UTC+02:00'), 'start': '14:45', 'duration': '00:30', 'room': 'Else (1.3)', 'slug': 'cfp-21-maximizing-marketplace-experimentation-switchback-design-for-small-samples-and-subtle-effects', 'url': 'https://eindhoven2024.pydata.org/cfp/talk/KBYKXY/', 'title': 'Maximizing marketplace experimentation: switchback design for small samples and subtle effects', 'subtitle': '', 'track': None, 'type': 'Talk', 'language': 'en', 'abstract': 'Conventional A/B testing often falls short in industries such as airlines, ride-sharing, and delivery services, where challenges like small samples and subtle effects complicate testing new features. Inspired by its significant impact in leading companies like Uber, Lyft, and Doordash, we introduce the switchback design a

In [None]:
vectorstore_chunked.similarity_search(
    "split testing"
)

In [None]:
retriever_chunked.invoke(
    "split testing"
)

In [None]:
retriever_chunked.invoke(
    "a methodology for comparing two versions of a webpage or app against each other to determine which one performs better"
)

## Case for Hybrid (keyword search)

In [37]:
retriever.invoke(
    "Maximizing "
)

[Document(page_content='# Computer vision at the Dutch Tennis Federation: Utilizing YOLO to create insights for coaches\nThrough single-camera tennis match footage, via a YOLO-driven computer vision system, and culminating in actionable insights for strength and conditioning coaches, the Dutch Tennis Federation offers a pathway for creating tennis data and insights. In our presentation, we will delve into technical specifications and algorithms of our system, navigate through the challenges of working with tennis video footage, and elaborate on our approach to actively engage coaches in our co-creation approach. After the presentation, you will have a deeper understanding of the intricate workings behind implementing such system in a competitive tennis environment. All output of the project will be presented on Github.\n\n## Description\nTennis is seen within the community more as a skill sport than a physical sport. In this way, tennis is an exception compared to other ball sports, in

In [41]:
retriever.invoke(
    "What's the talk starting with Maximizing about?"
)

[Document(page_content="# The Levels of RAG 🦜\nLLM's can be supercharged using a technique called RAG, allowing us to overcome dealbreaker problems like hallucinations or no access to internal data. RAG is gaining more industry momentum and is becoming rapidly more mature both in the open-source world and at major Cloud vendors. But what can we expect from RAG? What is the current state of the tech in the industry? What use-cases work well and which are more challenging? Let's find out together!\n\n## Description\nRetrieval Augmented Generation (RAG) is a popular technique to combine retrieval methods like vector search together with Large Language Models (LLM's). This gives us several advantages like retrieving extra information based on a user search query: allowing us to quote and cite LLM-generated answers. Because the underlying techniques are very broadly applicable, many types of data can be used to build up a RAG system, like textual data, tables, graphs or even images.\r\n\r\n

In [41]:
Markdown(
    rag_chain.invoke(
        "What's the talk starting with Maximizing about?"
    )
)

The talk starting with "Maximizing" is about less common but noteworthy features of scikit-learn and its ecosystem.  It covers topics such as sparse datasets, larger-than-memory datasets, sample weight techniques, and more. The speaker aims to highlight these features and potentially live code some examples. 


In [38]:
retriever.invoke(
    "I'm gonna run "
)

[Document(page_content="# Predicting the Spring Classics of cycling with my first neural network\nLast year I attended PyData Eindhoven for the first time. I got inspired and now I’m back to present my first neural network, a network that was trained to predict the Spring Classics of cycling! With this neural network, I’m attempting to beat my friends, and myself, in a well-known fantasy cycling game.\n\n## Description\nLast year I attended PyData Eindhoven for the first time. I got inspired and now I’m back to present my first neural network, a network that was trained to predict the Spring Classics of cycling! With this neural network, I’m attempting to beat my friends, and myself, in a well-known fantasy cycling game.\r\n\r\nIn this talk, I will elaborate on the process of building a model from scratch. This will include data collection, model training and finetuning, and of course a discussion of the predicted results. The predictions will also be compared to an existing cycling pr

In [40]:
retriever.invoke(
    "predicting the"
)

[Document(page_content='# Causal Forecasting: How to disentangle causal effects, while controlling for unobserved confounders and keeping accuracy\nA lot of industry-available Machine Learning solutions for causal forecasting have a very particular blind spot: unobserved confounders. We will present an approach that allows you to combine state-of-the-art Machine Learning approaches with advanced Econometrics techniques to get the better of both worlds: accurate causal inference and good forecasting accuracy.\n\n## Description\nCausal Forecasting is a very hot topic in the industry with many applications ranging from marketing spending to pricing. Disentangling causal effects from spurious correlations plays a key role when forecasts are used for decision making, such as in the case of pricing. Solutions available in the industry typically rely on Machine Learning methods that use techniques like DoubleML, Transformers, LSTM, and boosted tree algorithms. A common shortcoming of such sol

In [39]:
# Markdown(
#     format_docs(
#         retriever.invoke("I'm gonna run")
#     )
# )