# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.

In [1]:
import pandas as pd
import numpy as np

#delete stopwords
from nltk.corpus import stopwords
import string

stop_words = set(stopwords.words('english'))
punctuations = list(string.punctuation)
stop_words.update(punctuations)


from transformers import BertTokenizer, BertModel
import torch

from sklearn.metrics.pairwise import cosine_similarity


  from .autonotebook import tqdm as notebook_tqdm


### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

In [2]:
df = pd.read_csv('data/podcastdata_dataset.csv')
print(df.head())
corpus = df

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  
0  As part of MIT course 6S099, Artificial Genera...  
1  As part of MIT course 6S099 on artificial gene...  
2  You've studied the human mind, cognition, lang...  
3  What difference between biological neural netw...  
4  The following is a conversation with Vladimir ...  


### Step 3: Text Preprocessing

In [3]:
def clean_text(text):
    # Convertir el texto a minúsculas
    text = text.lower()
    # Eliminar signos de puntuación
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Eliminar stopwords
    text = ' '.join(word for word in text.split() if word not in stop_words)
    return text

In [4]:
# Aplica la función clean_text a la columna 'text' y guarda el resultado en una nueva columna 'cleaned_text'
df['cleaned_text'] = df['text'].apply(clean_text)
print(df[['text', 'cleaned_text']].head())

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        cleaned_text  
0  part mit course 6s099 artificial general intel...  
1  part mit course 6s099 artificial general intel...  
2  youve studied human mind cognition language vi...  
3  difference biological neural networks artifici...  
4  following conversation vladimir vapnik hes co ...  


###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectors = tfidf_vectorizer.fit_transform(df['cleaned_text'])

In [6]:
tfidf_vectors.shape

(319, 49728)

In [7]:
# Convierte la matriz TF-IDF a un DataFrame
tfidf_df = pd.DataFrame(tfidf_vectors.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

In [8]:
# Configuración de BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

In [9]:
# Define una función para obtener las representaciones de BERT
def generate_bert_embeddings(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
        outputs = bert_model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :].detach().numpy())  # Use [CLS] token representation
    return np.array(embeddings).transpose(0,2,1)

In [10]:
# Genera las representaciones de BERT para el corpus de textos limpios
cleaned_corpus = df['cleaned_text'].tolist()
corpus_bert_embeddings = generate_bert_embeddings(cleaned_corpus)

In [28]:
corpus_bert_embeddings.shape

(319, 768, 1)

In [None]:
query = "GPT 3.0"
result_bert = retrieve_bert(query)
print(result_bert)

     sims_bert                                                 ep
0     0.561838                                           Life 3.0
1     0.590867                                      Consciousness
2     0.591847                            AI in the Age of Reason
3     0.605100                                      Deep Learning
4     0.571484                               Statistical Learning
..         ...                                                ...
314   0.594952    Singularity, Superintelligence, and Immortality
315   0.539820   Emotion AI, Social Robots, and Self-Driving Cars
316   0.569350  Comedy, MADtv, AI, Friendship, Madness, and Pr...
317   0.518504                                              Poker
318   0.540916  Biology, Life, Aliens, Evolution, Embryogenesi...

[319 rows x 2 columns]


### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF 

In [29]:
# Define la función de recuperación usando TF-IDF
def retrieve_tfidf(query):
    # Asegúrate de que la consulta sea una cadena de texto
    if isinstance(query, list):
        query = ' '.join(query)
    
    query = clean_text(query)
    query_vector = tfidf_vectorizer.transform([query])
    similitudes = cosine_similarity(tfidf_vectors, query_vector)
    similitudes_df = pd.DataFrame(similitudes, columns=['sims_tfidf'])
    similitudes_df['ep'] = df['title']
    return similitudes_df


In [30]:
query = "GPT 3.0"
result_tfidf = retrieve_tfidf(query)
print(result_tfidf)

     sims_tfidf                                                 ep
0      0.002110                                           Life 3.0
1      0.002588                                      Consciousness
2      0.000000                            AI in the Age of Reason
3      0.000000                                      Deep Learning
4      0.000000                               Statistical Learning
..          ...                                                ...
314    0.001021    Singularity, Superintelligence, and Immortality
315    0.003847   Emotion AI, Social Robots, and Self-Driving Cars
316    0.000000  Comedy, MADtv, AI, Friendship, Madness, and Pr...
317    0.002649                                              Poker
318    0.000000  Biology, Life, Aliens, Evolution, Embryogenesi...

[319 rows x 2 columns]


### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both BERT embeddings.

In [31]:
#Define la función de recuperación usando TF-IDF
def retrieve_bert(query):  
    query = clean_text(query)
    query_bert = generate_bert_embeddings([query])
    query_bert = query_bert.reshape(1, -1)  # Asegúrate de que el vector de consulta tenga la forma correcta
    corpus_bert_embeddings_reshaped = corpus_bert_embeddings.reshape(len(df), -1)  # Asegúrate de que el corpus tenga la forma correcta
    similitudes = cosine_similarity(corpus_bert_embeddings_reshaped, query_bert)
    similitudes_df = pd.DataFrame(similitudes, columns=['sims_bert'])
    similitudes_df['ep'] = df['title']
    return similitudes_df

In [32]:
query = "GPT 3.0"
result_bert = retrieve_bert(query)
print(result_bert)

     sims_bert                                                 ep
0     0.561838                                           Life 3.0
1     0.590867                                      Consciousness
2     0.591847                            AI in the Age of Reason
3     0.605100                                      Deep Learning
4     0.571484                               Statistical Learning
..         ...                                                ...
314   0.594952    Singularity, Superintelligence, and Immortality
315   0.539820   Emotion AI, Social Robots, and Self-Driving Cars
316   0.569350  Comedy, MADtv, AI, Friendship, Madness, and Pr...
317   0.518504                                              Poker
318   0.540916  Biology, Life, Aliens, Evolution, Embryogenesi...

[319 rows x 2 columns]


### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

In [33]:
#Obtener los 10 mejores resultados de TF-IDF
top_tfidf = result_tfidf.sort_values(by='sims_tfidf', ascending=False).head(top_n)

# Obtener los 10 mejores resultados de BERT
top_bert = result_bert.sort_values(by='sims_bert', ascending=False).head(top_n)

In [34]:
# Imprimir los resultados en DataFrames separados
print("Top 10 resultados TF-IDF:")
print(top_tfidf)

print("\nTop 10 resultados BERT:")
print(top_bert)

Top 10 resultados TF-IDF:
     sims_tfidf                                                 ep
213    0.094562  OpenAI Codex, GPT-3, Robotics, and the Future ...
17     0.031548                                     OpenAI and AGI
120    0.028172                    Friendship with an AI Companion
94     0.027958                                      Deep Learning
117    0.024249  Math, Manim, Neural Networks & Teaching with 3...
119    0.013067                           Measures of Intelligence
130    0.011653  The Future of Computing and Programming Languages
7      0.010065                                             Google
89     0.009755        Cellular Automata, Computation, and Physics
140    0.009667   Economics of AI, Social Networks, and Technology

Top 10 resultados BERT:
     sims_bert                                                 ep
68    0.653917                                  YouTube Algorithm
296   0.640073                 Marxism, Capitalism, and Economics
168   0.633540

### Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.

### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

## Instructions:

* Follow the steps outlined above to implement the IR system.
* Run the provided code snippets to understand how each part of the system works.
* Test the system with various queries to observe the results from both TF-IDF and BERT representations.
* Compare and analyze the results. Discuss the pros and cons of each method.
* Document your findings and any improvements you make to the system.