# Workshop: Building an Information Retrieval System for Podcast Episodes
##### Nombre: Cristina Molina

Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

Instructions:

Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.

In [1]:
# Importar bibliotecas necesarias para el manejo de datos, procesamiento de texto y aprendizaje automático
import zipfile
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
from transformers import BertTokenizer, BertModel
import torch
from concurrent.futures import ThreadPoolExecutor

# Descargar recursos adicionales de NLTK
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Step 2: Load the Dataset
Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

In [2]:
# Ruta al archivo ZIP que contiene el CSV
ruta_al_zip = 'archive.zip'  # Asegúrate de que el archivo ZIP esté en el mismo directorio que este notebook

# Nombre del archivo CSV dentro del ZIP
nombre_archivo_csv = 'podcastdata_dataset.csv'

# Extraer el archivo CSV del ZIP
with zipfile.ZipFile(ruta_al_zip, 'r') as zip_ref:
    zip_ref.extractall('../')  # Puedes especificar la ruta donde quieres extraer los archivos

# Cargar el CSV en un DataFrame de pandas
ruta_al_csv_extraido = '../' + nombre_archivo_csv
dataset = pd.read_csv(ruta_al_csv_extraido)


Step 3: Text Preprocessing

Delete punctuation

Delete stop words

In [3]:
# Función para preprocesar el texto
def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words and word not in string.punctuation]
    return ' '.join(tokens)

# Aplicar preprocesamiento al dataset
dataset['texto_preprocesado'] = dataset['text'].apply(preprocess_text)


Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.


In [4]:
# Crear la matriz TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(dataset['texto_preprocesado'])


Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.


In [5]:
# Cargar el tokenizador y el modelo BERT preentrenado
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Función para obtener embeddings BERT de un solo texto
def get_bert_embeddings(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    cls_embedding = outputs.last_hidden_state[:, 0, :]
    return cls_embedding

# Función para obtener embeddings BERT de forma paralela
def get_bert_embeddings_parallel(texts):
    embeddings = []
    with ThreadPoolExecutor(max_workers=None) as executor:
        for embedding in executor.map(get_bert_embeddings, texts):
            embeddings.append(embedding)
    return torch.cat([torch.tensor(embedding.numpy()) for embedding in embeddings], dim=0)

# Obtener embeddings BERT para cada transcripción preprocesada en el dataset de forma paralela
bert_embeddings = get_bert_embeddings_parallel(dataset['texto_preprocesado'])

# Mostrar la forma de los embeddings
print(f'Forma de los embeddings BERT: {bert_embeddings.shape}')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Forma de los embeddings BERT: torch.Size([319, 768])


Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.


In [6]:
# Función para procesar consultas y calcular similitud utilizando TF-IDF y BERT
def procesar_consulta(query, tfidf_matrix, bert_embeddings, dataset):
    # Calcular similitud utilizando TF-IDF
    tfidf_scores = tfidf_matrix @ tfidf_vectorizer.transform([query]).T

    # Calcular embeddings BERT para la consulta
    query_embedding = get_bert_embeddings(preprocess_text(query))

    # Calcular similitud utilizando BERT
    bert_scores = torch.cosine_similarity(query_embedding, bert_embeddings, dim=1)

    # Obtener índices ordenados por similitud descendente para TF-IDF y BERT
    tfidf_indices = tfidf_scores.toarray().flatten().argsort()[::-1]
    bert_indices = bert_scores.argsort(descending=True)

    # Obtener títulos de episodios basados en los índices ordenados
    tfidf_results = dataset.iloc[tfidf_indices]['title']
    bert_results = dataset.iloc[bert_indices]['title']

    return tfidf_results, bert_results


Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.


In [12]:
# Función para recuperar los resultados superiores basados en la similitud
def retrieve_results(query, tfidf_matrix, bert_embeddings, dataset):
    tfidf_results, bert_results = procesar_consulta(query, tfidf_matrix, bert_embeddings, dataset)

    print("Resultados de TF-IDF:")
    print(tfidf_results.head(10))  # Mostrar los 10 mejores resultados

    print("\nResultados de BERT:")
    print(bert_results.head(10))  # Mostrar los 10 mejores resultados

# Probar el sistema con una consulta de ejemplo
query = "Duncan"
retrieve_results(query, tfidf_matrix, bert_embeddings, dataset)


Resultados de TF-IDF:
126    Conversations, Ideas, Love, Freedom & The Joe ...
305    Comedy, Sentient Robots, Suffering, Love & Bur...
122    Origin of Life, Humans, Ideas, Suffering, and ...
163    Sleep, Dreams, Creativity & the Limits of the ...
170                                              Bitcoin
102                      Artificial General Intelligence
103               Computer Architecture and Data Storage
104                                   Edison of Medicine
105         Neuroscience, Psychology, and AI at DeepMind
106                 Suffering in Humans, Animals, and AI
Name: title, dtype: object

Resultados de BERT:
296                   Marxism, Capitalism, and Economics
302    Doom, Quake, VR, AGI, Programming, Video Games...
137           Ayn Rand and the Philosophy of Objectivism
168           Solving Martial Arts from First Principles
284                                      Imagine Dragons
219    Cyc and the Quest to Solve Common Sense Reason...
272               

Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.


In [8]:
# Consulta de ejemplo para probar el sistema
query = "Artificial Intelligence"

# Recuperar y mostrar los resultados superiores utilizando TF-IDF y BERT
retrieve_results(query, tfidf_matrix, bert_embeddings, dataset)


Resultados de TF-IDF:
2                                AI in the Age of Reason
61      Concepts, Analogies, Common Sense & Future of AI
119                             Measures of Intelligence
38          Keras, Deep Learning, and the Progress of AI
295    IQ Tests, Human Intelligence, and Group Differ...
12                           Brains, Minds, and Machines
0                                               Life 3.0
91     Square, Cryptocurrency, and Artificial Intelli...
1                                          Consciousness
75      Universal Artificial Intelligence, AIXI, and AGI
Name: title, dtype: object

Resultados de BERT:
296                   Marxism, Capitalism, and Economics
168           Solving Martial Arts from First Principles
3                                          Deep Learning
223    Neuromorphic Computing and Optoelectronic Inte...
256    Dark Matter of Intelligence and Self-Supervise...
286    Reality is an Illusion – How Evolution Hid the...
165    Deep Work, 

Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.


In [9]:
# Análisis y comparación de resultados
# Esta celda está destinada a documentar las observaciones sobre las fortalezas y debilidades de cada método basado en los resultados de recuperación

def compare_results(query, tfidf_matrix, bert_embeddings, dataset):
    tfidf_results, bert_results = procesar_consulta(query, tfidf_matrix, bert_embeddings, dataset)

    print("Comparación de Resultados para la consulta:", query)
    print("\nResultados de TF-IDF:")
    print(tfidf_results.head(10))  # Mostrar los 10 mejores resultados

    print("\nResultados de BERT:")
    print(bert_results.head(10))  # Mostrar los 10 mejores resultados

    # Documentar observaciones
    print("\nObservaciones:")
    print("TF-IDF:")
    print("- Ventajas: Rápido de calcular, adecuado para términos frecuentes.")
    print("- Desventajas: Puede no capturar bien el contexto semántico.")

    print("BERT:")
    print("- Ventajas: Captura el contexto semántico, puede manejar variaciones en el lenguaje.")
    print("- Desventajas: Más lento de calcular, requiere más recursos computacionales.")

# Probar la comparación con una consulta de ejemplo
compare_results("Artificial Intelligence", tfidf_matrix, bert_embeddings, dataset)


Comparación de Resultados para la consulta: Artificial Intelligence

Resultados de TF-IDF:
2                                AI in the Age of Reason
61      Concepts, Analogies, Common Sense & Future of AI
119                             Measures of Intelligence
38          Keras, Deep Learning, and the Progress of AI
295    IQ Tests, Human Intelligence, and Group Differ...
12                           Brains, Minds, and Machines
0                                               Life 3.0
91     Square, Cryptocurrency, and Artificial Intelli...
1                                          Consciousness
75      Universal Artificial Intelligence, AIXI, and AGI
Name: title, dtype: object

Resultados de BERT:
296                   Marxism, Capitalism, and Economics
168           Solving Martial Arts from First Principles
3                                          Deep Learning
223    Neuromorphic Computing and Optoelectronic Inte...
256    Dark Matter of Intelligence and Self-Supervise...
286   