# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

Instructions:

### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.

### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

### Step 3: Text Preprocessing

You know what to do ;)

###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.

### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

### Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.

### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

## Instructions:

* Follow the steps outlined above to implement the IR system.
* Run the provided code snippets to understand how each part of the system works.
* Test the system with various queries to observe the results from both TF-IDF and BERT representations.
* Compare and analyze the results. Discuss the pros and cons of each method.
* Document your findings and any improvements you make to the system.

In [1]:
### Step 1: Import Libraries
#Import necessary libraries for data handling, text processing, and machine learning.
                          
import pandas as pd
import os
import re
import spacy
import nltk
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import BertTokenizer, TFBertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


  from .autonotebook import tqdm as notebook_tqdm
2024-07-12 12:03:03.803922: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:479] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-12 12:03:03.823767: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:10575] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-12 12:03:03.823799: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1442] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-12 12:03:03.836593: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
### Step 2: Load the Dataset

#Load the dataset of podcast transcripts.

#Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

podcast_df = pd.read_csv('data/podcastdata_dataset.csv')
print(podcast_df.head())
corpus = podcast_df['text']
print(corpus)

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  
0  As part of MIT course 6S099, Artificial Genera...  
1  As part of MIT course 6S099 on artificial gene...  
2  You've studied the human mind, cognition, lang...  
3  What difference between biological neural netw...  
4  The following is a conversation with Vladimir ...  
0      As part of MIT course 6S099, Artificial Genera...
1      As part of MIT course 6S099 on artificial gene...
2      You've studied the human mind, cognition, lang...
3      What difference between biological neural netw...
4      The following is a conversation with Vladimir ...
                             ...                        
314    By the time he

In [3]:
### Step 3: Text Preprocessing

#You know what to do ;)
stop_words_file = 'reuters/stopwords.txt'
with open(stop_words_file, 'r', encoding='utf-8') as file:
    stop_words = set(file.read().split())

#print(stop_words)

def clean_text(corpus):
    cln_txt = []
    for txt in corpus:
        content = txt.lower()
        # Removemos las stop words
        cleaned_content = ' '.join([word for word in content.split() if word not in stop_words])
        cln_txt.append(cleaned_content)
    return cln_txt

corpus_clean = clean_text(corpus)
#print(corpus_clean[0])
        
print("Stop words removal and file saving completed.")


Stop words removal and file saving completed.


In [4]:
# Funcion para limpiar caracteres especiales
def clean_special_chars(corpus):
    clean_corpus = []
    for txt in corpus_clean:
        cleaned_text = re.sub(r'[^A-Za-z0-9\s]', '', txt)
        clean_corpus.append(cleaned_text)
    return clean_corpus

corpus_char_clean = clean_special_chars(corpus_clean)


print("All files have been processed and cleaned.")

All files have been processed and cleaned.


In [5]:
# Cargar el modelo de lenguaje de spaCy
nlp = spacy.load('en_core_web_sm')

# Inicializar el stemmer
stemmer = SnowballStemmer('english')

# Función para lematizar y stematizar texto
def preprocess_text(text):
    doc = nlp(text)
    stemmed_tokens = [stemmer.stem(token.text) for token in doc]
    lemmatized_tokens = [token.lemma_ for token in doc]
    return ' '.join(stemmed_tokens), ' '.join(lemmatized_tokens)

def lema_stema_corpus(corpus):
    lematizedCorpus=[]
    stematizedCorpus=[]
    
    for txt in corpus:
        stemmed_text, lemmatized_text = preprocess_text(txt)
        lematizedCorpus.append(lemmatized_text)
        stematizedCorpus.append(stemmed_text)
    return lematizedCorpus, stematizedCorpus

myLemaCorpus, myStemaCorpus = lema_stema_corpus(corpus_char_clean)
print("Procesamiento completado.")

Procesamiento completado.


In [6]:
###  Step 4: Vector Space Representation - TF-IDF

#Create TF-IDF vector representations of the transcripts.
# Función para vectorizar textos utilizando TF-IDF
def TF_IDF(texts):
    # Vectorización usando TF-IDF
    tfidf_vectorizer = TfidfVectorizer()
    X_tfidf = tfidf_vectorizer.fit_transform(texts)
    
    return X_tfidf, tfidf_vectorizer


In [7]:
# Utilizando corpus Lemmatized
X_tfidf_lemmatized, tfidf_vectorizer_lemmatized = TF_IDF(myLemaCorpus)

# Ver los resultados de TF-IDF
print("TF-IDF Lemmatized:")
print(X_tfidf_lemmatized.toarray())
print("Caracteristicas de TF-IDF Lemmatized:", tfidf_vectorizer_lemmatized.get_feature_names_out())

TF-IDF Lemmatized:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Caracteristicas de TF-IDF Lemmatized: ['00' '000' '00000' ... 'zyklon' 'zz' 'zzzt']


In [8]:
#Utilizando corpus stemmed
# Vectorizar los documentos
X_tfidf_Stemmed, tfidf_vectorizer_Stemmed = TF_IDF(myStemaCorpus)

# Ver los resultados de TF-IDF
print("TF-IDF Stemmed:")
print(X_tfidf_Stemmed.toarray())
print("Caracteristicas de TF-IDF Stemmed:", tfidf_vectorizer_Stemmed.get_feature_names_out())

TF-IDF Stemmed:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Caracteristicas de TF-IDF Stemmed: ['00' '000' '00000' ... 'zyklon' 'zz' 'zzzt']


In [28]:
### Step 5: Vector Space Representation - BERT
#Create BERT vector representations of the transcripts using a pre-trained BERT model.

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

#serial approach
#def generate_bert_embeddings(texts):
#    embeddings = []
#    for text in texts:
#        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
#        outputs = model(**inputs)
#        embeddings.append(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token representation
#    return np.array(embeddings).transpose(0,2,1)

def generate_bert_embeddings(texts):
    if isinstance(texts, str):
        texts = [texts]
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True, max_length=512)
        outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :].numpy())  # Use [CLS] token representation
    return np.vstack(embeddings)  # Stack embeddings into a 2D array

bert_embeddings = generate_bert_embeddings(corpus)
print("corpus shape: ", corpus.shape)
print("BERT Embeddings:", bert_embeddings.shape)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

corpus shape:  (319,)
BERT Embeddings: (319, 768)


In [29]:
### Step 6: Query Processing

#Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.

# TF-IDF query cleaning
def clean_query(query):
    # Limpiamos la query
    query = query.lower()
    stop_words_file = 'reuters/stopwords.txt'
    with open(stop_words_file, 'r', encoding='utf-8') as file:
        stop_words = set(file.read().split())

    cleaned_query = ' '.join([word for word in query.split() if word not in stop_words])
    cleaned_query = re.sub(r'[^A-Za-z0-9\s]', '', cleaned_query)
    return cleaned_query


In [30]:
### Step 7: Retrieve and Compare Results

#Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

def top_n_documents(scores, n, corpus):

    # Asegurarse de que los puntajes sean un array 1D
    scores = scores.flatten()

    # Obtener los índices de los n puntajes más altos
    top_indices = np.argsort(scores)[-n:][::-1]

    # Obtener los textos correspondientes a esos índices

    top_texts = []
    top_titles = []

    for idx in top_indices:
        top_titles.append(corpus['title'][idx])
        top_texts.append(corpus['text'][idx])

    top_scores = scores[top_indices]

    # Imprimir los 10 textos con mayor similitud y sus puntajes
    for i, (text, title, score) in enumerate(zip(top_texts, top_titles, top_scores), 1):

        print(f"Top {i}: {title} - Similitud: {score:.4f}")
        print(text[19:75])
        print()
        


In [37]:
### Step 8: Test the IR System

# Test the system with a sample query.

# Retrieve and display the top results using both TF-IDF and BERT representations.

# Input you query text
# userInput = input("escribe tu frase de busqueda: ")
userInput = "technology computer innovation tomorrow new"

# Generate embeddings for the input
## TF-IDF
cleaned_query = clean_query(userInput)
# lematizamos y stematizamos la query
stemmed_query, lemmatized_query = preprocess_text(cleaned_query)

## BERT
input_bert_embeddings = generate_bert_embeddings(userInput)

print("input shape: ", input_bert_embeddings.shape)
# compute cosine simil

# Lemmatized
query_tfidf_lemmatized = tfidf_vectorizer_lemmatized.transform([lemmatized_query])
cosine_scores_lemmatized = cosine_similarity(query_tfidf_lemmatized, X_tfidf_lemmatized)

# stemmed
query_tfidf_stemmed = tfidf_vectorizer_Stemmed.transform([stemmed_query])
cosine_scores_stemmed = cosine_similarity(query_tfidf_stemmed, X_tfidf_Stemmed)

# BERT Cosine Sim
#input_bert_similarity = cosine_similarity(input_bert_embeddings.reshape(43,768))
input_bert_similarity = cosine_similarity(input_bert_embeddings, bert_embeddings).flatten()
top_indices = np.argsort(input_bert_similarity)[::-1][:5]

#bert_similarity_list = [
#    (i, j, input_bert_similarity[i, j])
#    for i in range(input_bert_similarity.shape[0])
#    for j in range(i + 1, input_bert_similarity.shape[1])
#]

# Sort based on similarity scores
#sorted_bert_sim_list = sorted(bert_similarity_list, key=lambda x: x[2], reverse=True)

for index in top_indices:
    print(f"Document {index} - {podcast_df['title'][index]} with similarity score: {input_bert_similarity[index]:.6f}")
    print(f"Transcript: {corpus.iloc[index][19:150]}")
    print()
print('--------------------------------------------')
# Lemmatized
top_n_documents(cosine_scores_lemmatized, 10, podcast_df)
print('--------------------------------------------')
# stemmed
top_n_documents(cosine_scores_stemmed, 10, podcast_df)


input shape:  (1, 768)
Document 273 - Bitcoin, Inflation, and the Future of Money with similarity score: 0.757319
Transcript: hington? You know how he died? Well meaning physicians bled him to death, and this was the most important patient in the country, m

Document 199 - Totalitarianism and Anarchy with similarity score: 0.754713
Transcript: conversation between me and Michael Malus. Michael is an author, anarchist, and simpleton, and I'm proud to call him my friend. He 

Document 165 - Deep Work, Focus, Productivity, Email, and Social Media with similarity score: 0.744074
Transcript: conversation with Cal Newport. He's a friend and someone who's writing, like his book, Deep Work, for example, has guided how I str

Document 96 - Going Big in Business, Investing, and AI with similarity score: 0.737252
Transcript: conversation with Stephen Schwarzman, CEO and cofounder of Blackstone, one of the world's leading investment firms with over $530 b

Document 153 - Aliens, Black Holes, and t

In [13]:
### Step 9: Compare Results

#Analyze and compare the results obtained from TF-IDF and BERT representations.

#Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

Mediante la obtención de los 10 resultados más relevantes, mediante los modelos BERT y FT-IDF, con relación al query ingresado, podemos darnos cuenta que ambos
modelos funcionan muy bien ya que tanto el uno como el otro obtienen resultados relevantes a la query ingresada.

    Considerando que la query es: "technology computer innovation tomorrow new"
    y que los resultados para cada uno son:
        BERT
            Document 273 - Bitcoin, Inflation, and the Future of Money with similarity score: 0.757319
            
            Document 199 - Totalitarianism and Anarchy with similarity score: 0.754713
            
            Document 165 - Deep Work, Focus, Productivity, Email, and Social Media with similarity score: 0.744074
            
            Document 96 - Going Big in Business, Investing, and AI with similarity score: 0.737252
            
            Document 153 - Aliens, Black Holes, and the Mystery of the Oumuamua with similarity score: 0.734598

        TF-IDF - Lemmatized
            Top 1: Moore’s Law, Microprocessors, Abstractions, and First Principles - Similitud: 0.1185
            
            Top 2: Self-Driving Cars at Aurora, Google, CMU, and DARPA - Similitud: 0.0858
            
            Top 3: Qualcomm CEO - Similitud: 0.0857
            
            Top 4: Google - Similitud: 0.0801
            
            Top 5: Flying Cars, Autonomous Vehicles, and Education - Similitud: 0.0769
            
            Top 6: Affective Computing, Emotion, Privacy, and Health - Similitud: 0.0725
            
            Top 7: Aliens, Technology, Religion, and the Nature of Belief - Similitud: 0.0689
            
            Top 8: Computer Vision - Similitud: 0.0638
            
            Top 9: Quantum Computing - Similitud: 0.0606
            
            Top 10: Computer Architecture and Data Storage - Similitud: 0.0585

        Tf-IDF - Stemmized
            Top 1: Moore’s Law, Microprocessors, Abstractions, and First Principles - Similitud: 0.1702
            
            Top 2: Cellular Automata, Computation, and Physics - Similitud: 0.1137
            
            Top 3: Qualcomm CEO - Similitud: 0.1067
            
            Top 4: Simulation and Superintelligence - Similitud: 0.0968
            
            Top 5: Self-Driving Cars at Aurora, Google, CMU, and DARPA - Similitud: 0.0906
            
            Top 6: Robotics - Similitud: 0.0891
            
            Top 7: Flying Cars, Autonomous Vehicles, and Education - Similitud: 0.0875
            
            Top 8: Affective Computing, Emotion, Privacy, and Health - Similitud: 0.0863
            
            Top 9: Quantum Computing - Similitud: 0.0856
            
            Top 10: Google - Similitud: 0.0846

    Entonces podemos observar que los resultados para el caso de TF-IDF son bastante parecidos, para el caso de BERT si existe una diferencia marcada en comparación
    a los resultados de TF-IDF. De igual forma los valores de similitud para TF-IDF son bajos en comparación a los que se obtienen con BERT que son relativamente
    altos y por tanto muestran una similitud más alta.

    En principio, hay que considera el diferente acercamiento a la solucion que cada modelo porporciona. Por una parte BERT es un modelo que considera más el contexto
    de un texto que TF-IDF el cual es un modelo más bien estadistico. A diferencia de TF-IDF, BERT nos da un entendimiento más profundo de la sintáctica y semántica
    de las palabras u oraciones. Es por estas razones que los resultados de TF-IDF, tanto para el caso de lemmatizacion así como estamatización son muy semejantes ya
    que la estadistica de las palabras es semejante en este modelo para todos los docuemntos del corpus. Por otro lado, BERT nos arroja resultados complemtamente
    diferentes ya que el contexto interpretado del input, en comparación con el propio contexto de los documentos se interpreta de una manera diferente a una mera
    comparación estadística. Incluso, este hecho es apoyado por los valores de similitud que arrojan los resultados, siendo estos valores mucho mayores en el caso
    de los resultados de BERT a diferencia de los vores muy pequeños de los resultados con TF-IDF tanto Lemmatizado como Estematizado.

SyntaxError: invalid syntax (2032385184.py, line 3)