# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.

In [35]:
import pandas as pd
import numpy as np

#delete stopwords
from nltk.corpus import stopwords
import string

stop_words = set(stopwords.words('english'))
punctuations = list(string.punctuation)
stop_words.update(punctuations)


from transformers import BertTokenizer, BertModel
import torch

from sklearn.metrics.pairwise import cosine_similarity


### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

In [13]:
df = pd.read_csv('data/podcastdata_dataset.csv')
print(df.head())
corpus = df

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  
0  As part of MIT course 6S099, Artificial Genera...  
1  As part of MIT course 6S099 on artificial gene...  
2  You've studied the human mind, cognition, lang...  
3  What difference between biological neural netw...  
4  The following is a conversation with Vladimir ...  


### Step 3: Text Preprocessing

In [3]:
def clean_text(text):
    # Convertir el texto a minúsculas
    text = text.lower()
    # Eliminar signos de puntuación
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Eliminar stopwords
    text = ' '.join(word for word in text.split() if word not in stop_words)
    return text

In [24]:
# Aplica la función clean_text a la columna 'text' y guarda el resultado en una nueva columna 'cleaned_text'
df['cleaned_text'] = df['text'].apply(clean_text)
print(df[['text', 'cleaned_text']].head())

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        cleaned_text  
0  part mit course 6s099 artificial general intel...  
1  part mit course 6s099 artificial general intel...  
2  youve studied human mind cognition language vi...  
3  difference biological neural networks artifici...  
4  following conversation vladimir vapnik hes co ...  


###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectors = tfidf_vectorizer.fit_transform(df['cleaned_text'])

In [34]:
tfidf_vectors.shape

(319, 49728)

In [6]:
# Convierte la matriz TF-IDF a un DataFrame
tfidf_df = pd.DataFrame(tfidf_vectors.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

In [10]:
# Configuración de BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

In [17]:
# Define una función para obtener las representaciones de BERT
def generate_bert_embeddings(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
        outputs = bert_model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :].detach().numpy())  # Use [CLS] token representation
    return np.array(embeddings).transpose(0,2,1)

In [31]:
# Genera las representaciones de BERT para el corpus de textos limpios
cleaned_corpus = df['cleaned_text'].tolist()
corpus_bert_embeddings = generate_bert_embeddings(cleaned_corpus)

In [32]:
corpus_bert_embeddings.shape

(319, 768, 1)

### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF 

In [49]:
# Define la función de recuperación usando TF-IDF
def retrieve_tfidf(query):
    # Asegúrate de que la consulta sea una cadena de texto
    if isinstance(query, list):
        query = ' '.join(query)
    
    query = clean_text(query)
    query_vector = tfidf_vectorizer.transform([query])
    similitudes = cosine_similarity(tfidf_vectors, query_vector)
    similitudes_df = pd.DataFrame(similitudes, columns=['sims_tfidf'])
    similitudes_df['ep'] = df['title']
    return similitudes_df


In [60]:
query = "Computer"
result = retrieve_tfidf(query)
print(result)

     sims_tfidf                                                 ep
0      0.041250                                           Life 3.0
1      0.065067                                      Consciousness
2      0.000000                            AI in the Age of Reason
3      0.000000                                      Deep Learning
4      0.010650                               Statistical Learning
..          ...                                                ...
314    0.037059    Singularity, Superintelligence, and Immortality
315    0.013135   Emotion AI, Social Robots, and Self-Driving Cars
316    0.001287  Comedy, MADtv, AI, Friendship, Madness, and Pr...
317    0.004625                                              Poker
318    0.013164  Biology, Life, Aliens, Evolution, Embryogenesi...

[319 rows x 2 columns]


### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both BERT embeddings.



### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

### Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.

### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

## Instructions:

* Follow the steps outlined above to implement the IR system.
* Run the provided code snippets to understand how each part of the system works.
* Test the system with various queries to observe the results from both TF-IDF and BERT representations.
* Compare and analyze the results. Discuss the pros and cons of each method.
* Document your findings and any improvements you make to the system.

In [None]:
#import kaggle
import pandas as pd
wine_df = pd.read_csv('data/podcastdata_dataset.csv')
print(wine_df.head())

In [None]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

# Descargar recursos de nltk necesarios
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Preporcesing

In [None]:
corpus_procesado = []
for doc in corpus[:10]:
    #print (len(doc.split()))
    

In [None]:
def preprocess_text(text):
    # Convertir a minúsculas
    text = text.lower()
    # Eliminar puntuación
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenizar
    words = word_tokenize(text)
    # Eliminar palabras vacías
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    # Lematizar
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    # Unir palabras procesadas en una sola cadena
    processed_text = ' '.join(words)
    return processed_text

In [None]:
wine_df['processed_text'] = wine_df['text'].apply(preprocess_text)

# Mostrar el DataFrame con la nueva columna de texto procesado
print(wine_df[['id', 'guest', 'title', 'processed_text']])

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Crear un objeto TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Ajustar y transformar los datos de texto
tfidf_matrix = tfidf_vectorizer.fit_transform(wine_df['processed_text'])

# Obtener las características (vocabulario)
features = tfidf_vectorizer.get_feature_names_out()

# Convertir la matriz TF-IDF a un DataFrame pandas (opcional)
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=features)

# Mostrar las primeras filas del DataFrame TF-IDF
print(tfidf_df.head())

In [None]:
from transformers import BertTokenizer, BertModel
import torch

# Cargar el modelo preentrenado BERT y el tokenizador
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Función para obtener la representación de BERT
def get_bert_embedding(text):
    # Tokenizar el texto y convertirlo en IDs de tokens
    inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True)
    
    # Obtener las salidas ocultas del modelo BERT
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state  # Última capa oculta como representación
    
    # Promediar las representaciones de todos los tokens
    avg_embeddings = torch.mean(embeddings, dim=1).squeeze()  # Promedio sobre la dimensión de tokens
    
    return avg_embeddings.numpy()  # Convertir a numpy array para trabajar con pandas

# Aplicar la función a tu DataFrame
wine_df['bert_embedding'] = wine_df['text'].apply(get_bert_embedding)