# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.


### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.

In [6]:
import tensorflow as tf
import gensim.downloader as api
from transformers import BertTokenizer, TFBertModel

  from .autonotebook import tqdm as notebook_tqdm





In [1]:

import kaggle
import pandas as pd
import numpy as np
import string

### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

In [2]:
postcast_df = pd.read_csv('data/podcastdata_dataset.csv')

In [3]:
print(postcast_df.head())


   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  
0  As part of MIT course 6S099, Artificial Genera...  
1  As part of MIT course 6S099 on artificial gene...  
2  You've studied the human mind, cognition, lang...  
3  What difference between biological neural netw...  
4  The following is a conversation with Vladimir ...  


In [4]:
corpus = postcast_df['text']
corpus

0      As part of MIT course 6S099, Artificial Genera...
1      As part of MIT course 6S099 on artificial gene...
2      You've studied the human mind, cognition, lang...
3      What difference between biological neural netw...
4      The following is a conversation with Vladimir ...
                             ...                        
314    By the time he gets to 2045, we'll be able to ...
315    there's a broader question here, right? As we ...
316    Once this whole thing falls apart and we are c...
317    you could be the seventh best player in the wh...
318    turns out that if you train a planarian and th...
Name: text, Length: 319, dtype: object

### Step 3: Text Preprocessing

You know what to do ;

Bert processing

In [7]:
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')




Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [8]:
def generate_bert_embeddings(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
        outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token representation
    return np.array(embeddings).transpose(0,2,1)



TF-IDF PROCESSING

In [12]:
#delete puntuacion
#stop words
corpus_nopunct = []
for doc in corpus: 
    corpus_nopunct.append(doc.lower().translate(str.maketrans('', '', string.punctuation)))

In [13]:
postcast_df['text_nopunct']=corpus_nopunct
print(postcast_df.head())

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        text_nopunct  
0  as part of mit course 6s099 artificial general...  
1  as part of mit course 6s099 on artificial gene...  
2  youve studied the human mind cognition languag...  
3  what difference between biological neural netw...  
4  the following is a conversation with vladimir ...  


In [16]:
import nltk
from nltk.corpus import stopwords

# Descargar el recurso stopwords
nltk.download('stopwords')

# Cargar las stopwords
stopw = set(stopwords.words('english'))
print(len(stopw))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Geovanny\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


179



###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

In [None]:
bert_embeddings = generate_bert_embeddings(corpus[:10])
print("BERT Embeddings:", bert_embeddings)
print("Bert Shape:", bert_embeddings.shape)

### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.


### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

### Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.

### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.



## Instructions:

* Follow the steps outlined above to implement the IR system.
* Run the provided code snippets to understand how each part of the system works.
* Test the system with various queries to observe the results from both TF-IDF and BERT representations.
* Compare and analyze the results. Discuss the pros and cons of each method.
* Document your findings and any improvements you make to the system.