# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.


### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.

In [1]:
import tensorflow as tf
import gensim.downloader as api
import kaggle
import pandas as pd
import numpy as np
import string
import nltk

In [2]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import BertTokenizer, TFBertModel
from sklearn.metrics.pairwise import cosine_similarity

  from .autonotebook import tqdm as notebook_tqdm





### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

In [3]:
postcast_df = pd.read_csv('data/podcastdata_dataset.csv')
print(postcast_df.head())

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  
0  As part of MIT course 6S099, Artificial Genera...  
1  As part of MIT course 6S099 on artificial gene...  
2  You've studied the human mind, cognition, lang...  
3  What difference between biological neural netw...  
4  The following is a conversation with Vladimir ...  


In [4]:
print(postcast_df.shape)

(319, 4)


### Step 3: Text Preprocessing

You know what to do ;

In [5]:
corpus = postcast_df['text']
print(corpus.head())

0    As part of MIT course 6S099, Artificial Genera...
1    As part of MIT course 6S099 on artificial gene...
2    You've studied the human mind, cognition, lang...
3    What difference between biological neural netw...
4    The following is a conversation with Vladimir ...
Name: text, dtype: object


Bert processing

In [6]:
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')




Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

### TF-IDF PROCESSING

 delete puntuacion

In [7]:
#delete puntuacion

corpus_nopunct = []
for doc in corpus: 
    corpus_nopunct.append(doc.lower().translate(str.maketrans('', '', string.punctuation)))

In [8]:
print(corpus_nopunct[:10])



In [9]:
postcast_df['text_nopunct']=corpus_nopunct
print(postcast_df.head())

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        text_nopunct  
0  as part of mit course 6s099 artificial general...  
1  as part of mit course 6s099 on artificial gene...  
2  youve studied the human mind cognition languag...  
3  what difference between biological neural netw...  
4  the following is a conversation with vladimir ...  


### stop words

In [10]:
#import nltk
#from nltk.corpus import stopwords

# Descargar el recurso stopwords
nltk.download('stopwords')

# Cargar las stopwords
stopw = set(stopwords.words('english'))
print(len(stopw))

179


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Geovanny\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
corpus_nostopw=[]
for doc in corpus_nopunct:
    clean_doc = []
    doc_array = doc.split(' ')
    for word in doc_array:
        if word not in stopw:
           clean_doc.append(word)
    corpus_nostopw.append(' '.join(clean_doc))

In [12]:
corpus_nostopw[300]

'following conversation brian armstrong cofounder ceo coinbase largest cryptocurrency exchange platform 98 million users 100 countries listing bitcoin ethereum cardano 100 popular cryptocurrencies recorded conversation brian weeks sec probe whether crypto listings securities thus need regulated always conversations involve cryptocurrency try make timeless price soaring high crashing low doesnt distract fundamental technological economic social philosophical ideas underlying new form money energy information world runs money exchange store value cryptocurrency seeks build next chapter money works coinbase brian trying working together regulators governments long difficult road bureaucracies resist change better worse latest sec probe good representation serious attempt limit fraud one also runs risk limiting innovation limiting financial freedom individuals complicated mess applaud everyone involved trying work hope end interest individual wins decentralization hedge corrupting nature c

In [13]:
postcast_df['text_nostopw']= corpus_nostopw
print(postcast_df.head())

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        text_nopunct  \
0  as part of mit course 6s099 artificial general...   
1  as part of mit course 6s099 on artificial gene...   
2  youve studied the human mind cognition languag...   
3  what difference between biological neural netw...   
4  the following is a conversation with vladimir ...   

                   

In [14]:
#tokens_filtrados = [token for token in corpus_nopunct if token not in stopw]


###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

In [15]:
#from sklearn.feature_extraction.text import TfidfVectorizer

In [16]:
vectorizer = TfidfVectorizer()
tfidf_mtx= vectorizer.fit_transform(postcast_df['text_nostopw'])

In [17]:
query = 'Computer Science'

In [18]:
query_vector = vectorizer.transform([query])

In [19]:
#from sklearn.metrics.pairwise import cosine_similarity

In [20]:
similarities = cosine_similarity(tfidf_mtx,query_vector)

In [21]:
similarities

array([[0.04507994],
       [0.0727281 ],
       [0.01451431],
       [0.05681495],
       [0.02340847],
       [0.05354464],
       [0.02846494],
       [0.03237011],
       [0.0169459 ],
       [0.00614624],
       [0.02421822],
       [0.00593825],
       [0.07372364],
       [0.01775848],
       [0.01775848],
       [0.07208534],
       [0.01616015],
       [0.0054441 ],
       [0.04327091],
       [0.00503707],
       [0.00667508],
       [0.01006973],
       [0.        ],
       [0.02116541],
       [0.10470165],
       [0.01085762],
       [0.06837741],
       [0.01049236],
       [0.00295929],
       [0.00373142],
       [0.00665419],
       [0.01236771],
       [0.02354847],
       [0.        ],
       [0.06939767],
       [0.00935121],
       [0.02700536],
       [0.037525  ],
       [0.07715644],
       [0.03107165],
       [0.07983962],
       [0.08368498],
       [0.02933174],
       [0.03290134],
       [0.02826904],
       [0.04788385],
       [0.01707148],
       [0.014

In [22]:
type(similarities)

numpy.ndarray

In [23]:

similarities_df =pd.DataFrame(similarities, columns=['sim'])
similarities_df['ep'] = postcast_df['title']
print(similarities_df.head())

        sim                       ep
0  0.045080                 Life 3.0
1  0.072728            Consciousness
2  0.014514  AI in the Age of Reason
3  0.056815            Deep Learning
4  0.023408     Statistical Learning


In [24]:
similarities_df

Unnamed: 0,sim,ep
0,0.045080,Life 3.0
1,0.072728,Consciousness
2,0.014514,AI in the Age of Reason
3,0.056815,Deep Learning
4,0.023408,Statistical Learning
...,...,...
314,0.036157,"Singularity, Superintelligence, and Immortality"
315,0.018635,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.000945,"Comedy, MADtv, AI, Friendship, Madness, and Pr..."
317,0.003397,Poker


### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

In [25]:
import time
from multiprocessing.pool import ThreadPool
from multiprocessing import Pool
import threading

In [26]:

def generate_bert_embeddings(texts):
    embeddings = []
    start_time = time.time()
    for i, text in enumerate(texts):
    #for text in texts:
        if i == 10:  # Mide el tiempo después de 10 textos como muestra
            elapsed_time = time.time() - start_time
            estimated_time = (elapsed_time / 10) * len(texts)
            print(f"Tiempo estimado para procesar el corpus completo: {estimated_time / 60:.2f} minutos")
        
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
        outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token representation
    return np.array(embeddings).transpose(0,2,1)



In [28]:
corpus_bert = generate_bert_embeddings(corpus[:50])

Tiempo estimado para procesar el corpus completo: 0.97 minutos


In [29]:
def generate_bert_embeddings2(texts):
    embeddings = []
    start_time = time.time()
    for i, text in enumerate(texts):
    #for text in texts:
        if i == 10:  # Mide el tiempo después de 10 textos como muestra
            elapsed_time = time.time() - start_time
            estimated_time = (elapsed_time / 10) * len(texts)
            print(f"Tiempo estimado para procesar el corpus completo: {estimated_time / 60:.2f} minutos")
        
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
        outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token representation
    return np.array(embeddings).transpose(0,2,1)


In [30]:
corpus_bert = generate_bert_embeddings(corpus)

Tiempo estimado para procesar el corpus completo: 6.36 minutos


In [31]:
corpus.shape

(319,)

In [32]:
#corpus_bert[0]

In [33]:
query=['Computer Science']
query_bert= generate_bert_embeddings(query)

In [34]:
query_bert.shape

(1, 768, 1)

In [35]:
#similarities=cosine_similarity(corpus_bert.reshape(50,768),query_bert.reshape(1,768))

In [36]:
#similarities

### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.

TF-IDF

In [37]:
def retrieve_tfidf(query):
    query_vector = vectorizer.transform([query])
    similarities = cosine_similarity(tfidf_mtx,query_vector)
    similarities_df =pd.DataFrame(similarities, columns=['sim'])
    similarities_df['ep'] = postcast_df['title']
    return similarities_df


In [38]:
retrieve_tfidf('gpt')

Unnamed: 0,sim,ep
0,0.0,Life 3.0
1,0.0,Consciousness
2,0.0,AI in the Age of Reason
3,0.0,Deep Learning
4,0.0,Statistical Learning
...,...,...
314,0.0,"Singularity, Superintelligence, and Immortality"
315,0.0,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.0,"Comedy, MADtv, AI, Friendship, Madness, and Pr..."
317,0.0,Poker


BERT

In [41]:
def retrieve_bert(query):
    query_bert =  generate_bert_embeddings(query)
    similarities = cosine_similarity(corpus_bert.reshape(319,768),query_bert.reshape(1,768))
    similarities_df =pd.DataFrame(similarities, columns=['sim'])
    similarities_df['ep'] = postcast_df['title']
    return similarities_df

In [42]:
retrieve_bert(['gpt'])

Unnamed: 0,sim,ep
0,0.608030,Life 3.0
1,0.606848,Consciousness
2,0.568143,AI in the Age of Reason
3,0.528032,Deep Learning
4,0.615966,Statistical Learning
...,...,...
314,0.581991,"Singularity, Superintelligence, and Immortality"
315,0.549650,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.605461,"Comedy, MADtv, AI, Friendship, Madness, and Pr..."
317,0.616164,Poker



### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

In [43]:
def ordenamiento(query):
    orden = query.sort_values(by='sim', ascending=False)
    return orden

### Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.

### IF-IDF TOP RESULTS

In [44]:
idf_result_df = retrieve_tfidf('gpt')

# Ordenar los resultados de mayor a menor según la columna 'sim'
result_idf = ordenamiento(idf_result_df)
print(result_idf[:10])

          sim                                                 ep
213  0.099371  OpenAI Codex, GPT-3, Robotics, and the Future ...
17   0.032536                                     OpenAI and AGI
94   0.028676                                      Deep Learning
120  0.028510                    Friendship with an AI Companion
117  0.025214  Math, Manim, Neural Networks & Teaching with 3...
119  0.011263                           Measures of Intelligence
130  0.011053  The Future of Computing and Programming Languages
276  0.007757                         Sara Walker and Lee Cronin
35   0.007033         fast.ai Deep Learning Courses and Research
266  0.006228  Origin of Life, Aliens, Complexity, and Consci...


### BERT TOP RESULTS

In [45]:
bert_result_df = retrieve_bert(['gpt'])

# Ordenar los resultados de mayor a menor según la columna 'sim'
bert_result = ordenamiento(bert_result_df)
print(bert_result[:10])

          sim                                                 ep
216  0.709173  Virtual Reality, Social Media & the Future of ...
49   0.703856    Neuralink, AI, Autopilot, and the Pale Blue Dot
199  0.669967                        Totalitarianism and Anarchy
133  0.666933  On the Nature of Good and Evil, Genius and Mad...
39   0.660287                                             iRobot
153  0.659555  Aliens, Black Holes, and the Mystery of the Ou...
96   0.657686           Going Big in Business, Investing, and AI
163  0.654897  Sleep, Dreams, Creativity & the Limits of the ...
34   0.654667        Machines Who Think and the Early Days of AI
273  0.654258        Bitcoin, Inflation, and the Future of Money


### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.



## Instructions:

* Follow the steps outlined above to implement the IR system.
* Run the provided code snippets to understand how each part of the system works.
* Test the system with various queries to observe the results from both TF-IDF and BERT representations.
* Compare and analyze the results. Discuss the pros and cons of each method.
* Document your findings and any improvements you make to the system.