# Nombre: Cesar Duque

# Taller: Construyendo un Sistema de Recuperación de Información para Episodios de Podcasts

## Objetivo:
Crear un sistema de Recuperación de Información (IR) que procese un conjunto de datos de transcripciones de podcasts y, dada una consulta, devuelva los episodios donde el anfitrión y el invitado discuten el tema de la consulta. Utiliza TF-IDF y BERT para la representación en el espacio vectorial y compara los resultados.

### Paso 1: Importar Bibliotecas
Se importa las bibliotecas necesarias para el manejo de datos, procesamiento de texto y aprendizaje automático.

In [62]:
import pandas as pd
import numpy as np
import string
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import BertTokenizer, TFBertModel
from sklearn.metrics.pairwise import cosine_similarity

### Paso 2: Cargar el Conjunto de Datos
Carga el conjunto de datos de transcripciones de podcasts.

Encuentra el conjunto de datos en: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

ahora se carga el csv descargado actualmente con la libreria pandas

In [63]:
postcast_df = pd.read_csv('data/podcastdata_dataset.csv')
print(postcast_df.head())
print(postcast_df.shape)

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  
0  As part of MIT course 6S099, Artificial Genera...  
1  As part of MIT course 6S099 on artificial gene...  
2  You've studied the human mind, cognition, lang...  
3  What difference between biological neural netw...  
4  The following is a conversation with Vladimir ...  
(319, 4)


### Paso 3: Preprocesamiento de Texto

Primero obtenemos del dataframe la columna de text para obtener el corpus

In [64]:
corpus = postcast_df['text']
print(corpus.head())

0    As part of MIT course 6S099, Artificial Genera...
1    As part of MIT course 6S099 on artificial gene...
2    You've studied the human mind, cognition, lang...
3    What difference between biological neural netw...
4    The following is a conversation with Vladimir ...
Name: text, dtype: object


Cargamos el modelo preentrenado Bert para embeddings contextuales y el tokenizer

In [65]:
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

### TF-IDF PROCESSING

primero vamos a poner todo en minusculas y eliminar la puntuacion

In [66]:
corpus_nopunct = []
for doc in corpus: 
    corpus_nopunct.append(doc.lower().translate(str.maketrans('', '', string.punctuation)))
corpus_nopunct[:10]

['as part of mit course 6s099 artificial general intelligence ive gotten the chance to sit down with max tegmark he is a professor here at mit hes a physicist spent a large part of his career studying the mysteries of our cosmological universe but hes also studied and delved into the beneficial possibilities and the existential risks of artificial intelligence amongst many other things he is the cofounder of the future of life institute author of two books both of which i highly recommend first our mathematical universe second is life 30 hes truly an out of the box thinker and a fun personality so i really enjoy talking to him if youd like to see more of these videos in the future please subscribe and also click the little bell icon to make sure you dont miss any videos also twitter linkedin agimitedu if you wanna watch other lectures or conversations like this one better yet go read maxs book life 30 chapter seven on goals is my favorite its really where philosophy and engineering com

y lo agregamos como una nueva columna al dataframe

In [67]:
postcast_df['text_nopunct']=corpus_nopunct
print(postcast_df.head())

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        text_nopunct  
0  as part of mit course 6s099 artificial general...  
1  as part of mit course 6s099 on artificial gene...  
2  youve studied the human mind cognition languag...  
3  what difference between biological neural netw...  
4  the following is a conversation with vladimir ...  


### stop words

importamos las stopwords de la libreria nltk

In [68]:
# Descargar el recurso stopwords
nltk.download('stopwords')

# Cargar las stopwords
stopw = set(stopwords.words('english'))
len(stopw)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cesar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


179

Se realiza la limpieza del corpus_nopunct.

primero se separa en tokens con split y eliminamos las palabras que sean stopwords

In [69]:
corpus_nostopw=[]
for doc in corpus_nopunct:
    clean_doc = []
    doc_array = doc.split(' ')
    for word in doc_array:
        if word not in stopw:
           clean_doc.append(word)
    corpus_nostopw.append(' '.join(clean_doc))

In [70]:
print('longitud:',len(corpus_nostopw))
corpus_nostopw[:10]

longitud: 319


['part mit course 6s099 artificial general intelligence ive gotten chance sit max tegmark professor mit hes physicist spent large part career studying mysteries cosmological universe hes also studied delved beneficial possibilities existential risks artificial intelligence amongst many things cofounder future life institute author two books highly recommend first mathematical universe second life 30 hes truly box thinker fun personality really enjoy talking youd like see videos future please subscribe also click little bell icon make sure dont miss videos also twitter linkedin agimitedu wanna watch lectures conversations like one better yet go read maxs book life 30 chapter seven goals favorite really philosophy engineering come together opens quote dostoevsky mystery human existence lies staying alive finding something live lastly believe every failure rewards us opportunity learn sense ive fortunate fail many new exciting ways conversation different ive learned something called radio

agregamos otra columna al dataframe con el corpus sin stopwords 

In [71]:
postcast_df['text_nostopw']= corpus_nostopw
print(postcast_df.head())

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        text_nopunct  \
0  as part of mit course 6s099 artificial general...   
1  as part of mit course 6s099 on artificial gene...   
2  youve studied the human mind cognition languag...   
3  what difference between biological neural netw...   
4  the following is a conversation with vladimir ...   

                   

### Paso 4: Representación en el Espacio Vectorial - TF-IDF

Crea representaciones vectoriales TF-IDF de las transcripciones.

incializamos el vectorizer para tfidf y vectorizamos la columna 'text_nostopw' del dataframe postcast_df

In [72]:
vectorizer = TfidfVectorizer()
tfidf_mtx= vectorizer.fit_transform(postcast_df['text_nostopw'])
tfidf_mtx.shape

(319, 49728)

creamos una variable con la cunsulta

In [73]:
query = 'Computer Science'

Usando el vectorizer ya ajustado ahora vectorizamos la consulta

In [74]:
query_vector = vectorizer.transform([query])
query_vector.shape

(1, 49728)

Realizamos la similitud coseno entre los 2 vectores

In [75]:
similarities = cosine_similarity(tfidf_mtx,query_vector)
similarities

array([[0.04507994],
       [0.0727281 ],
       [0.01451431],
       [0.05681495],
       [0.02340847],
       [0.05354464],
       [0.02846494],
       [0.03237011],
       [0.0169459 ],
       [0.00614624],
       [0.02421822],
       [0.00593825],
       [0.07372364],
       [0.01775848],
       [0.01775848],
       [0.07208534],
       [0.01616015],
       [0.0054441 ],
       [0.04327091],
       [0.00503707],
       [0.00667508],
       [0.01006973],
       [0.        ],
       [0.02116541],
       [0.10470165],
       [0.01085762],
       [0.06837741],
       [0.01049236],
       [0.00295929],
       [0.00373142],
       [0.00665419],
       [0.01236771],
       [0.02354847],
       [0.        ],
       [0.06939767],
       [0.00935121],
       [0.02700536],
       [0.037525  ],
       [0.07715644],
       [0.03107165],
       [0.07983962],
       [0.08368498],
       [0.02933174],
       [0.03290134],
       [0.02826904],
       [0.04788385],
       [0.01707148],
       [0.014

In [76]:
similarities.shape

(319, 1)

creamos un dataframe con las similaridad coseno y sus respectivo titulo

In [77]:
similarities_df =pd.DataFrame(similarities, columns=['sim'])
similarities_df['ep'] = postcast_df['title']
similarities_df

Unnamed: 0,sim,ep
0,0.045080,Life 3.0
1,0.072728,Consciousness
2,0.014514,AI in the Age of Reason
3,0.056815,Deep Learning
4,0.023408,Statistical Learning
...,...,...
314,0.036157,"Singularity, Superintelligence, and Immortality"
315,0.018635,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.000945,"Comedy, MADtv, AI, Friendship, Madness, and Pr..."
317,0.003397,Poker


### Paso 5: Representación en el Espacio Vectorial - BERT

Crea representaciones vectoriales BERT de las transcripciones utilizando un modelo BERT preentrenado.

importamos la libreria para hacer una barra de tiempo de cuanto va a tardar

In [78]:
from tqdm import tqdm

definimos la funcion para generar los embedding con bert en el cual usando el tokenizer creado actualmente le enviamos cada texto y el resultado se lo enviamos al modelo de bert y vamos guardando en un vector embeddins que sera el que devolvera la funcion una ves se le aplique un transpose y se lo convierta en un array de numpy

In [79]:
def generate_bert_embeddings(texts):
    embeddings = []
    for text in tqdm(texts, desc="Generating BERT embeddings"):
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
        outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token representation
    return np.array(embeddings).transpose(0,2,1)

Generamos los embeddings con bert para todo el corpus

In [80]:
corpus_bert = generate_bert_embeddings(corpus)

Generating BERT embeddings: 100%|██████████| 319/319 [03:11<00:00,  1.67it/s]


In [81]:
print("BERT Embeddings :", corpus_bert)
print("BERT Shape:", corpus_bert.shape)

BERT Embeddings : [[[-0.13343118]
  [-0.20641233]
  [ 0.00869827]
  ...
  [ 0.16827916]
  [ 0.4736129 ]
  [ 0.47503123]]

 [[ 0.27045053]
  [-0.00788736]
  [ 0.00813593]
  ...
  [-0.08308917]
  [ 0.7754289 ]
  [ 0.3222385 ]]

 [[ 0.47526115]
  [-0.01439859]
  [-0.3704123 ]
  ...
  [-0.08524544]
  [ 0.49683875]
  [ 0.3699943 ]]

 ...

 [[ 0.18494329]
  [-0.439323  ]
  [ 0.11626415]
  ...
  [ 0.11742253]
  [ 0.7223915 ]
  [ 0.366017  ]]

 [[-0.01205369]
  [-0.18836747]
  [-0.06401569]
  ...
  [-0.18816537]
  [ 0.6607348 ]
  [ 0.684296  ]]

 [[-0.15623435]
  [-0.3306038 ]
  [-0.1911864 ]
  ...
  [ 0.06854273]
  [ 0.70506334]
  [ 0.37314463]]]
BERT Shape: (319, 768, 1)


genreamos el embedding para la consulta usando bert la cual nos devuelve un embedding dedimensiones de (1,768,1)

In [82]:
query=['Computer Science']
query_bert= generate_bert_embeddings(query)
query_bert

Generating BERT embeddings: 100%|██████████| 1/1 [00:00<00:00,  6.95it/s]


array([[[ 1.71411604e-01],
        [ 4.82463539e-01],
        [-7.81254828e-01],
        [ 9.65924859e-02],
        [-3.29067647e-01],
        [ 2.58918941e-01],
        [ 2.59098262e-01],
        [ 1.10808814e+00],
        [-1.22093536e-01],
        [-1.54697239e-01],
        [ 1.59713961e-02],
        [ 3.64152014e-01],
        [ 3.68521869e-01],
        [ 6.18271790e-02],
        [-1.17505729e-01],
        [-3.64898860e-01],
        [-2.45325878e-01],
        [ 6.79173172e-01],
        [ 2.78612196e-01],
        [-5.87901920e-02],
        [-6.12187862e-01],
        [-5.10197699e-01],
        [-5.43958366e-01],
        [ 9.09850597e-02],
        [ 9.68162641e-02],
        [-9.84535813e-02],
        [-4.13496681e-02],
        [ 1.70391351e-01],
        [ 8.03369284e-03],
        [-4.91113186e-01],
        [-3.43841255e-01],
        [ 2.59529948e-01],
        [-4.51320261e-01],
        [ 6.30367771e-02],
        [ 3.74043137e-01],
        [-1.30845308e-01],
        [-5.12160540e-01],
 

In [83]:
query_bert.shape

(1, 768, 1)

### Paso 6: Procesamiento de Consultas

Define una función para procesar la consulta y calcular los puntajes de similitud utilizando tanto las incrustaciones TF-IDF como las de BERT.

TF-IDF

creamos una funcion la cual vectoriza la conuslta dada y ahce la similitud coseno con la matriz tfidf_mtx y se procede a crear el dataframe con la similitud coseno y su respectivo titulo que es lo que se va retornar al llamr a esta funcion

In [84]:
def retrieve_tfidf(query):
    query_vector = vectorizer.transform([query])
    similarities = cosine_similarity(tfidf_mtx,query_vector)
    similarities_df =pd.DataFrame(similarities, columns=['sim'])
    similarities_df['ep'] = postcast_df['title']
    return similarities_df


In [85]:
tfidf_similarity_consulta=retrieve_tfidf('gpt')
tfidf_similarity_consulta

Unnamed: 0,sim,ep
0,0.0,Life 3.0
1,0.0,Consciousness
2,0.0,AI in the Age of Reason
3,0.0,Deep Learning
4,0.0,Statistical Learning
...,...,...
314,0.0,"Singularity, Superintelligence, and Immortality"
315,0.0,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.0,"Comedy, MADtv, AI, Friendship, Madness, and Pr..."
317,0.0,Poker


BERT

se crea una funcion la cual se le pasa una query la cual se generera su embedding y esta sera calculada su similitud coseno con el corpus_bert creado anteriormente y de ahi guardada en el dataframe con la similitud coseno y su respectivo titulo que es lo que se va retornar al llamr a esta funcion

In [86]:
def retrieve_bert(query):
    query_bert =  generate_bert_embeddings(query)
    similarities = cosine_similarity(corpus_bert.reshape(319,768),query_bert.reshape(1,768))
    similarities_df =pd.DataFrame(similarities, columns=['sim'])
    similarities_df['ep'] = postcast_df['title']
    return similarities_df

In [60]:
bert_similarity_consulta=retrieve_bert(['gpt'])
bert_similarity_consulta

Generating BERT embeddings: 100%|██████████| 1/1 [00:00<00:00,  6.36it/s]


Unnamed: 0,sim,ep
0,0.608030,Life 3.0
1,0.606848,Consciousness
2,0.568143,AI in the Age of Reason
3,0.528032,Deep Learning
4,0.615966,Statistical Learning
...,...,...
314,0.581991,"Singularity, Superintelligence, and Immortality"
315,0.549650,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.605461,"Comedy, MADtv, AI, Friendship, Madness, and Pr..."
317,0.616164,Poker


### Paso 7: Recuperar y Comparar Resultados

Define una función para recuperar los mejores resultados basados en los puntajes de similitud para ambas representaciones, TF-IDF y BERT.

Se crea una funcion que recibe un df y ordena dicho dataframe de mayor a menor segun su columna llamada 'sim'

In [87]:
def ordenar_df(df):
    return df.sort_values(by='sim', ascending=False)

### Paso 8: Probar el Sistema de Recuperación de Información

Prueba el sistema con una consulta de muestra.

Recupera y muestra los mejores resultados utilizando tanto las representaciones TF-IDF como las de BERT.

### IF-IDF TOP RESULTS

In [88]:
idf_result_df = retrieve_tfidf('gpt')
result_idf = ordenar_df(idf_result_df)
print(result_idf[:10])

          sim                                                 ep
213  0.099371  OpenAI Codex, GPT-3, Robotics, and the Future ...
17   0.032536                                     OpenAI and AGI
94   0.028676                                      Deep Learning
120  0.028510                    Friendship with an AI Companion
117  0.025214  Math, Manim, Neural Networks & Teaching with 3...
119  0.011263                           Measures of Intelligence
130  0.011053  The Future of Computing and Programming Languages
276  0.007757                         Sara Walker and Lee Cronin
35   0.007033         fast.ai Deep Learning Courses and Research
266  0.006228  Origin of Life, Aliens, Complexity, and Consci...


### BERT TOP RESULTS

In [89]:
bert_result_df = retrieve_bert(['gpt'])
bert_result = ordenamiento(bert_result_df)
print(bert_result[:10])

Generating BERT embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.36it/s]

          sim                                                 ep
216  0.709173  Virtual Reality, Social Media & the Future of ...
49   0.703856    Neuralink, AI, Autopilot, and the Pale Blue Dot
199  0.669967                        Totalitarianism and Anarchy
133  0.666933  On the Nature of Good and Evil, Genius and Mad...
39   0.660287                                             iRobot
153  0.659555  Aliens, Black Holes, and the Mystery of the Ou...
96   0.657686           Going Big in Business, Investing, and AI
163  0.654897  Sleep, Dreams, Creativity & the Limits of the ...
34   0.654667        Machines Who Think and the Early Days of AI
273  0.654258        Bitcoin, Inflation, and the Future of Money





### Paso 9: Comparar Resultados

Analiza y compara los resultados obtenidos de las representaciones TF-IDF y BERT.

Discute las diferencias, fortalezas y debilidades de cada método basándote en los resultados de recuperación.


1. **Relevancia**:
   - **TF-IDF**: Los resultados de TF-IDF muestran episodios con puntuaciones de similitud más bajas y menos precisas. Los episodios parecen ser menos relevantes y más generales.
   - **BERT**: Los resultados de BERT tienen puntuaciones de similitud más altas y son más relevantes para la consulta. Los episodios están más relacionados con el tema de la consulta.

2. **Fortalezas y Debilidades**:
   - **TF-IDF**:
     - **Fortalezas**: Es fácil de usar y funciona bien con textos simples y bien organizados.
     - **Debilidades**: No entiende bien el contexto o el significado profundo, por lo que puede no encontrar los episodios más relevantes si la consulta es complicada.
   - **BERT**:
     - **Fortalezas**: Entiende mejor el significado de las palabras y el contexto, dando resultados más relevantes para consultas complejas.
     - **Debilidades**: Necesita más recursos y es más complicado de usar.

3. **Conclusión**:
   - **TF-IDF** es útil para tareas simples, pero **BERT** ofrece mejores resultados para consultas más complejas, ya que entiende mejor el contenido.
