<a href="https://colab.research.google.com/github/gabrielfernandorey/ITBA-NLP/blob/main/ITBA_nlp01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Trabajo Practico NLP - Detección de Tópicos y clasificación
- ITBA 2024
- Alumno: Gabriel Rey
---

### Resumen del problema

- Calcular los tópicos de portales de noticias que se reciben 
- Frecuencia del cálculo de tópicos: diaria
- Colección de noticias: diariamente, en lotes o de a un texto.
- Identificar tópicos, entidades, keywords y análisis de sentimiento.

### Datos
- Se reciben las noticias con formato: Titulo, Texto, Fecha, Entidades, Keywords

### Tareas
- Modelo de detección de tópicos diario utilizando embeddings
- Definir un criterio de agrupación de tópicos aplicado al mismo día y entre distintos días (merging)
- Almacenar los embeddings de tópicos en una base de datos vectorial
- Modelo de datos dado: 
    - Id del tópico
    - Nombre del tópico
    - Keywords
    - Embbeding
    - Fecha de creación
    - Fecha de entrenamiento inicial
    - Fecha de entrenamiento actualizada
    - Umbral de detección
    - Documento mas cercano
---
Tareas en esta notebook:
- Inicializar la base de datos vectorial
- Ingestar data
- NER: Encontrar las entidades de cada documento
- Limpiar data
- Modelo: Armado del modelo BERTopic
- Entrenamiento
- Almacenamiento en base de datos vectorial


In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import re
import json
from datetime import datetime
from dotenv import load_dotenv
from tqdm import tqdm
from collections import Counter

import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from NLP_tools import clean_all
from core.functions import *

# levantar la base antes de ejecutar
from opensearch_data_model import os_client

### Inicializamos la base vectorial
Se modifica la indice de la base "Topic" agregando referencias del documento mas cercano y un campo para los 100 documentos mas cercanos al tópico.

In [2]:
# Inicialización de indices
init_opensearch()



El índice Topic ya existe. Saltando inicialización de base de datos.
El índice News ya existe. Saltando inicialización de base de datos.


### Path

In [3]:
load_dotenv()
PATH_REMOTO='/content/ITBA-NLP/data/'
PATH=os.environ.get('PATH_LOCAL', PATH_REMOTO)
PATH

'C:/Users/gabri/OneDrive/Machine Learning/Github/ITBA-NLP/data/'

### Data

In [4]:
# Read the parquet file | ( lotes de prueba )

df_params = {'0_1000':'0_1000_data.parquet',
             '1000_2000':'1000_2000_data.parquet',
             '2000_3000':'2000_3000_data.parquet',
             'df_joined':'df_joined_2024-04-01 00_00_00.parquet'
            }

chunk = os.environ.get('CHUNK')
print(chunk)

df_parquet = pd.read_parquet(PATH+df_params[chunk])
df_parquet.head(1)


df_joined


Unnamed: 0_level_0,Asset Name,Author Id,Author Name,Keyword Id,Keyword Name,Entity Id,Entity Name,Media Group Id,Media Group Name,Impact,...,in__text,out__entities,out__potential_entities,predicted_at_entities,out__keywords_sorted,predicted_at_keywords,start_time_utc,start_time_local,truncated_text,title_and_text
Asset Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
105628101,Elecciones en Venezuela: María Corina Machado ...,36192,Infobae,1932002 | 417739 | 1687638 | 36187 | 7476 | 50...,"[falsas conspiraciones armadas, sustituta, det...",219925 | 210613 | 219770 | 36424 | 1129437,"[Nicolás Maduro, Jorge Rodríguez, Marcelo Ebra...",0,,7406333,...,Fotografía de archivo de la líder antichavista...,"[Nicolás Maduro, Marcelo Ebrard, Jorge Rodrígu...","[Jorge Rodríguez, Nicolás Maduro, Rayner Peña ...",2024-04-02 08:11:57.825777,"[elecciones presidenciales, candidatura presid...",2024-04-02 08:17:44.372891+00:00,2024-04-02,2024-04-01 21:00:00,Fotografía de archivo de la líder antichavista...,Elecciones en Venezuela: María Corina Machado ...


In [None]:
# Codigo para fraccionar el dataset (pruebas)
#df_parquet[:1000].to_parquet(PATH+'0_1000_data.parquet', engine='pyarrow')

#df_1000_2000 = df_parquet[1000:2000]

#df_1000_2000['start_time_local'] = '2024-04-03 00:00:00'
#df_1000_2000.to_parquet(PATH+'1000_2000_data.parquet', engine='pyarrow')

#df_2000 = df_parquet[2000:]

#df_2000['start_time_local'] = '2024-04-05 00:00:00'
#df_2000.to_parquet(PATH+'2000_3000_data.parquet', engine='pyarrow')


In [5]:
data = list(df_parquet['in__text'])

# Cantidad total de documentos
len(data)

3104

### StopWords

In [6]:
# Stopwords
SPANISH_STOPWORDS = list(pd.read_csv(PATH+'spanish_stop_words.csv' )['stopwords'].values)
SPANISH_STOPWORDS_SPECIAL = list(pd.read_csv(PATH+'spanish_stop_words_spec.csv' )['stopwords'].values)

In [None]:
""" import csv
# Guardar la lista de stopwords especial en un archivo CSV
with open(PATH+"spanish_stop_words_spec.csv", mode='w', newline='', encoding='utf-8') as archivo:
    escritor = csv.writer(archivo)
    escritor.writerow(['stopwords'])
    for stopword in SPANISH_STOPWORDS_SPECIAL:
        escritor.writerow([stopword]) """

### NER - Named Entity Recognition
Obtener entidades de las noticias 

In [7]:
# Cargar el modelo de spaCy para español
spa = spacy.load("es_core_news_lg")

In [None]:
""" # Cargar o saltar carga y procesar celda inferior
with open(PATH+f'modelos/entities{chunk}.json', 'r') as json_file:
    entities = json.load(json_file)

with open(PATH+f'modelos/entities_spa{chunk}.json', 'r') as json_file:
    entities_spa = json.load(json_file)

with open(PATH+f'modelos/keywords_spa{chunk}.json', 'r') as json_file:
    keywords_spa = json.load(json_file) """

In [76]:
# Detectar entidades para todos los documentos usando spaCy

original_entities = []
for data_in in tqdm(data):

    # Contabilizar palabras en doc
    normalized_text = re.sub(r'\W+', ' ', data_in.lower())
    words_txt_without_stopwords = [word for word in normalized_text.split() if word not in SPANISH_STOPWORDS+SPANISH_STOPWORDS_SPECIAL]
    words_txt_counter = Counter(words_txt_without_stopwords)
    words_counter = {elemento: cuenta for elemento, cuenta in sorted(words_txt_counter.items(), key=lambda item:item[1], reverse=True) if cuenta > 1}

    # Extraer entidades del doc segun atributos
    extract = spa(data_in)
    entidades_spacy = [(ent.text, ent.label_) for ent in extract.ents]
    ent_select = [ent for ent in entidades_spacy if ent[1] == 'PER' or ent[1] == 'ORG' or ent[1] == 'LOC' ]

    # Extraer entidades maximo 3 palabras 
    entidades = [ent[0] for ent in ent_select if len(ent[0].split()) <= 3]
    ent_clean = clean_all(entidades, accents=False)
    ent_unique = list(set([ word for word in ent_clean if word not in SPANISH_STOPWORDS+SPANISH_STOPWORDS_SPECIAL] ))

    ents_proc = {}
    
    pre_original_entities = []
    for ent in ent_unique:
        
        # Criterio de selección 
        weight = 0
        for word in ent.split():
            if word in words_counter:
                weight += 1 /len(ent.split()) * words_counter[word]
        
        ents_proc[ent] = round(weight,4)

    ents_proc = {k: v for k, v in sorted(ents_proc.items(), key=lambda item: item[1], reverse=True) if v > 0}

    # Crear la lista de entidades procesadas por noticia 
    pre_entities = [key for key, _ in ents_proc.items()] 

    # Obtener las última palabra de cada entidad que tenga mas de una palabra por entidad
    ult_palabras = list(set([ent.split()[-1] for ent in pre_entities if len(ent.split()) > 1 ]))

    # Eliminar palabra única si la encuentra al final de una compuesta
    pre_entities_aux = []
    for idx, ent in enumerate(pre_entities):
        if not (len(ent.split()) == 1 and ent in ult_palabras):
            pre_entities_aux.append(ent)

    # Obtener las palabras únicas
    unicas_palabras = [ ent.split()[0] for ent in pre_entities_aux if len(ent.split()) > 1 ]

    # Eliminar palabra única si la encuentra al comienzo de una compuesta
    pre_entities = []
    for idx, ent in enumerate(pre_entities_aux):
        if not (len(ent.split()) == 1 and ent in unicas_palabras):
            pre_entities.append(ent)

    # obtener entidades filtradas
    if len(pre_entities) > 10:
        umbral = 10 + (len(pre_entities)-10) // 2
        entities = pre_entities[:umbral] 
    else:
        entities = pre_entities[:10]

    # capturar las entidades en formato original
    for ent in entities:
        pre_original_entities.append([elemento for elemento in entidades if elemento.lower() == ent.lower()])

    sort_original_entities = sorted(pre_original_entities, key=len, reverse=True)
    
    try:
        original_entities.append( [ent[0] for ent in sort_original_entities if ent] ) 
    except Exception as e:
        original_entities.append([])

    

    
    
        


0it [00:00, ?it/s]

171it [00:18,  9.48it/s]


In [133]:
# Detectar entidades para todos los documentos usando spaCy
entities_ = []
data_in =  data[4]

# Contabilizar palabras en doc
normalized_text = re.sub(r'\W+', ' ', data_in.lower())
words_txt_without_stopwords = [word for word in normalized_text.split() if word not in SPANISH_STOPWORDS+SPANISH_STOPWORDS_SPECIAL]
words_txt_counter = Counter(words_txt_without_stopwords)
words_counter = {elemento: cuenta for elemento, cuenta in sorted(words_txt_counter.items(), key=lambda item:item[1], reverse=True) }

# Extraer entidades del doc segun atributos
extract = spa(data_in)
entidades_spacy = [(ent.text, ent.label_) for ent in extract.ents]
ent_select = [ent for ent in entidades_spacy if ent[1] == 'PER' or ent[1] == 'ORG' or ent[1] == 'LOC' ]

# Extraer entidades maximo 3 palabras 
entidades = [ent[0] for ent in ent_select if len(ent[0].split()) <= 3]
ent_clean = clean_all(entidades, accents=False)
ent_unique = list(set([ word for word in ent_clean if word not in SPANISH_STOPWORDS+SPANISH_STOPWORDS_SPECIAL] ))

ents_proc = {}

pre_original_entities = []
for ent in ent_unique:
    
    # Criterio de selección 
    weight = 0
    for word in ent.split():
        if word in words_counter:
            weight += 1 /len(ent.split()) * words_counter[word]
    
    ents_proc[ent] = round(weight,4)

ents_proc = {k: v for k, v in sorted(ents_proc.items(), key=lambda item: item[1], reverse=True) if v > 0}

# Crear la lista de entidades procesadas por noticia 
pre_entities = [key for key, _ in ents_proc.items()] 

# Obtener las última palabra de cada entidad que tenga mas de una palabra por entidad
ult_palabras = list(set([ent.split()[-1] for ent in pre_entities if len(ent.split()) > 1 ]))

# Eliminar palabra única si la encuentra al final de una compuesta
pre_entities_aux = []
for idx, ent in enumerate(pre_entities):
    if not (len(ent.split()) == 1 and ent in ult_palabras):
        pre_entities_aux.append(ent)

# Obtener las palabras únicas
unicas_palabras = [ ent.split()[0] for ent in pre_entities_aux if len(ent.split()) > 1 ]

# Eliminar palabra única si la encuentra al comienzo de una compuesta
pre_entities = []
for idx, ent in enumerate(pre_entities_aux):
    if not (len(ent.split()) == 1 and ent in unicas_palabras):
        pre_entities.append(ent)

# obtener entidades filtradas
if len(pre_entities) > 10:
    umbral = 10 + (len(pre_entities)-10) // 2
    entities_f = pre_entities[:umbral] 
else:
    entities_f = pre_entities[:10]

# capturar las entidades en formato original
for ent in entities_f:
    pre_original_entities.append([elemento for elemento in entidades if elemento.lower() == ent.lower()])

sort_original_entities = sorted(pre_original_entities, key=len, reverse=True)

try:
    entities_.append([ent[0] for ent in sort_original_entities if ent] ) 
except Exception as e:
    entities_.append([])

In [134]:
entities_

[['Mariana',
  'Diego Maradona',
  'Diez',
  'Juanito Belmonte',
  'Enrique Pinti',
  'Sembró',
  'Edelweiss',
  'calle Libertad',
  'Cuba',
  'Cocodrilo',
  'Salsa Criolla']]

In [128]:
words_counter

{'lectura': 17,
 'voz': 12,
 'alta': 12,
 'práctica': 6,
 'niños': 5,
 'escuela': 5,
 'desarrollo': 4,
 'leer': 4,
 'social': 4,
 'libros': 4,
 'lingüístico': 3,
 'integral': 3,
 'investigación': 3,
 'hogar': 3,
 'vocabulario': 3,
 'iniciativas': 3,
 'presenta': 2,
 'herramienta': 2,
 'esencial': 2,
 'niñas': 2,
 'educativo': 2,
 'familiar': 2,
 'artículo': 2,
 'universidad': 2,
 'capacidad': 2,
 'palabras': 2,
 'actividad': 2,
 'texto': 2,
 'experiencia': 2,
 'conexión': 2,
 'beneficios': 2,
 'ofrece': 2,
 'mejora': 2,
 'escritura': 2,
 'fomentar': 2,
 'lazos': 2,
 'afectivos': 2,
 'padres': 2,
 'resalta': 2,
 'importancia': 2,
 'emocional': 2,
 'visual': 2,
 'sugieren': 2,
 'turnos': 2,
 'creación': 2,
 'atmósfera': 2,
 'propicia': 2,
 'genere': 2,
 'expectativa': 2}

In [None]:
# Grabar
with open(PATH+f'modelos/entities{chunk}.json', 'w') as file:
    json.dump(entities, file)

# Grabar
with open(PATH+f'modelos/entities_spa{chunk}.json', 'w') as file:
    json.dump(entities_spa, file)

## Keywords
Obtener palabras clave de las noticias

In [30]:
# Detectar keywords para todos los documentos usando spaCy

keywords_spa = []
for doc in tqdm(data):
    extract = spa(doc)
    keywords_spa.append([(ext.text, ext.pos_) for ext in extract])  

100%|██████████| 3104/3104 [04:43<00:00, 10.96it/s]


### Keyboards with neighboards

In [None]:
# Encontrar la posicion en el df segun su indice
df_parquet.index.get_loc(105640350)

In [None]:
# Prueba ejemplo
doc = 211

# Obtenemos las keywords 'NOUN' mas frecuentes
nouns = []
for token in keywords_spa[doc]:
    if token[1] == 'NOUN':
        nouns.append(token[0])

count_nouns = Counter(nouns)

count_nouns.most_common()[:10]

In [None]:
# Obtenemos las keywords 'VERB' mas frecuentes
verbs = []
for token in keywords_spa[doc]:
    if token[1] == 'VERB':
        verbs.append(token[0])

count_verbs = Counter(verbs)

count_verbs.most_common()[:10]

In [None]:
# Pobar un documento ( resultados lematizados )
keywords_spa_n = []
extract = spa(data[doc])
keywords_spa_n.append([(ext.lemma_, ext.pos_) for ext in extract])
keywords_spa_n[0][:10]


In [None]:
# Resultados sin lematizar
extract = spa(data[doc])
tokens_and_labels = [(token.text, token.pos_) for token in extract if token.is_alpha]
tokens_and_labels[:10]

In [34]:
# Make a function to get all two-word combinations
def get_bigrams(word_list, number_consecutive_words=2):
    
    ngrams = []
    adj_length_of_word_list = len(word_list) - (number_consecutive_words - 1)
    
    #Loop through numbers from 0 to the (slightly adjusted) length of your word list
    for word_index in range(adj_length_of_word_list):
        
        #Index the list at each number, grabbing the word at that number index as well as N number of words after it
        ngram = word_list[word_index : word_index + number_consecutive_words]
        
        #Append this word combo to the master list "ngrams"
        ngrams.append(ngram)
        
    return ngrams

In [None]:
bigrams = get_bigrams(tokens_and_labels)
bigrams[:10]


In [35]:
# return the most frequent words that appear next to a particular keyword
def get_neighbor_words(keyword, bigrams, pos_label = None):
    
    neighbor_words = []
    keyword = keyword.lower()
    
    for bigram in bigrams:
        
        #Extract just the lowercased words (not the labels) for each bigram
        words = [word.lower() for word, label in bigram]        
        
        #Check to see if keyword is in the bigram
        if keyword in words:
            idx = words.index(keyword)
            for word, label in bigram:
                
                #Now focus on the neighbor word, not the keyword
                if word.lower() != keyword:
                    #If the neighbor word matches the right pos_label, append it to the master list
                    if label == pos_label or pos_label == None:
                        if idx == 0:
                            neighbor_words.append(" ".join([keyword, word.lower()]))
                        else:
                            neighbor_words.append(" ".join([word.lower(), keyword]))
                    
    return Counter(neighbor_words).most_common()

In [None]:
for word in count_nouns.most_common():
    print(get_neighbor_words(word[0], bigrams, pos_label='ADJ'))

#### Funcion completa para keywords with neighboards

In [32]:
def keywords_with_neighboards(keywords_spa, POS_1='NOUN', POS_2='ADJ'):
    """
    Funcion que devuelve dos listas:
    - lista de keywords with neighboards (segun argumentos POS_1 y POS_2)
    - lista de keywords mas frecuentes (segun argumentos POS_1 y POS_2)
    """

    doc_kwn = []
    commons = []
    for keywords in keywords_spa:
    
        # Obtenemos las keywords del tipo (Universal Dependences) mas frecuentes de cada doc (spaCy format)
        words = []
        for k_spa in keywords:
            if k_spa[1] == POS_1:
                words.append(k_spa[0])

        cont_words = Counter(words)

        common = cont_words.most_common()
        commons.append( [com for com in common if com[1] > 1] )

        # Calcular un umbral de corte (en repeticiones) para los keywords obtenidos
            ## suma de todos los valores
        valores = [valor for _, valor in common]

            ## Calcular los pesos como proporcionales a los valores mismos
        pesos = np.array(valores) / np.sum(valores)

            ## Calcular el umbral ponderado, valor 2 o superior ( debe repetirse la keyword al menos una vez )
        threshold = max(2, round(np.sum(np.array(valores) * pesos),4))


        # Obtenemos los bigramas del doc        
        tokens_and_labels = [(token[0], token[1]) for token in keywords if token[0].isalpha()]

        bigrams = get_bigrams(tokens_and_labels)

        keywords_neighbor = []
        for item_common in common:
            if item_common[1] >= threshold or len(keywords_neighbor) < 6: # corte por umbral o menor a 6
                
                kwn = get_neighbor_words(item_common[0], bigrams, pos_label=POS_2)
                if kwn != []:
                    keywords_neighbor.append( kwn )

        sorted_keywords_neighbor = sorted([item for sublist in keywords_neighbor for item in sublist ], key=lambda x: x[1], reverse=True)
        
        doc_kwn.append(sorted_keywords_neighbor)

    return doc_kwn, commons

In [36]:
k_w_n, common = keywords_with_neighboards(keywords_spa)

In [38]:
# filtramos que al menos se repitan una vez
filtered_k_w_n = [ [tupla[0] for tupla in sublista if tupla[1] > 1] for sublista in k_w_n ]
filtered_k_w_n[4]


[]

In [None]:
common[1]

In [None]:
filtered_common = [ [tupla[0] for i, tupla in enumerate(sublista) if i < 6] for sublista in common ]


In [None]:
filtered_common[1]

### KeyBert

In [None]:
from keybert import KeyBERT

In [None]:
kw_model = KeyBERT()

In [None]:
keywords = kw_model.extract_keywords(data)

In [None]:
keywords[4]

#### BOW - Armado del vocabulario con las entidades y keywords

In [None]:
# Unificar Entities + Keywords + Keywords with neighboards
vocab = list(set().union(*entities, *keywords, *filtered_k_w_n, *common[:10]))
len(vocab)

In [None]:
vocab[211]

In [None]:
# Guardar vocabulario
with open(PATH+f'modelos/vocabulary{chunk}.json', 'w') as file:
    json.dump(vocab, file)

### Guardar noticias en la base

In [None]:
# configurar  batch_size = ( ej.: 5000 ) si se supera el limite 100MB en elasticsearch por operacion
index_name = 'news'
bulk_data = []

# Unificar Keywords + Keywords with neighboards
keywords_plus = [ list(set(keywords[i]+filtered_k_w_n[i])) for i in range(len(entities)) ]

for idx, text_news in tqdm(enumerate(data)):
    doc = {
        'index': {
            '_index': index_name,
            '_id': int(df_parquet.index[idx])
        }
    }
    reg = {
        'title': str(df_parquet.iloc[idx].in__title),
        'news' : str(text_news), 
        'author': str(df_parquet.iloc[idx]['Author Name']),
        'topics': {},
        'vector': None,
        'keywords' : keywords_plus[idx],
        'entities' : original_entities[idx],
        'created_at': datetime.now().isoformat(),
        'process': False
    }
    bulk_data.append(json.dumps(doc))
    bulk_data.append(json.dumps(reg))

# Convertir la lista en un solo string separado por saltos de línea
bulk_request_body = '\n'.join(bulk_data) + '\n'

# Enviar la solicitud bulk
response = os_client.bulk(body=bulk_request_body)

if response['errors']:
    print("Errores encontrados al insertar los documentos")
else:
    print("Documentos insertados correctamente")


### Nota:
- por cada documento se van a guardar las entidades que al menos se repitan una vez (mayor frecuencia)
- se utilizarán todas las entidades guardadas de todos los documentos como vocabulario.

In [83]:
def funcion_aux(ID):
    keywords_df = df_parquet[df_parquet.index==ID]['Keyword Name'].values
    entities_df = df_parquet[df_parquet.index==ID]['Entity Name'].values
    fila = df_parquet.index.get_loc(ID)
    print(f"Keywords de dataframe: {keywords_df}")
    print(f"Entities de dataframe: {entities_df}")
    print("-"*80)
    print(f"Fila: {fila}")
    print(f"Entities calculadas: {original_entities[fila]}")
    k_w_n, common = keywords_with_neighboards([keywords_spa[fila]])
    filtered_k_w_n = [ [tupla[0] for tupla in sublista if tupla[1] > 1] for sublista in k_w_n ]
    print(f"Keywords neighboards calculadas: {filtered_k_w_n, common}")

funcion_aux(105641111)

Keywords de dataframe: [array(['vocabulario', 'desarrollo lingüístico', 'libros', 'hogar',
        'lectura'], dtype=object)                                  ]
Entities de dataframe: [array([''], dtype=object)]
--------------------------------------------------------------------------------
Fila: 7
Entities calculadas: []
Keywords neighboards calculadas: ([['voz alta', 'desarrollo lingüístico']], [[('lectura', 16), ('voz', 11), ('práctica', 6), ('niños', 5), ('escuela', 5), ('desarrollo', 4), ('libros', 4), ('hogar', 3), ('vocabulario', 3), ('manera', 3), ('iniciativas', 3), ('herramienta', 2), ('artículo', 2), ('través', 2), ('capacidad', 2), ('palabras', 2), ('actividad', 2), ('uso', 2), ('texto', 2), ('experiencia', 2), ('conexión', 2), ('beneficios', 2), ('escritura', 2), ('lazos', 2), ('padres', 2), ('investigación', 2), ('importancia', 2), ('turnos', 2), ('creación', 2), ('atmósfera', 2), ('expectativa', 2), ('apertura', 2), ('tiempo', 2)]])


In [None]:
print(data[-1])

In [None]:
sorted_word_count

In [None]:
# Obtener las última palabra de cada entidad que tenga mas de una palabra por entidad
ult_palabras = [ent.split()[-1] for ent in pre_entities if len(ent.split()) > 1 ]
ult_palabras

In [None]:
pre_entities_aux = pre_entities
for idx, ent in enumerate(pre_entities_aux):
    if len(ent.split()) == 1 and ent in ult_palabras:
        del pre_entities[idx]
        del pre_original_entities[idx]

pre_entities

In [None]:
# Obtener las palabras únicas
unicas_palabras = [ ent.split()[0] for ent in pre_entities if len(ent.split()) > 1 ]
unicas_palabras

In [None]:
pre_entities_aux = pre_entities
for idx, ent in enumerate(pre_entities_aux):
    if len(ent.split()) == 1 and ent in unicas_palabras:
        del pre_entities[idx]
        del pre_original_entities[idx]

pre_entities

In [None]:
pre_original_entities