# Trabajo de Inteligencia artificial
 ## Análisis de noticias

 Realizado por:
 - Marta Aguilar Morcillo
 - Candela Jazmín Gutiérrez González

Fecha: 30/05/2025

Convocatoria de junio.

 ## 1. Lectura de datos

 Se comenzará con la lectura del corpus. Para ello, será necesaria la importación de las siguientes librerías:
 - **nltk:** 
 - **punkt_tab:** para la tokenización de las palabras de los documentos.
 - **contractions:**
 - **sklearn:**

In [42]:
!pip install nltk
import nltk

from nltk import download

download('punkt_tab')                           # Tokenización
nltk.download('averaged_perceptron_tagger')     # POS tagging
nltk.download('averaged_perceptron_tagger_eng') # POS tagging
nltk.download('wordnet')                        # WordNet lemmatizer
nltk.download('omw-1.4')                        # WordNet multilingüe



[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [43]:
from nltk.tokenize import word_tokenize
from nltk.stem.lancaster import LancasterStemmer
from nltk.corpus import stopwords
from nltk.data import path
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict
from nltk.corpus import wordnet as wn
import numpy as np

path.append(".")

In [44]:
!pip install contractions
import contractions



In [45]:
import csv
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint
import re
from bs4 import MarkupResemblesLocatorWarning
import warnings

In [46]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import recall_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.neighbors import KNeighborsClassifier
import spacy

In [69]:
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter

In [48]:
warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)

In [49]:
palabras_vacias_ingles = stopwords.words('english')

In [50]:
nlp = spacy.load("en_core_web_sm")

In [51]:
def elimina_html(contenido):
    return BeautifulSoup(contenido).get_text()

def elimina_no_alfanumerico(contenido):
    return [re.sub(r'[^\w]', '', palabra)
            for palabra in contenido
            if re.search(r'\w', palabra)]

def expandir_constracciones(contenido):
    return contractions.fix(contenido)

def pasar_a_minuscula(contenido):
    return contenido.lower()

def limpiar_texto(texto):
    texto = re.sub(r'[^a-zA-Z\s]', ' ', texto)  # Reemplaza todo lo que no es letra o espacio con espacio
    texto = re.sub(r'\s+', ' ', texto).strip()
    return texto

def elimina_palabras_vacias(contenido):
    return [palabra for palabra in contenido if palabra not in palabras_vacias_ingles]

def lematizador(contenido):
    lemmatizer = WordNetLemmatizer()
    pos_tags = pos_tag(contenido)

    resultado = []
    for palabra, tag in pos_tags:
        if tag.startswith('VB'):  # Verbos
            resultado.append(lemmatizer.lemmatize(palabra, pos='v'))  # infinitivo
        else:  # Sustantivos y el resto tal como están
            resultado.append(palabra)

    return resultado

def extraer_noun_chunks(tokens):
    resultados = []
    doc = nlp(" ".join(tokens))
    
    noun_chunks = [chunk.text.lower().strip() for chunk in doc.noun_chunks if len(chunk.text.split()) <= 3]
    noun_chunks_set = set(noun_chunks)

    i = 0
    while i < len(tokens):
        composed2 = " ".join(tokens[i:i+2]).lower()
        composed3 = " ".join(tokens[i:i+3]).lower()

        if composed3 in noun_chunks_set:
            i += 3  
        elif composed2 in noun_chunks_set:
            i += 2 
        else:
            resultados.append(tokens[i].lower())  
            i += 1

    return resultados + noun_chunks


In [52]:
def proceso_contenido(texto):
    texto = elimina_html(texto)
    texto = expandir_constracciones(texto)
    texto = pasar_a_minuscula(texto)
    texto = limpiar_texto(texto)                # Limpiar antes de tokenizar
    tokens = word_tokenize(texto)               
    tokens = elimina_no_alfanumerico(tokens)    # Limpiar tokens individuales
    tokens = elimina_palabras_vacias(tokens)
    tokens = lematizador(tokens)
    return tokens

In [53]:
def lectura_normalizada_corpus():
    df = pd.read_csv("news_corpus.csv", encoding="latin-1", sep=";", quotechar='"')
    resultados = []

    df = df.head(105)
    for index, fila in df.iterrows():
        autor = [fila.iloc[0]]
        titulo = fila.iloc[1]
        cuerpo = fila.iloc[2]   

        titulo_proc = proceso_contenido(titulo)
        cuerpo_proc = proceso_contenido(cuerpo)

        # Unir las tres listas en una sola lista combinada
        fila_combinada =  autor + titulo_proc + cuerpo_proc

        contenido_final = extraer_noun_chunks(fila_combinada)
        resultados.append(contenido_final)
    return resultados

In [54]:
def lectura_documento(documento):
    documento_procesado = proceso_contenido(documento) 
    contenido_final = extraer_noun_chunks(documento_procesado)
    return contenido_final

In [55]:
# Mostrar los primeros 3 documentos procesados
def prueba_primeros_3_documentos_procesados(corpus):
    for i, documento in enumerate(corpus[:3]):
        print(f"Documento {i+1}:")
        print(" - Palabras:", documento)
        print()

In [67]:
def expand_term(term):
    related = set()
    for syn in wn.synsets(term):
        for lemma in syn.lemmas():
            word = lemma.name().replace('_', ' ').lower()
            if word != term:
                related.add(word)
    return related

def expand_corpus_with_synonyms(corpus):
    expanded_corpus = []
    for doc in corpus:
        # Obtenemos diccionario de palabra:repeticiones
        doc_counter = Counter(doc)
        expanded_doc = []
        for word, count in doc_counter.items():
            # Añadimos la palabra original tantas veces como aparece
            expanded_doc.extend([word] * count)
            # Obtenemos sinónimos y también los añadimos con la misma frecuencia
            synonyms = expand_term(word)
            for syn in synonyms:
                expanded_doc.extend([syn] * count)
        expanded_corpus.append(expanded_doc)
    return expanded_corpus

def tfidf_por_documentos_con_sinonimos(corpus_normalizado):
     # Convertimos el corpus a lista de strings
    corpus_expandido = expand_corpus_with_synonyms(corpus_normalizado)
    texts = [" ".join(doc) for doc in corpus_expandido]

    # Vectorizador TF-IDF (1 y 2-gramas)
    vectorizer = TfidfVectorizer(ngram_range=(1, 2))
    X = vectorizer.fit_transform(texts)  # Matriz TF-IDF sparse
    terms = vectorizer.get_feature_names_out()

    # Retornamos la matriz y vocabulario (términos)
    return X, terms, vectorizer


In [57]:
def tfidf_del_documento(documento, vectorizer):
    documento_normalizado = lectura_documento(documento)
    # Convertir el documento (lista de tokens) a string
    texto = " ".join(documento_normalizado)
    # Transformar usando el vectorizador ya entrenado
    X_doc = vectorizer.transform([texto])  # devuelve matriz sparse 1xN
    return X_doc

In [58]:
def prueba_tfdifs_primeros_3_documentos(lista_diccionarios_tfidfs):
    for idx, d in enumerate(lista_diccionarios_tfidfs[:3]):
        top_terms = sorted(d.items(), key=lambda x: x[1], reverse=True)
        print(f"\n Documento {idx+1}:")
        print("      Palabras añadidas por TF-IDF:")
        for term, score in top_terms:
            print(f"      - {term}: {score:.4f}")

In [59]:
def tfidf_por_documentos_sin_sinonimos(corpus_normalizado):
    # Convertimos el corpus a lista de strings
    texts = [" ".join(doc) for doc in corpus_normalizado]

    # Vectorizador TF-IDF (1 y 2-gramas)
    vectorizer = TfidfVectorizer(ngram_range=(1, 2))
    X = vectorizer.fit_transform(texts)  # Matriz TF-IDF sparse
    terms = vectorizer.get_feature_names_out()

    # Retornamos la matriz y vocabulario (términos)
    return X, terms, vectorizer

In [60]:
def construir_matriz_tfidf_desde_diccionarios(tfidf_dicts):
    # Construir vocabulario extendido unificado (con o sin sinónimos)
    vocabulario = sorted(set().union(*[set(d.keys()) for d in tfidf_dicts]))

    matriz = []
    for doc_dict in tfidf_dicts:
        vector = [doc_dict.get(term, 0.0) for term in vocabulario]
        matriz.append(vector)
    
    matriz_tfidf = np.array(matriz)
    vectorizer = TfidfVectorizer(vocabulary=vocabulario)

    return matriz_tfidf,vectorizer

In [61]:
def similitud_coseno(tfidf_corpus, tfidf_doc, umbral=0.0):
    # Calcular similitud coseno entre documento y corpus
    similitudes = cosine_similarity(tfidf_doc, tfidf_corpus).flatten()

    # Filtrar documentos que superan el umbral
    indices_filtrados = [i for i, sim in enumerate(similitudes) if sim > umbral]

    # Ordenar índices por similitud descendente
    indices_ordenados = sorted(indices_filtrados, key=lambda i: similitudes[i], reverse=True)

    # Devolver lista de (indice, similitud)
    return [(i, similitudes[i]) for i in indices_ordenados]

def documentos_similares(lista_similitudes):
    for idx, score in lista_similitudes:
        print(f"Documento {idx} tiene similitud: {score:.4f}")
    

In [62]:
# Prueba de lectura 
resultados = lectura_normalizada_corpus()
prueba_primeros_3_documentos_procesados(resultados)

Documento 1:
 - Palabras: ['diu', 'revoke', 'mandatory', 'rakshabandhan', 'offices', 'order', 'daman', 'wednesday', 'withdraw', 'circular', 'ask', 'tie', 'rakhis', 'male', 'colleagues', 'order', 'trigger', 'backlash', 'employees', 'rip', 'apart', 'social', 'media', 'union', 'territory', 'administration', 'force', 'retreat', 'within', 'circular', 'make', 'celebrate', 'decide', 'celebrate', 'festival', 'rakshabandhan', 'august', 'connection', 'offices', 'departments', 'shall', 'remain', 'collectively', 'suitable', 'time', 'wherein', 'shall', 'tie', 'rakhis', 'colleagues', 'order', 'issue', 'august', 'gurpreet', 'singh', 'deputy', 'secretary', 'personnel', 'say', 'ensure', 'one', 'skipped', 'office', 'attendance', 'report', 'send', 'government', 'next', 'one', 'mandate', 'celebration', 'rakshabandhan', 'leave', 'withdraw', 'mandate', 'daman', 'day', 'apart', 'circular', 'withdrawn', 'one', 'line', 'order', 'issue', 'late', 'evening', 'ut', 'department', 'personnel', 'administrative', 'ref

In [63]:
documento = "Union minister for transport and shipping Nitin Gadkari has demanded that the Maharashtra government provide special facilities to those detained during the Emergency under the Maintenance of Internal Security Act (MISA). The MISA detainees on?Sunday held a convention under the banner of Satyagrahi Sangh in Nagpur after meeting Gadkari at his Mahal residence. The group, led by the Sangh?s vice-president Sacchidanand Upasane, national secretary Komal Chheda and Maharashtra unit chief Jayprakash Pande insisted that the detainees be treated as freedom fighters and be given all facilities available to a freedom fighter in the country. Incidentally, most of the MISA detainees are RSS activists or its supporters.The group submitted a memorandum to Gadkari, briefing him about the facilities and recognition extended to detainees in states like UP, MP, Bihar and Chhattisgarh. They claimed that in these states, the detainees are considered at par with the freedom fighters and given pensions and other facilities. The government in Rajasthan recently formed a committee to study facilities available to MISA and Defence of India Rules 1971 detainees in other states. The committee will its report to the state soon, they claimed. The group also pointed out that in Madhya Pradesh, MISA detainees, dubbed as ?Democracy Warrior? draw a monthly honorarium of Rs 25,000. Gadkari told the group he had already spoken to chief minister Devendra Fadnavis and the state finance minister, Sudhir Munganttiwar, in this regard. ?Both were positive on the issue.? The Congress, meanwhile, has accused the BJP-led government of trying to give such facilities to its Sangh Parivar members. ?The move is totally a political one,? says former Union minister Vilas Muttemwar. Muttemwar warned his party would launch a statewide agitation if government accepted the group?s demand.?How can these Sangh Parivar members be compared with freedom fighters??"
lectura = lectura_documento(documento)
print(" - Palabras:", lectura)

 - Palabras: ['union', 'minister', 'transport', 'ship', 'nitin', 'gadkari', 'demand', 'maharashtra', 'government', 'provide', 'special', 'facilities', 'detain', 'emergency', 'maintenance', 'internal', 'security', 'act', 'misa', 'misa', 'detainees', 'sunday', 'hold', 'sangh', 'nagpur', 'meeting', 'gadkari', 'mahal', 'residence', 'group', 'lead', 'sangh', 'sacchidanand', 'upasane', 'national', 'secretary', 'komal', 'chief', 'jayprakash', 'pande', 'insist', 'detainees', 'treat', 'give', 'facilities', 'available', 'freedom', 'fighter', 'country', 'incidentally', 'misa', 'detainees', 'rss', 'activists', 'supporters', 'group', 'submit', 'memorandum', 'gadkari', 'brief', 'facilities', 'recognition', 'extend', 'like', 'mp', 'states', 'detainees', 'consider', 'par', 'give', 'pensions', 'facilities', 'government', 'rajasthan', 'recently', 'form', 'available', 'misa', 'defence', 'india', 'rules', 'committee', 'report', 'state', 'soon', 'claim', 'group', 'also', 'point', 'madhya', 'pradesh', 'misa

In [64]:
# Prueba de similitud sin sinonimos
documento = "Union minister for transport and shipping Nitin Gadkari has demanded that the Maharashtra government provide special facilities to those detained during the Emergency under the Maintenance of Internal Security Act (MISA). The MISA detainees on?Sunday held a convention under the banner of Satyagrahi Sangh in Nagpur after meeting Gadkari at his Mahal residence. The group, led by the Sangh?s vice-president Sacchidanand Upasane, national secretary Komal Chheda and Maharashtra unit chief Jayprakash Pande insisted that the detainees be treated as freedom fighters and be given all facilities available to a freedom fighter in the country. Incidentally, most of the MISA detainees are RSS activists or its supporters.The group submitted a memorandum to Gadkari, briefing him about the facilities and recognition extended to detainees in states like UP, MP, Bihar and Chhattisgarh. They claimed that in these states, the detainees are considered at par with the freedom fighters and given pensions and other facilities. The government in Rajasthan recently formed a committee to study facilities available to MISA and Defence of India Rules 1971 detainees in other states. The committee will its report to the state soon, they claimed. The group also pointed out that in Madhya Pradesh, MISA detainees, dubbed as ?Democracy Warrior? draw a monthly honorarium of Rs 25,000. Gadkari told the group he had already spoken to chief minister Devendra Fadnavis and the state finance minister, Sudhir Munganttiwar, in this regard. ?Both were positive on the issue.? The Congress, meanwhile, has accused the BJP-led government of trying to give such facilities to its Sangh Parivar members. ?The move is totally a political one,? says former Union minister Vilas Muttemwar. Muttemwar warned his party would launch a statewide agitation if government accepted the group?s demand.?How can these Sangh Parivar members be compared with freedom fighters??"
corpus_normalizado = lectura_normalizada_corpus()
tfidf_corpus_sin_sinonimos, terms, vectorizer = tfidf_por_documentos_sin_sinonimos(corpus_normalizado)
tfidf_documento = tfidf_del_documento(documento, vectorizer)
resultados_similitud = similitud_coseno(tfidf_corpus_sin_sinonimos, tfidf_documento)
documentos_similares(resultados_similitud)

Documento 19 tiene similitud: 0.1114
Documento 71 tiene similitud: 0.0938
Documento 82 tiene similitud: 0.0837
Documento 4 tiene similitud: 0.0817
Documento 75 tiene similitud: 0.0812
Documento 103 tiene similitud: 0.0758
Documento 0 tiene similitud: 0.0679
Documento 42 tiene similitud: 0.0672
Documento 104 tiene similitud: 0.0658
Documento 94 tiene similitud: 0.0648
Documento 39 tiene similitud: 0.0594
Documento 44 tiene similitud: 0.0592
Documento 15 tiene similitud: 0.0550
Documento 18 tiene similitud: 0.0482
Documento 36 tiene similitud: 0.0476
Documento 14 tiene similitud: 0.0452
Documento 88 tiene similitud: 0.0447
Documento 83 tiene similitud: 0.0444
Documento 52 tiene similitud: 0.0444
Documento 2 tiene similitud: 0.0413
Documento 63 tiene similitud: 0.0372
Documento 9 tiene similitud: 0.0369
Documento 22 tiene similitud: 0.0364
Documento 33 tiene similitud: 0.0361
Documento 99 tiene similitud: 0.0361
Documento 95 tiene similitud: 0.0340
Documento 62 tiene similitud: 0.0324
Doc

In [70]:
# Prueba de tfidfs con sinonimos
documento = "Union minister for transport and shipping Nitin Gadkari has demanded that the Maharashtra government provide special facilities to those detained during the Emergency under the Maintenance of Internal Security Act (MISA). The MISA detainees on?Sunday held a convention under the banner of Satyagrahi Sangh in Nagpur after meeting Gadkari at his Mahal residence. The group, led by the Sangh?s vice-president Sacchidanand Upasane, national secretary Komal Chheda and Maharashtra unit chief Jayprakash Pande insisted that the detainees be treated as freedom fighters and be given all facilities available to a freedom fighter in the country. Incidentally, most of the MISA detainees are RSS activists or its supporters.The group submitted a memorandum to Gadkari, briefing him about the facilities and recognition extended to detainees in states like UP, MP, Bihar and Chhattisgarh. They claimed that in these states, the detainees are considered at par with the freedom fighters and given pensions and other facilities. The government in Rajasthan recently formed a committee to study facilities available to MISA and Defence of India Rules 1971 detainees in other states. The committee will its report to the state soon, they claimed. The group also pointed out that in Madhya Pradesh, MISA detainees, dubbed as ?Democracy Warrior? draw a monthly honorarium of Rs 25,000. Gadkari told the group he had already spoken to chief minister Devendra Fadnavis and the state finance minister, Sudhir Munganttiwar, in this regard. ?Both were positive on the issue.? The Congress, meanwhile, has accused the BJP-led government of trying to give such facilities to its Sangh Parivar members. ?The move is totally a political one,? says former Union minister Vilas Muttemwar. Muttemwar warned his party would launch a statewide agitation if government accepted the group?s demand.?How can these Sangh Parivar members be compared with freedom fighters??"
corpus_normalizado = lectura_normalizada_corpus()
tfidf_corpus_con_sinonimos, terms, vectorizer = tfidf_por_documentos_con_sinonimos(corpus_normalizado)
tfidf_documento = tfidf_del_documento(documento, vectorizer)
resultados_similitud = similitud_coseno(tfidf_corpus_con_sinonimos, tfidf_documento)
documentos_similares(resultados_similitud)

Documento 82 tiene similitud: 0.1071
Documento 71 tiene similitud: 0.0881
Documento 103 tiene similitud: 0.0659
Documento 39 tiene similitud: 0.0616
Documento 99 tiene similitud: 0.0608
Documento 44 tiene similitud: 0.0600
Documento 42 tiene similitud: 0.0571
Documento 15 tiene similitud: 0.0551
Documento 0 tiene similitud: 0.0544
Documento 75 tiene similitud: 0.0540
Documento 36 tiene similitud: 0.0537
Documento 19 tiene similitud: 0.0493
Documento 104 tiene similitud: 0.0468
Documento 94 tiene similitud: 0.0467
Documento 14 tiene similitud: 0.0457
Documento 9 tiene similitud: 0.0451
Documento 4 tiene similitud: 0.0422
Documento 88 tiene similitud: 0.0371
Documento 43 tiene similitud: 0.0369
Documento 61 tiene similitud: 0.0364
Documento 76 tiene similitud: 0.0351
Documento 96 tiene similitud: 0.0342
Documento 83 tiene similitud: 0.0341
Documento 95 tiene similitud: 0.0336
Documento 25 tiene similitud: 0.0315
Documento 52 tiene similitud: 0.0309
Documento 33 tiene similitud: 0.0308
Do

NotFittedError: The TF-IDF vectorizer is not fitted