1. Vectorización y medición de similitud entre documentos

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import random

# Cargar el dataset 20 newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

# Vectorización TF-IDF
vectorizer = TfidfVectorizer(max_df=0.9, min_df=5, stop_words='english')
X_train = vectorizer.fit_transform(newsgroups_train.data)

# Seleccionar 5 documentos al azar
random_indices = random.sample(range(X_train.shape[0]), 5)
for idx in random_indices:
    # Calcular similitud coseno con el resto de los documentos
    cosine_sim = cosine_similarity(X_train[idx], X_train).flatten()
    
    # Encontrar los 5 documentos más similares
    similar_docs = cosine_sim.argsort()[-6:-1][::-1]
    
    print(f"\nDocumento {idx} (Texto resumido):")
    print(newsgroups_train.data[idx][:300])  # Mostrar primeras 300 caracteres
    
    print("\nDocumentos más similares:")
    for sim_idx in similar_docs:
        print(f" - Similaridad: {cosine_sim[sim_idx]:.4f} | Clase: {newsgroups_train.target_names[newsgroups_train.target[sim_idx]]}")
        print(f"   Texto: {newsgroups_train.data[sim_idx][:200]}...\n")



Documento 10938 (Texto resumido):

Maybe not to you.  But to those who stand on this base, He is 
precious.

Documentos más similares:
 - Similaridad: 0.2658 | Clase: misc.forsale
   Texto: North heavy Duty hi hat stand $45  
	older stand... but definately in working shape.. could
	use a little clean up.  comes with clutch and felts, etc..

Pearl bass drum pedal with felt beater $20 

ho...

 - Similaridad: 0.2076 | Clase: rec.sport.baseball
   Texto: 




I'm not sure I understand this question. When the IF rule is invoked,
the batter is automatically out. This relieves the runners from being
forced to advance to the next base if the ball is not c...

 - Similaridad: 0.2001 | Clase: comp.graphics
   Texto: 
But the Question was later revealed to be:  What is 9 x 6?  (In the
base 13 system, of course.)

...

 - Similaridad: 0.1973 | Clase: sci.electronics
   Texto: I know what the 68HC811E2 is all about, but I'm trying to figure
out what the 68SEC811E2 is... specifically, what does th

2. Clasificación con Naïve Bayes y optimización del f1-score macro

In [2]:
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

# Dividir datos de train y test
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
X_test = vectorizer.transform(newsgroups_test.data)
y_train, y_test = newsgroups_train.target, newsgroups_test.target

# MultinomialNB con GridSearch para optimizar hiperparámetros
params = {'alpha': [0.1, 0.5, 1.0]}
mnb = GridSearchCV(MultinomialNB(), param_grid=params, scoring='f1_macro')
mnb.fit(X_train, y_train)

# ComplementNB con GridSearch
cnb = GridSearchCV(ComplementNB(), param_grid=params, scoring='f1_macro')
cnb.fit(X_train, y_train)

# Evaluación
y_pred_mnb = mnb.best_estimator_.predict(X_test)
y_pred_cnb = cnb.best_estimator_.predict(X_test)

print("Resultados de clasificación:")
print(f"MultinomialNB - Best F1 Macro: {f1_score(y_test, y_pred_mnb, average='macro'):.4f}")
print(f"ComplementNB - Best F1 Macro: {f1_score(y_test, y_pred_cnb, average='macro'):.4f}")


Resultados de clasificación:
MultinomialNB - Best F1 Macro: 0.6772
ComplementNB - Best F1 Macro: 0.6823


3. Transposición de matriz y similaridad entre palabras

In [3]:
# Transponer la matriz documento-término
term_doc_matrix = X_train.T

# Calcular similitud coseno entre palabras
cosine_terms = cosine_similarity(term_doc_matrix, term_doc_matrix)

# Obtener el vocabulario
feature_names = vectorizer.get_feature_names_out()

# Elegir 5 palabras manualmente
selected_words = ['computer', 'space', 'religion', 'car', 'science']

print("\nSimilitud entre palabras:")
for word in selected_words:
    idx = np.where(feature_names == word)[0][0]
    similar_terms = np.argsort(-cosine_terms[idx])[1:6]
    
    print(f"\nPalabra: '{word}' - Palabras más similares:")
    for sim_idx in similar_terms:
        print(f" - {feature_names[sim_idx]} (Score: {cosine_terms[idx, sim_idx]:.4f})")



Similitud entre palabras:

Palabra: 'computer' - Palabras más similares:
 - shopper (Score: 0.1349)
 - verlag (Score: 0.1248)
 - delicate (Score: 0.1196)
 - drive (Score: 0.1105)
 - hackers (Score: 0.1082)

Palabra: 'space' - Palabras más similares:
 - nasa (Score: 0.3178)
 - shuttle (Score: 0.2784)
 - exploration (Score: 0.2328)
 - aeronautics (Score: 0.2219)
 - cfa (Score: 0.2164)

Palabra: 'religion' - Palabras más similares:
 - religious (Score: 0.2475)
 - religions (Score: 0.2237)
 - crusades (Score: 0.1936)
 - christianity (Score: 0.1882)
 - categorized (Score: 0.1849)

Palabra: 'car' - Palabras más similares:
 - cars (Score: 0.1923)
 - dealer (Score: 0.1773)
 - civic (Score: 0.1634)
 - loan (Score: 0.1560)
 - owner (Score: 0.1484)

Palabra: 'science' - Palabras más similares:
 - cognitivists (Score: 0.3941)
 - behaviorists (Score: 0.3941)
 - scientific (Score: 0.3624)
 - empirical (Score: 0.2890)
 - sects (Score: 0.2538)
