#### **TF-IDF**
TF-IDF significa "Term Frequency-Inverse Document Frequency". Es una técnica utilizada en la recuperación de información y minería de texto para representar cómo de importante es una palabra en un documento, en el contexto de un conjunto de documentos (a menudo denominado "corpus").

- **Term Frequency (TF):** Esta es la frecuencia de una palabra en un documento. Es decir, cuántas veces aparece una palabra en un documento dividido por el número total de palabras en ese documento. Se utiliza para medir la importancia de una palabra en un documento específico.
$$ \text{TF}(t, d) = \frac{\text{Número de veces que el término } t \text{ aparece en el documento } d} {\text{Número total de términos en el documento } d} $$

- **Inverse Document Frequency (IDF):** Esta es la medida de cuán única o rara es una palabra en todo el conjunto de documentos. La idea detrás de IDF es que las palabras que aparecen con frecuencia en muchos documentos (como "y", "es", "en") son menos informativas y relevantes que las palabras que aparecen en pocos documentos.
$$ \text{IDF}(t) = \log\left(\frac{\text{Número total de documentos}}{\text{Número de documentos con el término } t \text{ en ellos}}\right) $$
Scikit-learn usa esta formula de idf para evitar la dición por cero
$$ \text{IDF}(t) = \log\left(\frac{1 + \text{Número total de documentos}}{1 + \text{Número de documentos con el término } t \text{ en ellos}}\right) + 1 $$
Finalmente, el TF-IDF es simplemente el producto de TF e IDF:
$$ \text{TF-IDF}(t, d) = TF(t, d) \times IDF(t) $$

In [159]:
import sklearn.feature_extraction.text
import pandas
import numpy
import plotly.express as px
import spacy
import simplemma
import random

In [160]:
model_es = spacy.load("es_core_news_sm")

In [161]:
def reemplazar_tildes(texto: str) -> str:
    reemplazos = {"á": "a", "é": "e", "í": "i", "ó": "o", "ú": "u"}
    for original, reemplazo in reemplazos.items():
        texto = texto.replace(original, reemplazo)
    return texto


def normalize_text(text: str) -> str:
    doc = model_es(" ".join(simplemma.text_lemmatizer(text, lang="es")))
    return " ".join([reemplazar_tildes(str(token)) for token in doc if token.is_alpha and not token.is_stop])


def repetir_texto(texto, n):
    return " ".join([texto] * n)


def normalize_song(row: pandas.Series) -> str:
    try:
        normalize_genero = normalize_text(row["Genero"])
        normalize_artista = normalize_text(row["Artista"])
        normalize_titulo = normalize_text(row["Titulo"])
        normalize_cancion = normalize_text(row["Cancion"])
        return f"{repetir_texto(normalize_genero, 5)} {repetir_texto(normalize_artista, 5)} {repetir_texto(normalize_titulo, 5)} {normalize_cancion}"
    except:
        print(row)
        return ""


def is_spanish(text: str) -> bool:
    return simplemma.is_known(text, lang="es")


In [162]:
df = pandas.read_csv("../../../data/data.csv")
df = df.dropna(subset=["Cancion", "Titulo"])
df["normalize"] = df.apply(lambda row: normalize_song(row), axis=1)
df


Unnamed: 0,Genero,Artista,Titulo,Cancion,normalize
0,Pop,Shakira,Waka Waka (Esto Es África),"Llegó el momento, caen las murallas\r\nVa a co...",pop pop pop pop pop shakira shakira shakira sh...
1,Pop,Shakira,Gitana,Nunca usé un antifaz\r\nVoy de paso por este m...,pop pop pop pop pop shakira shakira shakira sh...
2,Pop,Shakira,"Te Aviso, Te Anuncio (Tango)",Nunca pensé que doliera el amor así\r\nCuando ...,pop pop pop pop pop shakira shakira shakira sh...
3,Pop,Shakira,Addicted To You,Debe ser el perfume que usas\r\nO el agua con ...,pop pop pop pop pop shakira shakira shakira sh...
4,Pop,Shakira,Monotonía (part. Ozuna),"No fue culpa tuya, ni tampoco mía\r\nFue culpa...",pop pop pop pop pop shakira shakira shakira sh...
...,...,...,...,...,...
3063,Salsa,Willie Gonzalez,No Es Casualidad,Siempre cada encuentro parece el primero\r\nSe...,salsa salsa salsa salsa salsa willie gonzalez ...
3064,Salsa,Willie Gonzalez,No Podras Escapar de Mi,Poder tocar tu mano\r\nEstar siempre a tu lado...,salsa salsa salsa salsa salsa willie gonzalez ...
3065,Salsa,Willie Gonzalez,Doble Vida,Juntos viven tu y el\r\nY estas tan sola...\r\...,salsa salsa salsa salsa salsa willie gonzalez ...
3066,Salsa,Willie Gonzalez,Tan Solo,Me has vuelto a llamar\r\nY como un niño emoci...,salsa salsa salsa salsa salsa willie gonzalez ...


In [163]:
df.to_csv("./data_normalize.csv", index=False)

In [164]:
corpus = df["normalize"].to_list()

In [165]:
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

count_vectorizer = sklearn.feature_extraction.text.CountVectorizer()
X_TF = count_vectorizer.fit_transform(corpus).toarray()

tf = X_TF / X_TF.sum(axis=1, keepdims=True)
df_tf = pandas.DataFrame(tf, columns=count_vectorizer.get_feature_names_out())
df_tf

Unnamed: 0,aa,aaa,aaaaa,aaaaaa,aaaaaaaaaaaa,aaaaaaaaao,aaaaaah,aaaaaahh,aaaaahhh,aaaahh,...,الك,بيروت,حياتي,دائي,ذا,عينيها,في,فيك,ماء,من
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3060,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3061,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3062,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [166]:
# Calculamos TF e IDF
features = vectorizer.get_feature_names_out()
idf = vectorizer.idf_

df_idf = pandas.DataFrame(list(zip(features, idf)), columns=["Word", "IDF"])
df_idf

Unnamed: 0,Word,IDF
0,aa,5.156928
1,aaa,7.236370
2,aaaaa,8.334982
3,aaaaaa,7.929517
4,aaaaaaaaaaaa,8.334982
...,...,...
21637,عينيها,8.334982
21638,في,8.334982
21639,فيك,8.334982
21640,ماء,8.334982


In [167]:
df_tf_idf = pandas.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df_tf_idf

Unnamed: 0,aa,aaa,aaaaa,aaaaaa,aaaaaaaaaaaa,aaaaaaaaao,aaaaaah,aaaaaahh,aaaaahhh,aaaahh,...,الك,بيروت,حياتي,دائي,ذا,عينيها,في,فيك,ماء,من
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3060,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3061,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3062,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [168]:
dfx = df_tf_idf.apply(lambda row: df_tf_idf.columns[row.argsort()[::-1]].tolist(), axis=1)
dfx = pandas.DataFrame(dfx.tolist())
dfx

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21632,21633,21634,21635,21636,21637,21638,21639,21640,21641
0,waka,tsamina,eh,mina,africa,zangalewa,ah,anawa,porque,yango,...,opinar,opinion,oponer,oportunamente,oportundidad,oportunidad,oportunista,oportuno,oprimir,aa
1,gitano,dominarme,amarrarme,shakira,contramano,liviano,aprovechame,equivocarme,eligir,latir,...,oprtunidad,optar,optimisma,optimista,or,oracion,orador,orandole,orar,aa
2,anuncio,aviso,tango,vacunar,shakira,madre,lisa,cuidar,mono,sucio,...,oportundidad,oportunidad,oportunista,oportuno,oprimir,oprtunidad,optar,optimisma,optimista,aa
3,addicted,you,to,shakira,vicio,deje,pop,piel,baby,querer,...,oportundidad,oportunidad,oportunista,oportuno,oprimir,oprtunidad,optar,optimisma,optimista,aa
4,monotonia,ozuna,shakira,culpa,protagonismo,part,hi,pop,dije,doler,...,optimista,or,oracion,orador,orandole,orar,orarte,orbita,orbitar,aa
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3060,primeroo,encuentro,parecer,casualidad,gonzalez,willie,desatar,esooo,fuego,preso,...,oportundidad,oportunidad,oportunista,oprimir,oprtunidad,optar,optimisma,optimista,or,aa
3061,escapar,pudrir,gonzalez,oh,importante,willie,oirte,salsa,fuerza,desafiar,...,optimisma,optimista,or,oracion,orador,orandole,orar,orarte,orbita,aa
3062,doble,bastar,vivr,vida,gonzalez,willie,dejade,negarle,entregarte,temer,...,oportundidad,oportunidad,oportunista,oportuno,oprimir,oprtunidad,optar,optimisma,optimista,aa
3063,gonzalez,escuchar,voz,willie,vivir,sentar,cambiar,volver,obtener,sentias,...,oportuno,oprimir,oprtunidad,optar,optimisma,optimista,or,oracion,orador,aa


In [169]:
df_sorted_desc = df_tf_idf.apply(lambda row: sorted(row, reverse=True), axis=1, result_type="broadcast")
df_sorted_desc


Unnamed: 0,aa,aaa,aaaaa,aaaaaa,aaaaaaaaaaaa,aaaaaaaaao,aaaaaah,aaaaaahh,aaaaahhh,aaaahh,...,الك,بيروت,حياتي,دائي,ذا,عينيها,في,فيك,ماء,من
0,0.583751,0.413490,0.361317,0.337027,0.295829,0.218907,0.148125,0.145938,0.127480,0.102267,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.688950,0.222119,0.222119,0.220640,0.211314,0.211314,0.203647,0.185176,0.185176,0.135812,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.531888,0.449090,0.326930,0.199458,0.198130,0.184319,0.182871,0.176805,0.152892,0.145584,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.771497,0.384401,0.307706,0.176852,0.166613,0.111362,0.094150,0.092930,0.090370,0.079007,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.620635,0.355921,0.290631,0.289445,0.195053,0.178098,0.165736,0.154722,0.143698,0.117883,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3060,0.570279,0.421385,0.318813,0.291237,0.237780,0.179567,0.149229,0.126729,0.116287,0.111410,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3061,0.590793,0.571330,0.248500,0.213668,0.204858,0.187662,0.117882,0.104260,0.103149,0.099400,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3062,0.564895,0.523264,0.407082,0.224893,0.178413,0.134734,0.095088,0.095088,0.093653,0.090945,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3063,0.361949,0.332759,0.330759,0.273337,0.248989,0.230086,0.181319,0.180746,0.167480,0.163912,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [172]:
N_COLS = 15
N_ROWS = 30

rows = random.choices(df.index, k = N_ROWS)

In [173]:
df_aux = df_tf_idf.iloc[rows, :N_COLS].apply(lambda row: sorted(row, reverse=True), axis=1, result_type="broadcast")

fig = px.imshow(
    df_aux.iloc[:N_ROWS, :N_COLS],
    x=list(map(lambda x: f" palabra {x} ", dfx.iloc[:N_ROWS, :N_COLS].columns)),
    y=list(map(lambda x: x.upper()[:20], df.loc[rows, "Titulo"].to_list())),
    aspect="auto",
    color_continuous_scale="ylgnbu",
)
fig.update_traces(text=dfx.iloc[rows, :N_COLS], texttemplate="%{text}", textfont_size=11)
fig.update_xaxes(visible=False)
fig.update_layout(title="Hetamap TF-IDF", template="ggplot2", height=800)
fig.show()
