# PMI para relacionar términos

La relación de términos puede determinarse por la co-ocurrencia de estos dentro de mismos contextos. Dos términos estarán más relacionados entre sí, o serán más similares, entre más contextos compartan.

Una forma de medir esto es a partir de determinar la informaicón que dos términos comparten entre sí a partir de la información mútua. Aquí implementamos una versión de esto para términos del corpus Brown.

In [1]:
from nltk.corpus import brown, stopwords
from nltk.stem import SnowballStemmer
from collections import defaultdict, Counter
from itertools import chain, combinations
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

Definimos la función de pre-procesamiento a partir de eliminar las stopwords y stemmizar los términos.

In [2]:
#Stopwords
paro = stopwords.words('english')
#Stemmer
stemmer = SnowballStemmer('english')
#Función de pre-procesamiento
normalize = lambda sent: [stemmer.stem(w.lower()) for w in sent if w.isalpha()==True and w.lower() not in paro]

Los contextos serán las oraciones en que los términos aparecen. También obtenemos los tokens y sus frecuencias.

In [3]:
#Obtenemos las sentencias
sentences = [normalize(sent) for sent in brown.sents(categories=['government']) if len(normalize(sent)) > 1]
#Obtenemos los tokens
tokens = list(chain(*sentences))
#Frecuencia de tokens
freq = Counter(tokens)
#Índices de tokens
index = {token:idx for idx,token in enumerate(freq.keys())}

In [4]:
terms = pd.DataFrame(data=freq.values(), index=list(freq.keys()),
                         columns=['Frequency'])
terms.sort_values(by='Frequency', ascending=False)

Unnamed: 0,Frequency
state,388
year,290
develop,193
may,180
unit,175
...,...
yesterday,1
armament,1
imposit,1
constitu,1


Para calcular las probabilidades de co-ocurrencia primero obtenemos una matriz de co-ocurrencias donde cada entrada es el número de contextos que dos términos comparten:

$$C = (c_{i,j}) = |\{\mathcal{N}(w_i) : w_j \in \mathcal{N}(w_i)\}|$$

In [5]:
#Matriz de co-ocurrencias
coocurrence_matrix = np.zeros((len(index),len(index)))
for sent in sentences:
    for term1, term2 in combinations(sent,2):
        coocurrence_matrix[index[term1],index[term2]] += 1
        coocurrence_matrix[index[term2],index[term1]] += 1

Podemos visualizar esta matriz:

In [6]:
Coocurrence = pd.DataFrame(data=coocurrence_matrix, index=list(index.keys()), columns=list(index.keys()))
Coocurrence

Unnamed: 0,offic,busi,econom,obe,depart,commerc,provid,basic,measur,nation,...,jones,masteri,widest,potent,besieg,preclud,cadr,percept,lengthen,shadow
offic,16.0,12.0,3.0,1.0,17.0,10.0,3.0,2.0,1.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
busi,12.0,40.0,11.0,2.0,6.0,3.0,9.0,2.0,5.0,10.0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
econom,3.0,11.0,8.0,2.0,17.0,3.0,4.0,7.0,3.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
obe,1.0,2.0,2.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
depart,17.0,6.0,17.0,1.0,16.0,5.0,4.0,2.0,1.0,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
preclud,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
cadr,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
percept,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
lengthen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0


Las probabilidades se pueden obtener a partir de la matriz de co-ocurrencias. Las probabilidades conjuntas están dadas por:

$$p(w_i,w_j) = \frac{c_{i,j}}{\sum_i \sum_j c_{i,j}}$$

Mientras que los marginales se calculan de la manera usual:

$$p(w_i) = \sum_j p(w_i,w_j)$$

In [7]:
#Probabilidades conjuntas
joint_dist = coocurrence_matrix/coocurrence_matrix.sum(0).sum(0)
#Probabilidades marginales
marginal_dist = joint_dist.sum(0)

Podemos entonces clacular el PMI (Pointwise Mutual Information) como:

$$PPMI_k(w_i,w_j) = \max\{0, \log_2 \frac{p(w_i,w_j)}{p(w_i)p(w_j)} - \log_2 k\}$$

Donde $k$ es un hiperparámetro. Se considera que $\log 0 = 0$.

In [8]:
#Hiperparámetro
k = 2
#Cálculo de PMI
pmi = np.log2(joint_dist/np.outer(marginal_dist,marginal_dist)) - np.log2(k)
#log0 = 0
pmi[pmi == -np.inf] = 0
#Sólo valores positivos max{0,PMI}
pmi[pmi < 0] = 0

pmi

  pmi = np.log2(joint_dist/np.outer(marginal_dist,marginal_dist)) - np.log2(k)


array([[1.3935015 , 0.37081775, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.37081775, 1.5001371 , 0.83432396, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.83432396, 1.57157568, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 9.30422011,
        9.30422011],
       [0.        , 0.        , 0.        , ..., 9.30422011, 0.        ,
        9.30422011],
       [0.        , 0.        , 0.        , ..., 9.30422011, 9.30422011,
        0.        ]])

Podemos visualizar los resultados:

In [9]:
PMI = pd.DataFrame(data=pmi, index=list(index.keys()), columns=list(index.keys()))
PMI

Unnamed: 0,offic,busi,econom,obe,depart,commerc,provid,basic,measur,nation,...,jones,masteri,widest,potent,besieg,preclud,cadr,percept,lengthen,shadow
offic,1.393501,0.370818,0.000000,3.583326,1.247488,1.905254,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.00000,0.00000,0.00000,0.00000
busi,0.370818,1.500137,0.834324,3.975680,0.000000,0.000000,0.000000,0.000000,0.187854,0.000000,...,0.0,0.0,0.0,3.741215,0.0,0.0,0.00000,0.00000,0.00000,0.00000
econom,0.000000,0.834324,1.571576,5.172363,1.836525,0.757326,0.000000,2.370213,0.647572,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.00000,0.00000,0.00000,0.00000
obe,3.583326,3.975680,5.172363,0.000000,3.349849,4.773151,2.602047,5.163646,4.663397,2.697593,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.00000,0.00000,0.00000,0.00000
depart,1.247488,0.000000,1.836525,3.349849,0.926548,0.671777,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.00000,0.00000,0.00000,0.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
preclud,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.00000,0.00000,0.00000,0.00000
cadr,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.00000,9.30422,9.30422,9.30422
percept,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,9.30422,0.00000,9.30422,9.30422
lengthen,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,9.30422,9.30422,0.00000,9.30422


Para cada término, podemos ver cuáles son los términos que tienen mayor información compartida, es decir, mayor PMI.

In [10]:
PMI[stemmer.stem('government')].sort_values(ascending=False)

monolith        5.469812
residenti       3.299887
theoretician    3.147884
undertook       3.147884
stripe          3.147884
                  ...   
reaffirm        0.000000
plenti          0.000000
confisc         0.000000
cargo           0.000000
offic           0.000000
Name: govern, Length: 3971, dtype: float64

Par tener un mejor resultado sobre la vinculación de términos se cálcula un PMI score como:

$$score(w_i,w_j) = \sum_k PPMI(w_i,w_k)PPMI(w_j,w_k)$$

In [13]:
#Consulta
query = 'united'
#Cálculo de los resultados
results = pd.DataFrame(data=np.dot(PMI, PMI[stemmer.stem(query)]), index=list(index.keys()), columns=[query])
#Visualización ordenada
results.sort_values(by=query, ascending=False)

Unnamed: 0,united
unit,636.778908
state,298.885393
canada,200.549074
america,192.449051
intern,181.093636
...,...
provost,0.000000
professor,0.000000
orvil,0.000000
thrice,0.000000
