## POINTWISE MUTUAL INFORMATION
The idea of PMI is that we want to quantify the likelihood of co-occurrence of two words, taking into account the fact that it might be caused by the frequency of the single words. Hence, the algorithm computes the (log) probability of co-occurrence scaled by the product of the single probability of occurrence.
Now, knowing that, when ‘a’ and ‘b’ are independent, their joint probability is equal to the product of their marginal probabilities, when the ratio equals 1 (hence the log equals 0), it means that the two words together don’t form a unique concept: they co-occur by chance.
On the other hand, if either one of the words (or even both of them) has a low probability of occurrence if singularly considered, but its joint probability together with the other word is high, it means that the two are likely to express a unique concept.

In [1]:
import spacy
import string
import re
import pandas as pd
import numpy as np
import seaborn as sns
import random
import pickle
from unidecode import unidecode
import nltk
nltk.download('wordnet')
nltk.download('words')
nltk.download('punkt')

from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.util import skipgrams
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords 
from spacy.lang.en import English

from tqdm import tqdm
from gensim import corpora
from gensim.parsing.preprocessing import remove_stopwords
from gensim.parsing.preprocessing import strip_non_alphanum
from gensim.models import CoherenceModel
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures,TrigramCollocationFinder, TrigramAssocMeasures

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## PMI - DRUG

In [3]:
def get_pmi_value(word1, word2, matrix):
    return matrix.loc[matrix.index == word1, [word2]]

def pmi(df, positive=True):
    cols = df.sum(axis=0)
    total = cols.sum()
    rows = df.sum(axis=1)
    expected = np.outer(rows, cols) / total
    df_pmi = df / expected
    with np.errstate(divide='ignore'):
        df_pmi = np.log(df_pmi)
    df_pmi[np.array(np.isinf(df_pmi))] = 0.0 
    if positive:
        df_pmi[df_pmi < 0] = 0.0
    return df_pmi

def max_cooccurrences(df_column):
    return df_column.idxmax() , df_column.max()

In [4]:
#Reload save pkl file
with open('drug.pkl', 'rb') as f:
    drug = pickle.load(f)

In [5]:
vectorizer = CountVectorizer(min_df=0.01)
X = vectorizer.fit_transform(drug.Text)

In [6]:
pmi_drug = pd.DataFrame(X.todense(), columns=vectorizer.vocabulary_)

In [7]:
asint = pmi_drug.astype(int)
drug_pmi = asint.T.dot(asint)
drug_pmi.values[tuple([np.arange(drug_pmi.shape[0])]*2)] = 0

In [8]:
drug_occ = pmi(drug_pmi, positive=True)
drug_occ

Unnamed: 0,undercover,cocaine,facing,away,according,remove,purse,phone,later,apparently,...,averment,disconnected,dying,doubtless,map,demurrer,repugnant,stole,rescind,assented
undercover,0.000000,2.189862,0.000000,0.240559,0.000000,0.000000,0.000000,0.253358,0.000000,0.042002,...,0.184582,0.000000,0.000000,0.322359,0.000000,0.337529,0.000000,0.000000,1.086969,0.872124
cocaine,2.189862,0.000000,0.000000,0.478409,0.000000,0.000000,0.222529,0.027474,0.046289,0.380657,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.329246,0.000000,0.000000,0.000000,0.590646
facing,0.000000,0.000000,0.000000,0.000000,0.076030,1.469845,0.000000,0.000000,0.155157,0.000000,...,0.000000,0.059655,0.000000,0.194137,0.000000,0.000000,0.902901,0.000000,0.000000,0.000000
away,0.240559,0.478409,0.000000,0.000000,1.255746,0.000000,0.307931,0.000000,0.255603,0.285938,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.346893,0.000000,0.000000,0.000000,0.000000
according,0.000000,0.000000,0.076030,1.255746,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.340941,0.000000,0.131364,0.176142,1.001072,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
demurrer,0.337529,0.329246,0.000000,0.346893,0.176142,0.000000,0.000000,0.000000,0.422485,0.000000,...,1.527361,0.250151,0.464518,0.000000,0.315359,0.000000,0.000000,0.089779,0.046354,0.000000
repugnant,0.000000,0.000000,0.902901,0.000000,1.001072,0.000000,0.000000,0.000000,0.382232,0.000000,...,0.106705,0.000000,0.000000,0.136268,0.000000,0.000000,0.000000,0.205773,0.000000,0.000000
stole,0.000000,0.000000,0.000000,0.000000,0.000000,0.481886,0.000000,0.000000,0.573427,0.000000,...,0.110503,0.295611,0.000000,0.000000,0.000000,0.089779,0.205773,0.000000,0.000000,0.000000
rescind,1.086969,0.000000,0.000000,0.000000,0.000000,0.000000,0.028884,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.885762,0.000000,0.344828,0.046354,0.000000,0.000000,0.000000,0.000000


In [9]:
get_pmi_value('cocaine', 'undercover', drug_occ)

Unnamed: 0,undercover
cocaine,2.189862


In [10]:
max_cooccurrences(drug_occ)

(undercover    essentially
 cocaine        undercover
 facing           emphasis
 away             strategy
 according        stopping
                  ...     
 demurrer            tenth
 repugnant         despite
 stole              gather
 rescind           install
 assented           formal
 Length: 2340, dtype: object,
 undercover    3.120438
 cocaine       2.189862
 facing        3.609566
 away          2.743107
 according     3.034203
                 ...   
 demurrer      2.274755
 repugnant     2.920316
 stole         2.160492
 rescind       3.406246
 assented      3.872346
 Length: 2340, dtype: float64)

###  Bigrams/Trigrams - Drug

In [11]:
list = drug['Text'].tolist()

In [12]:
tokenized_sentences = []
for line in list:
    token = line.split()
    tokenized_sentences.append(token)

In [13]:
#bigrams (sequences of 2 words) and collapsed them into a unique term with the underscore symbol

finder = BigramCollocationFinder.from_documents(tokenized_sentences)
bgm = BigramAssocMeasures()
score = bgm.pmi #Scores ngrams by pointwise mutual information
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}

In [15]:
#trigrams (sequences of 3 words) and collapsed them into a unique term with the underscore symbol

finder = TrigramCollocationFinder.from_documents(tokenized_sentences)
tgm = TrigramAssocMeasures()
score = tgm.pmi
collocations = {'_'.join(trigram): pmi for trigram, pmi in finder.score_ngrams(score)}

## PMI - WEAPONS

In [17]:
#Reload save pkl file
with open('weapons.pkl', 'rb') as f:
    weapons = pickle.load(f)

In [18]:
vectorizer = CountVectorizer(min_df=0.01)
X = vectorizer.fit_transform(weapons.Text)

In [19]:
pmi_weapons = pd.DataFrame(X.todense(), columns=vectorizer.vocabulary_)

In [20]:
asint = pmi_weapons.astype(int)
weapons_pmi = asint.T.dot(asint)
weapons_pmi.values[tuple([np.arange(weapons_pmi.shape[0])]*2)] = 0

In [21]:
weapons_occ = pmi(weapons_pmi, positive=True)
weapons_occ

Unnamed: 0,guilty,degree,murder,burglary,proven,reasonable,doubt,agree,conviction,homicide,...,utterly,occupancy,practiced,distant,replication,assumpsit,averment,indispensable,eighty,compensate
guilty,0.000000,1.395181,0.000000,1.098079,0.000000,0.000000,0.206415,0.076675,0.000000,0.182222,...,1.105455,1.061337,0.240021,0.000000,0.000000,0.000000,0.220249,0.000000,0.000000,1.067001
degree,1.395181,0.000000,0.000000,0.596283,0.000000,0.000000,0.400480,0.023703,0.000000,0.066061,...,0.319316,0.547350,0.000000,0.001150,0.000000,0.024353,0.000000,0.000000,0.000000,0.000000
murder,0.000000,0.000000,0.000000,0.000000,0.776977,0.794285,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.305060,0.745701,0.000000,0.308948,0.000000
burglary,1.098079,0.596283,0.000000,0.000000,0.657697,0.069935,0.000000,0.000000,0.000000,0.491707,...,0.492217,0.511613,0.000000,0.194569,0.000000,0.000000,0.000000,0.000000,0.000000,0.821488
proven,0.000000,0.000000,0.776977,0.657697,0.000000,0.353437,2.660074,0.000000,0.000000,0.028573,...,0.000000,0.000000,0.000000,0.000000,0.059120,0.583793,0.019347,0.246217,0.076855,0.070673
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
assumpsit,0.000000,0.024353,0.305060,0.000000,0.583793,0.000000,0.247727,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.862967,0.016954,0.000000,0.000000
averment,0.220249,0.000000,0.745701,0.000000,0.019347,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.063500,0.543365,1.862967,0.000000,0.110610,0.019833,0.000000
indispensable,0.000000,0.000000,0.000000,0.000000,0.246217,0.000000,0.000000,0.017280,0.000000,0.142839,...,0.000000,0.000000,0.121092,0.084239,0.977589,0.016954,0.110610,0.000000,0.970580,0.000000
eighty,0.000000,0.000000,0.308948,0.000000,0.076855,0.000000,0.000000,0.000000,0.078487,0.109639,...,0.000000,0.000000,0.000000,0.323823,1.408859,0.000000,0.019833,0.970580,0.000000,0.000000


In [22]:
get_pmi_value('pistol', 'murder', weapons_occ)

Unnamed: 0,murder
pistol,0.0


In [23]:
max_cooccurrences(weapons_occ)

(guilty           discretion
 degree           immaterial
 murder            reasoning
 burglary              speak
 proven              falling
                     ...    
 assumpsit          averment
 averment              recur
 indispensable    injunctive
 eighty                 lose
 compensate         evidence
 Length: 2419, dtype: object,
 guilty           2.966372
 degree           2.004413
 murder           2.102548
 burglary         3.023862
 proven           2.880192
                    ...   
 assumpsit        1.862967
 averment         2.173839
 indispensable    1.341409
 eighty           1.822589
 compensate       3.269967
 Length: 2419, dtype: float64)

### Bigrams/Trigrams - Weapons

In [24]:
list = weapons['Text'].tolist()

In [25]:
tokenized_sentences = []
for line in list:
    token = line.split()
    tokenized_sentences.append(token)

In [26]:
#bigrams (sequences of 2 words) and collapsed them into a unique term with the underscore symbol

finder = BigramCollocationFinder.from_documents(tokenized_sentences)
bgm = BigramAssocMeasures()
score = bgm.pmi
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}

In [28]:
#trigrams (sequences of 3 words) and collapsed them into a unique term with the underscore symbol

finder = TrigramCollocationFinder.from_documents(tokenized_sentences)
tgm = TrigramAssocMeasures()
score = tgm.pmi
collocations = {'_'.join(trigram): pmi for trigram, pmi in finder.score_ngrams(score)}

## PMI - ACCIDENT

In [30]:
#Reload save pkl file
with open('accident.pkl', 'rb') as f:
    accident = pickle.load(f)

In [31]:
vectorizer = CountVectorizer(min_df=0.01)
X = vectorizer.fit_transform(accident.Text)

In [32]:
pmi_accident = pd.DataFrame(X.todense(), columns=vectorizer.vocabulary_)

In [33]:
asint = pmi_accident.astype(int)
accident_pmi = asint.T.dot(asint)
accident_pmi.values[tuple([np.arange(accident_pmi.shape[0])]*2)] = 0

In [34]:
accident_occ = pmi(accident_pmi, positive=True)
accident_occ

Unnamed: 0,arose,declaratory,sought,extent,coverage,provided,occurrence,limitation,affirm,covered,...,pursuance,eighty,alongside,expenditure,motorman,jurisprudence,solicitor,tenth,contemporaneous,daylight
arose,0.000000,2.868897,0.557986,0.000000,0.043556,0.000000,1.190169,0.000000,0.168542,0.782283,...,0.401715,0.315792,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.433465
declaratory,2.868897,0.000000,0.497670,0.000000,0.100977,0.008697,0.449493,0.000000,0.000000,0.057403,...,0.225035,0.000000,0.000000,0.000000,0.290899,0.249676,0.074085,0.000000,0.000000,0.000000
sought,0.557986,0.497670,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.446430,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.656428,0.000000,0.000000,0.000000,0.000000
extent,0.000000,0.000000,0.000000,0.000000,0.331779,0.000000,0.191583,0.000000,0.000000,0.000000,...,0.000000,0.438061,0.646588,0.000000,0.055151,0.000000,0.000000,0.000000,0.000000,0.739656
coverage,0.043556,0.100977,0.000000,0.331779,0.000000,0.417106,0.000000,0.000000,0.000000,0.007982,...,0.000000,0.050057,0.190295,0.566370,0.000000,0.000000,0.067980,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
jurisprudence,0.000000,0.249676,0.656428,0.000000,0.000000,0.000000,0.469968,0.000000,0.000000,0.000000,...,0.054251,0.694573,1.231997,1.147415,1.710730,0.000000,0.167795,0.000000,0.000000,0.000000
solicitor,0.000000,0.074085,0.000000,0.000000,0.067980,0.000000,0.135654,0.000000,0.474374,0.437042,...,1.484110,0.121551,0.000000,0.000000,0.000000,0.167795,0.000000,0.157396,0.000000,0.000000
tenth,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.259667,0.000000,0.000000,0.000000,...,0.637159,0.008849,0.000000,0.000000,0.000000,0.000000,0.157396,0.000000,0.067054,0.000000
contemporaneous,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.018231,0.000000,0.000000,...,0.000000,0.000000,0.415831,0.000000,0.323456,0.000000,0.000000,0.067054,0.000000,0.768607


In [35]:
get_pmi_value('motorman','uninsured', accident_occ)

Unnamed: 0,uninsured
motorman,0.096351


In [36]:
max_cooccurrences(accident_occ)

(arose              declaratory
 declaratory              arose
 sought             termination
 extent                   shock
 coverage            eventually
                       ...     
 jurisprudence         opposite
 solicitor              consent
 tenth                    weigh
 contemporaneous         repeat
 daylight            eyewitness
 Length: 2179, dtype: object,
 arose              2.868897
 declaratory        2.868897
 sought             2.283645
 extent             2.269953
 coverage           0.955734
                      ...   
 jurisprudence      2.416352
 solicitor          1.796480
 tenth              1.761466
 contemporaneous    3.013553
 daylight           3.140476
 Length: 2179, dtype: float64)

### Bigrams/Trigrams - Accident

In [37]:
list = accident['Text'].tolist()

In [38]:
tokenized_sentences = []
for line in list:
    token = line.split()
    tokenized_sentences.append(token)

In [39]:
#bigrams (sequences of 2 words) and collapsed them into a unique term with the underscore symbol

finder = BigramCollocationFinder.from_documents(tokenized_sentences)
bgm = BigramAssocMeasures()
score = score = bgm.pmi
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}

In [41]:
#trigrams (sequences of 3 words) and collapsed them into a unique term with the underscore symbol

finder = TrigramCollocationFinder.from_documents(tokenized_sentences)
bgm = TrigramAssocMeasures()
score = score = bgm.pmi
collocations = {'_'.join(trigram): pmi for trigram, pmi in finder.score_ngrams(score)}

## PMI - FINANCE

In [43]:
#Reload save pkl file
with open('finance.pkl', 'rb') as f:
    finance = pickle.load(f)

In [44]:
vectorizer = CountVectorizer(min_df=0.01)
X = vectorizer.fit_transform(finance.Text)

In [45]:
pmi_finance= pd.DataFrame(X.todense(), columns=vectorizer.vocabulary_)

In [46]:
asint = pmi_finance.astype(int)
finance_pmi = asint.T.dot(asint)
finance_pmi.values[tuple([np.arange(finance_pmi.shape[0])]*2)] = 0

In [47]:
finance_occ = pmi(finance_pmi, positive=True)
finance_occ

Unnamed: 0,following,determined,bid,damages,pressure,begun,raise,provided,intended,lend,...,irregularity,distinctly,scarcely,effectual,prop,defence,evade,whilst,embrace,chitty
following,0.000000,2.143812,0.026679,0.000000,0.000000,0.159429,0.000000,0.000000,0.729704,0.000000,...,0.000000,0.129919,0.175949,0.000000,0.000000,0.037733,0.358093,0.000000,0.000000,0.284934
determined,2.143812,0.000000,0.192775,0.022095,0.000000,0.044336,0.018928,0.086654,0.702541,0.000000,...,0.000000,0.000000,0.256348,0.000000,0.103731,0.140750,0.185968,0.000000,0.000000,0.076380
bid,0.026679,0.192775,0.000000,0.046261,0.000000,0.136530,0.000000,0.000000,0.547828,0.017650,...,0.000000,0.133526,0.000000,0.000000,0.000000,0.286210,0.000000,0.237699,0.000000,0.000000
damages,0.000000,0.022095,0.046261,0.000000,0.395861,0.036156,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.246210,0.049407,0.010523,0.000000,0.000000,0.000000,0.000000
pressure,0.000000,0.000000,0.000000,0.395861,0.000000,0.000000,0.091052,0.000000,0.000000,0.284718,...,0.000000,0.000000,0.000000,0.456159,0.247206,0.000000,0.000000,0.028434,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
defence,0.037733,0.140750,0.286210,0.010523,0.000000,0.000000,0.000000,0.000000,0.103058,0.053433,...,0.000000,0.000000,0.753799,1.136770,1.795744,0.000000,0.000000,0.000000,0.000000,0.134395
evade,0.358093,0.185968,0.000000,0.000000,0.000000,0.236401,0.000000,0.175412,0.627449,0.576341,...,0.000000,1.527440,0.011312,0.084489,0.000000,0.000000,0.000000,0.000000,0.374215,0.000000
whilst,0.000000,0.000000,0.237699,0.000000,0.028434,0.311854,0.284492,0.000000,0.011607,0.000000,...,0.665396,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.500168
embrace,0.000000,0.000000,0.000000,0.000000,0.000000,0.213212,0.000000,0.043436,0.000000,0.260657,...,0.000000,0.753328,0.000000,0.000000,0.000000,0.000000,0.374215,0.000000,0.000000,0.000000


In [48]:
get_pmi_value('evade', 'debt', finance_occ)

Unnamed: 0,debt
evade,0.552875


In [49]:
max_cooccurrences(finance_pmi)

(following     knowledge
 determined    knowledge
 bid           knowledge
 damages       knowledge
 pressure      knowledge
                 ...    
 defence       knowledge
 evade         knowledge
 whilst        knowledge
 embrace       knowledge
 chitty        knowledge
 Length: 1805, dtype: object,
 following      4355
 determined    19212
 bid            5231
 damages       78292
 pressure      38575
               ...  
 defence       17946
 evade         52001
 whilst         4239
 embrace       43270
 chitty         5817
 Length: 1805, dtype: int32)

### Bigrams/Trigrams - Finance

In [50]:
list = finance['Text'].tolist()

In [51]:
#bigrams (sequences of 2 words) and collapsed them into a unique term with the underscore symbol

finder = BigramCollocationFinder.from_documents(tokenized_sentences)
bgm = BigramAssocMeasures()
score = score = bgm.pmi
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}

In [53]:
#trigrams (sequences of 3 words) and collapsed them into a unique term with the underscore symbol

finder = TrigramCollocationFinder.from_documents(tokenized_sentences)
bgm = TrigramAssocMeasures()
score = score = bgm.pmi
collocations = {'_'.join(trigram): pmi for trigram, pmi in finder.score_ngrams(score)}

## PMI - HOSPITAL

In [55]:
#Reload save pkl file
with open('hospital.pkl', 'rb') as f:
    hospital = pickle.load(f)

In [56]:
vectorizer = CountVectorizer(min_df=0.01)
X = vectorizer.fit_transform(hospital.Text)

In [57]:
pmi= pd.DataFrame(X.todense(), columns=vectorizer.vocabulary_)

In [58]:
asint = pmi.astype(int)
_pmi = asint.T.dot(asint)
_pmi.values[tuple([np.arange(_pmi.shape[0])]*2)] = 0

In [60]:
hospital_occ = pmi(_pmi, positive=True)
hospital_occ

Unnamed: 0,decision,correctly,assessed,usually,raised,remove,inch,bases,extend,privilege,...,badly,forcible,detainer,mainly,partitioned,upwards,inception,throw,endeavor,revert
decision,0.000000,2.215964,1.681607,2.130343,0.128236,0.000000,0.344637,1.377193,0.465713,0.000000,...,0.000000,0.000000,0.027760,0.000000,0.000000,0.580597,0.475479,0.000000,0.000000,0.863197
correctly,2.215964,0.000000,0.000000,1.689716,0.000000,0.565799,0.000000,0.737896,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.748199,1.624846,0.000000,0.000000,0.000000,0.431539
assessed,1.681607,0.000000,0.000000,1.048497,0.000000,0.000000,0.243621,0.044033,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.633731,0.000000,0.000000,0.000000,0.000000
usually,2.130343,1.689716,1.048497,0.000000,1.025553,1.249132,0.000000,0.338617,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.031611,0.000000,0.089015,0.000000
raised,0.128236,0.000000,0.000000,1.025553,0.000000,0.471639,0.000000,0.258890,0.363330,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.201961,0.004653,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
upwards,0.580597,1.624846,0.633731,0.000000,0.201961,0.003069,0.000000,0.000000,0.804131,0.000000,...,0.000000,0.000000,0.000000,0.149840,1.431263,0.000000,0.279355,0.489374,0.136873,0.000000
inception,0.475479,0.000000,0.000000,0.031611,0.004653,0.000000,0.000000,0.000000,0.346792,0.518786,...,0.000000,0.559102,1.500340,0.000000,0.000000,0.279355,0.000000,0.000000,0.591371,0.000000
throw,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.386946,0.000000,0.151346,0.000000,...,0.344888,0.000000,0.000000,0.000000,0.000000,0.489374,0.000000,0.000000,0.000000,0.771974
endeavor,0.000000,0.000000,0.000000,0.089015,0.000000,0.000000,0.170577,0.000000,0.000000,0.000000,...,0.738405,0.000000,0.682649,0.992403,0.133957,0.136873,0.591371,0.000000,0.000000,0.707155


In [64]:
get_pmi_value('hospital','inception', df_pmi)

Unnamed: 0,inception


In [None]:
max_cooccurrences(hospital_occ)

### Bigrams/Trigrams - Hospital

In [65]:
list = hospital['Text'].tolist()

In [66]:
tokenized_sentences = []
for line in list:
    token = line.split()
    tokenized_sentences.append(token)

In [67]:
#bigrams (sequences of 2 words) and collapsed them into a unique term with the underscore symbol

finder = BigramCollocationFinder.from_documents(tokenized_sentences)
bgm = BigramAssocMeasures()
score = score = bgm.pmi
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}

In [69]:
#trigrams (sequences of 3 words) and collapsed them into a unique term with the underscore symbol

finder = TrigramCollocationFinder.from_documents(tokenized_sentences)
bgm = TrigramAssocMeasures()
score = score = bgm.pmi
collocations = {'_'.join(trigram): pmi for trigram, pmi in finder.score_ngrams(score)}

## PMI - SEXUAL

In [71]:
#Reload save pkl file
with open('sexual.pkl', 'rb') as f:
    sexual = pickle.load(f)

In [72]:
vectorizer = CountVectorizer(min_df=0.01)
X = vectorizer.fit_transform(sexual.Text)

In [73]:
pmi= pd.DataFrame(X.todense(), columns=vectorizer.vocabulary_)

In [74]:
asint = pmi.astype(int)
_pmi = asint.T.dot(asint)
_pmi.values[tuple([np.arange(_pmi.shape[0])]*2)] = 0

In [75]:
sexual_occ = pmi(_pmi, positive=True)

In [79]:
sexual_occ

Unnamed: 0,brought,stated,following,deceptive,imposition,unenforceable,penalty,breach,duty,dealing,...,northerly,exhaustive,forbidden,transact,divest,destination,rarely,overturn,latitude,universally
brought,0.000000,2.160181,0.000000,0.000000,0.000000,0.000000,0.000000,0.416353,0.323725,0.000000,...,0.000000,0.000000,0.275937,0.000000,0.371688,0.780279,0.379545,0.000000,1.087923,0.000000
stated,2.160181,0.000000,0.000000,0.000000,0.000000,0.838600,0.000000,0.107563,0.302444,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.536697,0.399735,0.209963,0.000000,0.000000,0.000000
following,0.000000,0.000000,0.000000,0.000000,0.000000,0.255466,0.000000,0.000000,0.790349,0.283614,...,0.000000,0.000000,0.000000,0.028830,0.000000,0.300538,0.000000,0.000000,0.752187,0.000000
deceptive,0.000000,0.000000,0.000000,0.000000,0.477006,0.044938,0.000000,0.000000,0.003437,0.000000,...,0.161661,0.000000,0.000000,0.149741,0.000000,0.029651,0.063927,0.000000,0.172214,0.000000
imposition,0.000000,0.000000,0.000000,0.477006,0.000000,0.000000,0.000000,0.000000,0.000000,0.214287,...,0.238259,0.027826,0.114783,0.807400,0.244079,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
destination,0.780279,0.399735,0.300538,0.029651,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.939004,0.760223,1.566213,0.000000,0.675712,0.000000,0.000000,0.000000
rarely,0.379545,0.209963,0.000000,0.063927,0.000000,0.000000,0.000000,0.000000,0.500987,0.747509,...,0.902622,1.514628,0.100575,0.000000,0.000000,0.675712,0.000000,0.713712,0.092499,0.000000
overturn,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.002927,0.000000,...,1.467265,1.035988,0.128991,0.000000,0.000000,0.000000,0.713712,0.000000,0.000000,0.000000
latitude,1.087923,0.000000,0.752187,0.172214,0.000000,0.240234,0.577443,0.000000,0.000000,0.110271,...,0.000000,1.568740,0.000000,0.000000,0.000000,0.000000,0.092499,0.000000,0.000000,0.260796


In [80]:
get_pmi_value('anal', 'penalty',sexual_occ)   

Unnamed: 0,penalty


In [81]:
max_cooccurrences(sexual_occ)

(brought         essential
 stated            brought
 following          handle
 deceptive          albeit
 imposition        highest
                   ...    
 destination         amply
 rarely           estoppel
 overturn          burning
 latitude          deprive
 universally    subjective
 Length: 1989, dtype: object,
 brought        2.418989
 stated         2.160181
 following      4.232509
 deceptive      1.144641
 imposition     1.366331
                  ...   
 destination    1.784076
 rarely         2.472672
 overturn       2.167245
 latitude       2.777174
 universally    3.341154
 Length: 1989, dtype: float64)

### Bigrams/Trigrams - Sexual

In [82]:
list =  hospital['Text'].tolist()

In [83]:
tokenized_sentences = []
for line in list:
    token = line.split()
    tokenized_sentences.append(token)

In [84]:
#bigrams (sequences of 2 words) and collapsed them into a unique term with the underscore symbol

finder = BigramCollocationFinder.from_documents(tokenized_sentences)
bgm = BigramAssocMeasures()
score = score = bgm.pmi
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}

In [86]:
#trigrams (sequences of 3 words) and collapsed them into a unique term with the underscore symbol

finder = TrigramCollocationFinder.from_documents(tokenized_sentences)
bgm = TrigramAssocMeasures()
score = score = bgm.pmi
collocations = {'_'.join(trigram): pmi for trigram, pmi in finder.score_ngrams(score)}

## PMI - DIVORCE

In [88]:
#Reload save pkl file
with open('divorce.pkl', 'rb') as f:
    divorce = pickle.load(f)

In [89]:
vectorizer = CountVectorizer(min_df=0.01)
X = vectorizer.fit_transform(divorce.Text)

In [90]:
pmi= pd.DataFrame(X.todense(), columns=vectorizer.vocabulary_)

In [91]:
asint = pmi.astype(int)
_pmi = asint.T.dot(asint)
_pmi.values[tuple([np.arange(_pmi.shape[0])]*2)] = 0

In [93]:
divorce_occ = pmi(_pmi, positive=True)
divorce_occ

Unnamed: 0,decision,correctly,assessed,usually,raised,remove,inch,bases,extend,privilege,...,badly,forcible,detainer,mainly,partitioned,upwards,inception,throw,endeavor,revert
decision,0.000000,2.215964,1.681607,2.130343,0.128236,0.000000,0.344637,1.377193,0.465713,0.000000,...,0.000000,0.000000,0.027760,0.000000,0.000000,0.580597,0.475479,0.000000,0.000000,0.863197
correctly,2.215964,0.000000,0.000000,1.689716,0.000000,0.565799,0.000000,0.737896,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.748199,1.624846,0.000000,0.000000,0.000000,0.431539
assessed,1.681607,0.000000,0.000000,1.048497,0.000000,0.000000,0.243621,0.044033,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.633731,0.000000,0.000000,0.000000,0.000000
usually,2.130343,1.689716,1.048497,0.000000,1.025553,1.249132,0.000000,0.338617,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.031611,0.000000,0.089015,0.000000
raised,0.128236,0.000000,0.000000,1.025553,0.000000,0.471639,0.000000,0.258890,0.363330,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.201961,0.004653,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
upwards,0.580597,1.624846,0.633731,0.000000,0.201961,0.003069,0.000000,0.000000,0.804131,0.000000,...,0.000000,0.000000,0.000000,0.149840,1.431263,0.000000,0.279355,0.489374,0.136873,0.000000
inception,0.475479,0.000000,0.000000,0.031611,0.004653,0.000000,0.000000,0.000000,0.346792,0.518786,...,0.000000,0.559102,1.500340,0.000000,0.000000,0.279355,0.000000,0.000000,0.591371,0.000000
throw,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.386946,0.000000,0.151346,0.000000,...,0.344888,0.000000,0.000000,0.000000,0.000000,0.489374,0.000000,0.000000,0.000000,0.771974
endeavor,0.000000,0.000000,0.000000,0.089015,0.000000,0.000000,0.170577,0.000000,0.000000,0.000000,...,0.738405,0.000000,0.682649,0.992403,0.133957,0.136873,0.591371,0.000000,0.000000,0.707155


In [94]:
get_pmi_value('annulment', 'married',divorce_occ)

Unnamed: 0,married
annulment,0.0


In [None]:
max_cooccurrences(_pmi)

### Bigrams/Trigrams - Divorce

In [95]:
list =  divorce['Text'].tolist()

In [96]:
tokenized_sentences = []
for line in list:
    token = line.split()
    tokenized_sentences.append(token)

In [97]:
#bigrams (sequences of 2 words) and collapsed them into a unique term with the underscore symbol

finder = BigramCollocationFinder.from_documents(tokenized_sentences)
bgm = BigramAssocMeasures()
score = score = bgm.pmi
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}

In [100]:
#trigrams (sequences of 3 words) and collapsed them into a unique term with the underscore symbol

finder = TrigramCollocationFinder.from_documents(tokenized_sentences)
bgm = TrigramAssocMeasures()
score = bgm.pmi
collocations = {'_'.join(trigram): pmi for trigram, pmi in finder.score_ngrams(score)}

### PMI - BURGLARY

In [102]:
#Reload save pkl file
with open('burglary.pkl', 'rb') as f:
    burglary = pickle.load(f)

In [103]:
vectorizer = CountVectorizer(min_df=0.01)
X = vectorizer.fit_transform(burglary.Text)

In [104]:
pmi= pd.DataFrame(X.todense(), columns=vectorizer.vocabulary_)

In [105]:
asint = pmi.astype(int)
_pmi = asint.T.dot(asint)
_pmi.values[tuple([np.arange(_pmi.shape[0])]*2)] = 0

In [107]:
burglary_occ = pmi(_pmi, positive=True)
burglary_occ

Unnamed: 0,terminate,parental,following,evidentiary,consideration,excuse,failure,comply,evidence,convincing,...,lapse,libel,ne,pose,technician,sending,forthwith,restraining,pursuance,swear
terminate,0.000000,0.000000,0.000000,0.000000,0.000000,0.053233,0.167160,0.398305,0.047807,0.000000,...,0.584521,0.000000,0.314517,0.000000,0.629747,0.154496,0.000000,0.000000,0.000000,0.000000
parental,0.000000,0.000000,0.596336,0.000000,0.000000,0.449296,0.181288,0.000000,0.615530,0.588120,...,0.000000,0.060018,0.734110,0.691170,0.794280,0.146646,0.000000,0.000000,0.055143,0.689015
following,0.000000,0.596336,0.000000,0.000000,0.000000,0.445482,0.000000,0.750991,0.000000,0.000000,...,0.614876,0.000000,0.218355,0.000000,0.171426,0.000000,0.000000,0.000000,0.023022,0.000000
evidentiary,0.000000,0.000000,0.000000,0.000000,0.705765,0.000000,0.000000,0.000000,0.000000,0.245283,...,0.036460,0.000000,0.000000,0.516662,0.375540,0.201993,0.002726,0.000000,0.000000,0.000000
consideration,0.000000,0.000000,0.000000,0.705765,0.000000,0.042964,0.079880,0.000000,0.050268,0.204145,...,0.000000,0.000000,0.000000,0.515740,0.000000,0.000000,0.000000,0.037769,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sending,0.154496,0.146646,0.000000,0.201993,0.000000,0.413151,0.603252,0.000000,0.479363,0.000000,...,0.000000,0.023872,0.798283,1.578108,2.535362,0.000000,0.193719,0.000000,0.000000,0.000000
forthwith,0.000000,0.000000,0.000000,0.002726,0.000000,0.510949,0.000000,0.088661,0.080489,0.031391,...,0.000000,1.667271,0.099751,0.681396,0.048973,0.193719,0.000000,0.108021,0.334662,0.000000
restraining,0.000000,0.000000,0.000000,0.000000,0.037769,0.003070,0.000000,1.312199,0.359235,0.000000,...,0.000000,0.161757,0.124051,1.313739,0.000000,0.000000,0.108021,0.000000,0.455018,0.128379
pursuance,0.000000,0.055143,0.023022,0.000000,0.000000,0.091815,0.000000,0.000000,0.533412,0.000000,...,0.000000,0.700709,0.146888,0.000000,0.000000,0.000000,0.334662,0.455018,0.000000,0.000000


In [None]:
get_pmi_value('', '', burglary_occ)

In [None]:
max_cooccurrences(_pmi)

### Bigrams/Trigrams - Burglary

In [108]:
list = burglary['Text'].tolist()

In [109]:
tokenized_sentences = []
for line in list:
    token = line.split()
    tokenized_sentences.append(token)

In [110]:
#bigrams (sequences of 2 words) and collapsed them into a unique term with the underscore symbol

finder = BigramCollocationFinder.from_documents(tokenized_sentences)
bgm = BigramAssocMeasures()
score = bgm.pmi
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}

In [112]:
#trigrams (sequences of 3 words) and collapsed them into a unique term with the underscore symbol

finder = TrigramCollocationFinder.from_documents(tokenized_sentences)
bgm = TrigramAssocMeasures()
score = score = bgm.pmi
collocations = {'_'.join(trigram): pmi for trigram, pmi in finder.score_ngrams(score)}