<a href="https://colab.research.google.com/github/gamallo/jupiter/blob/main/curso.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Scripts to Build a Language Model and Compute Word Similarity 
In this demo, all scripts used to generate a language model and to compare the semantic similarity between words are introduced. This is the pipeline:

* **Building the language model**:

*Input*: text corpus 

*Process*: `Tokenization -> N-grams -> Stopwords removal -> Vectorization` 

*Output*: language model

* **Word Similarity**:

*Input*: Word pairs + Language Model 

*Process*: `Cosine Similarity` 

*Output*: list of similar words

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Building the Language Model

## Tokenizer
A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).

Substitution function (re.sub) takes three arguments:
* substring to be replaced : `(\w)([\,\.])`
* replaced substring : `\1 \2`
* string containing the substring to be replaced: `line`

Let us start by a simple example:

In [None]:
import re

line = input()
print(re.sub(r"(\w)([\,\.])",r"\1 \2",line))

llç
llç


## Input / Output
Now, we introduce the input and output files. The tokenizer script opens and reads an input file (flag: `r`), and the tokenized text is writen in an output file (flag: `w`)


In [None]:
import re
#file1 = open('/content/drive/MyDrive/emlex/data/corpus.txt', 'r')
file1 = open('corpus.txt', 'r')
output=""
for line in file1:
    #line = line.strip()
    line = re.sub(r"(\w)([\,\.])",r"\1 \2",line)
    output += line

file2 = open('tokens.txt', 'w')
file2.write(output)

397

## N-grams
The script takes the file with the tokens and generates a new file with n-grams (the default parameter is 3, thus, it generates trigrams). It contain the function `ngrams` defined using other functions on lists such as `len` and `range`.

`ngrams.py` is a python file stored in our google drive folder. It generates n-grams (`trigrams.txt`) from the `tokens.txt` file

In [None]:
size = 3 ##trigrams

##open file tokens:
file_tokens = open('tokens.txt', 'r')

def ngrams(input, n): ##main function
  input = input.split(' ')
  result = []
  for i in range(len(input)-n+1):  ##range from 0 to position of the first element of the last ngram
    result.append(input[i:i+n])
  return result

output=""
for line in file_tokens:  
  line = line.strip()

  if len(line)>1:
    result = ngrams(line,size) 
    for ngram in result:
      juntar = ""
      for token in ngram:
        juntar += token + " "
      output += juntar + "\n"
      print (juntar)

file2 = open('trigrams.txt', 'w')
file2.write(output)

Pedro read books 
read books and 
books and Maria 
and Maria read 
Maria read books 
read books too 
books too , 
Pedro read novels 
read novels and 
novels and Maria 
and Maria read 
Maria read novels 
read novels and 
novels and books 
and books , 
Pedro and Maria 
and Maria read 
Maria read many 
read many things 
many things , 
things , but 
, but Pedro 
but Pedro loves 
Pedro loves Maria 
loves Maria , 
Maria loves books 
loves books , 
books , in 
, in fact 
in fact Maria 
fact Maria loves 
Maria loves many 
loves many things 
many things . 
Maria is eating 
is eating an 
eating an apple 
an apple and 
apple and Pedro 
and Pedro is 
Pedro is eating 
is eating an 
eating an apple 
an apple too 
apple too , 
Pedro is eating 
is eating eggs 
eating eggs now 
eggs now , 
now , Pedro 
, Pedro and 
Pedro and Maria 
and Maria are 
Maria are eating 
are eating many 
eating many things 
many things , 
Maria is eating 
is eating eggs 
eating eggs , 
eggs , Maria 
, Maria and 
Maria and Ped

1074

In [None]:
!python3 /content/drive/MyDrive/emlex/ngrams.py 3

## StopWords

The following task is to identify grammar words (stopwords) from the textual n-grams. For this purpose, we need a file with a list of stopwords: `stopwords-en.txt`


In [None]:
##open file trigrams:
file_trigrams = open('trigrams.txt', 'r')
##open file stopwords:
#file_stopwords = open('/content/drive/MyDrive/emlex/data/stopwords-en.txt', 'r')
file_stopwords = open('stopwords-en.txt', 'r')

stop=[] ##this is the list where stopwords will be stored
for token in file_stopwords:
  token = token.strip()
  stop.append(token)

##read the trigrams and store each position token in a variable:
output=""
for line in file_trigrams:
  line = line.strip() 
  line = line.split() ##from string to list
  if len(line) >= 3:
    w1 = line[0]
    w2 = line[1]
    w3 = line[2]

##check if the tokens are in the list of stopwords, called stop:
  if w1 in stop:
    w1 = "STOP"
  if w2 in stop:
    w2 = "STOP"
  if w3 in stop:
    w3 = "STOP"

  print(w1, w2, w3)
  output += w1 + " " + w2 + " " + w3 + "\n"

file2 = open('trigrams2.txt', 'w')
file2.write(output)



Pedro read books
read books STOP
books STOP Maria
STOP Maria read
Maria read books
read books STOP
books STOP STOP
Pedro read novels
read novels STOP
novels STOP Maria
STOP Maria read
Maria read novels
read novels STOP
novels STOP books
STOP books STOP
Pedro STOP Maria
STOP Maria read
Maria read STOP
read STOP things
STOP things STOP
things STOP STOP
STOP STOP Pedro
STOP Pedro loves
Pedro loves Maria
loves Maria STOP
Maria loves books
loves books STOP
books STOP STOP
STOP STOP fact
STOP fact Maria
fact Maria loves
Maria loves STOP
loves STOP things
STOP things STOP
Maria STOP eating
STOP eating STOP
eating STOP apple
STOP apple STOP
apple STOP Pedro
STOP Pedro STOP
Pedro STOP eating
STOP eating STOP
eating STOP apple
STOP apple STOP
apple STOP STOP
Pedro STOP eating
STOP eating eggs
eating eggs STOP
eggs STOP STOP
STOP STOP Pedro
STOP Pedro STOP
Pedro STOP Maria
STOP Maria STOP
Maria STOP eating
STOP eating STOP
eating STOP things
STOP things STOP
Maria STOP eating
STOP eating eggs
eat

1141

# Language Model
We will build a language model of ngrams (trigrams). This consists of representing words as vectors in a word-context matrix (vectorization). This is performed in several steps:

## Matrix from trigrams

By using the final file of trigrams (`trigrams2.txt`), a dictionary with the frequency of each word in a word context is created. Each line of the dictionary is a triple `(word,context,frequency)`


In [None]:
import sys
from collections import defaultdict
matrix = defaultdict(int) ##initializing a dictionary

##open file trigrams2:
file_trigrams = open('trigrams2.txt', 'r')

for line in file_trigrams:  
  line = line.strip()
  line = line.split()
  if len(line) >= 3:
    w1 = line[0]
    w2 = line[1]
    w3 = line[2]

##create the dictionary 'matrix' with a double key (word, context) and 
##a frequency value:
  if w1 != "STOP" and w2 != "STOP":
    matrix[w1,w2] += 1
    matrix[w2,w1] += 1

  if w1 != "STOP" and w3 != "STOP":
    matrix[w1,w3] += 1
    matrix[w3,w1] += 1

file2 = open('matrix.txt','w') ##output file
sys.stdout = file2  ##print stdout is in the ouput file

for w,c in matrix:
  print ('%s\t%s\t%d' % (w, c, matrix[w,c]))
  print (w, c, matrix[w,c], file=sys.stderr)

Pedro read 2
read Pedro 2
Pedro books 1
books Pedro 1
read books 2
books read 2
books Maria 3
Maria books 3
Maria read 3
read Maria 3
Pedro novels 1
novels Pedro 1
read novels 2
novels read 2
novels Maria 2
Maria novels 2
novels books 1
books novels 1
Pedro Maria 4
Maria Pedro 4
read things 1
things read 1
Pedro loves 2
loves Pedro 2
loves Maria 3
Maria loves 3
loves books 1
books loves 1
fact Maria 1
Maria fact 1
fact loves 1
loves fact 1
loves things 1
things loves 1
Maria eating 3
eating Maria 3
eating apple 2
apple eating 2
apple Pedro 1
Pedro apple 1
Pedro eating 2
eating Pedro 2
eating eggs 2
eggs eating 2
eating things 1
things eating 1
eggs Maria 1
Maria eggs 1
Pedro eggs 1
eggs Pedro 1
loves eggs 1
eggs loves 1
eggs lot 1
lot eggs 1


## Punctual Mutual Information (PMI)

For each `(word, context, frequency)` triple, PMI is computed. It results in the same matrix with PMI values instead of frequencies. 

In [None]:
import math
from collections import defaultdict

##initializing three dictionaries and two variables:
pair = defaultdict(int)  
w = defaultdict(int)
c = defaultdict(int)
n=0;
fr=0;

##computing punctual mutual information by defining function pmi
def pmi (joint, wo, co):
  if  n:
      output = math.log  ( (joint / n)  /  ( (w[wo]/n)*(c[co]/n) )  ) 
  else:
      output = 0
  return output

##open input file matrix:
file_matrix = open('matrix.txt', 'r')

for line in file_matrix:  
  line = line.strip()
  line = line.split()
  if len(line) >= 3:
    word = line[0]
    context = line[1]
    fr =  float(line[2]) ##as the number is read as a string, 
                        ## it is converted into float
        
    w[word] += fr
    c[context] += fr
    pair[word,context] = fr        
    n += fr

file2 = open('matrix_pmi.txt','w') ##output file
sys.stdout = file2  ##print stdout is in the ouput file

for word,context in sorted(pair):
  assoc = pmi(pair[word,context],word,context)
  if assoc > 0:
    print ('%s\t%s\t%.6f' % (word, context, assoc))
    print (word, context, assoc, file=sys.stderr)

Maria Pedro 0.27329333499968117
Maria books 0.545227050483323
Maria eating 0.3220834991691132
Maria fact 0.8329091229351039
Maria loves 0.42744401482693956
Maria novels 0.42744401482693956
Maria read 0.3220834991691132
Pedro Maria 0.27329333499968117
Pedro apple 0.784118958765672
Pedro eating 0.27329333499968117
Pedro eggs 0.09097177820572659
Pedro loves 0.3786538506575074
Pedro novels 0.09097177820572659
Pedro read 0.27329333499968117
apple Pedro 0.784118958765672
apple eating 1.8137383759468302
books Maria 0.545227050483323
books loves 0.2451224580329849
books novels 0.6505875661411494
books read 0.8329091229351039
eating Maria 0.3220834991691132
eating Pedro 0.27329333499968117
eating apple 1.8137383759468302
eating eggs 1.1205911953868848
eating things 1.1205911953868848
eggs Pedro 0.09097177820572659
eggs eating 1.1205911953868848
eggs lot 2.7300291078209855
eggs loves 0.5328045304847658
fact Maria 0.8329091229351039
fact loves 1.6314168191528755
lot eggs 2.7300291078209855
loves 

## Filtering

We can select the most relevant contexts for each word. This is done by selecting only the `N` most relevant contexts. For this purpose, it is necessary to sort them by `pmi` value. At the end of this process, we obtain the final language model that will be used to compute word similarity.

In [None]:
import math
from collections import defaultdict

#th = int(sys.argv[1])
th = 3
w = defaultdict(dict)

##open input file matrix_pmi:
file_matrix_pmi = open('matrix_pmi.txt', 'r')

for line in file_matrix_pmi:  
  line = line.strip()  ## chomp $line
  line = line.split()
  if len(line) >= 3:
    word = line[0]
    context = line[1]
    pmi =  float (line[2])

    ##pmi value is stored in a dictionary of a dictionary
    w[word][context] = pmi

file2 = open('matrix_pmi_filtered.txt','w') ##output file
sys.stdout = file2  ##print stdout is in the ouput file

for word,contexts in sorted(w.items() ):
  i=1
  for context in sorted(contexts,key=contexts.get,reverse=True):
    if i <= th:
      print (word, context, w[word][context], file=sys.stderr)
      print ('%s\t%s\t%.6f' % (word, context, w[word][context]))
      i += 1


Maria fact 0.832909
Maria books 0.545227
Maria loves 0.427444
Pedro apple 0.784119
Pedro loves 0.378654
Pedro Maria 0.273293
apple eating 1.813738
apple Pedro 0.784119
books read 0.832909
books novels 0.650588
books Maria 0.545227
eating apple 1.813738
eating eggs 1.120591
eating things 1.120591
eggs lot 2.730029
eggs eating 1.120591
eggs loves 0.532805
fact loves 1.631417
fact Maria 0.832909
lot eggs 2.730029
loves fact 1.631417
loves things 1.225952
loves eggs 0.532805
novels read 1.120591
novels books 0.650588
novels Maria 0.427444
read novels 1.120591
read things 1.120591
read books 0.832909
things loves 1.225952
things eating 1.120591
things read 1.120591


# Word Similarity
Given a word, we need to extract all word pairs with the given word, compute cosine similarity for all pairs and select the N most similar words.
## Word Pairs
The first process is word pair extraction. Given a specific word, for instance, `novel`, we build all pairs `<novel, word>` from the matrix file.

In [None]:
from collections import defaultdict

target = "novels" ##target word
w = defaultdict(int)

##open input file matrix_pmi_filtered:
file_matrix_pmi_filtered = open('matrix_pmi_filtered.txt', 'r')

for line in file_matrix_pmi_filtered:  
    line = line.strip() 
    line = line.split()
    if len(line) >= 3:
        word = line[0]
      
        w[word] += 1

file2 = open('pairs.txt','w') ##output file
sys.stdout = file2  ##print stdout is in the ouput file

for word in sorted(w):
    print ('%s %s' % (target, word))
    print (target, word, file=sys.stderr)


novels Maria
novels Pedro
novels apple
novels books
novels eating
novels eggs
novels fact
novels lot
novels loves
novels novels
novels read
novels things


## Cosine Similarity
Given the `pairs` file generated in the previous step and the final matrix (language model), the following script computes cosine similarity between each word pair

In [4]:
import math,sys
from collections import defaultdict

##open input file matrix_pmi_filtered:
file_matrix_pmi_filtered = open('matrix_pmi_filtered.txt', 'r')
##open pairs file:
file_pairs = open('pairs.txt', 'r')

##initializing dictionaries:
dic = defaultdict(dict) 
w = defaultdict(float)
simil = defaultdict(float)

for line in file_matrix_pmi_filtered:  
    line = line.strip()  
    (word, context, weight) = line.split()
    #print (word,context,weight, file=sys.stderr)
 
    dic[word][context] = float(weight)
    w[word] += float(weight)

file2 = open('cosine.txt','w') ##output file
sys.stdout = file2  ##print stdout is in the ouput file

for line in file_pairs:
  line = line.strip() 
  (w1,w2) = line.split()
  common=0
  wc_count1=0
  wc_count2=0
  w1_count = w[w1]
  w2_count = w[w2]
  
  for context in dic[w1]:
    wc_count1 += dic[w1][context] **2  #this builds the euclidean norm of word_1 vector

  for context in dic[w2]:
    wc_count2 += dic[w2][context] **2 #this builds the euclidean norm of word_2 vector
    if context in dic[w1]:
      common += dic[w1][context] * dic[w2][context]
      #print (context,w1,w2, file=sys.stderr)

  if (w1_count and w2_count):   
    simil[w1,w2] = (common) / math.sqrt (wc_count1 * wc_count2)
    print ('%s\t%s\t%.6f' %  (w1, w2, simil[w1,w2]))
    print (w1, w2, simil[w1,w2], file=sys.stderr)


novels Maria 0.23996446673649857
novels Pedro 0.09381105579484929
novels apple 0.0
novels books 0.7188325190384269
novels eating 0.0
novels eggs 0.0
novels fact 0.1424489429187707
novels lot 0.0
novels loves 0.0
novels novels 1.0
novels read 0.22183111610835893
novels things 0.4593344739127516


## Ranking Top N Most Similar Words
This script rank by similarity score the `N` (e.g., 3) the most similar words of the target word (e.g., `novels`).

In [7]:
import math
from collections import defaultdict

th = 3 #N most similar words

##open cosine file:
file_cosine = open('cosine.txt', 'r')

w = defaultdict(dict)

for line in file_cosine:  
  line = line.strip()
  print (line)
  line = line.split('\t') #In this case, the string separator is tabulation
  
  if len(line) >= 3:
    word1 = line[0]
    word2 = line[1]
    simil =  float (line[2])
    print (word1,word2,simil)
    w[word1][word2] = simil

file2 = open('ranking.txt','w') ##output file
sys.stdout = file2  ##print stdout is in the ouput file

for w1, second in sorted(w.items() ):
  i=1
  for w2 in sorted(second,key=second.get,reverse=True):
    if i <= th:
    #    print >> sys.stderr, "threshold:", th, i
      print ('%s\t%s\t%.6f' % (w1, w2, w[w1][w2]))
      print (w1,w2,w[w1][w2],file=sys.stderr)
    i += 1


novels novels 1.0
novels books 0.718833
novels things 0.459334
