### Mini project 5: Working with text

Your tasks are the following:
1. Collect and load text documents from various sources of one domain – e.g. some of txt, doc, csv,
json, pdf files, web pages, or data frame attributes.
2. Extract, clean, and transform the text from the sources, to prepare it for vectorisation.
3. Vectorise and store the clean text in a software structure.
4. Create a simple interactive prototype of application, which can input a text from a user and output
the top three related pieces of texts, stored earlier, applying vector similarity approach.
5. Optionally, integrate your application with LLM (large language model) for improving the quality
of the language operations.
6. Suggest various implementations of such an application.

#### Enviroment

In [1]:
import docx
import json
import string
import ctypes
import math
import pprint

# data structure
import numpy as np
import pandas as pd

# cosine simularity
import cosimfunc
from cosimfunc import cosim

#### Load data

In [2]:
# method to read text from a docx file
def read_docx(name):
    doc = docx.Document(name)
    tekst = "\n".join(paragraf.text for paragraf in doc.paragraphs)
    return tekst

In [3]:
# method to read text from a json file
def read_json(name):
    with open(name, "r", encoding="utf-8") as json_fil:
        data = json.load(json_fil)
    return data

In [4]:
# read data
dad_jokes = read_docx("farjokes.docx")
dog_jokes = read_json("hunde_jokes.json")

#### Clean data

In [5]:
# method to remove regex signs
def clean(text):
    text = text.lower()
    
    # string.punctuation removes !"#$%&\'()*+,-./:;?@[\\]^_{|}~`
    PUNCT = string.punctuation
    text = text.translate(str.maketrans('', '', PUNCT))
    return text

In [6]:
# split into sentences
dad_jokes_split = dad_jokes.split('\n')

In [7]:
# first dad joke
first_dad_joke = dad_jokes_split[0]
print(first_dad_joke)

Hvorfor gik faren ind i butikken med et ur? Han ville købe tid!


In [8]:
# first dog joke
first_dog_joke = dog_jokes['hunde_jokes'][0]['joke']
print(first_dog_joke)

Hvorfor kan ikke hunde spille poker i Afrika? Fordi der er for mange cheetahs!


### Bag of words

In [9]:
sent1 = first_dad_joke.split(" ")
sent2 = first_dog_joke.split(" ")

In [10]:
print(sent1)
print(sent2)

['Hvorfor', 'gik', 'faren', 'ind', 'i', 'butikken', 'med', 'et', 'ur?', 'Han', 'ville', 'købe', 'tid!']
['Hvorfor', 'kan', 'ikke', 'hunde', 'spille', 'poker', 'i', 'Afrika?', 'Fordi', 'der', 'er', 'for', 'mange', 'cheetahs!']


#### Corpus of Terms

In [11]:
# All words appearing in all documents
# union() removes duplications
corpus = set(sent1).union(set(sent2))
print(corpus)

{'Fordi', 'ville', 'er', 'et', 'butikken', 'Hvorfor', 'cheetahs!', 'i', 'ur?', 'poker', 'kan', 'tid!', 'hunde', 'for', 'mange', 'faren', 'spille', 'der', 'Afrika?', 'ind', 'ikke', 'med', 'Han', 'gik', 'købe'}


In [12]:
# corpus size
n = len(corpus)
n

25

### Method 1: Binary vectorisation 

Check if a words appears in a sentence (document).
- write 1 if it does, write 0 if not

Store the findings in a dicetionary (key value)
- the word is akey, the appearance of it is a value 


In [13]:
# Binary vector of word appearance in a sentence
def vect(sent):
    # create new dict and place zeros in it
    mydict = dict.fromkeys(corpus, 0) 
    
    # code each word's appearance in the sentence with 1
    for word in sent:
        mydict[word] = 1
    return mydict    

In [14]:
# binarise sentence 1
dict1 = vect(sent1)
dict1

{'Fordi': 0,
 'ville': 1,
 'er': 0,
 'et': 1,
 'butikken': 1,
 'Hvorfor': 1,
 'cheetahs!': 0,
 'i': 1,
 'ur?': 1,
 'poker': 0,
 'kan': 0,
 'tid!': 1,
 'hunde': 0,
 'for': 0,
 'mange': 0,
 'faren': 1,
 'spille': 0,
 'der': 0,
 'Afrika?': 0,
 'ind': 1,
 'ikke': 0,
 'med': 1,
 'Han': 1,
 'gik': 1,
 'købe': 1}

In [15]:
# binarise sentence 2
dict2 = vect(sent2)
dict2

{'Fordi': 1,
 'ville': 0,
 'er': 1,
 'et': 0,
 'butikken': 0,
 'Hvorfor': 1,
 'cheetahs!': 1,
 'i': 1,
 'ur?': 0,
 'poker': 1,
 'kan': 1,
 'tid!': 0,
 'hunde': 1,
 'for': 1,
 'mange': 1,
 'faren': 0,
 'spille': 1,
 'der': 1,
 'Afrika?': 1,
 'ind': 0,
 'ikke': 1,
 'med': 0,
 'Han': 0,
 'gik': 0,
 'købe': 0}

In [16]:
# Store the data into DataFrame
df = pd.DataFrame([dict1, dict2])
df

Unnamed: 0,Fordi,ville,er,et,butikken,Hvorfor,cheetahs!,i,ur?,poker,...,faren,spille,der,Afrika?,ind,ikke,med,Han,gik,købe
0,0,1,0,1,1,1,0,1,1,0,...,1,0,0,0,1,0,1,1,1,1
1,1,0,1,0,0,1,1,1,0,1,...,0,1,1,1,0,1,0,0,0,0


In [17]:
ar = df.to_numpy() 

In [18]:
# Test the similarity
cosim(ar[0], ar[1])

Cosine similarity:  0.14824986333222026


0.14824986333222026

## Method 2: Word Importance

- TF - term frequency - how frequest is the appearance of a term (word) in a document
- DF - document frequency - number of documents containing the term
- IDF - inverse term frequency - how big part of all documents contain the term
- TF-IDF - an integrated measure for the importance of a term - multiply TF x IDF to find it.

term = word

TF can be measured in different ways:

absolute number of times the word appears in a document
relative frequency - count of occurences divided by number of words in the document
logarithmically scaled frequency (e.g. log(1 + count))

### 2.1 Count Vectorisation

In [19]:
# Create count vector from a sentence, telling the frequency of word appearance
def cvect(sent):
    
    # creates the dict with the corpus as keys
    mydict = dict.fromkeys(corpus, 0) 
    
    # count the occurance of each word
    for word in sent:
        mydict[word] += 1
    return mydict    

In [20]:
dict1 = cvect(sent1)
dict2 = cvect(sent2)

In [21]:
# collect the dictionaires in a data frame
dfc = pd.DataFrame([dict1, dict2])
dfc

Unnamed: 0,Fordi,ville,er,et,butikken,Hvorfor,cheetahs!,i,ur?,poker,...,faren,spille,der,Afrika?,ind,ikke,med,Han,gik,købe
0,0,1,0,1,1,1,0,1,1,0,...,1,0,0,0,1,0,1,1,1,1
1,1,0,1,0,0,1,1,1,0,1,...,0,1,1,1,0,1,0,0,0,0


Re-calculate the similarity

In [22]:
# Store the binary values into array
arc = dfc.to_numpy()
arc

array([[0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
        1, 1, 1],
       [1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,
        0, 0, 0]])

In [23]:
# Test the similarity
cosim(arc[0], arc[1])

Cosine similarity:  0.14824986333222026


0.14824986333222026

### 2.2  Term Frequency
Relative TF

tf(t,d) = count of t in d / number of words in d

In [24]:
# Recalculate the word appearance as a proportion of all words in the document
def computeTF(mydict, n):
    # New empty dict for the results of recalculation
    tfDict = {}
    
    for word, wcount in mydict.items():
        # calculate the proportion
        tfDict[word] = wcount/float(n) 
    return(tfDict)

In [25]:
# call the function for both sets
tf1 = computeTF(dict1, len(sent1))
tf2 = computeTF(dict2, len(sent2))

In [26]:
# store the two vectors into dataframe
tff = pd.DataFrame([tf1, tf2])
tff

Unnamed: 0,Fordi,ville,er,et,butikken,Hvorfor,cheetahs!,i,ur?,poker,...,faren,spille,der,Afrika?,ind,ikke,med,Han,gik,købe
0,0.0,0.076923,0.0,0.076923,0.076923,0.076923,0.0,0.076923,0.076923,0.0,...,0.076923,0.0,0.0,0.0,0.076923,0.0,0.076923,0.076923,0.076923,0.076923
1,0.071429,0.0,0.071429,0.0,0.0,0.071429,0.071429,0.071429,0.0,0.071429,...,0.0,0.071429,0.071429,0.071429,0.0,0.071429,0.0,0.0,0.0,0.0


TF doen't mean much. 10 times higher frequency doesn't mean 10 times more important term.

In [27]:
# compute similarity

In [28]:
# Store the binary values into array
art = tff.to_numpy()

In [29]:
# Test the similarity
cosim(art[0], art[1])

Cosine similarity:  0.14824986333222023


0.14824986333222023

### 2.3 DF and IDF

If a word occurs many times in one document, but also in other documents, it may not be important, but just frequent . IDF determines how common a word is amongst the whole corpus.

In [30]:
# IDF - inverse document frequency - measures the informativeness of term t
# DF - number of documents containing the term
# N - number of all documents
# D - corpus of all words

# idf(t, D) = N/df

def computeIDF(allDocs):
    # number of documents
    N = len(allDocs) 

    # create empty dict, put the words in as keys and 0 as value
    idf = {}
    idf = dict.fromkeys(allDocs[0].keys(), 0)    
    
    # check all docs 
    for doc in allDocs:
        # check all words 
        for word, wcount in doc.items():
            # count the doc if the word appears in it
            if wcount > 0:
                idf[word] += 1
    
    # make correction in the counting to avoid eventual division by zero: idf(t) = log10(N/(df + 1))
    for word, wcount in idf.items():
        idf[word] = math.log10(N/(float(wcount)) + 1)
        
    return(idf)

In [31]:
# test
idfs = computeIDF([dict1, dict2])

The closer the IDF is to 0, the more common it is.

In [32]:
pprint.pprint(idfs)

{'Afrika?': 0.47712125471966244,
 'Fordi': 0.47712125471966244,
 'Han': 0.47712125471966244,
 'Hvorfor': 0.3010299956639812,
 'butikken': 0.47712125471966244,
 'cheetahs!': 0.47712125471966244,
 'der': 0.47712125471966244,
 'er': 0.47712125471966244,
 'et': 0.47712125471966244,
 'faren': 0.47712125471966244,
 'for': 0.47712125471966244,
 'gik': 0.47712125471966244,
 'hunde': 0.47712125471966244,
 'i': 0.3010299956639812,
 'ikke': 0.47712125471966244,
 'ind': 0.47712125471966244,
 'kan': 0.47712125471966244,
 'købe': 0.47712125471966244,
 'mange': 0.47712125471966244,
 'med': 0.47712125471966244,
 'poker': 0.47712125471966244,
 'spille': 0.47712125471966244,
 'tid!': 0.47712125471966244,
 'ur?': 0.47712125471966244,
 'ville': 0.47712125471966244}


### 2.4 TF-IDF
TF-IDF determines how relevant a term is in a given document

In [35]:
# tf-idf(t, d) = tf(t, d) * idf(t, D)

def computeTFIDF(tf, idfs):
    tfidf = {}
    for word, wcount in tf.items():
        tfidf[word] = wcount*idfs[word]
    return(tfidf)

In [36]:
idf1 = computeTFIDF(tf1, idfs)
idf2 = computeTFIDF(tf2, idfs)

In [37]:
# store in a dataframe
idf= pd.DataFrame([idf1, idf2])
idf

Unnamed: 0,Fordi,ville,er,et,butikken,Hvorfor,cheetahs!,i,ur?,poker,...,faren,spille,der,Afrika?,ind,ikke,med,Han,gik,købe
0,0.0,0.036702,0.0,0.036702,0.036702,0.023156,0.0,0.023156,0.036702,0.0,...,0.036702,0.0,0.0,0.0,0.036702,0.0,0.036702,0.036702,0.036702,0.036702
1,0.03408,0.0,0.03408,0.0,0.0,0.021502,0.03408,0.021502,0.0,0.03408,...,0.0,0.03408,0.03408,0.03408,0.0,0.03408,0.0,0.0,0.0,0.0


In [38]:
# Store the binary values into array
arx = idf.to_numpy()

In [39]:
# Test the similarity
cosim(arx[0], arx[1])

Cosine similarity:  0.06480110259101254


0.06480110259101254

#### Note that Bag of Words has many negatives. E.g. it counts equaly "John is older than Mary" and "Mary is older than John"

## Method 3: Impoving by Preprocessing

In [41]:
# !pip install langdetect



In [42]:
import langdetect
from langdetect import detect, detect_langs

In [44]:
import spacy
from spacy import displacy

In [45]:
#!python -m spacy validate

In [46]:
#!python -m spacy download en_core_web_md

In [47]:
#!python -m spacy download da_core_news_md

In [48]:
def tokenize(text):
    lang = detect(text)
    if lang == 'en': 
        model = 'en_core_web_md'
        from spacy.lang.en.stop_words import STOP_WORDS
    elif lang == 'da': 
        model = 'da_core_news_md'
        from spacy.lang.da.stop_words import STOP_WORDS
    else:
         print("Wrong language")
        
    mysent = []
    nlp = spacy.load(model)
    doc = nlp(text)
    
    for token in doc:
        if not (token.is_stop or token.is_punct or token.is_space):
            data = {'token': token.text,'lemma': token.lemma_, 'POS': token.pos_, 'tag': token.tag_, 
                    "ent_type": token.ent_type_}
            pprint.pprint(data)
            mysent.append(token.text)
    # spacy.displacy.serve(doc, style="ent")
    # spacy.displacy.serve(doc, style="dep")
    return mysent

In [50]:
sent1 = clean(first_dad_joke)
sent1

'hvorfor gik faren ind i butikken med et ur han ville købe tid'

In [52]:
sent2 = clean(first_dog_joke)
sent2

'hvorfor kan ikke hunde spille poker i afrika fordi der er for mange cheetahs'

In [51]:
tok1 = tokenize(sent1)
tok1

{'POS': 'VERB', 'ent_type': '', 'lemma': 'gå', 'tag': 'VERB', 'token': 'gik'}
{'POS': 'NOUN',
 'ent_type': '',
 'lemma': 'fare',
 'tag': 'NOUN',
 'token': 'faren'}
{'POS': 'NOUN',
 'ent_type': '',
 'lemma': 'butik',
 'tag': 'NOUN',
 'token': 'butikken'}
{'POS': 'NOUN', 'ent_type': '', 'lemma': 'ur', 'tag': 'NOUN', 'token': 'ur'}
{'POS': 'VERB', 'ent_type': '', 'lemma': 'købe', 'tag': 'VERB', 'token': 'købe'}
{'POS': 'NOUN', 'ent_type': '', 'lemma': 'tid', 'tag': 'NOUN', 'token': 'tid'}


['gik', 'faren', 'butikken', 'ur', 'købe', 'tid']

In [53]:
tok2 = tokenize(sent2)
tok2

{'POS': 'NOUN',
 'ent_type': '',
 'lemma': 'hund',
 'tag': 'NOUN',
 'token': 'hunde'}
{'POS': 'VERB',
 'ent_type': '',
 'lemma': 'spille',
 'tag': 'VERB',
 'token': 'spille'}
{'POS': 'NOUN',
 'ent_type': '',
 'lemma': 'poker',
 'tag': 'NOUN',
 'token': 'poker'}
{'POS': 'NOUN',
 'ent_type': 'LOC',
 'lemma': 'afrika',
 'tag': 'NOUN',
 'token': 'afrika'}
{'POS': 'NOUN',
 'ent_type': '',
 'lemma': 'cheetahs',
 'tag': 'NOUN',
 'token': 'cheetahs'}


['hunde', 'spille', 'poker', 'afrika', 'cheetahs']

In [56]:
dfc = pd.DataFrame([tok1, tok2])
dfc

Unnamed: 0,0,1,2,3,4,5
0,gik,faren,butikken,ur,købe,tid
1,hunde,spille,poker,afrika,cheetahs,


In [57]:
# Konverter data frame til numpy array
arc = dfc.to_numpy()

In [59]:
# Test the similarity
cosim(arx[0], arx[1])

Cosine similarity:  0.06480110259101254


0.06480110259101254