### Mini project 5: Working with text

A great part of the information about the world comes to us as text. To be able to process, analyse and generate text automatically, often we need to convert it into numeric data sets (vectors).The process of converting or transforming data into a set of vectors is called vectorization. Text vectorisation is an essential prerequisite of the modern Natural Language Processing (NLP) and Understanding (NLU), maintained by the Generative AI.

The objectives of this project are:
- understanding the basic concepts and use of text vectorisation and vector similarity
- gaining experience in implementation of methods, algorithms, and libraries for working with text in BI and Python programming.

Your tasks are the following:
1. Collect and load text documents from various sources of one domain – e.g. some of txt, doc, csv,
json, pdf files, web pages, or data frame attributes.
2. Extract, clean, and transform the text from the sources, to prepare it for vectorisation.
3. Vectorise and store the clean text in a software structure.
4. Create a simple interactive prototype of application, which can input a text from a user and output
the top three related pieces of texts, stored earlier, applying vector similarity approach.
5. Optionally, integrate your application with LLM (large language model) for improving the quality
of the language operations.
6. Suggest various implementations of such an application.

#### Enviroment

In [1]:
import docx
import json
import string

# data structure
import pandas as pd

# cosine simularity
import cosimfunc
from cosimfunc import cosim


#### Load data

In [2]:
# method to read text from a docx file
def read_docx(name):
    doc = docx.Document(name)
    tekst = "\n".join(paragraf.text for paragraf in doc.paragraphs)
    return tekst

In [3]:
# method to read text from a json file
def read_json(name):
    with open(name, "r", encoding="utf-8") as json_fil:
        data = json.load(json_fil)
    return data

In [4]:
# read data
dad_jokes = read_docx("farjokes.docx")
dog_jokes = read_json("hunde_jokes.json")

#### Clean data

In [5]:
# collect all jokes
all_jokes = [joke for joke in dad_jokes.split('\n')] + [joke['joke'] for joke in dog_jokes['hunde_jokes']]

In [6]:
# method to remove regex signs
def clean(text):
    text = text.lower()
    
    # string.punctuation removes !"#$%&\'()*+,-./:;?@[\\]^_{|}~`
    PUNCT = string.punctuation
    text = text.translate(str.maketrans('', '', PUNCT))
    return text

In [7]:
all_jokes = [clean(joke) for joke in all_jokes]
all_jokes

['hvorfor gik faren ind i butikken med et ur han ville købe tid',
 'hvad kalder man det når ens far forsøger at være morsom farligt',
 'hvorfor er fars vittigheder som tidsmaskiner fordi de får tiden til at føles langsommere',
 'hvordan får man en far til at stoppe med at grine fortæl ham en dårlig vittighed',
 'hvorfor kan faren lide at se film på sit skur han elsker at se dem i sømfuld opløsning',
 'hvorfor laver faren altid dårlige vittigheder om bøger fordi bøger har mange sider',
 'hvordan får man en far til at grine på lørdag fortæl ham en vittighed på onsdag',
 'hvordan reparerer du en ødelagt farvittighed med fartape',
 'hvad sagde farretomaten til babytomaten ketchup min dreng',
 'hvad kalder man en gruppe af fars vittigheder en farqueue',
 'hvorfor kan ikke hunde spille poker i afrika fordi der er for mange cheetahs',
 'hvad siger en hund når den bestiller på en restaurant jeg vil have noget der er velhængt',
 'hvorfor elsker hunden at stikke hovedet ud af bilvinduet det er d

### Bag of words

In [8]:
# split into words
for joke in all_jokes:
    words = joke.split()
    print(words)

['hvorfor', 'gik', 'faren', 'ind', 'i', 'butikken', 'med', 'et', 'ur', 'han', 'ville', 'købe', 'tid']
['hvad', 'kalder', 'man', 'det', 'når', 'ens', 'far', 'forsøger', 'at', 'være', 'morsom', 'farligt']
['hvorfor', 'er', 'fars', 'vittigheder', 'som', 'tidsmaskiner', 'fordi', 'de', 'får', 'tiden', 'til', 'at', 'føles', 'langsommere']
['hvordan', 'får', 'man', 'en', 'far', 'til', 'at', 'stoppe', 'med', 'at', 'grine', 'fortæl', 'ham', 'en', 'dårlig', 'vittighed']
['hvorfor', 'kan', 'faren', 'lide', 'at', 'se', 'film', 'på', 'sit', 'skur', 'han', 'elsker', 'at', 'se', 'dem', 'i', 'sømfuld', 'opløsning']
['hvorfor', 'laver', 'faren', 'altid', 'dårlige', 'vittigheder', 'om', 'bøger', 'fordi', 'bøger', 'har', 'mange', 'sider']
['hvordan', 'får', 'man', 'en', 'far', 'til', 'at', 'grine', 'på', 'lørdag', 'fortæl', 'ham', 'en', 'vittighed', 'på', 'onsdag']
['hvordan', 'reparerer', 'du', 'en', 'ødelagt', 'farvittighed', 'med', 'fartape']
['hvad', 'sagde', 'farretomaten', 'til', 'babytomaten', 'ke

#### Corpus of Terms

In [9]:
# Set
corpus = set()

# add unique words to the set
for joke in all_jokes:
    corpus.update(joke.split())

# remove empty strings
corpus.discard('')
print(corpus)

{'ur', 'stoppe', 'skur', 'farqueue', 'er', 'mange', 'der', 'dens', 'bide', 'at', 'i', 'måde', 'ham', 'om', 'babytomaten', 'dårlige', 'afrika', 'holde', 'restaurant', 'nye', 'for', 'tandbørste', 'af', 'ekspert', 'farligt', 'hund', 'har', 'elsker', 'og', 'fordi', 'fars', 'hvordan', 'hunde', 'skolen', 'ind', 'man', 'hamlet', 'lørdag', 'butikken', 'bestiller', 'du', 'angrebet', 'trombone', 'morsom', 'langsommere', 'min', 'hovedet', 'råber', 'forsøger', 'far', 'gøgning', 'ødelagt', 'kan', 'cheetahs', 'en', 'spille', 'opløsning', 'grine', 'gruppe', 'bilvinduet', 'hvis', 'noget', 'ud', 'finder', 'op', 'farvittighed', 'ikke', 'kunne', 'stol', 'sit', 'bjeffe', 'dårlig', 'foretrukne', 'siger', 'vittighed', 'velhængt', 'deltage', 'tidsmaskiner', 'ville', 'dem', 'fartape', 'købe', 'poker', 'stikke', 'handler', 'laver', 'lide', 'onsdag', 'får', 'hvorfor', 'bøger', 'farretomaten', 'væk', 'et', 'vittigheder', 'jeg', 'kat', 'film', 'have', 'til', 'ens', 'han', 'reparerer', 'føles', 'klatrer', 'på', 's

In [10]:
# corpus size
n = len(corpus)
n

149

### Method 1: Binary vectorisation 

Check if a words appears in a sentence (document).
- write 1 if it does, write 0 if not

Store the findings in a dicetionary (key value)
- the word is akey, the appearance of it is a value 


In [11]:
# Binary vector of word appearance in a sentence
def vect(sent, corpus):
    # create new dict and place zeros in it
    mydict = dict.fromkeys(corpus, 0) 
    
    # code each word's appearance in the sentence with 1
    for word in sent.split():
        mydict[word] = 1
    return mydict    

# binarise each sentence
binary_vectors = [vect(sentence, corpus) for sentence in all_jokes]

In [12]:
binary_vectors

[{'ur': 1,
  'stoppe': 0,
  'skur': 0,
  'farqueue': 0,
  'er': 0,
  'mange': 0,
  'der': 0,
  'dens': 0,
  'bide': 0,
  'at': 0,
  'i': 1,
  'måde': 0,
  'ham': 0,
  'om': 0,
  'babytomaten': 0,
  'dårlige': 0,
  'afrika': 0,
  'holde': 0,
  'restaurant': 0,
  'nye': 0,
  'for': 0,
  'tandbørste': 0,
  'af': 0,
  'ekspert': 0,
  'farligt': 0,
  'hund': 0,
  'har': 0,
  'elsker': 0,
  'og': 0,
  'fordi': 0,
  'fars': 0,
  'hvordan': 0,
  'hunde': 0,
  'skolen': 0,
  'ind': 1,
  'man': 0,
  'hamlet': 0,
  'lørdag': 0,
  'butikken': 1,
  'bestiller': 0,
  'du': 0,
  'angrebet': 0,
  'trombone': 0,
  'morsom': 0,
  'langsommere': 0,
  'min': 0,
  'hovedet': 0,
  'råber': 0,
  'forsøger': 0,
  'far': 0,
  'gøgning': 0,
  'ødelagt': 0,
  'kan': 0,
  'cheetahs': 0,
  'en': 0,
  'spille': 0,
  'opløsning': 0,
  'grine': 0,
  'gruppe': 0,
  'bilvinduet': 0,
  'hvis': 0,
  'noget': 0,
  'ud': 0,
  'finder': 0,
  'op': 0,
  'farvittighed': 0,
  'ikke': 0,
  'kunne': 0,
  'stol': 0,
  'sit': 0,
 

In [13]:
# convert from set to list, so it can be used as columns
corpus_list = list(corpus)

# create data frame
df = pd.DataFrame(binary_vectors, columns=corpus_list)
df

Unnamed: 0,ur,stoppe,skur,farqueue,er,mange,der,dens,bide,at,...,med,faneblade,yndlingsinstrument,grave,var,sagde,de,shakespeareskuespil,dreng,ketchup
0,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
3,0,1,0,0,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,1
9,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
ar = df.to_numpy() 

In [19]:
# Test the similarity
# how similar is row 10 and 11
cosim(ar[10], ar[11])

Cosine similarity:  0.1336306209562122


0.1336306209562122