# Bi-Gram matriz count
This Notebook identifies unique words in Shakespeare's works. In addition, a count of all the bigrams present in the text corpus is performed.

In [1]:
import numpy as np
import pandas as pd
import re

The following function is used to tokenize the input sentences. In each case, symbols are removed, the sentence is converted to lowercase, and multiple spaces are removed. Finally, the processed text is returned as a list of words, including a special token for the star (s1) and end (e1) of the sentence.

In [2]:
def tokenize_text(text, n):
    # eliminate symbols on input text
    sims = "!\"#$%&()*+-.,'/:;<=>?@[\]^_`{|}~\n\t"
    for si in sims:
        text = text.replace(si, '')

    # lower text
    text = text.lower()

    # replace multiple spaces by single
    _RE_COMBINE_WHITESPACE = re.compile(r"\s+")
    text = _RE_COMBINE_WHITESPACE.sub(" ", text).strip()

    # insert start and end tokens
    split = text.split(' ')
    st = [f's{i+1}' for i in range(n-1)]
    en = [f'e{i+1}' for i in range(n-1)]
    split = st + split + en

    return split

The data, which consists of dialogues between characters, is then read. Examples of data evaluations with the `tokenize_text()` function is then displayed:

In [3]:
data = pd.read_csv('Data/Shakespeare_data.csv')

# only valid lines
lines = data['PlayerLine'].values
indx = data['PlayerLinenumber'].isna().values
indx = [not i for i in indx]
lines = lines[indx]
lines[:5]

array(['So shaken as we are, so wan with care,',
       'Find we a time for frighted peace to pant,',
       'And breathe short-winded accents of new broils',
       'To be commenced in strands afar remote.',
       'No more the thirsty entrance of this soil'], dtype=object)

All possible bigrams are counted, considering the vocabulary available in the data:

In [5]:
# create list with all words with repetition
words = []
for li in lines:
    tokens = tokenize_text(li,n=2)
    words += tokens
len(words)
unique_words, count = np.unique(words, return_counts=True)

words = None # free memory

# create a word-id dictionary
word_id = {}
for i, wi in enumerate(unique_words):
    word_id[wi] = i

# matrix generation
Cmatrix = np.zeros(shape=(len(count),len(count)), dtype=np.int32)
for li in lines:
    tokens = tokenize_text(li, n=2)
    for i in range(0,len(tokens)-1):
        t1 = tokens[i]
        t2 = tokens[i+1]
        Cmatrix[word_id[t1],word_id[t2]] += 1

The data necessary to implement the bigram system is written into .csv files. First, a dictionary word-id is written, that identifies each word in the corups with its index in the matrix:

In [6]:
# save word-id dictionary
ids = [word_id[ki] for ki in word_id.keys()]

data_dicc = pd.DataFrame({'id': ids,
                          'word': word_id.keys()})
data_dicc.to_csv('word_id.csv', index=False)

Subsequently, the bigram count matrix is ​​written. For this, the scarcuty presented is considered, so only the non-zero counts are saved:

In [7]:
# identify the existing combinations and counts
aux_counts = []
for row in Cmatrix:
    aux = ''
    for i, ci in enumerate(row):
        if ci > 0:
            if aux == '':
                aux += f'{i}:{ci}'
            else:
                aux += f',{i}:{ci}'
    aux_counts.append(aux)
aux_counts

# data from Sparce matrix CMatrix
data_sparse = pd.DataFrame({'id': ids,
                            'counts': aux_counts})
data_sparse.to_csv('CMatrix.csv', index=False)