# Parsing Text Using Computer Algorithms

This Jupyter notebook uses Python 3 runtime enviornment. Here, we showcase a few examples on how to parse a given sample of text to obtain its structural components such as sentences and words. The aim of this basic exercise is to hint at the fact that different types of tokenizers can be used in analyzing accessible corpora, using parallel processing and cloud computing.

## Initial steps
We first install and import the required libraries.

- `re` is for Regular Expression based magical things

- `collections` is for a method called `Counter`

- `nltk` and its modules are for one styel of text processing

- `tf_text` and `keras` relatd functions are for another style of text processing

- `pprint` because it is nicer and prettier than default `print` function

    - https://realpython.com/python-pretty-print/

In [None]:
pip install -q "tensorflow-text>=2.11.*"

In [None]:
import re
import collections

import nltk
from nltk.tokenize import word_tokenize as nltk_word_tokenize
from nltk.tokenize import sent_tokenize as nltk_sent_tokenize

import tensorflow_text as tf_text
from keras.preprocessing.text import text_to_word_sequence as keras_spliter

from pprint import pprint

Then we load the given sample of text, which is converted to lowercase letters for convenicence.

In [None]:
with open("./sample.txt") as input_file:
  text_sample = input_file.read().lower()

The current form of `text_sample` looks like this:


In [None]:
print(type(text_sample))
print('\n---\n')
print(text_sample)


## Python Builtin String Methods

The most ordinary approach is to split a given string on every new line, or on a specified delimiter.

Let's use

- [*splitlines()*](https://docs.python.org/3/library/stdtypes.html#str.splitlines) for obtainig each line of `text_sample` as a seperate substring, to see what it looks like.

- [*split()*](https://docs.python.org/3/library/stdtypes.html#str.split) for obtainig each substring within `text_sample` that is seperated by a white space, to see if it gives us a list of neatly distinguishable words.

In [None]:
text_splitup_on_newlines = text_sample.splitlines()

print(type(text_splitup_on_newlines))
print('\n---\n')
pprint(text_splitup_on_newlines)

In [None]:
text_splitup_on_whitespaces = text_sample.split()

print(type(text_splitup_on_whitespaces))
print('\n---\n')
pprint(text_splitup_on_whitespaces)

In the above code-blocks we see that:

- `splitlines()` does not strip the leading and trailing white spaces,

- and `split()` produces substrings that include punctuations.

## Regular Expression
Use of [regular expression](https://en.wikipedia.org/wiki/Regular_expression) is a very powerful and remarkably quick method for finding substrings, matching a given pattern, within a corpus.

In [None]:
words = re.findall(r"\w+", text_sample)
pprint(words)

In [None]:
# Frequency of a word's occurance
tally = collections.Counter()

for word in words:
  tally[word] += 1

pprint(dict(tally.items()))

### More About RegEx

Here are few useful online, regex builders:

- [RegexTester](https://extendsclass.com/regex-tester.html) for multiple programming languages with a flowchart visualization of the search pattern.

- [RegEx101](https://regex101.com/) for multiple programming languages.

- [Regexr](https://regexr.com/) for only JavaScript and PCRE.

- [Pythex](https://pythex.org/) for only Python.

As a hint:

- This expression is for finding only numbers and numeric values, in ordinary as well as scientific notation, but it does not handle hexadecimal notation

  `([+-]?(?:\d+\.\d*?|\.\d+)(?:[eE][+-]?\d+)|(?:[+-]?\d+))`

  

- This one finds only strings that include alphabets along with a hypen or an apostophy

  `(\w*[\\.\\'-]?\w+)`

- And this one is for finding only punctuations

  ```([!"#$%&'()*+,./:;<=>?@\^_`{|}~-])```



Of course, there are endless [memes about regex.](https://programmerhumor.io/memes/regex/) Here is one of my favorite:

<br>
<p align="center">
<img src="https://programmerhumor.io/wp-content/uploads/2023/01/programmerhumor-io-programming-memes-2aaa80d1d953d9a.png" alt="The ultimate flex" height="400">
</p>

## NLTK
Let us now look at an application of the [NLTK](https://www.nltk.org/) library.

We will need to download the punctuation module for the nltk tokenizers, as discussed here - https://www.nltk.org/api/nltk.tokenize.punkt.html

In [None]:
# Downloads and installs the punkt module

nltk.download('punkt')

We can now parse the given text sample and compare the output of each type of tokenizer.

In [None]:
sentences = nltk_sent_tokenize(text_sample)
pprint(sentences)

In [None]:
tokens = nltk_word_tokenize(text_sample)
pprint(tokens)

In [None]:
# Frequency of a word's occurance
# usage of nltk.probability.FreqDist is same as collections.Counter

counts = nltk.probability.FreqDist()

for token in tokens:
  counts[token] += 1

pprint(dict(counts.items()))

Upon running the above code-cells for `nltk`, we can see how the given sample of text gets partitioned into sentences, and into single units of words and punctuations called tokens.

An algorithm such as `nltk_sent_tokenize()` or `nltk_word_tokenize()` parses lines of text, in a way that isn't exactly how a human being would visually segregate those lines. However, the algorithm's similarity to how a human being would identify parts of the given text as sentences or words, is sufficient for parsing a large corpus, which would otherwise be too laborious for a human being to go through mannually.

For accurately identifying sentence and word boundaries, the tokenizers in NLTK can be trained on a sample text, with respect to expected results, to better handle the type of punctuations used in a large corpus.  

## TensorFlow and Keras

>**Important note:** On 2023-10-09, I noticed that the entire module of `tf.keras.preprocessing` is slated for being deprecated. The TensorFlow and Keras websites do not currently have any explaination for what needs to be used instead. However, my guess is that people will simply continue to use RegEx, [SciKit-learn](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) and other such libraries, for customized methods of preprocessing their data for machine learning related explorations and applications.

Let us now look at the task of obtaining word frequency by using algorithms from programming libraries called [Tensorflow](https://www.tensorflow.org/) and [Keras.](https://www.tensorflow.org/guide/keras)

For convenience, we will read the given "sample.txt" file, line-by-line, into a callable object.

In [None]:
with open("./sample.txt") as input_file:
  dataset = input_file.readlines()

The current form of `dataset` is as follows:

In [None]:
print(type(dataset))
print('\n---\n')
pprint(dataset)

Let's clean up the `dataset` a bit by converting each word into lowercase, and by removing leading as well as trailing white spaces.

In [None]:
lowercase_data = []
for line in dataset:
  lowercase_data.append(line.lower().strip())

pprint(lowercase_data)

### tensorflow_text.WhitespaceTokenizer

Let us obtain a list of tokens that have a word boundary of white spaces.

We will eventually see that it is no better than,

- `text_splitup_on_whitespaces = text_sample.split()`

- `words = re.findall(r"\w+", text_sample)`

In [None]:
# Tensorflow word frquency example

tf_word_tokenizer = tf_text.WhitespaceTokenizer()

tf_tokens = []
for line in lowercase_data:
  tf_tokens.append(tf_word_tokenizer.tokenize(line))

byte_tokens = []
for tf_token in tf_tokens:
  byte_tokens.append(list(tf_token.numpy()))

pprint(byte_tokens[0:5], indent=4)

### The Keras Tokenizer

So let us now use a tokenizer based on a trained model, that should be able to better recognize different types of punctuations, in order to produce substrings with a discernable word.

However, just before that, let us see an example of a simple spliter called `text_to_word_sequence`, that is used in Keras, which forms an integral part of the `keras.preprocessing.text.Tokenizer` algorithm.

In [None]:
# parse the given `text_sample` with `text_to_word_sequence`
parsed_text = keras_spliter(text_sample)

occurance_rate = collections.Counter()
for item in parsed_text:
  occurance_rate[item] += 1

pprint(dict(occurance_rate.items()))

In [None]:
# parse preprocessed `text_sample` as `dataset` with `keras_tokenizer`
from keras.preprocessing.text import Tokenizer as keras_tokenizer

keras_tokenizer = keras_tokenizer(num_words=150)
keras_tokenizer.fit_on_texts(dataset)
indexed_words = keras_tokenizer.word_index


In [None]:
indexed_words

When we inspecct `indexed_words` we see that `keras_tokenizer` has produced a dictionary of unique words within the given `dataset`, in which, each word forms a "key", and an index forms the "value."

So, `keras_tokenizer` does not simply do tokenization, but instead constructs a vocabulary from the given dataset.

>All unique words collected from a corpus arranged in an order, is called the "vocabulary" of the corpus.

Indeed, `keras_tokenizer` does not output a simple list of tokenized words, nor is its output helpful in finding the frequency of word occurances in the given dataset. It would have been more useful to name it `keras_vocab_builder` or something like that.


That is because, the people who produced the algorithm for `keras_tokenizer` used the subroutine of `text_to_word_sequence` based on a simple delimeter such as white space, along with a filter for punctuations, for the main purpose of populating an initialized dictionary that could get continuously updated with new words, to produce a final vocabulary constructed across serveral corpora.

Let us now look at a simple issue with such a vocabulary generating method.

When the initialized `keras_tokenizer` is given new text with arbitrary character strings, it naievely updates itself with the new strings.

In [None]:
keras_tokenizer.fit_on_texts(['abrtariry', 'crhacaetr','of','srtnigs','and','constants'])
updated_vocabulary = keras_tokenizer.word_index

In [None]:
updated_vocabulary

Upon inspecting `updated_vocabulary` we see that "abrtariry", "crhacaetr", "srtnigs", as well as "constants" were appended to `indexed_words`.

Supposedly, the words accumulated in `keras_tokenizer`, each time it is made to `fit_on_texts`, are hashable. Those hashes are used as an ecoding method for comparing sentences that happen to have similar words. Sentences that are dissimilar, can also be compared using mathematical operations, by storing word sequences within each dataset as a "vector", in matrix form. A very large matrix can thus be constructed, using an ever growing vocabulary and such a procedure that "vectorizes" given sentences.

The eventual output of `keras_tokenizer` is merely a complicated and an inefficient way of doning what Python's inbuilt datatype called `set` can already achieve. See, Python sets:

- https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset

- https://realpython.com/python-sets/

So let us obtain the output of `keras_tokenizer` in a different way...

In [None]:
words = re.findall(r"\w+", text_sample)
vocabulary = set(words)

indexed_vocab = dict(zip(vocabulary, range(len(vocabulary))))

In [None]:
indexed_vocab

Remember, `keras_tokenizer` is only slightly better at identifying word boundaries within the given text sample, than the rudimentary method of applying `"\w+"` as a search pattern using `re.findall()`.

So now, we can update `indexed_vocab` with new set of strings using `re`.

In [None]:
# New texts
glitch_text = 'abrtariry crhacaetr of srtnigs and constants'

# The tokenizer
more_words = re.findall(r"\w+", glitch_text)


In [None]:
# Operations for comparing sets
#
# p | q  is  p Union q                    i.e. p OR  q
# q | p  is  commutative with p | q
#
# p & q  is  p Intersection q             i.e. p AND q
# q & p  is  commutative with p & q
#
# p - q  is  p Difference q               i.e. p NOT q
# q - p  is  not commutative with q - p
#
# p ^ q  is  p Symmetric Difference q     i.e. p XOR q
# q ^ p  is  commutative with p ^ q

more_words_set = set(more_words)

novel_words = more_words_set - vocabulary

discard_pile = vocabulary & more_words_set

new_vocab = vocabulary | more_words_set


In [None]:
# Let's have a look at results from the comparisions
print(len(more_words_set))
print(type(more_words_set))

print('\n')

print(len(vocabulary))
print(type(vocabulary))

print('\n')

print(len(novel_words))
print(type(novel_words))
print(novel_words)

print('\n')

print(len(discard_pile))
print(type(discard_pile))
print(discard_pile)

print('\n')

print(len(new_vocab))
print(type(new_vocab))

In [None]:
# Method for updating indexed_vocab
incrementer = len(vocabulary)
for element in novel_words:
  indexed_vocab[element] = incrementer
  incrementer += 1

In [None]:
# The required output
print(len(indexed_vocab))
print(type(indexed_vocab))

print('\n')

indexed_vocab

## Discussion and Conclusion

Among all the methods showcased in this notebook, the eventual reality is that packaged algorithms like TensorFlow and Keras use computer programming techniques built into the datatype of programming languages like C++ and Python. Or, those commercial packages happen to use routines from bare-bones, free and open-source packages, such as an implementation of RegEx.

The benefit of packaged algorithms from communities that use R-Studio, Matlab, Mathematica, Wolfram, Pandas, OpenCV, Scikit-Learn, etc. is that some of the mathematical complexities of manually building the latest algorithms found in journal papers for statistical analysis and machine learning, are hidden away and already taken care of, by other developers. The unfortunate thing is that many of the nuanced and critical assumptions that go into building, tuning, and utilizing those algorithms, also get obfuscated due to "high level abstractions" incorporated into those Application Programming Interfaces (APIs). The obfuscated parameters and pitfalls eventually lead to silent failures, and polluted or corrupted databases of trained models.

For example, a vocabulary built using `keras_tokenizer` can get corrupted when it accidentally encodes a list of words like, `["model", model.", "models", "models.", "models'", "model's", "model's."]` as different words, when its tokenizer fails silently. The encoding and hashing procedures implemented using that faulty vocabulary, eventual produces vectors within a "noisy" vector space, where many incorrectly identified strings get encoded and hashed as a "new word" within the increasingly polluted vector space.

Consequently, much greater volumes of training data, along with much more laborious test phases go into producing a usable "Large Language Model", after wasting huge amounts of computing resources to overcome the noise and pollution incorporated within the model's training phase. The eventual carbon-footprint of training models using wasteful methods is immeasurably large.

Here is another example, from the globally popular, commercial activity of generating synthetic images using prompts:

>This NPR article discusses what currently happens when a commercial AI is asked to create images of African doctors treating white kids - [NPR Article, Goats and Soda, 2023-10-06](https://www.npr.org/sections/goatsandsoda/2023/10/06/1201840678/ai-was-asked-to-create-images-of-black-african-docs-treating-white-kids-howd-it-)
>
> It is entirely likely that much more arbitrary results than the ones shown in the above-mentioned article would be created, if the trained AI is asked to generate "Indian doctor helping white kids"; even though the current social reality is that many medical and academic doctors in the US and the UK are South Asians.
>
>The AI would most probably fail to correctly identify the difference between North American indigenous peoples and Indians from India, because the AI's programmers also couldn't, and still can't produce appropriate training data for image classification software that includes non-white and non-Eurocentric cultures, and multi-racial groups of people
>
>How many of the people depicted in such synthetically generated images would be male, female, or non-binary gendered? That merely happens to be yet another problematic bias that has been incorporated into baked AI models 🤦‍♂

The environmental damage produced by tech-giants in their pursuit of capitalizing on the next cutting edge product or service, is almost always brushed aside by most consumers, investors, law makers, and industry specialists. But, soon enough, in the near future, due to accelerated climate change, it will be practically fatal for many island nations of the world, to ignore the negative consequences of frenzied entrepreneurial ventures promulgated by industrialized countries. So, whose responsibility is it to prevent or mitigate that impending eventuality?  

There are already too many people in this world, who are convinced that eventualities like human made and natural disasters, are going to keep increasing in severity as well as frequency; and that in the face of those kinds of horrendous eventualities, "a 'rational' human being" ought to make merry and live up their life unabashedly, to every maximum extreme as possible, while ignoring the destruction caused to the ecology by their relentless pursuit of glory, riches, and power (along with all other forms of personal profits and pleasures), simply because the ensuing destruction contributed by them to the overall tragedy on Earth, in their view, was about to 'inevitably' be caused by 'some other' persons or natural events.

In their worldview, what is the big deal if they happen to be the actual person who has, and will only continue to cause destruction, harms, injuries and damages via parasitism or predation, until their death? Won't someone else or natural events any ways cause much worse and horrifying forms of destruction in the coming future?

It can be observed that the most problematic aspect of their philosophical approach to life is that, those kinds of insatiable and avaricious individuals have been influential board members of corporations like BlackRock, Chevron, and British Petroleum, or even presidents of countries like the US, and the UK, who have continued to have global-scale negative impacts over billions of people.

Their excuse in spreading global-scale destruction to human ecology has always been the same, "The calculated benefits and profits obtainable by them, has continuously outweighed the costs and damages incured by native and indigenous peoples of the world."

That type of influence by Westernized alliances, has unfortunately only brought about a world with more hate, sorrow, bigotry, and prejudice, all of which is evident even within the trained Artificial Intelligence (AI) models created by the subordinates and followers serving those so-called *'Supreme Leaders of The World.'*

So this is where one must deeply medidate on a vision of the future. If there are people on Earth who are hopeful for a cleaner and ecologically sustainable future, with lesser injustice and corruption prevalant in society due to activities of public and private organizations, will such earnest people be forced to fight to the death against unyielding war mongers who stand firmly opposed to the hope and vission of a greener future?

Alas, when the bad-faith actors embedded within security, intelligence, and policing agencies use their vast array of resources to analyze the contents of articles among such repositories, they and their algorithms fail to find appropriate context and meaning due to foundational faults within the design and application of their AI based systems. It hardly ever matters to them if they then, abuse persons like myself with the excuse of "needing to be vigilant and cautious through 'warranted' actions." Even when there is plenty of hard evidence that their premeditated and targeted acts of violence and terror, were not at all warranted, and were entirely vile, sadistic, as well as illegitimate, they face no consequences.

Moreover, as they continue to double-down on their treachery against innocent vulnerable people, day-after-day and year-after-year, in order to conduct meticulously coordinated acts of state-sponsored violations and pogroms, no forms of remedies are made available to victims and survivors. Worst of all, no measures are taken to properly halt and deter those kinds of surreptitious and belligerent activities.

By all means, such files made available within a publicly accessible repository, will get flagged, and will be used as an excuse by prejudiced bad-actors for directing advanced forms of cyber-attacks onto my networks and social relationships, if I merely happen to include words like "Bomb", "Explosives", and "AllahuAkhbar", even without any malicious context surrounding those words.

So, why would I then, knowingly include those strings within an example about the use of machine learning in signal detection and analysis?

As you can plainly see, I have become a simple canary in the mine-shaft of science education, that is only able to do so little to warn the general public about the dangerous, and nauseously nefarious activities of utterly bigoted officers from various security, political, commercial, and industrial organizations belonging to certain westernized countries. Their fallacies, racist norms, foundational errors of judgment, and abject ignorance are backed into the organizational culture, operations, and outputs of various public as well as private organizations influenced by them, throughout the world.

The only worthwhile and logical conclusion is:

>Injurious and bad-faith actors have to be removed from seats of power, influence, privilage, and authority, by replacing them with better educated and conscientious people, who can properly rectify the systemic faults and errors that are glaringly evident within every echelon of societal structures available to human kind, using all legally viable means necessary.

## Selected Reference:

- https://www.tensorflow.org/text

- https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/text_to_word_sequence

- https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

- https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/