# Overview
In order to pass text input, we need to do some processing? Why? Neural Networks can't process text as is. Therefore we need to convert the text into a mathematical representation (a vector). This mathematical representation is called embeddings. This can be done for other forms of input such as audio and video as well. However, the same embedding tool will not work for all of these. Embeddings can be done at the word, sentence or even paragraph level. Here's a rough outline of the structure we'll follow:
1. Tokenize your input text
2. Assign a Token ID to the tokens
3. Generate embeddings for the tokens.

Here is a rough pipeline of what we're building. 
Img src =  https://geoffrey-geofe.medium.com/tokenization-vs-embedding-understanding-the-differences-and-their-importance-in-nlp-b62718b5964a 
![alt text](../demo-files/tokenizing_workflow.png)


# Tokenizing Input Text

In [1]:
# let's first get a piece of text that we want to tokenize. Here is the chosen text: https://www.gutenberg.org/cache/epub/73951/pg73951.txt 
import urllib.request
url = ("https://www.gutenberg.org/cache/epub/73951/pg73951.txt")
file_path = "demo-files/the-second-shell.txt"
urllib.request.urlretrieve(url, file_path)

('demo-files/the-second-shell.txt', <http.client.HTTPMessage at 0x105edc490>)

In [8]:
# let's take a look at the first 500 characters of the second shell. I modified the text file to get rid of the meta data 
the_second_shell = open("../demo-files/the-second-shell.txt")
print(the_second_shell.read(500))

It was two o'clock in the morning of September 5, 1939. For a year
and a half I had been at work on the San Francisco _Times_. I had
come there immediately after finishing my year's course at the army
officers' flying school at San Antonio, on the chance that my work
would lead me into enough tong wars and exciting murder mysteries to
make life interesting.

The morning edition had just been "put to bed" and I was starting out
of the office when the night editor called me to meet a visitor who 


In [3]:
with open('../demo-files/the-second-shell.txt', 'r') as f:
    input_text = f.read()


I'm going to generate a word level tokenizer, which means I'm going to tokenize my input at the white space character. 
I'm going to seperate the punctuations from the words and thats about all the processing I'm going to do. 

In [7]:
import re
import pickle

def tokenizer(input_text):
    # we want to split at white spaces
    tokenized_text = re.split(r'([,.?!]|\s)',input_text)
    '''explanation:
    We're telling regex to split either on any of the punctations we've provided or on the space character
    The return type of this would be a list. 
    Note that this contains white spaces, so we must get rid of the white spaces as well. 
    '''
    tokenized_text = [token for token in tokenized_text if token != ' ']

    # additionally, we notice that the first element of the list has a \ufeff, which indicates it's a start of the sequence. 
    # to get rid of this, we can use the following line:
    tokenized_text = [token.lstrip('\ufeff') for token in tokenized_text]

    # Generating a file with all the tokens
    with open("files/tokens.txt", "wb") as file:
        pickle.dump(tokenized_text, file)


    return tokenized_text

print(tokenizer(input_text)[:500]) #printing the first 500 tokens

['It', 'was', 'two', "o'clock", 'in', 'the', 'morning', 'of', 'September', '5', ',', '', '1939', '.', '', 'For', 'a', 'year', '\n', 'and', 'a', 'half', 'I', 'had', 'been', 'at', 'work', 'on', 'the', 'San', 'Francisco', '_Times_', '.', '', 'I', 'had', '\n', 'come', 'there', 'immediately', 'after', 'finishing', 'my', "year's", 'course', 'at', 'the', 'army', '\n', "officers'", 'flying', 'school', 'at', 'San', 'Antonio', ',', '', 'on', 'the', 'chance', 'that', 'my', 'work', '\n', 'would', 'lead', 'me', 'into', 'enough', 'tong', 'wars', 'and', 'exciting', 'murder', 'mysteries', 'to', '\n', 'make', 'life', 'interesting', '.', '', '\n', '', '\n', 'The', 'morning', 'edition', 'had', 'just', 'been', '"put', 'to', 'bed"', 'and', 'I', 'was', 'starting', 'out', '\n', 'of', 'the', 'office', 'when', 'the', 'night', 'editor', 'called', 'me', 'to', 'meet', 'a', 'visitor', 'who', 'had', '\n', 'just', 'come', 'in', '.', '', 'The', 'stranger', 'came', 'forward', 'quickly', '.', '', 'Roughly', 'clad', 'in