# Working with text data

We are currently at step 1, data preparation and sampling.

To prepare input texts, we need to separate it into individual words and tokens to be able to encode them.

Embedding refers to the process of converting data in this case text into a vector format.

The purpose is to have a data format which neural networks can process

There are different embeddings, however we will focus on words embeddings as we want to generate one at a time.

Word embeddings can have varying dimensions, from one to thousands. A higher
dimensionality might capture more nuanced relationships but at the cost of computational efficiency.

In [5]:
# To practice this, we will use the-verdict.txt file
with open("the-verdict.txt", "r", encoding="utf-8") as f:
 raw_text = f.read()
print("Total number of character:", len(raw_text))
print(raw_text[:99])

# We wish to turn all this characters into tokens which we can embedd

# To obtain the different set of characters we use the re library
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)
print("Separating words: ",result)

# We wish to separate dots and commas to separate instances
result = re.split(r'([,.]|\s)', text)
print("Separating dots and commas: ",result)

# If we wish to remove blank space characters
result = [item for item in result if item.strip()]
print("Removing blank spaces: ",result)

# Removing white spaces can depend on what the focus is as it can be memory efficient or needed to avoid erros.

# Taking into account all punctuaction terms
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print("With punctuation: ", result)

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 
Separating words:  ['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']
Separating dots and commas:  ['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']
Removing blank spaces:  ['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']
With punctuation:  ['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [None]:
# Applying this to our whole text