-----
Build a Large Language Model
Sebastian Raschka
-----

# Working with text data

  - During the pretraining stage, LLMs process text, one word at a time.

## 2.1 - Understanding word embeddings

  - Converting data into a vector format is referred to as embedding.  An embedding is a mapping from discrete objects, such as words, images, or even entire documents, to point in a continuous vector space - the primary purpose of embeddings is to convert nonnumeric data into a format that neural networks can process.
  - In addition to word embeddings, there are also embedding for sentences, paragraphs, or whole documents.
  - Retrieval-augmented generation combines generation (like producting text) with retrieval (like searching an external knowledge base) to pull relevant information when generating text.

## 2.2 - Tokenizing text

  - We will tokenize "The Verdict," from https://en.wikisource.org/wiki/The_Verdict


In [2]:
# Download our text
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/refs/heads/main/ch02/01_main-chapter-code/the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

# Now load our text file
with open(file_path, "r", encoding="utf-8") as file:
    raw_text = file.read()
print("Total number of characters:", len(raw_text))
print(raw_text[:99])

Total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


  - Our goal is to tokenize this 20,479 character short story into individual words and special characters that we can then turn into embeddings for LLM training.

In [1]:
# Simple example text using re.split command with the following syntax
import re
text = "Hello, world.  This, is a test."
result = re.split(r'(\s)', text)
print(result)

['Hello,', ' ', 'world.', ' ', '', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


  - Modify the regular expression splits on whitespaces (\s), commas, and periods ([,.])
  - Capitalization helps LLMs distinguish between proper nouns and common nouns, so we will refrain from making all text lowercase.

In [3]:
result = re.split(f'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


  - Remove all the whitespace characters

In [4]:
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


  - Need to modify the exmple further so that it can also handle other types of punctuation, such as question marks, quotation marks, and the double dashes we have seen earlier in the first 100 characters

In [5]:
text = "Hello, world.  Is this -- a test?"
result = re.split(f'([,.:;?_!"()\']|--|\s")', text)
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', ' world', '.', '  Is this ', '--', ' a test', '?']


- Apply the basic tokenizer to the story