# Hands-on Session 1 -- Handling strings with Python - Text Preprocessing - State of the Art Tokenization

In [None]:
!pip install datasets
!pip install transformers
!pip install nltk

In this first coding lesson we will look at:
- How use python to handle strings.
- How to preprocess text (tokenization/lemmatization) for classical NLP applications
- The importance of tokenization in modern LLMs and the BPE algorithm.

## 1. Basic of Python Strings

### 1.1 What is a string?
In Python, a string is an ordered sequence of characters used to represent text. Strings are a fundamental data type and are essential for working with text data in natural language processing.

Characteristics of String in Python:
- __immutable__: Once a string is created, it cannot be modified.
- __Sequence Type__: Strings are sequences (of characters), so we can use indexing and slices.
- __Unicode Support__: Python 3 use Unicode by defualt.


In [None]:
#define a string
my_string = "Hello, NLP World!"

my_string_multiline = """ This s a multi
line
string
"""

#print the string
print(my_string)
print(my_string_multiline)


In [None]:
for char in my_string:
    print(char)

In [None]:
my_string[2] = "X"

#### 1.1.1. Unicode and UTF-8

__What is Unicode?__

Unicode is a universal character encoding standard that assigns a unique code point to every character from virtually all writing systems, symbols, and emojis.
It enables consistent representation and manipulation of text across different platforms and programs.
Unicode supports over 143,000 characters, covering scripts like Latin, Greek, Cyrillic, Arabic, Chinese, and more.

__UTF-8 Encoding__

UTF-8 (Unicode Transformation Format - 8-bit) is a variable-length character encoding for Unicode.
It encodes each Unicode character into one to four bytes:
- 1 byte for standard ASCII characters (U+0000 to U+007F).
- 2 to 4 bytes for characters outside the ASCII range.
- Advantages of UTF-8:
 - Backward Compatibility: UTF-8 is compatible with ASCII, making it ideal for systems originally designed for ASCII.
- Efficiency: It uses fewer bytes for common characters, which is space-efficient for texts dominated by ASCII characters.
- Flexibility: Can represent any Unicode character, accommodating multiple languages and symbols.

***References***:

1. [Unicode](https://home.unicode.org/)
2. [UTF-8 Wikipedia](https://it.wikipedia.org/wiki/UTF-8)
3. [UTF-8 Everywhere](https://utf8everywhere.org/)


In [None]:
# Convert a string to UTF-8

my_string = "Hello, NLP World! 🚀 你好世界 "

encoded_string = my_string.encode("UTF-8")

print("Encoded string:", encoded_string)

encoded_string_integer = list(encoded_string)

print("Encoded string integer:", encoded_string_integer)

decoded_string = encoded_string.decode("UTF-8")

print("Decoded string:", decoded_string)

### 1.2. Handling strings in Python
Python provides a rich set of built-in functions and methods for working with strings. Here are some common operations:

#### String Formatting

In [None]:
# f-string
x = {
    "name": "John",
    "age": 30
}

print(f"My name is {x['name']} and I am {x['age']} years old.")

# format
print("My name is {} and I am {} years old.".format(x['name'], x['age']))


#### String manipulation

In [None]:
text_1 = " Trieste is a beautiful city near the border with Croatia and Slovenia. "
text_2 = """Trieste is famous for its cafes and historic architecture.
"""

# SUm string
text = text_1 + text_2

print(text)


print(text.upper())
print(text.lower())
print(text.title())

# Remove Puntuaction
import string
print(text.translate(str.maketrans('', '', string.punctuation)))

# Remove white spaces
print(text.strip())

# Remove left with space
print(text.lstrip())

# Remove right with space
print(text.rstrip())

# Replace

print(text.replace("Trieste", "Milano"))

# Find
print(text.find("Croatia"))

In [None]:

## Split
print(text.split())

print(text.split("."))


In [None]:
text = "Is a student of NLP in Trieste."

names = ["jhon", "mary"]
surnames = ["doe", "lou"]
cities = ["Trieste", "Milano"]

# Exercise: Manipulate the three strings in order to print, for each triple of (Name, Surname, City) plot the string "Name Surname is a student of NLP in CITY". For example "Jhon Doe is a student of NLP in MILANO"
# DO NOT declare new string object, but just compose and manipulate the previous strings

for name, surname, city in zip(names, surnames, cities):
    print(f"{name.capitalize()} {surname.capitalize()} is a student of NLP in {city.upper()}")

### 1.3 Regular Expressions in Python

Regular expressions (regex) are sequences of characters that define a search pattern. They are widely used for string matching and manipulation.

In Python, the `re` module provides support for regular expressions.

#### Basics of Regular Expressions

- **Literal Characters**: Matches the exact character.
- **Metacharacters**: Characters with special meaning.

Some common metacharacters:

- `.` : Matches any character except a newline.
- `^` : Matches the start of a string.
- `$` : Matches the end of a string.
- `*` : Matches 0 or more repetitions.
- `+` : Matches 1 or more repetitions.
- `?` : Matches 0 or 1 repetition.
- `[]`: Matches any one character inside the brackets.
- `|` : Matches either the expression before or the expression after the `|`.
- `\d`: Matches any decimal digit; equivalent to [0-9].
- `\D`: Matches any non-digit character.
- `\s`: Matches any whitespace character.
- `\S`: Matches any non-whitespace character.
- `\w`: Matches any alphanumeric character and underscore.
- `\W`: Matches any non-alphanumeric character.

Let's start with some examples.

In [None]:

import re

# Simple pattern matching
pattern = r"apple"
text = "I like apples and apple pies."

matches = re.findall(pattern, text)
print(matches)


The `re.findall()` function returns all non-overlapping matches of the pattern in the string, as a list of strings.

#### Using Metacharacters

In [None]:
# Finding all digits
text = "My phone number is 123-456-7890."

pattern = r"\d"
digits = re.findall(pattern, text)
print(digits)

# Finding all sequences of digits
pattern = r"\d+"
digits_sequences = re.findall(pattern, text)
print(digits_sequences)



#### re.search() vs re.match()

- `re.match()` checks for a match only at the beginning of the string.
- `re.search()` checks for a match anywhere in the string.

In [None]:

text = "The cat sat on the mat."

# Using re.match()
match = re.match(r'cat', text)
print("Using re.match():", match)

# Using re.search()
search = re.search(r'cat', text)
print("Using re.search():", search)

#### re.sub()

The `re.sub()` function replaces occurrences of the pattern with a specified replacement string.
"""

In [None]:
text = "I have a cat. My cat is cute."

# Replace 'cat' with 'dog'
new_text = re.sub(r'cat', 'dog', text)
print(new_text)

#### Grouping and Capturing

In [None]:
text = "My email is john.doe@example.com"

# Pattern to extract email
pattern = r'(\w+.\w+)@(\w+\.\w+)'

match = re.search(pattern, text)
if match:
    print("Full match:", match.group(0))
    print("Username:", match.group(1))
    print("Domain:", match.group(2))


#### Compiling Regular Expressions

For patterns that will be used multiple times, it's more efficient to compile them.
"""

In [None]:
pattern = re.compile(r'\d+')

text1 = "Order number 12345"
text2 = "Invoice 67890"

print(pattern.findall(text1))
print(pattern.findall(text2))

#### Exercise: Regular Expressions

**Task**: Write a regular expression to extract all valid email addresses from the following text.

In [None]:
text = """
Please contact us at support@example.com for further information.
You can also reach out to sales@example.co.uk or feedback@company.org.
Invalid emails like test@.com or @example.com should not be matched.
"""

# Write your code here
pattern = r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'

emails = re.findall(pattern, text)

In [None]:
print("Extracted emails:", emails)

## 2 - Tokeniziation and Lemmatization
Tokenization and lemmatization are fundamental steps in text preprocessing for Natural Language Processing (NLP).

- **Tokenization**: The process of breaking text into smaller units called tokens (e.g., words, sentences).
- **Lemmatization**: Reducing words to their base or dictionary form (lemma).

We will use the Natural Language Toolkit (NLTK), a popular Python library for NLP tasks.

Why Tokenization?
- Text data is unstructured and needs to be converted into a structured format for analysis. Tokenization breaks text into smaller units (tokens) for further processing.
- Tokens can be words, sentences, or subwords, depending on the task.

Problem of Tokenization:
- Tokenization is not always straightforward due to the complexity of languages and text data.
- "New York" can be considered as one token or two tokens?
- "can't" can be split into "can" and "not" or kept as a single token?
- Tokenization depends on the context and the task at hand.
- How to handle new words or out-of-vocabulary (OOV) terms?

### 2.1 Tokenization using NLTK


In [None]:
# Install NLTK if not already installed
import nltk
nltk.download('punkt_tab')

#### Word Tokenization

In [4]:
from nltk.tokenize import word_tokenize

text = "Hello! How are you doing today? It's great to see you."

tokens = word_tokenize(text)


In [None]:
print(tokens)

#### Sentence Tokenization

In [None]:
from nltk.tokenize import sent_tokenize

text = "Hello! How are you doing today? It's great to see you. Let's catch up soon."

sentences = sent_tokenize(text)

In [None]:
print(sentences)

### 2.2 Lemmatization using NLTK

Lemmatization requires the use of a dictionary to find the lemma of a word.

We need to download the WordNet lemmatizer and the POS (Part of Speech) tagger.

In [None]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

To improve lemmatization, we need to provide the correct POS tag for each word.

In [None]:
text = "The striped bats are hanging on their feet for best. "

tokens = word_tokenize(text)
lemmatized_output = [lemmatizer.lemmatize(word) for word in tokens]

In [None]:
print([f"{word} -> {lemmatizer.lemmatize(word)}" for word in tokens])

### 2.3 Stemming using NLTK

In [1]:
#stemming
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()


In [5]:
text = "The striped bats are hanging on their feet for best. "

tokens = word_tokenize(text)
stemmed_output = [stemmer.stem(word) for word in tokens]

In [None]:
print([f"{word} -> {stemmer.stem(word)}" for word in tokens])

### 2.4 Stopwords Removal

Stopwords are common words that may not carry significant meaning (e.g., 'is', 'the', 'and').

We can remove them to reduce noise in our data.

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

filtered_tokens = [word for word in tokens if word.lower() not in stop_words]


In [None]:
print(filtered_tokens)

## 3 - Basic Sentiment Analysis using Vocabulary-based Approach

Sentiment analysis is the process of determining the sentiment (positive, negative, neutral) of a text. We can use a vocabulary-based approach to classify text based on the presence of positive or negative words.
The process involves the following steps:
- Load a list of positive and negative words.
- Tokenize the text.
- Count the number of positive and negative words.
- Determine the overall sentiment based on the counts.
- Calculate the sentiment score.
- Classify the sentiment as positive, negative, or neutral.
- 
Let's implement a basic sentiment analysis using this approach.

### Brief intro to 🤗 datasets
🤗 datasets offers a pactical way to handle and share datasets. It cames as a GitHub of datasets. Given a dataset, you can see the repository on the web, for example [imdb](https://huggingface.co/datasets/stanfordnlp/imdb). Let's explore the basics functionalities of the library.

In [None]:
from datasets import load_dataset
# Load a dataset
# For this example, we will use the 'imdb' dataset
dataset = load_dataset('imdb')


In [None]:
# Display the dataset structure
print(dataset)

In [None]:

# Explore the dataset
# Display the first few examples from the training set
print("Training set examples:")
print(dataset['train'][0])

In [None]:
# Shuffle the dataset
dataset = dataset.shuffle()

In [None]:
# Preprocess the dataset
# Define a preprocessing function to lowercase the text
def preprocess_function(examples):
    return {'text': [text.upper() for text in examples['text']]}

In [None]:
# Apply the preprocessing function to the dataset
dataset = dataset.map(preprocess_function, batched=True, num_proc=4)

# Display the first few preprocessed examples
print("\nPreprocessed training set examples:")
print(dataset['train'][0])

In [None]:
# load a negative and a positive word list
file = open("wordwithStrength.txt").read().splitlines()

# EXERCISE: Create a dictionary with the words as keys and the weights as values
WORDS_WITH_WEIGHT = {}
for word in file:
    WORDS_WITH_WEIGHT[word.split("\t")[0]] = float(word.split("\t")[1])

In [None]:
# Show the words with their weights
print(WORDS_WITH_WEIGHT)

In [None]:
# load a sentiment analysis dataset
from datasets import load_dataset
from typing import List
dataset = load_dataset("imdb", split="train")

#shuffle the dataset
import random
dataset = dataset.shuffle(seed=42)

# sample just 25 reviews
dataset = dataset.select(range(25))


In [None]:
def tokenize_review(review: str) -> List:
    # Tokenize a review into words
    tokenized_review = word_tokenize(review)

    # to lowercase
    tokenized_review = [word.lower() for word in tokenized_review]
    
    return tokenized_review

In [None]:
def count_review(review: List):
    # count the number of positive and negative words in a review

    balance = 0
    n_found = 0
    for word in review:
        if word in WORDS_WITH_WEIGHT:
            balance += WORDS_WITH_WEIGHT[word]
            n_found += 1
    return balance / n_found if n_found > 0 else 0

In [None]:
for review in dataset:
    review_text = review["text"]
    true_label = review["label"]
    tokenized_review = tokenize_review(review_text)
    balance = count_review(tokenized_review)
    
    print("Review:", review)
    print("Balance:", balance, "True Label:", true_label)
    print("\n")


## 4 - Modern Tokenization, aka what we use in LLM nowdays?

The tokenization algorithm currently employed in LLMs try to create a trade-off between the following property:
- Vocabulary size is larger enough to contain (some) semantic information
- Vocabulary size is small enough to be handled.
- (Bonus) How can we handle out-of-vocabulary words?

__Why Do We Need a Large Vocabulary?__

Imagine if we encoded all text using only the 26 letters of the alphabet. Tokenization would be straightforward, but the resulting sequences would be very long. This leads to two main issues:

- __Increased computational cost__: Longer sequences require more computational resources. If you recall from the Deep Learning course, the attention mechanism in transformers has a complexity that scales quadratically with the input length. This means that longer sequences can significantly increase the computation needed, making the process less efficient.

- __Loss of information__: With longer sequences, the information gets spread out, making it harder for machine learning models, especially transformers, to learn. Instead of directly recognizing and understanding words, the model would first have to learn how to construct words from individual letters before it can start learning the relationships between words. This adds an unnecessary layer of complexity.

In [43]:
import numpy as np

# Example sentence
text = "The quick brown fox jumps over the lazy dog"

In [None]:

# Character-level tokenization (very small vocabulary)
char_tokens = list(text.replace(" ", ""))
char_token_count = len(char_tokens)
char_attention_size = char_token_count ** 2

print(f"Original text: {text}")
print(f"\nCharacter-level tokenization:")
print(f"Tokens: {char_tokens}")
print(f"Number of tokens: {char_token_count}")
print(f"Attention matrix size: {char_attention_size} elements")

In [None]:

# Word-level tokenization (larger vocabulary)
word_tokens = text.split()
word_token_count = len(word_tokens)

# Size of attention matrix: (sequence length) x (sequence length)
word_attention_size = word_token_count ** 2

print(f"\nWord-level tokenization:")
print(f"Tokens: {word_tokens}")
print(f"Number of tokens: {word_token_count}")
print(f"Attention matrix size: {word_attention_size} elements")




__Why don't we use Word-level tokenization instead?__

Using word-level tokenization can seem like a good solution because it reduces the length of the tokenized sequence, leading to smaller and more efficient attention matrices. However, there are several challenges with this approach:

- Large Vocabulary Size: Word-level tokenization requires a vocabulary that contains every possible word in the language. Since languages are constantly evolving, new words, slang, and domain-specific terms are regularly added, making it impossible to maintain a comprehensive vocabulary. A large vocabulary also means that the model will need to handle a very large set of tokens, increasing memory usage and making the model harder to train.

- Out-of-Vocabulary (OOV) Problem: No matter how big the vocabulary is, there will always be words that are not included. For example, rare words, typos, or new terminology might not be part of the predefined set. In such cases, word-level tokenizers often fail because they cannot handle these "unknown" words, resulting in a loss of information.

- Difficulty Handling Morphology: In many languages, words can change forms depending on grammar rules (e.g., plurals, verb conjugations). A word-level tokenizer would need to include every possible variation (like "run," "runs," "running," etc.) as separate tokens. This increases the vocabulary size and makes it harder for the model to generalize because it sees different forms of the same word as completely separate entities.

- Inefficient Handling of Subwords: Some words are made up of common prefixes, suffixes, or roots (like "unhappiness" = "un" + "happiness"). Word-level tokenizers would treat each of these as a separate word, missing the opportunity to learn useful patterns.

### Solution: Sub-word tokenizer and Byte-Pair Econding algorithm
o address the challenges of character-level and word-level tokenization, we use sub-word tokenizers. These tokenizers aim to find a balance between the two approaches by breaking down words into smaller, meaningful units (sub-words) rather than relying on full words or individual characters. One of the most popular sub-word tokenization techniques is the Byte-Pair Encoding (BPE) algorithm.

__What is Byte-Pair Encoding (BPE)?__

BPE is a compression-based algorithm that starts by treating each character as a separate token and iteratively merges the most frequently occurring pairs of tokens to form new sub-words. This process continues until a predefined vocabulary size is reached. The result is a set of tokens that can effectively represent common sub-words, prefixes, suffixes, and even complete words.
It was popularized by the [GPT-2 Paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf).

__Why BPE Works Well__

- Efficient Vocabulary Size: BPE creates a more manageable vocabulary by allowing the model to represent both frequent words as single tokens and less common words as a combination of sub-word tokens. This strikes a balance between character-level and word-level tokenization, resulting in a smaller, more efficient vocabulary without sacrificing flexibility.

- Handling Rare and New Words: Since BPE breaks down words into smaller parts, it can effectively handle rare words, new terminology, and even misspellings. If a word is not in the vocabulary, BPE can still encode it by combining smaller sub-word tokens. This helps the model process unseen words without failing, unlike traditional word-level tokenizers.

- Learning Morphological Patterns: By segmenting words into sub-words, BPE enables the model to learn useful patterns, such as prefixes, suffixes, and root words. For example, "running," "runner," and "runs" might all share the common sub-word "run," making it easier for the model to understand relationships between these forms.

- Optimized Computation: Sub-word tokenization reduces the input sequence length compared to character-level tokenization, leading to smaller attention matrices. This makes the process more efficient, reducing computational cost without losing important information.

__Example: How BPE Works__

Suppose we have the following text:

"low", "lowest", "lower"
Start by treating each character as a token:

"l", "o", "w", "e", "s", "t", "r"
Count the most frequent pairs:

("l", "o"), ("o", "w"), ("l", "o"), ("o", "w"), ("o", "w")
Merge the most frequent pair ("o", "w") into a new token:

"low", "l", "o", "w", "e", "s", "t"
Repeat the process until the vocabulary size is reached:

Merge ("l", "o") -> "low", "lowest", "low", "e", "s", "t"
BPE ensures that common words like "low" are encoded efficiently as a single token, while more complex or rare words can be built from smaller sub-word pieces.


In [None]:
text = "banana bandana"
tokens = list(text.encode("utf-8"))
print(tokens)

In [None]:
def count_pair(tokens):
    pass
print(count_pair(tokens))

In [None]:
from typing import List, Union, Dict

class BPE():
    def __init__(self, vocab_size = 260):
        self.vocab_size = vocab_size
        self.merge_forest = {}

    @staticmethod
    def string_to_bytes(text: str) -> List[int]:
        raise NotImplementedError
    @staticmethod
    def bytes_to_string(tokens: List[int]) -> str:
        raise NotImplementedError

    def count_pair(self, tokens) -> Dict:
        raise NotImplementedError

    def merge(self, pair, tokens: List[int], new_id: id) -> List[int]:
        raise NotImplementedError


    def train(self, train_text: Union[List[int], str]):
        raise NotImplementedError

    def encode(self, text: Union[List[int], str]) -> List[int]:
        raise NotImplementedError

    def decode(self, tokens: List[int], return_type = "string") -> Union[str, List[int]]:
        raise NotImplementedError

In [None]:
bpe = BPE(vocab_size=780)

lorem_ipsum = """

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas commodo velit non ligula consequat molestie. Pellentesque dui massa, viverra sed quam vitae, maximus mattis leo. Morbi rhoncus sodales convallis. Etiam fermentum dui ex. Ut posuere rutrum lectus in vehicula. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Nullam non metus lobortis, efficitur eros tempus, placerat ligula.

Praesent in aliquam odio. Nulla varius sagittis ipsum. Cras lacinia tincidunt nisl, id interdum purus dapibus at. Vestibulum id justo gravida, rutrum tortor at, commodo orci. Sed molestie bibendum tortor, eget ultricies metus posuere sed. Donec venenatis felis massa, in fermentum libero volutpat et. Praesent accumsan consequat ligula at viverra. Proin sagittis dolor quis justo finibus pharetra. In hac habitasse platea dictumst. Mauris in vehicula augue. Duis scelerisque elementum mollis. Vestibulum auctor feugiat egestas. Duis luctus ornare pellentesque. Vestibulum et magna elementum, maximus justo et, feugiat nunc.

Ut mattis nec elit vitae placerat. Aenean in eleifend justo. Sed nec molestie felis. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Curabitur quis ipsum in odio consectetur egestas. Curabitur convallis vulputate eleifend. Vivamus mi mauris, facilisis non placerat efficitur, ornare ac purus. Nullam ornare purus vel dictum pulvinar.

Cras lacinia velit et nisi varius, id aliquet ligula posuere. In lacinia, nisi non dictum luctus, metus tortor viverra felis, nec rhoncus velit massa gravida tortor. Morbi purus metus, lobortis et arcu eu, molestie pulvinar mauris. Curabitur condimentum vehicula tempus. Nulla facilisi. Morbi mauris nisl, euismod id risus id, rutrum suscipit quam. Proin in lectus quis turpis ornare feugiat. Integer venenatis dui sem, at fermentum nulla varius vel. Ut tristique scelerisque nisl ut mollis. Cras facilisis, nunc quis tristique elementum, orci odio rhoncus nisl, vitae scelerisque odio ante ac elit.

Phasellus nec velit tellus. In tellus dui, euismod ac venenatis sit amet, porttitor id lectus. Sed et mauris at tellus vehicula commodo sed non metus. Donec non dui sit amet neque pretium convallis id eu nibh. Vivamus pharetra ligula eros. Maecenas interdum nibh nec venenatis viverra. Maecenas venenatis convallis est ac tristique. Aliquam erat volutpat. Mauris at velit sed lorem finibus pellentesque. Nam eget ante vel tortor finibus pellentesque. Phasellus bibendum venenatis mi eget molestie. Integer quis tortor at augue scelerisque tincidunt eu id mi. Praesent congue consectetur nulla. Proin est erat, tempus eu sem sit amet, sodales feugiat nibh. Nam at consequat ante. Aliquam fringilla tellus non odio pulvinar tincidunt.
"""

bpe.train(train_text=lorem_ipsum)


tokens_ids = bpe.encode("""Nulla varius sagittis ipsum. Cras lacinia tincidunt nisl, id interdum purus dapibus at. Vestibulum id justo gravida, rutrum tortor at, commodo orci. Sed molestie bibendum tortor, eget ultricies metus posuere sed. Donec venenatis felis massa, in fermentum libero volutpat et. Praesent accumsan consequat ligula at viverra. Proin sagittis dolor quis justo finibus pharetra. In hac habitasse platea dictumst. Mauris in vehicula augue. Duis scelerisque elementum mollis. Vestibulum auctor feugiat egestas. Duis luctus ornare pellentesque. Vestibulum et magna elementum, maximus justo et, feugiat nunc.

Ut mattis nec elit vitae placerat. Aenean in eleifend justo. Sed nec molestie felis. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Curabitur quis ipsum in odio consectetur egestas. Curabitur convallis vulputate eleifend. Vivamus mi mauris, facilisis non placerat efficitur, ornare ac purus. Nullam ornare purus vel dictum pulvinar.

Cras lacinia velit et nisi varius, id aliquet ligula posuere. In lacinia, nisi non dictum luctus, metus tortor viverra felis, nec rhoncus velit massa gravida tortor. Morbi purus metus, lobortis et arcu eu, molestie pulvinar mauris. Curabitur condimentum vehicula tempus. Nulla facilisi. Morbi mauris nisl, euismod id risus id, rutrum suscipit quam. Proin in lectus quis turpis ornare feugiat. Integer venenatis dui sem, at fermentum nulla varius vel. Ut tristique scelerisque nisl ut mollis. Cras facilisis, nunc quis tristique elementum, orci odio rhoncus nisl, vitae scelerisque odio ante ac elit.
""")
print(tokens_ids)
original_text = bpe.decode(tokens_ids)
print(original_text)


### Bonus
- Bonus 1. This implementation does not handle the case of unknown sequence of bytes. Since utf-8 is a variable-length encoding, it is possible that a sequence of bytes is not a valid utf-8 character. In this case, the algorithm will raise an exception. To fix this, implement a try-except block to handle the exception and continue the loop.
- Bonus 2. The current implementation do not handle special tokens, like <start_sentence> or <end_sentence>. Modify the algorithm to handle special tokens by adding them to the vocabulary and updating the encoding process accordingly.




## Tokenizer HuggingFace
Now let's see how we use in practice pre-trained tokenizer with the HuggingFace transformers library.


In [None]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

tokens = tokenizer.encode(lorem_ipsum)

print(tokens)

print(tokenizer.decode(tokens))

print(tokenizer.vocab_size)

### Additional Resources

- [BPE Tokenizer Youtube Video by Andrej Karpathy](https://www.youtube.com/watch?v=zduSFxRajkE&t=2155s)
- [Regular Expressions Documentation](https://docs.python.org/3/library/re.html)
- [HuggingFace Transformers Documentation](https://huggingface.co/docs/transformers/index)
- [NLTK Book](http://www.nltk.org/book/)

## Final Exercise: Building a Text Processing Pipeline for Log File Analysis

Objective:
Create a comprehensive text processing pipeline to analyze and extract meaningful information from a complex log file. The pipeline will involve:
1. Parsing and Cleaning the Log File
2. Feature Extraction using Regular Expressions
3. Tokenization and Lemmatization
4. Custom Byte-Pair Encoding (BPE) Tokenization with Special Tokens
5. Data Visualization and Analysis

Instructions:
Follow the steps outlined in the comments to complete the exercise.

---

Background:
Log files are essential in monitoring and diagnosing systems and applications. They often contain a mix of timestamps, error messages, user actions, and other system-generated information. Analyzing log files can be challenging due to their unstructured and noisy nature.

---

Deliverables:
- Complete all sections marked as TODO.
- Ensure your code is well-documented with comments explaining your logic.
- Include visualizations with appropriate titles and labels.
- Summarize your findings and discuss any challenges faced.

Good luck!

In [None]:


# Import necessary libraries
import re
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk

# Step 1: Obtain and Explore the Log File

# TODO:
# - Load the log file 'system_logs.txt' into Python.
# - Handle any encoding issues.
# - Read the file line by line for processing.

# Hints:
# - Use the open() function with the appropriate encoding.
# - Read lines using file.readlines().

# Your code here


# Step 2: Parsing and Cleaning the Logs

# TODO:
# - Strip unnecessary whitespace from each line.
# - Parse each log entry into its components:
#   - Timestamp
#   - Log Level (INFO, WARNING, ERROR)
#   - UserID
#   - IP Address
#   - Action Message
# - Handle any malformed entries.

# Hints:
# - Use string methods like strip().
# - Use exception handling to manage parsing errors.

# Your code here


# Step 3: Feature Extraction using Regular Expressions

# TODO:
# - Define regex patterns for each component.
# - Extract features using the patterns.
# - Store the extracted data in a structured format.

# Hints:
# - Use re.search() or re.match() to apply regex patterns.
# - Store data in a list of dictionaries or a pandas DataFrame.

# Your code here


# Step 4: Tokenization and Lemmatization

# Ensure required NLTK data packages are downloaded
# Uncomment the lines below if running for the first time
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')

# TODO:
# - Prepare the 'action' messages for text processing.
# - Tokenize the text into words.
# - Remove stopwords and punctuation.
# - Perform lemmatization with correct POS tags.
# - Add the processed text back to your data structure.

# Hints:
# - Use word_tokenize() for tokenization.
# - Use stopwords.words('english') for stopwords.
# - Define a function to map POS tags for lemmatization.

# Your code here


# Step 5: Custom Byte-Pair Encoding (BPE) Tokenization with Special Tokens

# TODO:
# - Implement or import your BPE class.
# - Modify the BPE tokenizer to accept special tokens.
# - Replace patterns in the text with special tokens.
# - Train the BPE tokenizer on the processed text.
# - Encode and decode sample text to verify correctness.

# Hints:
# - Define special tokens like <IP_ADDR>, <USER_ID>, etc.
# - Ensure special tokens are not split during BPE merges.

# Your code here


# Step 6: Data Visualization and Analysis

# TODO:
# - Analyze the frequency of different log levels.
# - Identify the most common actions or errors.
# - Create visualizations to represent your findings.

# Hints:
# - Use pandas for data manipulation.
# - Use seaborn or matplotlib for plotting.

# Your code here


# Optional Extensions

# TODO:
# - Implement anomaly detection.
# - Perform temporal analysis.
# - Integrate with a machine learning model for classification.

# Your code here (if attempting extensions)
