# Part_1_5_Words_Corpora_Text_Normalization
In Natural Language Processing (NLP), key techniques such as tokenization, stemming, and sentence segmentation are fundamental for transforming raw text into a structured format that can be effectively analyzed by language models. Byte-Pair Encoding (BPE) tokenization helps handle rare words and out-of-vocabulary terms by breaking text into subword units. Stemming, using algorithms like the Porter Stemmer, reduces words to their root forms, ensuring consistency across the text for better analysis. Sentence segmentation is crucial for dividing text into meaningful sentences, allowing for more accurate processing in downstream tasks. These techniques play a vital role in preparing text for various NLP tasks, ensuring that language data is in a normalized and analyzable state.

### **Objectives:**
By the end of this notebook, Parham will have a thorough understanding of tokenization and its importance in NLP, specifically learning how to implement **Byte-Pair Encoding (BPE)** to handle words and subwords. He will explore the **Porter Stemmer**, gaining insight into how it reduces words to their base forms and why this is essential for text normalization. Additionally, Parham will learn to apply **Sentence Segmentation** to split text into meaningful sentences for deeper analysis. Through hands-on coding exercises, he will gain practical experience using these techniques, utilizing Python libraries like `NLTK` and `SpaCy` to prepare text for NLP tasks.

**Table of Contents:** 
1. Import Libraries
2. Tokenization Techniques
3. Text Normalization: Porter Stemmer
4. Sentence Segmentation
5. Closing Thoughts

## 1. Import Libraries

In [19]:
import re
import os
import sys
from loguru import logger
import nltk
import pandas as pd 
from nltk.stem.porter import PorterStemmer
import numpy as np
from collections import Counter, defaultdict
import subprocess
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

## 2. Tokenization Techniques
Byte-pair encoding was first introduced in 1994 as a simple data compression technique by iteratively replacing the most frequent pair of bytes in a sequence with a single, unused byte.
Imagine Parham is reading a really big book, but some of the words are really long or tricky, and he might not know all of them. To make things easier, Parham can break the long words into smaller, simpler parts or pieces. This way, he can still understand the book without needing to know every single big word.

Byte-Pair Encoding (BPE) is a popular technique used in natural language processing for tokenizing text into subword units. The idea is to break down words into smaller, more frequent pieces, allowing models to efficiently handle rare or unknown words. BPE is especially useful in scenarios where the vocabulary is limited but the text contains a large variety of words, including compound or out-of-vocabulary words.

The algorithm works by:
- Starting with a sequence of individual characters.
- Finding the most frequent pair of consecutive characters (or subwords).
- Merging this pair into a single token.
- Repeating the process until a predefined number of merges or a desired vocabulary size is reached.

This process allows models to represent both common words and subwords effectively, making it easier to process any text.

In the cell below Parham will an example of BEP tokenization.

In [7]:
def get_stats(vocab):
    """
    Given a vocabulary (dictionary mapping words to frequency counts), returns a 
    dictionary of tuples representing the frequency count of pairs of characters 
    in the vocabulary.
    """
    pairs = defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i],symbols[i+1]] += freq
    return pairs

def merge_vocab(pair, v_in):
    """
    Given a pair of characters and a vocabulary, returns a new vocabulary with the 
    pair of characters merged together wherever they appear.
    """
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

def get_vocab(data):
    """
    Given a list of strings, returns a dictionary of words mapping to their frequency 
    count in the data.
    """
    vocab = defaultdict(int)
    for line in data:
        for word in line.split():
            vocab[' '.join(list(word)) + ' </w>'] += 1
    return vocab

def byte_pair_encoding(data, n):
    """
    Given a list of strings and an integer n, returns a list of n merged pairs
    of characters found in the vocabulary of the input data.
    """
    vocab = get_vocab(data)
    for i in range(n):
        pairs = get_stats(vocab)
        best = max(pairs, key=pairs.get)
        vocab = merge_vocab(best, vocab)
        print(f"\nAfter merge {i + 1}:")
        print(f"Best Pair: {best}")
        print(f"Updated Vocabulary: {vocab}")
    return vocab

# Example usage:
corpus = '''Tokenization is the process of breaking down 
a sequence of text into smaller units called tokens,
which can be words, phrases, or even individual characters.
Tokenization is often the first step in natural languages processing tasks 
such as text classification, named entity recognition, and sentiment analysis.
The resulting tokens are typically used as input to further processing steps,
such as vectorization, where the tokens are converted
into numerical representations for machine learning models to use.'''
data = corpus.split(' ')

n = 10
bpe_pairs = byte_pair_encoding(data, n)



After merge 1:
Best Pair: ('s', '</w>')
Updated Vocabulary: {'T o k e n i z a t i o n </w>': 2, 'i s</w>': 2, 't h e </w>': 3, 'p r o c e s s</w>': 1, 'o f </w>': 2, 'b r e a k i n g </w>': 1, 'd o w n </w>': 1, 'a </w>': 1, 's e q u e n c e </w>': 1, 't e x t </w>': 2, 'i n t o </w>': 2, 's m a l l e r </w>': 1, 'u n i t s</w>': 1, 'c a l l e d </w>': 1, 't o k e n s , </w>': 1, 'w h i c h </w>': 1, 'c a n </w>': 1, 'b e </w>': 1, 'w o r d s , </w>': 1, 'p h r a s e s , </w>': 1, 'o r </w>': 1, 'e v e n </w>': 1, 'i n d i v i d u a l </w>': 1, 'c h a r a c t e r s . </w>': 1, 'o f t e n </w>': 1, 'f i r s t </w>': 1, 's t e p </w>': 1, 'i n </w>': 1, 'n a t u r a l </w>': 1, 'l a n g u a g e s</w>': 1, 'p r o c e s s i n g </w>': 2, 't a s k s</w>': 1, 's u c h </w>': 2, 'a s</w>': 3, 'c l a s s i f i c a t i o n , </w>': 1, 'n a m e d </w>': 1, 'e n t i t y </w>': 1, 'r e c o g n i t i o n , </w>': 1, 'a n d </w>': 1, 's e n t i m e n t </w>': 1, 'a n a l y s i s . </w>': 1, 'T h e 

**Exercise:**

 The algorithm for Byte-Pair Encoding (BPE) in the previous example is a simplified version for educational purposes. While it illustrates the core concept of BPE, it might not be efficient for processing large datasets or in production environments. Here are some reasons why the naive implementation could be problematic with big data.
 1. **Computational Complexity Counting Pairs:** The algorithm counts pairs of characters for every iteration, which can become computationally expensive as the number of tokens increases. For large datasets, this might lead to significant slowdowns.
Repeated Iteration: The naive implementation involves multiple passes over the data to find the most frequent pairs, which can be inefficient when working with massive corpora.
2. **Memory Usage
Storage of Vocabulary:** The implementation keeps an entire vocabulary in memory, which can become unmanageable with large datasets. Each unique token needs to be stored, and the size of this vocabulary can grow quickly.
3. **Inefficiency in Merging
String Manipulation:** Frequent string replacements can be slow in Python. The current method creates new strings for each merge operation, leading to increased overhead.
4. **Scalability Issues
Lack of Parallelism:** The algorithm does not leverage parallel processing. For big data, parallelizing tasks can significantly reduce processing time.
More Efficient Approaches
For working with large datasets, here are a few alternative approaches:

**Optimized Libraries:**

  Use optimized libraries like `sentencepiece` or `subword-nmt`, which are designed for BPE and can handle larger datasets more efficiently. These libraries implement BPE in a way that’s optimized for speed and memory usage.

- **Stream Processing**:

  Instead of loading the entire dataset into memory, consider processing it in chunks or streams. This way, you can update your vocabulary and statistics without needing to load everything at once.
- **Use Hash Maps for Counting**:

  Instead of using lists or arrays, using hash maps (dictionaries) can improve the efficiency of counting pairs and merging them.

As an excersise, please implement BPE using `sentencepiece` using the corpus provided in the previous cell. 

In [8]:
# @title 🧑🏿‍💻 Your code here

! pip install sentencepiece
import sentencepiece as spm
with open("input.txt","w+") as f:
  f.write(corpus)



In [9]:
# @title 👀 Solution

# Train a SentencePiece model
spm.SentencePieceTrainer.train(input="input.txt", model_prefix='m', vocab_size=100)

# Load the model
sp = spm.SentencePieceProcessor(model_file='m.model')

encoded = sp.encode(corpus, out_type=str)
print("Encoded:", encoded)

# Decode back to text
decoded = sp.decode(encoded)
print("Decoded:", decoded)
os.remove("input.txt")
os.remove("m.model")
os.remove("m.vocab")

Encoded: ['▁T', 'oken', 'ization', '▁', 'is', '▁', 'the', '▁process', '▁of', '▁b', 're', 'a', 'k', 'i', 'ng', '▁', 'd', 'o', 'w', 'n', '▁a', '▁se', 'q', 'u', 'en', 'ce', '▁of', '▁', 'te', 'x', 't', '▁in', 'to', '▁s', 'm', 'alle', 'r', '▁', 'u', 'nit', 's', '▁', 'call', 'ed', '▁tokens', ',', '▁wh', 'i', 'ch', '▁ca', 'n', '▁b', 'e', '▁w', 'or', 'd', 's', ',', '▁p', 'h', 'r', 'as', 'es', ',', '▁', 'or', '▁', 'e', 've', 'n', '▁in', 'd', 'i', 'v', 'i', 'd', 'u', 'al', '▁', 'ch', 'ar', 'ac', 'te', 'rs', '.', '▁T', 'oken', 'ization', '▁', 'is', '▁of', 't', 'en', '▁', 'the', '▁', 'fi', 'rs', 't', '▁step', '▁in', '▁na', 't', 'ur', 'al', '▁', 'la', 'ng', 'ua', 'g', 'es', '▁process', 'i', 'ng', '▁t', 'as', 'k', 's', '▁', 'su', 'ch', '▁', 'as', '▁', 'te', 'x', 't', '▁c', 'la', 'ssi', 'fi', 'c', 'ation', ',', '▁na', 'me', 'd', '▁', 'enti', 'ty', '▁re', 'co', 'g', 'ni', 'tion', ',', '▁an', 'd', '▁s', 'enti', 'me', 'n', 't', '▁an', 'a', 'ly', 's', 'is', '.', '▁T', 'he', '▁re', 'su', 'l', 'ti', 'ng', 

sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: input.txt
  input_format: 
  model_prefix: m
  model_type: UNIGRAM
  vocab_size: 100
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  differential_pri

## 3. Text Normalization: Porter Stemmer

Text normalization is a critical preprocessing step in Natural Language Processing (NLP) that involves transforming text into a standard format. One common method of normalization is stemming, which reduces words to their root forms. In linguistics, a **stem** is a part of a word that is common to all of its inflected variants. 

For example:
- Integrate
- Integrated
- Integration
- Integrating

The above words are inflected variants of **Integrat**. Hence, **Integrat** is a stem. To this stem, we can add different suffixes to form various words. The process of reducing such inflected (or sometimes derived) words to their stem is known as stemming. For instance, "Integrate," "Integrated," "Integration," and "Integrating" can be reduced to the stem **Integrat**.

### The Porter Stemmer Algorithm

The Porter Stemmer operates by applying a series of rules to strip suffixes from words. Here’s a brief overview of its functionality:

**Step-wise Process**: The algorithm consists of multiple steps, each aimed at removing specific types of suffixes. The process can be categorized into five main phases, each focusing on a distinct set of morphological rules.

**Suffix Removal**: It employs a set of predefined rules to eliminate common suffixes such as "ing," "ed," "ly," etc. For example:

- "running" → "run"
- "happily" → "happi"
- "better" → "better" (the word is left unchanged as it's already a stem)

### Consonant-Vowel Rules

A consonant is defined as any letter that is not a vowel or the letter **Y** when preceded by a consonant. For instance, in the word **TOY**, the consonants are **T** and **Y**, while in **SYZYGY**, they are **S**, **Z**, and **G**. 

If a letter is not a consonant, it is classified as a vowel.

**Notation**:
- A consonant is denoted by **c** and a vowel by **v**.
- A sequence of one or more consecutive consonants is denoted by **C**, and a sequence of one or more consecutive vowels is denoted by **V**. 

Thus, any word or part of a word can be represented in one of the following forms:

- CVCV … C → e.g., collection, management
- CVCV … V → e.g., conclude, revise
- VCVC … C → e.g., entertainment, illumination
- VCVC … V → e.g., illustrate, abundance

All of these forms can be succinctly represented as:

{C\}VCVC ... \{V\}

Here, the brackets (`{}`) indicate the arbitrary presence of consonants or vowels.

**Measure of the Word (m)**: 

The value **m** found in the above expression is referred to as the measure of any word or word part when represented in the form \([C](VC)^m[V]\). Examples for different values of **m** include:

- **m=0**   →   TREE, TR, EE, Y, BY
- **m=1**   →   TROUBLE, OATS, TREES, IVY
- **m=2**   →   TROUBLES, PRIVATE, OATEN, ROBBERY

<p align="center"><img src="https://vijinimallawaarachchi.com/wp-content/uploads/2017/05/stemmer1.png" alt="Stemmer" width="50%" height="10%" style="display: block; margin: 20px auto;"/></p>

### Rules for Suffix Replacement

The rules for replacing (or removing) a suffix are articulated in the following formal structure:

`(condition) S1 → S2`


Where:
- **S1** represents the suffix subject to replacement or removal.
- **S2** denotes the new suffix (which may also be null, indicating complete removal).

#### Example Rules
1. **(ends with "ed" or "ing" and has a root word)**: S1 → S2  
2. **(ends with "y" preceded by a consonant)**: S1 → S2  
3. **(ends with "ational", "tional", "enci", or "anci")**: S1 → S2  
4. **(ends with "icate", "ative", or "alize")**: S1 → S2  
5. **(ends with "ment", "ness", or "ity")**: S1 → S2  
6. **(specific irregular forms)**: S1 → S2

This means that if a word ends with the suffix **S1**, and the stem before **S1** satisfies the given condition, **S1** is replaced by **S2**. The condition is generally expressed in terms of **m** concerning the stem preceding **S1**.

**Example**:
\[
(m > 1) \, EMENT \rightarrow \text{(S2 is null)}
\]
In this instance, **S1** is ‘EMENT’ and **S2** is null. This would transform **REPLACEMENT** to **REPLAC**, since **REPLAC** is a word part for which **m = 2**.

### Conditions

The conditions may incorporate the following components:

- **S**: the stem ends with S (and similarly for other letters)
- **v**: the stem contains a vowel
- **d**: the stem ends with a double consonant (e.g., -TT, -SS)
- **o**: the stem ends in a CVC pattern, where the second consonant is not W, X, or Y (e.g., -WIL, -HOP)

The condition part may also feature expressions using **and**, **or**, and **not**.

Examples of Conditions:
- **(m > 1 and (*S or *T))**: tests for a stem with **m > 1** ending in **S** or **T**.
- **(*d and not (*L or *S or *Z))**: tests for a stem ending with a double consonant, excluding endings with letters **L**, **S**, or **Z**.

### How Rules Are Applied

In a set of rules written sequentially, only one rule is obeyed, specifically the one with the longest matching **S1** for the given word. For instance, consider the following rules:

`SSES → SS IES → I SS → SS S →`


Here, all conditions are null. The word **CARESSES** maps to **CARESS**, as **SSES** is the longest match for **S1**. Similarly, **CARESS** maps to **CARESS** (since **S1 = "SS"**) and **CARES** maps to **CARE** (since **S1 = "S"**).

### Advantages of Using Porter Stemmer

- **Simplicity**: The algorithm is straightforward to implement and does not require extensive computational resources.
- **Efficiency**: It significantly reduces the size of the vocabulary, allowing models to run faster and consume less memory.
- **Performance**: In many cases, stemming can lead to improved performance in tasks such as search and classification by grouping related words.


In [22]:
# Ensure NLTK resources are available
nltk.download(info_or_id= 'punkt')
nltk.download(info_or_id= 'stopwords')
# Function to process the corpus in chunks
def process_corpus(file_path, chunk_size=100):
    """
    Processes a large text file in chunks, tokenizing, stemming,
    and removing stop words from each chunk.
    
    Args:
        file_path (str): The path to the text file.
        chunk_size (int): The number of lines to process at once.
        
    Returns:
        list: A list of stemmed words from the entire corpus,
              with stop words removed.
    """
    porter_stemmer = PorterStemmer()
    stop_words = set(stopwords.words('english'))  # Set of English stop words
    all_stemmed_words = []

    with open(file_path, 'r', encoding='utf-8') as file:
        while True:
            # Read a chunk of lines from the file
            lines = [file.readline() for _ in range(chunk_size)]
            if not lines or all(line == '' for line in lines):
                break
            
            # Combine the lines into a single string
            corpus_chunk = ' '.join(lines)

            # Tokenize the text into words
            tokens = word_tokenize(corpus_chunk)

            # Remove stop words, punctuation, and perform stemming on each token
            stemmed_words = [
                porter_stemmer.stem(token) 
                for token in tokens 
                if token.lower() not in stop_words and token.isalpha()  # Check if token is alphabetic
            ]

            # Extend the list of all stemmed words
            all_stemmed_words.extend(stemmed_words)

    return all_stemmed_words

# Example usage
file_path = './LICENSE'  # Replace with your large corpus file path
stemmed_words = process_corpus(file_path)

# Display a sample of the stemmed words
print("Sample of Stemmed Words:", stemmed_words)  # Show first 20 stemmed words


Sample of Stemmed Words: ['mypi', 'mypyc', 'licens', 'term', 'mit', 'licens', 'reproduc', 'mit', 'licens', 'copyright', 'c', 'jukka', 'lehtosalo', 'contributor', 'copyright', 'c', 'dropbox', 'inc', 'permiss', 'herebi', 'grant', 'free', 'charg', 'person', 'obtain', 'copi', 'softwar', 'associ', 'document', 'file', 'softwar', 'deal', 'softwar', 'without', 'restrict', 'includ', 'without', 'limit', 'right', 'use', 'copi', 'modifi', 'merg', 'publish', 'distribut', 'sublicens', 'sell', 'copi', 'softwar', 'permit', 'person', 'softwar', 'furnish', 'subject', 'follow', 'condit', 'copyright', 'notic', 'permiss', 'notic', 'shall', 'includ', 'copi', 'substanti', 'portion', 'softwar', 'softwar', 'provid', 'without', 'warranti', 'kind', 'express', 'impli', 'includ', 'limit', 'warranti', 'merchant', 'fit', 'particular', 'purpos', 'noninfring', 'event', 'shall', 'author', 'copyright', 'holder', 'liabl', 'claim', 'damag', 'liabil', 'whether', 'action', 'contract', 'tort', 'otherwis', 'aris', 'connect', 

[nltk_data] Downloading package punkt to /home/p/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/p/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Disadvantages of Using Porter Stemmer:

**Over-Stemming:** The Porter Stemmer may lead to excessive reduction, where different words that should not be grouped together are erroneously stemmed to the same form. For example, "university" and "universal" may both be reduced to "univer."

**Loss of Meaning:** Since stemming focuses on morphological similarity rather than semantic meaning, it can sometimes result in a loss of contextual information.

**Language Dependence:** The rules are tailored specifically for the English language and may not work well for other languages with different morphological structures.

## 4. Sentence Segmentation

## 5. Closing Thoughts