# Overview
This notebook was created to save and make notes of the fabulous NLP book:  **Handbook Of Natural Language Processing**

# Classifical Approaches
## The classical toolkit
The classical techniques first tackled a nature language by first decomposing the processing process into different steps / stages:
1. tokenization
2. lexical analysis
3. syntactic analysis
4. semantic analysis
5. pragmatic analysis

From the last layer, we proceed to extract the speaker's intended meaning.  

The book dives into each of these $5$ steps

## Text preprocessing
This is defined as the skill of converting text (boiled down to a sequence of bits) into a totally different sequence of meaningful linguistic units. Among the main techniques there is Word Segmentation: dividing a piece of text into different words, this raises the question of what determines the boundaries of a word, sentence segmentation: dividing a piece of text into different sentences. These two techniques are interdependent.  
Text preprocessing should consider different aspect of the language such as the writing system. There are $3$ main system:
1. logographic: a word is built out of a large number of symbols (some can reach $1000$ symbols)
2. syllabic: a symbol represents a syllable (at least two different sounds combined)
3. alphabetic: a symbol represents a sound.  

Due to the cultural exchange taking place in the modern age, no language is purely based on a single writing system. nevertheless, it is safe to state that English is mostly alphabetic.   
### Language identification: The Language dependence Challenge
Well this task does not require a heavy machinery of NLP techniques. The first step is to consider the range of characters in the used language (mainly the mapping of the characters to their numerical values in the encoding system). The first step already narrows down the choice. The second part is a bit trickier as several language could share the same alphabet: Arabic and Persian, Swedish and Danish, european languages in general. This further distinction is mainly based on the distribution of the frequencies of these chracters in the specific set.

### Corpus dependence


Robustness represented a major challenge for the earlier NLP systems. With the explosion of content on the internet, textual data is found in abandance. Nevertheless, the quality did not follow along with the quantity as most piece of text do not follow grammars, punctuations as well as formatting rules that some of the earlier NLP were built upon. It is evident that NLP requires robust algorithms capable of addressing a large number of irregularities associated with 1. difficulty to actually set and formally define rules for producing text 2. the slim likelihood that individuals will actually follow these rules if they ever formalized.

## application dependence
There is no universal criteria determining what constitutes a word, as the possible definitions of such entity can be greatly vary with respect to the language, context, writing system or even the system actually processing the text. One main example is the contraction: I am $\rightarrow$ I'm. Different corpora might treat this linguistic entity differently depending on several factors. Therefore, tokens' representation highly depends on the final processing purpose: speech recognition, classification tasks, text generation... 

## Tokenization 
The challenges associated with segmentation can be explained by two main factors:
1. writing system: I have already elaborated on that above
2. The type of the language: how words are constructed in terms of the most basic sound / components of the language

Ambiguities arise maily when the exact same character: delimiters mainly, have significantly different semantic meaning. The full stop can denote an end of a sentence, an abbreviation or even a part of a numerical representation. Such ambiguities make the tokenization such a challenging task.

* Tokenizating over white spaces is not the most reasonable approach as it does not take punctuation marks into account. The latters should be considered in most cases as seperate tokens regardless of their relative position to a weight space.

* Even though punctuation marks should be considered as stand-alone tokens, they might as well be attached to certain other tokens. One main example of such exception is the full stop used in abbreviations, the contraction such as ***doesn't*** and ***can't***. One apparent solution is to consider a list of possible abbreviations. Such approach fails in the face of reproducibility of abbreviations as any expression lengthy enough can be abbreviated under certain circumstances. Furthermore, the same abbreviation could stand for several expressions: such as St: saint or street or even state. 

* One of the most ambigious characters is the " ' ": apostrophe as it has multiple usages such as expressing the genetive case, contractions: he's doesn't or even the plural of abbreviated expressions: I.D's.
* White space is uninformative when tokenizing agglutinative or multi-part words. The hyphen character can also be used to build multi-part words. This is not the sole use case of the hyphen character which introduces ambiguity.
* In Unsegmented languages, A 'good' tokenization is even harder to reach as there is not predefined standard for what constitutes a word. Such disagreements manisfest even between native speakers who might have different definitions of the notion of a 'word'. The best-effort approach rely on a deeper understanding of the language in question alongside an extensive list of 'words' in that language might lead to relatively satisfactory results. Nevertheless, the accuracy is highly affected by out-of-vocabulary expressions.

* Among the approaches to tackle the tokenization of unsegmented languages, is the maximum length greedy algorithm. The latter is quite effective for languages such as chinease where most words do not exceed 3 characters in length. starting from the first letter, consider the longuest sequence that belong to the corpus starting with that particular character. The procedure continues from the next character after the end of the matching word. Another suggestion is the Inverse maximum length algorithm which is quite efficient itself.


## Sentence Segmentation
* This processing step is by no-means less complex or problematic than tokenization. The fewer punctuation marks, the more challenging this step is. Even with language with relatively rich punctuation system, sentence segmentation is quite problematic. Defining a set of punctuation marks as sentence boundaries is only a step towards the solution as the same fullstop could represent the end of a sentence, a part of an abbreviation or even part of a numerical representation. Identifying abbreviation is quite challenging task as it is corpus-dependent.
* In well-behaved corpora, rule based approach could lead to pretty good results

## Byte Pair Encoding: BPE Tokenizer
less talk, more code... The code below is for better understanding and does not represent a general implementation 

In [2]:
import re
from collections import defaultdict
from collections import Counter
from copy import copy

# first let's define the define the initial vocabulary produced from the input text / corpus.
def build_initial_vocab(text: str):
    # the given text will be first tokenized by spaces
    tokens = re.split(r'\s+', text)

    # convert each token to lower case, remove extra spaced and filter empty tokens
    tokens = [list(t.lower().strip()) for t in tokens if t.lower().strip()]
    
    # the vocabulary would be a set
    vocab = set()
    for t in tokens:
        vocab.update(t)
    
    return vocab, tokens
    

In [None]:
def _get_new_token(token: list, pair: str):
    if len(token) <= 1:
        return token
    
    if token[0] + token[1] == pair:
        return [pair] + _get_new_token(token[2:])

    pass 

def next_merge(current_vocab: set, tokens: list, equal_freq=None):
    """this function adds the most frequent contigent pair of characters to the current vocabulary

    Args:
        current_vocab (set): the current vocabulary
        tokens (list): the corpus represented as a list of tokens
        equal_freq: a callable that sorts the most frequent pairs. The pair returned will be the first one.
    """

    # let's create a counter to keep the occurences of different pairs
    pairs_counters = Counter()
    for t in tokens: 
        for i in range(len(t) - 1):
            pairs_counters.update(t[i] + t[i + 1])
    
    # determine the most frequent pair(s):
    max_count = pairs_counters.most_common()[0][1]
    most_freq = []
    for v in pairs_counters.most_common():
        if v[1] < max_count:
            break
        else:
            most_freq.append(v[0])
    
    # sort the pairs according to the equal_freq criteria
    most_freq = sorted(most_freq, key=equal_freq) # if equal_freq is None, the lexicographical order will be used

    # save the chosen pair
    final_pair = most_freq[-1]

    # merge the frequence pair in the occurences of the tokens
    





def train_tokenizer_num_merges(ini_vocab: set, tokens: list, num_merges:int):
    """_summary_

    Args:
        ini_vocab (set): the initial vocabulary
        tokens (list): the tokens extracted from the text
        num_merges (int): the number of merges needed
    """

## 