# Chapter 23 - Natural Language Processing (NLP)

*In which we see how a computer can use natural language to communicate with humans
and learn from what they have written.* - Peter Norvig, Stuart Russell in Artificial Intelligence: A Modern Approach

<img src="https://raw.githubusercontent.com/ValRCS/RBS_PBM773_Introduction_to_AI/main/img/ch23_nlp/DALL%C2%B7E%202024-03-25%2022.42.33%20-%20An%20illustration%20depicting%20an%20ancient%20human%20dwelling%20in%20a%20cave%2C%20gathered%20around%20a%20fire.%20The%20focus%20is%20on%20a%20group%20of%20early%20humans%20engaged%20in%20a%20moment%20of%20.webp" width="500">

## **Introduction to Natural Language Processing** 
- Human language, both spoken and written, is a defining characteristic of Homo sapiens, setting us apart from all other species due to its complexity and diversity.
- Language is central to human intelligence and behavior, as highlighted by Alan Turing's intelligence test. It is the primary medium for communicating knowledge and intentions. 
- Natural Language Processing (NLP) is crucial for: 
- **Communication with humans:**  It's often more convenient and natural to interact with computers using spoken or natural language rather than formal languages like first-order predicate calculus. 
- **Learning:**  Much of human knowledge is documented in natural language (e.g., Wikipedia). To harness this vast repository of information, computational systems need to understand natural language. 
- **Scientific advancement:**  Studying NLP aids in the broader understanding of languages and language use, integrating AI with linguistics, cognitive psychology, and neuroscience.
- The chapter focuses on mathematical models of language and explores various tasks that can be performed using these models.

##  **23.1 Language Models**  
- **Difference Between Formal and Natural Languages:**  Formal languages, like first-order logic, have precisely defined syntax and semantics, whereas natural languages exhibit variability, ambiguity, and a lack of formal definition in mapping symbols to objects. 
- **Variability and Ambiguity:**  Natural language varies between individuals and over time, and can be ambiguous (e.g., "He saw her duck" has multiple interpretations). 
- **Language Models Defined:**  A language model is a probability distribution over strings of text, used to determine the likelihood of a given sequence of words. It helps predict likely word sequences, suggest completions, and make corrections. 
- **Applications of Language Models:**  They are crucial for a variety of NLP tasks, including text completion, spelling and grammar correction, translation, and answering questions. Language models serve as benchmarks for measuring progress in understanding natural languages. 
- **Inherent Complexity and Approximation:**  Any language model is an approximation due to the complex nature of natural languages. Famous quotes by Edward Sapir and Donald Davidson highlight the challenges in creating a definitive language model, emphasizing that while individual models may be imperfect, they remain useful for communication and various computational tasks.

###  **23.1.1 The Bag-of-Words Model**  
- **Foundation in Naive Bayes:**  The bag-of-words model extends the naive Bayes approach from classifying sentences by topic (e.g., business or weather) to generating a joint probability distribution over sentences and their categories, considering the presence of specific words as independent events. 
- **Generative Process Description:**  Imagines a separate "bag" of words for each category, where words are drawn randomly to generate sentences. This model is simplistic and assumes word independence, not reflecting coherent sentence structures but is useful for classification tasks. 
- **Model Limitations and Accuracy:**  Despite its incorrect assumption of word independence, the bag-of-words model can classify texts with good accuracy by relying on word frequencies related to specific categories. 
- **Learning from Corpora:**  Prior probabilities and word given category probabilities are learned from large text corpora, using counts to estimate how likely a word appears in a given category. 
- **Comparison with Other Machine Learning Approaches:**  Other methods like logistic regression, neural networks, and support vector machines might outperform the naive Bayes for certain tasks. Feature vectors in these models are often large and sparse, containing the frequencies or presence of words from the vocabulary. 
- **Feature Selection and Additional Features:**  Improvements in model performance can come from selecting a subset of words as features and incorporating non-word features (e.g., sender information in emails, time sent, punctuation usage). 
- **Tokenization Challenges:**  Identifying what constitutes a word (e.g., handling contractions) is a non-trivial problem that impacts the model's input, requiring careful text tokenization.


### Tokenization examples

Tokenization is the process of breaking a text into words, phrases, symbols, or other meaningful elements. The resulting tokens are then used for further processing such as parsing or text mining. Tokenization is also known as word segmentation.



In [2]:
# let's tokenize a simple sentence
# the most basic tokenization is to split the sentence by spaces
sentence = "We are learning basics of Artificial Intelligence today."
tokens = sentence.split() # split by space any number of whitespaces (whitespace being defined as any sequence of spaces, tabs, newlines)
print(tokens)


['We', 'are', 'learning', 'basics', 'of', 'Artificial', 'Intelligence', 'today.']


In [None]:
# so this simple tokenization works well for English, but not for all languages
# also, it doesn't handle punctuation well
# another issue is that We is uppercase but we really want to treat it as lowercase
# so often we perform additional preprocessing steps
# these could include:
# - converting to lowercase
# - removing punctuation
# - removing stopwords - common words that don't carry much meaning (e.g. "the", "a", "is")

In [None]:
# so for Latvian we might want to use a more sophisticated tokenizer
# one approach is to use the nltk library - which does not have a good Latvian tokenizer
# same goes for the Spacy library
# for now best would be to use a tool like https://nlp.ailab.lv/ 


In [None]:
# so token is not necessarily a word, it could be a subword or even just a character

# for training LLMs we often use subword tokenization such as BPE or SentencePiece
# BPE stands for Byte Pair Encoding

# BPE works as follows:
# 1. start with a vocabulary of characters
# 2. count the frequency of each pair of characters
# 3. merge the most frequent pair into a new token
# 4. repeat until we have the desired vocabulary size

# typical vocabulary size is 32k or 64k tokens
# compare with 26 letters in the English alphabet
# also compare with words in English - there are about 170k words in the Oxford English Dictionary
# so BPE is a good compromise between characters and words

In [3]:
# so lets creata a simple BOW model out of some document:
# 1. tokenize the document
# 2. create a vocabulary
# 3. create a BOW vector for each document
# 4. can model the document as a vector in a high-dimensional space

# let's start with a simple example
# we have a corpus of 3 documents
# we will use a simple tokenizer that splits by space
# we will use a vocabulary of 6 words
# we will create a BOW vector for each document

# first let's tokenize the documents
doc1 = "We are learning basics of Artificial Intelligence today."
doc2 = "We are learning basics of Machine Learning today."
doc3 = "We are learning basics of Deep Learning today."

# we will use a simple tokenizer that splits by space
tokens1 = doc1.split()
tokens2 = doc2.split()
tokens3 = doc3.split()

# next we will create a vocabulary
# we will use a vocabulary of 6 words
# we will use the most common 6 words from the 3 documents
# we will also add a special token for unknown words
vocabulary = ["are", "basics", "learning", "of", "today", "we", "<UNK>"]

# next we will create a BOW vector for each document
# we will use a simple Python dictionary to represent the BOW vector
# the keys will be the words in the vocabulary
# the values will be the count of each word in the document
# we will use the special token <UNK> for unknown words
bow1 = {word: 0 for word in vocabulary}
bow2 = {word: 0 for word in vocabulary}
bow3 = {word: 0 for word in vocabulary}

for token in tokens1:
    if token in vocabulary:
        bow1[token] += 1
    else:
        bow1["<UNK>"] += 1

for token in tokens2:
    if token in vocabulary:
        bow2[token] += 1
    else:
        bow2["<UNK>"] += 1

for token in tokens3:
    if token in vocabulary:
        bow3[token] += 1
    else:
        bow3["<UNK>"] += 1

# we could have used a more efficient way to create the BOW vectors or used
# CountVectorizer from scikit-learn library
        
print(bow1)
print(bow2)
print(bow3)



{'are': 1, 'basics': 1, 'learning': 1, 'of': 1, 'today': 0, 'we': 0, '<UNK>': 4}
{'are': 1, 'basics': 1, 'learning': 1, 'of': 1, 'today': 0, 'we': 0, '<UNK>': 4}
{'are': 1, 'basics': 1, 'learning': 1, 'of': 1, 'today': 0, 'we': 0, '<UNK>': 4}


### Bow conclusions

* The bag-of-words model is a simple and effective way to represent text data for machine learning models.
* in the example our documents came out identical, because our documents had identical use of words in the vocabulary. Also all documents used identical number of unknown words.
* Also you can see that the bag-of-words model does not consider the order of words in the document. It only considers the frequency of words in the document.
* finally we see that lack of normalization and punctuation removal can lead to multiple tokens for the same word. In the example above we lost on *we* and *today* because of insufficient normalization.

###  **23.1.2 N-gram Word Models**  
- **Limitations of Bag-of-Words Model:**  The bag-of-words model can't distinguish contexts where the same word appears in different categories, due to its assumption of word independence. 
- **Introduction of N-gram Models:**  N-gram models address these limitations by considering each word's dependency on its preceding words, allowing for more context-aware analysis. 
- **Dependency Representation:**  In an ideal model, a word would depend on all previous words in a sentence, but this is impractical due to the vast number of parameters needed. N-gram models offer a compromise by limiting dependency to the n−1 previous words. 
- **Markov Chain Model:**  N-gram models use a Markov chain approach, treating sentences as sequences of dependent words. This significantly reduces complexity by focusing on local, rather than global, word dependencies. 
- **Types of N-grams:**  The model differentiates between unigrams (single words), bigrams (two-word sequences), and trigrams (three-word sequences), with each word's probability conditioned on the presence of the previous n−1 words. 
- **Applications:**  N-gram models are effective for various classification tasks beyond section classification, including spam detection, sentiment analysis, and author attribution, by capturing stylistic and thematic nuances in text sequences.

In [4]:
# in real life we could use libraries such as nltk or Spacy for n-grams.

# for now I will show you how to create bigrams from a sentence
# we will use a simple tokenizer that splits by space
sentence = "We are learning basics of Artificial Intelligence today. We are awesome!"
tokens = sentence.split()
# pythonic approach to create bigrams
bigrams = [(first,second) for first, second in zip(tokens[:-1], tokens[1:])]
print(bigrams)


[('We', 'are'), ('are', 'learning'), ('learning', 'basics'), ('basics', 'of'), ('of', 'Artificial'), ('Artificial', 'Intelligence'), ('Intelligence', 'today.'), ('today.', 'We'), ('We', 'are'), ('are', 'awesome!')]


In [5]:
# now let's count them using Counter
from collections import Counter
bigram_counts = Counter(bigrams)
print(bigram_counts)

Counter({('We', 'are'): 2, ('are', 'learning'): 1, ('learning', 'basics'): 1, ('basics', 'of'): 1, ('of', 'Artificial'): 1, ('Artificial', 'Intelligence'): 1, ('Intelligence', 'today.'): 1, ('today.', 'We'): 1, ('are', 'awesome!'): 1})


### 23.1.3 Other N-gram Models**  
- **N-gram Word Models:**  Addresses limitations of the bag-of-words model by considering sequences of words (n-grams) rather than individual words in isolation. This model captures dependencies between adjacent words, improving the ability to distinguish between contexts (e.g., "first quarter earnings report" in business vs. "fourth quarter touchdown passes" in sports). 
- **Character-level Models:**  An alternative n-gram approach that models the probability of each character based on the previous n−1 characters. It's particularly effective for languages that concatenate words and for dealing with unknown words. These models excel in tasks like language identification and classifying unique names or terms, achieving high accuracy even with short texts. 
- **Skip-gram Models:**  Skip-gram models count words that are near each other but with skips over one or more words between them. This approach is useful for capturing more nuanced language patterns, such as conjugation and negation relationships, by considering non-adjacent word pairs. 
- **Applications of N-gram Models:**  N-gram and skip-gram models are valuable for various classification tasks, including distinguishing newspaper sections, detecting spam, analyzing sentiment, attributing authorship, identifying languages, and classifying names or terms. These models offer improved context sensitivity over simpler models, leading to enhanced performance in these areas.

###  **23.1.4 Smoothing N-gram Models**  
- **Addressing Variance:**  High-frequency n-grams have stable probability estimates, while low-frequency n-grams suffer from high variance due to randomness. Smoothing techniques aim to mitigate this variance, improving model performance. 
- **Handling Unknown Words:**  To model out-of-vocabulary (unknown) words, training corpora are modified by replacing infrequent words with a special symbol (<UNK>), allowing for estimation of their probabilities. Additional symbols like <NUM> for numbers or <EMAIL> for email addresses can also be used. 
- **Unseen N-grams:**  Even after accounting for unknown words, the challenge of unseen n-grams remains. These are sequences that have never appeared in the training set but could appear in test data. Smoothing distributes some probability mass to these unseen n-grams, aiming for a more accurate model representation. 
- **Laplace Smoothing:**  A basic form of smoothing that adds one to the count of all n-grams, including unseen ones, to avoid zero probabilities. However, it often performs poorly in natural language tasks due to its simplicity. 
- **Backoff and Interpolation Models:**  Backoff models reduce to (n−1)-grams when encountering low-frequency or unseen n-grams. Linear interpolation smoothing combines different n-gram models (trigram, bigram, unigram) with weighted averages, adjusting the weights based on the presence and frequency of n-grams. 
- **Advanced Smoothing Techniques:**  Researchers have developed sophisticated smoothing methods, such as Witten-Bell and Kneser-Ney, alongside simpler approaches like "stupid backoff," both aimed at reducing variance. The choice between sophisticated techniques and accumulating larger corpora for simpler methods reflects ongoing research into optimizing language model performance.

###  **23.1.5 Word Representations**  
- **N-gram Model Limitations:**  While n-grams can accurately predict the likelihood of word sequences based on their frequency in training corpora, they miss out on the inherent patterns of language that native speakers understand, such as grammatical structures (e.g., article-adjective-noun). 
- **Beyond Surface-level Analysis:**  Native speakers recognize patterns and relations between words that n-gram models, treating each word as an atomic unit without internal structure, cannot capture. For instance, understanding the grammatical correctness of "the fulvous kitten" despite not having encountered "fulvous" before. 
- **Generalization through Structure:**  The n-gram model's inability to generalize beyond direct word sequence occurrences is a significant limitation. Factored or structured models, which account for the internal structure and relationships between words, can offer better generalization. Word embeddings are an example of such models, providing a richer, multidimensional representation of word meanings and relations. 
- **WordNet as a Structured Word Model:**  WordNet, a hand-curated dictionary, exemplifies a structured word model, offering categorizations and relations (e.g., hypernyms and hyponyms) among words. However, while useful for distinguishing word senses and basic categorizations, WordNet does not convey the full semantic richness or contextual usage details of words. 
- **Future Directions:**  The text hints at the exploration of more expressive models like word embeddings in subsequent sections, indicating a move towards understanding language in a more nuanced and comprehensive manner, akin to human language comprehension.

### **23.1.6 Part-of-Speech (POS) Tagging**  
- **Fundamental Task in NLP:**  POS tagging involves assigning each word in a sentence its appropriate part of speech (e.g., noun, verb, adjective), a basic yet crucial step for many NLP applications, from text-to-speech synthesis to machine translation. The complexity arises from the diversity in the classification of parts of speech, exemplified by the 45 tags used in the Penn Treebank. 
- **Hidden Markov Models (HMM):**  One traditional approach to POS tagging is using HMMs, which predict the sequence of POS tags based on the sequence of words (the observed states) and the transitions between different POS tags (the hidden states). HMMs, despite their simplification of language complexity to transitions and emissions, achieve high accuracy in tagging, partly thanks to algorithms like Viterbi for finding the most probable tag sequences. 
- **Transition and Sensor Models:**  These are integral to HMMs for POS tagging, with the transition model capturing the likelihood of one POS following another, and the sensor model reflecting the probability of a word being associated with a particular POS. The effectiveness of HMMs hinges on these probabilistic models derived from corpus counts and refined through smoothing techniques. 
- **Logistic Regression for POS Tagging:**  An alternative method, logistic regression allows for the incorporation of a rich set of features about words and their context, surpassing the HMM's limitations in capturing linguistic nuances. This model assigns probabilities to POS tags based on features like word identity, spelling patterns, and contextual information. 
- **Feature-rich Models:**  The ability of logistic regression to utilize a vast array of features, from the morphological characteristics of words to their positional information within a sentence, facilitates a more nuanced understanding and classification of words according to their parts of speech. 
- **Generative vs. Discriminative Models:**  While HMMs are generative models capable of producing random sequences of words and tags, logistic regression is a discriminative model focused on tagging given sequences. The choice between generative and discriminative approaches often depends on the specific requirements of the task, including the availability of training data and the need for speed or accuracy. 
- **Greedy and Beam Search Methods:**  For sequence classification tasks like POS tagging, strategies for traversing the sequence of words to assign tags vary in their trade-offs between speed and accuracy. Greedy search makes irreversible choices based on local maxima, while beam search and the Viterbi algorithm offer more comprehensive explorations of possible tag sequences, balancing computational efficiency with tagging accuracy.


###  **23.1.7 Comparing Language Models**  
- **N-gram Models' Performance:**  Experimentation with unigram (bag-of-words), bigram, trigram, and 4-gram models on the text of this book produced increasingly coherent sequences of words. While the unigram model generated nonsensical strings of words, the 4-gram model produced sentences that, although imperfect, were significantly more structured and meaningful, suggesting a closer approximation to English or the specific content of an AI textbook. 
- **Incorporating Diverse Text Sources:**  Enhancing the 4-gram model with the text from the King James Bible illustrated how models could generate interesting, albeit sometimes incongruous, text by blending styles and content from different sources. This mix resulted in sentences that were grammatically more complex and thematically varied, yet sometimes amusingly anachronistic or contextually jarring. 
- **Limits of N-gram Models:**  Despite improvements in fluency with higher n values, n-gram models have limitations. Specifically, as n increases, the models tend to regurgitate long excerpts from the training data, limiting their ability to generate novel content. This replication effect underscores the inherent limitation of relying solely on local word sequence probabilities without deeper understanding or context. 
- **Advancements in Language Modeling:**  The text hints at the evolution towards more sophisticated language models that employ complex representations of words and contexts, moving beyond the constraints of n-gram models. These advanced models, including deep learning approaches like GPT-2 and transformer models, show promise in generating more fluent, contextually aware, and grammatically coherent text. 
- **Deep Learning and Transformers:**  Highlighted models like GPT-2 and CTRL (Conditional Transformer Language) exemplify the cutting-edge in language modeling, capable of producing text that is not only grammatically fluent but also contextually relevant to given prompts. However, while these models demonstrate impressive linguistic capabilities, they sometimes lack the ability to develop and advance a coherent argument or narrative across multiple sentences. 
- **Implications for NLP:**  The exploration of various language models, from simple n-grams to advanced deep learning-based transformers, reflects ongoing efforts to create systems that more accurately mimic human language understanding and production. These advancements suggest a future where machines can interact, reason, and generate content with increasing sophistication, though challenges remain in achieving true linguistic and contextual comprehension.

##  **23.2 Grammar**  
- **Grammar in Natural Language:**  Unlike the formal structure of first-order logic, natural languages exhibit flexibility in sentence structure without strict boundaries between allowable and non-allowable sentences. Despite this, natural languages possess hierarchical structures essential for understanding syntax and semantics. Words form part of larger syntactic categories like noun phrases or verb phrases, which in turn contribute to the overall meaning of a sentence. 
- **Probabilistic Context-Free Grammar (PCFG):**  A popular model for capturing the hierarchical syntactic structure of language is the probabilistic context-free grammar. PCFGs assign probabilities to different strings, indicating how likely they are to occur within a given language context. The term "context-free" suggests that grammar rules apply universally across a sentence, regardless of their specific location or context. 
- **Grammar Rules and Probabilities:**  PCFGs utilize rules that define how sentences can be constructed from smaller parts, with probabilities indicating the likelihood of each rule's application. For example, an adjective might directly form an adjective phrase with a certain probability or combine with another adjective to form a longer phrase with a different probability. 
- **Limitations of PCFGs:**  While PCFGs offer a structured approach to understanding natural language, they are not without limitations. They can overgenerate, producing non-grammatical sentences, and undergenerate, failing to account for valid sentences outside their defined rules. This gap highlights the challenge of creating a comprehensive model that fully captures the nuances of natural language. 
- **Learning and Improving Grammar Models:**  The process of refining PCFGs involves learning from linguistic data to better model the complexities of natural language. This ongoing effort aims to reduce overgeneration and undergeneration by expanding and adjusting the set of grammar rules and their associated probabilities based on empirical evidence.


### 23.2.1 The Lexicon of E₀
<div>

- **Lexicon Overview:** The lexicon for E₀, representing a simplified segment of English suitable for specific communication contexts, comprises a list of allowable words categorized by their parts of speech. This lexicon includes nouns, names, verbs, adjectives, adverbs, pronouns, relative pronouns, articles, prepositions, and conjunctions.

- **Open vs. Closed Classes:** The lexicon distinguishes between open and closed classes. Open classes (nouns, names, verbs, adjectives, adverbs) are vast and constantly evolving, with new words regularly added to accommodate new concepts or trends (e.g., "humblebrag" or "microbiome"). Closed classes (pronouns, relative pronouns, articles, prepositions, conjunctions) consist of a relatively fixed and small set of words that change only over long historical periods.

- **Evolution of Closed Classes:** Despite their stability, closed classes do evolve, albeit at a much slower pace than open classes. Changes in usage patterns can lead to the decline or obsolescence of certain words (e.g., "thee" and "thou") and are typically only noticeable over centuries, contrasting sharply with the rapid evolution seen in open classes.

- **Challenges in Listing Words:** Given the dynamic nature of language, especially within open classes, it's practically impossible to compile a complete list of all words belonging to each category. The lexicon for E₀ is therefore indicative rather than exhaustive, highlighting the ongoing challenge of capturing and categorizing the full range of vocabulary in any natural language.</div>

##  **23.3 Parsing**  
- **Parsing Defined:**  Parsing is the process of analyzing a string of words to determine its phrase structure based on grammar rules. It involves constructing a valid parse tree that maps out the hierarchical structure of phrases in a sentence. 
- **Challenges and Strategies:**  Pure top-down or bottom-up parsing can lead to inefficiencies, such as redundant effort and backtracking, especially in sentences that share initial segments but diverge in structure. To mitigate this, dynamic programming techniques like chart parsing store results of substring analyses to prevent re-analysis, enhancing efficiency. 
- **Chart Parsing and CYK Algorithm:**  Chart parsing utilizes a data structure to remember parsed substrings, avoiding repetition in the parsing process. The CYK (Cocke-Younger-Kasami) algorithm, a form of bottom-up chart parsing, is highlighted for its structured approach to parsing context-free grammars, requiring grammars to be in Chomsky Normal Form (CNF) for optimal operation. 
- **Computational Considerations:**  The CYK algorithm has a polynomial computational complexity, reflecting the trade-off between comprehensive grammar coverage and parsing efficiency. Despite its thoroughness, the inherent complexity suggests a natural limit to the speed of parsing under unrestricted context-free grammars. 
- **Natural Language and Parsing Efficiency:**  Considering natural languages' evolution towards comprehensibility, there's an argument that they should be more amenable to efficient parsing strategies. Techniques like A* search, leveraging heuristics for probabilistic evaluation, aim to improve parsing speed by narrowing the search space to the most probable parse trees. 
- **Beam Search and Deterministic Parsing:**  Beam search introduces a compromise between accuracy and efficiency by limiting the number of alternative parses considered at any point. Deterministic parsers, such as shift-reduce parsers, operate under even stricter constraints, making choices word by word without guaranteeing the highest probability parse but achieving operational efficiency. 
- **Diversity in Parsing Techniques:**  The field of parsing in NLP showcases a variety of methods, each with its proponents. The choice of parsing technique can influence the types of generalizations and biases introduced through machine learning, underscoring the diverse approaches to understanding and structuring natural language within computational models.


### CYK in Python

- The CYK (Cocke-Younger-Kasami) algorithm is a dynamic programming algorithm for parsing sentences in context-free grammars that are in Chomsky Normal Form (CNF). To implement the CYK algorithm in Python, we'll need a grammar in CNF where each rule is either of the form `A -> B C` (where `A`, `B`, and `C` are non-terminal symbols) or `A -> a` (where `A` is a non-terminal symbol and `a` is a terminal symbol, such as a word).

Here's a simplified Python implementation of the CYK algorithm. This implementation assumes that the grammar is provided as a dictionary where the keys are tuples representing the right-hand side of the grammar rules, and the values are lists of non-terminals that produce those tuples. Terminal rules are keyed by single-element tuples.

In [1]:
# FIXME: This is only a template for the CYK parsing algorithm. You need to fill in the missing parts.
def CYK_parse(words, grammar):
    """
    Performs the CYK parsing algorithm on a list of words using the given grammar.
    
    :param words: List of words (strings) to parse.
    :param grammar: Grammar rules in CNF, represented as a dictionary where
                    keys are tuples of right-hand side symbols, and values are lists
                    of left-hand side non-terminals that can produce those tuples.
    :return: A parse table (list of lists) containing possible parses.
    """
    # Initialize the parse table
    n = len(words)
    table = [[set() for _ in range(n + 1)] for _ in range(n + 1)]
    
    # Fill in the table for the base case (single words)
    for i, word in enumerate(words, start=1):
        if (word,) in grammar:
            for non_terminal in grammar[(word,)]:
                table[i][i].add(non_terminal)
    
    # Fill in the table for phrases of length 2 to n
    for length in range(2, n + 1):  # Length of the span
        for start in range(1, n - length + 2):  # Start of the span
            end = start + length - 1  # End of the span
            for mid in range(start, end):  # Split the span
                for A, rhs_list in grammar.items():
                    for B, C in rhs_list:
                        if B in table[start][mid] and C in table[mid + 1][end]:
                            table[start][end].add(A)
    
    # Return the filled-in table
    return table

# Example usage
grammar = {
    ('the',): ['Det'],
    ('cat',): ['N'],
    ('sat',): ['V'],
    ('on',): ['P'],
    ('mat',): ['N'],
    ('Det', 'N'): ['NP'],
    ('V', 'NP'): ['VP'],
    ('P', 'NP'): ['PP'],
    ('NP', 'PP'): ['NP'],
    ('NP', 'VP'): ['S'],
}

sentence = "the cat sat on the mat".split()
parse_table = CYK_parse(sentence, grammar)

# To check if the sentence is in the language defined by the grammar
if 'S' in parse_table[1][len(sentence)]:
    print("The sentence is grammatically correct according to the grammar.")
else:
    print("The sentence is not grammatically correct according to the grammar.")

ValueError: too many values to unpack (expected 2)

Above implementation assumes that the grammar is well-formed and in CNF. The `CYK_parse` function generates a parse table, which can be used to check if a given sentence can be generated by the provided grammar. If `'S'` (assuming 'S' is the start symbol) is found in the top-right cell of the table, the sentence is grammatically correct according to the grammar. This example is basic and primarily focuses on demonstrating the CYK algorithm's structure. In practice, grammars and parsing requirements may be more complex, requiring enhancements to this basic implementation.

###  **23.3.1 Dependency Parsing**  
- **Dependency Grammar Concept:**  Dependency grammar is an alternative syntactic approach that models the structure of sentences based on binary relations between words (lexical items), foregoing the need for syntactic constituents like phrases and clauses. This approach emphasizes direct relationships between a head word and its dependents, illustrating the grammatical structure through these connections rather than nested groupings of words. 
- **Comparison with Phrase Structure Grammar:**  While dependency and phrase structure grammars offer different perspectives on sentence structure, they can be seen as notational variants to some extent. With appropriate annotations, it's possible to convert between a phrase structure parse and a dependency parse, although the conversion might not always result in a structure that appears natural or intuitively understandable. 
- **Notational Preference and Naturalness:**  The preference for dependency grammar or phrase structure grammar isn't based on the power or capability of one over the other but rather on which representation feels more natural or intuitive, either for human developers or for computational models learning language structures. Dependency grammar tends to be more suitable for languages with flexible word order, where syntactic roles and relationships are not strictly tied to the sequence of words, while phrase structure grammar fits well with languages that have a relatively fixed word order. 
- **Universal Dependencies Project:**  The growing popularity of dependency grammar in computational linguistics and natural language processing has been significantly boosted by the Universal Dependencies project. This initiative provides a standardized set of syntactic relations and a vast treebank covering parsed sentences from over 70 languages, facilitating the development and evaluation of language understanding systems based on dependency grammar across a wide array of linguistic contexts.

### **23.3.2 Learning a Parser from Examples**  
- **Challenges in Grammar Construction:**  Manually building a comprehensive grammar for English or any complex language is not only tedious but prone to errors. This difficulty suggests a shift towards automated learning of grammar rules and their probabilities from annotated examples. 
- **Treebanks as Learning Resources:**  Treebanks, such as the Penn Treebank, provide a valuable resource for supervised learning, containing sentences annotated with parse trees. These annotations facilitate the automated creation of Probabilistic Context-Free Grammars (PCFGs) by counting occurrences of node types and subtree patterns. 
- **Generating PCFGs from Annotated Data:**  By analyzing annotated trees, researchers can derive PCFG rules by calculating the frequency of specific node combinations. The process includes smoothing to account for low-frequency occurrences, ensuring that the resulting grammar is robust and reflective of the language's variability. 
- **Complexity and Generalization in Grammar:**  The Penn Treebank illustrates the complexity of English through its vast array of node types. However, some argue that the resulting trees are overly flat, suggesting a need for more generalized grammar rules that better capture the language's hierarchical structure. Techniques like data-oriented parsing aim to simplify and generalize the derived grammar, improving its applicability to new sentences. 
- **Limitations and Alternatives to Treebanks:**  Despite their utility, treebanks are limited by errors, idiosyncrasies, and the extensive labor required for their creation. These limitations highlight the appeal of unsupervised and semisupervised learning approaches, which seek to refine grammars using unannotated texts or a combination of annotated and unannotated resources. 
- **Unsupervised and Curriculum Learning:**  Unsupervised parsing techniques, such as the inside–outside algorithm, learn grammar structures from sentences without annotated parse trees. Curriculum learning approaches begin with simple, unambiguous sentences, gradually increasing complexity as the system's understanding grows, enabling the parsing of longer and more complex sentences over time. 
- **Semisupervised Parsing and Partial Bracketing:**  This approach combines a small set of annotated trees with a large corpus of unannotated sentences to refine and expand the grammar. Partial bracketing leverages non-expert annotations (e.g., HTML tags in texts) to infer syntactic structures, providing additional clues for grammar learning and parser improvement.


##  **23.4 Augmented Grammars**  
- **Augmented Grammar Essentials:**  Augmented grammars enhance context-free grammars by incorporating features that capture more nuanced language rules, such as case, person, and number, allowing for distinctions between words within the same lexical category. For instance, "I" (subjective case, first person singular) versus "me" (objective case), or "banana" (more likely as an object of "ate") versus "bandanna" (less likely). 
- **Subcategories for Refined Parsing:**  Words are classified into subcategories based on features like case and number, which helps in identifying their syntactic roles more accurately. This classification enables the grammar to make finer distinctions about sentence structures, improving the grammar's predictive accuracy. 
- **Lexicalized PCFGs:**  These are probabilistic context-free grammars augmented with lexical information, assigning probabilities based on the properties of head words in phrases. Lexicalized PCFGs address the limitations of simple PCFGs by considering the likelihood of specific word combinations, thereby enabling more realistic sentence generation and parsing. 
- **Phrase Head and Probability Modeling:**  The concept of a phrase's head word plays a crucial role in lexicalized PCFGs, guiding the assignment of probabilities to different constructions. This approach simplifies probability models by focusing on the interactions between key words in phrases, though it requires extensive data and sophisticated models for accurate probability estimation. 
- **Addressing Overgeneration and Undergeneration:**  Augmented grammars aim to reduce the overgeneration of non-sentences and undergeneration of valid sentences by encoding grammatical rules and probabilities that reflect the correct usage of words and phrases. This includes specifying the appropriate case, person, and number for noun phrases and verb phrases in different contexts. 
- **Grammar Rules and Feature Agreement:**  Augmented grammars implement rules that enforce agreement between syntactic features, such as case alignment between a noun phrase and a verb phrase. These rules are vital for constructing grammatically coherent sentences, ensuring that the elements of a sentence match in terms of grammatical properties like number and person. 
- **Modular and Concise Representation:**  By using augmented categories and variables, these grammars achieve a more modular and concise representation of linguistic rules, facilitating the construction of complex sentence structures with fewer rules. This modularity also simplifies the learning and application of grammatical rules in natural language processing tasks.

###  **23.4.1 Semantic Interpretation**  
- **Basics of Adding Semantics:**  The process of incorporating semantic interpretation into a grammar involves mapping syntactic structures to their corresponding meanings. This is demonstrated through arithmetic expressions, where semantics are attached to grammatical rules, enabling the derivation of the semantic value of expressions based on the semantics of their components. This approach adheres to the principle of compositional semantics, where the meaning of a phrase is determined by the meanings of its parts. 
- **From Arithmetic to English:**  Moving from arithmetic to English, semantic interpretation is handled using first-order logic. Simple sentences like "Ali loves Bo" are translated into logical representations, such as Loves(Ali,Bo), capturing the underlying semantic relationships between subjects, verbs, and objects. 
- **Representing Constituent Phrases:**  Constituent phrases within a sentence are represented in ways that reflect their semantic roles. For instance, a noun phrase might directly correspond to a logical term, while a verb phrase (VP), representing an incomplete action or property, is modeled as a predicate that requires a subject to form a complete logical sentence. 
- **Lambda Notation for Predicates:**  The λ-notation is employed to represent predicates in a manner that can be applied to logical terms to yield complete sentences. This notation allows for the representation of partial meanings, like "loves Bo", as predicates that become complete when combined with a subject. 
- **Semantic Rules and β-Reduction:**  Grammatical rules specify how the semantic interpretations of different parts of a sentence combine to form the overall sentence meaning. For example, combining a noun phrase (NP) with a verb phrase (VP) through a specific rule can produce the full logical representation of a sentence. The process of applying these rules is technically known as β-reduction in the context of lambda calculus, simplifying expressions to reveal their underlying logical form. 
- **Comprehensive Semantic Representation:**  In more detailed grammars, semantic information is integrated alongside other linguistic features such as case and number, encapsulating the full spectrum of language understanding in a set of unified rules. This integration enables the translation of complex syntactic structures into their semantic equivalents, facilitating a deeper understanding of language through formal representations. 
- **Example and Application:**  The methodology extends beyond simple sentences, applying to a wide range of linguistic expressions. Through systematic rule application and semantic augmentation, it's possible to construct grammars that accurately capture the meanings of sentences, offering insights into their logical structure and implications.


### **23.4.2 Learning Semantic Grammars**  
- **Challenge of Semantic Annotation:**  Unlike the Penn Treebank, which provides syntactic trees without semantic annotations, learning semantic grammars requires data that includes both sentences and their corresponding logical or semantic forms. This necessitates a different source of examples that pair sentences with their semantic interpretations. 
- **Systems for Learning Semantic Representations:**  Zettlemoyer and Collins developed a system capable of learning grammars for question-answering applications from sentence-semantic form pairs. By incorporating a modest amount of domain-specific knowledge, their system can generate lexical entries and learn grammar parameters that parse sentences into semantic forms, demonstrating notable accuracy in interpreting unseen sentences. 
- **Improvements and Alternatives:**  Further advancements by Zhao and Huang introduced a shift-reduce parser that not only operates faster but also achieves higher accuracy, illustrating progress in the field of semantic parsing. 
- **Limitations of Requiring Logical Forms:**  A significant obstacle in learning semantic grammars is the need for training data to include logical forms, which are costly to produce due to the requirement for annotators familiar with complex logical and mathematical notations like lambda calculus. 
- **Leveraging Question/Answer Pairs:**  An alternative approach utilizes readily available question/answer pairs from the web, bypassing the need for explicit logical forms. This method capitalizes on the abundance of such data online to train parsers, potentially improving performance by leveraging larger datasets. 
- **Innovative Approaches to Semantic Parsing:**  Recent studies have proposed methods to create internal logical forms that are both compositional and manageable in terms of computational complexity. These approaches aim to optimize the balance between expressiveness and the feasibility of parsing, paving the way for more effective and scalable semantic parsing systems.


##  **23.5 Complications of Real Natural Language**  
- **Quantification and Ambiguity:**  Natural language complexity is significantly heightened by quantification, leading to semantic ambiguities in sentences. For example, the sentence "Every agent feels a breeze" can be interpreted in multiple ways, depending on whether one breeze is felt by all agents or each agent feels a separate breeze. These distinctions are not always clear-cut in syntactic parsing, requiring nuanced semantic interpretation. 
- **Pragmatics and Context:**  Pragmatics plays a crucial role in understanding natural language, involving context-dependent information to resolve meanings of indexicals (like "I" or "today") and interpret the speaker's intent. This dimension of language processing goes beyond syntax and semantics to consider the situational context and speaker-hearer dynamics. 
- **Long-distance Dependencies:**  Sentences often contain elements that are related despite being separated by several other components. Parsing strategies must account for these long-distance dependencies to correctly associate elements like gaps in a sentence with their referents. 
- **Time and Tense:**  Expressing and understanding temporal concepts in language, such as differentiating between "Ali loves Bo" (present tense) and "Ali loved Bo" (past tense), introduces additional layers of complexity. This necessitates incorporating notions of time into the semantic interpretation of sentences. 
- **Ambiguity in Language:**  Ambiguity is a pervasive feature of natural language, encompassing lexical (single words with multiple meanings), syntactic (multiple possible parses), and semantic (different interpretations) ambiguities. While often seen as a communication challenge, ambiguity is also an inherent and expressive aspect of language that parsers must navigate. 
- **Metaphor and Metonymy:**  Figures of speech like metaphor and metonymy add depth and richness to language but pose challenges for computational models. These figures involve representing one thing in terms of another, based on analogy or association, requiring sophisticated parsing strategies to interpret meaning correctly. 
- **Disambiguation Strategies:**  Effective language understanding involves disambiguation, choosing the most probable meaning from multiple possibilities. This process relies on integrating various models, including the world model (real-world likelihoods), the mental model (speaker and hearer beliefs and intentions), the language model (likelihood of word sequences), and the acoustic model (for spoken language). 
- **Real-world Complexity:**  The complexities highlighted underscore the multifaceted nature of natural language understanding. Beyond parsing syntax, computational models must engage with semantic, pragmatic, and contextual dimensions to accurately interpret and generate language, reflecting its inherent ambiguity and richness.


##  **23.6 Natural Language Tasks**  
- **Speech Recognition:**  Converts spoken language into text, serving as a foundation for further natural language processing tasks. Modern systems boast low word error rates, comparable to human transcriptionists, thanks to advances in deep neural networks and recurrent neural network-Hidden Markov model hybrids. 
- **Text-to-Speech Synthesis:**  Involves generating spoken language from text, aiming for natural pronunciation and intonation. Recent improvements have focused on voice synthesis diversity, including dialects and celebrity voices, with deep neural networks significantly enhancing naturalness. 
- **Machine Translation:**  Translates text from one language to another using bilingual corpora for training. Early systems relied on n-gram models but faced limitations in syntax representation. Recurrent neural sequence-to-sequence models and transformer models have since achieved more nuanced translations, approaching human-level accuracy for some language pairs. 
- **Information Extraction:**  Identifies and extracts structured information from unstructured text, such as addresses or weather report details. While simple patterns can be extracted using regular expressions, more complex information requires advanced models like hidden Markov models or neural networks. 
- **Information Retrieval:**  Finds documents relevant to a user's query, a task performed by search engines like Google and Baidu. It involves sophisticated algorithms to rank documents based on their relevance and importance to the query. 
- **Question Answering:**  Goes beyond retrieving documents to directly providing answers to user queries. This task has evolved from relying on syntactic parsing and databases to leveraging web information retrieval for broader coverage. Recent systems use web searches to find likely answers, benefiting from large-scale internet data. 
- **Challenges and Advances:**  These tasks underscore the breadth of natural language processing and the complexities of dealing with real-world language data. Progress in areas like deep learning has led to significant improvements, yet the field continues to evolve, addressing ongoing challenges in understanding and generating human language.


<img src="https://raw.githubusercontent.com/ValRCS/RBS_PBM773_Introduction_to_AI/main/img/ch23_nlp/DALL%C2%B7E%202024-03-25%2022.44.03%20-%20An%20imaginative%20illustration%20where%20librarians%20are%20depicted%20in%20a%20magical%20library%2C%20extracting%20knowledge%20directly%20from%20textbooks.%20The%20books%20float%20in%20the%20a.webp" width="500">

##  **Summary of Chapter Highlights**  
- **Probabilistic Language Models and N-grams:**  These models are powerful tools for understanding language patterns and can effectively tackle a range of tasks, from language identification and spelling correction to sentiment analysis and named-entity recognition. 
- **Importance of Preprocessing and Smoothing:**  With potentially millions of features in language models, preprocessing and smoothing are crucial steps to refine the data and improve model performance. 
- **Simplicity and Data Utilization:**  In statistical language modeling, the aim is to develop models that effectively leverage the available data, even if the models initially appear too simplistic. This approach often yields surprisingly robust systems. 
- **Word Embeddings:**  These provide a nuanced representation of word meanings and relationships, enhancing the ability to model language complexity. 
- **Hierarchical Structure of Language:**  To accurately reflect language's hierarchical nature, phrase structure grammars, especially context-free grammars, are employed. Both PCFGs and dependency grammars are essential formalisms in this domain. 
- **Parsing and Treebanks:**  Sentences can be parsed efficiently using algorithms like CYK, with modifications such as beam search or shift-reduce parsing allowing for faster processing. Treebanks serve as vital resources for learning and refining grammars. 
- **Grammar Augmentation:**  Augmenting grammars helps address linguistic nuances like subject-verb agreement and pronoun case, and facilitates the inclusion of word-level information beyond mere syntactic categories. 
- **Semantic Interpretation:**  Augmented grammars can also incorporate semantic interpretation, enabling the construction of semantic grammars from corpora of questions paired with their logical forms or answers. 
- **Complexity of Natural Language:**  Capturing the full complexity of natural language in formal grammars is a challenging task due to its inherent intricacies and the nuanced information conveyed through language.

This summary encapsulates the core insights and methodologies discussed in the chapter, highlighting the multifaceted approach required to model and process natural language effectively.


## Historical and Bibliographical Notes

- **Markov (1913):**  Introduced n-gram letter models for language modeling. 
- **Shannon and Weaver (1949):**  First to generate n-gram word models of English. 
- **Zellig Harris (1954):**  Coined the term "bag of words." 
- **Chomsky (1956, 1957):**  Highlighted limitations of finite-state models compared to context-free models. 
- **Jelinek (1976):**  Contributed to the reemergence of statistical models in speech recognition. 
- **Laplace (1816) and Jeffreys (1948):**  Discussed add-one smoothing. 
- **Jelinek and Mercer (1980), Witten–Bell (1991), Church and Gale (1991), Kneser–Ney (1995, 2004), Brants et al. (2007):**  Developed various smoothing techniques. 
- **Blei et al. (2002), Hoffman et al. (2011):**  Developed the latent Dirichlet allocation model. 
- **Joulin et al. (2016), Joachims (2001), Apté et al. (1994), Koller and Sahami (1997), Schapire and Singer (2000), Zhang et al. (2016):**  Made significant contributions to text classification and machine learning in NLP. 
- **Fellbaum (2001):**  Developed WordNet, a dictionary linked by semantic relations. 
- **Marcus et al. (1993), Bies et al. (2015):**  Contributed to the Penn Treebank. 
- **Zelle and Mooney (1996), Zettlemoyer and Collins (2005), Zhao and Huang (2015):**  Worked on learning semantic grammars from examples. 
- **Gold (1967):**  Showed challenges in learning exactly correct context-free grammars. 
- **Horning (1969), Schütze (1995), de Marcken (1996):**  Demonstrated language learning from positive examples. 
- **Baker (1975):**  Developed the D RAGON speech recognition system, the first to use HMMs. 
- **Hinton et al. (2012), Deng (2016):**  Discussed the impact of deep learning on speech recognition. 
- **Manning et al. (2008), Croft et al. (2010), Baeza-Yates and Ribeiro-Neto (2011):**  Authored textbooks on information retrieval. 
- **Brin and Page (1998):**  Described the PageRank algorithm and Web search engine implementation. 
- **Banko et al. (2007), Banko and Etzioni (2008), Mitchell (2005), Etzioni et al. (2006):**  Advanced the field of information extraction and machine reading. 
- **Chomsky (1956), Backus (1959), Panini (ca. 350 BCE):**  Contributed to the development of context-free grammars. 
- **Charniak (1993), Manning and Schütze (1999), Jurafsky and Martin (2020):**  Provided significant resources on PCFGs and NLP. 
- **Andor et al. (2016), Chen and Manning (2014), Kitaev and Klein (2018):**  Developed highly accurate open-source parsers. 
- **Green et al. (1961), Winograd (1972), Woods (1973):**  Pioneered NLP systems for question answering and command interpretation.

These key authors and their contributions highlight the historical and bibliographical milestones in the development of natural language processing, spanning from early theoretical foundations to contemporary computational models and applications. The evolution of language models, parsing algorithms, and semantic interpretation techniques reflects the ongoing quest to understand and process human language effectively.

## Learn more on NLP

### Books 
- **Natural Language Processing with Python:**  A comprehensive guide to NLP using Python, covering text processing, classification, information extraction, and more. Src: [Natural Language Processing with Python](https://www.nltk.org/book/)
- **Speech and Language Processing:**  A foundational textbook on NLP, covering a wide range of topics from speech recognition to machine translation. Src: [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/)

### Papers
- **"Attention Is All You Need" (Vaswani et al., 2017):**  Introduced the transformer model, a key advancement in sequence-to-sequence learning for NLP tasks. Src: [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
- **"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (Devlin et al., 2018):**  Described BERT, a transformer-based model that achieved state-of-the-art results on various NLP benchmarks. Src: [BERT](https://arxiv.org/abs/1810.04805)
- **"GloVe: Global Vectors for Word Representation" (Pennington et al., 2014):**  Introduced the GloVe model for learning word embeddings from large text corpora. Src: [GloVe](https://nlp.stanford.edu/projects/glove/)
- **Word2Vec (Mikolov et al., 2013):**  Introduced the Word2Vec model for learning word embeddings, a key advancement in distributed word representations. Src: [Word2Vec](https://arxiv.org/abs/1301.3781)

### Tools
- **NLTK:**  A popular Python library for NLP tasks, providing tools for tokenization, stemming, tagging, parsing, and semantic reasoning. Src: [NLTK](https://www.nltk.org/)
- **spaCy:**  An open-source NLP library for Python, offering efficient tokenization, named entity recognition, part-of-speech tagging, and dependency parsing. Src: [spaCy](https://spacy.io/)
- **Hugging Face Transformers:**  A library for state-of-the-art NLP models, including BERT, GPT-2, and T5, with pre-trained models and fine-tuning capabilities. Src: [Hugging Face Transformers](https://huggingface.co/transformers/)
- **GenSim:**  A Python library for topic modeling, document indexing, and similarity retrieval using large corpora. Src: [GenSim](https://radimrehurek.com/gensim/)

### Courses
- **Coursera NLP Specialization:**  A series of courses on NLP topics, including text classification, sequence models, and sentiment analysis. Src: [Coursera NLP Specialization](https://www.coursera.org/specializations/natural-language-processing)
- **Stanford CS224N:**  A course on NLP with a focus on deep learning, covering topics like word embeddings, sequence-to-sequence models, and transformers. Src: [Stanford CS224N](http://web.stanford.edu/class/cs224n/)