# Chapter 23 - Natural Language Processing (NLP)

*In which we see how a computer can use natural language to communicate with humans
and learn from what they have written.* - Peter Norvig, Stuart Russell in Artificial Intelligence: A Modern Approach

## **Introduction to Natural Language Processing** 
- Human language, both spoken and written, is a defining characteristic of Homo sapiens, setting us apart from all other species due to its complexity and diversity.
- Language is central to human intelligence and behavior, as highlighted by Alan Turing's intelligence test. It is the primary medium for communicating knowledge and intentions. 
- Natural Language Processing (NLP) is crucial for: 
- **Communication with humans:**  It's often more convenient and natural to interact with computers using spoken or natural language rather than formal languages like first-order predicate calculus. 
- **Learning:**  Much of human knowledge is documented in natural language (e.g., Wikipedia). To harness this vast repository of information, computational systems need to understand natural language. 
- **Scientific advancement:**  Studying NLP aids in the broader understanding of languages and language use, integrating AI with linguistics, cognitive psychology, and neuroscience.
- The chapter focuses on mathematical models of language and explores various tasks that can be performed using these models.

##  **23.1 Language Models**  
- **Difference Between Formal and Natural Languages:**  Formal languages, like first-order logic, have precisely defined syntax and semantics, whereas natural languages exhibit variability, ambiguity, and a lack of formal definition in mapping symbols to objects. 
- **Variability and Ambiguity:**  Natural language varies between individuals and over time, and can be ambiguous (e.g., "He saw her duck" has multiple interpretations). 
- **Language Models Defined:**  A language model is a probability distribution over strings of text, used to determine the likelihood of a given sequence of words. It helps predict likely word sequences, suggest completions, and make corrections. 
- **Applications of Language Models:**  They are crucial for a variety of NLP tasks, including text completion, spelling and grammar correction, translation, and answering questions. Language models serve as benchmarks for measuring progress in understanding natural languages. 
- **Inherent Complexity and Approximation:**  Any language model is an approximation due to the complex nature of natural languages. Famous quotes by Edward Sapir and Donald Davidson highlight the challenges in creating a definitive language model, emphasizing that while individual models may be imperfect, they remain useful for communication and various computational tasks.

###  **23.1.1 The Bag-of-Words Model**  
- **Foundation in Naive Bayes:**  The bag-of-words model extends the naive Bayes approach from classifying sentences by topic (e.g., business or weather) to generating a joint probability distribution over sentences and their categories, considering the presence of specific words as independent events. 
- **Generative Process Description:**  Imagines a separate "bag" of words for each category, where words are drawn randomly to generate sentences. This model is simplistic and assumes word independence, not reflecting coherent sentence structures but is useful for classification tasks. 
- **Model Limitations and Accuracy:**  Despite its incorrect assumption of word independence, the bag-of-words model can classify texts with good accuracy by relying on word frequencies related to specific categories. 
- **Learning from Corpora:**  Prior probabilities and word given category probabilities are learned from large text corpora, using counts to estimate how likely a word appears in a given category. 
- **Comparison with Other Machine Learning Approaches:**  Other methods like logistic regression, neural networks, and support vector machines might outperform the naive Bayes for certain tasks. Feature vectors in these models are often large and sparse, containing the frequencies or presence of words from the vocabulary. 
- **Feature Selection and Additional Features:**  Improvements in model performance can come from selecting a subset of words as features and incorporating non-word features (e.g., sender information in emails, time sent, punctuation usage). 
- **Tokenization Challenges:**  Identifying what constitutes a word (e.g., handling contractions) is a non-trivial problem that impacts the model's input, requiring careful text tokenization.


###  **23.1.2 N-gram Word Models**  
- **Limitations of Bag-of-Words Model:**  The bag-of-words model can't distinguish contexts where the same word appears in different categories, due to its assumption of word independence. 
- **Introduction of N-gram Models:**  N-gram models address these limitations by considering each word's dependency on its preceding words, allowing for more context-aware analysis. 
- **Dependency Representation:**  In an ideal model, a word would depend on all previous words in a sentence, but this is impractical due to the vast number of parameters needed. N-gram models offer a compromise by limiting dependency to the n−1 previous words. 
- **Markov Chain Model:**  N-gram models use a Markov chain approach, treating sentences as sequences of dependent words. This significantly reduces complexity by focusing on local, rather than global, word dependencies. 
- **Types of N-grams:**  The model differentiates between unigrams (single words), bigrams (two-word sequences), and trigrams (three-word sequences), with each word's probability conditioned on the presence of the previous n−1 words. 
- **Applications:**  N-gram models are effective for various classification tasks beyond section classification, including spam detection, sentiment analysis, and author attribution, by capturing stylistic and thematic nuances in text sequences.

### 23.1.3 Other N-gram Models**  
- **N-gram Word Models:**  Addresses limitations of the bag-of-words model by considering sequences of words (n-grams) rather than individual words in isolation. This model captures dependencies between adjacent words, improving the ability to distinguish between contexts (e.g., "first quarter earnings report" in business vs. "fourth quarter touchdown passes" in sports). 
- **Character-level Models:**  An alternative n-gram approach that models the probability of each character based on the previous n−1 characters. It's particularly effective for languages that concatenate words and for dealing with unknown words. These models excel in tasks like language identification and classifying unique names or terms, achieving high accuracy even with short texts. 
- **Skip-gram Models:**  Skip-gram models count words that are near each other but with skips over one or more words between them. This approach is useful for capturing more nuanced language patterns, such as conjugation and negation relationships, by considering non-adjacent word pairs. 
- **Applications of N-gram Models:**  N-gram and skip-gram models are valuable for various classification tasks, including distinguishing newspaper sections, detecting spam, analyzing sentiment, attributing authorship, identifying languages, and classifying names or terms. These models offer improved context sensitivity over simpler models, leading to enhanced performance in these areas.

###  **23.1.4 Smoothing N-gram Models**  
- **Addressing Variance:**  High-frequency n-grams have stable probability estimates, while low-frequency n-grams suffer from high variance due to randomness. Smoothing techniques aim to mitigate this variance, improving model performance. 
- **Handling Unknown Words:**  To model out-of-vocabulary (unknown) words, training corpora are modified by replacing infrequent words with a special symbol (<UNK>), allowing for estimation of their probabilities. Additional symbols like <NUM> for numbers or <EMAIL> for email addresses can also be used. 
- **Unseen N-grams:**  Even after accounting for unknown words, the challenge of unseen n-grams remains. These are sequences that have never appeared in the training set but could appear in test data. Smoothing distributes some probability mass to these unseen n-grams, aiming for a more accurate model representation. 
- **Laplace Smoothing:**  A basic form of smoothing that adds one to the count of all n-grams, including unseen ones, to avoid zero probabilities. However, it often performs poorly in natural language tasks due to its simplicity. 
- **Backoff and Interpolation Models:**  Backoff models reduce to (n−1)-grams when encountering low-frequency or unseen n-grams. Linear interpolation smoothing combines different n-gram models (trigram, bigram, unigram) with weighted averages, adjusting the weights based on the presence and frequency of n-grams. 
- **Advanced Smoothing Techniques:**  Researchers have developed sophisticated smoothing methods, such as Witten-Bell and Kneser-Ney, alongside simpler approaches like "stupid backoff," both aimed at reducing variance. The choice between sophisticated techniques and accumulating larger corpora for simpler methods reflects ongoing research into optimizing language model performance.

###  **23.1.5 Word Representations**  
- **N-gram Model Limitations:**  While n-grams can accurately predict the likelihood of word sequences based on their frequency in training corpora, they miss out on the inherent patterns of language that native speakers understand, such as grammatical structures (e.g., article-adjective-noun). 
- **Beyond Surface-level Analysis:**  Native speakers recognize patterns and relations between words that n-gram models, treating each word as an atomic unit without internal structure, cannot capture. For instance, understanding the grammatical correctness of "the fulvous kitten" despite not having encountered "fulvous" before. 
- **Generalization through Structure:**  The n-gram model's inability to generalize beyond direct word sequence occurrences is a significant limitation. Factored or structured models, which account for the internal structure and relationships between words, can offer better generalization. Word embeddings are an example of such models, providing a richer, multidimensional representation of word meanings and relations. 
- **WordNet as a Structured Word Model:**  WordNet, a hand-curated dictionary, exemplifies a structured word model, offering categorizations and relations (e.g., hypernyms and hyponyms) among words. However, while useful for distinguishing word senses and basic categorizations, WordNet does not convey the full semantic richness or contextual usage details of words. 
- **Future Directions:**  The text hints at the exploration of more expressive models like word embeddings in subsequent sections, indicating a move towards understanding language in a more nuanced and comprehensive manner, akin to human language comprehension.

### **23.1.6 Part-of-Speech (POS) Tagging**  
- **Fundamental Task in NLP:**  POS tagging involves assigning each word in a sentence its appropriate part of speech (e.g., noun, verb, adjective), a basic yet crucial step for many NLP applications, from text-to-speech synthesis to machine translation. The complexity arises from the diversity in the classification of parts of speech, exemplified by the 45 tags used in the Penn Treebank. 
- **Hidden Markov Models (HMM):**  One traditional approach to POS tagging is using HMMs, which predict the sequence of POS tags based on the sequence of words (the observed states) and the transitions between different POS tags (the hidden states). HMMs, despite their simplification of language complexity to transitions and emissions, achieve high accuracy in tagging, partly thanks to algorithms like Viterbi for finding the most probable tag sequences. 
- **Transition and Sensor Models:**  These are integral to HMMs for POS tagging, with the transition model capturing the likelihood of one POS following another, and the sensor model reflecting the probability of a word being associated with a particular POS. The effectiveness of HMMs hinges on these probabilistic models derived from corpus counts and refined through smoothing techniques. 
- **Logistic Regression for POS Tagging:**  An alternative method, logistic regression allows for the incorporation of a rich set of features about words and their context, surpassing the HMM's limitations in capturing linguistic nuances. This model assigns probabilities to POS tags based on features like word identity, spelling patterns, and contextual information. 
- **Feature-rich Models:**  The ability of logistic regression to utilize a vast array of features, from the morphological characteristics of words to their positional information within a sentence, facilitates a more nuanced understanding and classification of words according to their parts of speech. 
- **Generative vs. Discriminative Models:**  While HMMs are generative models capable of producing random sequences of words and tags, logistic regression is a discriminative model focused on tagging given sequences. The choice between generative and discriminative approaches often depends on the specific requirements of the task, including the availability of training data and the need for speed or accuracy. 
- **Greedy and Beam Search Methods:**  For sequence classification tasks like POS tagging, strategies for traversing the sequence of words to assign tags vary in their trade-offs between speed and accuracy. Greedy search makes irreversible choices based on local maxima, while beam search and the Viterbi algorithm offer more comprehensive explorations of possible tag sequences, balancing computational efficiency with tagging accuracy.
