# **01 - NLP Introduction**

---

### What is NLP?

**Natural Language Processing (NLP)** is a subfield of computer science and Artificial Intelligence (AI) focused on the interaction between computers and human (natural) languages. Its **ultimate goal is to enable computers to understand, interpret, generate, and respond to human (natural) language** in a way that is both meaningful and useful.

With the rise of modern AI technologies, NLP has become **one of the fastest-growing and most impactful areas in AI**, driving innovations on multiple fronts.
<br>
<br>

---

### What Are NLP Applications?


NLP is commonly used to transform free-form (unstructured) text from documents, conversations, or databases into structured data suitable for analysis or interaction. Common applications include:

- Text mining and analytics
- Text generation (e.g., ChatGPT)
- Machine translation (e.g., Google Translate)
- Chatbots and virtual assistants
- Search engines and autocomplete
- Text classification (e.g., Sentiment Analysis)
- Voice recognition
- And more...
<br>
<br>
---

### The NLP Pipeline

The NLP pipeline is a series of steps used to process raw text data and turn it into meaningful information that can be used for the tasks we've enumerated. At a high level, the pipeline consists of three main stages: **corpus preparation, feature engineering, and task-specific modeling**.

1. **Corpus Preparation**

    The first stage in the NLP pipeline involves collecting and preparing the corpus (a collection of text from a particular or multiple domains) that will be used for training or testing. A corpus is a large collection of text that is typically labeled or structured to support a specific NLP task. Corpus preparation may include:

    - Data Collection: Gathering text data from various sources, such as books, websites, or social media.

    - Text Cleaning: Removing irrelevant information, like HTML tags or unnecessary symbols.

    - Text Preprocessing: Steps like tokenization, lowecasing, lemmatization, and removing stop words.
<br>
<br>

2. **Feature Engineering**

    Feature engineering refers to the process of transforming raw text data into numerical features that Machine Learning (ML) models can understand. Text data is inherently unstructured, so transforming it into structured data involves the following techniques:

    - Vectorization: Converting text into numerical representations, such as bag-of-words, TF-IDF (Term Frequency-Inverse Document Frequency), or word embeddings (e.g., Word2Vec or GloVe).

    - N-grams: Using sequences of n consecutive words (bigrams, trigrams) to capture context in a sentence.

    - Contextual Features: Adding additional information like sentence length, word frequency, part-of-speech tags, or named entity features.

    - Dimensionality Reduction: Reducing the number of features (if necessary) to make the model more efficient without losing significant information.
<br>
<br>

3. **Task-Specific Modeling**

    Once the data has been prepared and features have been extracted, the final step is applying a model to solve a specific NLP task. The approach depends on the task's complexity and the type of data being processed.

    - Heuristic Approach: 
        - Applied in scenarios with limited data, and often used for data-gathering tasks for machine learning or deep learning models.

    - Machine Learning Approach:
        - Naive Bayes: Used in document classification tasks, such as sentiment analysis or spam filtering.
        - Support Vector Machine: Frequently used for text classification tasks, including sentiment analysis or topic classification.
        - Hidden Markov Model: Commonly applied in speech recognition, part-of-speech tagging, and named entity recognition.
        - Conditional Random Field: Used for tasks like named entity recognition, part-of-speech tagging, and information extraction.

    - Deep Learning Approach:
    
        - Recurrent Neural Networks (RNN): Primarily used for natural language processing activities like language translation, speech recognition, sentiment analysis, and summary writing.
        - Long Short-Term Memory (LSTM): An advanced form of the RNN model.
        - GRU (Gated Recurrent Unit): Another advanced variant of the RNN model.
<br>
<br>
---

### The Challenges of NLP

Despite impressive progress, NLP still faces many challenges, mainly:
<br>
<br>
1. **Ambiguity**: Natural language has ambiguous words and sentences can have multiple meanings depending on context.
    - *Example:* "He promised to give her dog food." -> Is he giving food for her dog, or dog food to her?
<br>
<br>
2. **Language Variability**: The same idea can be expressed in many different ways, and NLP systems must recognize that they mean the same thing.
    - *Example 1 (paraphrase):* "The weather is nice today." vs. "It's a beautiful day."
    - *Example 2 (multiple expressions):* "Turn off the lights." vs. "Can you kill the lights?"
<br>
<br>
3. **Generalization**: NLP models are typically trained on a specific corpus. When used in a different domain, they may encounter unfamiliar words or sentence structures.
    - This leads to issues with **Out-of-Domain (OOD)** data and **Out-of-Vocabulary (OOV)** words.
    - *Example:* A chatbot trained on customer service emails might struggle with medical terminology or legal jargon.
