# <center>NLP PIPELINES</center>

## What is NLP Pipeline

NLP pipeline is a **set of steps** followed to build an end-to-end NLP software.

OR

An NLP (Natural Language Processing) pipeline is a structured sequence of steps used to develop an end-to-end NLP software application. Each stage of the pipeline contributes to transforming raw text data into actionable insights or predictions through systematic processing.

- NLP Software consists of the following steps:
    - [1. Data Acquisition](#1-data-acquisition)
    - [2. Text Preparation](#2-text-preparation)
        - Text Clean-Up: basic steps to remove spelling errors, emojis, etc.
        - Basic Preprocessing: tokenization
        - Advance Preprocessing: Parts of speech tagging, chucking, co-reference resolution
    - [3. Feature Engineering](#3-feature-engineering)
        - Bag of words
        - TFIDF
        - One Hot Encoding
        - Word2Vec
    - [4. Modelling](#4-modelling)
        - Model Building
        - Evaluation
    - [5. Deployment](#5-deployment)
        - Deployment on any cloud service
        - Monitoring
        - Model Update 

## 1. Data Acquisition

**Objective**: 

Collect and gather relevant data for analysis. This step is crucial as the quality and quantity of data can significantly impact the performance of the NLP model.


<br>

**Top 3 Data Sources**:

1. **Internal Data(available internally)**
    - **Available on Desktop**: If data is readily available on local systems, it can be directly accessed for analysis.
        - Move to next step of the pipelines
    - **On Database/Company Systems**: Request data from the data engineering team if it resides in company databases or data warehouses.
    - **Data Augmentation**: If data is limited, generate synthetic data using techniques such as:
        - Generate fake/synthethic data
            - **Synonyms**: Replace words with their synonyms.
            - **Bigram Flip**: Swap words in pairs (bigrams) to create variations.
            - **Back Translation**: Translate text to another language and then back to the original language to create paraphrased data.
            - **Adding Noise**: Introduce random noise to the data to increase robustness.


2. **External Data(available externally to someone else)**
    - **Public Datasets**: Utilize datasets available on platforms like Kaggle, academic institutions, or data repositories.
    - **Web Scraping**: Extract data from websites using tools like BeautifulSoup.
    - **Web APIs**: Access data through APIs provided by services such as RapidAPI.
    - **PDF/Image/Audio**: Use OCR (Optical Character Recognition) for PDFs, and audio processing techniques for audio data.


3. **No Existing Data**
    There is no data about the topic, it's first time to began with
    - **Customer Reviews**: Collect feedback from loyal customers or stakeholders to gather initial data.

## 2. Text Preparation

**Objective**: 

Process and clean text data to make it suitable for feature extraction and modeling.


<br>


The top 3 works to perform in this stage:
1. **Basic Clean-Up**:
    - **HTML Tag Cleaning**: Remove HTML tags from web-scraped content.
    - **Removing Emojis**: Eliminate emojis and other non-standard characters.
    - **Checking Spelling**: Correct spelling errors to avoid confusion in further analysis.

2. **Basic Text Preprocessing**:
    - **Basic / Fundamental**
        - **Tokenization**: Split text into tokens.
            - Word
            - Sentence
    - **Optional** 
        - **Stop Word Removal**: Remove common words that add little value (e.g., "the," "and").
        - **Stemming/Lemmatization**: Reduce words to their base or root form.
        - **Removing Punctuation/Digits**: Eliminate punctuation and numeric values if not relevant.
        - **Lowercasing/Uppercasing**: Convert text to a consistent case.
        - **Language Detection**: Identify and process text based on language.

        
3. **Advance Text Preprocessing**:
    - **Parts of Speech Tagging**: Assign grammatical categories to each word (e.g., noun, verb).
    - **Parsing**: Analyze sentence structure to understand syntactic relationships.
    - **Co-reference Resolution**: Identify and link pronouns to their corresponding entities (e.g., "he" to "Avinash").

In [1]:
# Removing HTML Tags:

# Sample HTML data with various HTML tags
html_data = "<p>Hello my name is <strong>Avinash Yadav</strong>. Below is the sample code for <em>HTML Tag cleaning</em> using regular expression.</p>"

# Import the 're' module for regular expressions
import re

# Define a function to remove HTML tags from a string
def stripHTML(data):
    # Compile a regular expression pattern to match HTML tags
    # The pattern '<.*?>' matches any text between '<' and '>', including the tags
    p = re.compile(r'<.*?>')
    
    # Substitute (remove) all HTML tags in the input data with an empty string
    # This effectively removes all the HTML tags from the string
    return p.sub('', data)

stripHTML(html_data)

'Hello my name is Avinash Yadav. Below is the sample code for HTML Tag cleaning using regular expression.'

In [2]:
# Unicode Normalization

# Sample string containing emojis and various Unicode characters
emoji_data = 'Hello 👋 my name is Avinash Yadav 🙂. Below 👇 is the sample code for removing emojis 🫢 using unicode normalization. ↗️'

# Encode the string in UTF-8 format
encoded_data = emoji_data.encode('utf-8')

# Print the encoded data to show its byte representation
print(encoded_data)

b'Hello \xf0\x9f\x91\x8b my name is Avinash Yadav \xf0\x9f\x99\x82. Below \xf0\x9f\x91\x87 is the sample code for removing emojis \xf0\x9f\xab\xa2 using unicode normalization. \xe2\x86\x97\xef\xb8\x8f'


In [3]:
# Sample text with spelling errors
incorrect_data = 'Hlo! Blow is the sampl code for chking spellings. What is ur name? How r u?'

# Import the TextBlob library for natural language processing tasks
from textblob import TextBlob

# Create a TextBlob object with the incorrect_data string
TextBlb = TextBlob(incorrect_data)

# Use the correct() method to correct spelling errors in the text
corrected_text = TextBlb.correct()

# Print the corrected text
print(corrected_text)

Lo! Low is the sample code for choking swellings. That is or name? Now r u?


In [4]:
# Sample text to be tokenized
dummy_data = "Lorem Ipsum. Lorem Ipsum. Lorem Ipsum. Lorem Ipsum."

# Import the tokenization functions from the nltk library
from nltk.tokenize import sent_tokenize, word_tokenize

# Tokenize the text into sentences
sents = sent_tokenize(dummy_data)
print(sents)

# Tokenize the text into words
words = word_tokenize(dummy_data)
print(words)

['Lorem Ipsum.', 'Lorem Ipsum.', 'Lorem Ipsum.', 'Lorem Ipsum.']
['Lorem', 'Ipsum', '.', 'Lorem', 'Ipsum', '.', 'Lorem', 'Ipsum', '.', 'Lorem', 'Ipsum', '.']


## 3. Feature Engineering

**Objective**: 

Convert text data into numerical features that can be used for modelling

(The step to convert the text into numbers **OR** the process to extract the input column from the text is called as feature engineering.)

Features are the input columns in machine learning.



<br>



**Common Techniques**:
- **Bag of Words (BoW)**: Represent text as a collection of word frequencies.
- **TF-IDF (Term Frequency-Inverse Document Frequency)**: Weigh words based on their importance in a document relative to a corpus.
- **One-Hot Encoding**: Represent words or tokens as binary vectors.
- **Word2Vec**: Map words to dense vector representations capturing semantic meaning.

## 4. Modelling


**Objective**: 

Apply machine learning or deep learning algorithms to the processed data to build and evaluate models.

The stage where the algorithms are applied on the processed data and evaluate it.


<br>


There are 2 steps in this stage:
1. **Model Building / Modelling**:

    The kind of model selection depends on *1. Amount of available data* & *2. Nature of the problem*
    - **Heuristic Approaches**: Simple rule-based models or algorithms based on predefined rules.
    - **ML Algorithms**: Machine learning techniques like Naive Bayes, SVM, etc.
    - **DL Algorithms**: Deep learning methods like neural networks, RNNs, etc.
    - **Cloud APIs**: Utilize pre-built models from cloud services (e.g., Google Cloud AI, AWS Comprehend).


2. **Evaluation**:
    - **Intrinsic Evaluation**: Assess model performance using metrics like accuracy, precision, recall, F1 score.
    - **Extrinsic Evaluation**: Evaluate the model's effectiveness based on its impact or performance in real-world applications.

## 5. Deployment

**Objective**: 

Deploy the NLP model into a production environment and ensure its continued performance and relevance.


<br>


The 3 main stages are:
1. **Deployment**:
    - **Cloud Services**: Deploy the model on platforms like AWS, Google Cloud, or Azure.

2. **Monitoring**:
    - **Performance Tracking**: Continuously monitor model performance and accuracy.
    - **Error Analysis**: Identify and address any issues or drifts in performance.

3. **Model Update**:
    - **Retraining**: Update and retrain the model periodically with new data.
    - **Versioning**: Maintain versions of the model to manage changes and improvements.