# NLP Pipeline

 1. Data acquisition
 2. Text Preparation
      - Text Cleanup
      - Basic Preprocessing
      - Advance Preprocessing
 3. Feature Engineering
 4. Modeling
      - Model Building
      - Evaluation
 5. Deployment
      - Deployment
      - Monitoring
      - Model Update

1. **Data Acquisition**
    - Data is Available on Company
        - In Hand
        - On Data Base
            - Contact with Data Engineer
        - Less Data
             - Data Augument (Create a Fake Data)
                 - Synonames
                 - Bigram Flip
                 - Back Translate
                 - Add aditional noise
    - Data is available outside company
        - Use public Dataset
            - example Kaggle
        - Use webscraping
            - using beautifulsoup
        - Use API
            -  example Rapid API
            - library used request
            - get data as a JSON format
        - Data as a form of pdf
        - Data available as a image
        - Data available as a audio
            - use speech to text library
    - Data does not exist

2. **Text Preparation**
    - Data cleaning
        - HTML tag cleaning
        - Emoji cleaning (Unicode Normalization)
        - Spelling check (using textblob)
    - Basic Preprocessing
        - Basic
            - Tokenization
                - sentence tokenization
                - word tokenization
        - Optional
            - stop word removal
            - steaming/lemitization
            - removing digits
            - lower/upper casing
            - language detection
    - Advance Preprocessing
        - part of speech (POS) tagging
        - parsing
        - core reference resolution

#### HTML tag cleaning

In [2]:
sample_text='<h2>HTML Element</h2>'
print(sample_text)
print()

import re
def striphtml(data):
    p=re.compile(r'<.*?>')
    return p.sub('',data)

striphtml(sample_text)

<h2>HTML Element</h2>



'HTML Element'

#### Unicode Normalization

In [3]:
emoji_text="✂️ Copy and 📋 Paste Emoji 👍"
emoji_text.encode('utf-8')

b'\xe2\x9c\x82\xef\xb8\x8f Copy and \xf0\x9f\x93\x8b Paste Emoji \xf0\x9f\x91\x8d'

#### Spell Checking

In [5]:
!pip install textblob

Collecting textblob
  Downloading textblob-0.17.1-py2.py3-none-any.whl (636 kB)
     -------------------------------------- 636.8/636.8 kB 3.4 MB/s eta 0:00:00
Installing collected packages: textblob
Successfully installed textblob-0.17.1


In [12]:
incorrect_text="Hy how aree yoy , but i am not fine."
from textblob import TextBlob
textBlb=TextBlob(incorrect_text)
textBlb.correct()

TextBlob("By how are you , but i am not fine.")

#### Word Tokenization – Splitting words in a sentence.

In [14]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
print(word_tokenize(text))

['Hello', 'everyone', '.', 'Welcome', 'to', 'GeeksforGeeks', '.', 'You', 'are', 'studying', 'NLP', 'article']


[nltk_data] Error loading punkt: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


#### Sentence Tokenization – Splitting sentences in the paragraph

In [15]:
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
print(sent_tokenize(text))

['Hello everyone.', 'Welcome to GeeksforGeeks.', 'You are studying NLP article']


[nltk_data] Error loading punkt: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


3. **Feature Engineering**
    - Machine Learning Feature engineering
    - Deep Learning Feature engineering

4. **Modeling**
    - Modeling selection (based on amount of data and nature of problem)
        - Heuristic approach
            - used if we have less amount of data (Example spam classifier)
        - ML algo
            - used if we have moderate amount of data
        - DL algo
            - used if we have large amount of data (Example BERT)
        - Cloud API
    - Evaluation (how model perform ?)
        - intrinsic
        - extrinsic

5. **Deployment**
    - Deployment
        - use API for deployment 
        - chatbot
    - Monitoring
    - Update