# ✅ **NLP Pipeline**

NLP Pipeline is a set of steps followed to build an end-to-end NLP software.  
It consists of the following stages:

1. **Data Acquisition**  
2. **Text Preparation**  
   - i. Text Cleanup  
   - ii. Basic Preprocessing  
   - iii. Advanced Preprocessing  
3. **Feature Engineering**  
4. **Modeling**  
   - i. Model Building  
   - ii. Evaluation  
5. **Deployment**  
   - i. Deployment  
   - ii. Monitoring  
   - iii. Model Update  

---

## **1. Data Acquisition**
It has three stages:

### **Stage i: Available Stage**  
When data is already present within the organization.  

a. **Data on Table**  
   - Present in files (CSV, Excel, etc.).  
   - Ready for direct use in NLP tasks.  

b. **Data in Database**  
   - Stored in the company’s internal database.  
   - Requires the **Data Engineering team** to extract.  

c. **Data is Less**  
   - When data is insufficient.  
   - Apply **Data Augmentation** techniques:  
     - Synonym replacement  
     - Bigram flip (swap adjacent words)  
     - Back translation  
     - Add noise  

---

### **Stage ii: Data Available Through Others**  
When data is not in our system but can be collected externally.  

a. **Public Datasets**  
   - Freely available (e.g., Kaggle, UCI ML repository).  
   - Can be directly downloaded.  

b. **Web Scraping**  
   - Extract data from websites.  
   - Tools: **BeautifulSoup, Scrapy**.  

c. **APIs**  
   - External APIs provide structured data.  
   - Usually in **JSON** format via `requests`.  
   - Example: **RapidAPI**.  

d. **Files (Unstructured Sources)**  
   - PDF → Text Extraction  
   - Image → OCR (convert image to text)  
   - Audio → Speech-to-Text  

---

### **Stage iii: Data Not Available at All**  
When no dataset exists, we must create our own.  

a. **Surveys & Questionnaires**  
   - Design forms to collect responses.  
   - Good for domain-specific datasets.  

b. **Interviews / Feedback Forms**  
   - Directly ask users/customers for inputs.  

c. **Manual Data Collection**  
   - Collect text manually (e.g., product reviews, support tickets).  

d. **Crowdsourcing**  
   - Distribute tasks to many people online.  
   - Platforms: **Amazon Mechanical Turk, Appen**.  
   - Useful for labeling or generating text data.  

---

## **2. Text Preparation**
It has three stages:

### **i. Text Cleanup**  
Remove unwanted/irrelevant parts of text.  
- HTML/Tag Cleaning → remove `<p>, <br>` etc.  
- Emoji Removal/Handling → remove or replace emojis.  
- Spelling Check → correct typos and spelling mistakes.  

---

### **ii. Basic Preprocessing**  
Prepare text for modeling (two levels: *basic & optional*).  

- **Basic**  
  - Tokenization → split text into sentences or words.  

- **Optional**  
  - Stopword Removal → remove common words like *is, the, and*.  
  - Stemming → cut words to base/root form (*running → run*).  
  - Lemmatization → reduce words to dictionary form (*better → good*).  
  - Removing Digits & Punctuation.  
  - Lowercasing → unify text format.  
  - Language Detection → identify the language of text.  

---

### **iii. Advanced Preprocessing**  
Deeper linguistic analysis.  
- POS Tagging → identify parts of speech (noun, verb, adj).  
- Parsing → analyze grammatical structure of sentences.  
- Coreference Resolution → resolve references (e.g., *“Alice said she is happy” → she = Alice*).  

---

## **3. Feature Engineering**  
It has two stages:

### **i. Machine Learning (ML) Pipeline**  
In ML, we need to **manually create features** because algorithms don’t understand raw text.  

**Common Techniques:**  
- Bag of Words (BoW) → count frequency of each word.  
- TF-IDF (Term Frequency – Inverse Document Frequency) → give weight to important words.  
- n-grams → capture word sequences (*e.g., "good movie"*).  
- Word Embeddings (pre-trained) → Word2Vec, GloVe.  
- Custom Features → sentence length, hashtags count, sentiment score, etc.  

- In ML pipeline → **manual feature extraction is critical**.  

---

### **ii. Deep Learning (DL) Pipeline**  
In DL, the model itself **learns features automatically** from raw text.  

**Common Techniques:**  
- Word Embeddings (learned during training) → Word2Vec, GloVe, FastText.  
- Contextual Embeddings → **ELMo, BERT, GPT embeddings**.  
- End-to-End Learning → raw text → embedding layer → neural network.  

- In DL pipeline → **feature engineering is minimal**.  

---

## **4. Modeling**  
It has two stages:

### **i. Modeling Approaches**  
Ways to build NLP models:  

- **Heuristics (Rule-based)**  
  - Hand-crafted rules.  
  - Example: if sentence contains “?” → it’s a question.  

- **Machine Learning (ML)**  
  - Traditional algorithms (Logistic Regression, Naive Bayes, SVM).  
  - Needs manual feature engineering (BoW, TF-IDF, etc.).  

- **Deep Learning (DL)**  
  - Neural networks learn features automatically.  
  - Examples: RNN, LSTM, Transformers (BERT, GPT).  

- **Cloud-based APIs**  
  - Ready-made NLP APIs from cloud providers.  
  - Examples: Google NLP API, AWS Comprehend, Azure Cognitive Services.  

---

### **ii. Evaluation**  
Once models are built, they need to be evaluated.  

- **Intrinsic Evaluation** (direct performance)  
  - Accuracy → % correct predictions.  
  - Precision, Recall, F1-score → classification quality.  
  - BLEU, ROUGE → for text generation/translation.  

- **Extrinsic Evaluation** (task-based performance)  
  - Real-world performance on end-task.  
  - Perplexity → how well a language model predicts text.  
  - Task success rate → e.g., chatbot completion rate.  

---

## **5. Deployment** 
It has three stages:

### **i. Deployment**  
- Make the trained model available for real-world use.  
- Methods:  
  - API (microservices) → expose model as API.  
  - Chatbot integration → use in apps/chatbots.  

---

### **ii. Monitoring**  
- Track model performance after deployment.  
- Monitor:  
  - Accuracy drop.  
  - Bias/errors.  
  - System performance (latency, response time).  

---

### **iii. Update**  
- Keep the model fresh and effective.  
- Methods:  
  - Retraining with new data.  
  - Fine-tuning on recent examples.  
  - Versioning → maintain and roll out new versions.  

---