 
 # NLP Pipeline

## A General NLP Pipeline

![nlp-pipeline](../images/nlp-pipeline.png)

### Varations of the NLP Pipelines

- The process may not always be linear.
- There are loops in between.
- These procedures may depend on specific task at hand.

## Data Collection

### Data Acquisition: Heart of ML System

- Ideal Setting: We have everything needed.
- Labels and Annotations
- Very often we are dealing with less-than-idea scenarios

### Less-than-ideal Scenarios

- Initial datasets with limited annotations/labels
- Initial datasets labeled based on regular expressions or heuristics
- Public datasets (cf. [Google Dataset Search](https://datasetsearch.research.google.com/))
- Scrape data
- Product intervention
- Data augmentation

### Data Augmentation

- It is a technique to exploit language properties to create texts that are syntactically similar to the source text data.
- Types of strategies:
    - synonym replacement
    - Related word replacement (based on association metrics)
    - Back translation
    - Replacing entities
    - Adding noise to data (e.g. spelling errors, random words)

## Text Extraction and Cleanup

### Text Extraction

- Extracting raw texts from the input data
    - HTML
    - PDF
- Relevant vs. irrelevant information
    - non-textual information
    - markup
    - metadata
- Encoding format

#### Extracting texts from webpages

In [1]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd
 
 
url = 'https://news.google.com/topics/CAAqJQgKIh9DQkFTRVFvSUwyMHZNRFptTXpJU0JYcG9MVlJYS0FBUAE?hl=zh-TW&gl=TW&ceid=TW%3Azh-Hant'
r = requests.get(url)
web_content = r.text
soup = BeautifulSoup(web_content,'html.parser')
title = soup.find_all('a', class_='DY5T1d')
first_art_link = title[0]['href'].replace('.','https://news.google.com',1)

#print(first_art_link)
art_request = requests.get(first_art_link)
art_request.encoding='utf8'
soup_art = BeautifulSoup(art_request.text,'html.parser')

art_content = soup_art.find_all('p')
art_texts = [p.text for p in art_content]
print(art_texts)


['為達最佳瀏覽效果，建議使用 Chrome、Firefox 或 Microsoft Edge 的瀏覽器。', '', '爆', '中國突然宣布今日起暫停進口台灣鳳梨，掀起國內挺果農的認購熱潮，初步統計，兩天內，國內認購量已破二萬五千公噸，預計今天可達三萬公噸；以此估計，中國毀約棄購的鳳梨訂單可望很快就能轉銷解決。（資料照，記者劉信德攝）', '首次上稿 01:00更新時間 06:08', '〔記者羅綺、黃以敬／台北報導〕中國上週六突然宣布今（一日）日起暫停進口台灣鳳梨，引發國人議論及不滿，相對掀起國內挺台灣果農的認購熱潮。遭中國片面棄單的四萬多公噸鳳梨，農委會原規劃要轉銷他國約三萬噸、國內認購約兩萬噸，但初步統計，兩天內，國內認購量已破二萬五千公噸，預計今天可達三萬公噸；以此估計，中國毀約棄購的鳳梨訂單可望很快就能轉銷解決。', '請繼續往下閱讀...', '', '農委會初步統計，光是國內認購估已破兩萬五千公噸，包括果乾、果汁等加工業者，不少原本用越南鳳梨，都同意優先或擴大採購國內鳳梨，估可增購一萬五千公噸，另有茶飲、餐廳業者擴大採購至少已達五千公噸。加上連繫農委會表達認購意願的企業已破百家，例如全聯採購量要從原本四千公噸增到上萬公噸，還有許多大企業認購名單在持續增加，預計這兩天認購數還會增加。', '農委會這兩天也初步聯繫外銷商，已有業者承諾要擴大外銷日本、新加坡等五千多公噸，更有海外僑民紛紛表達要認購不需經海關檢疫的國內鳳梨加工品；外銷管道較費時，但預計在三、四月鳳梨產季收成前，海外訂單陸續也會增加。', '農委會昨日並已建構訂購國產鳳梨的通路平台，民眾可上農委會網站直接訂購挺農民。', '農委會主委陳吉仲表示，很感謝國人對台灣鳳梨及果農的熱情相挺；他也強調，農委會不會讓台灣鳳梨的熱購只是短期熱情，而希望要藉此更積極拓展更多元及更長期的國內外鳳梨多元銷售管道，「雞蛋不要都放在一個籃子裡」，台灣農業也應與澳洲、日本、東南亞等國家建立起更大的銷售網絡。', '農委會昨日並已建構訂購國產鳳梨的通路平台，初步統計，到2月28日光是國內認購估已破兩萬五千公噸。（圖擷取自農委會主委陳吉仲臉書）', '\n    不用抽 不用搶 現在用APP看新聞 保證天天中獎\u3000\n    點我下載APP\u3000\n    按我看活動辦法\n', '獨家》真巧！中國阻台灣鳳

#### Extracting texts from scanned PDF

In [2]:
from PIL import Image
from pytesseract import image_to_string
 
YOUR_DEMO_DATA_PATH = "../../../RepositoryData/data/"  # please change your file path
filename = YOUR_DEMO_DATA_PATH+'pdf-firth-text.png'
text = image_to_string(Image.open(filename))
print(text)

Stellenbosch Papers in Linguistics, Vol. 15, 1986, 31-60 doi: 10.5774/15-0-96

SPIL 14 (1986) 31- 6¢ 31

THE LINGUISTIC THOUGHT OF J.R. FIRTH

Nigel Love

"The study of the living votce of a
man tn aectton ts a very btg job in-

ii
deed." --- J.R. Firth

John Rupert Firth was born in 1890. After serving as Pro-
fessor of English at the University of the Punjab from 1919
to 1928, he took up a pest in the phonetics department of
University College, London. In 1938 he moved to the lin-
guistics department of the School of Oriental and African
Studies in London, where from 1944 until his retirement in
1956 he was Professor of Generali Linguistics. He died in
1960. He was an influential teacher, some of whose doctrines
(especially those concerning phonology) were widely propa-~
gated and developed by his students in what came to be known

as the "London school” of linguistics.

"The business of linguistics", according to Firth, "is to

1}

describe languages". In saying as much he would hav

#### Unicode normalization

In [14]:
text = 'I feel really 😡. GOGOGO!! 💪💪💪  🤣🤣 ȀÆĎǦƓ'
print(text)
text2 = text.encode('utf-8')
print(text2)


I feel really 😡. GOGOGO!! 💪💪💪  🤣🤣 ȀÆĎǦƓ
b'I feel really \xf0\x9f\x98\xa1. GOGOGO!! \xf0\x9f\x92\xaa\xf0\x9f\x92\xaa\xf0\x9f\x92\xaa  \xf0\x9f\xa4\xa3\xf0\x9f\xa4\xa3 \xc8\x80\xc3\x86\xc4\x8e\xc7\xa6\xc6\x93'


In [15]:
import unicodedata
unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

'I feel really . GOGOGO!!    ADG'

- Other useful libraries
    - Spelling check: pyenchant, Microsoft REST API
    - PDF:  PyPDF, PDFMiner
    - OCR: pytesseract
 

### Cleanup

- Preliminaries
    - Sentence segmentation
    - Word tokenization
    

#### Segmentation and Tokenization

In [4]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = '''
Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
'''

## sent segmentation
sents = sent_tokenize(text)

## word tokenization
for sent in sents:
    print(sent)
    print(word_tokenize(sent))


Python is an interpreted, high-level and general-purpose programming language.
['Python', 'is', 'an', 'interpreted', ',', 'high-level', 'and', 'general-purpose', 'programming', 'language', '.']
Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
['Python', "'s", 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'its', 'notable', 'use', 'of', 'significant', 'whitespace', '.']
Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
['Its', 'language', 'constructs', 'and', 'object-oriented', 'approach', 'aim', 'to', 'help', 'programmers', 'write', 'clear', ',', 'logical', 'code', 'for', 'small', 'and', 'large-scale', 'projects', '.']


- Frequent preprocessing
    - Stopword removal
    - Stemming and/or lemmatization
    - Digits/Punctuaions removal
    - Case normalization
    

#### Removing stopwords, punctuations, digits

In [5]:
from nltk.corpus import stopwords
from string import punctuation

eng_stopwords = stopwords.words('english')

text = "Mr. John O'Neil works at Wonderland, located at 245 Goleta Avenue, CA., 74208."

words = word_tokenize(text)

print(words)

# remove stopwords, punctuations, digits
for w in words:
    if w not in eng_stopwords and w not in punctuation and not w.isdigit():
        print(w)

['Mr.', 'John', "O'Neil", 'works', 'at', 'Wonderland', ',', 'located', 'at', '245', 'Goleta', 'Avenue', ',', 'CA.', ',', '74208', '.']
Mr.
John
O'Neil
works
Wonderland
located
Goleta
Avenue
CA.


#### Stemming and lemmatization

In [6]:
## Stemming
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

words = ['cars','revolution', 'better']
print([stemmer.stem(w) for w in words])


['car', 'revolut', 'better']


In [7]:
## Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

## Wordnet requires POS of words
poss = ['n','n','a']

for w,p in zip(words,poss):
    print(lemmatizer.lemmatize(w, pos=p))

car
revolution
good


- Task-specific preprocessing
    - Unicode normalization
    - language detection
    - code mixing
    - transliteration
    

- Automatic annotations
    - POS tagging
    - Parsing
    - Named entity recognition
    - coreference resolution
    

### Important Reminders for Preprocessing

- Not all steps are necessary
- These steps are NOT sequential
- These steps are task-dependent

## Feature Engineering

### What is feature engineering?

- It refers to a process to feed the extracted and preprocessed texts into a machine-learning algorithm.
- It aims at capturing the characteristics of the text into a numeric vector that can be understood by the ML algorithms. (Cf. *construct*, *operational definitions*, and *measurement* in experimental science)
- In short, it concerns how to meaningfully represent texts quantitatively, i.e., text representation.

### Feature Engineering for Classical ML

- word-based frequency lists
- bag-of-words representations
- domain-specific word frequency lists
- handcrafted features based on domain-specific knowledge

### Feature Engineering for DL

- DL directly takes the texts as inputs to the model.
- The DL model is capable of learning features from the texts (e.g., embeddings)
- Less interpretable.
    

## Modeling

### From Simple to Complex

- Start with heuristics or rules
- Experiment with different ML models
    - from heuristics to features
    - from manual annotation to automatic extraction
    - feature importance (weights)
- Find the most optimal model
    - Ensemble and stacking
    - Redo feature engineering
    - Transfer learning
    - Reapply heuristics

## Evaluation

### Why evaluation?

- We need to know how *good* the model we've built is -- "Goodness"
- Factors relating to the evaluation methods
    - model building
    - deployment
    - production
- ML metrics vs. business metrics


### Intrinsic vs. Extrinsic Evaluation

- Take spam-classification system as an example
- Intrinsic:
    - the precision and recall of the spam classification/prediction
- Extrinsic:
    - the amount of time users spent on a spam email
    

### General Principles

- Do intrinsic evaluation before extrinsic.
- Extrinsic evaluation is more expensive because it often invovles project stakeholders outside the AI team.
- Only when we get consistently good results in intrinsic evaluation should we go for extrinsic evaluation.
- Bad results in intrinsic often implies bad results in extrinsic as well.

### Common Intrinsic Metrics

- Principles for Evaluation Metrics Selection
- Data type of the labels (ground truths)
    - Binary (e.g., sentiment)
    - Ordinal (e.g., informational retrieval)
    - Categorical (e.g., POS tags)
    - Textual (e.g., named entity, machine translation, text generation)
- Automatic vs. Human Evalation

## Post-Modeling Phases

### Post-Modeling Phases

- Deployment of the model in a  production environment (e.g., web service)
- Monitoring system performance on a regular basis
- Updating system with new-coming data

## References

- Chapter 2 of Practical Natural Language Processing. {cite}`vajjala2020`

```{bibliography} ../book.bib
:filter: docname in docnames
:style: unsrt
```