 
 # NLP Pipeline

## A General NLP Pipeline

![nlp-pipeline](../images/nlp-pipeline.png)

### Varations of the NLP Pipelines

- The process may not always be linear.
- There are loops in between.
- These procedures may depend on specific task at hand.

## Data Collection

### Data Acquisition: Heart of ML System

- Ideal Setting: We have everything needed.
- Labels and Annotations
- Very often we are dealing with less-than-ideal scenarios

### Less-than-ideal Scenarios

- Initial datasets with limited annotations/labels
- Initial datasets labeled based on regular expressions or heuristics
- Public datasets (cf. [Google Dataset Search](https://datasetsearch.research.google.com/) or [kaggle](https://www.kaggle.com/))
- Scrape data
- Product intervention
- Data augmentation

### Data Augmentation

- It is a technique to exploit language properties to create texts that are syntactically similar to the source text data.
- Types of strategies:
    - synonym replacement
    - Related word replacement (based on association metrics)
    - Back translation
    - Replacing entities
    - Adding noise to data (e.g. spelling errors, random words)

## Text Extraction and Cleanup

### Text Extraction

- Extracting raw texts from the input data
    - HTML
    - PDF
- Relevant vs. irrelevant information
    - non-textual information
    - markup
    - metadata
- Encoding format

#### Extracting texts from webpages

In [1]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd
 
 
url = 'https://news.google.com/topics/CAAqJQgKIh9DQkFTRVFvSUwyMHZNRFptTXpJU0JYcG9MVlJYS0FBUAE?hl=zh-TW&gl=TW&ceid=TW%3Azh-Hant'
r = requests.get(url)
web_content = r.text
soup = BeautifulSoup(web_content,'html.parser')
title = soup.find_all('a', class_='DY5T1d')
first_art_link = title[0]['href'].replace('.','https://news.google.com',1)

#print(first_art_link)
art_request = requests.get(first_art_link)
art_request.encoding='utf8'
soup_art = BeautifulSoup(art_request.text,'html.parser')

art_content = soup_art.find_all('p')
art_texts = [p.text for p in art_content]
print(art_texts)


['', '\n', '\n鳳梨議題延燒，媒體人黃創夏一句：「每位國人只要1天吃18公斤的鳳梨，連吃兩周就好，有很難嗎？」1天吃18公斤鳳梨難不難？引發網路論戰。', '\n', '\r\n黃創夏今早在臉書貼文「真的，我是說錯了」指出，因為心有旁騖，錄影當中，有感而發，忘了自己都已經50多歲了，竟然還敢和學生時代時期般的邊心算還邊發言，一心多用之下造成口誤不察，被特定力量抓到話柄，讓他們嗨了好幾天，就當作是失言總要付出的代價，隨他們高興去吧‧‧‧', '\n', '\r\n黃創夏說，2月27日中午錄影當時，「我還真的是說錯了，有錯當然要認！」因為當天他還只看到台灣一年的鳳梨總產量是「42萬公噸」，聽到某些特定力量把共產黨的「鳳梨突襲」當成「世界末日」般在鬼哭神號，覺得實在不必如此大驚小怪，因此才想到把一年42萬公噸換算成2300萬人的年均值是多少，覺得「一個人一年18公斤」的數字根本不是難題，「我真是錯了」。', '\n', '\r\n他指出，因為下了節目他才知道，在那「一年42萬公噸總產量」的鳳梨當中，僅有約10%是要賣到大陸去，換言之，其實衝擊量的取用上，他真的錯了，應該是用「4萬公噸」去計算，除以2300萬人的話，衝擊量只有「一人才1.8公斤」，真的是只要在未來幾個月的鳳梨產季之內，每個人只要多吃一顆鳳梨，共產黨的「鳳梨之亂」毫無殺傷力。', '\n', '\r\n黃創夏表示，事實也證明，真的才在短短幾天內，台灣的認購鳳梨潮之下，增購量已經超過「4萬公噸」了。', '\n                    中國大陸對台下鳳梨禁令，事件超過5天，即使我方回函說明，要求能夠面對面視訊、溝通，卻仍遭「已讀不回」；雖然農委會說兩岸防...                  ', '\n                    鳳梨議題延燒，媒體人黃創夏一句：「每位國人只要1天吃18公斤的鳳梨，連吃兩周就好，有很難嗎？」1天吃18公斤鳳梨難不難？...                  ', '\n                    鳳梨銷陸禁令3月1日生效，財信傳媒董事長謝金河在臉書表示，台灣亟待重建食品加工產業鏈。                  ', '\n                    中國大陸三月起暫停台灣鳳梨輸入。國民黨立委費鴻泰昨在立院質詢

#### Extracting texts from scanned PDF

In [2]:
from PIL import Image
from pytesseract import image_to_string


YOUR_DEMO_DATA_PATH = "../../../RepositoryData/data/"  # please change your file path
filename = YOUR_DEMO_DATA_PATH+'pdf-firth-text.png'
text = image_to_string(Image.open(filename))
print(text)

Stellenbosch Papers in Linguistics, Vol. 15, 1986, 31-60 doi: 10.5774/15-0-96

SPIL 14 (1986) 31- 6¢ 31

THE LINGUISTIC THOUGHT OF J.R. FIRTH

Nigel Love

"The study of the living votce of a
man tn aectton ts a very btg job in-

ii
deed." --- J.R. Firth

John Rupert Firth was born in 1890. After serving as Pro-
fessor of English at the University of the Punjab from 1919
to 1928, he took up a pest in the phonetics department of
University College, London. In 1938 he moved to the lin-
guistics department of the School of Oriental and African
Studies in London, where from 1944 until his retirement in
1956 he was Professor of Generali Linguistics. He died in
1960. He was an influential teacher, some of whose doctrines
(especially those concerning phonology) were widely propa-~
gated and developed by his students in what came to be known

as the "London school” of linguistics.

"The business of linguistics", according to Firth, "is to

1}

describe languages". In saying as much he would hav

#### Unicode normalization

In [5]:
text = 'I feel really 😡. GOGOGO!! 💪💪💪  🤣🤣 ȀÆĎǦƓ'
print(text)
text2 = text.encode('utf-8') # encode the strings in bytes
print(text2)


I feel really 😡. GOGOGO!! 💪💪💪  🤣🤣 ȀÆĎǦƓ
b'I feel really \xf0\x9f\x98\xa1. GOGOGO!! \xf0\x9f\x92\xaa\xf0\x9f\x92\xaa\xf0\x9f\x92\xaa  \xf0\x9f\xa4\xa3\xf0\x9f\xa4\xa3 \xc8\x80\xc3\x86\xc4\x8e\xc7\xa6\xc6\x93'


In [15]:
import unicodedata
unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

'I feel really . GOGOGO!!    ADG'

- Please check [unicodedata documentation](https://docs.python.org/3/library/unicodedata.html) for more detail on character normalization.
- Other useful libraries
    - Spelling check: pyenchant, Microsoft REST API
    - PDF:  PyPDF, PDFMiner
    - OCR: pytesseract
 

### Cleanup

- Preliminaries
    - Sentence segmentation
    - Word tokenization
    

#### Segmentation and Tokenization

In [4]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = '''
Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
'''

## sent segmentation
sents = sent_tokenize(text)

## word tokenization
for sent in sents:
    print(sent)
    print(word_tokenize(sent))


Python is an interpreted, high-level and general-purpose programming language.
['Python', 'is', 'an', 'interpreted', ',', 'high-level', 'and', 'general-purpose', 'programming', 'language', '.']
Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
['Python', "'s", 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'its', 'notable', 'use', 'of', 'significant', 'whitespace', '.']
Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
['Its', 'language', 'constructs', 'and', 'object-oriented', 'approach', 'aim', 'to', 'help', 'programmers', 'write', 'clear', ',', 'logical', 'code', 'for', 'small', 'and', 'large-scale', 'projects', '.']


- Frequent preprocessing
    - Stopword removal
    - Stemming and/or lemmatization
    - Digits/Punctuaions removal
    - Case normalization
    

#### Removing stopwords, punctuations, digits

In [5]:
from nltk.corpus import stopwords
from string import punctuation

eng_stopwords = stopwords.words('english')

text = "Mr. John O'Neil works at Wonderland, located at 245 Goleta Avenue, CA., 74208."

words = word_tokenize(text)

print(words)

# remove stopwords, punctuations, digits
for w in words:
    if w not in eng_stopwords and w not in punctuation and not w.isdigit():
        print(w)

['Mr.', 'John', "O'Neil", 'works', 'at', 'Wonderland', ',', 'located', 'at', '245', 'Goleta', 'Avenue', ',', 'CA.', ',', '74208', '.']
Mr.
John
O'Neil
works
Wonderland
located
Goleta
Avenue
CA.


#### Stemming and lemmatization

In [6]:
## Stemming
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

words = ['cars','revolution', 'better']
print([stemmer.stem(w) for w in words])


['car', 'revolut', 'better']


In [7]:
## Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

## Wordnet requires POS of words
poss = ['n','n','a']

for w,p in zip(words,poss):
    print(lemmatizer.lemmatize(w, pos=p))

car
revolution
good


- Task-specific preprocessing
    - Unicode normalization
    - Language detection
    - Code mixing
    - Transliteration (e.g., using piyin for Chinese words in English-Chinese code-switching texts)
    

- Automatic annotations
    - POS tagging
    - Parsing
    - Named Entity Recognition
    - Coreference resolution
    

### Important Reminders for Preprocessing

- Not all steps are necessary
- These steps are NOT sequential
- These steps are task-dependent
- Goals
    - Text Normalization
    - Text Tokenization
    - Text Enrichment/Annotation

## Feature Engineering

### What is feature engineering?

- It refers to a process to feed the extracted and preprocessed texts into a machine-learning algorithm.
- It aims at capturing the characteristics of the text into a numeric vector that can be understood by the ML algorithms. (Cf. *construct*, *operational definitions*, and *measurement* in experimental science)
- In short, it concerns how to meaningfully represent texts quantitatively, i.e., text representation.

### Feature Engineering for Classical ML

- Word-based frequency lists
- Bag-of-words representations
- Domain-specific word frequency lists
- Handcrafted features based on domain-specific knowledge

### Feature Engineering for DL

- DL directly takes the texts as inputs to the model.
- The DL model is capable of learning features from the texts (e.g., embeddings)
- The price is that the model is often less interpretable.
    

### Strengths and Weakness 

![](../images/feature-engineer-strengths.png)

![](../images/feature-engineer-weakness.png)

## Modeling

### From Simple to Complex

- Start with heuristics or rules
- Experiment with different ML models
    - From heuristics to features
    - From manual annotation to automatic extraction
    - Feature importance (weights)
- Find the most optimal model
    - Ensemble and stacking
    - Redo feature engineering
    - Transfer learning
    - Reapply heuristics

## Evaluation

### Why evaluation?

- We need to know how *good* the model we've built is -- "Goodness"
- Factors relating to the evaluation methods
    - Model building
    - Deployment
    - Production
- ML metrics vs. Business metrics


### Intrinsic vs. Extrinsic Evaluation

- Take spam-classification system as an example
- Intrinsic:
    - the precision and recall of the spam classification/prediction
- Extrinsic:
    - the amount of time users spent on a spam email
    

### General Principles

- Do intrinsic evaluation before extrinsic.
- Extrinsic evaluation is more expensive because it often invovles project stakeholders outside the AI team.
- Only when we get consistently good results in intrinsic evaluation should we go for extrinsic evaluation.
- Bad results in intrinsic often implies bad results in extrinsic as well.

### Common Intrinsic Metrics

- Principles for Evaluation Metrics Selection
- Data type of the labels (ground truths)
    - Binary (e.g., sentiment)
    - Ordinal (e.g., informational retrieval)
    - Categorical (e.g., POS tags)
    - Textual (e.g., named entity, machine translation, text generation)
- Automatic vs. Human Evalation

## Post-Modeling Phases

### Post-Modeling Phases

- Deployment of the model in a  production environment (e.g., web service)
- Monitoring system performance on a regular basis
- Updating system with new-coming data

## References

- Chapter 2 of Practical Natural Language Processing. {cite}`vajjala2020`

```{bibliography} ../book.bib
:filter: docname in docnames
:style: unsrt
```