# NLP Pipeline

## A General NLP Pipeline

![nlp-pipeline](../images/nlp-pipeline.png)

### Varations of the NLP Pipelines

- The process may not always be linear.
- There are loops in between.
- These procedures may depend on specific task at hand.

## Data Collection

### Data Acquisition: Heart of ML System

- Ideal Setting: We have everything needed.
- Labels and Annotations
- Very often we are dealing with less-than-idea scenarios

### Less-than-ideal Scenarios

- Initial datasets with limited annotations/labels
- Initial datasets labeled based on regular expressions or heuristics
- Public datasets (cf. [Google Dataset Search](https://datasetsearch.research.google.com/))
- Scrape data
- Product intervention
- Data augmentation

### Data Augmentation

- It is a technique to exploit language properties to create texts that are syntactically similar to the source text data.
- Types of strategies:
    - synonym replacement
    - Related word replacement (based on association metrics)
    - Back translation
    - Replacing entities
    - Adding noise to data (e.g. spelling errors, random words)

## Text Extraction and Cleanup

### Text Extraction

- Extracting raw texts from the input data
    - HTML
    - PDF
- Relevant vs. irrelevant information
    - non-textual information
    - markup
    - metadata
- Encoding format

#### Extracting texts from webpages

In [1]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd
 
 
url = 'https://news.google.com/topics/CAAqJQgKIh9DQkFTRVFvSUwyMHZNRFptTXpJU0JYcG9MVlJYS0FBUAE?hl=zh-TW&gl=TW&ceid=TW%3Azh-Hant'
r = requests.get(url)
web_content = r.text
soup = BeautifulSoup(web_content,'html.parser')
title = soup.find_all('a', class_='DY5T1d')
first_art_link = title[0]['href'].replace('.','https://news.google.com',1)

#print(first_art_link)
art_request = requests.get(first_art_link)
art_request.encoding='utf8'
soup_art = BeautifulSoup(art_request.text,'html.parser')

art_content = soup_art.find_all('p')
art_texts = [p.text for p in art_content]
print(art_texts)


['目前設定', '目前設定', '目前設定', '近日桃園市議員王浩宇罷免案投票通過，給了國民黨打了強心針，將下一個目標轉移到高雄市議員黃捷的罷免投票上。而先前曾說若王浩宇、黃捷和自己被罷免，就請全台吃雞排的台灣基進黨陳柏惟，今（17日）卻反悔稱時空背景不一樣。對此，宅神朱學恒則於臉書發文，恥笑：「打賭都不敢是要怎麼打仗啦！」', '先前陳柏惟尚未成為立委時，曾貼出自己與王浩宇、黃捷的合照，發布雞排祭品文打賭，「2020三個都罷免成功，我請全台灣人吃雞排」。然而近日桃園市立委王浩與罷免成功，不少網友紛紛向立委陳柏惟討先前打賭的雞排，卻遭陳柏惟以「當年時空背景不一樣」迴避。', '對此，今宅神朱學恒則於臉書發文嗆：「台派連嗆賭都這麼沒種不敢面對，我們也只能繼續笑他了。打賭都不敢是要怎麼打仗啦（恥笑）」。不少網友也紛紛留言開酸：「時空背景不同的時候....過幾年就變舔共仔」、「不是1打35？」、「果然是雙標之術就是時空背景啊~」、「原來他講的話都是開玩笑。那麼抗中保台屆時也會是他口中的開玩笑」、「他不光是3Q還很軟Q」等等。', '點選關鍵字看更多 :', '說明文字']


#### Extracting texts from scanned PDF

In [2]:
from PIL import Image
from pytesseract import image_to_string

filename = '../../../RepositoryData/data/pdf-firth-text.png'
text = image_to_string(Image.open(filename))
print(text)

Stellenbosch Papers in Linguistics, Vol. 15, 1986, 31-60 doi: 10.5774/15-0-96

SPIL 14 (1986) 31- 6¢ 31

THE LINGUISTIC THOUGHT OF J.R. FIRTH

Nigel Love

"The study of the living votce of a
man tn aectton ts a very btg job in-

ii
deed." --- J.R. Firth

John Rupert Firth was born in 1890. After serving as Pro-
fessor of English at the University of the Punjab from 1919
to 1928, he took up a pest in the phonetics department of
University College, London. In 1938 he moved to the lin-
guistics department of the School of Oriental and African
Studies in London, where from 1944 until his retirement in
1956 he was Professor of Generali Linguistics. He died in
1960. He was an influential teacher, some of whose doctrines
(especially those concerning phonology) were widely propa-~
gated and developed by his students in what came to be known

as the "London school” of linguistics.

"The business of linguistics", according to Firth, "is to

1}

describe languages". In saying as much he would hav

#### Unicode normalization

In [3]:
text = 'I feel really 😡. GOGOGO!! 💪💪💪  🤣🤣'
print(text)
text2 = text.encode('utf-8')
print(text2)


I feel really 😡. GOGOGO!! 💪💪💪  🤣🤣
b'I feel really \xf0\x9f\x98\xa1. GOGOGO!! \xf0\x9f\x92\xaa\xf0\x9f\x92\xaa\xf0\x9f\x92\xaa  \xf0\x9f\xa4\xa3\xf0\x9f\xa4\xa3'


- Other useful libraries
    - Spelling check: pyenchant, Microsoft REST API
    - PDF:  PyPDF, PDFMiner
    - OCR: pytesseract
 

### Cleanup

- Preliminaries
    - Sentence segmentation
    - Word tokenization
    

#### Segmentation and Tokenization

In [4]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = '''
Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
'''

## sent segmentation
sents = sent_tokenize(text)

## word tokenization
for sent in sents:
    print(sent)
    print(word_tokenize(sent))


Python is an interpreted, high-level and general-purpose programming language.
['Python', 'is', 'an', 'interpreted', ',', 'high-level', 'and', 'general-purpose', 'programming', 'language', '.']
Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
['Python', "'s", 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'its', 'notable', 'use', 'of', 'significant', 'whitespace', '.']
Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
['Its', 'language', 'constructs', 'and', 'object-oriented', 'approach', 'aim', 'to', 'help', 'programmers', 'write', 'clear', ',', 'logical', 'code', 'for', 'small', 'and', 'large-scale', 'projects', '.']


- Frequent preprocessing
    - Stopword removal
    - Stemming and/or lemmatization
    - Digits/Punctuaions removal
    - Case normalization
    

#### Removing stopwords, punctuations, digits

In [5]:
from nltk.corpus import stopwords
from string import punctuation

eng_stopwords = stopwords.words('english')

text = "Mr. John O'Neil works at Wonderland, located at 245 Goleta Avenue, CA., 74208."

words = word_tokenize(text)

print(words)

# remove stopwords, punctuations, digits
for w in words:
    if w not in eng_stopwords and w not in punctuation and not w.isdigit():
        print(w)

['Mr.', 'John', "O'Neil", 'works', 'at', 'Wonderland', ',', 'located', 'at', '245', 'Goleta', 'Avenue', ',', 'CA.', ',', '74208', '.']
Mr.
John
O'Neil
works
Wonderland
located
Goleta
Avenue
CA.


#### Stemming and lemmatization

In [6]:
## Stemming
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

words = ['cars','revolution', 'better']
print([stemmer.stem(w) for w in words])


['car', 'revolut', 'better']


In [7]:
## Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

## Wordnet requires POS of words
poss = ['n','n','a']

for w,p in zip(words,poss):
    print(lemmatizer.lemmatize(w, pos=p))

car
revolution
good


- Task-specific preprocessing
    - Unicode normalization
    - language detection
    - code mixing
    - transliteration
    

- Automatic annotations
    - POS tagging
    - Parsing
    - Named entity recognition
    - coreference resolution
    

### Important Reminders for Preprocessing

- Not all steps are necessary
- These steps are NOT sequential
- These steps are task-dependent

## Feature Engineering

### What is feature engineering?

- It refers to a process to feed the extracted and preprocessed texts into a machine-learning algorithm.
- It aims at capturing the characteristics of the text into a numeric vector that can be understood by the ML algorithms. (Cf. *construct*, *operational definitions*, and *measurement* in experimental science)
- In short, it concerns how to meaningfully represent texts quantitatively, i.e., text representation.

### Feature Engineering for Classical ML

- word-based frequency lists
- bag-of-words representations
- domain-specific word frequency lists
- handcrafted features based on domain-specific knowledge

### Feature Engineering for DL

- DL directly takes the texts as inputs to the model.
- The DL model is capable of learning features from the texts (e.g., embeddings)
- Less interpretable.
    

## Modeling

### From Simple to Complex

- Start with heuristics or rules
- Experiment with different ML models
    - from heuristics to features
    - from manual annotation to automatic extraction
    - feature importance (weights)
- Find the most optimal model
    - Ensemble and stacking
    - Redo feature engineering
    - Transfer learning
    - Reapply heuristics

## Evaluation