 
 # NLP Pipeline

## A General NLP Pipeline

![nlp-pipeline](../images/nlp-pipeline.png)

### Varations of the NLP Pipelines

- The process may not always be linear.
- There are loops in between.
- These procedures may depend on specific task at hand.

## Data Collection

### Data Acquisition: Heart of ML System

- Ideal Setting: We have everything needed.
- Labels and Annotations
- Very often we are dealing with less-than-ideal scenarios

### Less-than-ideal Scenarios

- Initial datasets with limited annotations/labels
- Initial datasets labeled based on regular expressions or heuristics
- Public datasets (cf. [Google Dataset Search](https://datasetsearch.research.google.com/) or [kaggle](https://www.kaggle.com/))
- Scrape data
- Product intervention
- Data augmentation

### Data Augmentation

- It is a technique to exploit language properties to create texts that are syntactically similar to the source text data.
- Types of strategies:
    - synonym replacement
    - Related word replacement (based on association metrics)
    - Back translation
    - Replacing entities
    - Adding noise to data (e.g. spelling errors, random words)

## Text Extraction and Cleanup

### Text Extraction

- Extracting raw texts from the input data
    - HTML
    - PDF
- Relevant vs. irrelevant information
    - non-textual information
    - markup
    - metadata
- Encoding format

#### Extracting texts from webpages

Extracting textual contents from the website is a very common way to obtain data. It requires a close study of the structure of the HTML content of the web pages.

The following codes attempt to extract the hyperlinks from the homepage of a news agency and automatically visit the first hyperlink for its textual content.

In [1]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd
 
## China Times Homepage
newsurl = 'https://www.chinatimes.com/realtimenews/?chdtv'
r = requests.get(newsurl)
web_content = r.text
soup = BeautifulSoup(web_content,'html.parser')
title = [x['href'] for x in soup.select('h3.title > a')]
first_art_link = 'https://www.chinatimes.com'+ title[0]+'?chdtv'

#print(first_art_link)
art_request = requests.get(first_art_link)
art_request.encoding='utf8'
soup_art = BeautifulSoup(art_request.text,'html.parser')

art_content = soup_art.find_all('p')
art_texts = [p.text for p in art_content]
print(art_texts)


['據《華爾街日報》報導，眾多中國大陸城市在資金短缺、經濟困難之際，正紛紛拋出前所未有的甜頭向西方企業示好。中國大陸政府在2023年啟動了「投資中國年」活動，地方官員們開啟了面向海外的推介宣傳之旅，以吸引投資者的興趣。但上述招商努力迎頭撞上了國家主席習近平的國家安全議程，相關議程側重抵禦感受到的外來威脅，對外國公司來說，這已經使任何對華投資都成為潛在雷區。', '', '今年以來的一場由習近平領導的運動，讓西方管理顧問公司、審計公司等機構遭遇了一連串的突擊搜查、調查和拘留行動。與此同時，反間諜法的拓展使外國公司高管愈發擔心，在中國大陸開展市場調研等常規商業活動可能被視為間諜活動。', '', '中國大陸經濟已經因私營部門投資不振、消費疲軟和青年失業率飆升而步履艱難，而有關在中國大陸做生意風險大增的看法正妨礙著資金流入。', '', '報導引述研究公司榮鼎集團（Rhodium Group）分析師Mark Witzke對中國大陸政府數據的分析，今年第一季，中國大陸的外商直接投資額降至200億美元，去年第一季度則為1000億美元。此外，高盛（Goldman Sachs）經濟學家預計，今年中國大陸的資金流出額將抵消投資流入額，對一個在過去40年裡資金流入一直多於流出的國家來說，這是一個相當驚人的變化。', '', '報導稱，近幾十年來，對西方開放為中國大陸經濟增長提供了助力，這一增長依靠外國投資和專業知識來推動創新和提高生產率。對中國大陸領導人來說，一方面向外國企業施壓，同時還要努力吸引這些企業投資，這樣的試圖平衡之舉充滿風險，有可能使中國大陸失去助力其崛起的資金、技術、理念和管理技能。', '', '陷入困境的城市', '', '報導稱，這場拉鋸戰使中國大陸各地陷入財政困境的城鎮飽受煎熬。許多城市都急需資金，在經歷了三年的新冠限制政策後，這些城市深陷債務泥淖，難以創造就業機會。', '', '報導引述中國大陸官方統計，去年地方政府的支出同比增加，主要原因是用於支付新冠檢測和相關費用的衛生支出猛增了18％。與此同時，地方政府的收入卻下降，主要原因是向開發商出讓土地的收入同比銳減23％，而土地出讓是地方政府長期依賴的資金來源。地方政府的借貸超過了他們的償還能力，直接欠下的債務高達收入的120％。', '', '許多官員表示，他們吸引外資的傳統策略不再奏效。', '', '報導

#### Extracting texts from scanned PDF

:::{important}
[Tesseract](https://tesseract-ocr.github.io/tessdoc/Installation.html) is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages.

You have to **manually** install it before you can use it in Python.
:::

In [2]:
from PIL import Image
from pytesseract import image_to_string

#import os
#print(os.getcwd())

YOUR_DEMO_DATA_PATH = "../../../RepositoryData/data/"  # please change your file path
filename = YOUR_DEMO_DATA_PATH+'pdf-firth-text.png'

text = image_to_string(Image.open(filename))
print(text)

ModuleNotFoundError: No module named 'pytesseract'

#### Unicode normalization

In [None]:
text = 'I feel really 😡. GOGOGO!! 💪💪💪  🤣🤣 ȀÆĎǦƓ'
print(text)
text2 = text.encode('utf-8') # encode the strings in bytes
print(text2)


In [None]:
import unicodedata
unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

- Please check [unicodedata documentation](https://docs.python.org/3/library/unicodedata.html) for more detail on character normalization.
- Other useful libraries
    - Spelling check: pyenchant, Microsoft REST API
    - PDF:  PyPDF, PDFMiner
    - OCR: pytesseract
 

### Cleanup

- Preliminaries
    - Sentence segmentation
    - Word tokenization
    

#### Segmentation and Tokenization

:::{important}
NLTK package provides many useful datasets for text analysis. Some of the codes may require you to download the corpus data first. Please see [Installing NLTK Data](https://www.nltk.org/data.html) for more information.
:::

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = '''
Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
'''

## sent segmentation
sents = sent_tokenize(text)

## word tokenization
for sent in sents:
    print(sent)
    print(word_tokenize(sent))

- Frequent preprocessing
    - Stopword removal
    - Stemming and/or lemmatization
    - Digits/Punctuaions removal
    - Case normalization
    

#### Removing stopwords, punctuations, digits

In [None]:
from nltk.corpus import stopwords
from string import punctuation

eng_stopwords = stopwords.words('english')

text = "Mr. John O'Neil works at Wonderland, located at 245 Goleta Avenue, CA., 74208."

words = word_tokenize(text)

print(words)

# remove stopwords, punctuations, digits
for w in words:
    if w not in eng_stopwords and w not in punctuation and not w.isdigit():
        print(w)

#### Stemming and lemmatization

In [None]:
## Stemming
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

words = ['cars','revolution', 'better']
print([stemmer.stem(w) for w in words])


In [None]:
## Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

## Wordnet requires POS of words
poss = ['n','n','a']

for w,p in zip(words,poss):
    print(lemmatizer.lemmatize(w, pos=p))

- Task-specific preprocessing
    - Unicode normalization
    - Language detection
    - Code mixing
    - Transliteration (e.g., using piyin for Chinese words in English-Chinese code-switching texts)
    

- Automatic annotations
    - POS tagging
    - Parsing
    - Named Entity Recognition
    - Coreference resolution
    

### Important Reminders for Preprocessing

- Not all steps are necessary
- These steps are NOT sequential
- These steps are task-dependent
- Goals
    - Text Normalization
    - Text Tokenization
    - Text Enrichment/Annotation

## Feature Engineering

### What is feature engineering?

- It refers to a process to feed the extracted and preprocessed texts into a machine-learning algorithm.
- It aims at capturing the characteristics of the text into a numeric vector that can be understood by the ML algorithms. (Cf. *construct*, *operational definitions*, and *measurement* in experimental science)
- In short, it concerns how to meaningfully represent texts quantitatively, i.e., text representation.

### Feature Engineering for Classical ML

- Word-based frequency lists
- Bag-of-words representations
- Domain-specific word frequency lists
- Handcrafted features based on domain-specific knowledge

### Feature Engineering for DL

- DL directly takes the texts as inputs to the model.
- The DL model is capable of learning features from the texts (e.g., embeddings)
- The price is that the model is often less interpretable.
    

### Strengths and Weakness 

![](../images/feature-engineer-strengths.png)

![](../images/feature-engineer-weakness.png)

## Modeling

### From Simple to Complex

- Start with heuristics or rules
- Experiment with different ML models
    - From heuristics to features
    - From manual annotation to automatic extraction
    - Feature importance (weights)
- Find the most optimal model
    - Ensemble and stacking
    - Redo feature engineering
    - Transfer learning
    - Reapply heuristics

## Evaluation

### Why evaluation?

- We need to know how *good* the model we've built is -- "Goodness"
- Factors relating to the evaluation methods
    - Model building
    - Deployment
    - Production
- ML metrics vs. Business metrics


### Intrinsic vs. Extrinsic Evaluation

- Take spam-classification system as an example
- Intrinsic:
    - the precision and recall of the spam classification/prediction
- Extrinsic:
    - the amount of time users spent on a spam email
    

### General Principles

- Do intrinsic evaluation before extrinsic.
- Extrinsic evaluation is more expensive because it often invovles project stakeholders outside the AI team.
- Only when we get consistently good results in intrinsic evaluation should we go for extrinsic evaluation.
- Bad results in intrinsic often implies bad results in extrinsic as well.

### Common Intrinsic Metrics

- Principles for Evaluation Metrics Selection
- Data type of the labels (ground truths)
    - Binary (e.g., sentiment)
    - Ordinal (e.g., informational retrieval)
    - Categorical (e.g., POS tags)
    - Textual (e.g., named entity, machine translation, text generation)
- Automatic vs. Human Evalation

## Post-Modeling Phases

### Post-Modeling Phases

- Deployment of the model in a  production environment (e.g., web service)
- Monitoring system performance on a regular basis
- Updating system with new-coming data

## References

- Chapter 2 of Practical Natural Language Processing. {cite}`vajjala2020`