# NLP Pipeline

Code notebook for TAHLR Working Group (Spring 2024) based on:  

- Vajjala, S., Majumder, B., Gupta, A., and Surana, H. 2020. *Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems*. Sebastopol, CA: O’Reilly Media.

More info on book here: https://www.oreilly.com/library/view/practical-natural-language/9781492054047/

In [None]:
%%capture

# Installs

!python -m spacy download en_core_web_sm
!pip install -U https://huggingface.co/latincy/la_core_web_sm/resolve/main/la_core_web_sm-any-py3-none-any.whl
!sudo apt install tesseract-ocr
!pip install pytesseract

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Downloads

import os
# if not os.path.exists('somefile.png'):
if True:
    # Download the file
    import urllib.request
    url = 'https://www.dropbox.com/scl/fi/5en6qvay08hxa6lowg3y5/somefile.png?rlkey=kw1f3s87apostym9gva3amfk6&dl=1'
    local_path = 'somefile.png'
    urllib.request.urlretrieve(url, local_path)

## HTML Parsing and Cleanup

In [None]:
from bs4 import BeautifulSoup
import urllib.request

myurl = "https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python"

# Set the User-Agent header
req = urllib.request.Request(myurl, headers={'User-Agent': 'Mozilla/5.0'})

# Open the URL with the modified request
html = urllib.request.urlopen(req).read()
soupified = BeautifulSoup(html, "html.parser")

print(soupified.prettify()[:500])

In [None]:
# Get question from post

question = soupified.find("div", {"class": "question"})
questiontext = question.find("div", {"class": "s-prose"}).find('p')
print(f"Question: {questiontext.get_text().strip()}")

answer = soupified.find("div", {"class": "answer"})
answertext = answer.find("div", {"class": "s-prose"}).find('p')
print(f"Answer: {answertext.get_text().strip()}")

In [None]:
# Example using Perseus

# Get html
url = "https://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A1999.01.0133%3Abook%3D1%3Acard%3D1"
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soupified = BeautifulSoup(html, "html.parser")

# Get Perseus text
poem = soupified.find("div", {"class": "text_container"})
poemtext = poem.find_all("div", {"class": "text"})

for line in poemtext:
    print(line.get_text().strip())

## Unicode Normalization

In [None]:
text = 'I love Pizza 🍕!  Shall we book a cab 🚕 to get pizza?'
Text = text.encode("utf-8")
print(Text)

In [None]:
print(Text.decode("utf-8"))

In [None]:
# Greek examples...

print("μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος".encode("utf-8"))

In [None]:
print(b'\xce\xbc\xe1\xbf\x86\xce\xbd\xce\xb9\xce\xbd \xe1\xbc\x84\xce\xb5\xce\xb9\xce\xb4\xce\xb5 \xce\xb8\xce\xb5\xe1\xbd\xb0 \xce\xa0\xce\xb7\xce\xbb\xce\xb7\xcf\x8a\xce\xac\xce\xb4\xce\xb5\xcf\x89 \xe1\xbc\x88\xcf\x87\xce\xb9\xce\xbb\xe1\xbf\x86\xce\xbf\xcf\x82'.decode("utf-8"))

In [None]:
# Note the following!

eta_with_circumflex = "ῆ".encode("utf-8")
print(eta_with_circumflex)
print(len(eta_with_circumflex))

In [None]:
eta_with_circumflex = "ῆ".encode("utf-8").decode("utf-8")
print(eta_with_circumflex)
print(len(eta_with_circumflex))

## Text from scanned documents

In [None]:
from IPython.display import Image
Image('somefile.png')

In [None]:
from PIL import Image
from pytesseract import image_to_string
filename = "somefile.png"
text = image_to_string(Image.open(filename))
print(text)

## Sentence and word tokenization

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

mytext = "In the previous chapter, we saw examples of some common NLP applications that we might encounter in everyday life. If we were asked to build such an application, think about how we would approach doing so at our organization. We would normally walk through the requirements and break the problem down into several sub-problems, then try to develop a step-by-step procedure to solve them. Since language processing is involved, we would also list all the forms of text processing needed at each step. This step-by-step processing of text is known as pipeline. It is the series of steps involved in building any NLP model. These steps are common in every NLP project, so it makes sense to study them in this chapter. Understanding some common procedures in any NLP pipeline will enable us to get started on any NLP problem encountered in the workplace. Laying out and developing a text-processing pipeline is seen as a starting point for any NLP application development process. In this chapter, we will learn about the various steps involved and how they play important roles in solving the NLP problem and we’ll see a few guidelines about when and how to use which step. In later chapters, we’ll discuss specific pipelines for various NLP tasks (e.g., Chapters 4–7)."

my_sentences = sent_tokenize(mytext)

In [None]:
for sentence in my_sentences:
   print(sentence)
   print(word_tokenize(sentence))
   print()

## Preprocessing

In [None]:
from nltk.corpus import stopwords
from string import punctuation

def preprocess_corpus(texts):
    mystopwords = set(stopwords.words("english"))
    def remove_stops_digits(tokens):
       return [token.lower() for token in tokens if token not in mystopwords and
               not token.isdigit() and token not in punctuation]
    return [remove_stops_digits(word_tokenize(text)) for text in texts]

In [None]:
# Latin example

texts = ["""Tityre, tu patulae recubans sub tegmine fagi
silvestrem tenui Musam meditaris avena;
nos patriae fines et dulcia linquimus arva.
nos patriam fugimus; tu, Tityre, lentus in umbra
formosam resonare doces Amaryllida silvas.               5"""]

print(preprocess_corpus(texts))

## Stemming and lemmatizing

In [None]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
word1, word2 = "cars", "revolution"
print(stemmer.stem(word1), stemmer.stem(word2))

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("better", pos="a")) #a is for adjective

In [None]:
# with spaCy

import spacy

sp = spacy.load('en_core_web_sm')
token = sp(u'better')
for word in token:
   print(word.text,  word.lemma_)

In [None]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Charles Spencer Chaplin was born on 16 April 1889 to Hannah Chaplin (born Hannah Harriet Pedlingham Hill) and Charles Chaplin Sr')

for token in doc:
    print(token.text, token.lemma_, token.pos_,
          token.shape_, token.is_alpha, token.is_stop)

In [None]:
# with spaCy, Latin

preprocessed_text = " ".join(preprocess_corpus(texts)[0])

nlp = spacy.load('la_core_web_sm')
doc = nlp(preprocessed_text.split()[2])

for token in doc:
    print(token.text, token.lemma_)    


In [None]:
doc = nlp(preprocessed_text)

data = []

for token in doc:
    data.append((token.text, token.lemma_, token.pos_,
          token.shape_, token.is_alpha, token.is_stop))
    
from tabulate import tabulate

print(tabulate(data, headers=["Token", "Lemma", "POS", "Shape", "Is Alpha", "Is Stop"]))
