# Week 02. Tokens, N-Grams and Linguistics

## Text as Data
Professor: Elliott Ash, NYU

TA: Eduardo Zago, NYU

Objective of the course: Build an LLM from scratch

Last lab we went over simple pandas pre-processing, some tokenization, sentiment analysis, etc (all over the place). Objective was to introduce packages, functions that are useful and need not much intuition.

Now we can start building a pipeline:

Where are we?

Stage 1.1) Data preparation and sampling

In [None]:
# set random seed
import numpy as np
import warnings; warnings.simplefilter('ignore')
%matplotlib inline
import pandas as pd
import re
import matplotlib.pyplot as plt
from collections import Counter

import spacy
nlp = spacy.load('en_core_web_sm')

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk import sent_tokenize
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# For colab
#!pip install gensim
#!pip install tiktoken
#!pip install sentencepiece
# https://github.com/google/sentencepiece
#!pip install transformers


In [None]:
# If you are using Google Colab, here's the code to load sc_cases_cleaned.pkl from local.
from google.colab import files
uploaded = files.upload()

In [None]:
# load cleaned data from lesson 1.
df = pd.read_pickle('sc_cases_cleaned.pkl',compression='gzip')
df.columns

# Simple Pre-processing and Tokenization

Remember: **Pre-processing choices affect downstream results (Denny and Spirling 2017)**

*  Can you think of a case where removing whitespaces is detrimental?
*  Upper/lower cases?

Pre-processing decisions depend on our applications and its requirements. No general formula.

Here I do a lot of things for you to have the tools:

In [None]:
text = "Prof. Zurich hailed from Zurich. She got 3 M.A.'s from ETH."
 # Import all common punctuation
from string import punctuation
translator = str.maketrans(' ', ' ', punctuation)

# Import english stopwords
from nltk.corpus import stopwords
stoplist = set(stopwords.words('english'))

def normalize_text(doc):
    "Input doc and return clean list of tokens"
    doc = doc.replace('\r', ' ').replace('\n', ' ')
    doc = doc.lower() # all lower case
    doc = doc.translate(translator) # remove punctuation
    doc = re.sub(r"(\d)([A-Za-z])", r"\1 \2", doc) # separate numbers from strings
    doc = re.sub(r"([A-Za-z])(\d)", r"\1 \2", doc) # separate strings from numbers
    words = doc.split() # split into tokens
    words = [w for w in words if w not in stoplist] # remove stopwords
    words = [w if not w.isdigit() else '#' for w in words] # normalize numbers
    return words

print(normalize_text(text))

**Stemming is different than Lemmatizing**

Stemming heuristically chops off word endings. Lemmatization reduces a word to its dictionary base form (lemma).

*When is stemming more useful than lemmatizing?*

In [None]:
word = 'studies'

# Stemming
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')

# Lemmatizing
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

print('Stemmed word is:', stemmer.stem(word), 'and lemmatized word is:',
      wnl.lemmatize(word))

**Sometimes better to use a Shortcut: `gensim.simple_preprocess`.**



In [None]:
from gensim.utils import simple_preprocess # lowercase, tokenized, punctuations/numbers removed
print(simple_preprocess(text))

**More customized preprocessing: `gensim.parsing.preprocessing_string` with filters.**

In [None]:
from gensim.parsing.preprocessing import preprocess_string, strip_tags, \
 strip_punctuation, strip_multiple_whitespaces, strip_numeric, \
 remove_stopwords, strip_short, stem_text
import re

complicated_text = "<div>Prof. Zurich <i>hailed</i> from Zurich., She got 3 M.A.'s from ETH.</div>" # added html tags

CUSTOM_FILTERS = [
    lambda x: x.lower(),
    lambda x: x.replace('\r', ' ').replace('\n', ' '),
    lambda x: re.sub(r"\b\d+\b", "#", x),   # normalize numbers
    strip_tags,
    strip_punctuation,
    strip_multiple_whitespaces,
    remove_stopwords
]

preprocess_string(complicated_text,
                  CUSTOM_FILTERS)

**Another shortcut: spaCy**

In [None]:
def tokenize(x, nlp):
    # lemmatize and lowercase without stopwords, punctuation and numbers
    return [w.lemma_.lower() for w in nlp(x) if not w.is_stop and not w.is_punct
            and not w.is_digit]
tokenize(text, nlp)

# Bag-of-terms tokenization

Objective: build a document-term matrix X (most models from before and topic models used this as an input).

We will try to get term frequencies of word w in document k. First we can generate the ngrams

In [None]:
from nltk import ngrams
from collections import Counter

# get n-gram counts for 10 documents
grams = []
for i, row in df.iterrows():
    tokens = row['opinion_text'].lower().split() # get tokens
    for n in range(2, 4):
        grams += list(ngrams(tokens,n)) # get bigrams, trigrams, and quadgrams
    if i > 50:
        break
Counter(grams).most_common()[:8]  # most frequent n-grams

In [None]:
freqs = Counter(tokens)
freqs.most_common()[:20]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(min_df=0.01, # at min 1% of docs
                        max_df=.9,
                        max_features=1000,
                        stop_words='english',
                        ngram_range=(1,3))
X = vec.fit_transform(df['opinion_text'])

# save the vectors
pd.to_pickle(X,'X.pkl')

# Tokenizers for LLMs: Introduction to Encoder-Decoders

 We need to convert these tokens from a Python string to an integer representation to produce the **token IDs**.

**Process:**



1. Tokenize as we've been doing (nlp())
2. Sort alphabetically, and remove duplicate tokens.
3. Define a mapping from each unique token to a unique integer value (encoder - decoder)

See Raschka Sections 2.3 to 2.6

In [None]:
# Let's work with the Supreme Court Data:
np.random.seed(4)
dfs = df.sample(10)

dfs["doc"] = dfs["opinion_text"].apply(normalize_text)

all_words = sorted(set(w for doc in dfs["doc"] for w in doc))
print(len(all_words)) # Unique tokens

In [None]:
# Simple way of creating a Vocab:
vocab = {token: integer for integer, token in enumerate(all_words)} # dictionary: id - token

# Check it works:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

In [None]:
# Preprocessing decisions are important: see what happens with gensim:
dfs["doc2"] = dfs["opinion_text"].apply(lambda x: preprocess_string(x, CUSTOM_FILTERS))

all_words2 = sorted(set(w for doc in dfs["doc2"] for w in doc))
vocab2 = {token: integer for integer, token in enumerate(all_words2)} # dictionary: id - token
for i, item in enumerate(vocab2.items()):
    print(item)
    if i >= 50:
        break

In [18]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab # dictionary with word - id pairs
        self.int_to_str = {i:s for s,i in vocab.items()} # reverse dictionary with id - vocab pairs

    def encode(self, text): # String to vocab id
        preprocessed = normalize_text(text) # same pre-processing as you did with the vocab
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids): # vocab id to string
        text = " ".join([self.int_to_str[i] for i in ids])
        return text

In [None]:
tokenizer = SimpleTokenizerV1(vocab)
text = dfs['opinion_text'].iloc[0]
ids = tokenizer.encode(text)
print(ids)

In [None]:
text = tokenizer.decode(ids)
print(text)

In [None]:
ids = tokenizer.encode('gallup')

If the word is not in the vocab the encoder does not work. Two solutions.

1) **Special context tokens**

Add |\<unk>| for unknown words. Other special tokens are: |\<endoftext>|, [PAD] (padding), etc.

In [22]:
# Add to vocab
all_words.extend(["<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_words)}

class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab #
        self.int_to_str = {i:s for s,i in vocab.items()} #

    def encode(self, text): #
        preprocessed = normalize_text(text)
        preprocessed = [item if item in self.str_to_int
                        else "<|unk|>" for item in preprocessed] # Only difference
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids): #
        text = " ".join([self.int_to_str[i] for i in ids])
        return text

In [None]:
tokenizer = SimpleTokenizerV2(vocab)
# Now let's see what it returns:
gall = 'gallup polls show an increase in preferences'
text_after = tokenizer.decode(tokenizer.encode(gall))
print(text_after)

2) **Byte - pair encoding**

Encodes the **characters** of words that are not in the vocabulary. We use a package called tiktoken

In [None]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

comp = 'Handles word misspellings when added at the enddd'
integers = tokenizer.encode(comp, allowed_special={"<|endoftext|>"})
print(integers)

In [None]:
strings = tokenizer.decode(integers)
print(strings)

In [None]:
# More precisely:

print(tokenizer.decode([886]))

Most models (BERT, LLMs, etc) will handle the tokenization and encoding-decoding or will provide you functions to do it (shortcuts).

Still it is useful to know how to do it yourself for lower level models and to understand the significance of each decision along the pipeline.

1) Pre-processing affects tokenization, therefore affects the results of the models.
2) How you encode - decode special tokens will also affect the results of your models.

Two quick introductions (we will see them later): sentencepieace and huggingface


In [None]:
import sentencepiece as spm
# training spm requires a text file as input, so generate a small one

with open("sample_text.txt", "w") as outfile:
        for text in df["opinion_text"][:10]:
            outfile.write(text + "\n")

spm.SentencePieceTrainer.train(input="sample_text.txt", model_prefix='m',
                               vocab_size=100)

# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')
sp.encode_as_pieces(df["opinion_text"][0][:100])

In [None]:
# we use distilbert tokenizer
from transformers import DistilBertTokenizerFast

# let's instantiate a tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# tokenize text
text = "Prof. Zurich hailed from Zurich. She got 3 M.A.'s from ETH."
tokenizer.tokenize(text)

# Linguistic Processing using spaCy

These pre-processing and tokenization leaves out important information about the structure of human language that can be accounted for if we use some ideas from linguistics.

(From class) Parts of speech vary in their informativeness for various functions:

1) For categorizing topics, nouns are usually most important
2) For sentiment, adjectives are usually most important.

*Would a topic model be better if we just feed it nouns? Would we get a more accurate sentiment prediction if we just use adjectives? *

In [None]:
# Get noun chunks out of texts:
i = 0
for chunk in nlp(dfs.iloc[0]['opinion_text']).noun_chunks:
    print ('{} - {}'.format(chunk, chunk.label_))
    if i > 10:
        break
    i += 1

In [None]:
# Find all prepositions of a text
def get_pps(text):
    pps = []
    doc = nlp(text)
    for token in doc:
        # if we have a prepositional object and the dependency head of the current token is a preposition, we have a prepositional phrase
        if token.dep_ == "pobj" and token.head.dep_ == "prep":
            # we just iterate through the subtree then and collect the dependency head, the token itself and all tokens in the subtree
            pp = token.head.text + " " + ' '.join([tok.orth_ for tok in token.subtree])
            pps.append(pp)
    return pps

pps = get_pps(df["opinion_text"][0])
pps[:10]

In [None]:
# NER: Named Entity Recognition
i = 0
for entity in nlp(dfs.iloc[0]['opinion_text']).ents:
    print ('{} - {}'.format(entity, entity.label_))
    if i > 10:
        break
    i += 1


We can also generate features that count the number of entities, nouns, adjectives, etc. to feed into a model: fact-checking.

In [33]:
def count_ner(text):
    doc = nlp(str(text))
    return len(doc.ents)

dfs['ner_count'] = dfs['opinion_text'].apply(count_ner)

## Parsing

In [None]:
# !pip install benepar
# !pip install svgling
# !pip install fastcoref
from fastcoref import spacy_component # a SOTA coreference resolution package, see https://arxiv.org/pdf/2209.04280.pdf
import benepar
from spacy import displacy

df = pd.read_csv('data/death-penalty-cases.csv')

df.head()

In [None]:
from google.colab import files
uploaded = files.upload()

## Dependency Parsing with SpaCy

Let's first look at one example:

In [None]:
for sent in doc.sents:
    print("sentence:", sent)
    print("root:", sent.root)
    print([(w, w.dep_) for w in sent.root.children])
    print()

In [None]:
# current sentence
print(sent)
print(sent.root)
print(list(sent.root.children))
# Left children
print(list(sent.root.lefts))
# Right children
print(list(sent.root.rights))
# first token
print(sent[0])
# first token dependency label, cc=conjunction
print(sent[0].dep_)
print(sent[0].head)

## Constituency Parsing with SpaCy

In [None]:
import nltk
benepar.download('benepar_en3')
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('benepar', config={'model': 'benepar_en3'})
doc = nlp('Science cannot solve the ultimate mystery of nature.')
sent = list(doc.sents)[0]
print(sent._.parse_string)
print(sent._.labels)
print(list(sent._.children)[0])
nltk.Tree.fromstring(sent._.parse_string)

In the following, we show how to extract NSUBJ-verb pairs from text.

In [17]:
df = df.sample(n=200)
df["processed"] = df["snippet"].apply(lambda x: nlp(x)) # takes many time so we sample 200 data points

def extract_subject_verb_pairs(sent):
    subjs = [w for w in sent if w.dep_ == "nsubj"]
    pairs = [(w.lemma_.lower(), w.head.lemma_.lower()) for w in subjs]
    return pairs

df["subj-verb-pairs"] = df["processed"].apply(lambda x: extract_subject_verb_pairs(x))

In [None]:
# most common pairs
counter = Counter()
for item in df["subj-verb-pairs"]:
    counter.update(item)

for pair, counts in counter.most_common(n=25):
    print (pair, counts)

In [None]:
# verbs used with defendant

for (subject, verb), counts in counter.most_common():
    if subject == "defendant" and counts > 1:
        print (subject, verb, counts)

In [None]:
# verbs used with jury

for (subject, verb), counts in counter.most_common():
    if subject == "jury" and counts > 1:
        print (subject, verb, counts)