<a href="https://colab.research.google.com/github/alex-smith-uwec/NLP_Spring2025/blob/main/Basic_Text_Normalization_and_Counting_Part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[What are stemming and lemmatization (IBM)](https://www.ibm.com/think/topics/stemming-lemmatization)

We will turn from nltk to spaCy


1.   [spaCy-Wiki](https://en.wikipedia.org/wiki/SpaCy)
2.   [Main spaCy website](https://spacy.io/)

We will use spaCy to lemmatize and remove stop words from a State of the Union Speech by George Washington.

The speech will come from a dataset on huggingface.










In [1]:
!pip install spacy -q




[Trained spaCy pipelines for English](https://spacy.io/models/en#en_core_web_sm)

We will use **en_core_web_sm**

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [3]:
#Function to just lemmatize a sentence, and not remove stopwords
def lemmatize_sentence(sentence):
    doc = nlp(sentence)
    return [token.lemma_ for token in doc]

# Function to lemmatize and remove stop words
def lemmatize_and_remove_stopwords(sentence):
    doc = nlp(sentence)
    # Filter tokens: Exclude stop words and punctuation, and return lemmas
    filtered_lemmas = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return filtered_lemmas

# Function to display POS tags for a given text
def display_pos_tags(text):
    doc = nlp(text)
    for token in doc:
        print(f"Token: {token.text}, POS: {token.pos_}, Detailed POS: {token.tag_}")

# Function to tokenize a passage into sentences
def tokenize_into_sentences(passage):
    doc = nlp(passage)
    return [sent.text for sent in doc.sents]

# Function tokenize passage into sentences and apply lemmatization and stopword removal
def process_all_sentences(passage):
    # Tokenize into sentences
    sentences = tokenize_into_sentences(passage)
    # Apply lemmatize_and_remove_stopwords to each sentence
    processed_sentences = [lemmatize_and_remove_stopwords(sentence) for sentence in sentences]
    return processed_sentences


In [4]:
# Example sentences
example_sentences = [
    "The cats are running quickly.",
    "I was enjoying the beautiful sunsets.",
    "He studies programming and loves solving problems."
]

In [5]:
display_pos_tags(text=example_sentences[0])

Token: The, POS: DET, Detailed POS: DT
Token: cats, POS: NOUN, Detailed POS: NNS
Token: are, POS: AUX, Detailed POS: VBP
Token: running, POS: VERB, Detailed POS: VBG
Token: quickly, POS: ADV, Detailed POS: RB
Token: ., POS: PUNCT, Detailed POS: .


In [6]:
lemmatize_sentence(sentence=example_sentences[0])


['the', 'cat', 'be', 'run', 'quickly', '.']

In [7]:
lemmatize_and_remove_stopwords(sentence=example_sentences[0])

['cat', 'run', 'quickly']

In [8]:
!pip install datasets -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━[0m [32m419.8/480.6 kB[0m [31m12.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/134.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━

In [9]:
from datasets import load_dataset
import pandas as pd

Dataset from huggingface

 [State of the Union speeches](https://huggingface.co/datasets/jsulz/state-of-the-union-addresses)

In [10]:
# Load the dataset from huggingface
dataset = load_dataset("jsulz/state-of-the-union-addresses")

# Access the row with index k
k=16 #Should correspond to Washington. Selected since it is short
row_k = dataset["train"][k]



README.md:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

sotu.parquet:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/245 [00:00<?, ? examples/s]

In [11]:
speech = row_k['speech_html']
speech
row_k['potus']

'George Washington'

In [12]:
sentences=tokenize_into_sentences(speech)

In [13]:
print(sentences[1])

Numerous as are the providential blessings which demand our grateful acknowledgments, the abundance with which another year has again rewarded the industry of the husbandman is too important to escape recollection.


In [14]:
result = lemmatize_and_remove_stopwords(sentence=sentences[1])

In [15]:
result = [lemmatize_and_remove_stopwords(sentence) for sentence in sentences]
flattened_result = [word for sublist in result for word in sublist]  # Flatten the list of lists

print(f"length  of flattend_result (by spaCy) is {len(flattened_result)}\n and first few entries are {flattened_result[0:10]}")


length  of flattend_result (by spaCy) is 865
 and first few entries are ['meet', 'present', 'occasion', 'feeling', 'naturally', 'inspire', 'strong', 'impression', 'prosperous', 'situation']


In [16]:
source_lemmatized=row_k['lemmatized']
print(f"length  of source_lemmatized is {len(source_lemmatized)}\n and first few entries are {source_lemmatized[0:10]}")

length  of source_lemmatized is 956
 and first few entries are ['meet', 'upon', 'present', 'occasion', 'feeling', 'naturally', 'inspire', 'strong', 'impression', 'prosperous']


In [18]:
min_len = min(len(flattened_result), len(source_lemmatized))
flattened_result = flattened_result[:min_len]
source_lemmatized = source_lemmatized[:min_len]

df = pd.DataFrame({
    'flattened_result': flattened_result,
    'source_lemmatized': source_lemmatized
})


In [19]:
df

Unnamed: 0,flattened_result,source_lemmatized
0,meet,meet
1,present,upon
2,occasion,present
3,feeling,occasion
4,naturally,feeling
...,...,...
860,ought,material
861,lose,utility
862,avail,disorder
863,public,exist
