In [1]:
corpus = [
    "The leaves on the tree have fallen.",
    "She is running towards the finish line.",
    "The cars are parked in the parking lot.",
    "A quick brown fox jumps over the lazy dog.",
    "He was walking in the park and saw a beautiful bird."
]


In [2]:
import nltk
from nltk.stem import PorterStemmer


nltk.download('punkt')

ps = PorterStemmer()

for sentence in corpus:
    words = nltk.word_tokenize(sentence)
    stemmed_words = [ps.stem(word) for word in words]
    print("Original:", sentence)
    print("Stemmed:", " ".join(stemmed_words))
    print()


Original: The leaves on the tree have fallen.
Stemmed: the leav on the tree have fallen .

Original: She is running towards the finish line.
Stemmed: she is run toward the finish line .

Original: The cars are parked in the parking lot.
Stemmed: the car are park in the park lot .

Original: A quick brown fox jumps over the lazy dog.
Stemmed: a quick brown fox jump over the lazi dog .

Original: He was walking in the park and saw a beautiful bird.
Stemmed: he wa walk in the park and saw a beauti bird .



[nltk_data] Downloading package punkt to C:\Users\Deep
[nltk_data]     Salunkhe\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


The PorterStemmer is a common stemming algorithm. It works by chopping off the ends of words in the hope of achieving the correct base form. However, it can sometimes produce non-words and might be too aggressive in some cases.
Example: The word running would be stemmed to run, but the word better would be stemmed to bett.

In [3]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

for sentence in corpus:
    words = nltk.word_tokenize(sentence)
    lemmatized_words = [lemmatizer.lemmatize(word, pos=wordnet.VERB) for word in words]
    print("Original:", sentence)
    print("Lemmatized:", " ".join(lemmatized_words))
    print()


[nltk_data] Downloading package wordnet to C:\Users\Deep
[nltk_data]     Salunkhe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\Deep
[nltk_data]     Salunkhe\AppData\Roaming\nltk_data...


Original: The leaves on the tree have fallen.
Lemmatized: The leave on the tree have fall .

Original: She is running towards the finish line.
Lemmatized: She be run towards the finish line .

Original: The cars are parked in the parking lot.
Lemmatized: The cars be park in the park lot .

Original: A quick brown fox jumps over the lazy dog.
Lemmatized: A quick brown fox jump over the lazy dog .

Original: He was walking in the park and saw a beautiful bird.
Lemmatized: He be walk in the park and saw a beautiful bird .



The WordNetLemmatizer uses WordNet's built-in lexical database to find the base (lemma) of a word, considering its Part of Speech (POS).
This approach is generally more accurate than stemming since it results in actual words, but it requires the correct POS tagging.
Example: running is lemmatized to run (verb form), and better would stay better as it's already in its lemma form.

In [8]:

import spacy
spacy.cli.download("en_core_web_sm")

# Load English tokenizer, POS tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")

for sentence in corpus:
    doc = nlp(sentence)
    lemmatized_words = [token.lemma_ for token in doc]
    print("Original:", sentence)
    print("Lemmatized (spaCy):", " ".join(lemmatized_words))
    print()


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Original: The leaves on the tree have fallen.
Lemmatized (spaCy): the leave on the tree have fall .

Original: She is running towards the finish line.
Lemmatized (spaCy): she be run towards the finish line .

Original: The cars are parked in the parking lot.
Lemmatized (spaCy): the car be park in the parking lot .

Original: A quick brown fox jumps over the lazy dog.
Lemmatized (spaCy): a quick brown fox jump over the lazy dog .

Original: He was walking in the park and saw a beautiful bird.
Lemmatized (spaCy): he be walk in the park and see a beautiful bird .



In [6]:
pip install spacy

Collecting spacy
  Downloading spacy-3.7.6-cp311-cp311-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Using cached murmurhash-1.0.10-cp311-cp311-win_amd64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Using cached cymem-2.0.8-cp311-cp311-win_amd64.whl.metadata (8.6 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Using cached preshed-3.0.9-cp311-cp311-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.3.0,>=8.2.2 (from spacy)
  Downloading thinc-8.2.5-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.4.8-cp311-cp311-win_a

spaCy provides lemmatization out of the box using its pre-trained models. It's more sophisticated than NLTK’s WordNetLemmatizer, as it considers the context of the word within the sentence.
Example: The phrase was walking would correctly be lemmatized to be walk (where was becomes be and walking becomes walk).

In [9]:
# Combining spaCy with NLTK for stemming
for sentence in corpus:
    doc = nlp(sentence)
    stemmed_words = [ps.stem(token.text) for token in doc]
    print("Original:", sentence)
    print("Stemmed (spaCy+Porter):", " ".join(stemmed_words))
    print()


Original: The leaves on the tree have fallen.
Stemmed (spaCy+Porter): the leav on the tree have fallen .

Original: She is running towards the finish line.
Stemmed (spaCy+Porter): she is run toward the finish line .

Original: The cars are parked in the parking lot.
Stemmed (spaCy+Porter): the car are park in the park lot .

Original: A quick brown fox jumps over the lazy dog.
Stemmed (spaCy+Porter): a quick brown fox jump over the lazi dog .

Original: He was walking in the park and saw a beautiful bird.
Stemmed (spaCy+Porter): he wa walk in the park and saw a beauti bird .



NLTK’s PorterStemmer is a rule-based stemmer that sometimes produces results that are not real words (e.g., running -> run, better -> bett). It's faster but less accurate than lemmatization.

NLTK’s WordNetLemmatizer uses WordNet's lexical database for more accurate lemmatization. However, it requires correct POS tagging to work well, which can be a limitation if you're processing large amounts of text without POS tagging.

spaCy’s Lemmatizer is more advanced and context-aware, making it generally more accurate than NLTK’s WordNetLemmatizer. It’s better suited for tasks requiring higher accuracy in natural language processing.

spaCy + NLTK’s PorterStemmer allows for stemming within the spaCy framework, though it's not as common as using lemmatization in spaCy.

For most modern NLP applications, lemmatization (especially using spaCy) is preferred over stemming due to its accuracy and context-awareness. However, stemming might still be useful in specific scenarios where speed is crucial and slight inaccuracies can be tolerated.