# Lexical Processing in NLP

**Goal:** Learn the 5 fundamental steps of lexical processing

**Steps covered:**
1. Tokenization
2. Lowercasing
3. Stop Word Removal
4. Stemming
5. Lemmatization

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')


text = "I am running to the stores quickly 123 !"
print("Original Text:", text)

# Step 1: Tokenization
tokens = word_tokenize(text)
print("\n1. Tokenization:")
print(tokens)

# Step 2: Lowercasing
tokens_lower = [word.lower() for word in tokens]
print("\n2. Lowercasing:")
print(tokens_lower)


# Step 3: Remove stop words - Filter out common words
stop_words = set(stopwords.words('english'))

# Keep only alphabetic words that are not stop words
filtered_tokens = [word for word in tokens_lower 
                   if word not in stop_words and word.isalpha()]

print("\n3. Stop Word Removal:")
print("Removed:", [w for w in tokens_lower if w in stop_words or not w.isalpha()])
print("Kept:", filtered_tokens)

stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

print("\n4. Stemming:")
for original, stemmed in zip(filtered_tokens, stemmed_tokens):
    print(f"  {original:12} → {stemmed}")

lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word, pos='v') for word in filtered_tokens]

print("\n5. Lemmatization:")
for original, lemmatized in zip(filtered_tokens, lemmatized_tokens):
    print(f"  {original:12} → {lemmatized}")



print("\n" + "="*50)
print("COMPLETE PIPELINE COMPARISON")
print("="*50)
print(f"Original:          {text}")
print(f"Tokenized:         {tokens}")
print(f"Lowercased:        {tokens_lower}")
print(f"After Stop Words:  {filtered_tokens}")
print(f"Stemmed:           {stemmed_tokens}")
print(f"Lemmatized:        {lemmatized_tokens}")

Original Text: I am running to the stores quickly 123 !

1. Tokenization:
['I', 'am', 'running', 'to', 'the', 'stores', 'quickly', '123', '!']

2. Lowercasing:
['i', 'am', 'running', 'to', 'the', 'stores', 'quickly', '123', '!']

3. Stop Word Removal:
Removed: ['i', 'am', 'to', 'the', '123', '!']
Kept: ['running', 'stores', 'quickly']

4. Stemming:
  running      → run
  stores       → store
  quickly      → quickli

5. Lemmatization:
  running      → run
  stores       → store
  quickly      → quickly

COMPLETE PIPELINE COMPARISON
Original:          I am running to the stores quickly 123 !
Tokenized:         ['I', 'am', 'running', 'to', 'the', 'stores', 'quickly', '123', '!']
Lowercased:        ['i', 'am', 'running', 'to', 'the', 'stores', 'quickly', '123', '!']
After Stop Words:  ['running', 'stores', 'quickly']
Stemmed:           ['run', 'store', 'quickli']
Lemmatized:        ['run', 'store', 'quickly']


[nltk_data] Downloading package punkt to /home/abinas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/abinas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/abinas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/abinas/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
