<a href="https://colab.research.google.com/github/appliedcode/mthree-c422/blob/main/Exercises/day-8/NLP-Concepts-Token/Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exploring Core NLP Concepts
Welcome to this hands-on NLP Colab lab! You will work through key tasks—tokenization, POS tagging, stemming, stop-word filtering, vocabulary matching, lemmatization, dependency parsing, NER, and intent classification—using Python libraries. Follow the instructions and complete the exercises.

In [10]:
# Install required packages
!pip install --upgrade pip setuptools wheel -q
!pip install --quiet nltk spacy textblob sklearn

# Download NLTK data and spaCy model
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger_eng')

!python -m spacy download en_core_web_sm -q


  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [11]:
# 1. Tokenization
# Goal: Split text into tokens (words and punctuation).
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Natural Language Processing enables machines to understand human language."
print("Sentences:", sent_tokenize(text))
print("Tokens:", word_tokenize(text))

Sentences: ['Natural Language Processing enables machines to understand human language.']
Tokens: ['Natural', 'Language', 'Processing', 'enables', 'machines', 'to', 'understand', 'human', 'language', '.']


In [12]:
# Exercise 1.1: Tokenize the following paragraph into words and sentences:

paragraph = "Machine learning models power many NLP tasks. They learn patterns from data!"


In [13]:
# 2. Part-of-Speech Tagging
# Goal: Assign grammatical tags to each token.
import nltk
tokens = word_tokenize(text)
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)


[('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('enables', 'VBZ'), ('machines', 'NNS'), ('to', 'TO'), ('understand', 'VB'), ('human', 'JJ'), ('language', 'NN'), ('.', '.')]


In [14]:
# Exercise 2.1: Tag POS for tokens from your Exercise 1.1.

In [15]:
# 3. Stemming
# Goal: Reduce words to their root forms (may be non-dictionary).
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["running", "runs", "ran", "easily", "fairly"]
print({w: stemmer.stem(w) for w in words})


{'running': 'run', 'runs': 'run', 'ran': 'ran', 'easily': 'easili', 'fairly': 'fairli'}


In [16]:
# Exercise 3.1: Stem the tokens from your Exercise 1.1.



In [17]:
# 4. Stop-Word Filtering
# Goal: Remove common, low-value words.
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(text.lower())
filtered = [w for w in tokens if w.isalpha() and w not in stop_words]
print(filtered)

['natural', 'language', 'processing', 'enables', 'machines', 'understand', 'human', 'language']


In [18]:
# Exercise 4.1: Filter stop words from your Exercise 1.1 tokens.



In [19]:
# 5. Vocabulary Matching
# Goal: Check tokens against a predefined vocabulary.

vocab = {"natural", "language", "machine", "data", "processing"}
tokens = [w.lower() for w in word_tokenize(text)]
in_vocab = [w for w in tokens if w.isalpha() and w in vocab]
print("In-vocab tokens:", in_vocab)
print("OOV tokens:", [w for w in tokens if w.isalpha() and w not in vocab])


In-vocab tokens: ['natural', 'language', 'processing', 'language']
OOV tokens: ['enables', 'machines', 'to', 'understand', 'human']


In [20]:
# Exercise 5.1: Define your own small vocabulary and classify tokens from Exercise 1.1 into in-vocab vs. out-of-vocab.

In [21]:
# 6. Lemmatization
# Goal: Convert words to their dictionary form.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["running", "better", "wolves"]
print({w: lemmatizer.lemmatize(w) for w in words})
# For verbs:
print("run (verb):", lemmatizer.lemmatize("running", pos='v'))


{'running': 'running', 'better': 'better', 'wolves': 'wolf'}
run (verb): run


In [22]:
# Exercise 6.1: Lemmatize tokens from Exercise 1.1 (both default and verb POS).



In [23]:
# 7. Dependency Parsing
# Goal: Identify syntactic relationships between tokens.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
for token in doc:
    print(token.text, token.dep_, token.head.text)


Natural compound Language
Language compound Processing
Processing nsubj enables
enables ROOT enables
machines nsubj understand
to aux understand
understand ccomp enables
human amod language
language dobj understand
. punct enables


In [24]:
# Exercise 7.1: Parse the sentence “They learn patterns from data” and list each token’s dependency label and head.

In [25]:
# 8. Named-Entity Recognition (NER)
# Goal: Extract real-world entities from text.
doc = nlp("Google was founded in 1998 by Larry Page and Sergey Brin in California.")
for ent in doc.ents:
    print(ent.text, ent.label_)

Google ORG
1998 DATE
Larry Page PERSON
Sergey Brin PERSON
California GPE


In [26]:
#Exercise 8.1: Run NER on this sentence and add at least two more sentences of your own.

