<a href="https://colab.research.google.com/github/chenoa23/NLP-Projects/blob/main/NLP%20Techniques%20Using%20NLTK%20Library.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##This project demonstrates key Natural Language Processing (NLP) techniques using the NLTK library in Python. It covers foundational text preprocessing tasks such as tokenization, stemming, lemmatization, stopword removal, and part-of-speech (POS) tagging. Additionally, it shows how to clean and prepare textual data by removing punctuation and irrelevant words, extract named entities, and represent text numerically using the Bag of Words (BoW) model for downstream analysis or machine learning. The goal is to illustrate how raw text can be transformed into a structured and meaningful format, which is a critical step for any NLP or text classification task.

In [None]:
# Import your standard Python PP Libs
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [None]:
data = ''' Miami Dade College, known locally as MDC, is an institution of higher education in
Southeast Florida. It is the largest college by enrollment in the United States,
with the most diverse student body, and is recognized as a prominent educational
establishment in the nation. '''

In [None]:
nltk.sent_tokenize(data)

[' Miami Dade College, known locally as MDC, is an institution of higher education in\nSoutheast Florida.',
 'It is the largest college by enrollment in the United States,\nwith the most diverse student body, and is recognized as a prominent educational\nestablishment in the nation.']

In [None]:
nltk.word_tokenize(data)

['Miami',
 'Dade',
 'College',
 ',',
 'known',
 'locally',
 'as',
 'MDC',
 ',',
 'is',
 'an',
 'institution',
 'of',
 'higher',
 'education',
 'in',
 'Southeast',
 'Florida',
 '.',
 'It',
 'is',
 'the',
 'largest',
 'college',
 'by',
 'enrollment',
 'in',
 'the',
 'United',
 'States',
 ',',
 'with',
 'the',
 'most',
 'diverse',
 'student',
 'body',
 ',',
 'and',
 'is',
 'recognized',
 'as',
 'a',
 'prominent',
 'educational',
 'establishment',
 'in',
 'the',
 'nation',
 '.']

In [None]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer, SnowballStemmer
from nltk.corpus import stopwords

In [None]:
lancaster = LancasterStemmer()
porter = PorterStemmer()
Snowball = SnowballStemmer("english")
print('Porter stemmer')
print(porter.stem("hobby"))
print(porter.stem("hobbies"))
print(porter.stem("computer"))
print(porter.stem("computation"))
print("**************************")
print('lancaster stemmer')
print(lancaster.stem("hobby"))
print(lancaster.stem("hobbies"))
print(lancaster.stem("computer"))
print(porter.stem("computation"))
print("**************************")
print('Snowball stemmer')
print(Snowball.stem("hobby"))
print(Snowball.stem("hobbies"))
print(Snowball.stem("computer"))
print(Snowball.stem("computation"))

Porter stemmer
hobbi
hobbi
comput
comput
**************************
lancaster stemmer
hobby
hobby
comput
comput
**************************
Snowball stemmer
hobbi
hobbi
comput
comput


In [None]:
sent = "I was going to the office on my bike when\
i saw a car passing by hit the tree."
token = list(nltk.word_tokenize(sent))
for stemmer in (Snowball, lancaster, porter):
    stemm = [stemmer.stem(t) for t in token]
    print(" ".join(stemm))

i was go to the offic on my bike wheni saw a car pass by hit the tree .
i was going to the off on my bik when saw a car pass by hit the tre .
i wa go to the offic on my bike wheni saw a car pass by hit the tree .


In [None]:
paragraph = """Miami Dade College has evolved over decades, becoming a beacon of education in the diverse fabric of Miami. Throughout its existence, students from all corners of the globe have converged here, seeking knowledge and growth. Like India, MDC has seen the influx of various cultures, enriching its own identity without dominating any. This hasn't led MDC to impose its ethos elsewhere, but rather to build a unique educational culture that respects the freedom of thought and diversity. This is the cornerstone of my first vision for MDC - freedom in education.

This freedom was envisaged at its founding, setting the stage for an academic revolution. It is this freedom that MDC must continue to cherish, fostering an environment where respect is earned through the empowerment of knowledge.

My second vision is for MDC's advancement. For years, we have been an institution nurturing potential. Now, it's time we recognize ourselves as a leader in education, one of the most influential colleges nationwide. We've seen growth in student success, decreases in educational inequity, and our contributions are gaining recognition. Yet, we sometimes hesitate to see ourselves as a hub of advanced education, independent and confident. This must change.

My third vision is for MDC to assert itself globally. Respect comes with recognition, and recognition is born of strength—not just in academic excellence but in our ability to impact the community and economy positively. Strength in education fosters respect globally.

I have been fortunate to work alongside remarkable educators—innovators in teaching and administration. Their mentorship was an invaluable chapter in my life. At MDC, I see these interactions as key milestones in my career, guiding our collective journey towards these three visions."""

In [None]:
sentences = nltk.sent_tokenize(paragraph)
stemmer = PorterStemmer()


# Stemming
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)

sentences

['miami dade colleg evolv decad , becom beacon educ divers fabric miami .',
 'throughout exist , student corner globe converg , seek knowledg growth .',
 'like india , mdc seen influx variou cultur , enrich ident without domin .',
 "thi n't led mdc impos etho elsewher , rather build uniqu educ cultur respect freedom thought divers .",
 'thi cornerston first vision mdc - freedom educ .',
 'thi freedom envisag found , set stage academ revolut .',
 'it freedom mdc must continu cherish , foster environ respect earn empower knowledg .',
 "my second vision mdc 's advanc .",
 'for year , institut nurtur potenti .',
 "now , 's time recogn leader educ , one influenti colleg nationwid .",
 "we 've seen growth student success , decreas educ inequ , contribut gain recognit .",
 'yet , sometim hesit see hub advanc educ , independ confid .',
 'thi must chang .',
 'my third vision mdc assert global .',
 'respect come recognit , recognit born strength—not academ excel abil impact commun economi posit 

In [None]:
# Lemmatize:  take each word and run it through a dictionary and we are going to evaluate the "best" base word

from nltk.stem import WordNetLemmatizer

In [None]:
lemma = WordNetLemmatizer()

print(lemma.lemmatize('running'))
print(lemma.lemmatize('runs'))
print(lemma.lemmatize('ran'))

running
run
ran


In [None]:
sentences = nltk.sent_tokenize(paragraph)
lemmatizer = WordNetLemmatizer()

# Lemmatization
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)

sentences

['Miami Dade College evolved decade , becoming beacon education diverse fabric Miami .',
 'Throughout existence , student corner globe converged , seeking knowledge growth .',
 'Like India , MDC seen influx various culture , enriching identity without dominating .',
 "This n't led MDC impose ethos elsewhere , rather build unique educational culture respect freedom thought diversity .",
 'This cornerstone first vision MDC - freedom education .',
 'This freedom envisaged founding , setting stage academic revolution .',
 'It freedom MDC must continue cherish , fostering environment respect earned empowerment knowledge .',
 "My second vision MDC 's advancement .",
 'For year , institution nurturing potential .',
 "Now , 's time recognize leader education , one influential college nationwide .",
 "We 've seen growth student success , decrease educational inequity , contribution gaining recognition .",
 'Yet , sometimes hesitate see hub advanced education , independent confident .',
 'This m

In [None]:
# Stop Words:  words that help us "glue" sentences together but have no NLP value... "of", "the", etc.

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words



['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
data =' We will see an example of POS tagging.'

pos = nltk.pos_tag(nltk.word_tokenize(data))

pos

[('We', 'PRP'),
 ('will', 'MD'),
 ('see', 'VB'),
 ('an', 'DT'),
 ('example', 'NN'),
 ('of', 'IN'),
 ('POS', 'NNP'),
 ('tagging', 'NN'),
 ('.', '.')]

In [None]:
# We also have punctuations which we can ignore from our set of words just like stopwords.

import string
import nltk
import string
from nltk.corpus import stopwords

punct =string.punctuation
punct

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
# Let's word tokenize the given sample after we remove the stopwords and punctuation.
stop_words = stopwords.words('english')
punct =string.punctuation

data = '''Miami Dade College has evolved over decades, becoming a beacon of education in the diverse fabric of Miami. Throughout its existence, students from all corners of the globe have converged here, seeking knowledge and growth. MDC has seen the influx of various cultures, enriching its own identity without dominating any. This hasn't led MDC to impose its ethos elsewhere, but rather to build a unique educational culture that respects the freedom of thought and diversity. This is the cornerstone of my first vision for MDC - freedom in education.'''
clean_data =[]

for word in nltk.word_tokenize(data):
    if word not in punct:
        if word not in stop_words:
            clean_data.append(word)

clean_data

['Miami',
 'Dade',
 'College',
 'evolved',
 'decades',
 'becoming',
 'beacon',
 'education',
 'diverse',
 'fabric',
 'Miami',
 'Throughout',
 'existence',
 'students',
 'corners',
 'globe',
 'converged',
 'seeking',
 'knowledge',
 'growth',
 'MDC',
 'seen',
 'influx',
 'various',
 'cultures',
 'enriching',
 'identity',
 'without',
 'dominating',
 'This',
 "n't",
 'led',
 'MDC',
 'impose',
 'ethos',
 'elsewhere',
 'rather',
 'build',
 'unique',
 'educational',
 'culture',
 'respects',
 'freedom',
 'thought',
 'diversity',
 'This',
 'cornerstone',
 'first',
 'vision',
 'MDC',
 'freedom',
 'education']

In [None]:
nltk.pos_tag(clean_data)

[('Miami', 'NNP'),
 ('Dade', 'NNP'),
 ('College', 'NNP'),
 ('evolved', 'VBD'),
 ('decades', 'NNS'),
 ('becoming', 'VBG'),
 ('beacon', 'JJ'),
 ('education', 'NN'),
 ('diverse', 'NN'),
 ('fabric', 'NN'),
 ('Miami', 'NNP'),
 ('Throughout', 'NNP'),
 ('existence', 'NN'),
 ('students', 'NNS'),
 ('corners', 'NNS'),
 ('globe', 'VBP'),
 ('converged', 'VBN'),
 ('seeking', 'VBG'),
 ('knowledge', 'NN'),
 ('growth', 'NN'),
 ('MDC', 'NNP'),
 ('seen', 'VBN'),
 ('influx', 'VBP'),
 ('various', 'JJ'),
 ('cultures', 'NNS'),
 ('enriching', 'VBG'),
 ('identity', 'NN'),
 ('without', 'IN'),
 ('dominating', 'VBG'),
 ('This', 'DT'),
 ("n't", 'RB'),
 ('led', 'VBN'),
 ('MDC', 'NNP'),
 ('impose', 'VB'),
 ('ethos', 'NN'),
 ('elsewhere', 'RB'),
 ('rather', 'RB'),
 ('build', 'VB'),
 ('unique', 'JJ'),
 ('educational', 'JJ'),
 ('culture', 'NN'),
 ('respects', 'VBZ'),
 ('freedom', 'RB'),
 ('thought', 'VBN'),
 ('diversity', 'NN'),
 ('This', 'DT'),
 ('cornerstone', 'NN'),
 ('first', 'JJ'),
 ('vision', 'NN'),
 ('MDC', 'NN

In [None]:
pos_tag = nltk.pos_tag(clean_data)
namedEntity = nltk.ne_chunk(pos_tag)
print(namedEntity)

(S
  (PERSON Miami/NNP)
  (PERSON Dade/NNP College/NNP)
  evolved/VBD
  decades/NNS
  becoming/VBG
  beacon/JJ
  education/NN
  diverse/NN
  fabric/NN
  (PERSON Miami/NNP Throughout/NNP)
  existence/NN
  students/NNS
  corners/NNS
  globe/VBP
  converged/VBN
  seeking/VBG
  knowledge/NN
  growth/NN
  (ORGANIZATION MDC/NNP)
  seen/VBN
  influx/VBP
  various/JJ
  cultures/NNS
  enriching/VBG
  identity/NN
  without/IN
  dominating/VBG
  This/DT
  n't/RB
  led/VBN
  (ORGANIZATION MDC/NNP)
  impose/VB
  ethos/NN
  elsewhere/RB
  rather/RB
  build/VB
  unique/JJ
  educational/JJ
  culture/NN
  respects/VBZ
  freedom/RB
  thought/VBN
  diversity/NN
  This/DT
  cornerstone/NN
  first/JJ
  vision/NN
  (ORGANIZATION MDC/NNP)
  freedom/NN
  education/NN)


# Word Vectorization
## the process of mapping words to a set of real numbers (vectors)

### Reviews
- Review 1: This movie is very scary and long
- Review 2: This movie is not scary and is slow
- Review 3: This movie is spooky and good

### BoW representation of these reviews:
- Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]
- Vector of Review 2: [1 1 2 0 1 1 0 1 1 0 0]
- Vector of Review 3: [1 1 1 0 0 1 0 0 0 1 1]


## Bag of Words - Vectorize the data

#### truism :  Natural Language Processing REQUIRES all words be converted in to numbers!  One way to do that is through the BoW algorithm.


In [None]:
data = """ The reader of this course should have a basic knowledge of the Python programming lenguage.
He/she must have knowldge of data types in Python.He should be able to write functions,
 and also have the ability to import and use libraries and packages in Python. Familiarity
 with basic linguistics and probability is assumed although not required to fully
 complete this course. """