<a href="https://colab.research.google.com/github/faisu6339-glitch/Natural-Language-Processing-NLP-/blob/main/Sentence_Tokenization_%26_Lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sentence_Tokenization


#Explain how sent_tokenize works
The sent_tokenize function from the nltk.tokenize module is used to split a given text into a list of sentences. It works by using an unsupervised algorithm to build a model for a language (like English) that can detect sentence boundaries. This model is based on a pre-trained Punkt tokenizer, which identifies punctuation marks and capitalization patterns to determine where one sentence ends and another begins. Essentially, it looks for cues like periods, question marks, and exclamation points, often combined with the capitalization of the next word, to accurately segment the text into individual sentences.

In [8]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [9]:

import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [10]:
text='Hello everyone.Welcome to GeekForGeeks. You are studying NLP article.'
sent_tokenize(text)

['Hello everyone.Welcome to GeekForGeeks.', 'You are studying NLP article.']

#Word tokenization

Word tokenization is the process of splitting a text document into individual words or tokens. It's a fundamental step in many natural language processing (NLP) tasks. Unlike sentence tokenization, which focuses on breaking text into sentences, word tokenization breaks sentences (or a continuous stream of text) into meaningful units, typically words, punctuation, and numbers. The exact definition of a 'word' can vary depending on the specific tokenizer used and the language, but generally, it involves identifying boundaries based on spaces, punctuation, and other linguistic rules. For example, in the sentence 'Hello, world!', word tokenization might produce tokens like ['Hello', ',', 'world', '!'].

In [11]:
from nltk.tokenize import word_tokenize
text='Hello everyone.Welcome to GeekForGeeks. You are studying NLP article.'
word_tokenize(text)

['Hello',
 'everyone.Welcome',
 'to',
 'GeekForGeeks',
 '.',
 'You',
 'are',
 'studying',
 'NLP',
 'article',
 '.']

#Word Tokenization usng Regular Expression

In [16]:
from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer(r'\w+')
text='Hello everyone.Welcome to GeekForGeeks. You are studying NLP article.'
tokenizer.tokenize(text)

['Hello',
 'everyone',
 'Welcome',
 'to',
 'GeekForGeeks',
 'You',
 'are',
 'studying',
 'NLP',
 'article']

#Certainly! Beyond sent_tokenize and word_tokenize, NLTK offers several other tokenization methods, each suited for different needs:

TreebankWordTokenizer: This is often considered the 'standard' word tokenizer in NLTK. It implements the Penn Treebank tokenization scheme, which is widely used in NLP research. It has specific rules for handling contractions, punctuation (like separating periods from abbreviations), and hyphenated words, generally producing a cleaner, more standardized tokenization for many tasks.

Example: text = "Don't go!" might become ['Do', 'n\'t', 'go', '!']

WhitespaceTokenizer: This is a very simple tokenizer that splits text based on whitespace characters (spaces, tabs, newlines). It doesn't handle punctuation attached to words, so it's less sophisticated than word_tokenize but can be useful for quick, basic tokenization when punctuation separation isn't critical.

Example: text = "Hello, world!" might become ['Hello,', 'world!']

WordPunctTokenizer: This tokenizer splits all punctuation into separate tokens. It's more aggressive in separating punctuation than word_tokenize.

Example: text = "Hello, world!" might become ['Hello', ',', 'world', '!']

MWETokenizer (Multi-Word Expression Tokenizer): This tokenizer is used to treat sequences of words as single tokens. For example, if you want "New York" to be considered one token instead of two, you would use this tokenizer after initial word tokenization.

Example: If you define "New York" as an MWE, "I live in New York City." would yield ['I', 'live', 'in', 'New York', 'City', '.']

#Lemmatization

Lemmatization is the process of reducing words to their base or root form, known as a lemma. Unlike stemming, which often chops off prefixes and suffixes and might result in a word that isn't a true dictionary word, lemmatization uses a vocabulary and morphological analysis to return the dictionary form of a word. For example, the words "running", "runs", and "ran" would all be lemmatized to "run". It's often used in natural language processing to ensure that different inflected forms of a word are treated as the same item, which can improve the accuracy of tasks like text classification and information retrieval.

#Show me how to perform lemmatization in NLTK
Certainly! Here's how you can perform lemmatization using NLTK:

The lemmatization example executed successfully!

From the output of cell 5282132f:

Original words: ['The', 'cats', 'were', 'running', 'quickly', 'and', 'ate', 'mice', '.']

Lemmatized words: ['The', 'cat', 'were', 'running', 'quickly', 'and', 'ate', 'mouse', '.']

Notice that 'cats' was lemmatized to 'cat', and 'mice' to 'mouse'. 'running' remained 'running' because, by default, lemmatize treats words as nouns, and 'running' as a noun (e.g., 'the running of the bulls') doesn't change.
Lemmatization with POS tags: {'running': 'run', 'better': 'good', 'ran': 'run', 'geese': 'goose'}

Here, when we specify pos='v' for 'running' and 'ran', they correctly lemmatize to 'run'.
'better' with pos='a' (adjective) correctly lemmatizes to 'good'.
'geese' with pos='n' (noun) correctly lemmatizes to 'goose'.
This demonstrates the importance of providing the correct Part-of-Speech (POS) tag for more accurate lemmatization, especially for words that can act as different parts of speech.

We've covered quite a few fundamental NLP concepts in this notebook:

Sentence Tokenization: We learned how to split text into individual sentences using NLTK's sent_tokenize function, understanding its reliance on the Punkt tokenizer.
Word Tokenization: We explored different methods for breaking sentences into words:
word_tokenize for general-purpose word tokenization.
RegexpTokenizer for custom, regex-based tokenization, which we used to extract only alphanumeric words.
We also discussed TreebankWordTokenizer, WhitespaceTokenizer, WordPunctTokenizer, and MWETokenizer for different use cases.
Lemmatization: We dove into reducing words to their base form (lemma) using NLTK's WordNetLemmatizer, contrasting it with stemming and demonstrating the importance of Part-of-Speech (POS) tagging for accurate results.

In [17]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download the WordNet corpus, which is required for WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [18]:
# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

text = "The cats were running quickly and ate mice."

# Tokenize the text into words
words = word_tokenize(text)

# Perform lemmatization on each word
lemmas = [lemmatizer.lemmatize(word) for word in words]

print("Original words:", words)
print("Lemmatized words:", lemmas)

# Example with different parts of speech (pos)
# WordNetLemmatizer can take an optional 'pos' argument (part-of-speech)
# 'n' for noun (default), 'v' for verb, 'a' for adjective, 'r' for adverb

word_to_lemma = {
    "running": lemmatizer.lemmatize("running", pos="v"),
    "better": lemmatizer.lemmatize("better", pos="a"),
    "ran": lemmatizer.lemmatize("ran", pos="v"),
    "geese": lemmatizer.lemmatize("geese", pos="n")
}

print("\nLemmatization with POS tags:", word_to_lemma)

Original words: ['The', 'cats', 'were', 'running', 'quickly', 'and', 'ate', 'mice', '.']
Lemmatized words: ['The', 'cat', 'were', 'running', 'quickly', 'and', 'ate', 'mouse', '.']

Lemmatization with POS tags: {'running': 'run', 'better': 'good', 'ran': 'run', 'geese': 'goose'}


In this example:

1.  We import `WordNetLemmatizer` from `nltk.stem` and `word_tokenize` from `nltk.tokenize`.
2.  We download the necessary `wordnet` and `omw-1.4` (Open Multilingual WordNet) corpora, which `WordNetLemmatizer` uses.
3.  We initialize an instance of `WordNetLemmatizer`.
4.  We tokenize a sample sentence into individual words.
5.  We then iterate through the words and apply `lemmatizer.lemmatize()` to each word. By default, `lemmatize` treats words as nouns. You can specify the part of speech using the `pos` argument for more accurate results (e.g., `pos='v'` for verbs, `pos='a'` for adjectives).

#Basic Lemmatization

In [21]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download the WordNet corpus, which is required for WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

words=["cats","running","better","ran","geese","flies"]
for word in words:
  print(word,"->",lemmatizer.lemmatize(word))

cats -> cat
running -> running
better -> better
ran -> ran
geese -> goose
flies -> fly


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


#Lemmatization with POS tags

In [22]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("running", pos="v"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("ran", pos="v"))
print(lemmatizer.lemmatize("geese", pos="n"))

run
good
run
goose


#Lemmatization of a Sentence

In [23]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

text = "The cats were running quickly and ate mice."
words = word_tokenize(text)
lemmas = [lemmatizer.lemmatize(word) for word in words]
print(lemmas)

['The', 'cat', 'were', 'running', 'quickly', 'and', 'ate', 'mouse', '.']


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [35]:
from textblob import TextBlob

In [36]:
paragraph="This wonderful day is full of sunshine and joy. I love spending time outdoors."

In [37]:
sentences=sent_tokenize(paragraph)

In [40]:
pol=[]
for sentence in sentences:
  blob=TextBlob(sentence)
  pol.append(blob.sentiment.polarity)

In [46]:
result=sum(pol)

In [41]:
blob2=TextBlob(paragraph)
blob2.sentences

[Sentence("This wonderful day is full of sunshine and joy."),
 Sentence("I love spending time outdoors.")]

In [44]:
result=sum(pol)

In [45]:
if result >= 1:
  print("Positive")
elif result == 0:
  print("Neutral")
else:
  print("Negative")

Positive


#Lemmatization function in ML

In [47]:
def lemmatize_text(text):
  import spacy
  nlp=spacy.load('en_core_web_sm')
  doc=nlp(text)
  lemmatized_text=' '.join([token.lemma_ for token in doc])
  return lemmatized_text
text="Students are studying natural language processing."
lemmatized_text=lemmatize_text(text)
print(lemmatized_text)

student be study natural language processing .
