<a href="https://colab.research.google.com/github/faisu6339-glitch/Natural-Language-Processing-NLP-/blob/main/Revision1_NLP_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NLTK stands for Natural Language Toolkit. It's a powerful and popular open-source library in Python for working with human language data (text). It provides easy-to-use interfaces to over 50 corpora and lexical resources, such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

Here's a breakdown of what NLTK is and what it's used for:

What is NLTK?

Python Library: It's primarily a Python library, making it accessible and easy to integrate into Python projects.
Open Source: It's freely available and has a large, active community.
Comprehensive: It offers a wide range of tools and resources for various NLP tasks, from basic text processing to more advanced linguistic analysis.
Educational Tool: It's widely used in academia for teaching and research in NLP, computational linguistics, and artificial intelligence.
Prototyping: It's excellent for rapid prototyping and experimenting with different NLP techniques before moving to more specialized or performance-oriented libraries for production.
Key Features and Functionalities:

Tokenization: Breaking down text into smaller units (words, sentences).

nltk.word_tokenize(): Splits text into words.
nltk.sent_tokenize(): Splits text into sentences.
Stemming and Lemmatization: Reducing words to their base or root form.

Stemming: Removes suffixes to get to a base form (e.g., "running" -> "run", "generously" -> "generous"). NLTK includes stemmers like PorterStemmer and SnowballStemmer.
Lemmatization: Reduces words to their dictionary form (lemma) using vocabulary and morphological analysis (e.g., "better" -> "good", "ran" -> "run"). NLTK uses WordNetLemmatizer.
Stop Words: Removing common words (like "the," "is," and "nltk.pos_t") that often don't carry significant meaning for analysis.

Part-of-Speech (POS) Tagging: Identifying the grammatical role of each word in a sentence (e.g., noun, verb, adjective).

### Stemming
Stemming is a rule-based text normalisation technique that reduces words to their root form by removing prefixes or suffixes. The resulting form called a stem, may not be a valid or meaningful word in the language.

Each word is processed independently without considering context
The algorithm checks for common suffixes or prefixes
Predefined heuristic rules are applied to strip these affixes
The remaining part of the word is returned as the stem
No grammatical or semantic validation is performed
In essence, stemming performs mechanical truncation of words.

#### Techniques Used
Suffix Stripping: Removes common endings like -ing, -ed, -es
Rule-Based Truncation: Applies fixed linguistic rules
Aggressive Reduction: Shortens words for maximum generalization

#### Example:

| Original Word | Stem |
|---------------|------|
| running       | run  |
| studies       | studi|
| smiling       | smile|
| communication | commun|

In [75]:
import nltk

In [76]:
txt = "Hello Aaliya Fatma, I miss you from the deepest depths of my heart."


In [77]:
txt

'Hello Aaliya Fatma, I miss you from the deepest depths of my heart.'

In [78]:
txt.split('.')

['Hello Aaliya Fatma, I miss you from the deepest depths of my heart', '']

In [79]:
txt.split(" ")

['Hello',
 'Aaliya',
 'Fatma,',
 'I',
 'miss',
 'you',
 'from',
 'the',
 'deepest',
 'depths',
 'of',
 'my',
 'heart.']

In [80]:
txt.split(' ')

['Hello',
 'Aaliya',
 'Fatma,',
 'I',
 'miss',
 'you',
 'from',
 'the',
 'deepest',
 'depths',
 'of',
 'my',
 'heart.']

In [81]:
len(txt.split(' '))

13

In [82]:
from nltk.tokenize import word_tokenize

In [83]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [84]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [85]:
from nltk.tokenize import word_tokenize,sent_tokenize
word_tokenize(txt)

['Hello',
 'Aaliya',
 'Fatma',
 ',',
 'I',
 'miss',
 'you',
 'from',
 'the',
 'deepest',
 'depths',
 'of',
 'my',
 'heart',
 '.']

In [86]:
sent_tokenize(txt)

['Hello Aaliya Fatma, I miss you from the deepest depths of my heart.']

In [87]:
for word in word_tokenize(txt):
    print(word)

Hello
Aaliya
Fatma
,
I
miss
you
from
the
deepest
depths
of
my
heart
.


In [88]:
for word in sent_tokenize(txt):
    if word.endswith('a'):
        print(word)

### NLTK Stemming Examples

Let's apply the different stemming algorithms to a sample text to see how they work.

In [89]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
from nltk.tokenize import word_tokenize

# Sample text
text = "The quick brown foxes are running quickly through the beautiful forest, connecting with nature."

# Tokenize the text into words
words = word_tokenize(text)
print(f"Original words: {words}\n")

Original words: ['The', 'quick', 'brown', 'foxes', 'are', 'running', 'quickly', 'through', 'the', 'beautiful', 'forest', ',', 'connecting', 'with', 'nature', '.']



#### 1. Porter Stemmer

The Porter Stemmer is one of the oldest and most widely used. It's known for being fairly aggressive.

In [90]:
porter = PorterStemmer()
porter_stems = [porter.stem(word) for word in words]
print(f"Porter Stemmer: {porter_stems}")

Porter Stemmer: ['the', 'quick', 'brown', 'fox', 'are', 'run', 'quickli', 'through', 'the', 'beauti', 'forest', ',', 'connect', 'with', 'natur', '.']


#### 2. Snowball Stemmer (Porter2 Stemmer)

The Snowball Stemmer is an improved version of the Porter Stemmer, offering better performance and supporting multiple languages. It's often less aggressive than the original Porter.

In [91]:
# Specify the language for Snowball Stemmer (e.g., 'english')
snowball = SnowballStemmer("english")
snowball_stems = [snowball.stem(word) for word in words]
print(f"Snowball Stemmer: {snowball_stems}")

Snowball Stemmer: ['the', 'quick', 'brown', 'fox', 'are', 'run', 'quick', 'through', 'the', 'beauti', 'forest', ',', 'connect', 'with', 'natur', '.']


#### 3. Lancaster Stemmer

The Lancaster Stemmer is generally the most aggressive of the three, often producing very short, sometimes unrecognizable stems.

In [92]:
lancaster = LancasterStemmer()
lancaster_stems = [lancaster.stem(word) for word in words]
print(f"Lancaster Stemmer: {lancaster_stems}")

Lancaster Stemmer: ['the', 'quick', 'brown', 'fox', 'ar', 'run', 'quick', 'through', 'the', 'beauty', 'forest', ',', 'connect', 'with', 'nat', '.']


**Stemming Explained in Detail**

Stemming is a text normalization technique that reduces words to their **root or base form**, often called a "stem." The primary goal of stemming is to remove suffixes (and sometimes prefixes) from words so that different inflected forms of a word (e.g., "running," "runs," "ran") are mapped to a common base word (e.g., "run").

**Key Characteristics of Stemming:**

1.  **Rule-Based Heuristics:** Stemming algorithms typically use a set of heuristic rules to chop off the ends of words. These rules are often language-specific.
2.  **Does Not Guarantee Lexical Correctness:** The resulting "stem" may not always be a valid word in the dictionary. For instance, the stem of "beautiful" might be "beauti," which isn't a word. This is a crucial distinction from lemmatization.
3.  **Faster and Simpler:** Stemmers are generally simpler and faster to implement and execute compared to lemmatizers, as they don't rely on lexical dictionaries or advanced morphological analysis.
4.  **Reduces Dimensionality:** By mapping multiple forms of a word to a single stem, stemming helps in reducing the total number of unique words in a corpus. This is beneficial for tasks like information retrieval, text classification, and clustering, where you want to treat variations of a word as the same underlying concept.

**Common Stemming Algorithms (in NLTK):**

NLTK provides several popular stemming algorithms:

1.  **Porter Stemmer:**
    *   One of the oldest and most widely used stemmers, developed by Martin Porter in 1980.
    *   It applies a series of rules (e.g., remove 's', 'es', 'ing', 'ed') in multiple passes.
    *   It's known for being aggressive, meaning it can sometimes over-stem words.
    *   *Example:* "connection," "connections," "connected," "connecting" -> "connect"
    *   *Example:* "policy," "policies" -> "polici"

2.  **Snowball Stemmer (Porter2 Stemmer):**
    *   An improved version of the Porter Stemmer, also developed by Martin Porter.
    *   It's more sophisticated and offers better performance for various languages.
    *   It's less aggressive than the original Porter Stemmer in some cases, leading to better results.
    *   *Example:* "generously" -> "generous"

3.  **Lancaster Stemmer:**
    *   More aggressive than both Porter and Snowball stemmers.
    *   Often produces very short, sometimes unrecognizable stems.
    *   While highly effective in reducing word forms, its aggressive nature can sometimes lead to loss of meaning or make the stems difficult to interpret.
    *   *Example:* "maximum," "maximus," "maximization" -> "maxim"

**When to Use Stemming:**

*   **Information Retrieval:** When you want a search query for "fishing" to also return documents containing "fishes" or "fished."
*   **Text Classification:** To reduce the feature space (number of unique words) and group semantically similar words together, potentially improving model performance by treating different forms of a word as the same.
*   **Sentiment Analysis:** To treat "happy," "happier," "happiest" as the same positive sentiment indicator.
*   **Initial Data Exploration:** For quick and dirty text normalization when performance is critical, and a perfect dictionary word is not required.

**Limitations of Stemming:**

*   **Over-stemming:** Removing too much of a word, leading to loss of meaning or combining words that should be distinct (e.g., "universal" and "university" might both stem to "univers").
*   **Under-stemming:** Failing to reduce words that should be mapped to the same stem (e.g., "theory" and "theorize" might not stem to the same root).
*   **Produces Non-Words:** As mentioned, the output is not guaranteed to be a valid dictionary word.

**How it differs from Lemmatization:**

The main difference lies in the output: stemming is a heuristic process that chops off endings, often resulting in non-dictionary words, whereas lemmatization uses lexical knowledge (dictionaries and morphological analysis) to return the base **dictionary form** (lemma) of a word, which is always a valid word.

In [121]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

#1Ô∏è‚É£ Porter Stemmer ‚≠ê (Most Popular)

 Program 1

In [93]:
from nltk.stem import PorterStemmer

ps=PorterStemmer()
ps.stem("running")
ps.stem("Aaliya Fatma")

'aaliya fatma'

In [94]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

word1 = ps.stem("running")
word2 = ps.stem("Aaliya Fatma")

word1, word2


('run', 'aaliya fatma')

In [95]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
ps.stem("running")
ps.stem("studies")


'studi'

Program 2

In [96]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

words = ["running", "runs", "runner", "studies", "easily"]
stems = [ps.stem(word) for word in words]

print(stems)


['run', 'run', 'runner', 'studi', 'easili']


In [97]:
from nltk.stem import PorterStemmer
ps=PorterStemmer()

words=["Stronger","Bigger","Larger","harder","Easliy","Luckily","Connected","Connections"]
stems=[ps.stem(word) for word in words]
print(stems)

['stronger', 'bigger', 'larger', 'harder', 'easliy', 'luckili', 'connect', 'connect']


Program 3

In [98]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

sentence = "I was running and he runs every day"
tokens = word_tokenize(sentence)

stemmed_words = [ps.stem(word) for word in tokens]

print(stemmed_words)


['i', 'wa', 'run', 'and', 'he', 'run', 'everi', 'day']


In [99]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

sentence="Dil mera tod Students are studying and learning machine learning techniques daily"
tokens=word_tokenize(sentence)

stemmed_words=[ps.stem(word) for word in tokens]
print(stemmed_words)

['dil', 'mera', 'tod', 'student', 'are', 'studi', 'and', 'learn', 'machin', 'learn', 'techniqu', 'daili']


Program 4

In [100]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

ps = PorterStemmer()
stop_words = set(stopwords.words('english'))

sentence = "I am learning Natural Language Processing"
tokens = word_tokenize(sentence)

filtered_stems = [
    ps.stem(word) for word in tokens if word.lower() not in stop_words
]

print(filtered_stems)


['learn', 'natur', 'languag', 'process']


Program 5

In [101]:
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

data = {
    "text": [
        "I love machine learning",
        "He is studying data science",
        "NLP is very interesting"
    ]
}

df = pd.DataFrame(data)

df["stemmed_text"] = df["text"].apply(
    lambda x: " ".join([ps.stem(word) for word in word_tokenize(x)])
)

print(df)


                          text             stemmed_text
0      I love machine learning      i love machin learn
1  He is studying data science  he is studi data scienc
2      NLP is very interesting     nlp is veri interest


#2Ô∏è‚É£ Snowball Stemmer (Improved Porter)

üîπ What is Snowball Stemmer?

Snowball Stemmer is an improved and more consistent version of Porter Stemmer.
It is also known as Porter2 Stemmer.

üëâ It is:

More accurate than Porter

Slightly more aggressive

Supports multiple languages

In [102]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Program 1

In [103]:
from nltk.stem import SnowballStemmer

ss = SnowballStemmer("english")
ss.stem("studies")


'studi'

Program 2

In [104]:
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

stemmer = SnowballStemmer("english")

sentence = "He was studying machines and learning NLP"
tokens = word_tokenize(sentence)

stemmed = [stemmer.stem(word) for word in tokens]
print(stemmed)


['he', 'was', 'studi', 'machin', 'and', 'learn', 'nlp']


In [105]:
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

stemmer = SnowballStemmer("english")

sentence ="Organizations are organizing and organized many events"

tokens=word_tokenize(sentence)

stemmed=[stemmer.stem(word) for word in tokens]
print(stemmed)

['organ', 'are', 'organ', 'and', 'organ', 'mani', 'event']


Program 3

In [106]:
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stemmer = SnowballStemmer("english")
stop_words = set(stopwords.words("english"))

sentence = "Natural Language Processing is very interesting"
tokens = word_tokenize(sentence)

processed = [
    stemmer.stem(word) for word in tokens
    if word.lower() not in stop_words
]

print(processed)


['natur', 'languag', 'process', 'interest']


Program 4

In [107]:
import pandas as pd
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

stemmer = SnowballStemmer("english")

df = pd.DataFrame({
    "text": [
        "Students are studying NLP",
        "Machines are learning patterns",
        "Snowball stemmer improves stemming"
    ]
})

df["snowball_stemmed"] = df["text"].apply(
    lambda x: " ".join([stemmer.stem(word) for word in word_tokenize(x)])
)

print(df)


                                 text             snowball_stemmed
0           Students are studying NLP        student are studi nlp
1      Machines are learning patterns     machin are learn pattern
2  Snowball stemmer improves stemming  snowbal stemmer improv stem


#3Ô∏è‚É£ Lancaster Stemmer (Very Aggressive)

üîπ What is Lancaster Stemmer?

Lancaster Stemmer is a rule-based stemming algorithm that is much more aggressive than:

Porter Stemmer

Snowball Stemmer

üëâ It cuts words very heavily, often producing very short stems that may lose meaning.

Program

In [108]:
from nltk.stem import LancasterStemmer

ls = LancasterStemmer()
ls.stem("maximum")


'maxim'

Program

In [109]:
from nltk.stem import LancasterStemmer

ls = LancasterStemmer()

print(ls.stem("running"))
print(ls.stem("maximum"))
print(ls.stem("organization"))


run
maxim
org


All in one words Stemmings

In [114]:
from nltk.stem import LancasterStemmer

ls = LancasterStemmer()
ps=PorterStemmer()
ss=SnowballStemmer("english")

words = ["running", "runs", "runner", "studies", "easily"]
stems = [ls.stem(word) for word in words]
stems2 = [ps.stem(word) for word in words]
stems3 = [ss.stem(word) for word in words]

print(stems)
print(stems2)
print(stems3)

['run', 'run', 'run', 'study', 'easy']
['run', 'run', 'runner', 'studi', 'easili']
['run', 'run', 'runner', 'studi', 'easili']


Comparison between PorterStemmer, SnowballStemer,LancasterStemer

In [110]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

porter = PorterStemmer()
snowball = SnowballStemmer("english")
lancaster = LancasterStemmer()

words = ["studies", "running", "maximum", "organization"]

for word in words:
    print(word,
          "| Porter:", porter.stem(word),
          "| Snowball:", snowball.stem(word),
          "| Lancaster:", lancaster.stem(word))


studies | Porter: studi | Snowball: studi | Lancaster: study
running | Porter: run | Snowball: run | Lancaster: run
maximum | Porter: maximum | Snowball: maximum | Lancaster: maxim
organization | Porter: organ | Snowball: organ | Lancaster: org


Program

In [111]:
from nltk.stem import LancasterStemmer
from nltk.tokenize import word_tokenize

ls = LancasterStemmer()

sentence = "Students are studying organizations carefully"
tokens = word_tokenize(sentence)

stems = [ls.stem(word) for word in tokens]
print(stems)


['stud', 'ar', 'study', 'org', 'car']


Program

All in one Sentence Stemming

In [115]:
from nltk.stem import LancasterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

ls = LancasterStemmer()
ps=PorterStemmer()
ss=SnowballStemmer("english")
stop_words = set(stopwords.words("english"))

sentence = "Natural Language Processing is extremely powerful"
tokens = word_tokenize(sentence)

processed = [
    ls.stem(word) for word in tokens

    if word.lower() not in stop_words
]
processed2 = [
    ps.stem(word) for word in tokens

    if word.lower() not in stop_words
]
processed3 = [
    ss.stem(word) for word in tokens

    if word.lower() not in stop_words
]
print(processed)
print(processed2)
print(processed3)


['nat', 'langu', 'process', 'extrem', 'pow']
['natur', 'languag', 'process', 'extrem', 'power']
['natur', 'languag', 'process', 'extrem', 'power']


Program

In [113]:
import pandas as pd
from nltk.stem import LancasterStemmer
from nltk.tokenize import word_tokenize

ls = LancasterStemmer()

df = pd.DataFrame({
    "text": [
        "Students are studying data science",
        "Organizations are growing rapidly",
        "Lancaster stemmer is aggressive"
    ]
})

df["lancaster_stemmed"] = df["text"].apply(
    lambda x: " ".join([ls.stem(word) for word in word_tokenize(x)])
)

print(df)


                                 text        lancaster_stemmed
0  Students are studying data science    stud ar study dat sci
1   Organizations are growing rapidly        org ar grow rapid
2     Lancaster stemmer is aggressive  lancast stem is aggress


### Lemmatization with NLTK

**Lemmatization** is a more sophisticated and linguistically informed process than stemming. Its goal is to reduce words to their base or dictionary form, known as a **lemma**. Unlike stemming, lemmatization guarantees that the resulting word is a valid word in the language.

**Key Characteristics of Lemmatization:**

1.  **Linguistic Knowledge:** It uses lexical knowledge bases (like WordNet in NLTK) and morphological analysis to determine the root form of a word. This means it considers the word's meaning and part of speech.
2.  **Produces Valid Words:** The output of lemmatization is always a proper dictionary word, making the results more interpretable and useful for tasks requiring high linguistic accuracy.
3.  **Context-Dependent:** It can take into account the part of speech (POS) of a word to provide a more accurate lemma. For example, 'leaves' can be the plural of 'leaf' (noun) or the third-person singular of 'leave' (verb). Lemmatization can distinguish these based on context.
4.  **Slower than Stemming:** Due to its reliance on dictionaries and more complex algorithms, lemmatization is generally slower and more computationally intensive than stemming.

**Comparison with Stemming:**

| Feature           | Stemming                                | Lemmatization                                      |
| :---------------- | :-------------------------------------- | :------------------------------------------------- |
| **Output**        | Often a truncated string (not always a valid word) | Always a valid dictionary word (lemma)             |
| **Method**        | Heuristic rules, suffix/prefix removal  | Dictionary-based, morphological analysis           |
| **Speed**         | Faster                                  | Slower                                             |
| **Accuracy**      | Less accurate, can over/under-stem      | More accurate, linguistically sound                |
| **Context**       | Word-independent                        | Can be context-dependent (with POS tagging)        |
| **Use Case**      | Information retrieval, quick analysis   | Text classification, sentiment analysis, machine translation, chatbots |

**NLTK's `WordNetLemmatizer`**

NLTK provides the `WordNetLemmatizer` for performing lemmatization. It uses the WordNet corpus (a large lexical database of English) to look up lemmas. For best results, it's often used in conjunction with part-of-speech (POS) tagging.

In [116]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4') # Required for some WordNet functionalities

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

### Program 1: Basic Lemmatization

In [117]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words_to_lemmatize = ['running', 'runs', 'ran', 'better', 'best', 'cats', 'geese']

print("Original words and their lemmas (without POS tag):")
for word in words_to_lemmatize:
    print(f"{word}: {lemmatizer.lemmatize(word)}")

Original words and their lemmas (without POS tag):
running: running
runs: run
ran: ran
better: better
best: best
cats: cat
geese: goose


### Program 2: Lemmatization with Part-of-Speech (POS) Tagging

Lemmatization can be significantly more accurate when the Part-of-Speech (POS) of the word is provided. The `lemmatize` method accepts a `pos` argument ('n' for noun, 'v' for verb, 'a' for adjective, 'r' for adverb). If no POS is specified, it defaults to 'n' (noun).

To use POS tagging effectively, we often need to first tag the words in a sentence and then convert NLTK's POS tags to WordNet's format.

#Lemmatize words

In [124]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("running", pos="v"))
print(lemmatizer.lemmatize("better", pos="a"))


run
good


#Lemmatize a Sentence

In [125]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()

sentence = "Students are studying and studied machine learning"
tokens = word_tokenize(sentence)

lemmas = [lemmatizer.lemmatize(word, pos="v") for word in tokens]
print(lemmas)


['Students', 'be', 'study', 'and', 'study', 'machine', 'learn']


### Detailed Code Explanation

*   **`from nltk.stem import WordNetLemmatizer`**: This line imports the `WordNetLemmatizer` class from the `nltk.stem` module. This is the primary tool we'll use for lemmatization.

*   **`from nltk.tokenize import word_tokenize`**: This line imports the `word_tokenize` function from the `nltk.tokenize` module. This function is used to break down a sentence into individual words.

*   **`lemmatizer = WordNetLemmatizer()`**: Here, an instance of the `WordNetLemmatizer` is created. This object will be used to perform the lemmatization operations.

*   **`sentence = "Students are studying and studied machine learning"`**: This defines the input string (a sentence) that we want to lemmatize.

*   **`tokens = word_tokenize(sentence)`**: This line uses the `word_tokenize` function to split the `sentence` into a list of individual words, or 'tokens'. For example, "Students are studying..." would become `['Students', 'are', 'studying', 'and', 'studied', 'machine', 'learning']`.

*   **`lemmas = [lemmatizer.lemmatize(word, pos="v") for word in tokens]`**: This is the core of the lemmatization process. It's a list comprehension that iterates through each `word` in the `tokens` list. For each word:
    *   `lemmatizer.lemmatize(word, pos="v")` is called. The `pos="v"` argument is crucial here; it tells the lemmatizer to treat the word as a **verb**. This allows it to correctly identify the base form for verbs like "studying" (study) and "studied" (study), and "are" (be).
    *   The resulting lemma is added to the `lemmas` list.

*   **`print(lemmas)`**: Finally, this line prints the list of lemmatized words.

### Explanation of Lemmatization Output

The code executed successfully, producing the output: `['Students', 'be', 'study', 'and', 'study', 'machine', 'learn']`.

Here's a breakdown of what happened:

*   **`'Students'`**: This word remained unchanged. Even though we specified `pos='v'` (verb), 'Students' is primarily a noun, and the lemmatizer wouldn't reduce it further when treated as a verb.
*   **`'be'`**: The word 'are' was correctly lemmatized to its base verb form, 'be'.
*   **`'study'`**: Both 'studying' and 'studied' were successfully reduced to their base verb form, 'study', due to the `pos='v'` argument.
*   **`'machine'`**: This word remained 'machine'. While it can be a verb, in this context, its base form is itself when treated as a verb.
*   **`'learn'`**: The word 'learning' (when functioning as a verb here, likely a gerund or present participle) was lemmatized to its base verb form, 'learn'.

This output clearly demonstrates how specifying the Part-of-Speech (`pos='v'`) guides the `WordNetLemmatizer` to find the correct base form for words when they function as verbs.

In [128]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()

sentence = "Students are studying and studied machine learning"
tokens = word_tokenize(sentence)

lemmas = [lemmatizer.lemmatize(word, pos="v") for word in tokens]
print(lemmas)

['Students', 'be', 'study', 'and', 'study', 'machine', 'learn']


#Stopwords Removal + Lemmatization

In [126]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

sentence = "Students are studying natural language processing"
tokens = word_tokenize(sentence)

processed = [
    lemmatizer.lemmatize(word, pos="v")
    for word in tokens if word.lower() not in stop_words
]

print(processed)


['Students', 'study', 'natural', 'language', 'process']


In [129]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Example: Removing Stopwords with NLTK

In [130]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Sample text
text = "This is an example sentence, demonstrating the removal of common stopwords in natural language processing."

# Get English stopwords
stop_words = set(stopwords.words('english'))

# Tokenize the text
word_tokens = word_tokenize(text)

# Filter out stopwords
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words and w.isalnum()]

print(f"Original sentence: {text}")
print(f"Filtered sentence (without stopwords): {filtered_sentence}")

Original sentence: This is an example sentence, demonstrating the removal of common stopwords in natural language processing.
Filtered sentence (without stopwords): ['example', 'sentence', 'demonstrating', 'removal', 'common', 'stopwords', 'natural', 'language', 'processing']


In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.corpus import wordnet

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN # Default to noun if POS not found

lemmatizer = WordNetLemmatizer()
sentence = "The quick brown foxes are running quickly through the beautiful forest, connecting with nature."
tokens = word_tokenize(sentence)

lemmatized_words = []
for word, tag in pos_tag(tokens):
    wordnet_pos = get_wordnet_pos(tag)
    lemmatized_words.append(lemmatizer.lemmatize(word, pos=wordnet_pos))

print(f"Original sentence: {sentence}")
print(f"Lemmatized words (with POS): {lemmatized_words}")

sentence2 = "He is driving his car, and he drove to the store."
tokens2 = word_tokenize(sentence2)
lemmatized_words2 = []
for word, tag in pos_tag(tokens2):
    wordnet_pos = get_wordnet_pos(tag)
    lemmatized_words2.append(lemmatizer.lemmatize(word, pos=wordnet_pos))

print(f"\nOriginal sentence: {sentence2}")
print(f"Lemmatized words (with POS): {lemmatized_words2}")

Original sentence: The quick brown foxes are running quickly through the beautiful forest, connecting with nature.
Lemmatized words (with POS): ['The', 'quick', 'brown', 'fox', 'be', 'run', 'quickly', 'through', 'the', 'beautiful', 'forest', ',', 'connect', 'with', 'nature', '.']

Original sentence: He is driving his car, and he drove to the store.
Lemmatized words (with POS): ['He', 'be', 'drive', 'his', 'car', ',', 'and', 'he', 'drive', 'to', 'the', 'store', '.']


### Program 3: Lemmatization on a Pandas DataFrame

In [None]:
import pandas as pd
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.corpus import wordnet

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

lemmatizer = WordNetLemmatizer()

df_lem = pd.DataFrame({
    "text": [
        "Cats are running very fast on the green grass.",
        "The women were singing and enjoying the beautiful scenery.",
        "He has a better understanding of the data."
    ]
})

def lemmatize_text(text):
    tokens = word_tokenize(text)
    lemmatized_words = []
    for word, tag in pos_tag(tokens):
        wordnet_pos = get_wordnet_pos(tag)
        lemmatized_words.append(lemmatizer.lemmatize(word, pos=wordnet_pos))
    return " ".join(lemmatized_words)

df_lem["lemmatized_text"] = df_lem["text"].apply(lemmatize_text)

print(df_lem)

                                                text  \
0     Cats are running very fast on the green grass.   
1  The women were singing and enjoying the beauti...   
2         He has a better understanding of the data.   

                                     lemmatized_text  
0         Cats be run very fast on the green grass .  
1  The woman be sing and enjoy the beautiful scen...  
2         He have a good understanding of the data .  


In [127]:
import pandas as pd
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()

df = pd.DataFrame({
    "text": [
        "Students are studying NLP",
        "Researchers analyzed datasets",
        "Machines are learning patterns"
    ]
})

df["lemmatized_text"] = df["text"].apply(
    lambda x: " ".join([lemmatizer.lemmatize(word, pos="v") for word in word_tokenize(x)])
)

print(df)


                             text               lemmatized_text
0       Students are studying NLP         Students be study NLP
1   Researchers analyzed datasets  Researchers analyze datasets
2  Machines are learning patterns     Machines be learn pattern


### Stopwords

**Stopwords** are common words that appear frequently in any language but often carry little or no significant meaning for text analysis tasks. These words are typically filtered out from text before processing, as they can add noise and unnecessary computational overhead without contributing much to the overall understanding or differentiation of documents.

**Why Remove Stopwords?**

1.  **Reduce Noise:** Stopwords are very common (e.g., 'the', 'is', 'a', 'an', 'in') and don't usually help in distinguishing between different documents or topics. Removing them helps focus on more meaningful terms.
2.  **Reduce Dimensionality:** By eliminating frequent, non-informative words, the size of the vocabulary (feature space) is significantly reduced. This leads to faster processing and less memory usage, which is especially beneficial in large datasets.
3.  **Improve Performance:** In many NLP tasks like text classification, sentiment analysis, and information retrieval, stopwords can skew results or reduce the effectiveness of algorithms. Removing them often improves the accuracy and efficiency of these tasks.
4.  **Focus on Key Terms:** It allows analysis to concentrate on the more important, content-bearing words, providing a clearer picture of the text's subject matter.

**Common Stopwords:**

Examples of English stopwords include: 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'.

**NLTK's Stopwords Corpus:**

NLTK provides a pre-defined list of stopwords for various languages in its `stopwords` corpus. You can easily access and use this list to filter stopwords from your text.

In [131]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

üîπ List Available Languages

In [132]:
from nltk.corpus import stopwords

print(stopwords.fileids())


['albanian', 'arabic', 'azerbaijani', 'basque', 'belarusian', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'tamil', 'turkish', 'uzbek']


üîπ Get English Stopwords

In [133]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
print(len(stop_words))
print(list(stop_words)[:10])


198
["i'll", 'than', 'the', 'off', 'as', 'where', 'o', "should've", 'for', 'again']


In [137]:
txt="This is not a good time to talk."
txt=word_tokenize(txt)
print(txt)

['This', 'is', 'not', 'a', 'good', 'time', 'to', 'talk', '.']


In [138]:
for word in txt:
    if word.lower() not in stop_words:
        print(word)

good
time
talk
.


In [139]:
'This'.lower()=='this'

True

In [140]:
txt2="This is not a good time to talk.Can we Do it now ?"
txt2=word_tokenize(txt2)
print(txt2)

['This', 'is', 'not', 'a', 'good', 'time', 'to', 'talk.Can', 'we', 'Do', 'it', 'now', '?']


In [141]:
for word in txt2:
    if word.lower() not in stop_words:
        print(word)

good
time
talk.Can
?


#üõë Program 1: Remove Stopwords from a Sentence

In [134]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

sentence = "This is a simple example to understand stopwords"
tokens = word_tokenize(sentence)

filtered_words = [word for word in tokens if word.lower() not in stop_words]
print(filtered_words)


['simple', 'example', 'understand', 'stopwords']


#üõë Program 2: Stopwords + Stemming

In [135]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

ps = PorterStemmer()
stop_words = set(stopwords.words('english'))

sentence = "Students are studying and learning natural language processing"
tokens = word_tokenize(sentence)

processed = [
    ps.stem(word) for word in tokens if word.lower() not in stop_words
]

print(processed)


['student', 'studi', 'learn', 'natur', 'languag', 'process']


#üõë Program 3: Stopwords + Lemmatization

In [136]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

sentence = "Researchers are analyzing large datasets in NLP"
tokens = word_tokenize(sentence)

processed = [
    lemmatizer.lemmatize(word, pos="v")
    for word in tokens if word.lower() not in stop_words
]

print(processed)


['Researchers', 'analyze', 'large', 'datasets', 'NLP']


#üõë Program 4: Stopwords on Pandas DataFrame

In [144]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

df = pd.DataFrame({
    "text": [
        "This is an NLP example",
        "Stopwords removal improves models",
        "Not all stopwords should be removed"
    ]
})

df["clean_text"] = df["text"].apply(
    lambda x: " ".join(
        [word for word in word_tokenize(x) if word.lower() not in stop_words]
    )
)

print(df)


                                  text                         clean_text
0               This is an NLP example                        NLP example
1    Stopwords removal improves models  Stopwords removal improves models
2  Not all stopwords should be removed                  stopwords removed


### Corpus (in NLP)

A **corpus** (plural: **corpora**) is a large and structured set of texts. In the field of Natural Language Processing (NLP) and computational linguistics, a corpus serves as a fundamental resource for studying language, developing models, and evaluating algorithms.

Think of it as a carefully curated collection of written or spoken language data, often gathered for specific research purposes.

**Key Characteristics of a Corpus:**

1.  **Size:** Corpora are typically large, containing millions or even billions of words. This ensures that the data is representative of the language being studied and allows for statistical analysis of linguistic patterns.
2.  **Representativeness:** A good corpus should be representative of the language or the specific domain it aims to cover. For example, a corpus designed for general English might include texts from various genres (news, fiction, academic papers, conversations), while a specialized medical corpus would focus on medical literature.
3.  **Machine-Readable:** Corpora are stored in electronic formats that can be processed and analyzed by computers. This is crucial for applying NLP techniques.
4.  **Annotation (Optional but Common):** Many corpora are enriched with linguistic annotations, which means they have extra information added to them. Common types of annotation include:
    *   **Part-of-Speech (POS) Tagging:** Marking each word with its grammatical category (e.g., noun, verb, adjective).
    *   **Lemmatization/Stemming:** Providing the base form of words.
    *   **Syntactic Parsing:** Analyzing the grammatical structure of sentences.
    *   **Named Entity Recognition (NER):** Identifying proper nouns like person names, organizations, and locations.
    *   **Semantic Annotation:** Marking word senses or thematic roles.

**Types of Corpora:**

*   **General Corpora:** Aim to represent a broad range of language use (e.g., Brown Corpus, British National Corpus).
*   **Specialized Corpora:** Focus on specific domains, genres, or time periods (e.g., medical texts, legal documents, historical literature).
*   **Monolingual Corpora:** Contain texts in a single language.
*   **Multilingual/Parallel Corpora:** Contain texts in two or more languages, often with sentences or paragraphs aligned for translation tasks.
*   **Annotated Corpora:** Corpora with added linguistic information.

**Why are Corpora Important in NLP?**

1.  **Training Data:** They are essential for training machine learning models used in NLP tasks such as text classification, sentiment analysis, machine translation, speech recognition, and more.
2.  **Linguistic Research:** Linguists use corpora to study language patterns, grammar, vocabulary, and semantic change.
3.  **Algorithm Development:** Developers test and refine NLP algorithms on corpora to ensure they perform accurately and efficiently.
4.  **Lexicography:** Creating dictionaries and lexical resources relies heavily on corpora to identify common words, phrases, and their uses.

**NLTK and Corpora:**

NLTK (Natural Language Toolkit) provides easy access to a variety of corpora and lexical resources. When you use NLTK functions like `nltk.download('wordnet')` or `nltk.download('stopwords')`, you are downloading components of different corpora that are bundled with NLTK. These pre-packaged corpora allow users to quickly start experimenting with NLP without having to build their own datasets from scratch.

In [146]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [147]:
from nltk.corpus import brown

print(brown.fileids()[:5])
print(brown.words()[:10])

['ca01', 'ca02', 'ca03', 'ca04', 'ca05']
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']


In [149]:
import nltk
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [150]:
from nltk.corpus import gutenberg

print(gutenberg.fileids())

words = gutenberg.words('austen-emma.txt')
print(words[:20])

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER', 'I', 'Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich']


In [152]:
corpus="Gutenberg's invention of the mechanical movable type printing press around 1440 sparked a print revolution, transforming hand-copied manuscripts into mass-produced, cheaper books like the Gutenberg Bible, fostering literacy, spreading knowledge, and enabling the Reformation and Renaissance by creating a new reading public and culture of debate. This technology allowed for rapid production, making books accessible beyond the elite and fundamentally shifting societies from oral traditions to print culture"

In [153]:
corpus

"Gutenberg's invention of the mechanical movable type printing press around 1440 sparked a print revolution, transforming hand-copied manuscripts into mass-produced, cheaper books like the Gutenberg Bible, fostering literacy, spreading knowledge, and enabling the Reformation and Renaissance by creating a new reading public and culture of debate. This technology allowed for rapid production, making books accessible beyond the elite and fundamentally shifting societies from oral traditions to print culture"

In [154]:
words=word_tokenize(corpus)

In [155]:
words

['Gutenberg',
 "'s",
 'invention',
 'of',
 'the',
 'mechanical',
 'movable',
 'type',
 'printing',
 'press',
 'around',
 '1440',
 'sparked',
 'a',
 'print',
 'revolution',
 ',',
 'transforming',
 'hand-copied',
 'manuscripts',
 'into',
 'mass-produced',
 ',',
 'cheaper',
 'books',
 'like',
 'the',
 'Gutenberg',
 'Bible',
 ',',
 'fostering',
 'literacy',
 ',',
 'spreading',
 'knowledge',
 ',',
 'and',
 'enabling',
 'the',
 'Reformation',
 'and',
 'Renaissance',
 'by',
 'creating',
 'a',
 'new',
 'reading',
 'public',
 'and',
 'culture',
 'of',
 'debate',
 '.',
 'This',
 'technology',
 'allowed',
 'for',
 'rapid',
 'production',
 ',',
 'making',
 'books',
 'accessible',
 'beyond',
 'the',
 'elite',
 'and',
 'fundamentally',
 'shifting',
 'societies',
 'from',
 'oral',
 'traditions',
 'to',
 'print',
 'culture']

In [156]:
for word in words:
    if word.lower() not in stop_words:
        print(word)

Gutenberg
's
invention
mechanical
movable
type
printing
press
around
1440
sparked
print
revolution
,
transforming
hand-copied
manuscripts
mass-produced
,
cheaper
books
like
Gutenberg
Bible
,
fostering
literacy
,
spreading
knowledge
,
enabling
Reformation
Renaissance
creating
new
reading
public
culture
debate
.
technology
allowed
rapid
production
,
making
books
accessible
beyond
elite
fundamentally
shifting
societies
oral
traditions
print
culture


In [158]:
for word in word_tokenize(corpus):
  if(word.lower() not in stopwords.words('english')):
    print(word)

Gutenberg
's
invention
mechanical
movable
type
printing
press
around
1440
sparked
print
revolution
,
transforming
hand-copied
manuscripts
mass-produced
,
cheaper
books
like
Gutenberg
Bible
,
fostering
literacy
,
spreading
knowledge
,
enabling
Reformation
Renaissance
creating
new
reading
public
culture
debate
.
technology
allowed
rapid
production
,
making
books
accessible
beyond
elite
fundamentally
shifting
societies
oral
traditions
print
culture


In [159]:
for word in word_tokenize(corpus):
  if(word.lower() not in stopwords.words('english')) and (len(word)>=2):
    print(word)

Gutenberg
's
invention
mechanical
movable
type
printing
press
around
1440
sparked
print
revolution
transforming
hand-copied
manuscripts
mass-produced
cheaper
books
like
Gutenberg
Bible
fostering
literacy
spreading
knowledge
enabling
Reformation
Renaissance
creating
new
reading
public
culture
debate
technology
allowed
rapid
production
making
books
accessible
beyond
elite
fundamentally
shifting
societies
oral
traditions
print
culture


In [160]:
words=[]

for word in word_tokenize(corpus):
  if(word.lower() not in stopwords.words('english')) and (len(word)>=2):
    words.append(word.lower())

In this repeated words are available

In [164]:
len(words)

51

In [161]:
words

['gutenberg',
 "'s",
 'invention',
 'mechanical',
 'movable',
 'type',
 'printing',
 'press',
 'around',
 '1440',
 'sparked',
 'print',
 'revolution',
 'transforming',
 'hand-copied',
 'manuscripts',
 'mass-produced',
 'cheaper',
 'books',
 'like',
 'gutenberg',
 'bible',
 'fostering',
 'literacy',
 'spreading',
 'knowledge',
 'enabling',
 'reformation',
 'renaissance',
 'creating',
 'new',
 'reading',
 'public',
 'culture',
 'debate',
 'technology',
 'allowed',
 'rapid',
 'production',
 'making',
 'books',
 'accessible',
 'beyond',
 'elite',
 'fundamentally',
 'shifting',
 'societies',
 'oral',
 'traditions',
 'print',
 'culture']

In [162]:
set(words)

{"'s",
 '1440',
 'accessible',
 'allowed',
 'around',
 'beyond',
 'bible',
 'books',
 'cheaper',
 'creating',
 'culture',
 'debate',
 'elite',
 'enabling',
 'fostering',
 'fundamentally',
 'gutenberg',
 'hand-copied',
 'invention',
 'knowledge',
 'like',
 'literacy',
 'making',
 'manuscripts',
 'mass-produced',
 'mechanical',
 'movable',
 'new',
 'oral',
 'press',
 'print',
 'printing',
 'production',
 'public',
 'rapid',
 'reading',
 'reformation',
 'renaissance',
 'revolution',
 'shifting',
 'societies',
 'sparked',
 'spreading',
 'technology',
 'traditions',
 'transforming',
 'type'}

Unique words

In [163]:
len(set(words))

47

In [165]:
vocab=list(set(words))

In [166]:
vocab

['mechanical',
 'reading',
 'revolution',
 'manuscripts',
 'elite',
 'renaissance',
 'public',
 'fostering',
 'allowed',
 'fundamentally',
 'knowledge',
 'printing',
 'cheaper',
 'beyond',
 'enabling',
 'societies',
 'debate',
 'shifting',
 'press',
 'invention',
 'transforming',
 "'s",
 'gutenberg',
 'oral',
 '1440',
 'rapid',
 'culture',
 'accessible',
 'new',
 'technology',
 'like',
 'books',
 'mass-produced',
 'bible',
 'movable',
 'creating',
 'print',
 'traditions',
 'production',
 'type',
 'around',
 'hand-copied',
 'literacy',
 'reformation',
 'sparked',
 'making',
 'spreading']

### Vocabulary (in NLP)

In Natural Language Processing (NLP), **vocabulary** refers to the set of all unique words (or tokens) found in a given text corpus or dataset. It's essentially a dictionary of all the distinct words that an NLP model can recognize and process.

**Key Aspects of Vocabulary:**

1.  **Unique Words:** Each word in the vocabulary is unique, regardless of how many times it appears in the corpus. For example, if 'run' appears 100 times and 'running' appears 50 times, both 'run' and 'running' would be distinct entries in the vocabulary (unless processed by stemming or lemmatization).

2.  **Size:** The size of a vocabulary can vary greatly depending on the size and diversity of the text corpus. Large, diverse corpora will yield larger vocabularies. This is often a critical factor in NLP, as larger vocabularies can lead to more complex models and computational challenges.

3.  **Tokenization:** The process of defining what constitutes a 'word' (or token) is crucial before building a vocabulary. Different tokenization methods (e.g., word tokenization, subword tokenization) will result in different vocabularies.

4.  **Normalization:** Text normalization techniques like lowercasing, stemming, and lemmatization directly impact vocabulary size and content. For example, if all words are lowercased, 'The' and 'the' become a single entry. If stemming is applied, 'running', 'runs', and 'ran' might all map to 'run', reducing vocabulary size.

5.  **Out-of-Vocabulary (OOV) Words:** Words encountered during inference (when using a trained model) that were not present in the training vocabulary are called Out-Of-Vocabulary (OOV) words. Handling OOV words is a significant challenge in NLP, often addressed through techniques like subword tokenization or replacing them with an `<UNK>` (unknown) token.

**Importance of Vocabulary in NLP:**

*   **Feature Representation:** In many NLP tasks, words are converted into numerical representations (e.g., one-hot encodings, word embeddings). The vocabulary defines the mapping from words to these numerical indices or vectors.
*   **Model Performance:** A well-constructed vocabulary is essential for training robust NLP models. A vocabulary that is too small might miss important linguistic nuances, while one that is too large can lead to sparsity issues and increased computational cost.
*   **Language Understanding:** The vocabulary dictates the scope of what a language model can 'understand' or generate. It's the foundation upon which more complex linguistic structures are built.

In [167]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Assuming 'corpus' is already defined from previous cells
# corpus = "Gutenberg's invention of the mechanical movable type printing press..."

# Get English stopwords
stop_words = set(stopwords.words('english'))

# Tokenize and normalize the text
words = []
for word in word_tokenize(corpus):
    # Convert to lowercase, remove stopwords, and filter words shorter than 2 characters
    if word.lower() not in stop_words and len(word) >= 2:
        words.append(word.lower())

# Create the vocabulary (set of unique words)
vocabulary = sorted(list(set(words)))

print(f"Original text length (in words, after initial filtering): {len(words)}")
print(f"Vocabulary size (unique words): {len(vocabulary)}")
print("\nFirst 10 words in the vocabulary:")
print(vocabulary[:10])

print("\nLast 10 words in the vocabulary:")
print(vocabulary[-10:])

Original text length (in words, after initial filtering): 51
Vocabulary size (unique words): 47

First 10 words in the vocabulary:
["'s", '1440', 'accessible', 'allowed', 'around', 'beyond', 'bible', 'books', 'cheaper', 'creating']

Last 10 words in the vocabulary:
['renaissance', 'revolution', 'shifting', 'societies', 'sparked', 'spreading', 'technology', 'traditions', 'transforming', 'type']


In [168]:
for sent in sent_tokenize(corpus):
    print(sent)

Gutenberg's invention of the mechanical movable type printing press around 1440 sparked a print revolution, transforming hand-copied manuscripts into mass-produced, cheaper books like the Gutenberg Bible, fostering literacy, spreading knowledge, and enabling the Reformation and Renaissance by creating a new reading public and culture of debate.
This technology allowed for rapid production, making books accessible beyond the elite and fundamentally shifting societies from oral traditions to print culture
