# Natural Language Toolkit(NLTK):
NLTK (Natural Language Toolkit) is a comprehensive platform for building Python programs to work with human language data. It's widely used in academia and industry for tasks ranging from simple text processing to advanced natural language understanding. 

# Features:

### 1. Corpora: 
NLTK provides access to over 50 corpora and lexical resources, including WordNet, which is a large lexical database of English. These corpora cover a wide range of text genres and languages, facilitating research and development in NLP.

### 2. Tokenization: 
NLTK offers tokenization tools for splitting text into words or sentences. This is a fundamental step in most NLP tasks, and NLTK provides various tokenizers to suit different needs.

### 3. Stemming and Lemmatization: 
NLTK includes modules for stemming, which reduces words to their root form (e.g., "running" to "run"), and lemmatization, which converts words to their base or dictionary form (e.g., "ran" to "run"). These processes are essential for text normalization.

### 4. Part-of-Speech (POS) Tagging: 
NLTK provides tools for tagging words in a text with their corresponding part-of-speech (e.g., noun, verb, adjective). POS tagging is crucial for many NLP tasks, such as syntactic parsing, information extraction, and sentiment analysis.

### 5. Chunking and Parsing: 
NLTK supports chunking and parsing, which involve identifying syntactic structures in sentences. Chunking groups words into meaningful chunks (e.g., noun phrases, verb phrases), while parsing analyzes the grammatical structure of sentences.

### 6. Named Entity Recognition (NER): 
NLTK includes tools for identifying and classifying named entities in text, such as names of people, organizations, locations, dates, and more. NER is essential for tasks like information extraction, entity linking, and question answering.

### 7. Text Classification: 
NLTK supports text classification tasks, such as sentiment analysis, topic classification, and spam detection. It provides algorithms and utilities for feature extraction, model training, and evaluation.

### 8. Language Models: 
NLTK allows you to build and train language models for tasks like language generation, machine translation, and spell checking. It includes tools for n-gram modeling, probabilistic context-free grammars (PCFGs), and more.

### 9. WordNet Interface: 
NLTK provides an interface to WordNet, a lexical database of English. WordNet organizes words into synsets (sets of synonyms) and provides information about word meanings, relationships, and semantic similarity.

### 10. Integration with Other Libraries: 
NLTK integrates with other Python libraries and tools, such as scikit-learn, TensorFlow, and spaCy, allowing you to combine its functionality with advanced machine learning and deep learning techniques.

# Get it now
- pip install nltk

### Corpora

In [1]:
import nltk

# Download required NLTK data
nltk.download('punkt')  # Tokenizers
nltk.download('wordnet')  # WordNet
nltk.download('stopwords')  # Stopwords
nltk.download('averaged_perceptron_tagger')  # POS Tagger
nltk.download('maxent_ne_chunker')  # Named Entity Chunker
nltk.download('words')  # Word List

# Accessing a specific corpus - Brown Corpus
from nltk.corpus import brown

# Access the words in the Brown Corpus
brown_words = brown.words()

# Print the first 20 words in the Brown Corpus
print("First 20 words in Brown Corpus:", brown_words[:20])

# Accessing a specific corpus - WordNet
from nltk.corpus import wordnet

# Find synonyms of a word using WordNet
synonyms = wordnet.synsets('good')

# Print synonyms of the word 'good'
print("Synonyms of 'good':", [synonym.name() for synonym in synonyms])

# Accessing a specific corpus - stopwords
from nltk.corpus import stopwords

# Get English stopwords
stop_words = set(stopwords.words('english'))

# Print the first 10 English stopwords
print("First 10 English stopwords:", list(stop_words)[:10])


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-

First 20 words in Brown Corpus: ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that']
Synonyms of 'good': ['good.n.01', 'good.n.02', 'good.n.03', 'commodity.n.01', 'good.a.01', 'full.s.06', 'good.a.03', 'estimable.s.02', 'beneficial.s.01', 'good.s.06', 'good.s.07', 'adept.s.01', 'good.s.09', 'dear.s.02', 'dependable.s.04', 'good.s.12', 'good.s.13', 'effective.s.04', 'good.s.15', 'good.s.16', 'good.s.17', 'good.s.18', 'good.s.19', 'good.s.20', 'good.s.21', 'well.r.01', 'thoroughly.r.02']
First 10 English stopwords: ['theirs', 'only', 'before', 'mightn', 'out', 'how', 'at', 'he', 'i', 'about']


### Tokenization

In [5]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "NLTK (Natural Language Toolkit) is a comprehensive platform for building Python programs to work with human language data. It's widely used in academia and industry for tasks ranging from simple text processing to advanced natural language understanding. NLTK is a leading platform for building Python programs to work with human language data."
words = word_tokenize(text)
sentences = sent_tokenize(text)

print("Words:", words)
print("Sentences:", sentences)


Words: ['NLTK', '(', 'Natural', 'Language', 'Toolkit', ')', 'is', 'a', 'comprehensive', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.', 'It', "'s", 'widely', 'used', 'in', 'academia', 'and', 'industry', 'for', 'tasks', 'ranging', 'from', 'simple', 'text', 'processing', 'to', 'advanced', 'natural', 'language', 'understanding', '.', 'NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.']
Sentences: ['NLTK (Natural Language Toolkit) is a comprehensive platform for building Python programs to work with human language data.', "It's widely used in academia and industry for tasks ranging from simple text processing to advanced natural language understanding.", 'NLTK is a leading platform for building Python programs to work with human language data.']


### Stopwords Removal

In [6]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]

print("Filtered Words:", filtered_words)


Filtered Words: ['NLTK', '(', 'Natural', 'Language', 'Toolkit', ')', 'comprehensive', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.', "'s", 'widely', 'used', 'academia', 'industry', 'tasks', 'ranging', 'simple', 'text', 'processing', 'advanced', 'natural', 'language', 'understanding', '.', 'NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.']


### Stemming / Lemmatization

In [7]:
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Stemming
ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in words]

print("Stemmed Words:", stemmed_words)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print("Lemmatized Words:", lemmatized_words)


Stemmed Words: ['nltk', '(', 'natur', 'languag', 'toolkit', ')', 'is', 'a', 'comprehens', 'platform', 'for', 'build', 'python', 'program', 'to', 'work', 'with', 'human', 'languag', 'data', '.', 'it', "'s", 'wide', 'use', 'in', 'academia', 'and', 'industri', 'for', 'task', 'rang', 'from', 'simpl', 'text', 'process', 'to', 'advanc', 'natur', 'languag', 'understand', '.', 'nltk', 'is', 'a', 'lead', 'platform', 'for', 'build', 'python', 'program', 'to', 'work', 'with', 'human', 'languag', 'data', '.']
Lemmatized Words: ['NLTK', '(', 'Natural', 'Language', 'Toolkit', ')', 'is', 'a', 'comprehensive', 'platform', 'for', 'building', 'Python', 'program', 'to', 'work', 'with', 'human', 'language', 'data', '.', 'It', "'s", 'widely', 'used', 'in', 'academia', 'and', 'industry', 'for', 'task', 'ranging', 'from', 'simple', 'text', 'processing', 'to', 'advanced', 'natural', 'language', 'understanding', '.', 'NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'program', 'to', 'work'

### Dependency Parsing

In [8]:
from nltk.parse.dependencygraph import DependencyGraph
from nltk.parse.stanford import StanfordDependencyParser

# Example sentence
sentence = "The quick brown fox jumps over the lazy dog."

# Path to Stanford Parser (make sure to download and set it up)
path_to_jar = '/path/to/stanford-parser.jar'
path_to_models_jar = '/path/to/stanford-parser-3.9.2-models.jar'

# Create Stanford Dependency Parser
dep_parser = StanfordDependencyParser(path_to_jar=path_to_jar, path_to_models_jar=path_to_models_jar)

# Perform dependency parsing
result = dep_parser.raw_parse(sentence)

# Get first parsed result (there could be multiple depending on the sentence complexity)
dep_tree = next(result)

# Print dependency tree
print(dep_tree.to_dot())

# Alternatively, you can print the list of tuples representing dependencies
# for triple in dep_tree.triples():
#     print(triple)


Please use [91mnltk.parse.corenlp.CoreNLPDependencyParser[0m instead.
  dep_parser = StanfordDependencyParser(path_to_jar=path_to_jar, path_to_models_jar=path_to_models_jar)


LookupError: Could not find stanford-parser\.jar jar file at /path/to/stanford-parser.jar

### Part-of-Speech (POS) Tagging

In [9]:
from nltk import pos_tag

pos_tags = pos_tag(words)

print("POS Tags:", pos_tags)


POS Tags: [('NLTK', 'NNP'), ('(', '('), ('Natural', 'NNP'), ('Language', 'NNP'), ('Toolkit', 'NNP'), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('comprehensive', 'JJ'), ('platform', 'NN'), ('for', 'IN'), ('building', 'VBG'), ('Python', 'NNP'), ('programs', 'NNS'), ('to', 'TO'), ('work', 'VB'), ('with', 'IN'), ('human', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('.', '.'), ('It', 'PRP'), ("'s", 'VBZ'), ('widely', 'RB'), ('used', 'VBN'), ('in', 'IN'), ('academia', 'NN'), ('and', 'CC'), ('industry', 'NN'), ('for', 'IN'), ('tasks', 'NNS'), ('ranging', 'VBG'), ('from', 'IN'), ('simple', 'JJ'), ('text', 'NN'), ('processing', 'NN'), ('to', 'TO'), ('advanced', 'VB'), ('natural', 'JJ'), ('language', 'NN'), ('understanding', 'NN'), ('.', '.'), ('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'VBG'), ('platform', 'NN'), ('for', 'IN'), ('building', 'VBG'), ('Python', 'NNP'), ('programs', 'NNS'), ('to', 'TO'), ('work', 'VB'), ('with', 'IN'), ('human', 'JJ'), ('language', 'NN'), ('data', 'NNS

### Named Entity Recognition (NER)

In [10]:
from nltk import ne_chunk

ner_tags = ne_chunk(pos_tags)

print("NER Tags:", ner_tags)


NER Tags: (S
  (GPE NLTK/NNP)
  (/(
  (ORGANIZATION Natural/NNP Language/NNP Toolkit/NNP)
  )/)
  is/VBZ
  a/DT
  comprehensive/JJ
  platform/NN
  for/IN
  building/VBG
  (PERSON Python/NNP)
  programs/NNS
  to/TO
  work/VB
  with/IN
  human/JJ
  language/NN
  data/NNS
  ./.
  It/PRP
  's/VBZ
  widely/RB
  used/VBN
  in/IN
  academia/NN
  and/CC
  industry/NN
  for/IN
  tasks/NNS
  ranging/VBG
  from/IN
  simple/JJ
  text/NN
  processing/NN
  to/TO
  advanced/VB
  natural/JJ
  language/NN
  understanding/NN
  ./.
  (ORGANIZATION NLTK/NNP)
  is/VBZ
  a/DT
  leading/VBG
  platform/NN
  for/IN
  building/VBG
  (PERSON Python/NNP)
  programs/NNS
  to/TO
  work/VB
  with/IN
  human/JJ
  language/NN
  data/NNS
  ./.)


### Chunking

In [11]:
from nltk.chunk import RegexpParser

# Example sentence
sentence = [("the", "DT"), ("big", "JJ"), ("cat", "NN"), ("sat", "VBD"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]

# Define chunk grammar
chunk_grammar = r"""
    NP: {<DT>?<JJ>*<NN>} # Chunk NP: optional determiner, followed by any number of adjectives, followed by a noun
    VP: {<VB.*><NP|PP>*} # Chunk VP: verb followed by NP or PP
    PP: {<IN><NP>} # Chunk PP: preposition followed by NP
"""

# Create chunk parser
chunk_parser = RegexpParser(chunk_grammar)

# Perform chunking
chunked_sentence = chunk_parser.parse(sentence)

# Print chunked sentence
print(chunked_sentence)


(S
  (NP the/DT big/JJ cat/NN)
  (VP sat/VBD)
  (PP on/IN (NP the/DT mat/NN)))


### Word Frequency Analysis

In [12]:
from nltk.probability import FreqDist

fdist = FreqDist(words)

print("Word Frequency:", fdist.most_common(5))


Word Frequency: [('for', 3), ('to', 3), ('language', 3), ('.', 3), ('NLTK', 2)]
