# Natural Learning Process

## Tokenization
It may be defined as the process of breaking up a piece of text into smaller parts, such as sentences and words. These smaller parts are called tokens. For example, a word is a token in a sentence, and a sentence is a token in a paragraph.

### Example

In [1]:
# importing library
import nltk
from nltk.tokenize import word_tokenize

In [2]:
text= 'Here you go. Is this what you are looking for?'
print(word_tokenize(text))


['Here', 'you', 'go', '.', 'Is', 'this', 'what', 'you', 'are', 'looking', 'for', '?']


In [3]:
text_0= '''Won't'''
print(word_tokenize(text_0))

['Wo', "n't"]


* Here in second example the word_tokenize splitts the one word i.e 'won't' into 'Wo' and 'n't'. In order to avoid it, we can use alternate tokenizer i.e WordPunctTokenizer.

In [4]:
# WordPunctTokenizer
from nltk.tokenize import WordPunctTokenizer

text= '''I can't allow you to go home earlier.'''

tokenizer= WordPunctTokenizer()

tokenizer.tokenize(text)

['I', 'can', "'", 't', 'allow', 'you', 'to', 'go', 'home', 'earlier', '.']

## Tokenizing text into sentences.
It is used to tokenize the sentences from the paragraph. For this purpose we have sent_tokenize in nltk. 

In [5]:
from nltk.tokenize import sent_tokenize

text= '''Hello! is anyone here? I need to talk to someone. I am looking for a room here.'''

sent_tokenize(text)


['Hello!',
 'is anyone here?',
 'I need to talk to someone.',
 'I am looking for a room here.']

## RageexTokenizer
Unlike standard tokenizers that split text on predefined characters like spaces and punctuation, RegexpTokenizer allows you to define your own rules for tokenization using regular expressions. This is particularly useful when the standard tokenization doesn't suit your needs.


In [6]:
from nltk.tokenize import RegexpTokenizer
text= '''Hello! is anyone here? I need to talk to someone. I am looking for a room here.'''

tokenizer= RegexpTokenizer("[\w']+")
text= "Hello! is anyone here? I don't need to talk to someone."

tokenizer.tokenize(text)

['Hello',
 'is',
 'anyone',
 'here',
 'I',
 "don't",
 'need',
 'to',
 'talk',
 'to',
 'someone']

Here "don't" is consdidered as one token instead of tokenizing into 2. And Punctuation is removed.

## Extraction of text from file

In [7]:
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext

In [8]:
text= webtext.raw(r"X:\BIA\nlpt.txt")

In [9]:
sent_tokenizer= PunktSentenceTokenizer(text)
sents_1= sent_tokenizer.tokenize(text)

In [10]:
sents_1[0]

'Samsung AI Ballie\nSamsung is going to launch AI Robot named Ballie, it can also be called as AI home assistant.'

In [11]:
sents_1[1]

'It can do some interesting things inside our homes.'

In [12]:
sents_1[6]

'What are the methods and backgrounds behind this advancement?'

## Stopwords
Some common words that are present in text but do not contribute in the meaning of a
sentence. Such words are not at all important for the purpose of information retrieval or
natural language processing. The most common stopwords are ‘the’ and ‘a’.


In [13]:
# Importing stopwords from nltk.corpus
from nltk.corpus import stopwords
# stopwords
stop_words= stopwords.words('english')
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [16]:
text= ' Hello how are you? Am I doing it in a right way?'
text= word_tokenize(text)
text

['Hello',
 'how',
 'are',
 'you',
 '?',
 'Am',
 'I',
 'doing',
 'it',
 'in',
 'a',
 'right',
 'way',
 '?']

In [20]:
x= [word for word in text if word not in stop_words]
x

['Hello', '?', 'Am', 'I', 'right', 'way', '?']

## List of supported languages

In [21]:
stopwords.fileids()

['arabic',
 'azerbaijani',
 'basque',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hebrew',
 'hinglish',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

## Wordnet
Following are some use cases of Wordnet:
* It can be used to look up the definition of a word
* We can find synonyms and antonyms of a word
* Word relations and similarities can be explored using Wordnet
* Word sense disambiguation for those words having multiple uses and definitions

In [22]:
from nltk.corpus import wordnet as wn

In [25]:
syn= wn.synsets('dog')[0]
syn.name()

'dog.n.01'

In [26]:
syn.definition()

'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'

In [27]:
syn2= wn.synsets('lion')[0]
syn2.name()

'lion.n.01'

In [28]:
syn2.definition()

'large gregarious predatory feline of Africa and India having a tawny coat with a shaggy mane in the male'

In [33]:
syn= wn.synsets('hammer')[6]
syn.name()

'hammer.n.07'

In [34]:
syn.definition()

'a power tool for drilling rocks'

## What is Stemming?
Stemming is a technique used to extract the base form of the words by removing affixes
from them. It is just like cutting down the branches of a tree to its stems. For example,
the stem of the words eating, eats, eaten is eat.

#### Advantage:
Search engines use stemming for indexing the words. That’s why rather than storing all
forms of a word, a search engine can store only the stems. In this way, stemming reduces
the size of the index and increases retrieval accuracy.

In [36]:
from nltk.stem import PorterStemmer

In [46]:
x= 'writing'
stemmer= PorterStemmer()
x= stemmer.stem(x)
x

'write'

### RegexpStemmer
With this stemmer, we can manually enter a prefix or suffix to be removed from the text. 
###### Example:

In [47]:
from nltk.stem import RegexpStemmer

In [49]:
stemmer= RegexpStemmer('ment')
x='entertainment'
x= stemmer.stem(x)
x

'entertain'

## What is Lemmatization?
Lemmatization technique is like stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.

In [50]:
from nltk.stem import WordNetLemmatizer

In [53]:
lemmatizer= WordNetLemmatizer()
x= 'eating'
lemmatizer.lemmatize(x)

'eating'

In [54]:
x= 'believes'
lemmatizer.lemmatize(x)

'belief'

### Difference between stemming and Lemmatization:'
Stemming removes the endings words, while Lemmatization tells about the root of the word.
###### Example:

In [55]:
x= 'believes'
x=lemmatizer.lemmatize(x)
print(f'From lemmatization we get {x}')

x= 'believes'
stemmer= PorterStemmer()
x= stemmer.stem(x)
print(f'From stemming we get {x}')

From lemmatization we get belief
From stemming we get believ
