# Natural Language Processing with Python

## What is Natural Language Processing (NLP)?

![image.png](attachment:image.png)

Natural Language refers to a language in which we humans communicate with each other and Natural Language Processing is a way to preprocess the data in an understandable form so that the computers can understand it. 

In simple words, NLP(Natural Language Processing) is the process that helps computers to communicate with humans in their own language.

## NLTK (Natural Language Toolkit):

NLTK stands for Natural Language Toolkit. The library was developed by Steven Bird and Edward Loper. It comes with many built-in modules for tokenization, lemmatization, stemming, parsing, chunking, and POS tagging. It provides over 50 corpora and lexical resources.

In [25]:
!pip install nltk



In [26]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

## Tokenizing of paragraph or sentences:

**Tokenization**

It is the process of dividing the whole text into small tokens. The divination is done based on two basis — sentence and word.

In [21]:
# import required libraries
import nltk
from nltk import sent_tokenize
from nltk import word_tokenize

### sentence tokenizing
By tokenizing the text with sent_tokenize( ), we can get the text as sentences.

In [23]:
text_file=open("natural_language_processing.txt")
text=text_file.read()
print(text)
print("\n")

sentences=sent_tokenize(text)
print(len(sentences))
print("\n")
print(sentences)

Natural's language processing 123'@ helps computers communicate with humans in their own language and scales other language-related tasks. For example, NLP makes it possible for computers to read text1, hear speech, interpret it, measure sentiment and determine which parts are important ##.


2


["Natural's language processing 123'@ helps computers communicate with humans in their own language and scales other language-related tasks.", 'For example, NLP makes it possible for computers to read text1, hear speech, interpret it, measure sentiment and determine which parts are important ##.']


In the example above, we can see the entire text of our data is represented as sentences and also notice that the total number of sentences here is 9.

### Word tokenizing
By tokenizing the text with word_tokenize( ), we can get the text as words.

In [24]:
text_file=open("natural_language_processing.txt")
text=text_file.read()
print(text)
print("\n")

words=word_tokenize(text)
print(len(words))
print("\n")
print(words)

Natural's language processing 123'@ helps computers communicate with humans in their own language and scales other language-related tasks. For example, NLP makes it possible for computers to read text1, hear speech, interpret it, measure sentiment and determine which parts are important ##.


52


['Natural', "'s", 'language', 'processing', '123', "'", '@', 'helps', 'computers', 'communicate', 'with', 'humans', 'in', 'their', 'own', 'language', 'and', 'scales', 'other', 'language-related', 'tasks', '.', 'For', 'example', ',', 'NLP', 'makes', 'it', 'possible', 'for', 'computers', 'to', 'read', 'text1', ',', 'hear', 'speech', ',', 'interpret', 'it', ',', 'measure', 'sentiment', 'and', 'determine', 'which', 'parts', 'are', 'important', '#', '#', '.']


Next, we can see the entire text of our data is represented as words and also notice that the total number of words here is 144.

## Regex

Regular expressions also called regex. It is a very powerful programming tool that is used for a variety of purposes such as feature extraction from text, string replacement and other string manipulations. 

A regular expression is a set of characters, or a pattern, which is used to find sub strings in a given string. for ex. extracting all hashtags from a tweet, getting email id or phone numbers etc..from a large unstructured text content.

for example:

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [27]:
import re

## Punctuation

**Remove punctuation marks:**

In [28]:
text_file=open("natural_language_processing.txt")
text=text_file.read()
print(text)

Natural's language processing 123'@ helps computers communicate with humans in their own language and scales other language-related tasks. For example, NLP makes it possible for computers to read text1, hear speech, interpret it, measure sentiment and determine which parts are important ##.


In [29]:
filtered_text=re.sub('[^A-Za-z]', ' ', text)
filtered_text

'Natural s language processing       helps computers communicate with humans in their own language and scales other language related tasks  For example  NLP makes it possible for computers to read text   hear speech  interpret it  measure sentiment and determine which parts are important    '

## Stopwords

In general, these are the set of words that does not add much meaning to a sentence. In NLP we remove all the stopwords because they are not important to analyze the data. There are a total of 179 stopwords in English.

**List of stopwords:**

In [30]:
from nltk.corpus import stopwords
stopwords=stopwords.words("english")
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

**Removing stopwords:**

In [31]:
text_file=open("natural_language_processing.txt")
text=text_file.read()
print(text)
print("\n")

filtered_text=re.sub('[^A-Za-z]', ' ', text)
print(filtered_text)
print("\n")
print(len(filtered_text))

clean_words=[]
for w in filtered_text.split():
    if w not in stopwords:
        clean_words.append(w)
        
print(clean_words)
print("\n")
print(len(clean_words))

Natural's language processing 123'@ helps computers communicate with humans in their own language and scales other language-related tasks. For example, NLP makes it possible for computers to read text1, hear speech, interpret it, measure sentiment and determine which parts are important ##.


Natural s language processing       helps computers communicate with humans in their own language and scales other language related tasks  For example  NLP makes it possible for computers to read text   hear speech  interpret it  measure sentiment and determine which parts are important    


291
['Natural', 'language', 'processing', 'helps', 'computers', 'communicate', 'humans', 'language', 'scales', 'language', 'related', 'tasks', 'For', 'example', 'NLP', 'makes', 'possible', 'computers', 'read', 'text', 'hear', 'speech', 'interpret', 'measure', 'sentiment', 'determine', 'parts', 'important']


28


## Part of Speech Tagging (PoS tagging):

![image.png](attachment:image.png)

It is the process of converting a sentence to a list of tuples. Each tuple has a form (word, tag). The tag here signifies whether the word is noun, adjective, verb, so on.

Below, please find a list of Part of Speech (PoS) tags with their respective examples:

- POS explains how a word is used in a sentence.
- 8 main POS : nouns, pronouns, adjectives, verbs, adverbs, preposition, conjunctions, interjections
- Noun(N): David, Cassanova, Berlin, Calender...
- Verb(V): go, speak, run, eat, play, live, walk...
- Adjective(ADJ): two, young, happy, angry...
- Adverb(ADV): slowly, quietly, always, near...
- preposition(P): at, on, in ,from , with, under...
- Conjunction(CON): and, or, but, because, if...
- Pronoun(PRO): I, you, we, they, he, she, me, us...
- Interjection(INT): Ouch!, Wow!, Great!, Help!,...

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [58]:
tag=nltk.pos_tag(["studying","study"])
print(tag)

[('studying', 'VBG'), ('study', 'NN')]


In [32]:
sentence="A very beautiful young lady is walking on the beach"
tokenized_words=word_tokenize(sentence)
print(tokenized_words)
for words in tokenized_words:
    tagged_words=nltk.pos_tag(tokenized_words)
tagged_words

['A', 'very', 'beautiful', 'young', 'lady', 'is', 'walking', 'on', 'the', 'beach']


[('A', 'DT'),
 ('very', 'RB'),
 ('beautiful', 'JJ'),
 ('young', 'JJ'),
 ('lady', 'NN'),
 ('is', 'VBZ'),
 ('walking', 'VBG'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('beach', 'NN')]

In [34]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

sw=set(stopwords.words('english'))
quote="""He thus came to be known as the Missile Man of India for his work on the development of ballistic missile 
and launch vehicle technology. He also played a pivotal organisational, technical, 
and political role in India's Pokhran-II nuclear tests in 1998, 
the first since the original nuclear test by India in 1974."""
tokenized=sent_tokenize(quote)
print(tokenized)
print("\n")

for i in tokenized:
    word_list=nltk.word_tokenize(i)
    word_list=[w for w in word_list if not w in sw]
    tagged=nltk.pos_tag(word_list)
    print(tagged)

['He thus came to be known as the Missile Man of India for his work on the development of ballistic missile \nand launch vehicle technology.', "He also played a pivotal organisational, technical, \nand political role in India's Pokhran-II nuclear tests in 1998, \nthe first since the original nuclear test by India in 1974."]


[('He', 'PRP'), ('thus', 'RB'), ('came', 'VBD'), ('known', 'VBN'), ('Missile', 'NNP'), ('Man', 'NNP'), ('India', 'NNP'), ('work', 'NN'), ('development', 'NN'), ('ballistic', 'JJ'), ('missile', 'NN'), ('launch', 'NN'), ('vehicle', 'NN'), ('technology', 'NN'), ('.', '.')]
[('He', 'PRP'), ('also', 'RB'), ('played', 'VBD'), ('pivotal', 'JJ'), ('organisational', 'JJ'), (',', ','), ('technical', 'JJ'), (',', ','), ('political', 'JJ'), ('role', 'NN'), ('India', 'NNP'), ("'s", 'POS'), ('Pokhran-II', 'JJ'), ('nuclear', 'JJ'), ('tests', 'NNS'), ('1998', 'CD'), (',', ','), ('first', 'RB'), ('since', 'IN'), ('original', 'JJ'), ('nuclear', 'JJ'), ('test', 'NN'), ('India', 'NNP

## Named Entity Recognition(NER)

Named entity recognition can automatically scan entire articles and pull out some fundamental entities like people, organizations, places, date, time, money, and GPE discussed in them.

Commonly used types of named entity:


![image.png](attachment:image.png)

![image.png](attachment:image.png)

There are two options :

1. binary = True

When the binary value is True, then it will only show whether a particular entity is named entity or not. It will not show any further details on it.

In [35]:
import nltk
from nltk.tokenize import word_tokenize

# sentence for NER
sentence="Mr. Smith made a deal on a beach of Switzerland near WHO"

# tokenzing words
tokenized_words=word_tokenize(sentence)

# POS tagging
for w in tokenized_words:
    tagged_words=nltk.pos_tag(tokenized_words)

#print(tagged_words)

#NER
N_E_R=nltk.ne_chunk(tagged_words, binary=True)
print(N_E_R)

# to visualize
N_E_R.draw()

(S
  (NE Mr./NNP Smith/NNP)
  made/VBD
  a/DT
  deal/NN
  on/IN
  a/DT
  beach/NN
  of/IN
  (NE Switzerland/NNP)
  near/IN
  (NE WHO/NNP))


Our graph does not show what type of named entity it is. It only shows whether a particular word is named entity or not.

![image.png](attachment:image.png)

2. binary = False

When the binary value equals False, it shows in detail the type of named entities.


In [37]:
import nltk
from nltk.tokenize import word_tokenize

# sentence for NER
sentence="Mr. Smith made a deal on a beach of Switzerland near WHO"

# tokenzing words
tokenized_words=word_tokenize(sentence)

# POS tagging
for w in tokenized_words:
    tagged_words=nltk.pos_tag(tokenized_words)

#print(tagged_words)

#NER
N_E_R=nltk.ne_chunk(tagged_words, binary=False)
print(N_E_R)

# to visualize
N_E_R.draw()

(S
  (PERSON Mr./NNP)
  (PERSON Smith/NNP)
  made/VBD
  a/DT
  deal/NN
  on/IN
  a/DT
  beach/NN
  of/IN
  (GPE Switzerland/NNP)
  near/IN
  (ORGANIZATION WHO/NNP))


![image.png](attachment:image.png)

Our graph now shows what type of named entity it is.

## Chunking 

It works on top of Part of Speech(PoS) tagging. Chunking takes PoS tags as input and provides chunks as output. 

Chunking means to extract meaningful phrases from unstructured text. Chunking breaks simple text into phrases that are more meaningful than individual words.

Meaningful groups of words are called phrases. There are five significant categories of phrases.

- Noun Phrases (NP).
- Verb Phrases (VP).
- Adjective Phrases (ADJP).
- Adverb Phrases (ADVP).
- Prepositional Phrases (PP)

In the following example, we will extract a noun phrase from the text. Before extracting it, we need to define what kind of noun phrase we are looking for, or in other words, we have to set the grammar for a noun phrase. **In this case, we define a noun phrase by an optional determiner followed by adjectives and nouns.** Then we can define other rules to extract some other phrases. Next, we are going to use RegexpParser( ) to parse the grammar. Notice that we can also visualize the text with the .draw( ) function.

In [38]:
sentence="A very beautiful young lady is walking on the beach"
tokenized_words=word_tokenize(sentence)
for words in tokenized_words:
    tagged_words=nltk.pos_tag(tokenized_words)
#tagged_words

grammer="NP:{<DT>?<JJ>*<NN>}"
parser=nltk.RegexpParser(grammer)
output=parser.parse(tagged_words)
print(output)
output.draw()

(S
  A/DT
  very/RB
  (NP beautiful/JJ young/JJ lady/NN)
  is/VBZ
  walking/VBG
  on/IN
  (NP the/DT beach/NN))


![image.png](attachment:image.png)

In [40]:
import nltk
from nltk import pos_tag
from nltk import RegexpParser

quote="""Machine learning is a method of data analysis that automates analytical model building. 
It is a branch of artificial intelligence based on the idea that systems can learn from data, 
identify patterns and make decisions with minimal human intervention. """

tokens=nltk.word_tokenize(quote)
print(tokens)

tags=nltk.pos_tag(tokens)
print(tags)

grammer="NP:{<DT>?<JJ>*<NN>}"

cp=nltk.RegexpParser(grammer)

res=cp.parse(tags)
print(res)

res.draw()

['Machine', 'learning', 'is', 'a', 'method', 'of', 'data', 'analysis', 'that', 'automates', 'analytical', 'model', 'building', '.', 'It', 'is', 'a', 'branch', 'of', 'artificial', 'intelligence', 'based', 'on', 'the', 'idea', 'that', 'systems', 'can', 'learn', 'from', 'data', ',', 'identify', 'patterns', 'and', 'make', 'decisions', 'with', 'minimal', 'human', 'intervention', '.']
[('Machine', 'NN'), ('learning', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('method', 'NN'), ('of', 'IN'), ('data', 'NNS'), ('analysis', 'NN'), ('that', 'WDT'), ('automates', 'VBZ'), ('analytical', 'JJ'), ('model', 'NN'), ('building', 'NN'), ('.', '.'), ('It', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('branch', 'NN'), ('of', 'IN'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('based', 'VBN'), ('on', 'IN'), ('the', 'DT'), ('idea', 'NN'), ('that', 'IN'), ('systems', 'NNS'), ('can', 'MD'), ('learn', 'VB'), ('from', 'IN'), ('data', 'NNS'), (',', ','), ('identify', 'NN'), ('patterns', 'NNS'), ('and', 'CC'), ('make', 'VB'), ('

## Chinking

You may find that, after a lot of chunking, you have some words in your chunk you still do not want, but you have no idea how to get rid of them by chunking. You may find that chinking is your solution.

Chinking is a lot like chunking, it is basically a way for you to remove a chunk from a chunk. The chunk that you remove from your chunk is your chink.

**In the following example, we are going to take the whole string as a chunk, and then we are going to exclude adjectives from it by using chinking.** 

To write chinking grammar, we have to use inverted curly braces, i.e }{

In [41]:
# Here we are taking the whole string and then excluding adjectives from that chunk

sentence="A very beautiful lady is walking on the beach"

# tokenizing words
tokenized_words=word_tokenize(sentence)

for word in tokenized_words:
    tagged_words=nltk.pos_tag(tokenized_words)

grammer=r""" NP: {<.*>+}
                }<JJ>+{"""

# creating parser
parser=nltk.RegexpParser(grammer)

# parsing list tuples containing word and its tag
output=parser.parse(tagged_words)
print(output)

# to visualise
output.draw()

(S
  (NP A/DT very/RB)
  beautiful/JJ
  (NP lady/NN is/VBZ walking/VBG on/IN the/DT beach/NN))


From the example above, we can see that adjectives separate from the other text.

![image.png](attachment:image.png)

## Stemming and Lemmatization:

### Stemming
It is the process of reducing a word to its root word by removing suffixes and prefixes.
loved → love, learning →learn

### Lemmatization
It works the same as stemming, but the key difference is it returns a meaningful word. It is mostly in developing chatbots, Q&A bots, text prediction, etc.

Example:

Stemming
history → histori

Lemmatizing
history → history

![image.png](attachment:image.png)

Lemmatization takes into account Part Of Speech (POS) values. Also, lemmatization may generate different outputs for different values of POS. We generally have four choices for POS:

![image.png](attachment:image.png)

**Various Stemming Algorithms:**

Porter’s Stemmer

Lovin’s Stemmer

Dawson’s Stemmer

Krovetz Stemmer

Xerox Stemmer

Snowball Stemmer

Mostly used and famous stemmer is Porter's stemmer 


**Difference between Stemmer and Lemmatizer:**

**a. Stemming:**

Notice how on stemming, the word “studies” gets truncated to “studi.”

In [42]:
from nltk.stem import PorterStemmer

stemmer=PorterStemmer()

stemmer.stem("studies")

'studi'

**b. Lemmatizing:**

During lemmatization, the word “studies” displays its dictionary word “study.”

In [43]:
from nltk.stem import WordNetLemmatizer

lemmatizer=WordNetLemmatizer()

lemmatizer.lemmatize("studies")

'study'

In the following example, we are taking the PoS tag as “verb,” and when we apply the lemmatization rules, it gives us dictionary words instead of truncating the original word:

In [44]:
from nltk import WordNetLemmatizer
lemma=WordNetLemmatizer()
wordlist=["study","studies","studying","studied"]
for w in wordlist:
    print(lemma.lemmatize(w, pos="v"))

study
study
study
study


The default value of PoS in lemmatization is a noun(n). In the following example, we can see that it’s generating dictionary words:

In [45]:
from nltk import WordNetLemmatizer
lemma=WordNetLemmatizer()
wordlist=["studies","leaves","decreases","plays"]
for w in wordlist:
    print(lemma.lemmatize(w))

study
leaf
decrease
play


In [46]:
from nltk import WordNetLemmatizer
lemma=WordNetLemmatizer()
wordlist=["am","is","are","was","were"]
for w in wordlist:
    print(lemma.lemmatize(w, pos="v"))

be
be
be
be
be


## Corpora and Corpus

Natural language processing related applications are built using a huge amount of data.

Corpora is a group presenting multiple collections of text documents. A single collection is called corpus.

## WordNet:

Wordnet is a lexical database for the English language. Wordnet is a part of the NLTK corpus. We can use Wordnet to find meanings of words, synonyms, antonyms, and other related details.

**We can check how many different definitions of a word are available in Wordnet.**

In [53]:
from nltk.corpus import wordnet

for words in wordnet.synsets("Fun"):
    print(words)


Synset('fun.n.01')
Synset('fun.n.02')
Synset('fun.n.03')
Synset('playfulness.n.02')


In [48]:
for words in wordnet.synsets("Fun"):
    for lemma in words.lemmas():
        print(lemma)
    print("\n")

Lemma('fun.n.01.fun')
Lemma('fun.n.01.merriment')
Lemma('fun.n.01.playfulness')


Lemma('fun.n.02.fun')
Lemma('fun.n.02.play')
Lemma('fun.n.02.sport')


Lemma('fun.n.03.fun')


Lemma('playfulness.n.02.playfulness')
Lemma('playfulness.n.02.fun')




In [52]:
# All details for a word.

word=wordnet.synsets("Fun")[0]

print(word.name())
print(word.definition())
print(word.examples())

fun.n.01
activities that are enjoyable or amusing
['I do it for the fun of it', 'he is fun to have around']


In [5]:
# All details for all meanings of a word.

for words in wordnet.synsets("Fun"):
    print(words.name())
    print(words.definition())
    print(words.examples())
    
    for lemma in words.lemmas():
            print(lemma)
    print("\n")

fun.n.01
activities that are enjoyable or amusing
['I do it for the fun of it', 'he is fun to have around']
Lemma('fun.n.01.fun')
Lemma('fun.n.01.merriment')
Lemma('fun.n.01.playfulness')


fun.n.02
verbal wit or mockery (often at another's expense but not to be taken seriously)
['he became a figure of fun', 'he said it in sport']
Lemma('fun.n.02.fun')
Lemma('fun.n.02.play')
Lemma('fun.n.02.sport')


fun.n.03
violent and excited activity
['she asked for money and then the fun began', 'they began to fight like fun']
Lemma('fun.n.03.fun')


playfulness.n.02
a disposition to find (or make) causes for amusement
['her playfulness surprised me', 'he was fun to be with']
Lemma('playfulness.n.02.playfulness')
Lemma('playfulness.n.02.fun')




In [54]:
# Get a name only.

word=wordnet.synsets("Play")[0]
print(word)

print(word.lemmas()[0].name())

Synset('play.n.01')
play


In [7]:
# Synonyms.

synonyms=[]

for words in wordnet.synsets("Fun"):
    for lemma in words.lemmas():
        synonyms.append(lemma.name())
        
synonyms

['fun',
 'merriment',
 'playfulness',
 'fun',
 'play',
 'sport',
 'fun',
 'playfulness',
 'fun']

In [55]:
# Antonyms.

antonyms=[]

for words in wordnet.synsets("Natural"):
    for lemma in words.lemmas():
        if lemma.antonyms():
            antonyms.append(lemma.antonyms()[0].name())
        
antonyms

['unnatural', 'artificial', 'supernatural', 'sharp']

In [56]:
synonyms=[]
antonyms=[]

for words in wordnet.synsets("New"):
    for lemma in words.lemmas():
        synonyms.append(lemma.name())
        if lemma.antonyms():
            antonyms.append(lemma.antonyms()[0].name())
            
print(synonyms)
print("\n")
print(antonyms)

['new', 'fresh', 'new', 'novel', 'raw', 'new', 'new', 'unexampled', 'new', 'new', 'newfangled', 'new', 'New', 'Modern', 'New', 'new', 'young', 'new', 'newly', 'freshly', 'fresh', 'new']


['old', 'worn']


In [57]:
# Finding the similarity between words.

word1=wordnet.synsets("ship",'n')[0]
word2=wordnet.synsets("boat",'n')[0]

print(word1.wup_similarity(word2))

0.9090909090909091


## Bag of words

![image.png](attachment:image.png)

A bag of words method converts the raw text into words, and it also counts the frequency for the words in the text. It is a collection of words that represent a sentence along with the word count where the order of occurrences is not relevant.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

sentences=["Jim and Pam travelled by the bus",
           "The train was late",
          "The flight was full. Travelling by flight is expensive"]

cv=CountVectorizer()

B_O_W=cv.fit_transform(sentences).toarray()
print(B_O_W)
print("\n")

# total words with their index in model
print(cv.vocabulary_)
print("\n")

print(cv.get_feature_names())
print("\n")

[[1 1 1 0 0 0 0 1 0 1 1 0 1 0 0]
 [0 0 0 0 0 0 0 0 1 0 1 1 0 0 1]
 [0 0 1 1 2 1 1 0 0 0 1 0 0 1 1]]


{'jim': 7, 'and': 0, 'pam': 9, 'travelled': 12, 'by': 2, 'the': 10, 'bus': 1, 'train': 11, 'was': 14, 'late': 8, 'flight': 4, 'full': 5, 'travelling': 13, 'is': 6, 'expensive': 3}


['and', 'bus', 'by', 'expensive', 'flight', 'full', 'is', 'jim', 'late', 'pam', 'the', 'train', 'travelled', 'travelling', 'was']




## Text Classiication

Text Classification is an important area in machine learning, there is a wide range of applications that depends on text classification. Let’s take some examples.

Sentiment Analysis: This is a classification task which will classify people’s opinion expressed in a piece of text.

Intention Mining: Finding the future decision of a person based on the text.

Spam Filtering: This is a very matured field in text classification, filtering which email to show in the inbox and which email to put in the spam.

![image.png](attachment:image.png)

Machine Learning algorithms work very well with numbers, but when it comes to text, we have to do some preprocessing to make our model predict well. Let’s see about these steps practically with an SMS spam filtering program.

### Spam Filtering

In [58]:
import pandas as pd
dataset = pd.read_csv('data.csv', encoding='ISO-8859-1')
dataset

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [59]:
import re
import nltk

In [60]:
#nltk.download('punkt')
from nltk.tokenize import word_tokenize as wt 

In [61]:
#nltk.download('stopwords')
from nltk.corpus import stopwords

In [62]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

In [63]:
data = []

for i in range(dataset.shape[0]):
    sms = dataset.loc[i, 'v2']

    # remove non alphabatic characters
    sms = re.sub('[^A-Za-z]', ' ', sms)

    # make words lowercase, because Go and go will be considered as two words
    sms = sms.lower()

    # tokenising
    tokenized_sms = wt(sms)

    # remove stop words and stemming
    sms_processed = []
    for word in tokenized_sms:
        if word not in set(stopwords.words('english')):
            sms_processed.append(stemmer.stem(word))
            

    sms_text = " ".join(sms_processed)
    data.append(sms_text)

In [65]:
data

['go jurong point crazi avail bugi n great world la e buffet cine got amor wat',
 'ok lar joke wif u oni',
 'free entri wkli comp win fa cup final tkt st may text fa receiv entri question std txt rate c appli',
 'u dun say earli hor u c alreadi say',
 'nah think goe usf live around though',
 'freemsg hey darl week word back like fun still tb ok xxx std chg send rcv',
 'even brother like speak treat like aid patent',
 'per request mell mell oru minnaminungint nurungu vettam set callertun caller press copi friend callertun',
 'winner valu network custom select receivea prize reward claim call claim code kl valid hour',
 'mobil month u r entitl updat latest colour mobil camera free call mobil updat co free',
 'gon na home soon want talk stuff anymor tonight k cri enough today',
 'six chanc win cash pound txt csh send cost p day day tsandc appli repli hl info',
 'urgent week free membership prize jackpot txt word claim c www dbuk net lccltd pobox ldnw rw',
 'search right word thank breathe

In [66]:
# creating the feature matrix 
from sklearn.feature_extraction.text import CountVectorizer
matrix = CountVectorizer(max_features=1000)
X = matrix.fit_transform(data).toarray()
y = dataset.loc[:, 'v1']


In [67]:
matrix.get_feature_names()

['abiola',
 'abl',
 'abt',
 'ac',
 'accept',
 'access',
 'account',
 'across',
 'activ',
 'actual',
 'ad',
 'add',
 'address',
 'admir',
 'aft',
 'afternoon',
 'age',
 'ago',
 'ah',
 'aha',
 'aight',
 'al',
 'almost',
 'alon',
 'alreadi',
 'alright',
 'also',
 'alway',
 'amp',
 'an',
 'angri',
 'anoth',
 'answer',
 'anymor',
 'anyon',
 'anyth',
 'anytim',
 'anyway',
 'apart',
 'app',
 'appli',
 'appreci',
 'ard',
 'area',
 'around',
 'arriv',
 'asap',
 'ask',
 'askd',
 'ass',
 'attempt',
 'auction',
 'avail',
 'ave',
 'await',
 'award',
 'away',
 'awesom',
 'babe',
 'babi',
 'back',
 'bad',
 'bag',
 'balanc',
 'bank',
 'basic',
 'bath',
 'bb',
 'bcoz',
 'beauti',
 'bed',
 'believ',
 'best',
 'better',
 'bid',
 'big',
 'bill',
 'birthday',
 'bit',
 'bless',
 'blood',
 'blue',
 'bluetooth',
 'bold',
 'bonu',
 'book',
 'bore',
 'boss',
 'bout',
 'box',
 'boy',
 'boytoy',
 'break',
 'bring',
 'brother',
 'bslvyl',
 'bt',
 'bu',
 'busi',
 'buy',
 'bx',
 'cake',
 'call',
 'caller',
 'callert

In [68]:
# index values for faetures
matrix.vocabulary_

{'go': 332,
 'point': 650,
 'crazi': 171,
 'avail': 52,
 'great': 344,
 'world': 974,
 'la': 440,
 'got': 341,
 'wat': 934,
 'ok': 596,
 'lar': 445,
 'joke': 426,
 'wif': 955,
 'free': 302,
 'entri': 249,
 'wkli': 967,
 'comp': 146,
 'win': 958,
 'cup': 175,
 'final': 283,
 'st': 806,
 'may': 509,
 'text': 849,
 'receiv': 691,
 'question': 675,
 'std': 811,
 'txt': 895,
 'rate': 681,
 'appli': 40,
 'dun': 230,
 'say': 736,
 'earli': 232,
 'alreadi': 24,
 'nah': 561,
 'think': 855,
 'goe': 334,
 'usf': 914,
 'live': 475,
 'around': 44,
 'though': 859,
 'freemsg': 303,
 'hey': 371,
 'week': 943,
 'word': 971,
 'back': 60,
 'like': 469,
 'fun': 314,
 'still': 812,
 'xxx': 986,
 'send': 750,
 'even': 253,
 'brother': 94,
 'speak': 802,
 'treat': 883,
 'per': 630,
 'request': 706,
 'set': 754,
 'callertun': 104,
 'caller': 103,
 'press': 661,
 'copi': 161,
 'friend': 308,
 'winner': 960,
 'valu': 918,
 'network': 569,
 'custom': 178,
 'select': 747,
 'prize': 666,
 'reward': 712,
 'claim': 

In [69]:
#dict(sorted(matrix.vocabulary_.items(),key=lambda item: item[1]))

{'abiola': 0,
 'abl': 1,
 'abt': 2,
 'ac': 3,
 'accept': 4,
 'access': 5,
 'account': 6,
 'across': 7,
 'activ': 8,
 'actual': 9,
 'ad': 10,
 'add': 11,
 'address': 12,
 'admir': 13,
 'aft': 14,
 'afternoon': 15,
 'age': 16,
 'ago': 17,
 'ah': 18,
 'aha': 19,
 'aight': 20,
 'al': 21,
 'almost': 22,
 'alon': 23,
 'alreadi': 24,
 'alright': 25,
 'also': 26,
 'alway': 27,
 'amp': 28,
 'an': 29,
 'angri': 30,
 'anoth': 31,
 'answer': 32,
 'anymor': 33,
 'anyon': 34,
 'anyth': 35,
 'anytim': 36,
 'anyway': 37,
 'apart': 38,
 'app': 39,
 'appli': 40,
 'appreci': 41,
 'ard': 42,
 'area': 43,
 'around': 44,
 'arriv': 45,
 'asap': 46,
 'ask': 47,
 'askd': 48,
 'ass': 49,
 'attempt': 50,
 'auction': 51,
 'avail': 52,
 'ave': 53,
 'await': 54,
 'award': 55,
 'away': 56,
 'awesom': 57,
 'babe': 58,
 'babi': 59,
 'back': 60,
 'bad': 61,
 'bag': 62,
 'balanc': 63,
 'bank': 64,
 'basic': 65,
 'bath': 66,
 'bb': 67,
 'bcoz': 68,
 'beauti': 69,
 'bed': 70,
 'believ': 71,
 'best': 72,
 'better': 73,
 'b

In [70]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [71]:
X[0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [72]:
len(X[1])

1000

In [73]:
# split train and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [74]:
y_train

344      ham
1512     ham
5296     ham
64       ham
4334     ham
        ... 
3262     ham
2108    spam
4275     ham
3370     ham
558      ham
Name: v1, Length: 4179, dtype: object

In [75]:
# few years back naive bayes algorithm outperform and give better results as compared to other algorithm for text classification
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [76]:
X_test

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [82]:
# predict class
y_pred = classifier.predict(X_test)
y_pred

array(['ham', 'spam', 'ham', ..., 'ham', 'spam', 'ham'], dtype='<U4')

In [78]:
# Confusion matrix
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)
print(cm)
print(cr)

accuracy = accuracy_score(y_test, y_pred)
accuracy

[[957 260]
 [ 19 157]]
              precision    recall  f1-score   support

         ham       0.98      0.79      0.87      1217
        spam       0.38      0.89      0.53       176

   micro avg       0.80      0.80      0.80      1393
   macro avg       0.68      0.84      0.70      1393
weighted avg       0.90      0.80      0.83      1393



0.7997128499641063

In [79]:
def pred(classifier, list_of_messages, cv):
    list_of_messages=cv.transform(list_of_messages).toarray()
    return classifier.predict(list_of_messages)

In [80]:
sentences=['hi, how are you', 'you won 1 crore']
#matrix = CountVectorizer(max_features=1000)
#classifier = GaussianNB()
pred(classifier, sentences, matrix)

array(['ham', 'spam'], dtype='<U4')

## References

https://pub.towardsai.net/natural-language-processing-nlp-with-python-tutorial-for-beginners-1f54e610a1a0#01f7