# Basic example of Natural Language Processing

Import the library

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

Vectorization is a step in feature extraction. Why we do that? because to get some useful features out of the text. It means, we are converting from text to numerical vectors.

In [8]:
vect = CountVectorizer()
vect

Let’s use it to tokenize and count the word occurrences of text document

In [9]:
corpus = ['Hi my name is Erwin.','I love traveling.','Erwin loves eating delicious food.']
my_doc = vect.fit_transform(corpus)
my_doc

<3x11 sparse matrix of type '<class 'numpy.int64'>'
	with 12 stored elements in Compressed Sparse Row format>

From the information above, the dimension of my_doc is 3x11. <br>
It means, 3 rows and 11 columns as there are 3 documents and 11 unique words.

In [13]:
vect.get_feature_names_out()

array(['delicious', 'eating', 'erwin', 'food', 'hi', 'is', 'love',
       'loves', 'my', 'name', 'traveling'], dtype=object)

As explained before, there are 11 unique words

Each term found by the analyzer during the analysis is assigned a unique integer index corresponding to a column in the resulting matrix found in "vect.get_feature_names_out()". <br>
<br>
This interpretation of the columns can be retrieved as follows:

In [14]:
my_doc.toarray()

array([[0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
       [1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0]])

If there are new words that are not included in training corpus, they will be ignored in the transform method. <br>
For example: "Hallo, how old are you?"

In [15]:
vect.transform(['Hallo, how old are you?']).toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

As can be seen, the array is filled with zero since there is no matching words in training corpus.

## Normalization and stemming <br>
the word "love" and "loves" has same meaning. That's why we should treat them the same.


In [16]:
import nltk
porter = nltk.PorterStemmer()
[porter.stem(t) for t in vect.get_feature_names_out()]

['delici',
 'eat',
 'erwin',
 'food',
 'hi',
 'is',
 'love',
 'love',
 'my',
 'name',
 'travel']

Now the word "loves" has become "love".

In [18]:
list(set([porter.stem(t) for t in vect.get_feature_names_out()]))

['delici', 'food', 'travel', 'name', 'is', 'love', 'erwin', 'eat', 'my', 'hi']

As can be seen, now we have only 10 unique words (previously was 11).

In [20]:
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in list(set([porter.stem(t) for t in vect.get_feature_names_out()]))]

['delici', 'food', 'travel', 'name', 'is', 'love', 'erwin', 'eat', 'my', 'hi']

## Lemmatization <br>

similar case to stemming is called lemmatizing. <br>
The main difference between those two as you saw earlier example:
stemming can often create non-existent words, whereas lemmas are actual words.<br>

Sometimes, the meaning of word you generate with, is not found in  dictionary, However, you can look up a lemma.

In [21]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("dogs"))
print(lemmatizer.lemmatize("doggies"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("snakes"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("good", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))
print(lemmatizer.lemmatize("running",'v'))

dog
doggy
cactus
goose
rock
snake
good
best
good
run
run
run


Part of speech tagging: is a tool which tags a particular sentence or words in a paragraph by looking at the context of the sentence/words inside the paragraph.

In [24]:
import nltk
#nltk.download('averaged_perceptron_tagger')

In [27]:
from nltk import word_tokenize, pos_tag
sentence = "Python is the best programming language to learn Data Science"

In [28]:
sen_token = word_tokenize(sentence)
pos_tag(sen_token)

[('Python', 'NNP'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('best', 'JJS'),
 ('programming', 'NN'),
 ('language', 'NN'),
 ('to', 'TO'),
 ('learn', 'VB'),
 ('Data', 'NNP'),
 ('Science', 'NNP')]

As can be seen, the words are tagged by various parts of speech.


* ![](https://cdn-images-1.medium.com/max/800/0*V635bzjWK2n1jBsd.png)