**NATURAL LANGUAGE PROCESSING**

In this notebook, I have focused on the basics on NLP. The core concepts such as ***'Stemming', 'Lemmatization', 'Stop Word Removing', 'Parts of Speech Tagging, 'Named Entity Recognition'*** are considered in this notebook.


In [54]:
import nltk
#nltk.download('punkt')
#nltk.download('words')
#nltk.download('wordnet')
#nltk.download('stopwords')
#nltk.download('averaged_perceptron_tagger')
#nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [23]:
paragraph = "I'm a girl. I recently starting learning new concepts. I'm trying to learn more and at the same time use my knowledge to help resolve issues that are budding with the help of my skills. I'm aiming at honing my skills by practising more and more.I would want to learn more. I want to teach the same to people around me and grow together. It is extremely essential to make this community stronger and bolder. We need to build a strong society. I want to contribute into making this community stroonger & better."

In [24]:
sentences = nltk.sent_tokenize(paragraph)   # seperate each sentences of our paragraph
sentences

["I'm a girl.",
 'I recently starting learning new concepts.',
 "I'm trying to learn more and at the same time use my knowledge to help resolve issues that are budding with the help of my skills.",
 "I'm aiming at honing my skills by practising more and more.I would want to learn more.",
 'I want to teach the same to people around me and grow together.',
 'It is extremely essential to make this community stronger and bolder.',
 'We need to build a strong society.',
 'I want to contribute into making this community stroonger & better.']

In [25]:
words = nltk.word_tokenize(paragraph)
words

['I',
 "'m",
 'a',
 'girl',
 '.',
 'I',
 'recently',
 'starting',
 'learning',
 'new',
 'concepts',
 '.',
 'I',
 "'m",
 'trying',
 'to',
 'learn',
 'more',
 'and',
 'at',
 'the',
 'same',
 'time',
 'use',
 'my',
 'knowledge',
 'to',
 'help',
 'resolve',
 'issues',
 'that',
 'are',
 'budding',
 'with',
 'the',
 'help',
 'of',
 'my',
 'skills',
 '.',
 'I',
 "'m",
 'aiming',
 'at',
 'honing',
 'my',
 'skills',
 'by',
 'practising',
 'more',
 'and',
 'more.I',
 'would',
 'want',
 'to',
 'learn',
 'more',
 '.',
 'I',
 'want',
 'to',
 'teach',
 'the',
 'same',
 'to',
 'people',
 'around',
 'me',
 'and',
 'grow',
 'together',
 '.',
 'It',
 'is',
 'extremely',
 'essential',
 'to',
 'make',
 'this',
 'community',
 'stronger',
 'and',
 'bolder',
 '.',
 'We',
 'need',
 'to',
 'build',
 'a',
 'strong',
 'society',
 '.',
 'I',
 'want',
 'to',
 'contribute',
 'into',
 'making',
 'this',
 'community',
 'stroonger',
 '&',
 'better',
 '.']

**Stemming**

What is Stemming?

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).
*Basically, reducing a word to its root form*

In [26]:
from nltk.stem import PorterStemmer #import stemming from nltk


In [27]:
stemmer = PorterStemmer()

In [28]:

#stemming
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    newwords = [stemmer.stem(word) for word in words]
    sentences[i]= ' '.join(newwords)
    
sentences

["I 'm a girl .",
 'I recent start learn new concept .',
 "I 'm tri to learn more and at the same time use my knowledg to help resolv issu that are bud with the help of my skill .",
 "I 'm aim at hone my skill by practis more and more.i would want to learn more .",
 'I want to teach the same to peopl around me and grow togeth .',
 'It is extrem essenti to make thi commun stronger and bolder .',
 'We need to build a strong societi .',
 'I want to contribut into make thi commun stroonger & better .']

**Lemmatization**

Same as Stemming but intermediate representation/root form has a meaning

In [30]:
from nltk.stem import WordNetLemmatizer # import Lemmatization from NLTK

In [31]:
sentences = nltk.sent_tokenize(paragraph)   # seperate each sentences of our paragraph

In [32]:
lemmatizer = WordNetLemmatizer()

In [35]:

# Lemmatization
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    newwords = [lemmatizer.lemmatize(word) for word in words]
    sentences[i] = ''.join(newwords)
sentences

["I'magirl.",
 'Irecentlystartinglearningnewconcept.',
 "I'mtryingtolearnmoreandatthesametimeusemyknowledgetohelpresolveissuethatarebuddingwiththehelpofmyskill.",
 "I'maimingathoningmyskillbypractisingmoreandmore.Iwouldwanttolearnmore.",
 'Iwanttoteachthesametopeoplearoundmeandgrowtogether.',
 'Itisextremelyessentialtomakethiscommunitystrongerandbolder.',
 'Weneedtobuildastrongsociety.',
 'Iwanttocontributeintomakingthiscommunitystroonger&better.']

**Stop Word Removing**

In [36]:
from nltk.corpus import stopwords #importing stopwords

In [37]:
sentences = nltk.sent_tokenize(paragraph)   # seperate each sentences of our paragraph

In [40]:
# stop word removal
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    newwords = [word for word in words if word not in stopwords.words('english')]
    sentences[i] = ' '.join(newwords)
sentences

["I 'm girl .",
 'I recently starting learning new concepts .',
 "I 'm trying learn time use knowledge help resolve issues budding help skills .",
 "I 'm aiming honing skills practising more.I would want learn .",
 'I want teach people around grow together .',
 'It extremely essential make community stronger bolder .',
 'We need build strong society .',
 'I want contribute making community stroonger & better .']

**Parts of Speech Tagging**

What is Part-of-speech (POS) tagging ? 

It is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on

In [41]:
paragraph = 'Hello! I am a student. What do you do? '

In [44]:
words = nltk.word_tokenize(paragraph)
tagged_words =nltk.pos_tag(words)

In [45]:
word_tags = []
for tw in tagged_words:
    word_tags.append(tw[0]+"_"+tw[1])
    
tagged_paragraph = ' '.join(word_tags)
tagged_paragraph

'Hello_NN !_. I_PRP am_VBP a_DT student_NN ._. What_WP do_VBP you_PRP do_VB ?_.'

**Named Entity Recognition**

What is Named Entity Recognition?

Named entity recognition (NER) , also known as entity chunking/extraction , is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various predefined classes.

In [59]:
para = "Fouder of Wavy AI Research Foundation is from Pakistan"

In [60]:
words = nltk.word_tokenize(para)
words

['Fouder',
 'of',
 'Wavy',
 'AI',
 'Research',
 'Foundation',
 'is',
 'from',
 'Pakistan']

In [61]:
tagged_words = nltk.pos_tag(words)
tagged_words

[('Fouder', 'NN'),
 ('of', 'IN'),
 ('Wavy', 'NNP'),
 ('AI', 'NNP'),
 ('Research', 'NNP'),
 ('Foundation', 'NNP'),
 ('is', 'VBZ'),
 ('from', 'IN'),
 ('Pakistan', 'NNP')]

In [None]:
namedEnt = nltk.ne_chunk(tagged_words)
namedEnt.draw()
