# Natural Langeage Processing



## Common Tasks in NLP

__Tokenization:__ Taking a big text and breaking it down to smaller chunks

__StopWrods Removal:__ Stopwords are words that do not add meaning to the text. There are list of these words and we should remove them before analysis

__N-Grams:__ Identify commonly occuring groups of words in a text. ( Usuarlly commonly ocurrent words or groups of words are the most important words of the text)

"New York" is a _bigram_ beause it has 2 words.. We can gruop words in biggers grupos.

__Word sense disambiguation__: We need to indentify the meaning of a word or words based on the context in which they appear

"The movie had really _cool_ effects."
"I'd like a tall glass of _cool_ water."

__Parts-of-Speech:__ Identifes the parts of the speech: Words, Verbs, ... Which part of Speech a particular word is.

__Stemming:__ Some words have the same meaning, but their ending is different so they could be considered as differents words:

- close
- closed
- closely
- closer 

If we want to treat these words as they having the same meaning we can _steam_ the words and extract the root of the words

- __clos__e
- __clos__ed
- __clos__ely
- __clos__er 


## Example

### Tokenization
Tokenize text into sentences and words 


In [1]:
# nltk is the Natural Language Tool Kit, Comes built in with funcionality
# to perfom the tasks above and more.
import nltk

In [2]:
text = "Mary had a little lamb. It's fleece was white as snow"

from nltk.tokenize import word_tokenize, sent_tokenize

In [3]:
# Break the text into sentences
sents = sent_tokenize(text)
print(sents)

['Mary had a little lamb.', "It's fleece was white as snow"]


In [4]:
# Break the text into words
words = [word_tokenize(sent) for sent in sents]
print(words)

[['Mary', 'had', 'a', 'little', 'lamb', '.'], ['It', "'s", 'fleece', 'was', 'white', 'as', 'snow']]


In [5]:
# Now we remove all the Stop-Words
# Words like 'a', 'and', '.'
from nltk.corpus import stopwords
from string import punctuation

customSetStopWords = set(stopwords.words('english') + list(punctuation))

In [6]:
print(customSetStopWords)

{'been', '[', 'more', 'yourself', 'out', 'wasn', 'then', 'isn', 'or', 'them', '}', 'own', 'this', 'yours', '#', 'does', '/', '@', 'shouldn', 'before', 'we', 'mightn', 'do', '.', 'until', 'why', 's', 'through', 'our', 'yourselves', 'there', 'themselves', 'up', 'being', 'are', 'for', 'and', 'don', 'if', 'again', 'i', 'too', '%', 'how', 'their', 'should', '<', 'herself', 're', 'doesn', ',', 'who', 'had', ')', 'he', 'what', 'to', 'needn', 'didn', 'few', '~', 'same', 'they', 'because', 'you', 'after', '|', 'nor', 'my', '_', 'is', 'just', 'below', '(', '`', '&', 'at', '-', 'above', 'under', '*', 'on', 'so', 'were', 'its', 'during', 'be', 'as', '$', 'between', 'am', 'his', 'while', 'myself', 'which', 'once', "'", 'your', 'couldn', 'whom', 'was', 'such', '{', 'o', 'other', '+', 'd', 'hasn', 'have', 'over', 't', 'shan', 'ours', ':', 'the', ']', 'any', 'will', 'hers', 'theirs', 'now', 'some', 'by', 'no', 'down', 'all', '^', '!', '>', 'mustn', 'wouldn', 'each', 've', 'did', 'y', 'these', 'itself'

In [7]:
wordsNoStopWords = [word for word in word_tokenize(text) if word not in customSetStopWords]
print(wordsNoStopWords)

['Mary', 'little', 'lamb', 'It', "'s", 'fleece', 'white', 'snow']


In [8]:
# Identify Bigrams
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
# Collocations also allows us to find trigrams

finder = BigramCollocationFinder.from_words(wordsNoStopWords)

# Bigrams are sorted in the order of their frequency
sorted(finder.ngram_fd.items())

[(("'s", 'fleece'), 1),
 (('It', "'s"), 1),
 (('Mary', 'little'), 1),
 (('fleece', 'white'), 1),
 (('lamb', 'It'), 1),
 (('little', 'lamb'), 1),
 (('white', 'snow'), 1)]

In [9]:
# Stemming
text2 = "Mary closed on closing night when she was in the mood to close"

# There are several Stemming algorithms. Here we use the Lancaster Stemmer Algorithm
from nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()
stemmedWords = [st.stem(word) for word in word_tokenize(text2)]
print(stemmedWords)

['mary', 'clos', 'on', 'clos', 'night', 'when', 'she', 'was', 'in', 'the', 'mood', 'to', 'clos']


In [10]:
# Parts-of-Speech
# pos_tag PartsOfSpeech_tag
nltk.pos_tag(word_tokenize(text2))

# We can se a description of every tag ising the following
# nltk.help.upenn_tagset()

[('Mary', 'NNP'),
 ('closed', 'VBD'),
 ('on', 'IN'),
 ('closing', 'NN'),
 ('night', 'NN'),
 ('when', 'WRB'),
 ('she', 'PRP'),
 ('was', 'VBD'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mood', 'NN'),
 ('to', 'TO'),
 ('close', 'VB')]

In [11]:
# Word Sense Disambiguation:
# Identifying the meaning of a word based on its context

# In python nltk we use Wordnet, it is a lexicon
from nltk.corpus import wordnet as wn

for ss in wn.synsets('bass'):
    # synset represents one single definition of a word.
    print(ss, ss.definition())
    # here we print the multiple definitions of the word 'bass'


Synset('bass.n.01') the lowest part of the musical range
Synset('bass.n.02') the lowest part in polyphonic music
Synset('bass.n.03') an adult male singer with the lowest voice
Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae
Synset('freshwater_bass.n.01') any of various North American freshwater fish with lean flesh (especially of the genus Micropterus)
Synset('bass.n.06') the lowest adult male singing voice
Synset('bass.n.07') the member with the lowest range of a family of musical instruments
Synset('bass.n.08') nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes
Synset('bass.s.01') having or denoting a low vocal or instrumental range


In [12]:
# lesk is an algorithm for Word Sense Disambiguation
from nltk.wsd import lesk

text3 = "Sing in a lower tone, along with the bass"
word = "bass"

# the lesk funtion takes a set of words: (context, word)
# it returns one definition or synset
sense1 = lesk(word_tokenize(text3),word)
print(sense1, sense1.definition())


Synset('bass.n.07') the member with the lowest range of a family of musical instruments


In [13]:
# another example with the word "bass"
text4 = "This sea bass was really hard to cath"
sense2 = lesk(word_tokenize(text4),word)
print(sense2, sense2.definition())

Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae
