### Natural Language Processing with Python - NLTK

Installing the NLTK package: http://www.nltk.org/install.html

In [1]:
import nltk

**Installing NLTK data files (click at "Download" when prompted)**

In [2]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### Tokenization

Process of dividing a string into lists of chunks or "tokens", where a token is an entire part. For example: a word is a token in a sentence, and a sentence is a token in a paragraph.

In [3]:
from nltk.tokenize import sent_tokenize

import nltk.data

**Dividing a paragraph into sentences**

In [4]:
paragraph_en = 'Hi. Good to know that you are learning PLN. Thank you for being with us.'
paragraph_es = 'Hola. Es bueno saber que estás aprendiendo PLN. Gracias por estar con nosotros.'

In [5]:
sent_tokenize(paragraph_en)

['Hi.',
 'Good to know that you are learning PLN.',
 'Thank you for being with us.']

In [6]:
sent_tokenize(paragraph_es)

['Hola.',
 'Es bueno saber que estás aprendiendo PLN.',
 'Gracias por estar con nosotros.']

In [7]:
tokenizer_en = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
tokenizer_es = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')

In [8]:
tokenizer_en.tokenize(paragraph_en)

['Hi.',
 'Good to know that you are learning PLN.',
 'Thank you for being with us.']

In [9]:
tokenizer_es.tokenize(paragraph_es)

['Hola.',
 'Es bueno saber que estás aprendiendo PLN.',
 'Gracias por estar con nosotros.']

In [10]:
tokenizer_en

<nltk.tokenize.punkt.PunktSentenceTokenizer at 0x7f0070bcbe20>

In [11]:
tokenizer_es

<nltk.tokenize.punkt.PunktSentenceTokenizer at 0x7f0070bcb460>

**Dividing a sentence into words**

In [12]:
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import word_tokenize

In [13]:
word_tokenize('Data Science Rocks!')

['Data', 'Science', 'Rocks', '!']

In [14]:
tw_tokenizer = TreebankWordTokenizer() 

In [15]:
tw_tokenizer.tokenize('Hello my friend.')

['Hello', 'my', 'friend', '.']

In [16]:
word_tokenize("I can't do that.")

['I', 'ca', "n't", 'do', 'that', '.']

In [17]:
wp_tokenizer = WordPunctTokenizer()

In [18]:
wp_tokenizer.tokenize("I can't do that.")

['I', 'can', "'", 't', 'do', 'that', '.']

In [19]:
re_tokenizer = RegexpTokenizer("[\w']+")

In [20]:
re_tokenizer.tokenize("I can't do that.")

['I', "can't", 'do', 'that']

In [21]:
regexp_tokenize("I can't do that.", "[\w']+")

['I', "can't", 'do', 'that']

In [22]:
re_tokenizer = RegexpTokenizer('\s+', gaps = True)

In [23]:
re_tokenizer.tokenize("I can't do that.")

['I', "can't", 'do', 'that.']

### Training a Tokenizer

In [24]:
from nltk.tokenize import PunktSentenceTokenizer
from nltk.tokenize import sent_tokenize
from nltk.corpus import webtext

**NLTK file at /home/caio/nltk_data/corpora/webtext**

In [25]:
file = webtext.raw('overheard.txt')

In [26]:
ps_tokenizer = PunktSentenceTokenizer(file)

In [27]:
ps_tokenizer

<nltk.tokenize.punkt.PunktSentenceTokenizer at 0x7f0070bcb700>

In [28]:
sentences_ps = ps_tokenizer.tokenize(file)

In [29]:
sentences_ps[0]

'White guy: So, do you have any plans for this evening?'

In [30]:
sentences_st = sent_tokenize(file)

In [31]:
sentences_st[0]

'White guy: So, do you have any plans for this evening?'

In [32]:
sentences_st[678]

'Girl: But you already have a Big Mac...\nHobo: Oh, this is all theatrical.'

In [33]:
sentences_ps[678]

'Girl: But you already have a Big Mac...'

**Using the file path**

In [34]:
with open('/home/caio/nltk_data/corpora/webtext/overheard.txt', encoding = 'ISO-8859-2') as file:
    
    file_text = file.read()

In [35]:
ps_tokenizer = PunktSentenceTokenizer(file_text)

In [36]:
sentences_ps = ps_tokenizer.tokenize(file_text)

In [37]:
sentences_ps[0]

'White guy: So, do you have any plans for this evening?'

In [38]:
sentences_ps[678]

'Girl: But you already have a Big Mac...'

### Stopwords

Stopwords are common words that normally don't contribute to a sentence meaning, at least with regard to the information purpose and natural language processing. They are words like "the" and "a". Many search engines filter these words to save space in their search indexes.

In [39]:
from nltk.corpus import stopwords

In [40]:
stops_en = set(stopwords.words('english'))

In [41]:
sentence_words = ["Can't", 'is', 'a', 'contraction']

In [42]:
[valid_word for valid_word in sentence_words if valid_word not in stops_en]

["Can't", 'contraction']

In [43]:
stops_pt = set(stopwords.words('portuguese'))

In [44]:
sentence_words = ['Data', 'Science', 'é', 'um', 'assunto', 'interessante']

In [45]:
[valid_word for valid_word in sentence_words if valid_word not in stops_pt]

['Data', 'Science', 'assunto', 'interessante']

**Stopwords Languages**

In [46]:
print(stopwords.fileids())

['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']


**Stopwords Portuguese Words**

In [47]:
print(stopwords.words('portuguese'))

['de', 'a', 'o', 'que', 'e', 'é', 'do', 'da', 'em', 'um', 'para', 'com', 'não', 'uma', 'os', 'no', 'se', 'na', 'por', 'mais', 'as', 'dos', 'como', 'mas', 'ao', 'ele', 'das', 'à', 'seu', 'sua', 'ou', 'quando', 'muito', 'nos', 'já', 'eu', 'também', 'só', 'pelo', 'pela', 'até', 'isso', 'ela', 'entre', 'depois', 'sem', 'mesmo', 'aos', 'seus', 'quem', 'nas', 'me', 'esse', 'eles', 'você', 'essa', 'num', 'nem', 'suas', 'meu', 'às', 'minha', 'numa', 'pelos', 'elas', 'qual', 'nós', 'lhe', 'deles', 'essas', 'esses', 'pelas', 'este', 'dele', 'tu', 'te', 'vocês', 'vos', 'lhes', 'meus', 'minhas', 'teu', 'tua', 'teus', 'tuas', 'nosso', 'nossa', 'nossos', 'nossas', 'dela', 'delas', 'esta', 'estes', 'estas', 'aquele', 'aquela', 'aqueles', 'aquelas', 'isto', 'aquilo', 'estou', 'está', 'estamos', 'estão', 'estive', 'esteve', 'estivemos', 'estiveram', 'estava', 'estávamos', 'estavam', 'estivera', 'estivéramos', 'esteja', 'estejamos', 'estejam', 'estivesse', 'estivéssemos', 'estivessem', 'estiver', 'estiv

### Wordnet

WordNet is a lexical database (in english). It is a kind of dictionary created specifically for natural language processing.

In [48]:
from nltk.corpus import wordnet

In [49]:
syn = wordnet.synsets('cookbook')[0]

In [50]:
syn.name()

'cookbook.n.01'

In [51]:
syn.definition()

'a book of recipes and cooking directions'

In [52]:
wordnet.synsets('cooking')[0].examples()

['cooking can be a great art',
 'people are needed who have experience in cookery',
 'he left the preparation of meals to his wife']

### Collocations

Collocations are two or more words that tend to appear frequently together, such as "United States" or "Rio Grande do Sul". These words can generate different combinations and therefore the context is also important in natural language processing.

In [53]:
from nltk.collocations import BigramCollocationFinder
from nltk.corpus import stopwords
from nltk.corpus import webtext
from nltk.metrics import BigramAssocMeasures

In [54]:
words_lower = [word.lower() for word in webtext.words('grail.txt')]

In [55]:
bcf = BigramCollocationFinder.from_words(words_lower)

In [56]:
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)

[("'", 's'), ('arthur', ':'), ('#', '1'), ("'", 't')]

In [57]:
stop_words = set(stopwords.words('english'))

In [58]:
bcf.apply_word_filter(lambda word: len(word) < 3 or word in stop_words)

In [59]:
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)

[('black', 'knight'),
 ('clop', 'clop'),
 ('head', 'knight'),
 ('mumble', 'mumble')]

### Stemming Words

Stemming is the technique of removing suffixes and prefixes from a word, called "stem". For example, the stem of the word "cooking" is "cook". A good algorithm knows that "ing" is a suffix and can be removed.<br />
Stemming is widely used in search engines for indexing words. Instead of storing all the words forms, a search engine stores only the word stem, reducing the index size and increasing the search process performance.

In [60]:
from nltk.stem import LancasterStemmer
from nltk.stem import PorterStemmer
from nltk.stem import RegexpStemmer
from nltk.stem import SnowballStemmer

In [61]:
stemmer = PorterStemmer()

In [62]:
stemmer.stem('eating')

'eat'

In [63]:
stemmer.stem('generously')

'gener'

In [64]:
stemmer = LancasterStemmer()

In [65]:
stemmer.stem('eating')

'eat'

In [66]:
stemmer.stem('generously')

'gen'

In [67]:
stemmer = RegexpStemmer('ing')

In [68]:
stemmer.stem('eating')

'eat'

In [69]:
print(SnowballStemmer.languages)

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


In [70]:
stemmer = SnowballStemmer('english')

In [71]:
stemmer.stem('eating')

'eat'

In [72]:
stemmer.stem('generously')

'generous'

### Corpus

Corpus is a collection of text documents and Corpora is the plural of Corpus.<br />
This term comes from the Latin word for "body" (in this case, the body of a text). A custom Corpus is a collection of text files organized in a directory.

For the training of a custom model as part of a text classification process (such as text analysis), it is necessary to create your own Corpus and train it.

In [73]:
from nltk.corpus import brown
from nltk.corpus.reader import WordListCorpusReader
from nltk.tokenize import line_tokenize

**Creating a custom Corpus**

In [74]:
reader = WordListCorpusReader('.', ['aux/custom-corpus.txt'])

In [75]:
reader.words()

['Big Data', 'Data Science', 'Artificial Intelligence', 'Deep Learning']

In [76]:
reader.fileids()

['aux/custom-corpus.txt']

In [77]:
reader.raw()

'Big Data\nData Science\nArtificial Intelligence\nDeep Learning\n'

In [78]:
line_tokenize(reader.raw())

['Big Data', 'Data Science', 'Artificial Intelligence', 'Deep Learning']

In [79]:
print(brown.categories())

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
