### Natural Language Processing with Python - NLTK

Installing the NLTK package: http://www.nltk.org/install.html

In [1]:
import nltk

**Installing NLTK data files (click at "Download" when prompted)**

In [2]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### Tokenization

Process of dividing a string into lists of chunks or "tokens", where a token is an entire part. For example: a word is a token in a sentence, and a sentence is a token in a paragraph.

In [3]:
from nltk.tokenize import sent_tokenize

import nltk.data

**Dividing a paragraph into sentences**

In [4]:
paragraph_en = 'Hi. Good to know that you are learning PLN. Thank you for being with us.'
paragraph_es = 'Hola. Es bueno saber que estás aprendiendo PLN. Gracias por estar con nosotros.'

In [5]:
sent_tokenize(paragraph_en)

['Hi.',
 'Good to know that you are learning PLN.',
 'Thank you for being with us.']

In [6]:
sent_tokenize(paragraph_es)

['Hola.',
 'Es bueno saber que estás aprendiendo PLN.',
 'Gracias por estar con nosotros.']

In [7]:
tokenizer_en = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
tokenizer_es = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')

In [8]:
tokenizer_en.tokenize(paragraph_en)

['Hi.',
 'Good to know that you are learning PLN.',
 'Thank you for being with us.']

In [9]:
tokenizer_es.tokenize(paragraph_es)

['Hola.',
 'Es bueno saber que estás aprendiendo PLN.',
 'Gracias por estar con nosotros.']

In [10]:
tokenizer_en

<nltk.tokenize.punkt.PunktSentenceTokenizer at 0x7f2ae8e3b100>

In [11]:
tokenizer_es

<nltk.tokenize.punkt.PunktSentenceTokenizer at 0x7f2abfeb2790>

**Dividing a sentence into words**

In [12]:
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import word_tokenize

In [13]:
word_tokenize('Data Science Rocks!')

['Data', 'Science', 'Rocks', '!']

In [14]:
tw_tokenizer = TreebankWordTokenizer() 

In [15]:
tw_tokenizer.tokenize('Hello my friend.')

['Hello', 'my', 'friend', '.']

In [16]:
word_tokenize("I can't do that.")

['I', 'ca', "n't", 'do', 'that', '.']

In [17]:
wp_tokenizer = WordPunctTokenizer()

In [18]:
wp_tokenizer.tokenize("I can't do that.")

['I', 'can', "'", 't', 'do', 'that', '.']

In [19]:
re_tokenizer = RegexpTokenizer("[\w']+")

In [20]:
re_tokenizer.tokenize("I can't do that.")

['I', "can't", 'do', 'that']

In [21]:
regexp_tokenize("I can't do that.", "[\w']+")

['I', "can't", 'do', 'that']

In [22]:
re_tokenizer = RegexpTokenizer('\s+', gaps = True)

In [23]:
re_tokenizer.tokenize("I can't do that.")

['I', "can't", 'do', 'that.']

### Training a Tokenizer