# Tokenisation of Text into sentences
Code from [Mastering Natural Langiage Processing with Python](https://www.packtpub.com/big-data-and-business-intelligence/mastering-natural-language-processing-python). 

In [37]:
import nltk
# nltk.download()

In [13]:
text="Welcome readers. I hope you find it interesting. please fo reply."

In [14]:
from nltk.tokenize import sent_tokenize

In [15]:
sent_tokenize(text)

['Welcome readers.', 'I hope you find it interesting.', 'please fo reply.']

# For large number of senteces try this instead

In [16]:
tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

In [17]:
text='Hello everyone. Hope you are all fine and doing well. Hope that you will find the book interesting.'

In [18]:
tokenizer.tokenize(text)

['Hello everyone.',
 'Hope you are all fine and doing well.',
 'Hope that you will find the book interesting.']

# For splitting sentences into words
This can be done using the `word_tokenize` as per below:

In [19]:
text=nltk.word_tokenize("PeirreVinken, 59 years old, will join as a nonexecutive director on Nov. 29.")

In [26]:
print(text)

['PeirreVinken', ',', '59', 'years', 'old', ',', 'will', 'join', 'as', 'a', 'nonexecutive', 'director', 'on', 'Nov.', '29', '.']


In [22]:
r=input("Please provide some imput text:")

Please provide some imput text:So, hello this is my first attempt to use python with nltk. Hope it goes well.


In [25]:
from nltk import word_tokenize
print("The length of this word is", len(word_tokenize(r)),"words.")

The length of this word is 19 words.


# What about other tokenizers
There is the `TreeBankTokenizer` which tokenises according to Penn Treebank Corpus conventions.

In [28]:
from nltk.tokenize import TreebankWordTokenizer

In [29]:
tokenizer = TreebankWordTokenizer()
print(tokenizer.tokenize("Have a nice day. I hope you find the book interesting"))

['Have', 'a', 'nice', 'day.', 'I', 'hope', 'you', 'find', 'the', 'book', 'interesting']


In [30]:
text=nltk.word_tokenize(" Don't hesitate to ask questions")
print(text)

['Do', "n't", 'hesitate', 'to', 'ask', 'questions']


# What about using regular expressions
Regular expressions can be used to split text based on punctuation chars. Luckily, we won't need RegExs just yet:

In [32]:
from nltk.tokenize import WordPunctTokenizer
tokenizer=WordPunctTokenizer()
print(tokenizer.tokenize(" Don't hesitate to ask questions"))

['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions']


But if you want to use regular expressions:

In [33]:
from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer("[\w']+")
print(tokenizer.tokenize("Don't hesitate to ask questions"))


["Don't", 'hesitate', 'to', 'ask', 'questions']


Instead of instantiating a class, here is an alternative way to tokenize with regexs:

In [35]:
from nltk.tokenize import regexp_tokenize
sent="Don't hesitate to ask questions"
print(regexp_tokenize(sent, pattern='\w+|\$[\d\.]+|\S+'))

['Don', "'t", 'hesitate', 'to', 'ask', 'questions']


And here is how we can 

In [36]:
tokenizer=RegexpTokenizer('\s+',gaps=True)
print(tokenizer.tokenize("Don't hesitate to ask questions"))

["Don't", 'hesitate', 'to', 'ask', 'questions']


How about selecting words with a capital letter:

In [38]:
sent=" She secured 90.56 % in class X . She is a meritorious student"
capt = RegexpTokenizer('[A-Z]\w+')
print(capt.tokenize(sent))


['She', 'She']


And what about using predefined regexs?

In [40]:
sent=" She secured 90.56 % in class X . She is a meritorious student"
from nltk.tokenize import BlanklineTokenizer
print(BlanklineTokenizer().tokenize(sent))

[' She secured 90.56 % in class X . She is a meritorious student']


Tokenisation using a white_space tokenizer:

In [41]:
sent=" She secured 90.56 % in class X . She is a meritorious student"
from nltk.tokenize import WhitespaceTokenizer
print(WhitespaceTokenizer().tokenize(sent))

['She', 'secured', '90.56', '%', 'in', 'class', 'X', '.', 'She', 'is', 'a', 'meritorious', 'student']


The split method can also be used to specify a white_space character or anything else:

In [43]:
sent= "She secured 90.56 % in class X. She is a meritorious student"
print(sent.split())     # Notice that these two...
print(sent.split(' '))  # ...are equivalent
sent=" She secured 90.56 % in class X \n. She is a meritorious student\n"
print(sent.split('\n'))


['She', 'secured', '90.56', '%', 'in', 'class', 'X.', 'She', 'is', 'a', 'meritorious', 'student']
['She', 'secured', '90.56', '%', 'in', 'class', 'X.', 'She', 'is', 'a', 'meritorious', 'student']
[' She secured 90.56 % in class X ', '. She is a meritorious student', '']


The `SpaceTokenizer` works in a very similar way to `sent.split(' ')`

In [45]:
sent=" She secured 90.56 % in class X \n. She is a meritorious student\n"
from nltk.tokenize import SpaceTokenizer
print(SpaceTokenizer().tokenize(sent))

['', 'She', 'secured', '90.56', '%', 'in', 'class', 'X', '\n.', 'She', 'is', 'a', 'meritorious', 'student\n']


What if all we want it to tokenize words into lines:

In [44]:
from nltk.tokenize import BlanklineTokenizer
sent=" She secured 90.56 % in class X \n. She is a meritorious student\n"
print(BlanklineTokenizer().tokenize(sent))
from nltk.tokenize import LineTokenizer
print(LineTokenizer(blanklines='keep').tokenize(sent))
print(LineTokenizer(blanklines='discard').tokenize(sent))

[' She secured 90.56 % in class X \n. She is a meritorious student\n']
[' She secured 90.56 % in class X ', '. She is a meritorious student']
[' She secured 90.56 % in class X ', '. She is a meritorious student']


# Normalisation

It is the process of: 
1. eliminating punctuation, 
2. converting the entire text into lowercase or uppercase, 
3. changing numbers into words, 
4. expanding abbreviations, 
5. canonicalisation of text 
and so on.

Let's start with punctuation:

In [47]:
import re
import string
text=[" It is a pleasant evening.","Guests, who came from US arrived at the venue","Food was tasty."]
from nltk.tokenize import word_tokenize
tokenized_docs=[word_tokenize(doc) for doc in text]
x=re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = []
for review in tokenized_docs:
    new_review = []
    for token in review: 
        new_token = x.sub(u'', token)
        if not new_token == u'':
            new_review.append(new_token)
    tokenized_docs_no_punctuation.append(new_review)	
print(tokenized_docs_no_punctuation)

[['It', 'is', 'a', 'pleasant', 'evening'], ['Guests', 'who', 'came', 'from', 'US', 'arrived', 'at', 'the', 'venue'], ['Food', 'was', 'tasty']]
