# Text Tokenization

"the process of breaking down or splitting textual data into smaller meaningful components called tokens"

## Sentence Tokenization/Segmentation

básico: splitar em separadores de sentença(., \n, ; etc)

In [1]:
import nltk
from nltk.corpus import gutenberg
# from pprint import pprint
def pprint(l):
    
    for i in l:
        print('\t> ', i, ' <')

In [2]:
alice = gutenberg.raw(fileids="carroll-alice.txt")

In [3]:
sample_text = "We will discuss briefly about the basic syntax, structure and design philosophies. There is a defined hierarchical syntax for Python code which you should remember when writing code! Python is a really powerful programming language!"

In [4]:
len(alice)

144395

In [5]:
print(alice[:100])

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was


In [6]:
default_st = nltk.sent_tokenize
alice_sentences = default_st(text=alice)
sample_sentences = default_st(text=sample_text)

In [7]:
print("Total sentences in sample text:", len(sample_sentences))
print("Sample text sentences:")
pprint(sample_sentences)
print("\nTotal sentences in alice:", len(alice_sentences))
print("First 5 sentences in alice:")
pprint(alice_sentences[:5])

Total sentences in sample text: 3
Sample text sentences:
	>  We will discuss briefly about the basic syntax, structure and design philosophies.  <
	>  There is a defined hierarchical syntax for Python code which you should remember when writing code!  <
	>  Python is a really powerful programming language!  <

Total sentences in alice: 1625
First 5 sentences in alice:
	>  [Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I.  <
	>  Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'  <
	>  So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the

In [8]:
from nltk.corpus import europarl_raw

In [9]:
german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')
print(len(german_text))
print(german_text[:100])

157171
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sit


In [10]:
german_sentences_def = default_st(text=german_text, language='german')

german_tokenizer = nltk.data.load(resource_url='tokenizers/punkt/german.pickle')
german_sentences = german_tokenizer.tokenize(german_text)

print(type(german_tokenizer))

<class 'nltk.tokenize.punkt.PunktSentenceTokenizer'>


In [11]:
print(german_sentences_def == german_sentences)

True


In [12]:
pprint(german_sentences[:5])

	>   
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .  <
	>  Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .  <
	>  Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .  <
	>  Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .  <
	>  Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .  <


In [13]:
portuguese_text = """Na conversa que teve na noite da última segunda-feira (26) com o presidente Michel Temer, o ministro Raul Jungmann pediu "carta branca" para fazer as mudanças que julgasse necessárias nos cargos vinculados ao novo Ministério da Segurança Pública. Temer, então, deu a "carta branca"."""

In [14]:
pprint(default_st(text=portuguese_text, language='portuguese'))

	>  Na conversa que teve na noite da última segunda-feira (26) com o presidente Michel Temer, o ministro Raul Jungmann pediu "carta branca" para fazer as mudanças que julgasse necessárias nos cargos vinculados ao novo Ministério da Segurança Pública.  <
	>  Temer, então, deu a "carta branca".  <


In [15]:
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'

# gaps -> o regex indica os gaps entre os tokens, ou os tokens?
regex_st = nltk.tokenize.RegexpTokenizer(pattern=SENTENCE_TOKENS_PATTERN, gaps=True)

pprint(regex_st.tokenize(sample_text))

	>  We will discuss briefly about the basic syntax, structure and design philosophies.  <
	>  There is a defined hierarchical syntax for Python code which you should remember when writing code!  <
	>  Python is a really powerful programming language!  <


In [16]:
pprint(regex_st.tokenize(portuguese_text))

	>  Na conversa que teve na noite da última segunda-feira (26) com o presidente Michel Temer, o ministro Raul Jungmann pediu "carta branca" para fazer as mudanças que julgasse necessárias nos cargos vinculados ao novo Ministério da Segurança Pública.  <
	>  Temer, então, deu a "carta branca".  <


# Word Tokenization

"the process of splitting or segmenting sentences into their constituent words"

In [17]:
sentence = "The brown fox wasn't that quick and he couldn't win the race."

default_wt = nltk.word_tokenize
words = default_wt(sentence)

print(words)

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race', '.']


In [18]:
TOKEN_PATTERN = r'\w+'

regex_wt_token = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN, gaps=False)

words_token = regex_wt_token.tokenize(sentence)
print(words_token)

['The', 'brown', 'fox', 'wasn', 't', 'that', 'quick', 'and', 'he', 'couldn', 't', 'win', 'the', 'race']


In [19]:
GAP_PATTERN = r'\s+'

regex_wt_gaps = nltk.RegexpTokenizer(pattern=GAP_PATTERN, gaps=True)

words_gaps = regex_wt_gaps.tokenize(sentence)
print(words_gaps)

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race.']


In [20]:
'¬'.join(words_token), '¬'.join(words_gaps)

('The¬brown¬fox¬wasn¬t¬that¬quick¬and¬he¬couldn¬t¬win¬the¬race',
 "The¬brown¬fox¬wasn't¬that¬quick¬and¬he¬couldn't¬win¬the¬race.")

In [21]:
word_indices = list(regex_wt_token.span_tokenize(sentence))

print(word_indices)
print([sentence[start:end] for start, end in word_indices])

[(0, 3), (4, 9), (10, 13), (14, 18), (19, 20), (21, 25), (26, 31), (32, 35), (36, 38), (39, 45), (46, 47), (48, 51), (52, 55), (56, 60)]
['The', 'brown', 'fox', 'wasn', 't', 'that', 'quick', 'and', 'he', 'couldn', 't', 'win', 'the', 'race']


In [22]:
wordpunkt_wt = nltk.WordPunctTokenizer() # r'\w+|[^\w\s]+'

words = wordpunkt_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', 'wasn', "'", 't', 'that', 'quick', 'and', 'he', 'couldn', "'", 't', 'win', 'the', 'race', '.']


In [23]:
whitespace_wt = nltk.WhitespaceTokenizer()

words = whitespace_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race.']


# Text Normalization/Cleansing/Wrangling

"series of steps that should be followed to wrangle, clean, and standardize textual data into a form that could be consumed by other NLP and analytics systems and applications as input."

tokenization, cleaning, case conversion, correcting spellings, removing stopwords, stemming, lemmatization

In [24]:
import nltk
import re
import string

corpus = ["The brown fox wasn't that quick and he couldn't win the race",
          "Hey that's a great deal! I just bought a phone for $199",
          "@@You'll (learn) a **lot** in the book. Python is an amazing language!@@"]

## Cleaning Text

removing extraneus and unnecessary tokens and characters, before further operations like tokenization

ex: removendo tags html(*clean_html()* or BeautifulSoup)

## Tokenizing Text

In [25]:
def tokenize_text(text):
    
    sentences = nltk.sent_tokenize(text)
    word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences]
    
    return word_tokens

In [26]:
token_list = [tokenize_text(text) for text in corpus]

pprint(token_list)

	>  [['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']]  <
	>  [['Hey', 'that', "'s", 'a', 'great', 'deal', '!'], ['I', 'just', 'bought', 'a', 'phone', 'for', '$', '199']]  <
	>  [['@', '@', 'You', "'ll", '(', 'learn', ')', 'a', '**lot**', 'in', 'the', 'book', '.'], ['Python', 'is', 'an', 'amazing', 'language', '!'], ['@', '@']]  <


In [27]:
tk = nltk.RegexpTokenizer(pattern='[^\s\!\?]+', gaps=False)

tk.tokenize(corpus[2])

["@@You'll",
 '(learn)',
 'a',
 '**lot**',
 'in',
 'the',
 'book.',
 'Python',
 'is',
 'an',
 'amazing',
 'language',
 '@@']

## Removing Special Characters

## after tokenization

In [28]:
pattern_str = '[{}]'.format(re.escape(string.punctuation))
print(pattern_str)
pattern = re.compile(pattern_str)

def remove_characters_after_tokenization(tokens):
    
    # filter(None, ...) -> remove elementos vazios, string vazias etc, por avaliarem False
    filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
    
    return list(filtered_tokens)

[\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^_\`\{\|\}\~]


In [29]:
a = [[1, 2, 3], [4, 5]]

In [30]:
filtered_list_1 = [list(filter(None, 
                          [remove_characters_after_tokenization(tokens) for tokens in sentence_tokens])) \
                          for sentence_tokens in token_list]

for sens in filtered_list_1:
    print('\t', ' '.join((' '.join(sen) for sen in sens)))

	 The brown fox was nt that quick and he could nt win the race
	 Hey that s a great deal I just bought a phone for 199
	 You ll learn a lot in the book Python is an amazing language


## before tokenization

In [31]:
PATTERN_KEEP_APOSTROPHES = r'[?|$|&|*|%|@|(|)|~]'
pat_keep_apostrophes = re.compile(PATTERN_KEEP_APOSTROPHES)

PATTERN_ONLY_ALPHA = r'[^a-zA-Z0-9 ]'
pat_only_alpha = re.compile(PATTERN_ONLY_ALPHA)

def remove_characters_before_tokenization(sentence, keep_apostrophes=False):
    
    sentence = sentence.strip()
    
    if keep_apostrophes:
        
        filtered_sentence = pat_keep_apostrophes.sub('', sentence)
    else:
        
        filtered_sentence = pat_only_alpha.sub('', sentence)
        
    return filtered_sentence
        

In [32]:
pprint([remove_characters_before_tokenization(sentence) for sentence in corpus])

	>  The brown fox wasnt that quick and he couldnt win the race  <
	>  Hey thats a great deal I just bought a phone for 199  <
	>  Youll learn a lot in the book Python is an amazing language  <


In [33]:
cleaned_corpus = [remove_characters_before_tokenization(sentence, keep_apostrophes=True) for sentence in corpus]
pprint(cleaned_corpus)

	>  The brown fox wasn't that quick and he couldn't win the race  <
	>  Hey that's a great deal! I just bought a phone for 199  <
	>  You'll learn a lot in the book. Python is an amazing language!  <


## Expanding Contractions

In [34]:
from contractions import CONTRACTION_MAP

def expand_contractions(sentence, contraction_mapping):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), flags=re.IGNORECASE|re.DOTALL)
    
    def expand_match(contraction):
        
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                               if contraction_mapping.get(match)\
                               else contraction_mapping.get(match.lower())
        expanded_contraction = first_char+expanded_contraction[1:]
        
        return expanded_contraction
    
    expanded_sentence = contractions_pattern.sub(expand_match, sentence)
    
    return expanded_sentence

In [35]:
expanded_corpus = [expand_contractions(sentence, CONTRACTION_MAP) for sentence in cleaned_corpus]

pprint(expanded_corpus)

	>  The brown fox was not that quick and he could not win the race  <
	>  Hey that is a great deal! I just bought a phone for 199  <
	>  You will learn a lot in the book. Python is an amazing language!  <


# Case Conversions

In [36]:
print('Original: ', corpus[0])
print('Lower: ', corpus[0].lower())
print('Upper: ', corpus[0].upper())
print('Sentence case: ', corpus[0].capitalize())
print('Word case: ', corpus[0].title())

Original:  The brown fox wasn't that quick and he couldn't win the race
Lower:  the brown fox wasn't that quick and he couldn't win the race
Upper:  THE BROWN FOX WASN'T THAT QUICK AND HE COULDN'T WIN THE RACE
Sentence case:  The brown fox wasn't that quick and he couldn't win the race
Word case:  The Brown Fox Wasn'T That Quick And He Couldn'T Win The Race


# Removing Stopwords

"words that have little or no significance"

In [37]:
def remove_stopwords(tokens):
    
    stopword_list = nltk.corpus.stopwords.words('english')
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    
    return filtered_tokens

In [38]:
expanded_corpus_tokens = [tokenize_text(text) for text in expanded_corpus]

filtered_list_3 = [[remove_stopwords(tokens) for tokens in sentence_tokens] for sentence_tokens in expanded_corpus_tokens]

pprint(filtered_list_3)

	>  [['The', 'brown', 'fox', 'quick', 'could', 'win', 'race']]  <
	>  [['Hey', 'great', 'deal', '!'], ['I', 'bought', 'phone', '199']]  <
	>  [['You', 'learn', 'lot', 'book', '.'], ['Python', 'amazing', 'language', '!']]  <


!!!!! 

'One important thing to remember is that negations like not and no are removed in this case(...) and it is often essential to preserve the same so the actual context of the sentence is not lost'

# Correcting Words

"The definition of incorrect here covers words that have spelling mistakes as well as words with several letters repreated that do not contribute much to its overall significance."

## Correcting repeated characters

In [39]:
old_word = 'finalllllyyy'

repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
match_substitution = r'\1\2\3'

step = 1

while True:
    
    new_word = repeat_pattern.sub(match_substitution, old_word)
    
    if new_word != old_word:
        
        print('Step: {} Word: {}'.format(step, new_word))
        step += 1
        
        old_word = new_word
        continue
    else:
        
        print('Final word: ', new_word)
        break;

Step: 1 Word: finalllllyy
Step: 2 Word: finallllly
Step: 3 Word: finalllly
Step: 4 Word: finallly
Step: 5 Word: finally
Step: 6 Word: finaly
Final word:  finaly


In [40]:
def remove(match):
    
    if match.group(1):
        return match.group(1)[0]
    else: 
        return match
    
repeat_pattern = re.compile(r'(\w)(\1+)')

def remove_repeat(s):
    
    return repeat_pattern.sub(remove, s)

remove_repeat('Abelaaaaardooo Vieirrrra Moooota')

'Abelardo Vieira Mota'

In [42]:
from nltk.corpus import wordnet

old_word = 'finalllllyyy'

repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
match_substitution = r'\1\2\3'

step = 1

while True:
    
    if wordnet.synsets(old_word):
        print('Final correct word: ', old_word)
        break
    new_word = repeat_pattern.sub(match_substitution, old_word)
    
    if new_word != old_word:
        
        print('Step: {} Word: {}'.format(step, new_word))
        step += 1
        
        old_word = new_word
        continue
    else:
        
        print('Final word: ', new_word)
        break;

Step: 1 Word: finalllllyy
Step: 2 Word: finallllly
Step: 3 Word: finalllly
Step: 4 Word: finallly
Step: 5 Word: finally
Final correct word:  finally


In [43]:
def remove_repeated_characters(tokens):
    
    def replace(old_word):
        if wordnet.synsets(old_word):
            return old_word
        
        new_word = repeat_pattern.sub(match_substitution, old_word)
        return replace(new_word) if new_word != old_word else new_word
    
    correct_tokens = [replace(word) for word in tokens]
    
    return correct_tokens

In [44]:
sample_sentence = 'My schooool is reallllllyyyy amaaaaazinnnng'
sample_sentence_tokens = tokenize_text(sample_sentence)[0]

print(sample_sentence_tokens)
print(remove_repeated_characters(sample_sentence_tokens))

['My', 'schooool', 'is', 'reallllllyyyy', 'amaaaaazinnnng']
['My', 'school', 'is', 'really', 'amazing']


# Correcting Spellings

In [47]:
import re, collections

def tokens(text):
    
    return re.findall('[a-z]+', text.lower())

with open('big.txt') as f:
    WORDS = tokens(f.read())
    
WORD_COUNTS = collections.Counter(WORDS)

pprint(WORD_COUNTS.most_common(10))

	>  ('the', 80030)  <
	>  ('of', 40025)  <
	>  ('and', 38313)  <
	>  ('to', 28766)  <
	>  ('in', 22050)  <
	>  ('a', 21155)  <
	>  ('that', 12512)  <
	>  ('he', 12401)  <
	>  ('was', 11410)  <
	>  ('it', 10681)  <


In [54]:
def edits0(word):
    
    return {word}

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
        
    def splits(word):

        return [(word[:i], word[i:]) for i in range(len(word) + 1)]

    pairs = splits(word)
    deletes = [a + b[1:] for (a, b) in pairs if b]
    transposes = [a + b[1] + b[0] + b[2:] for (a, b) in pairs if len(b) > 1]
    replaces = [a+c+b[1:] for (a, b) in pairs for c in alphabet if b]
    inserts = [a+c+b for (a, b) in pairs for c in alphabet]
    
    return set(deletes + transposes + replaces + inserts)

def edits2(word):
    
    return {e2 for e1 in edits1(word) for e2 in edits1(e1)}

In [50]:
def known(words):
    
    return {w for w in words if w in WORD_COUNTS}

In [51]:
word = 'fianlly'

edits0(word)

{'fianlly'}

In [52]:
known(edits0(word))

set()

In [56]:
known(edits1(word))

{'finally'}

In [57]:
known(edits2(word))

{'faintly', 'finally', 'finely', 'frankly'}

In [58]:
candidates = (known(edits0(word)) or known(edits1(word)) or known(edits2(word)) or [word])

candidates

{'finally'}

In [71]:
def correct(word):
    
    candidates = (known(edits0(word)) or known(edits1(word)) or known(edits2(word)) or [word])
    
    return max(candidates, key=WORD_COUNTS.get)

In [72]:
correct('fianlly')

'finally'

In [73]:
correct('FIANLY')

'FIANLY'

In [76]:
def correct_match(match):
    
    word = match.group()
    
    def case_of(text):
        
        return (str.upper if text.isupper() else
                str.lower if text.islower() else
                str.title if text.istitle() else
                str)
    
    return case_of(word)(correct(word.lower()))

def correct_text_generic(text):
    
    return re.sub('[a-zA-Z]+', correct_match, text)

In [77]:
correct_text_generic('fianlly')

'finally'

In [78]:
correct_text_generic('FIANLY')

'FINALLY'

In [79]:
# pattern is only available to python 3.6 with dev branch...
from pattern.en import suggest

ModuleNotFoundError: No module named 'pattern'