# Week 4 - Using Natural Language Processing
This week follows chapter 3 from the text.

## Page 118

In [1]:
import requests

data = requests.get('http://www.gutenberg.org/cache/epub/8001/pg8001.html')
content = data.content
# the text that prints is a little different because of book version differences
print(content[1163:2200], '\n')

b'a name="generator" content="Ebookmaker 0.10.0 by Project Gutenberg"/>\r\n</head>\r\n  <body><p id="id00000">Project Gutenberg EBook The Bible, King James, Book 1: Genesis</p>\r\n\r\n<p id="id00001">Copyright laws are changing all over the world. Be sure to check the\r\ncopyright laws for your country before downloading or redistributing\r\nthis or any other Project Gutenberg eBook.</p>\r\n\r\n<p id="id00002">This header should be the first thing seen when viewing this Project\r\nGutenberg file.  Please do not remove it.  Do not change or edit the\r\nheader without written permission.</p>\r\n\r\n<p id="id00003">Please read the "legal small print," and other information about the\r\neBook and Project Gutenberg at the bottom of this file.  Included is\r\nimportant information about your specific rights and restrictions in\r\nhow the file may be used.  You can also find out about how to make a\r\ndonation to Project Gutenberg, and how to get involved.</p>\r\n\r\n<p id="id00004" style="ma

## Pages 118-119

In [2]:
import re
from bs4 import BeautifulSoup

def strip_html_tags(text):
    soup = BeautifulSoup(text, 'html.parser')
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)

    return stripped_text

clean_content = strip_html_tags(content)
print(clean_content[1163:2045], '\n')

*** START OF THE PROJECT GUTENBERG EBOOK, THE BIBLE, KING JAMES, BOOK 1***
This eBook was produced by David Widger
with the help of Derek Andrew's text from January 1992
and the work of Bryan Taylor in November 2002.
Book 01        Genesis
01:001:001 In the beginning God created the heaven and the earth.
01:001:002 And the earth was without form, and void; and darkness was
           upon the face of the deep. And the Spirit of God moved upon
           the face of the waters.
01:001:003 And God said, Let there be light: and there was light.
01:001:004 And God saw the light, that it was good: and God divided the
           light from the darkness.
01:001:005 And God called the light Day, and the darkness he called
           Night. And the evening and the morning were the first day.
01:001:006 And God said, Let there be a firmament in the midst of the
           waters, 



## Tokenizer
Shorter text than the text, but this is the same core code, which comes after showing some data about the Alice corpus.

In [9]:
import nltk
#nltk.download('gutenberg')
from nltk.corpus import gutenberg
from pprint import pprint
import numpy as np

## SENTENCE TOKENIZATION
# loading text corpora
alice = gutenberg.raw(fileids='carroll-alice.txt')

sample_text = 'We will discuss briefly about the basic syntax,\
 structure and design philosophies. \
 There is a defined hierarchical syntax for Python code which you should remember \
 when writing code! Python is a really powerful programming language!'
print('Sample text: ', sample_text, '\n')

Sample text:  We will discuss briefly about the basic syntax, structure and design philosophies.  There is a defined hierarchical syntax for Python code which you should remember  when writing code! Python is a really powerful programming language! 



### Output a bit of Alice in Wonderland

In [5]:
# Total characters in Alice in Wonderland
print('Length of alice: ', len(alice))
# First 100 characters in the corpus
print('First 100 chars of alice: ', alice[0:100], '\n')

Length of alice:  144395
First 100 chars of alice:  [Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was 




## Default Sentence Tokenizer

In [6]:
default_st = nltk.sent_tokenize
alice_sentences = default_st(text=alice)
sample_sentences = default_st(text=sample_text)
print('Default sentence tokenizer')
print('Total sentences in sample_text:', len(sample_sentences))
print('Sample text sentences :-')
pprint(sample_sentences)
print('\nTotal sentences in alice:', len(alice_sentences))
print('First 5 sentences in alice:-')
pprint(alice_sentences[0:5])

Default sentence tokenizer
Total sentences in sample_text: 3
Sample text sentences :-
['We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 'There is a defined hierarchical syntax for Python code which you should '
 'remember  when writing code!',
 'Python is a really powerful programming language!']

Total sentences in alice: 1625
First 5 sentences in alice:-
["[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I.",
 'Down the Rabbit-Hole\n'
 '\n'
 'Alice was beginning to get very tired of sitting by her sister on the\n'
 'bank, and of having nothing to do: once or twice she had peeped into the\n'
 'book her sister was reading, but it had no pictures or conversations in\n'
 "it, 'and what is the use of a book,' thought Alice 'without pictures or\n"
 "conversation?'",
 'So she was considering in her own mind (as well as she could, for the\n'
 'hot day made her feel very sleepy and stupid), whether the pleasure\n'
 'of making a dais

## Other Languages Sentence Tokenization

In [10]:
#nltk.download('europarl_raw')
from nltk.corpus import europarl_raw
german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')
print('Other language tokenization')
# Total characters in the corpus
print(len(german_text))
# First 100 characters in the corpus
print(german_text[0:100])

german_sentences_def = default_st(text=german_text, language='german')
# loading german text tokenizer into a PunktSentenceTokenizer instance
german_tokenizer = nltk.data.load(resource_url='tokenizers/punkt/german.pickle')
german_sentences = german_tokenizer.tokenize(german_text)

# verify the type of german_tokenizer
# should be PunktSentenceTokenizer
print('German tokenizer type:', type(german_tokenizer))

# check if results of both tokenizers match
# should be True
print(german_sentences_def == german_sentences)
# print(first 5 sentences of the corpus
for sent in german_sentences[0:5]:
    print(sent)
print('\n')

Other language tokenization
157171
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sit
German tokenizer type: <class 'nltk.tokenize.punkt.PunktSentenceTokenizer'>
True
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .
Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .
Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .
Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .
Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .

## Using Punkt Tokenizer for Sentence Tokenization

In [11]:
print('Punkt tokenizer')
punkt_st = nltk.tokenize.PunktSentenceTokenizer()
sample_sentences = punkt_st.tokenize(sample_text)
pprint(np.array(sample_sentences))
print('\n')

Punkt tokenizer
array(['We will discuss briefly about the basic syntax, structure and design philosophies.',
       'There is a defined hierarchical syntax for Python code which you should remember  when writing code!',
       'Python is a really powerful programming language!'], dtype='<U100')





## Using RegexpTokenizer for Sentence Tokenization

In [13]:
print('Regex tokenizer')
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
regex_st = nltk.tokenize.RegexpTokenizer(
            pattern=SENTENCE_TOKENS_PATTERN,
            gaps=True)
sample_sentences = regex_st.tokenize(sample_text)
# again, the output is different because the sample sentence is different
pprint(sample_sentences)
print('\n')

Regex tokenizer
['We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 ' There is a defined hierarchical syntax for Python code which you should '
 'remember  when writing code!',
 'Python is a really powerful programming language!']




## Work Tokenization

In [14]:
sentence = "The brown fox wasn't that quick and he couldn't win the race"
# default word tokenizer
print('Word tokenizer')
default_wt = nltk.word_tokenize
words = default_wt(sentence)
print(words, '\n')

Word tokenizer
['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race'] 



In [16]:
print('Treebank tokenizer')
treebank_wt = nltk.TreebankWordTokenizer()
words = treebank_wt.tokenize(sentence)
print(words, '\n')

# toktok tokenizer
print('TokTok tokenizer')
from nltk.tokenize.toktok import ToktokTokenizer
tokenizer = ToktokTokenizer()
words = tokenizer.tokenize(sample_text)
print(np.array(words), '\n')

# regex word tokenizer
print('RegEx word tokenizer')
TOKEN_PATTERN = r'\w+'        
regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN,
                                gaps=False)
words = regex_wt.tokenize(sentence)
print(words)

GAP_PATTERN = r'\s+'        
regex_wt = nltk.RegexpTokenizer(pattern=GAP_PATTERN,
                                gaps=True)
words = regex_wt.tokenize(sentence)
print(words)

word_indices = list(regex_wt.span_tokenize(sentence))
print(word_indices)
print([sentence[start:end] for start, end in word_indices], '\n')

# derived regex tokenizers
print("Derived RegEx tokenizers")
wordpunkt_wt = nltk.WordPunctTokenizer()
words = wordpunkt_wt.tokenize(sentence)
print(words, '\n')

# whitespace tokenizer
print('Whitespace Tokenizer')
whitespace_wt = nltk.WhitespaceTokenizer()
words = whitespace_wt.tokenize(sentence)
print(words, '\n')

Treebank tokenizer
['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race'] 

TokTok tokenizer
['We' 'will' 'discuss' 'briefly' 'about' 'the' 'basic' 'syntax' ','
 'structure' 'and' 'design' 'philosophies.' 'There' 'is' 'a' 'defined'
 'hierarchical' 'syntax' 'for' 'Python' 'code' 'which' 'you' 'should'
 'remember' 'when' 'writing' 'code' '!' 'Python' 'is' 'a' 'really'
 'powerful' 'programming' 'language' '!'] 

RegEx word tokenizer
['The', 'brown', 'fox', 'wasn', 't', 'that', 'quick', 'and', 'he', 'couldn', 't', 'win', 'the', 'race']
['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']
[(0, 3), (4, 9), (10, 13), (14, 20), (21, 25), (26, 31), (32, 35), (36, 38), (39, 47), (48, 51), (52, 55), (56, 60)]
['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race'] 

Derived RegEx tokenizers
['The', 'brown', 'fox', 'wasn', "'", 't', 'that', 'quick', 'and', 'he',

## Pages 132 - 134

In [17]:
print('Robust tokenizer - NLTK')
def tokenize_text(text):
    sentences = nltk.sent_tokenize(text)
    word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences]
    return word_tokens

sents = tokenize_text(sample_text)
print(np.array(sents),'\n')

words = [word for sentence in sents for word in sentence]
print(np.array(words), '\n')

print('spaCy...')
import spacy
nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)
text_spacy = nlp(sample_text)
print(np.array(list(text_spacy.sents)), '\n')

sent_words = [[word for word in sent] for sent in sents]
print(np.array(sent_words), '\n')

# in spacy documentation, this is usually written as [token for token in doc]
words = [word for word in text_spacy]
print(np.array(words), '\n')

Robust tokenizer - NLTK
[list(['We', 'will', 'discuss', 'briefly', 'about', 'the', 'basic', 'syntax', ',', 'structure', 'and', 'design', 'philosophies', '.'])
 list(['There', 'is', 'a', 'defined', 'hierarchical', 'syntax', 'for', 'Python', 'code', 'which', 'you', 'should', 'remember', 'when', 'writing', 'code', '!'])
 list(['Python', 'is', 'a', 'really', 'powerful', 'programming', 'language', '!'])] 

['We' 'will' 'discuss' 'briefly' 'about' 'the' 'basic' 'syntax' ','
 'structure' 'and' 'design' 'philosophies' '.' 'There' 'is' 'a' 'defined'
 'hierarchical' 'syntax' 'for' 'Python' 'code' 'which' 'you' 'should'
 'remember' 'when' 'writing' 'code' '!' 'Python' 'is' 'a' 'really'
 'powerful' 'programming' 'language' '!'] 

spaCy...
[We will discuss briefly about the basic syntax, structure and design philosophies.
 There is a defined hierarchical syntax for Python code which you should remember  when writing code!
 Python is a really powerful programming language!] 

[list(['We', 'will', 'd

## Page 135

In [18]:
import unicodedata

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii',
                                                      'ignore').decode('utf-8', 'ignore')
    return text

print(remove_accented_chars('Sòme Åccentềd cliché façades'))

Some Accented cliche facades


# Expanding Contractions
Starting on page 136

In [20]:
import nltk
from contractions import CONTRACTION_MAP
import re

def expand_contractions(sentence, contraction_mapping=CONTRACTION_MAP):
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction

    expaned_text = contractions_pattern.sub(expand_match, sentence)
    expanded_text = re.sub("'", "", expaned_text)
    return expanded_text
print('Exanding contractions:')
print(expand_contractions("Y'all can't expand contractions I'd think"), '\n')

Exanding contractions:
You all cannot expand contractions I would think 



# Removing special characters, page 138

In [22]:
def remove_special_characters(text, remove_digits =False):
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
    text = re.sub(pattern, '', text)
    return text

print('Remove special characters:')
print(remove_special_characters('Well this was fun! What do you think? 123#@!', remove_digits=True), '\n')

Remove special characters:
Well this was fun What do you think  



## Case Conversion

In [23]:
print('Case conversions:')
# lowercase
text = 'The quick brown fox jumped over The Big Dog'
print(text.lower())
# uppercase
print(text.upper())
# title case
print(text.title(), '\n')

Case conversions:
the quick brown fox jumped over the big dog
THE QUICK BROWN FOX JUMPED OVER THE BIG DOG
The Quick Brown Fox Jumped Over The Big Dog 



## Correcting repeating characters - pages 139-140

In [24]:
old_word = 'finalllyyy'
repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
match_substitution = r'\1\2\3'
step = 1
while True:
    # remove on repeated character
    new_word = repeat_pattern.sub(match_substitution, old_word)
    if new_word != old_word:
        print('Step: {} Word: {}'.format(step, new_word))
        step += 1 #update step
        # update old word to last substituted state
        old_word = new_word
        continue
    else:
        print('Final word: ', new_word, '\n')
        break

Step: 1 Word: finalllyy
Step: 2 Word: finallly
Step: 3 Word: finally
Step: 4 Word: finaly
Final word:  finaly 



## Pages 140-141 - Wordnet

In [25]:
print('Wordnet:')
from nltk.corpus import wordnet
old_word = 'finalllyyy'
repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
match_substitution = r'\1\2\3'
step = 1
while True:
    # check for semantically correct words
    if wordnet.synsets(old_word):
        print('Final correct word: ', old_word, '\n')
        break
    # remove on repeated characters
    new_word = repeat_pattern.sub(match_substitution, old_word)
    if new_word != old_word:
        print('Step: {} Word: {}'.format(step, new_word))
        step += 1  # update step
        # update old word to last substituted state
        old_word = new_word
        continue
    else:
        print('Final word: ', new_word, '\n')
        break

Wordnet:
Step: 1 Word: finalllyy
Step: 2 Word: finallly
Step: 3 Word: finally
Final correct word:  finally 



## Pages 141-142 - Remove repeated characters

In [26]:
def remove_repeated_characters(tokens):
    repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
    match_substitution = r'\1\2\3'
    def replace(old_word):
        if wordnet.synsets(old_word):
            return old_word
        new_word = repeat_pattern.sub(match_substitution, old_word)
        return replace(new_word) if new_word != old_word else new_word
            
    correct_tokens = [replace(word) for word in tokens]
    return correct_tokens

sample_sentence = 'My schooool is realllllyyy amaaazingggg'
correct_tokens = remove_repeated_characters(nltk.word_tokenize(sample_sentence))
print(' '.join(correct_tokens), '\n')

My school is really amazing 



## Spelling Corrector Part 1 - Starting on Page 143

In [27]:
import re, collections

def tokens(text): 
    """
    Get all words from the corpus
    """
    return re.findall('[a-z]+', text.lower()) 

WORDS = tokens(open('big.txt').read())
WORD_COUNTS = collections.Counter(WORDS)
# top 10 words in corpus
print('Top 10 words in corpus:')
print(WORD_COUNTS.most_common(10), '\n')

def edits0(word):
    """
    Return all strings that are zero edits away 
    from the input word (i.e., the word itself).
    """
    return {word}

def edits1(word):
    """
    Return all strings that are one edit away 
    from the input word.
    """
    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    def splits(word):
        """
        Return a list of all possible (first, rest) pairs 
        that the input word is made of.
        """
        return [(word[:i], word[i:]) 
                for i in range(len(word)+1)]
                
    pairs      = splits(word)
    deletes    = [a+b[1:]           for (a, b) in pairs if b]
    transposes = [a+b[1]+b[0]+b[2:] for (a, b) in pairs if len(b) > 1]
    replaces   = [a+c+b[1:]         for (a, b) in pairs for c in alphabet if b]
    inserts    = [a+c+b             for (a, b) in pairs for c in alphabet]
    return set(deletes + transposes + replaces + inserts)

def edits2(word):
    """Return all strings that are two edits away 
    from the input word.
    """
    return {e2 for e1 in edits1(word) for e2 in edits1(e1)}

def known(words):
    """
    Return the subset of words that are actually
    in our WORD_COUNTS dictionary.
    """
    return {w for w in words if w in WORD_COUNTS}

print('Input words:')
# input word
word = 'fianlly'

# zero edit distance from input word
print(edits0(word))
# returns null set since it is not a valid word
print(known(edits0(word)))
# one edit distance from input word
print(edits1(word))
# get correct words from above set
print(known(edits1(word)))
# two edit distances from input word
print(edits2(word))
# get correc twords from above set
print(known(edits2(word)))
candidates = (known(edits0(word)) or known(edits1(word)) or known(edits2(word)) or [word])
print(candidates, '\n')

qvly', 'foianllx', 'fienllyt', 'fianslx', 'dianllg', 'finnlqly', 'afuanlly', 'fianlaxy', 'fiaxllty', 'ftqianlly', 'fiaullyq', 'fivnllxy', 'fsiantlly', 'fianlayz', 'fianlipy', 'finallt', 'fiaklyy', 'foignlly', 'firanlcly', 'fiaymly', 'fyianlzly', 'fiaqnlfly', 'fjiainlly', 'wfdianlly', 'franoly', 'fianfls', 'fiapljy', 'uiynlly', 'fianxfy', 'ufianlfly', 'fiaswly', 'ofiafnlly', 'fianlswly', 'fianevy', 'fiianllwy', 'wfianlfly', 'fiawllly', 'fiaqllty', 'kfianlfly', 'fsanlry', 'fianlzlyj', 'fienlwly', 'fiuanlhly', 'kfiankly', 'fianlawly', 'fiajjnlly', 'rqfianlly', 'lianllsy', 'fcavnlly', 'fkiahlly', 'fianlfyd', 'efiacnlly', 'fidabnlly', 'fianlxlny', 'fnablly', 'fianllyms', 'fianloc', 'fiaoljy', 'fxnianlly', 'bfiaully', 'fihanllx', 'bfianllyt', 'fianltvly', 'ficanlyy', 'fianljply', 'figgnlly', 'fianllsb', 'fxianzly', 'fijanllo', 'rfianllc', 'fiajnily', 'diadnlly', 'fianllxv', 'fiuanllgy', 'fiangldly', 'aianllby', 'fplanlly', 'fianrlyu', 'fiavllys', 'ljanlly', 'pnanlly', 'fianludy', 'wfiaklly',

## Spelling Correction Part 2

In [28]:
def correct(word):
    '''
    Get the best correct spelling for the input word
    :param word: the input word
    :return: best correct spelling
    '''
    # priority is for edit distance 0, then 1, then 2
    # else defaults to the input word iteself.
    candidates = (known(edits0(word)) or known(edits1(word)) or known(edits2(word)) or [word])
    return max(candidates, key=WORD_COUNTS.get)

print(correct('fianlly'))
print(correct('FIANLLY'), '\n')

def correct_match(match):
    '''
    Spell-correct word in match, and preserve proper upper/lower/title case.
    :param match: word to be corrected
    :return: corrected word
    '''
    word = match.group()
    def case_of(text):
        '''
        Return the case-function appropriate for text: upper/lower/title/as-is
        :param text: The text to be acted on
        :return: Correct text
        '''
        return (str.upper if text.isupper() else
                str.lower if text.islower() else
                str.title if text.istitle() else
                str)
    return case_of(word)(correct(word.lower()))

def correct_text_generic(text):
    '''
    Correct all the words within a text, returning the corrected text
    :param text: Text to be corrected
    :return: Corrected text
    '''
    return re.sub('[a-zA-Z]+', correct_match, text)

print(correct_text_generic('fianlly'))
print(correct_text_generic('FIANLLY'), '\n')

print('TextBlob way (you may need to use pip to install textblob):')
from textblob import Word
w = Word('fianlly')
print(w.correct())
# check suggestions
print(w.spellcheck())
# another example
w = Word('flaot')
print(w.spellcheck())

finally
FIANLLY 

finally
FINALLY 

TextBlob way (you may need to use pip to install textblob):
finally
[('finally', 1.0)]
[('flat', 0.85), ('float', 0.15)]


## Stem and Lem

In [29]:
# porter stemmer
from nltk.stem import PorterStemmer
ps = PorterStemmer()

print('Porter stemmer:')
print(ps.stem('jumping'), ps.stem('jumps'), ps.stem('jumped'))
print(ps.stem('lying'))
print(ps.stem('strange'), '\n')

# lancaster stemmer
print('Lancaster stemmer:')
from nltk.stem import LancasterStemmer
ls = LancasterStemmer()
print(ls.stem('jumping'), ls.stem('jumps'), ls.stem('jumped'))
print(ls.stem('lying'))
print(ls.stem('strange'), '\n')

# regex stemmer
print('Regex stemmer:')
from nltk.stem import RegexpStemmer
rs = RegexpStemmer('ing$|s$|ed$', min=4)
print(rs.stem('jumping'), rs.stem('jumps'), rs.stem('jumped'))
print(rs.stem('lying'))
print(rs.stem('strange'), '\n')

# snowball stemmer
print('Snowball stemmer:')
from nltk.stem import SnowballStemmer
ss = SnowballStemmer("german")
print('Supported Languages:', SnowballStemmer.languages)
# autobahnen -> cars
# autobahn -> car
print(ss.stem('autobahnen'))
# springen -> jumping
# spring -> jump
print(ss.stem('springen'), '\n')

# lemmatization
print('WordNet lemmatization:')
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
# lemmatize nouns
print(wnl.lemmatize('cars', 'n'))
print(wnl.lemmatize('men', 'n'))
# lemmatize verbs
print(wnl.lemmatize('running', 'v'))
print(wnl.lemmatize('ate', 'v'))
# lemmatize adjectives
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('fancier', 'a'))
# ineffective lemmatization
print(wnl.lemmatize('ate', 'n'))
print(wnl.lemmatize('fancier', 'v'), '\n')

print('spaCy:')
import spacy
nlp = spacy.load('en_core_web_sm')
text = 'My system keeps crashing! his crashed yesterday, ours crashes daily'

def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-'
                     else word.text for word in text])
    return text

print(lemmatize_text(text))

Porter stemmer:
jump jump jump
lie
strang 

Lancaster stemmer:
jump jump jump
lying
strange 

Regex stemmer:
jump jump jump
ly
strange 

Snowball stemmer:
Supported Languages: ('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')
autobahn
spring 

WordNet lemmatization:
car
men
run
eat
sad
fancy
ate
fancier 

spaCy:
My system keep crash ! his crash yesterday , ours crash daily


## Stop Words

In [32]:
import re
import nltk
from nltk.tokenize.toktok import ToktokTokenizer

tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')

# removing stopwords
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]

    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]

    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

print(remove_stopwords('The, and, if are stopwords, computer is not'))

, , stopwords , computer


## POS Tagging - Starting on Page 166

In [34]:
sentence = 'US unveils world\'s most powerful supercomputer, beats China.'
import pandas as pd
import spacy
from pprint import pprint

print('spaCy:')
nlp = spacy.load('en_core_web_sm')
sentence_nlp = nlp(sentence)
# POS tagging with spaCy
spacy_pos_tagged = [(word, word.tag_, word.pos_) for word in sentence_nlp]
# the .T in the book transposes rows and columsn, but it's harder to read
pprint(pd.DataFrame(spacy_pos_tagged, columns=['Word', 'POS tag', 'Tag type']))

# POS tagging with nltk
print('\n', 'NLTK')
import nltk
# only need the following two lines one time
#nltk.download('averaged_perceptron_tagger')
#nltk.download('universal_tagset')
nltk_pos_tagged = nltk.pos_tag(nltk.word_tokenize(sentence), tagset='universal')
pprint(pd.DataFrame(nltk_pos_tagged, columns=['Word', 'POS tag']))

print('\n', 'Treebank:')
# you only need the next line once
# nltk.download('treebank')
from nltk.corpus import treebank
data = treebank.tagged_sents()
train_data = data[:3500]
test_data = data[3500:]
print(train_data[0])

print('\n', 'Default tagger:')
# default tagger
from nltk.tag import DefaultTagger
dt = DefaultTagger('NN')
# accuracy on test data
print(dt.evaluate(test_data))
# tagging our sample headline
print(dt.tag(nltk.word_tokenize(sentence)))

print('\n', 'Regex tagger')
# regex tagger
from nltk.tag import RegexpTagger
# define regex tag patterns
patterns = [
        (r'.*ing$', 'VBG'),               # gerunds
        (r'.*ed$', 'VBD'),                # simple past
        (r'.*es$', 'VBZ'),                # 3rd singular present
        (r'.*ould$', 'MD'),               # modals
        (r'.*\'s$', 'NN$'),               # possessive nouns
        (r'.*s$', 'NNS'),                 # plural nouns
        (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
        (r'.*', 'NN')                     # nouns (default) ... 
]
rt = RegexpTagger(patterns)
# accuracy on test data
print(rt.evaluate(test_data))
# tagging our sample headline
print(rt.tag(nltk.word_tokenize(sentence)))

print('\n', 'N Gram taggers')
## N gram taggers
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger

ut = UnigramTagger(train_data)
bt = BigramTagger(train_data)
tt = TrigramTagger(train_data)

# testing performance on unigram tagger
print(ut.evaluate(test_data))
print(ut.tag(nltk.word_tokenize(sentence)))

# testing performance of bigram tagger
print(bt.evaluate(test_data))
print(bt.tag(nltk.word_tokenize(sentence)))

# testing performance of trigram tagger
print(tt.evaluate(test_data))
print(tt.tag(nltk.word_tokenize(sentence)))

def combined_tagger(train_data, taggers, backoff=None):
    for tagger in taggers:
        backoff = tagger(train_data, backoff=backoff)
    return backoff

ct = combined_tagger(train_data=train_data,
                     taggers=[UnigramTagger, BigramTagger, TrigramTagger],
                     backoff=rt)
print(ct.evaluate(test_data))
print(ct.tag(nltk.word_tokenize(sentence)))

print('\n', 'Naive Bayes and Maxent')
from nltk.classify import NaiveBayesClassifier, MaxentClassifier
from nltk.tag.sequential import ClassifierBasedPOSTagger
nbt = ClassifierBasedPOSTagger(train=train_data,
                               classifier_builder=NaiveBayesClassifier.train)
print(nbt.evaluate(test_data))
print(nbt.tag(nltk.word_tokenize(sentence)), '\n')

# the following takes a LONG time to run - run if you have time
'''
met = ClassifierBasedPOSTagger(train=train_data,
                               classifier_builder=MaxentClassifier.train)
print(met.evaluate(test_data))
print(met.tag(nltk.word_tokenize(sentence)))
'''

spaCy:
             Word POS tag Tag type
0              US     NNP    PROPN
1         unveils     VBZ     VERB
2           world      NN     NOUN
3              's     POS     PART
4            most     RBS      ADV
5        powerful      JJ      ADJ
6   supercomputer      NN     NOUN
7               ,       ,    PUNCT
8           beats     VBZ     VERB
9           China     NNP    PROPN
10              .       .    PUNCT

 NLTK
             Word POS tag
0              US    NOUN
1         unveils     ADJ
2           world    NOUN
3              's     PRT
4            most     ADV
5        powerful     ADJ
6   supercomputer    NOUN
7               ,       .
8           beats    VERB
9           China    NOUN
10              .       .

 Treebank:
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'

'\nmet = ClassifierBasedPOSTagger(train=train_data,\n                               classifier_builder=MaxentClassifier.train)\nprint(met.evaluate(test_data))\nprint(met.tag(nltk.word_tokenize(sentence)))\n'

## Shallow Parsing - Starting on Page 173

This can take a bit of time to run at the end...

In [6]:
print('Treebank:')
from nltk.corpus import treebank_chunk
data = treebank_chunk.chunked_sents()
train_data = data[:3500]
test_data = data[3500:]
print(train_data[7], '\n')

print('Regext parser:')
simple_sentence = 'US unveils world\'s most powerful supercomputer, beats China.'
from nltk.chunk import RegexpParser
import nltk
from pattern.en import tag
# get POS tagged sentence
tagged_simple_sent = nltk.pos_tag(nltk.word_tokenize(simple_sentence))
print('POS Tags:', tagged_simple_sent)

chunk_grammar = """
NP: {<DT>?<JJ>*<NN.*>}
"""
rc = RegexpParser(chunk_grammar)
c = rc.parse(tagged_simple_sent)
print(c, '\n')

print('Chinking:')
chink_grammar = """
NP: {<.*>+} # chunk everything as NP
}<VBD|IN>+{
"""
rc = RegexpParser(chink_grammar)
c = rc.parse(tagged_simple_sent)
# print and view chunked sentence using chinking
print(c, '\n')

# create a more generic shallow parser
print('More generic shallow parser:')
grammar = """
NP: {<DT>?<JJ>?<NN.*>}  
ADJP: {<JJ>}
ADVP: {<RB.*>}
PP: {<IN>}      
VP: {<MD>?<VB.*>+}
"""
rc = RegexpParser(grammar)
c = rc.parse(tagged_simple_sent)
# print and view shallow parsed simple sentence
print(c)
# Evaluate parser performance on test data
print(rc.evaluate(test_data), '\n')

print('Chunked and treebank:')
from nltk.chunk.util import tree2conlltags, conlltags2tree
train_sent = train_data[7]
print(train_sent)
# get the (word, POS tag, Chung tag) triples for each token
wtc = tree2conlltags(train_sent)
print(wtc)
# get shallow parsed tree back from the WTC trples
tree = conlltags2tree(wtc)
print(tree, '\n')

print('NGramTagChunker:')
def conll_tag_chunks(chunk_sents):
  tagged_sents = [tree2conlltags(tree) for tree in chunk_sents]
  return [[(t, c) for (w, t, c) in sent] for sent in tagged_sents]
  
def combined_tagger(train_data, taggers, backoff=None):
    for tagger in taggers:
        backoff = tagger(train_data, backoff=backoff)
    return backoff
  
from nltk.tag import UnigramTagger, BigramTagger
from nltk.chunk import ChunkParserI
class NGramTagChunker(ChunkParserI):
  def __init__(self, train_sentences, 
               tagger_classes=[UnigramTagger, BigramTagger]):
    train_sent_tags = conll_tag_chunks(train_sentences)
    self.chunk_tagger = combined_tagger(train_sent_tags, tagger_classes)

  def parse(self, tagged_sentence):
    if not tagged_sentence: 
        return None
    pos_tags = [tag for word, tag in tagged_sentence]
    chunk_pos_tags = self.chunk_tagger.tag(pos_tags)
    chunk_tags = [chunk_tag for (pos_tag, chunk_tag) in chunk_pos_tags]
    wpc_tags = [(word, pos_tag, chunk_tag) for ((word, pos_tag), chunk_tag)
                     in zip(tagged_sentence, chunk_tags)]
    return conlltags2tree(wpc_tags)

# train the shallow parser
ntc = NGramTagChunker(train_data)
# test parser performance on test data
print(ntc.evaluate(test_data))

# the next 2 lines don't belong and have been commented out
# sentence_nlp = nlp(sentence)
# tagged_sentence = [(word.text, word.tag_) for word in sentence_nlp]

# parse our sample sentence
print('Parsing NTC...')
tree = ntc.parse(tagged_simple_sent)
print(tree)
tree.draw()

print('Wall Street Journal (cut to just 1000):')
# only need the next line once
#nltk.download('conll2000')
from nltk.corpus import conll2000
wsj_data = conll2000.chunked_sents()
train_wsj_data = wsj_data[:1000]
test_wsj_data = wsj_data[1000:]
print(train_wsj_data[10])

# tran the shallow parser
tc = NGramTagChunker(train_wsj_data)
# test performance on test data
print(tc.evaluate(test_wsj_data))

# there's code on the start of page 183 that's a repeat of the code on 181
# I didn't even write it - no need

Treebank:
(S
  (NP A/DT Lorillard/NNP spokewoman/NN)
  said/VBD
  ,/,
  ``/``
  (NP This/DT)
  is/VBZ
  (NP an/DT old/JJ story/NN)
  ./.) 

Regext parser:
POS Tags: [('US', 'NNP'), ('unveils', 'JJ'), ('world', 'NN'), ("'s", 'POS'), ('most', 'RBS'), ('powerful', 'JJ'), ('supercomputer', 'NN'), (',', ','), ('beats', 'VBZ'), ('China', 'NNP'), ('.', '.')]
(S
  (NP US/NNP)
  (NP unveils/JJ world/NN)
  's/POS
  most/RBS
  (NP powerful/JJ supercomputer/NN)
  ,/,
  beats/VBZ
  (NP China/NNP)
  ./.) 

Chinking:
(S
  (NP
    US/NNP
    unveils/JJ
    world/NN
    's/POS
    most/RBS
    powerful/JJ
    supercomputer/NN
    ,/,
    beats/VBZ
    China/NNP
    ./.)) 

More generic shallow parser:
(S
  (NP US/NNP)
  (NP unveils/JJ world/NN)
  's/POS
  (ADVP most/RBS)
  (NP powerful/JJ supercomputer/NN)
  ,/,
  (VP beats/VBZ)
  (NP China/NNP)
  ./.)
ChunkParse score:
    IOB Accuracy:  46.1%%
    Precision:     19.9%%
    Recall:        43.3%%
    F-Measure:     27.3%% 

Chunked and treebank:
(S
  (

## Dependency Parsing

In [2]:
sentence = 'US unveils world\'s most powerful supercomputer, beats China.'
import spacy
nlp = spacy.load('en_core_web_sm')
sentence_nlp = nlp(sentence)
dependency_pattern = '{left}<---{word}[{w_type}]--->{right}\n--------'
for token in sentence_nlp:
    print(dependency_pattern.format(word=token.orth_, w_type=token.dep_,
                                    left=[t.orth_ for t in token.lefts],
                                    right=[t.orth_ for t in token.rights]))
                                             
from spacy import displacy
displacy.render(sentence_nlp, jupyter=True, style='dep',
                options={'distance': 100,
                        'arrow_stroke': 2,
                        'arrow_width': 8})

[]<---US[nsubj]--->[]
--------
['US']<---unveils[ROOT]--->['supercomputer', ',', 'beats', '.']
--------
[]<---world[poss]--->["'s"]
--------
[]<---'s[case]--->[]
--------
[]<---most[advmod]--->[]
--------
['most']<---powerful[amod]--->[]
--------
['world', 'powerful']<---supercomputer[dobj]--->[]
--------
[]<---,[punct]--->[]
--------
[]<---beats[conj]--->['China']
--------
[]<---China[dobj]--->[]
--------
[]<---.[punct]--->[]
--------


## NOTE
The book goes into teh Stanford parser at the bottom of page 187. This Standford parser is depricated and requires a local server (too complicated for this). Therefore, I commented all the code out - it's just another parser and does the same thing as the rest of the code without the hassle.

## Constituency Parsing - Starting on Page 195

In [3]:
entence = 'US unveils world\'s most powerful supercomputer, beats China.'

import nltk
from nltk.grammar import Nonterminal
from nltk.corpus import treebank
training_set = treebank.parsed_sents()
print(training_set[1], '\n')

# extract the productions for all annotated training sentences
treebank_productions = list(
                        set(production 
                            for sent in training_set  
                            for production in sent.productions()
                        )
                    )
# view some production rules
print(treebank_productions[0:10])
  
# add productions for each word, POS tag
for word, tag in treebank.tagged_words():
    t = nltk.Tree.fromstring( "("+ tag + " " + word  + ")")
    for production in t.productions():
        treebank_productions.append(production)

# build the PCFG based grammar  
treebank_grammar = nltk.grammar.induce_pcfg(Nonterminal('S'), 
                                         treebank_productions)

# build the parser
viterbi_parser = nltk.ViterbiParser(treebank_grammar)
# get sample sentence tokens
tokens = nltk.word_tokenize(sentence)
# get parse tree for sample sentence
# this next lines throw and error (see the text on page 197)
# result = list(viterbi_parser.parse(tokens))

# get tokens and their POS tags and check it
tagged_sent = nltk.pos_tag(nltk.word_tokenize(sentence))
print(tagged_sent, '\n')

# extend productions for sample sentence tokens
for word, tag in tagged_sent:
    t = nltk.Tree.fromstring("("+ tag + " " + word  +")")
    for production in t.productions():
        treebank_productions.append(production)

# rebuild grammar
treebank_grammar = nltk.grammar.induce_pcfg(Nonterminal('S'), treebank_productions)
# rebuild parser
viterbi_parser = nltk.ViterbiParser(treebank_grammar)
# get parse tree for sample sentence
result = list(viterbi_parser.parse(tokens))
#print parse tree
print(result[0])
# visualize parse tree
result[0].draw()

(S
  (NP-SBJ (NNP Mr.) (NNP Vinken))
  (VP
    (VBZ is)
    (NP-PRD
      (NP (NN chairman))
      (PP
        (IN of)
        (NP
          (NP (NNP Elsevier) (NNP N.V.))
          (, ,)
          (NP (DT the) (NNP Dutch) (VBG publishing) (NN group))))))
  (. .)) 

[NN -> 'traffic', NN -> 'creditworthiness', JJ -> 'other', VP -> VBD PP-MNR S, NN -> 'overtime', VBD -> 'mounted', VB -> 'suit', NP-SBJ -> WDT, NNS -> 'jumps', VB -> 'swap']
[('US', 'NNP'), ('unveils', 'JJ'), ('world', 'NN'), ("'s", 'POS'), ('most', 'RBS'), ('powerful', 'JJ'), ('supercomputer', 'NN'), (',', ','), ('beats', 'VBZ'), ('China', 'NNP'), ('.', '.')] 

(S
  (NP-SBJ-2
    (NP (NNP US))
    (NP
      (NP (JJ unveils) (NN world) (POS 's))
      (JJS most)
      (JJ powerful)
      (NN supercomputer)))
  (, ,)
  (VP (VBZ beats) (NP-TTL (NNP China)))
  (. .)) (p=5.08954e-43)
