# Exercise 02: Regular Expressions and the Natural Language Toolkit

### Task 1: Regular Expressions
In this task we're working with the `re` package of Python. Take a look at the documentation to solve the following subtasks: https://docs.python.org/3/library/re.html

In [1]:
# import the re package
import re

Make yourself familiar with the different kinds of regex commands `re.XXXX(pattern, string)` available in python. Describe the differences between them.

* re.search()
* re.match()
* re.fullmatch()
* re.split()
* re.findall()
* re.finditer()

__Solution__
* re.search(): find _first_ match for the pattern in the given string
* re.match(): match the pattern only from the _beginning_ of the given string
* re.fullmatch(): match only if the _entire_ string corresponds to the pattern
* re.split(): _split_ a given string by the pattern
* re.findall(): _find all_ (non overlapping) occurences of the pattern, return them as a list
* re.finditer(): _find all_ (non overlapping) occurences of the pattern, return them as an _iterator_

In [3]:
re.match?

In [6]:
x = re.search(r'word', 'wor d swords word')
print(x)
print(x.group()) # Returns one or more subgroups of the match. Without arguments, group1 defaults to zero (the whole match is returned)

<re.Match object; span=(7, 11), match='word'>
word


In [7]:
re.search(r'word', 'wor d swords word')

<re.Match object; span=(7, 11), match='word'>

In [18]:
y1 = re.match(r'word', ' word hello')  # match only at beginning (' ' counts)
y2 = re.match(r'he', 'hello') # zero characters(empty string) matches as well

print(y2)
print(y2.group())

<re.Match object; span=(0, 2), match='he'>
he


In [11]:
z1 = re.fullmatch(r'word', 'Word', flags=re.I)  # re.IGNORECASE
z2 = re.fullmatch(r'word', 'Word of hello', flags=re.I)
print(z1)
print(z2)

<re.Match object; span=(0, 4), match='Word'>
None


In [15]:
a = re.split(r'a+', 'bab caad')
print(a)

['b', 'b c', 'd']


In [25]:
b = re.findall(r'word', 'wor d swords word')
c = re.findall(r'\bw+\w*', 'wor d swords word') # \w+ find all words. \w = [a-zA-Z0-9_]. + quantifier "at least once"
print(b)
print(c)

['word', 'word']
['wor', 'word']


In [29]:
c = re.finditer(r'word', 'wor d swords word') # iterator = stream of data

for w in c:
  print(w.group())

word
word


In [30]:
c = re.finditer(r'word', 'wor d swords word')
for m in c:
    print(m.group())

word
word


__a)__ Write a regular expression that checks if "Hello World" is contained in a string.

In [32]:
positive = "this is a Hello World test "
negative = "this is a H3ll0 W0rld Test "

In [34]:
# solution
print(bool(re.search(r'Hello World', positive)))
print(bool(re.search(r'Hello World', negative)))

print(re.search(r'hello World', positive, flags=re.I).group())  # ignore upper/lowercases

# or check if != None
# or
a = re.search(r'Hello World', positive)
if a:
    print(a.group())
else:
    print('no match')

True
False
Hello World
Hello World


__b)__ Write a regular expression that selects all integer numbers from a text.

In [35]:
s1 = "This is a text written in 2022, and it contains numbers like one, 2, 3. Other numbers like 123 are also detected."

In [39]:
# solution
re.findall(r'[0-9]+', s1)

['2022', '2', '3', '123']

__c)__ Write a regular expression that finds two (lower-case) words connected by a "-" (e.g., ill-advised).

In [40]:
s2 = "This my (so-so) example text. It features well-received meaningless words. The string 1-2 should not be returned."

In [41]:
# not working: re.findall(r'\w+-\w+', s2) \w also matches numbers and underscores

# solution
re.findall(r'[a-z]+-[a-z]+', s2)

['so-so', 'well-received']

__d)__ Write a regular expression that selects numbers with at least 4 digits from a text


In [50]:
s3 = "This is a 2022 text, the third useless text I'm writing after 202111 ended. \
    Larger numbers like 63527 should be detected, but smaller numbers like 123 should not."

In [48]:
# solution
re.findall(r'\d{4,6}', s3)

['2022', '202111', '63527']

In [57]:
re.findall(r'\b\d{3}\b', s3)

['123']

__e)__ Write a regular expression that extract dates in the format YYYY-MM-DD:
Assume that Y, M or D are valid when they are digits.

In [53]:
s4 = "2017-05-12 and 2018-01-01 are correct formats,  2017-25-01 is also valid. \
    Not valid is 03.05.22, neither is 12/12/2017."

In [54]:
# solution
re.findall(r'\d{4}-\d{2}-\d{2}', s4)

['2017-05-12', '2018-01-01', '2017-25-01']

__f)__ Write a regular expression that finds all URLS in a string. For sake of simplicity, a url starts with "http://" or "https://" and contains at least one dot, which is not at the end.

In [58]:
s5 = "The url http://example1.com should be returned, as well as http://www.example2.com. Let's try http://test."

In [None]:
s5 = "The url http://www.exa..mple1.com should be returned, as well as https://www.example2.com. Let's try http://test. http://wwwexa..mple1/djei.com"


In [74]:
for m in re.finditer(r'https?://(\w*\.+\w+)+', s5):
    print(m.group())

http://www.exa..mple1.com
https://www.example2.com
http://wwwexa..mple1


In [75]:
pattern6 = r'https?://(?:www\.)?[^\s]+\.\w+'
re.findall(pattern6, s5)

['http://www.exa..mple1.com',
 'https://www.example2.com',
 'http://wwwexa..mple1/djei.com']

In [None]:
pattern7 = r'https?://[.\w-]+\.\w+'
re.findall(pattern7, s5)

In [81]:
pattern5 = r'https?://[\w.-]+'
re.findall(pattern5, s5)

['http://www.exa..mple1.com',
 'https://www.example2.com.',
 'http://test.',
 'http://wwwexa..mple1']

In [None]:
r'''
- be careful with the \. vs .
- allow a second dot -> can't use \w+\.\w+
- if we want to allow multiple dots directly: re.findall(r'https?://(?:\w*\.+\w+)+', s5)
- Note the use of (?:) instead of (): The former indicates a non-capturing group, the latter a capturing group.
The result of findall depends on the number of capturing groups in the pattern. If there are no groups, return a list of
strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group.
If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do
not affect the form of the result.
'''

re.findall(r'http://(?:\w*\.\w+)+|https://(?:\w*\.\w+)+', s5)  # also works

#re.findall(r'https?://(?:\w*\.\w*)+\w+', s5)

['http://www.exa', 'https://www.example2.com']

In [None]:
# strictly speaking, http://www.exa..mple1.com is not a valid URL
re.findall(r'(https?://(?:\w+\.{1})+\w+)\.?\s', s5) # match valid URLs followed by period (optional) and whitespace

['https://www.example2.com']

__g)__ Replace all occurences of _two_ as whole word with _2_ in a string.

In [None]:
s6 = "An easy math task is two + 3 = 5. The infix two in the word twofold should remain unchanged, though. onetwo"

In [None]:
# solution
re.sub(r'\btwo\b', '2', s6)

'An easy math task is 2 + 3 = 5. The infix 2 in the word twofold should remain unchanged, though. onetwo'

__h)__ Return a list with all elements from the given list that don't contain the letter 'a'.

In [85]:
l = ["apple", "cucumber", "tomato", "zucchini", "pumpkin", "pear"]

In [86]:
# solution
l_new = [w for w in l if not re.search(r'a', w)]
print(l_new)

['cucumber', 'zucchini', 'pumpkin']


### Task 2: Natural Language Toolkit
In the next task we're working with the Natural Language Toolkit (NLTK) package of Python, which is a powerful open source library for Natural Language Processing. Make sure you have the package installed and the documentation ready: https://www.nltk.org/api/nltk.html.

Then download the Gutenberg corpus and the Punkt sentence tokenization models. `nltk.download()` without parameters provides an interactive interface.

In [None]:
# import the NLTK package
import nltk

nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> 1


    Error loading 1: Package '1' not found in index



---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------


In [None]:
# download the Gutenberg corpus
nltk.download("gutenberg")
# download tokenization models
# nltk.download("punkt")
# current nltk version requires punkt_tab
nltk.download("punkt_tab")

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/schloett/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/schloett/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

__a)__ Import the ``gutenberg`` corpus reader from the ``corpus`` module of the  ``nltk`` package. What does the Gutenberg corpus contain?

In [None]:
# import corpus reader
from nltk.corpus import gutenberg

# https://www.nltk.org/howto/corpus.html

# have a look at the readme file
print(gutenberg.readme())

# have a look at the files of the corpus
gutenberg.fileids()

Project Gutenberg Selections
http://gutenberg.net/

This corpus contains etexts from from Project Gutenberg,
by the following authors:

* Jane Austen (3)
* William Blake (2)
* Thornton W. Burgess
* Sarah Cone Bryant
* Lewis Carroll
* G. K. Chesterton (3)
* Maria Edgeworth
* King James Bible
* Herman Melville
* John Milton
* William Shakespeare (3)
* Walt Whitman

The beginning of the body of each book could not be identified automatically,
so the semi-generic header of each file has been removed, and included below.
Some source files ended with a line "End of The Project Gutenberg Etext...",
and this has been deleted.

Information about Project Gutenberg (one page)

We produce about two million dollars for each hour we work.  The
fifty hours is one conservative estimate for how long it we take
to get any etext selected, entered, proofread, edited, copyright
searched and analyzed, the copyright letters written, etc.  This
projected audience is one hundred million readers.  If our value


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

__b)__ Have a look at the book "Persuasion" by Jane Austen (``austen-persuasion.txt``). Print the number of words and the number of sentences that the book contains. How many unique words can you find? What fraction of sentences contain the word "Anne"?

In [None]:
# The book "Persuasion" by Jane Austen as raw text
raw = gutenberg.raw('austen-persuasion.txt')

print(raw[:1090])
# the length of the raw text is the number of letters icluding spaces
print(len(raw))

[Persuasion by Jane Austen 1818]


Chapter 1


Sir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who,
for his own amusement, never took up any book but the Baronetage;
there he found occupation for an idle hour, and consolation in a
distressed one; there his faculties were roused into admiration and
respect, by contemplating the limited remnant of the earliest patents;
there any unwelcome sensations, arising from domestic affairs
changed naturally into pity and contempt as he turned over
the almost endless creations of the last century; and there,
if every other leaf were powerless, he could read his own history
with an interest which never failed.  This was the page at which
the favourite volume always opened:

           "ELLIOT OF KELLYNCH HALL.

"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,
daughter of James Stevenson, Esq. of South Park, in the county of
Gloucester, by which lady (who died 1800) he has issue Elizabeth,
born June 1, 1785; Ann

In [None]:
# get all tokens as a list
words = gutenberg.words('austen-persuasion.txt')

print(words[:10])

# number of words
print(len(words))

# number of unique words (including punctuation marks)
print(len(set(words)))

['[', 'Persuasion', 'by', 'Jane', 'Austen', '1818', ']', 'Chapter', '1', 'Sir']
98171
6132


In [None]:
sents = gutenberg.sents('austen-persuasion.txt')

# first 3 sentences
print([' '.join(s) for s in sents[:3]])

# number of sentences
print('number of sentences:', len(sents))

# fraction of sentences that contain the word "Anne"
sents_anne = [s for s in sents if 'Anne' in s]
print(len(sents_anne))

print(len(sents_anne)/len(sents))

['[ Persuasion by Jane Austen 1818 ]', 'Chapter 1', 'Sir Walter Elliot , of Kellynch Hall , in Somersetshire , was a man who , for his own amusement , never took up any book but the Baronetage ; there he found occupation for an idle hour , and consolation in a distressed one ; there his faculties were roused into admiration and respect , by contemplating the limited remnant of the earliest patents ; there any unwelcome sensations , arising from domestic affairs changed naturally into pity and contempt as he turned over the almost endless creations of the last century ; and there , if every other leaf were powerless , he could read his own history with an interest which never failed .']
number of sentences: 3747
477
0.12730184147317855


__c)__ By using functions from ``nltk``, tokenize the given text into
* words
* sentences

In [None]:
text = "Tokenize this sentence into sentences and words. This is the task someone gave to you. \
    Why, i.e., what the purpose of this is, remains unknown."

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

# tokenize into sentences
sentences = sent_tokenize(text)
print(sentences)

['Tokenize this sentence into sentences and words.', 'This is the task someone gave to you.', 'Why, i.e., what the purpose of this is, remains unknown.']


In [None]:
# tokenize into words
words = word_tokenize(text)
print(words)

['Tokenize', 'this', 'sentence', 'into', 'sentences', 'and', 'words', '.', 'This', 'is', 'the', 'task', 'someone', 'gave', 'to', 'you', '.', 'Why', ',', 'i.e.', ',', 'what', 'the', 'purpose', 'of', 'this', 'is', ',', 'remains', 'unknown', '.']


__d)__ Use the Porter stemmer from ``nltk`` on the text from task __c)__

In [None]:
from nltk.stem import PorterStemmer

p = PorterStemmer()
stems = [p.stem(word) for word in words]
print(stems)

['token', 'thi', 'sentenc', 'into', 'sentenc', 'and', 'word', '.', 'thi', 'is', 'the', 'task', 'someon', 'gave', 'to', 'you', '.', 'whi', ',', 'i.e.', ',', 'what', 'the', 'purpos', 'of', 'thi', 'is', ',', 'remain', 'unknown', '.']
