# Exercise 02: Regular Expressions and the Natural Language Toolkit

### Task 1: Regular Expressions
In this task we're working with the `re` package of Python. Take a look at the documentation to solve the following subtasks: https://docs.python.org/3/library/re.html

In [1]:
# import the re package
import re

Make yourself familiar with the different kinds of regex commands `re.XXXX(pattern, string)` available in python. Describe the differences between them.

* re.search()
* re.match()
* re.fullmatch()
* re.split()
* re.findall()
* re.finditer()

__a)__ Write a regular expression that checks if "Hello World" is contained in a string.

In [2]:
positive = "this is a Hello World test "
negative = "this is a H3ll0 W0rld Test "

In [7]:
a = re.search('Hello World', positive)
b = re.search('Hello World', negative)
print(a)
print(a.group())

print(b)
print(b.group())


<re.Match object; span=(10, 21), match='Hello World'>
Hello World
None


AttributeError: 'NoneType' object has no attribute 'group'

In [11]:
a = re.search(r'Hello World', positive)
b = re.search(r'Hello World', negative)
print(a)
print(re.search('Hello World', negative).group())




<re.Match object; span=(10, 21), match='Hello World'>


AttributeError: 'NoneType' object has no attribute 'group'

__b)__ Write a regular expression that selects all integer numbers from a text.

In [12]:
s1 = "This is a text written in 2022, and it contains numbers like one, 2, 3. Other numbers like 123 are also detected."

In [13]:
re.findall(r'[\d]+', s1)

['2022', '2', '3', '123']

__c)__ Write a regular expression that finds two (lower-case) words connected by a "-" (e.g., ill-advised).

In [15]:
s2 = "This my so-so example text. try so-o0 and No-prob1em. It features well-received meaningless words. The string 1-2 should not be returned."

In [18]:
re.findall(r'\b[a-z]+-[a-z]+\b', s2)

['so-so', 'well-received']

__d)__ Write a regular expression that selects numbers with at least 4 digits from a text


In [19]:
s3 = "This is a 2022 text, the third useless text I'm writing after 2021 ended. Larger numbers like 63527 should be detected, but smaller numbers like 123 should not."

In [22]:
re.findall(r'\d{4,}', s3)   #no use [\d{4,}]+ since it matches any single character inside the brackets \d, {, 4, ,, and }

['2022', '2021', '63527']

__d)__ Write a regular expression that extract dates in the format YYYY-MM-DD:
Assume that Y, M or D are valid when they are digits.

In [25]:
s4 = "2017-05-12 and 2018-01-01 are correct formats,  2017-25-01 is also valid. Not valid is 03.05.22, neither is 12/12/2017. neither is 12540-85-58. neither is 3584-13-32"

In [24]:
re.findall(r'\d{4}-\d{2}-\d{2}', s4)

['2017-05-12', '2018-01-01', '2017-25-01']

__e)__ Write a regular expression that finds all URLS in a string. For sake of simplicity, a url starts with "http://" or "https://" and contains at least one dot, which is not at the end.

In [36]:
s5 = "The url http://example1.com/emajpf/ejisjl.cor should be returned, as well as https://www.example2.com#fef=di_eies123f. Let's try http://test."

In [55]:
c = re.finditer(r'https?://([^\s]+\.[^\s\.]+)', s5)
#[^\s]+ is any character expect for the space but not ^[\s]+
#\.[^\s\.]+ here + is important, otherwise it only focus the 1 letter

for i in c:
  print(i.group())

http://example1.com/emajpf/ejisjl.cor
https://www.example2.com#fef=di_eies123f


In [56]:
# negative lookbehind: (?<!pattern)
d = re.finditer(r'https?://[^\s]+(?<!\.)', s5)
# match the content begins with http(s):// and ends with '.' in the end of the sentence

for i in d:
  print(i.group())

http://example1.com/emajpf/ejisjl.cor
https://www.example2.com#fef=di_eies123f
http://test


In [41]:
re.findall(r'http://[\w\./^*-_]+\.[\w/^*-_#]+', s5)

['http://example1.com/emajpf/ejisjl.cor']

__f)__ Replace all occurences of _two_ as whole word with _2_ in a string.

In [63]:
s6 = "An easy math task is two + 3 = 5. The infix Two in the word twofold should remain unchanged, though."

In [64]:
s6.replace('two', '2')
# replace anything including 'two' so not suitable in this case

'An easy math task is 2 + 3 = 5. The infix Two in the word 2fold should remain unchanged, though.'

In [65]:
re.sub(r'\btwo\b', '2', s6, flags=re.I)
# flags=re.I: ignore the differences of capitalization

'An easy math task is 2 + 3 = 5. The infix 2 in the word twofold should remain unchanged, though.'

__g)__ Return a list with all elements from the given list that don't contain the letter 'a'.

In [68]:
l = ["apple", "cucumber", "tomato", "zucchini", "pumpkin", "pear"]

In [72]:
resl = []
for i in l:
  if not re.search(r'a', i):
    resl.append(i)

print(resl)

['cucumber', 'zucchini', 'pumpkin']


### Task 2: Natural Language Toolkit
In the next task we're working with the Natural Language Toolkit (NLTK) package of Python, which is a powerful open source library for Natural Language Processing. Make sure you have the package installed and the documentation ready: https://www.nltk.org/api/nltk.html.

Then download the Gutenberg corpus and the Punkt sentence tokenization models. `nltk.download()` without parameters provides an interactive interface.

In [89]:
# import the NLTK package
import nltk

# download the Gutenberg corpus
nltk.download("gutenberg")
# download tokenization models
# nltk.download("punkt")
# current nltk version requires punkt_tab
nltk.download("punkt_tab")
nltk.download("punkt")

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

__a)__ Import the ``gutenberg`` corpus reader from the ``corpus`` module of the ``nltk`` package. What does the Gutenberg corpus contain?

In [74]:
import gutenberg from nltk.corpus

SyntaxError: invalid syntax (<ipython-input-74-076b97c78c39>, line 1)

In [76]:
from nltk.corpus import gutenberg

gutenberg.read()

AttributeError: 'PlaintextCorpusReader' object has no attribute 'read'

In [78]:
print(gutenberg.readme())

Project Gutenberg Selections
http://gutenberg.net/

This corpus contains etexts from from Project Gutenberg,
by the following authors:

* Jane Austen (3)
* William Blake (2)
* Thornton W. Burgess
* Sarah Cone Bryant
* Lewis Carroll
* G. K. Chesterton (3)
* Maria Edgeworth
* King James Bible
* Herman Melville
* John Milton
* William Shakespeare (3)
* Walt Whitman

The beginning of the body of each book could not be identified automatically,
so the semi-generic header of each file has been removed, and included below.
Some source files ended with a line "End of The Project Gutenberg Etext...",
and this has been deleted.

Information about Project Gutenberg (one page)

We produce about two million dollars for each hour we work.  The
fifty hours is one conservative estimate for how long it we take
to get any etext selected, entered, proofread, edited, copyright
searched and analyzed, the copyright letters written, etc.  This
projected audience is one hundred million readers.  If our value


__b)__ Have a look at the book "Persuasion" by Jane Austen (``austen-persuasion.txt``). Print the number of words and the number of sentences that the book contains. How many unique words can you find? What fraction of sentences contain the word "Anne"?

In [87]:
fulltext = gutenberg.raw('austen-persuasion.txt')
len(fulltext)



466292

In [90]:
fullwords = gutenberg.words('austen-persuasion.txt')
fullsents = gutenberg.sents('austen-persuasion.txt')
# .words()   .sents()


print(len(fullwords))
print(len(fullsents))


98171
3747


In [82]:
string1 = 'a and b'
len(string1)

7

__c)__ By using functions from ``nltk``, tokenize the given text into
* words
* sentences

In [104]:
text1 = "Tokenize this sentence into sentences and words. This is the task someone gave to you. Why, i.e., what the purpose of this, is remains unknown."

In [105]:
from nltk.tokenize import word_tokenize, sent_tokenize
# '_' but not '.' (not word.tokenize)
# 'word' but not 'words'

print(word_tokenize(text1))
print(sent_tokenize(text1))

['Tokenize', 'this', 'sentence', 'into', 'sentences', 'and', 'words', '.', 'This', 'is', 'the', 'task', 'someone', 'gave', 'to', 'you', '.', 'Why', ',', 'i.e.', ',', 'what', 'the', 'purpose', 'of', 'this', ',', 'is', 'remains', 'unknown', '.']
['Tokenize this sentence into sentences and words.', 'This is the task someone gave to you.', 'Why, i.e., what the purpose of this, is remains unknown.']


__d)__ Use the Porter stemmer from ``nltk`` on the text from task __c)__

In [109]:
from nltk.stem import PorterStemmer

p = PorterStemmer()
for i in text1.split():
  stemwords = p.stem(i)
  print(stemwords)

token
thi
sentenc
into
sentenc
and
words.
thi
is
the
task
someon
gave
to
you.
why,
i.e.,
what
the
purpos
of
this,
is
remain
unknown.


In [112]:
stemword1 = [p.stem(i) for i in text1.split()]
print(stemword1)

['token', 'thi', 'sentenc', 'into', 'sentenc', 'and', 'words.', 'thi', 'is', 'the', 'task', 'someon', 'gave', 'to', 'you.', 'why,', 'i.e.,', 'what', 'the', 'purpos', 'of', 'this,', 'is', 'remain', 'unknown.']
