# Exercise 02: Regular Expressions and the Natural Language Toolkit

### Task 1: Regular Expressions
In this task we're working with the `re` package of Python. Take a look at the documentation to solve the following subtasks: https://docs.python.org/3/library/re.html

In [1]:
# import the re package 
import re

Make yourself familiar with the different kinds of regex commands `re.XXXX(pattern, string)` available in python. Describe the differences between them.

* re.search()
* re.match() 
* re.fullmatch()
* re.split()
* re.findall() 
* re.finditer()

In [2]:
re.search()

TypeError: search() missing 2 required positional arguments: 'pattern' and 'string'

__a)__ Write a regular expression that checks if "Hello World" is contained in a string.

In [5]:
positive = "this is a Hello World test "
negative = "this is a H3ll0 W0rld Test "

In [7]:
result = re.search("Hello World", negative)
if result:
    print("'Hello World' is contained in the string")
else:
    print("'Hello World' is not contained in the string")    

'Hello World' is not contained in a string


__b)__ Write a regular expression that selects all integer numbers from a text.

In [9]:
s1 = "This is a text written in 2022, and it contains numbers like one, 2, 3. Other numbers like 123 are also detected."

In [11]:
re.findall(r"\d+", s1)


['2022', '2', '3', '123']

__c)__ Write a regular expression that finds two (lower-case) words connected by a "-" (e.g., ill-advised).

In [16]:
s2 = "This my so-so example text. It features well-received meaningless words. The string 1-2 should not be returned."

In [17]:
re.findall(r'[a-z]+-[a-z]+', s2)

['so-so', 'well-received']

__d)__ Write a regular expression that selects numbers with at least 4 digits from a text


In [18]:
s3 = "This is a 2022 text, the third useless text I'm writing after 2021 ended. Larger numbers like 63527 should be detected, but smaller numbers like 123 should not."

In [22]:
re.findall(r"\d{4,}", s3)

['2022', '2021', '63527']

__d)__ Write a regular expression that extract dates in the format YYYY-MM-DD:
Assume that Y, M or D are valid when they are digits.

In [24]:
s4 = "2017-05-12 and 2018-01-01 are correct formats,  2017-25-01 is also valid. Not valid is 03.05.22, neither is 12/12/2017."

In [25]:
re.findall(r'\d{4}-\d{2}-\d{2}', s4)

['2017-05-12', '2018-01-01', '2017-25-01']

__e)__ Write a regular expression that finds all URLS in a string. For sake of simplicity, a url starts with "http://" or "https://" and contains at least one dot, which is not at the end.

In [32]:
s5 = "The url http://example1.com should be returned, as well as http://www.example2.com. Let's try http://test."

In [30]:
#.group() 的行为：
#m.group() 或 m.group(0)：返回整个匹配的字符串，即正则表达式匹配到的完整 URL。
#m.group(1)：返回第一个捕获组的内容（括号 () 内的部分）。在这个例子中，m.group(1) 返回的是域名部分，比如 example.com。

#[\w.-]+：匹配任意字母、数字、下划线、点号和破折号的组合。这能捕获包含多个点号的完整域名。

In [36]:
pattern5 = r'https?://[\w.-]+'
re.findall(pattern5, s5)

['http://example1.com', 'http://www.example2.com.', 'http://test.']

__f)__ Replace all occurences of _two_ as whole word with _2_ in a string. 

In [37]:
s6 = "An easy math task is two + 3 = 5. The infix two in the word twofold should remain unchanged, though."

In [38]:
re.sub(r"\btwo\b", "2", s6)

'An easy math task is 2 + 3 = 5. The infix 2 in the word twofold should remain unchanged, though.'

__g)__ Return a list with all elements from the given list that don't contain the letter 'a'.

In [47]:
l = ["apple", "cucumber", "tomato", "zucchini", "pumpkin", "pear"]

In [50]:
words = [w for w in l if not re.search(r"a", w)]
print(words)

['cucumber', 'zucchini', 'pumpkin']


### Task 2: Natural Language Toolkit
In the next task we're working with the Natural Language Toolkit (NLTK) package of Python, which is a powerful open source library for Natural Language Processing. Make sure you have the package installed and the documentation ready: https://www.nltk.org/api/nltk.html.

Then download the Gutenberg corpus and the Punkt sentence tokenization models. `nltk.download()` without parameters provides an interactive interface.

In [62]:
# import the NLTK package
import nltk

# download the Gutenberg corpus
nltk.download("gutenberg")
# download tokenization models
# nltk.download("punkt")
# current nltk version requires punkt_tab
nltk.download("punkt_tab")
nltk.download('punkt')

[nltk_data] Downloading package gutenberg to C:\Users\Edward
[nltk_data]     Gaming\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\Edward
[nltk_data]     Gaming\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Edward
[nltk_data]     Gaming\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

__a)__ Import the ``gutenberg`` corpus reader from the ``corpus`` module of the ``nltk`` package. What does the Gutenberg corpus contain?

In [57]:
from nltk.corpus import gutenberg
print(gutenberg.readme())
gutenberg.fileids()

Project Gutenberg Selections
http://gutenberg.net/

This corpus contains etexts from from Project Gutenberg,
by the following authors:

* Jane Austen (3)
* William Blake (2)
* Thornton W. Burgess
* Sarah Cone Bryant
* Lewis Carroll
* G. K. Chesterton (3)
* Maria Edgeworth
* King James Bible
* Herman Melville
* John Milton
* William Shakespeare (3)
* Walt Whitman

The beginning of the body of each book could not be identified automatically,
so the semi-generic header of each file has been removed, and included below.
Some source files ended with a line "End of The Project Gutenberg Etext...",
and this has been deleted.

Information about Project Gutenberg (one page)

We produce about two million dollars for each hour we work.  The
fifty hours is one conservative estimate for how long it we take
to get any etext selected, entered, proofread, edited, copyright
searched and analyzed, the copyright letters written, etc.  This
projected audience is one hundred million readers.  If our value


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

__b)__ Have a look at the book "Persuasion" by Jane Austen (``austen-persuasion.txt``). Print the number of words and the number of sentences that the book contains. How many unique words can you find? What fraction of sentences contain the word "Anne"?

In [67]:
austen = gutenberg.raw("austen-persuasion.txt")
print(len(austen))
print(len(set(austen)))

466292
78


In [66]:
aussents = gutenberg.sents("austen-persuasion.txt")
print("the number of sentences is", len(aussents))
anne = [a for a in aussents if "Anne" in a]
print(len(anne))

the number of sentences is 3747
477


__c)__ By using functions from ``nltk``, tokenize the given text into
* words
* sentences

In [72]:
text = "Tokenize this sentence into sentences and words. This is the task someone gave to you. Why, i.e., what the purpose of this, is remains unknown."

In [73]:
from nltk.tokenize import word_tokenize, sent_tokenize

sent1 = sent_tokenize(text)
print(sent1)


['Tokenize this sentence into sentences and words.', 'This is the task someone gave to you.', 'Why, i.e., what the purpose of this, is remains unknown.']


In [74]:
word1 = word_tokenize(text)
print(word1)

['Tokenize', 'this', 'sentence', 'into', 'sentences', 'and', 'words', '.', 'This', 'is', 'the', 'task', 'someone', 'gave', 'to', 'you', '.', 'Why', ',', 'i.e.', ',', 'what', 'the', 'purpose', 'of', 'this', ',', 'is', 'remains', 'unknown', '.']


__d)__ Use the Porter stemmer from ``nltk`` on the text from task __c)__

In [76]:
from nltk.stem import PorterStemmer

p = PorterStemmer()
stem1 = [p.stem(w) for w in word1]
print(stem1)

['token', 'thi', 'sentenc', 'into', 'sentenc', 'and', 'word', '.', 'thi', 'is', 'the', 'task', 'someon', 'gave', 'to', 'you', '.', 'whi', ',', 'i.e.', ',', 'what', 'the', 'purpos', 'of', 'thi', ',', 'is', 'remain', 'unknown', '.']
