**Summary of basic notations to match single characters
and sequences of characters**

1. /[abc]/ = /a|b|c/ Character class; disjunction
matches one of a, b or c

2. /[b-e]/ = /b|c|d|e/ Range in a character class

3. /[^b-e]/ Complement of character class
4. /./ Wildcard matches any character
5. /a*/ /[af]*/ /(abc)*/ Kleene star: zero or more
6. /a?/ /(ab|ca)?/ Zero or one; optional
7. /a+/ /([a-zA-Z]1|ca)+/ Kleene plus: one or more
8. /a{8}/ /b{1,2}/ /c{3,}/ Counters: exact number of repeats

In [None]:
'''
/^a/ pattern must match at beginning of string
/a$/ pattern musch math the end of string
'''


## Reading Text from Files, Stemming and Lemmatization Lab Exercise:

In [2]:
#import all the necessary packages
import nltk
from urllib import request

In [3]:
#text from guteberg.org
#this the url of the book "Crime and Punishment"
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response =request.urlopen(url)
raw = response.read().decode('utf8')
print(type(raw))
print(len(raw))
print(raw[:200])

<class 'str'>
1176965
﻿The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, giv


In [6]:
#Optional
#Let's try to import another text with html formatting
blondurl="http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(blondurl).read().decode('utf8')
html[:1000]
#As we can see this text file contains a lot of shit
#import bs4 to remove html markups
from bs4 import BeautifulSoup
#Get the contents as a BeautifulSoup object:
soup = BeautifulSoup(html, 'lxml')
#Use the get_text function to get the contents of all the text tags.
braw = soup.get_text()
btokens = nltk.word_tokenize(braw)
btokens[:100]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\r\n<html>\r\n<head>\r\n<title>BBC NEWS | Health | Blondes \'to die out in 200 years\'</title>\r\n<meta name="keywords" content="BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service">\r\n<meta name="OriginalPublicationDate" content="2002/09/27 11:51:55">\r\n<meta name="UKFS_URL" content="/1/hi/health/2284783.stm">\r\n<meta name="IFS_URL" content="/2/hi/health/2284783.stm">\r\n<meta name="HTTP-EQUIV" content="text/html;charset=iso-8859-1">\r\n<meta name="Headline" content="Blondes \'to die out in 200 years\'">\r\n<meta name="Section" content="Health">\r\n<meta name="Description" content="Natural blondes are an endangered species and will die out by 2202, a study suggests.">\r\n<!-- GENMaps-->\r\n<map name="banner">\r\n<area alt="BBC NEWS" coords="7,9,167,32" href="http://news.bbc.co.uk/1/hi.html" shape="RECT">\r\n</map>\r\n\r\n<script src="/

In [27]:
#read the txt file
fin = open('desert.txt')
rawtext = fin.read()


deserttokens = nltk.word_tokenize(rawtext)
text = nltk.Text(deserttokens)
text.concordance('pass')

fin.close()

Displaying 2 of 2 matches:
 Shaiba range of mountainous dunes , pass by the quicksand of Umm al Samim ( M
 Shaiba range of mountainous dunes , pass by the quicksand of Umm al Samim ( M


In [30]:
#Stemming and lemmatization
f = open('CrimeAndPunishment.txt')
crimetext = f.read()

#Tokenize the text and make crimewords to have lower-case words with no capitalization.
crimetokens = nltk.word_tokenize(crimetext)
print(len(crimetokens))
print(crimetokens[:100])

['Produced',
 'by',
 'John',
 'Bickers',
 ';',
 'and',
 'Dagny',
 'CRIME',
 'AND',
 'PUNISHMENT',
 'By',
 'Fyodor',
 'Dostoevsky',
 'Translated',
 'By',
 'Constance',
 'Garnett',
 'TRANSLATOR',
 "'S",
 'PREFACE',
 'A',
 'few',
 'words',
 'about',
 'Dostoevsky',
 'himself',
 'may',
 'help',
 'the',
 'English',
 'reader',
 'to',
 'understand',
 'his',
 'work',
 '.',
 'Dostoevsky',
 'was',
 'the',
 'son',
 'of',
 'a',
 'doctor',
 '.',
 'His',
 'parents',
 'were',
 'very',
 'hard-working',
 'and',
 'deeply',
 'religious',
 'people',
 ',',
 'but',
 'so',
 'poor',
 'that',
 'they',
 'lived',
 'with',
 'their',
 'five',
 'children',
 'in',
 'only',
 'two',
 'rooms',
 '.',
 'The',
 'father',
 'and',
 'mother',
 'spent',
 'their',
 'evenings',
 'in',
 'reading',
 'aloud',
 'to',
 'their',
 'children',
 ',',
 'generally',
 'from',
 'books',
 'of',
 'a',
 'serious',
 'character',
 '.',
 'Though',
 'always',
 'sickly',
 'and',
 'delicate',
 'Dostoevsky',
 'came',
 'out',
 'third']

In [33]:
#NLTK has two stemmers, you first create them
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
#compare how the 2 stemmers work on a small portion of the tokens
crimePstem = [porter.stem(t) for t in crimetokens]
print(crimePstem[:200],'\n')
crimeLstem = [lancaster.stem(t) for t in crimetokens] 
print(crimeLstem[:200])

#The Lancaster stemmer has lower-cased all the words, and in some cases, it appears to
#be a little more severe in removing word endings, but in others not.



['produc', 'by', 'john', 'bicker', ';', 'and', 'dagni', 'crime', 'and', 'punish', 'By', 'fyodor', 'dostoevski', 'translat', 'By', 'constanc', 'garnett', 'translat', "'S", 'prefac', 'A', 'few', 'word', 'about', 'dostoevski', 'himself', 'may', 'help', 'the', 'english', 'reader', 'to', 'understand', 'hi', 'work', '.', 'dostoevski', 'wa', 'the', 'son', 'of', 'a', 'doctor', '.', 'hi', 'parent', 'were', 'veri', 'hard-work', 'and', 'deepli', 'religi', 'peopl', ',', 'but', 'so', 'poor', 'that', 'they', 'live', 'with', 'their', 'five', 'children', 'in', 'onli', 'two', 'room', '.', 'the', 'father', 'and', 'mother', 'spent', 'their', 'even', 'in', 'read', 'aloud', 'to', 'their', 'children', ',', 'gener', 'from', 'book', 'of', 'a', 'seriou', 'charact', '.', 'though', 'alway', 'sickli', 'and', 'delic', 'dostoevski', 'came', 'out', 'third', 'in', 'the', 'final', 'examin', 'of', 'the', 'petersburg', 'school', 'of', 'engin', '.', 'there', 'he', 'had', 'alreadi', 'begun', 'hi', 'first', 'work', ',', '`

In [34]:
#The NLTK  has a lemmatizer that uses the WordNet on-line thesaurus 
#as a dictionary to look up roots and find the word.
wnl = nltk.WordNetLemmatizer()
crimeLemma = [wnl.lemmatize(t) for t in crimetokens]
crimeLemma[:200]

#Note that the WordNetLemmatizer does not stem verbs and in general, 
#doesn’t stem very severely at all.

['Produced',
 'by',
 'John',
 'Bickers',
 ';',
 'and',
 'Dagny',
 'CRIME',
 'AND',
 'PUNISHMENT',
 'By',
 'Fyodor',
 'Dostoevsky',
 'Translated',
 'By',
 'Constance',
 'Garnett',
 'TRANSLATOR',
 "'S",
 'PREFACE',
 'A',
 'few',
 'word',
 'about',
 'Dostoevsky',
 'himself',
 'may',
 'help',
 'the',
 'English',
 'reader',
 'to',
 'understand',
 'his',
 'work',
 '.',
 'Dostoevsky',
 'wa',
 'the',
 'son',
 'of',
 'a',
 'doctor',
 '.',
 'His',
 'parent',
 'were',
 'very',
 'hard-working',
 'and',
 'deeply',
 'religious',
 'people',
 ',',
 'but',
 'so',
 'poor',
 'that',
 'they',
 'lived',
 'with',
 'their',
 'five',
 'child',
 'in',
 'only',
 'two',
 'room',
 '.',
 'The',
 'father',
 'and',
 'mother',
 'spent',
 'their',
 'evening',
 'in',
 'reading',
 'aloud',
 'to',
 'their',
 'child',
 ',',
 'generally',
 'from',
 'book',
 'of',
 'a',
 'serious',
 'character',
 '.',
 'Though',
 'always',
 'sickly',
 'and',
 'delicate',
 'Dostoevsky',
 'came',
 'out',
 'third',
 'in',
 'the',
 'final',
 'e

In [49]:
#Lab Assignment
#1. First use nltk.word_tokenize() to find the tokens of desert.txt. 
f = open('desert.txt')
deserttext = f.read()

deserttokens = nltk.word_tokenize(deserttext)

# 2. Use NLTK’s Porter stemmer and Lancaster stemmer to stem the tokens of the desert.txt file. 
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
#compare how the 2 stemmers work on a small portion of the tokens
desertPstem = [porter.stem(t) for t in deserttokens]
# print(desertPstem[:200],'\n')
desertLstem = [lancaster.stem(t) for t in deserttokens] 
# print(desertLstem[:200])

#3. Choose a number randomly between 0 and 1363 (the length of the tokens is 1364). 
#I'mma choose 60
print("Porter Stemmer:", desertPstem[60],'\n')
print("Lancaster Stemmer:", desertLstem[60])




#Post on the course discussion forum the word from desert.txt at that location from both 
#the Porter and Lancaster stemmed token lists.
# Observe whether there was no stemming on that token, the stemming is the same or the
# stemming is different between the 2 stemmed lists.

Porter Stemmer: kilometr 

Lancaster Stemmer: kilomet
