## INVESTIGATING UNSTRUCTURED TEXT
As we saw last week, even the sometimes messy and unpredictable Markup language of HTML can give us clues to how data may be structured. But language as a system (as we saw in Borges) also comes with its own structures. Python provides numerous methods for navigating through basic linguistic patterns. Let's begin with repetition itself:

In [2]:
speech = '''Tomorrow, and tomorrow, and tomorrow,
Creeps in this petty pace from day to day,
To the last syllable of recorded time;
And all our yesterdays have lighted fools
The way to dusty death. Out, out, brief candle!
Life's but a walking shadow, a poor player,
That struts and frets his hour upon the stage,
And then is heard no more. It is a tale
Told by an idiot, full of sound and fury,
Signifying nothing.'''



There're various ways to investigate Macbeth's famous, very short, speech. We begin by searching for the obvious, searching through the whole speech.

In [3]:
'tomorrow' in speech

True

In [4]:
speech.find('tomorrow') #position

14

In [5]:
speech[14:14+len('tomorrow')]
#speech[14:22]


'tomorrow'

In [6]:
speech.count('tomorrow')

2

In [7]:
speech.lower().count('tomorrow')
#speech.lower().find('tomorrow') #ola mikra gia na vrw auto pou thelw

3

In [8]:
speech.upper().count('A')

28

In [9]:
speech.lower().count(' a') 

9

In [10]:
speech.lower().count(' a ')

3

Of course, there is already a structure to the speech that we are ignoring--it has lines. Let's get out those lines and put them into a list.

In [11]:
lines = speech.split('\n')
#lines = tom.splitlines() 
lines

['Tomorrow, and tomorrow, and tomorrow,',
 'Creeps in this petty pace from day to day,',
 'To the last syllable of recorded time;',
 'And all our yesterdays have lighted fools',
 'The way to dusty death. Out, out, brief candle!',
 "Life's but a walking shadow, a poor player,",
 'That struts and frets his hour upon the stage,',
 'And then is heard no more. It is a tale',
 'Told by an idiot, full of sound and fury,',
 'Signifying nothing.']

In [12]:
firstline = lines[0]
firstline

'Tomorrow, and tomorrow, and tomorrow,'

Python has a handful of built-in ways to search a line. Here are just a few.

In [13]:
yest = firstline.replace('tomorrow','yesterday',1) #to noumero einai gia to poses fores tha to allaxei
yest

'Tomorrow, and yesterday, and tomorrow,'

In [14]:
yest = firstline.lower().replace('tomorrow','yesterday',2) #to noumero einai gia to poses fores tha to allaxei
yest

'yesterday, and yesterday, and tomorrow,'

In [15]:
firstline.startswith('tomorrow')

False

In [16]:
firstline.endswith('tomorrow')

False

## List comprehensions
What if we want to search through every line. The obvious way is using a `for` loop.

In [17]:
for line in lines:
    if line.startswith('And'):
        print(line)

And all our yesterdays have lighted fools
And then is heard no more. It is a tale


That is a very simple loop, so simple that Python has a solution for a looping through a list using a one-line statement, called a **list comprehension**

In [18]:
[line for line in lines if line.startswith('And')]

['And all our yesterdays have lighted fools',
 'And then is heard no more. It is a tale']

In [19]:
['hello' for line in lines if line.startswith('And')]

['hello', 'hello']

In [20]:
[line for line in lines if line.startswith('T') and line.endswith(',')]

['Tomorrow, and tomorrow, and tomorrow,',
 'That struts and frets his hour upon the stage,',
 'Told by an idiot, full of sound and fury,']

Remember this, when we start using more robust ways of searching line by line (sentence by sentence, etc) these will come in handy. But before we jump to those special searching methods, let's have a little detour on sorting.

## Sorting!
Say we want to investigate the lines in the speech, and order them from longest line to shortest line. Well we know how to get the length of each line using loop, but how can we measure them to reorder our list?

In [21]:
for line in lines:
    print(len(line))

37
42
38
41
47
43
46
39
41
19


We could write a function that pairs these numbers with each line, and then sorts through everything--but sort functions are notoriously challenging to write. And Python has a built in sorting function.

In [22]:
sortlines = lines.copy()
sortlines.sort()
sortlines

['And all our yesterdays have lighted fools',
 'And then is heard no more. It is a tale',
 'Creeps in this petty pace from day to day,',
 "Life's but a walking shadow, a poor player,",
 'Signifying nothing.',
 'That struts and frets his hour upon the stage,',
 'The way to dusty death. Out, out, brief candle!',
 'To the last syllable of recorded time;',
 'Told by an idiot, full of sound and fury,',
 'Tomorrow, and tomorrow, and tomorrow,']

But not only that, Python has a built in mini-function generator called `lambda` that you can nest inside at sorting function.

In [23]:
sortlines = lines.copy()
sortlines.sort(key=lambda x: len(x), reverse=True)
#sortlines.sort(key=lambda x: x.split()[-1], reverse=True)
sortlines

['The way to dusty death. Out, out, brief candle!',
 'That struts and frets his hour upon the stage,',
 "Life's but a walking shadow, a poor player,",
 'Creeps in this petty pace from day to day,',
 'And all our yesterdays have lighted fools',
 'Told by an idiot, full of sound and fury,',
 'And then is heard no more. It is a tale',
 'To the last syllable of recorded time;',
 'Tomorrow, and tomorrow, and tomorrow,',
 'Signifying nothing.']

In [24]:
sortlines = lines.copy()
#sortlines.sort(key=lambda x: len(x), reverse=True)
sortlines.sort(key=lambda x: x.split()[-1], reverse=True)
sortlines   #reverse alphabetical order by the last word

['Tomorrow, and tomorrow, and tomorrow,',
 'To the last syllable of recorded time;',
 'And then is heard no more. It is a tale',
 'That struts and frets his hour upon the stage,',
 "Life's but a walking shadow, a poor player,",
 'Signifying nothing.',
 'Told by an idiot, full of sound and fury,',
 'And all our yesterdays have lighted fools',
 'Creeps in this petty pace from day to day,',
 'The way to dusty death. Out, out, brief candle!']

## Regular Expressions
The more you work with unstructured text, the greater desire you will have for the power that regular expressions give you. Regular expressions are a mini-language to themselves (often sharing similarities across different programming languages). They allow you to search for a variety of patterns within text. The most obvious patterns you might find our telephone numbers, ZIP Codes, email addresses (social security numbers and credit card numbers for the more malicious)--and many regular expressions have been written to capture these with varying levels accuracy. Today, however, our focus will be on exploring text.

First import the built-in regular expression library `re`

In [25]:
import re

There are five main regular expression functions that we will work with:

**match()** & **search()**: these methods tell you whether or not they found a match, and where that match was located--although match() only searches at the very beginning of the line--so it is rarely useful.

**split()** & **sub()**: these two work just like split() & replace(), but they search for patterns and return a list or a substitute string respective.

**findall()**: just as the name sounds, this method returns a list of matching patterns that were found throughout the entire string.

In [26]:

#found = re.match("morrow",firstline,re.IGNORECASE)
found = re.search("morrow",firstline,re.IGNORECASE)
found.group()
#found.end()

'morrow'

In [27]:

#found = re.match("morrow",firstline,re.IGNORECASE)
#found = re.search("morrow",firstline,re.IGNORECASE)
found.group()
#found.end()
found.start()

2

In [28]:
newlist = re.split("and",firstline,flags=re.IGNORECASE)
newstring = re.sub("tomorrow","yesterday",firstline,flags=re.IGNORECASE)
print(newlist,newstring)

['Tomorrow, ', ' tomorrow, ', ' tomorrow,'] yesterday, and yesterday, and yesterday,


In [29]:
words = re.findall("to",firstline,re.IGNORECASE)
words
#len(words)

['To', 'to', 'to']

In [30]:
words1 = re.findall('..morrow', firstline,re.IGNORECASE)

words1

['Tomorrow', 'tomorrow', 'tomorrow']

In [31]:
#####################

#words2 = re.findall('.o.', firstline,re.IGNORECASE)
#words = re.findall(r"to.", speech,re.IGNORECASE)
##words = re.findall(r"to\w", speech,re.IGNORECASE) #gia grammata
#words = re.findall(r"to\W\w", speech,re.IGNORECASE) #gia symvola W = non words
words = re.findall(r".?to\S*", speech,re.IGNORECASE) #oti den ehei space meta to to kai any character that comes before it
words

['Tomorrow,', ' tomorrow,', ' tomorrow,', ' to', 'To', ' to', 'Told']

In [32]:
#words3 = re.findall(r"\b[A-Z]\w+\b", speech)
#words3 = re.findall(r"\b[A-Z]\w{3,5}\b", speech)#apo 3 mexri 5 grammata
words3 = re.findall(r"\b\w*s\b", speech) #b = word boundary 
words3

['Creeps',
 'this',
 'yesterdays',
 'fools',
 's',
 'struts',
 'frets',
 'his',
 'is',
 'is']

In [33]:
words4 = re.findall(r".{22}", speech) 
words4

['Tomorrow, and tomorrow',
 'Creeps in this petty p',
 'To the last syllable o',
 'And all our yesterdays',
 'The way to dusty death',
 '. Out, out, brief cand',
 "Life's but a walking s",
 'That struts and frets ',
 'his hour upon the stag',
 'And then is heard no m',
 'Told by an idiot, full']

In [34]:
new_list = re.split("\W+",speech) ## + one or more 
newstring = re.sub("tomorrow", "yesterday", firstline, flags=re.IGNORECASE)
newlist
speech

"Tomorrow, and tomorrow, and tomorrow,\nCreeps in this petty pace from day to day,\nTo the last syllable of recorded time;\nAnd all our yesterdays have lighted fools\nThe way to dusty death. Out, out, brief candle!\nLife's but a walking shadow, a poor player,\nThat struts and frets his hour upon the stage,\nAnd then is heard no more. It is a tale\nTold by an idiot, full of sound and fury,\nSignifying nothing."

## Special characters
While the search methods above are more useful than what's built into Python, it is the pattern seeking commands that--once you get used to them--do the most powerful work.

Here's a list  of the most common pattern seeking characters:

| special character | what it does |
|--------|---------|
| `.` | Match any character except newline |
| `^` | match the beginning of string |
| `$` | match the end of string, including `\n` |
| `*` | match 0 or more repetitions |
| `+` | match 1 or more repetitions  |
| `?` | match 0 or 1 repetitions  |
| `{m}` | m specifies the number of repetitions  |
| `{m,n}` | m and n specifies a range of repetitions  |
| `{m,}` | m specifies the minimum number of repetitions  |


In [35]:
all_ll = re.findall("..ll..",speech)
#re.search("^Tomorrow",firstline)
#re.search("tomorrow,$",firstline)
all_ll


['syllab', ' all o', 'full o']

In [36]:
#a list comprehension again!
#Note that match() would produce the same thing #^ beggining of the line
[line for line in lines if re.search("^And",line)]

['And all our yesterdays have lighted fools',
 'And then is heard no more. It is a tale']

In [37]:
[line for line in lines if re.search("y,$",line)]

['Creeps in this petty pace from day to day,',
 'Told by an idiot, full of sound and fury,']

In [38]:
th_plus = re.findall("the*..",speech)
th_plus

['this', 'the l', 'th. ', 'the s', 'then ', 'thin']

In [39]:
[line for line in lines if re.search("^.{0,20}$",line)]

['Signifying nothing.']

In [40]:
l_plus = re.findall("..l+..",speech)
l_plus

['e las',
 'syllab',
 ' all o',
 'e lig',
 'ndle!',
 'walki',
 ' play',
 'Told ',
 'full o']

In [41]:
all_length = [len(this_line) for this_line in lines]
all_length

[37, 42, 38, 41, 47, 43, 46, 39, 41, 19]

In [42]:
[line for line in lines if len(line) > 40]

['Creeps in this petty pace from day to day,',
 'And all our yesterdays have lighted fools',
 'The way to dusty death. Out, out, brief candle!',
 "Life's but a walking shadow, a poor player,",
 'That struts and frets his hour upon the stage,',
 'Told by an idiot, full of sound and fury,']

In [43]:
l_plus = re.findall(".or?",speech)
l_plus

['To',
 'mor',
 'ro',
 'to',
 'mor',
 'ro',
 'to',
 'mor',
 'ro',
 'ro',
 'to',
 'To',
 ' o',
 'cor',
 ' o',
 'fo',
 'to',
 ' o',
 'do',
 'po',
 'ho',
 'po',
 'no',
 'mor',
 'To',
 'io',
 ' o',
 'so',
 'no']

In [44]:
[line for line in lines if re.search("ca",line)]

['The way to dusty death. Out, out, brief candle!']

In [45]:
[line for line in lines if re.search("[.,!]$",line)]

['Tomorrow, and tomorrow, and tomorrow,',
 'Creeps in this petty pace from day to day,',
 'The way to dusty death. Out, out, brief candle!',
 "Life's but a walking shadow, a poor player,",
 'That struts and frets his hour upon the stage,',
 'Told by an idiot, full of sound and fury,',
 'Signifying nothing.']

In [46]:
[len(re.findall(r"\bday\b",line)) for line in lines] #poses fores emfanizetai se kathe grammi

[0, 2, 0, 0, 0, 0, 0, 0, 0, 0]

In [47]:
[len(re.findall(r"\bd\w+y\b",line)) for line in lines] #na arhizei me d kai na teleiwnei me y  

[0, 2, 0, 0, 1, 0, 0, 0, 0, 0]

In [48]:
[line for line in lines if re.search("ca",line) is None]

['Tomorrow, and tomorrow, and tomorrow,',
 'Creeps in this petty pace from day to day,',
 'To the last syllable of recorded time;',
 'And all our yesterdays have lighted fools',
 "Life's but a walking shadow, a poor player,",
 'That struts and frets his hour upon the stage,',
 'And then is heard no more. It is a tale',
 'Told by an idiot, full of sound and fury,',
 'Signifying nothing.']

In [49]:
o_2 = re.findall("..o{2}..",speech)
o_2

[' fools', ' poor ']

## Sets and Groups
**Sets**, which include `[]` in shortcuts like `\w`, allow you to search for certain types of characters. **Groups**, which are demarcated by `()` allow you to specify important sub-patterns that you can access individually.

| enclosures | what it does |
|--------|---------|
| `[]` | A defined set of characters to search for |
| `()` | A group of characters to search for, can be accessed individually in the results. |


| Examples of sets | what it does |
|--------|---------|
| `[aeiou]` | Find any vowel |
| `[Tt]` | Find a lowercase or uppercase t |
| `[0-9]` | Find any number, there is a shortcut for this |
| `[^0-9]` | Find anything that's not number, there is a shortcut for this |
| `[13579]` | Find any odd numer |
| `[A-Za-z]` | Find any letter, there is a shortcut for this too |
| `[+.*]` | Find those actual characters, special characters are canceled in sets (not including shortcuts: see below) |


| Shortcut | what it does |
|--------|---------|
| `\b` | Word boundary: spaces, commas, end of line, anything that comes at the beginning or end of a word |
| `\B` | Not a word-boundary |
| `\d` | numbers [0-9] |
| `\D` | not numbers |
| `\s` | whitespace characters: space, tab... |
| `\S` | not space |
| `\w` | letters |
| `\W` | not letters |


In [50]:
words = re.findall(r"\b[CcBb]\w+",speech)
words

['Creeps', 'brief', 'candle', 'but', 'by']

In [51]:
words = re.findall(r"[tT]\w+",speech)
#words = re.findall(r"([tT]\w+)",line)
words

['Tomorrow',
 'tomorrow',
 'tomorrow',
 'this',
 'tty',
 'to',
 'To',
 'the',
 'time',
 'terdays',
 'ted',
 'The',
 'to',
 'ty',
 'th',
 'That',
 'truts',
 'ts',
 'the',
 'tage',
 'then',
 'tale',
 'Told',
 'thing']

Looking for phrases

In [52]:
phrases = re.findall(r"(?=(\b\w{2}\W+\w+\W+\w+))",speech)
#phrases = re.findall(r"(\b\w{2}) (\w+) (\w+)",speech)
phrases

['in this petty',
 'to day,\nTo',
 'To the last',
 'of recorded time',
 'to dusty death',
 'is heard no',
 'no more. It',
 'It is a',
 'is a tale',
 'by an idiot',
 'an idiot, full',
 'of sound and']

Searching a longer poem

In [53]:
f = open('/Users/Jon/Documents/columbia_syllabus/wasteland.txt', 'r')
wasteland = f.read()

FileNotFoundError: [Errno 2] No such file or directory: '/Users/Jon/Documents/columbia_syllabus/wasteland.txt'

In [None]:
poemlines = wasteland.split('\n')

In [None]:

[line for line in poemlines if re.search("win.", line)]


Searching whole play

In [None]:
f = open('/Users/Jon/Documents/columbia_syllabus/hamlet.txt', 'r')
play = f.read()


In [None]:
type(play)

In [None]:
play[:500]

In [None]:
all_chars = re.findall(r"[\n]([A-Z]+)[\n]",play)
all_chars