## INVESTIGATING UNSTRUCTURED TEXT
As we saw last week, even the sometimes messy and unpredictable Markup language of HTML can give us clues to how data may be structured. But language as a system (as we saw in Borges) also comes with its own structures. Python provides numerous methods for navigating through basic linguistic patterns. Let's begin with repetition itself:

In [1]:
speech = '''Tomorrow, and tomorrow, and tomorrow,
Creeps in this petty pace from day to day,
To the last syllable of recorded time;
And all our yesterdays have lighted fools
The way to dusty death. Out, out, brief candle!
Life's but a walking shadow, a poor player,
That struts and frets his hour upon the stage,
And then is heard no more. It is a tale
Told by an idiot, full of sound and fury,
Signifying nothing.'''

There're various ways to investigate Macbeth's famous, very short, speech. We begin by searching for the obvious, searching through the whole speech.

In [2]:
'tomorrow' in speech

True

In [3]:
speech.find('tomorrow')

14

@ it is the position of the word in the speech

In [5]:
speech.find('Tomorrow')

0

In [9]:
speech[14:]
#speech[14:22]

"tomorrow, and tomorrow,\nCreeps in this petty pace from day to day,\nTo the last syllable of recorded time;\nAnd all our yesterdays have lighted fools\nThe way to dusty death. Out, out, brief candle!\nLife's but a walking shadow, a poor player,\nThat struts and frets his hour upon the stage,\nAnd then is heard no more. It is a tale\nTold by an idiot, full of sound and fury,\nSignifying nothing."

In [6]:
speech.count('tomorrow')

2

In [13]:
speech.lower().count('tomorrow')
#speech.lower().find('tomorrow')

3

@ We make everything lowercase so we don't care about case sensitivity

In [26]:
speech.upper().count('A')

28

In [27]:
speech.upper().count('a')

0

In [28]:
speech.lower().count(' a')

9

@ words that start with a

In [31]:
speech.lower().count(' a ')

3

@ occurences of the word a

In [32]:
speech.lower().count(' player ')

0

@ because it has a comma after it

Of course, there is already a structure to the speech that we are ignoring--it has lines. Let's get out those lines and put them into a list.

In [33]:
lines = speech.split('\n')
#lines = tom.splitlines() 
lines

['Tomorrow, and tomorrow, and tomorrow,',
 'Creeps in this petty pace from day to day,',
 'To the last syllable of recorded time;',
 'And all our yesterdays have lighted fools',
 'The way to dusty death. Out, out, brief candle!',
 "Life's but a walking shadow, a poor player,",
 'That struts and frets his hour upon the stage,',
 'And then is heard no more. It is a tale',
 'Told by an idiot, full of sound and fury,',
 'Signifying nothing.']

@ to analyse it by line, we split it by \n

In [34]:
# periods = speech.split('.')
# #lines = tom.splitlines() 
# periods

@ to analyse it by period, we split it by . in case of a novel

In [35]:
firstline = lines[0]
firstline

'Tomorrow, and tomorrow, and tomorrow,'

Python has a handful of built-in ways to search a line. Here are just a few.

In [36]:
yest = firstline.replace('tomorrow','yesterday',1)
yest

'Tomorrow, and yesterday, and tomorrow,'

@ it will only replace it once, if we do 2 it will replace it twice, or if we take out the name we take 'yesterday and yesterday and yesterday'

In [37]:
firstline.startswith('tomorrow')

False

@ because there is a capital T

In [40]:
firstline.endswith('tomorrow')

False

@ because there is a comma

## List comprehensions
What if we want to search through every line. The obvious way is using a `for` loop.

In [41]:
for line in lines:
    if line.startswith('And'):
        print(line)

And all our yesterdays have lighted fools
And then is heard no more. It is a tale


That is a very simple loop, so simple that Python has a solution for a looping through a list using a one-line statement, called a **list comprehension**

In [42]:
[line for line in lines if line.startswith('And')]

['And all our yesterdays have lighted fools',
 'And then is heard no more. It is a tale']

@ it turns it not only to one line but also a list of the results. 

In [43]:
[line for line in lines if line.startswith('T')]

['Tomorrow, and tomorrow, and tomorrow,',
 'To the last syllable of recorded time;',
 'The way to dusty death. Out, out, brief candle!',
 'That struts and frets his hour upon the stage,',
 'Told by an idiot, full of sound and fury,']

In [44]:
myNewList =[line for line in lines if line.startswith('T')]
myNewList

['Tomorrow, and tomorrow, and tomorrow,',
 'To the last syllable of recorded time;',
 'The way to dusty death. Out, out, brief candle!',
 'That struts and frets his hour upon the stage,',
 'Told by an idiot, full of sound and fury,']

In [45]:
myNewNewList =[line for line in lines if line.startswith('T') and line.endswith(',')]
myNewNewList

['Tomorrow, and tomorrow, and tomorrow,',
 'That struts and frets his hour upon the stage,',
 'Told by an idiot, full of sound and fury,']

Remember this, when we start using more robust ways of searching line by line (sentence by sentence, etc) these will come in handy. But before we jump to those special searching methods, let's have a little detour on sorting.

## Sorting!
Say we want to investigate the lines in the speech, and order them from longest line to shortest line. Well we know how to get the length of each line using loop, but how can we measure them to reorder our list?

In [46]:
for line in lines:
    print(len(line))

37
42
38
41
47
43
46
39
41
19


@ we can sort for i.e. for what's the shortest line to the longer line.

We could write a function that pairs these numbers with each line, and then sorts through everything--but sort functions are notoriously challenging to write. And Python has a built in sorting function.

In [48]:
sortlines = lines.copy()
sortlines.sort()
sortlines

['And all our yesterdays have lighted fools',
 'And then is heard no more. It is a tale',
 'Creeps in this petty pace from day to day,',
 "Life's but a walking shadow, a poor player,",
 'Signifying nothing.',
 'That struts and frets his hour upon the stage,',
 'The way to dusty death. Out, out, brief candle!',
 'To the last syllable of recorded time;',
 'Told by an idiot, full of sound and fury,',
 'Tomorrow, and tomorrow, and tomorrow,']

@ We do copy( ) in order not to overwrite the original list. We just make a copy in order to sort it. 

But not only that, Python has a built in mini-function generator called `lambda` that you can nest inside at sorting function.

In [49]:
sortlines = lines.copy()
sortlines.sort(key=lambda line: len(line), reverse=True)
#sortlines.sort(key=lambda line: line.split()[-1], reverse=True)
sortlines

['The way to dusty death. Out, out, brief candle!',
 'That struts and frets his hour upon the stage,',
 "Life's but a walking shadow, a poor player,",
 'Creeps in this petty pace from day to day,',
 'And all our yesterdays have lighted fools',
 'Told by an idiot, full of sound and fury,',
 'And then is heard no more. It is a tale',
 'To the last syllable of recorded time;',
 'Tomorrow, and tomorrow, and tomorrow,',
 'Signifying nothing.']

@ lambda is a way of doing a new function, I make it do it the longest line first. 

In [51]:
sortlines = lines.copy()
#sortlines.sort(key=lambda line: len(line), reverse=True)
sortlines.sort(key=lambda line: line.split()[-1], reverse=True)
sortlines

['Tomorrow, and tomorrow, and tomorrow,',
 'To the last syllable of recorded time;',
 'And then is heard no more. It is a tale',
 'That struts and frets his hour upon the stage,',
 "Life's but a walking shadow, a poor player,",
 'Signifying nothing.',
 'Told by an idiot, full of sound and fury,',
 'And all our yesterdays have lighted fools',
 'Creeps in this petty pace from day to day,',
 'The way to dusty death. Out, out, brief candle!']

@ sorted reverse-alphabetically by last word

## Regular Expressions
The more you work with unstructured text, the greater desire you will have for the power that regular expressions give you. Regular expressions are a mini-language to themselves (often sharing similarities across different programming languages). They allow you to search for a variety of patterns within text. The most obvious **patterns **you might find our telephone numbers, ZIP Codes, email addresses (social security numbers and credit card numbers for the more malicious)--and many regular expressions have been written to capture these with varying levels accuracy. Today, however, our focus will be on exploring text.

First import the built-in regular expression library `re`

In [52]:
import re

@ actually we ask: "does this exist anywhere here?", we are searching for patterns

There are five main regular expression functions that we will work with:

**match()** & **search()**: these methods tell you whether or not they found a match, and where that match was located--although match() only searches at the very beginning of the line--so it is rarely useful.

**split()** & **sub()**: these two work just like split() & replace(), but they search for patterns and return a list or a substitute string respective.

**findall()**: just as the name sounds, this method returns a list of matching patterns that were found throughout the entire string.

In [53]:
#found = re.match("morrow",firstline,re.IGNORECASE)
found = re.search("morrow",firstline,re.IGNORECASE)
found.group()
#found.end()

'morrow'

In [56]:
# found = re.match("morrow",firstline,re.IGNORECASE)
found = re.search("morrow",firstline,re.IGNORECASE)
# found.group()
found.start()

2

@ list comprehension is useful of a search, a no-yes answer

In [57]:
newlist = re.split("and",firstline,flags=re.IGNORECASE)
newstring = re.sub("tomorrow","yesterday",firstline,flags=re.IGNORECASE)
print(newlist,newstring)

['Tomorrow, ', ' tomorrow, ', ' tomorrow,'] yesterday, and yesterday, and yesterday,


In [58]:
words = re.findall("to",firstline,re.IGNORECASE)
words

['To', 'to', 'to']

In [60]:
words = re.findall("morrow",firstline,re.IGNORECASE)
words

['morrow', 'morrow', 'morrow']

In [61]:
words = re.findall("..morrow",firstline,re.IGNORECASE)
words

['Tomorrow', 'tomorrow', 'tomorrow']

@ word with two characters before the morrow.

In [63]:
words = re.findall(".o.",firstline,re.IGNORECASE)
words

['Tom', 'row', 'tom', 'row', 'tom', 'row']

@ give me two characters with an o in the middle

@ find all will give us all the results

## Special characters
While the search methods above are more useful than what's built into Python, it is the pattern seeking commands that--once you get used to them--do the most powerful work.

Here's a list  of the most common pattern seeking characters:

| special character | what it does |
|--------|---------|
| `.` | Match any character except newline |
| `^` | match the beginning of string |
| `$` | match the end of string, including `\n` |
| `*` | match 0 or more repetitions |
| `+` | match 1 or more repetitions  |
| `?` | match 0 or 1 repetitions  |
| `{m}` | m specifies the number of repetitions  |
| `{m,n}` | m and n specifies a range of repetitions  |
| `{m,}` | m specifies the minimum number of repetitions  |


In [59]:
words = re.findall("to",firstline,re.IGNORECASE)
len(words)

3

In [65]:
all_ll = re.findall("..ll",speech)
#re.search("^Tomorrow",firstline)
#re.search("tomorrow,$",firstline)
all_ll


['syll', ' all', 'full']

@ search for stuff that end in ll

In [66]:
all_ll = re.findall("..ll..",speech)
#re.search("^Tomorrow",firstline)
#re.search("tomorrow,$",firstline)
all_ll

['syllab', ' all o', 'full o']

@ if I put .. I will get what comes after it as well.

In [67]:
#a list comprehension again!
#Note that match() would produce the same thing
[line for line in lines if re.search("^And",line)]

['And all our yesterdays have lighted fools',
 'And then is heard no more. It is a tale']

@ same as "starts with"

In [None]:
[line for line in lines if re.search(",$",line)]

@ $ is end fo line

In [68]:
[line for line in lines if re.search("y,$",line)]

['Creeps in this petty pace from day to day,',
 'Told by an idiot, full of sound and fury,']

@ lines that end with y and a comma

In [70]:
th_plus = re.findall("the*..",speech)
th_plus

['this', 'the l', 'th. ', 'the s', 'then ', 'thin']

@ * match any or none, * = skip

In [72]:
th_plus = re.findall("mo*..",speech)
th_plus

['morr', 'morr', 'morr', 'm d', 'me;', 'more']

In [69]:
l_plus = re.findall("..l+..",speech)
l_plus

['e las',
 'syllab',
 ' all o',
 'e lig',
 'ndle!',
 'walki',
 ' play',
 'Told ',
 'full o']

@ this has to find an l, double l, 

In [73]:
l_plus = re.findall(".or?",speech)
l_plus

['To',
 'mor',
 'ro',
 'to',
 'mor',
 'ro',
 'to',
 'mor',
 'ro',
 'ro',
 'to',
 'To',
 ' o',
 'cor',
 ' o',
 'fo',
 'to',
 ' o',
 'do',
 'po',
 'ho',
 'po',
 'no',
 'mor',
 'To',
 'io',
 ' o',
 'so',
 'no']

@ the ? is 0 or none

In [74]:
o_2 = re.findall("..o{2}..",speech)
o_2

[' fools', ' poor ']

@ specify exactly what we are looking for

## Sets and Groups
**Sets**, which include `[]` in shortcuts like `\w`, allow you to search for certain types of characters. **Groups**, which are demarcated by `()` allow you to specify important sub-patterns that you can access individually.

| enclosures | what it does |
|--------|---------|
| `[]` | A defined set of characters to search for |
| `()` | A group of characters to search for, can be accessed individually in the results. |


| Examples of sets | what it does |
|--------|---------|
| `[aeiou]` | Find any vowel |
| `[Tt]` | Find a lowercase or uppercase t |
| `[0-9]` | Find any number, there is a shortcut for this |
| `[^0-9]` | Find anything that's not number, there is a shortcut for this |
| `[13579]` | Find any odd numer |
| `[A-Za-z]` | Find any letter, there is a shortcut for this too |
| `[+.*]` | Find those actual characters, special characters are canceled in sets (not including shortcuts: see below) |


| Shortcut | what it does |
|--------|---------|
| `\b` | Word boundary: spaces, commas, end of line, anything that comes at the beginning or end of a word |
| `\B` | Not a word-boundary |
| `\d` | numbers [0-9] |
| `\D` | not numbers |
| `\s` | whitespace characters: space, tab... |
| `\S` | not space |
| `\w` | letters |
| `\W` | not letters |


In [75]:
all_ll = re.findall(r"\bi[sn]\b",speech)
#it's a set and a word boundary. 
all_ll

['in', 'is', 'is']

In [76]:
all_ll = re.findall(r".[iI][stn].",speech)
#it's a set and a word boundary. 
all_ll

[' in ', 'his ', 'king', 'his ', ' is ', ' It ', 'ying', 'hing']

In [None]:
words = re.findall(r"\b[CcBb]\w+",speech)
words

In [None]:
words = re.findall(r"[tT]\w+",speech)
#words = re.findall(r"([tT]\w+)",line)
words

Looking for phrases

In [None]:
phrases = re.findall(r"(?=(\b\w{2}\W+\w+\W+\w+))",speech)
#phrases = re.findall(r"(\b\w{2}) (\w+) (\w+)",speech)
phrases

Searching a longer poem

In [None]:
f = open('/Users/Jon/Documents/columbia_syllabus/wasteland.txt', 'r')
wasteland = f.read()

In [None]:
poemlines = wasteland.split('\n')

In [None]:

[line for line in poemlines if re.search("win.", line)]


Searching whole play

In [None]:
f = open('/Users/Jon/Documents/columbia_syllabus/hamlet.txt', 'r')
play = f.read()


In [None]:
type(play)

In [None]:
play[:500]

In [None]:
all_chars = re.findall(r"[\n]([A-Z]+)[\n]",play)
all_chars