# Python for Poets

This Jupyter Notebook is inspired on Keneth W. Church's [Unix for Poets](https://www.cs.upc.edu/~padro/Unixforpoets.pdf). From that chapter itself:

- "many researchers have more data than they know what to do with"
- "Many researchers believe that they don’t have sufficient computing resources to do these things for themselves."
- "This chapter will describe a set of simple Unix-based (**Python in our case**) tools that should
be more than adequate for counting trigrams on a corpus the size of the Brown Corpus"
- "this chapter will focus on examples and avoid definitions whenever possible"

The code has been developed using Python 3.6. It has been written using [PyCharm](), and tested on [Colab](). All snippets could be run in any machine with Python 3.6 (or higher) installed, or online, as a Jupyter notebook.

Note that many of these exercises would be indeed simpler using simple one-liners on the Unix/Linux command line!

## 1. Excercise 1: Count words in a text

From Chuch. "The problem is to input a text file, say Genesis (a good place to start),2 and output a list of words in the file along with their frequency counts. The algorithm consists of three steps:"

1. Tokenize the text into a sequence of words (_re_),
2. Count the words (with a _dictionary_ or with _Counter_)

The algorithm can be implemented in just three lines of Unix code:

In [1]:
import re   

In [3]:
with open("genesis.txt", 'r') as input:
    txt = input.read()

# Apply a regular expression to the string txt and look for all occurrences of the given pattern
tokens = re.findall('[A-Za-z]+', txt)


In [4]:
# Option 1: using a dictionary
d = {}
for tok in tokens:
    if tok not in d:
        d[tok] = 0
    d[tok] += 1
print(d)

{'THE': 1, 'BOOK': 1, 'OF': 1, 'GENESIS': 1, 'Chapter': 50, 'In': 17, 'the': 2333, 'beginning': 5, 'when': 90, 'God': 227, 'created': 10, 'heavens': 8, 'and': 1822, 'earth': 105, 'was': 310, 'a': 348, 'formless': 1, 'void': 1, 'darkness': 5, 'covered': 5, 'face': 35, 'of': 1267, 'deep': 8, 'while': 15, 'wind': 5, 'from': 209, 'swept': 1, 'over': 40, 'waters': 32, 'Then': 164, 'said': 449, 'Let': 35, 'there': 144, 'be': 200, 'light': 11, 'And': 196, 'saw': 68, 'that': 456, 'good': 39, 'separated': 5, 'called': 54, 'Day': 1, 'he': 537, 'Night': 1, 'evening': 14, 'morning': 20, 'first': 17, 'day': 56, 'dome': 9, 'in': 531, 'midst': 3, 'let': 64, 'it': 223, 'separate': 3, 'So': 139, 'made': 63, 'were': 152, 'under': 17, 'above': 6, 'so': 97, 'Sky': 1, 'second': 13, 'sky': 5, 'gathered': 14, 'together': 17, 'into': 52, 'one': 124, 'place': 55, 'dry': 4, 'land': 195, 'appear': 1, 'Earth': 1, 'Seas': 1, 'put': 46, 'forth': 10, 'vegetation': 2, 'plants': 4, 'yielding': 3, 'seed': 10, 'fruit': 

In [6]:
# Option 2: using a counter
from collections import Counter
c = Counter(tokens)
print(c)

print(c['the'])

Counter({'the': 2333, 'and': 1822, 'of': 1267, 'to': 1064, 'you': 613, 'his': 550, 'he': 537, 'in': 531, 'that': 456, 'said': 449, 'I': 442, 'him': 400, 'a': 348, 'your': 334, 'my': 323, 'with': 320, 'was': 310, 'for': 307, 'me': 288, 'is': 271, 's': 234, 'will': 230, 'God': 227, 'father': 227, 'had': 225, 'it': 223, 'all': 222, 'them': 221, 'they': 213, 'not': 213, 'from': 209, 'be': 200, 'And': 196, 'shall': 196, 'land': 195, 'on': 194, 'have': 193, 'Jacob': 188, 'are': 177, 'LORD': 166, 'Then': 164, 'her': 161, 'Joseph': 157, 'as': 156, 'son': 156, 'were': 152, 'there': 144, 'their': 143, 'The': 140, 'So': 139, 'Abraham': 134, 'she': 132, 'one': 124, 'sons': 116, 'out': 113, 'years': 112, 'this': 108, 'earth': 105, 'who': 104, 'When': 104, 'up': 103, 'went': 103, 'came': 98, 'so': 97, 'wife': 96, 'He': 96, 'us': 94, 'Pharaoh': 94, 'when': 90, 'has': 89, 'brother': 88, 'man': 84, 'Isaac': 82, 'Esau': 80, 'But': 79, 'Egypt': 79, 'we': 79, 'by': 75, 'do': 72, 'every': 71, 'but': 69, 'b

- A comment in Python starts with #. The compiler/interpreter does not execute anything coming after a #.
- There are many official Python (and contributed) libraries available. They are imported with _import_
- Once a library is imported, we have access to all its methods and classes 
- The contents of a (text) file are accessed with _open()_
- Regular expressions are powerful tools to find patterns
- Lists are precisely that: lists of elements. 
- Dictionaries are key-value pairs
- Loops are repetitions until certain condition is true (here we use _for_)
- Conditionals execute a code if a condition is true (here we use a simple _if_)
- We can display the contents of a variable with _print()_

In [8]:
# print the first 20 words in the text

None

THE
BOOK
OF
GENESIS
Chapter
In
the
beginning
when
God
created
the
heavens
and
the
earth
the
earth
was
a


In [9]:
# sort the words in the list

None

A
A
A
Abel
Abel
Abel
Abel
Abel
Abel
Abel
Abel
Abel
Abida
Abimael
Abimelech
Abimelech
Abimelech
Abimelech
Abimelech
Abimelech


In [11]:
# count again, this time on the sorted_tokens

d = {}
for tok in sorted_tokens:
    if tok not in d:
        d[tok] = 0
    d[tok] += 1
print(d)

{'A': 3, 'Abel': 9, 'Abida': 1, 'Abimael': 1, 'Abimelech': 24, 'About': 1, 'Abraham': 134, 'Abram': 59, 'Accad': 1, 'Achbor': 2, 'Adah': 8, 'Adam': 5, 'Adbeel': 1, 'Admah': 3, 'Adullamite': 3, 'After': 12, 'Afterward': 3, 'Afterwards': 1, 'Again': 3, 'Ah': 1, 'Ahuzzath': 1, 'Ai': 2, 'Aiah': 1, 'Akan': 1, 'Alas': 1, 'All': 10, 'Allon': 1, 'Almighty': 6, 'Almodad': 1, 'Also': 1, 'Although': 2, 'Alvah': 1, 'Alvan': 1, 'Am': 2, 'Amalek': 2, 'Amalekites': 1, 'Ammonites': 1, 'Amorite': 1, 'Amorites': 5, 'Amraphel': 2, 'Anah': 9, 'Anamim': 1, 'And': 196, 'Aner': 2, 'Any': 1, 'Aram': 4, 'Aramean': 5, 'Aran': 1, 'Ararat': 1, 'Ard': 1, 'Are': 5, 'Areli': 1, 'Arioch': 2, 'Arise': 1, 'Arkites': 1, 'Arodi': 1, 'Arpachshad': 6, 'Arvadites': 1, 'As': 23, 'Asenath': 3, 'Ashbel': 1, 'Asher': 4, 'Ashkenaz': 1, 'Ashteroth': 1, 'Assemble': 1, 'Asshur': 1, 'Asshurim': 1, 'Assyria': 3, 'At': 9, 'Atad': 2, 'Avith': 1, 'BOOK': 1, 'Baal': 2, 'Babel': 2, 'Basemath': 6, 'Be': 4, 'Because': 9, 'Becher': 1, 'Bedad

## 2. Sort a list of words in various ways

Ignore the case when counting: lower casing

In [13]:
txt = None
tokens = re.findall('[A-Za-z]+', txt)
tokens = sorted(tokens)

c = Counter(tokens)
print(c)

Counter({'the': 2474, 'and': 2018, 'of': 1270, 'to': 1078, 'you': 650, 'he': 633, 'his': 557, 'in': 548, 'that': 462, 'said': 449, 'i': 442, 'him': 400, 'a': 351, 'your': 348, 'my': 341, 'for': 331, 'with': 323, 'was': 310, 'me': 289, 'is': 284, 'they': 250, 'it': 236, 'so': 236, 's': 234, 'will': 234, 'all': 232, 'god': 229, 'father': 228, 'had': 225, 'them': 221, 'from': 217, 'not': 215, 'then': 215, 'lord': 213, 'on': 206, 'be': 204, 'shall': 201, 'have': 200, 'land': 195, 'when': 194, 'jacob': 188, 'are': 182, 'as': 179, 'her': 162, 'joseph': 157, 'she': 157, 'son': 156, 'there': 154, 'were': 152, 'but': 148, 'their': 144, 'abraham': 134, 'this': 134, 'one': 128, 'sons': 117, 'out': 115, 'years': 112, 'who': 111, 'now': 107, 'earth': 106, 'up': 104, 'we': 103, 'went': 103, 'let': 99, 'came': 98, 'wife': 96, 'pharaoh': 94, 'us': 94, 'these': 93, 'do': 90, 'has': 89, 'brother': 88, 'man': 85, 'isaac': 82, 'by': 81, 'esau': 80, 'egypt': 79, 'may': 77, 'go': 76, 'every': 74, 'at': 73, 

Count sequences of vowels

In [15]:
vowels = re.findall('[aeiou]+', txt)
c = Counter(vowels)
print(c)

Counter({'e': 15837, 'a': 10899, 'o': 8415, 'i': 6661, 'ou': 1803, 'u': 1276, 'ea': 946, 'ai': 725, 'ee': 388, 'oo': 282, 'au': 278, 'ie': 262, 'ei': 244, 'aa': 148, 'io': 122, 'ao': 99, 'oa': 98, 'eo': 84, 'ae': 69, 'ui': 68, 'oi': 61, 'ia': 51, 'ue': 31, 'ua': 27, 'eu': 26, 'eou': 13, 'oe': 11, 'oii': 5, 'eue': 5, 'eei': 4, 'iou': 4, 'eau': 3, 'aio': 3, 'iu': 1, 'uie': 1, 'aia': 1})


Count sequences of consonants

In [16]:
consonants = re.findall('[bcdfghjklmnpqrstvwxyz]+', txt)
c = Counter(consonants)
print(c)

Counter({'th': 5353, 's': 5125, 'r': 4257, 'n': 4038, 't': 3740, 'h': 3696, 'd': 3651, 'm': 3300, 'f': 2728, 'nd': 2589, 'w': 2231, 'y': 1943, 'l': 1850, 'b': 1690, 'c': 1422, 'v': 1245, 'll': 951, 'g': 909, 'ng': 733, 'sh': 658, 'p': 619, 'k': 615, 'wh': 558, 'br': 515, 'j': 489, 'nt': 450, 'st': 417, 'rs': 412, 'ch': 400, 'rd': 390, 'my': 342, 'ph': 312, 'ght': 297, 'ld': 296, 'fr': 278, 'ss': 250, 'ns': 245, 'pl': 198, 'rn': 174, 'gr': 151, 'rth': 148, 'bl': 142, 'ry': 138, 'ck': 129, 'pr': 125, 'ls': 121, 'nc': 121, 'fl': 120, 'tw': 114, 'ft': 110, 'rv': 106, 'gypt': 99, 'wn': 97, 'z': 95, 'dr': 95, 'pt': 93, 'sp': 91, 'nts': 86, 'ys': 83, 'by': 81, 'ndr': 78, 'ts': 69, 'ldr': 66, 'str': 64, 'rt': 64, 'nk': 62, 'thr': 61, 'cr': 60, 'lt': 59, 'ttl': 59, 'kn': 59, 'sl': 57, 'lf': 56, 'tr': 55, 'rds': 55, 'rr': 55, 'gh': 55, 'ty': 54, 'tt': 50, 'cc': 48, 'ms': 48, 'pp': 46, 'sc': 46, 'ngs': 45, 'ffspr': 45, 'rm': 44, 'lv': 44, 'sr': 44, 'cl': 42, 'ff': 41, 'sw': 40, 'rty': 40, 'ds': 3

**From Unix for poets**

"These three examples are intended to show how easy it is to change the definition of what counts as a word. Sometimes you want to distinguish between upper and lower case, and sometimes you don’t [...] The same basic counting program can be used to count a variety of different things, depending on how you implement the definition of thing (=token)."

### 2.1 Sort in dictionary order

In [17]:
with open("genesis.txt", 'r') as input:
    txt = input.read()

# Apply a regular expression to the string txt and look for all occurrences of the given pattern
tokens = re.findall('[A-Za-z]+', txt)

sorted_tokens = sorted(tokens)
print(sorted_tokens)

['A', 'A', 'A', 'Abel', 'Abel', 'Abel', 'Abel', 'Abel', 'Abel', 'Abel', 'Abel', 'Abel', 'Abida', 'Abimael', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'Abimelech', 'About', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham', 'Abraham',

### 2.2 Sort in "rhyming" order

In [19]:
# Note this method!
def invert(word):
    return word[::-1]

# Note the additional parameter
rythm_tokens = sorted(tokens, key=invert)

print(rythm_tokens)

['A', 'A', 'A', 'GOD', 'GOD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'LORD', 'L

## 3. Compute n-gram statistics

For 2-grams

In [21]:
bigrams = None 
c = Counter(bigrams)
print(c)

Counter({'of the': 334, 'in the': 232, 'said to': 215, 'the land': 159, 'the LORD': 144, 'to the': 137, 'and the': 132, 'land of': 105, 'I will': 101, 'all the': 97, 'on the': 96, 'the earth': 92, 'to him': 89, 'and he': 85, 'to his': 78, 'from the': 73, 'and said': 73, 'you and': 73, 'I have': 70, 'his father': 67, 'of his': 65, 'shall be': 61, 'and his': 61, 'he said': 60, 'to me': 56, 'I am': 56, 'to you': 55, 'he had': 54, 'that you': 52, 'him and': 52, 'you shall': 51, 'sons of': 51, 'that I': 51, 'that he': 50, 'are the': 50, 'and I': 50, 'with him': 50, 'you have': 48, 'with you': 47, 'for the': 45, 'of my': 45, 'me and': 45, 'of Egypt': 45, 'to them': 44, 'and they': 44, 'and all': 43, 'so that': 43, 'that is': 42, 'the father': 42, 'father of': 42, 'the man': 41, 'his brothers': 40, 'that the': 38, 'he was': 38, 'of Canaan': 37, 'the field': 36, 'out of': 36, 'came to': 36, 'the ground': 35, 'the sons': 35, 'did not': 35, 'to your': 35, 'his wife': 34, 'to her': 34, 'of your':

For 3-grams

In [23]:
trigrams = [" ".join(tokens[i:i+3]) for i in range(len(tokens)-2)]
c = Counter(trigrams)
print(c)

Counter({'the land of': 100, 'in the land': 66, 'said to him': 44, 'the father of': 42, 'the sons of': 35, 'land of Canaan': 35, 'land of Egypt': 35, 'said to his': 32, 'on the earth': 30, 'These are the': 30, 'said to them': 27, 'of the LORD': 27, 'and said to': 27, 'became the father': 27, 'that he had': 26, 'the earth and': 25, 'of the land': 24, 'and I will': 22, 'the God of': 21, 'the LORD God': 19, 'the days of': 19, 'the name of': 19, 'he became the': 19, 'that I may': 19, 'and all the': 18, 'to his father': 18, 'out of the': 17, 'years he became': 17, 'had other sons': 17, 'other sons and': 17, 'sons and daughters': 17, 'of the earth': 16, 'God said to': 16, 'name of the': 16, 'the LORD said': 16, 'after the birth': 16, 'the birth of': 16, 'years and had': 16, 'and had other': 16, 'all the land': 16, 'the face of': 15, 'of every kind': 15, 'lived after the': 15, 'the LORD has': 15, 'I will give': 15, 'Jacob said to': 15, 'Joseph said to': 15, 'I will make': 14, 'to his brothers

For **any** n

In [25]:
n = 5
grams = [" ".join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
c = Counter(grams)
print(c)

Counter({'in the land of Canaan': 22, 'he became the father of': 19, 'years he became the father': 17, 'had other sons and daughters': 17, 'years and had other sons': 16, 'and had other sons and': 16, 'lived after the birth of': 15, 'These are the sons of': 11, 'all the land of Egypt': 11, 'These are the descendants of': 10, 'in the land of Egypt': 10, 'other sons and daughters Thus': 9, 'sons and daughters Thus all': 9, 'and daughters Thus all the': 9, 'daughters Thus all the days': 9, 'Thus all the days of': 8, 'other sons and daughters When': 8, 'years and he died When': 7, 'And there was evening and': 6, 'there was evening and there': 6, 'was evening and there was': 6, 'evening and there was morning': 6, 'and there was morning the': 6, 'to you and to your': 6, 'the land of Canaan and': 6, 'a wife for my son': 6, 'the God of your father': 6, 'and bore a son and': 6, 'in the land of Goshen': 6, 'day And God said Let': 5, 'And it was so God': 5, 'And God saw that it': 5, 'God saw that

## 5. Methods

Once again, **from Unix for poets**

Suppose that you found that you were often computing trigrams of different things, and you found it
inconvenient to keep typing the same five lines over and over. If you put the following **method**, then you could count n-grams with one single line

In [35]:
def ngrams(tokens, n):
    return [" ".join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

four_grams = ngrams(tokens, 4)
c = Counter(four_grams)
print(c)

Counter({'in the land of': 51, 'the land of Canaan': 35, 'the land of Egypt': 33, 'became the father of': 27, 'he became the father': 19, 'years he became the': 17, 'had other sons and': 17, 'other sons and daughters': 17, 'after the birth of': 16, 'years and had other': 16, 'and had other sons': 16, 'lived after the birth': 15, 'are the sons of': 14, 'the name of the': 13, 'all the land of': 13, 'and said to him': 12, 'are the descendants of': 11, 'to the land of': 11, 'These are the sons': 11, 'the face of the': 10, 'all the days of': 10, 'bore a son and': 10, 'These are the descendants': 10, 'the LORD said to': 9, 'sons and daughters Thus': 9, 'and daughters Thus all': 9, 'daughters Thus all the': 9, 'Thus all the days': 9, 'years and he died': 9, 'all that he had': 9, 'said to his father': 9, 'on the earth and': 8, 'sons and daughters When': 8, 'between you and me': 8, 'the land of the': 8, 'so that I may': 8, 'the sons of Esau': 8, 'And God saw that': 7, 'Then the LORD said': 7, '

## 6. Counting n-grams from verses containing the phrase "the land of"

The most frequent 3-gram is "the land of". Let us count the 3-grams in verses containing "the land of" only. 

In [28]:
with open("genesis.txt", 'r') as input:
    txt = " ".join([x for x in input.readlines() if "the land of" not in x]) 
tokens = re.findall('[A-Za-z]+', txt)
three_grams = ngrams(tokens, 3)
c = Counter(three_grams)
print(c)

Counter({'the land of': 100, 'in the land': 53, 'land of Canaan': 35, 'land of Egypt': 35, 'all the land': 15, 'to the land': 11, 'land of the': 9, 'of the land': 9, 'from the land': 7, 'the sons of': 7, 'of Egypt and': 7, 'land of Goshen': 7, 'and all the': 6, 'of Canaan and': 6, 'seven years of': 5, 'said to Joseph': 5, 'over all the': 5, 'settled in the': 4, 'had acquired in': 4, 'are the sons': 4, 'these are the': 4, 'are the clans': 4, 'the clans of': 4, 'land of Edom': 4, 'Pharaoh said to': 4, 'they came to': 3, 'to go to': 3, 'of Egypt in': 3, 'Egypt in the': 3, 'in the cave': 3, 'land of your': 3, 'These are the': 3, 'sons of Esau': 3, 'in all the': 3, 'years of famine': 3, 'over the land': 3, 'the seven years': 3, 'of Egypt He': 3, 'for the famine': 3, 'the famine was': 3, 'settle in the': 3, 'of Goshen and': 3, 'your father and': 3, 'his brothers and': 3, 'father s household': 3, 'The land of': 3, 'from the presence': 2, 'the presence of': 2, 'of the LORD': 2, 'land of Shinar

Let us count the 3-grams in verses **not** containing "the land of" only. 

In [29]:
with open("genesis.txt", 'r') as input:
    txt = None 
tokens = re.findall('[A-Za-z]+', txt)
three_grams = ngrams(tokens, 3)
c = Counter(three_grams)
print(c)

Counter({'said to him': 44, 'the father of': 42, 'said to his': 31, 'on the earth': 30, 'the sons of': 28, 'said to them': 27, 'These are the': 27, 'and said to': 27, 'became the father': 27, 'the earth and': 25, 'of the LORD': 25, 'that he had': 24, 'and I will': 20, 'the LORD God': 19, 'the name of': 19, 'he became the': 19, 'the God of': 19, 'the days of': 18, 'years he became': 17, 'had other sons': 17, 'other sons and': 17, 'sons and daughters': 17, 'that I may': 17, 'of the earth': 16, 'God said to': 16, 'name of the': 16, 'out of the': 16, 'after the birth': 16, 'the birth of': 16, 'years and had': 16, 'and had other': 16, 'to his father': 16, 'the face of': 15, 'of every kind': 15, 'the LORD said': 15, 'lived after the': 15, 'the LORD has': 15, 'of the land': 15, 'I will make': 14, 'said to her': 14, 'Jacob said to': 14, 'Joseph said to': 14, 'of the ground': 13, 'said to the': 13, 'the descendants of': 13, 'to his brothers': 13, 'in the land': 13, 'I will give': 13, 'he said t

#### Exercises

1. Ignore lines containing "gh"
2. Consider only lines ending with "ing"

#### Take-home exercises
1. How many uppercase tokens are in this version of Genesis?
2. How many 4-letter words?
3. Are there words without vowels?


---

Get the _k_ most common words

In [30]:
k = 3
with open("genesis.txt", 'r') as input:
    txt = " ".join([x for x in input.readlines() if "the land of" in x])
tokens = re.findall('[A-Za-z]+', txt)
three_grams = ngrams(tokens, 1)
c = Counter(three_grams)
c.most_common(k)

[('the', 246), ('of', 179), ('land', 121)]

Get all n-grams appearing only **k times**. An then let's try to find out the longest n-grams appearing at least 5 times!

In [31]:
n = 5
k = 5

with open("genesis.txt", 'r') as input:
    txt = input.read()

# Apply a regular expression to the string txt and look for all occurrences of the given pattern
tokens = re.findall('[A-Za-z]+', txt)
# NOTE THIS HORRIBLE THING I DID HERE!!!
ngrams = ngrams(tokens, n)

c = Counter(ngrams)
my_ngrams = None
print(my_ngrams)

['day And God said Let', 'And it was so God', 'And God saw that it', 'God saw that it was', 'saw that it was good', 'Then the LORD said to', 'the name of the LORD', 'the face of the ground', 'between me and you and', 'you and to your offspring', 'in the hill country of', 'he said Here I am', 'conceived and bore a son', 'These were the sons of', 'are the sons of Esau', 'the sons of Esau s', 'these are the clans of', 'the captain of the guard', 'the land of Egypt and']


#### Find palyndroms in Genesis

We will use the comparator **==** to find out whether a statement is true or false

In [36]:
palyndromes = []
None 
c = Counter(palyndromes)
print(c)

Counter({'I': 442, 'a': 348, 's': 234, 'did': 67, 'O': 11, 'Did': 4, 'A': 3, 'ewe': 3, 'Eve': 2, 'noon': 2, 'deed': 1, 'sees': 1})


#### Exercises

(from Church)
1. It is said that English avoids sequences of _-ing_ words. Find bigrams where both words end in _-ing_. Do these count as counter-exampes of the _-ing -ing_ rule?
2. For comparison's sake, find bigrams where bth words end in _-ed_. Should there also be a prohibition against _-ed -ed_? Are there anu examples of _-ed -ed_ in Genesis? If so, how many? Which verse(s)?

In [37]:
# OPTION 1. Using a regular expression over the already computed n-grams

grams = ngrams(tokens, 2)
regexp = re.compile('.+ing .+ing$')
ing_ing = [x for x in grams if regexp.match(x)]
print(ing_ing)

['creeping thing', 'living thing', 'living being', 'living thing', 'creeping thing', 'living thing', 'creeping thing', 'living thing', 'living thing', 'creeping thing', 'creeping thing', 'moving thing', 'ring weighing', 'morning devouring', 'evening dividing']


In [38]:
# OPTION 2. Using string operations
grams = ngrams(tokens, 2)
ing_ing = []
for gram in grams:
    pair = gram.split(" ")
    if pair[0].endswith("ing") and pair[1].endswith("ing"):
        ing_ing.append(gram)
print(ing_ing)

['creeping thing', 'living thing', 'living being', 'living thing', 'creeping thing', 'living thing', 'creeping thing', 'living thing', 'living thing', 'creeping thing', 'creeping thing', 'moving thing', 'ring weighing', 'morning devouring', 'evening dividing']


In [140]:
# Do it for "-ed -ed" here

#### Exercise

Print out verses containing the phrase "Let there be light". Print out the previous verse as well

In [39]:
my_str = "Let there be light"

with open("genesis.txt", 'r') as input:
    verses = input.readlines()
    
for verse in verses:
    if my_str in verse:
        print(verse)

[1:3] Then God said, "Let there be light"; and there was light.

[1:14] And God said, "Let there be lights in the dome of the sky to separate the day from the night; and let them be for signs and for seasons and for days and years,



Now we have the two verses. **How can we print the previous verse as well???**

In [142]:
# Change the previous code snippet to print the previous sentence as well
None


## 7. String substitutions

In [40]:
def top_k_lines(file, k=10):
    with open(file) as f:
        head = [next(f) for x in range(k)]
    return head

txt = top_k_lines("genesis.txt")
# print(txt)
for line in txt:
    print(line.replace("God", "The Spaghetti Monster").strip())

THE BOOK OF GENESIS

Chapter 1

[1:1] In the beginning when The Spaghetti Monster created the heavens and the earth,
[1:2] the earth was a formless void and darkness covered the face of the deep, while a wind from The Spaghetti Monster swept over the face of the waters.
[1:3] Then The Spaghetti Monster said, "Let there be light"; and there was light.
[1:4] And The Spaghetti Monster saw that the light was good; and The Spaghetti Monster separated the light from the darkness.
[1:5] The Spaghetti Monster called the light Day, and the darkness he called Night. And there was evening and there was morning, the first day.
[1:6] And The Spaghetti Monster said, "Let there be a dome in the midst of the waters, and let it separate the waters from the waters."


## 8. Mutual information to find collocations

From the Wikipedia articles on [mutual information](https://en.wikipedia.org/wiki/Mutual_information#Applications_2) and [collocations](https://en.wikipedia.org/wiki/Collocation)

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the **mutual dependence between the two variables**. More specifically, it quantifies the "amount of information" (in units such as shannons, commonly called bits) obtained about one random variable through observing the other random variable.

Mutual information of words is often used as a **significance function for the computation of collocations in corpus**

A collocation is a series of words or terms that co-occur **more often than would be expected by chance**.
Mutiual information is defined as

$MI(x,y) = log_2 \frac{Pr(x,y)}{Pr(x) Pr(y)}$

and, following [Magerman and Marcus](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.78.4178&rep=rep1&type=pdf), in NLP it can be estimated as 

$MI(x,y) \approx log \frac{\frac{f(x,y)}{\sum_{(i,j)\in C}f(i,j)}}{\frac{f(x)}{\sum_{i\in C}{f(x)}} \frac{f{y}}{\sum_{i\in C}f(y)} }$

where $\sum_{\cdot}$ is the sum over all instances of $\cdot$

From Chuch. "The problem is to input a text file, say Genesis (a good place to start),2 and output a list of words in the file along with their frequency counts. The algorithm consists of three steps:"

1. Tokenize the text into a sequence of words (_re_),
2. Count the words (with a _dictionary_ or with _Counter_)

The algorithm can be implemented in just three lines of Unix code:

In [42]:
from math import log

bigrams = ngrams(tokens, 2)
unigrams = ngrams(tokens, 1)

freq_bigrams = Counter(bigrams)
freq_unigrams = Counter(unigrams)

sum_bigrams = sum(freq_bigrams.values())
sum_unigrams = sum(freq_unigrams.values())

print(sum_bigrams, sum_unigrams)
#freqs_x_y = grams = 

my_str = ["God", "created"]
my_str = ["of", "Esau"]
my_str = ["LORD", "said"]

mi = log((freq_bigrams[" ".join(my_str)] / sum_bigrams) / 
          ( (freq_unigrams[my_str[0]] / sum_unigrams) * (freq_unigrams[my_str[1]] / sum_unigrams)  )  )

print("mi(%s, %s) = %f" % (my_str[0], my_str[1], mi))

36550 36551
mi(LORD, said) = 2.283213


but these are just MLE probabilities!!

In [44]:
from math import log

bigrams = ngrams(tokens, 2)
unigrams = ngrams(tokens, 1)

freq_bigrams = Counter(bigrams)
freq_unigrams = Counter(unigrams)

sum_bigrams = sum(freq_bigrams.values())
sum_unigrams = sum(freq_unigrams.values())

print(sum_bigrams, sum_unigrams)

for k in freq_bigrams:
    freq_bigrams[k] /= sum_bigrams
    
for k in freq_unigrams:
    freq_unigrams[k] /= sum_unigrams

#freqs_x_y = grams = 

my_str = ["God", "created"]
my_str = ["of", "Esau"]
my_str = ["LORD", "said"]

mi = log((freq_bigrams[" ".join(my_str)]) / 
          ( (freq_unigrams[my_str[0]]) * (freq_unigrams[my_str[1]])  )  )

print("mi(%s, %s) = %f" % (my_str[0], my_str[1], mi))

36550 36551
mi(LORD, said) = 2.283213


### Exercises (once again, from Church)

1. Compute the $MI(x,y) \forall (x,y) \in C$ (Compute $MI(x,y)$ for all pair in the corpus
2. MI is unestable for small bigram counts. Compute (or display) MI only for those bigrams x such that $f(x)\geq 5$.
3. Find the 10 bigrams in Genesis with the largest MI.


## 9. Make a Concordance