# Warmup: Counter. 

Count how many times each element in a list occurs.

```
[1, 3, 2, 1, 5, 3, 5, 1, 4] ⇒

    1: 3 times
    2: 1 time
    3: 2 times
    4: 1 time
    5: 2 times
```


In [2]:
def key_counter(l): 
    d = {}
    for key in set(l):
        if l.count(key) <= 1:
            d[key] = str(l.count(key)) + " time" 
        else:
            d[key] = str(l.count(key)) + " times" 
    return d

In [3]:
print(key_counter([1, 3, 2, 1, 5, 3, 5, 1, 4]))

{1: '3 times', 2: '1 time', 3: '2 times', 4: '1 time', 5: '2 times'}


# Safe dict reading

define a function `safe_dict(d, k)` that takes in a python dict `d` and a key `k` and makes it safe to read even with keys that aren't in the dictionary. If you try to read from the dictionary with a bad key, it should return 0 instead.

```
d = {1 : 2, 3 : 4}
safe_dict(d, 1) -> 2
safe_dict(d, 'cat') -> 0
```

In [45]:
safe_dict = lambda d, k: d[k] if k in d else 0

# or
#def safe_dict(d, k):
#    return d[k] if k in d else 0

In [46]:
d = {1 : 2, 3 : 4}
print(safe_dict(d, 1))      # 2
print(safe_dict(d, 'cat'))  # 0

2
0


# File Reading: Hamlet Exercises

Open `hamlet.txt` in the `data` folder

### 1. Mentionned Hamlet

How many times is hamlet mentioned in the book?

Use python and line iteration to count it up

In [1]:
import os
#path = os.getcwd()
#print(path)

f = open('data/hamlet.txt', 'r')
txt = f.read()
f.close()

In [17]:
print("Hamlet is mentionned " + str(txt.count("Hamlet")) + " times")
print("HAMLET speaks " + str(txt.count("HAMLET")) + " times")
print("The word 'hamlet' appears " + str(txt.count("HAMLET") + txt.count("Hamlet")) + " times in total")

Hamlet is mentionned 111 times
HAMLET speaks 363 times
The word 'hamlet' appears 474 times in total


### 2. File Reading as a .py program

Make a python file that defines a function that counts the number of times hamlet is mentionned using the code in the previous exercise.

Then import it in your notebook and call it here.

In [63]:
# create file: python hamlet.py (in the terminal)
# write the file
with open('hamlet.py', 'w') as f:
    f.write(
"""
import os

def mentionned(self):
    f = open('data/hamlet.txt', 'r')
    txt = f.read()
    f.close()
    answer = "Hamlet is mentionned " + str(txt.count("Hamlet")) + " times"
    return answer
""")

# show py file
#with open('hamlet.py', 'r') as f:
#    print(f.read())

In [66]:
# load and execute
import hamlet
hamlet.mentionned()

Hamlet is mentionned 111 times


### 3. Unique words in hamlet

Write a program that counts the unique words in hamlet.

In [51]:
import os

def unique_words():

    f = open("data/hamlet.txt", "r")
    hamlet = []
    for line in f.readlines():
        hamlet += line.lower().split()
    f.close()

    return len(set(hamlet))

In [52]:
 # this is the words in the document not the words spoken in the play
unique_words()

7092

# File Reading 2: A Python library.

In the `data` folder, you will find a folder called `csrgraph` which is a python library.

### 1. File count

Count the `py` files in the library using the `os` package

In [21]:
import os
# list the files
files = os.listdir('data/csrgraph/')
# count py file
sum([f.count(".py") for f in files])

8

### 2. For the following packages, count the number of files that import them:

- pandas 

- numpy

- numba

In [59]:
import os

def pkg_imported(directory, pkg):
    """ count the number of files using a specific python package 
        usage : pkg_imported(directory, pkg)
    """
    # list the files
    files = os.listdir(directory)
    count = 0

    for file in files:
        found = False
        #open files and read each lines
        with open(directory + file , 'r') as f:     
            for line in f.readlines() and found == False:
                if pkg in line: found = True
        if found: count += 1
    return count

In [50]:
print( str(pkg_imported('data/csrgraph/', 'pandas')) + " files use pandas" )
print( str(pkg_imported('data/csrgraph/', 'numpy')) + " files use numpy" )
print( str(pkg_imported('data/csrgraph/', 'numba')) + " files use numba" )

4 files use pandas
6 files use numpy
6 files use numba


# First NLP Program: IDF

Given a list of words, the the inverse document frequency (IDF) is a basic statistic of the amount of information of each word in the text.

The IDF formulat is:

$$IDF(w) = ln(\dfrac{N}{1 + n(w)})$$

Where:

- $w$ is the token (unique word),
- $n(w)$ is the number of documents that $w$ occurs in,
- $N$ is the total number of documents

Write a function, `idf(docs)` that takes in a list of lists of words and returns a dictionary  `word -> idf score`

Example:

```
IDF([['interview', 'questions'], ['interview', 'answers']]) -> {'questions': 0.0, 
                                                                'interview': -0.4, 
                                                                'answers': 0.0}


```

In [6]:
import math

def TF(doc):
    unique_words = set(doc)
    # fill tf dict with the occurence of w
    tf = {}
    for w in unique_words:
        tf[w] = doc.count(w)
    return tf


# this version can handle single and multiple docs
def IDF(docs):

    n = {}
    # if a list of doc is feeded
    if type(docs[0]) == list:

        # flaten docs and create set of unique words (w) in docs
        unique_words = set([item for sublist in docs for item in sublist])
        
        # fill tf dict with the number of docs that w occurs in
        for w in unique_words:
            n[w] = sum( [True if sublist.count(w) >= 1 else False for sublist in docs ] )

    # if a single doc is feeded
    else:
        unique_words = set(docs)
        # fill n dict with the occurence of w
        for w in unique_words:
            n[w] = docs.count(w)

    # store N value
    N = len(docs)

    # fill n dict with idf values for each w
    idf = {}
    for w in unique_words:
        idf[w] = math.log(N / (1 + n[w]))

    return idf

In [42]:
IDF([['interview', 'questions'], ['interview', 'answers']])

{'answers': 0.0, 'interview': -0.40546510810816444, 'questions': 0.0}

# Stretch Goal: IDF on Hamlet

Calculate the IDF dictionary on the Hamlet book.

What's the IDF of "Hamlet"?

What's the word with the highest IDF in the book?

In [43]:
import os

# Import file
f = open("data/hamlet.txt", "r")
hamlet = []
for line in f.readlines():
    line = line.lower()
    hamlet += line.split()
f.close()

tf = TF(hamlet)
idf = IDF(hamlet)

tf_idf = {}
for w in tf_hamlet:
    tf_idf[w] = tf[w]/idf[w]

In [44]:
#print("Calculate the TF-IDF dictionary on the Hamlet book.")
#print(sorted(((value,key) for (key,value) in tf_idf.items()), reverse=True)[0:10])
#print('')

print("What's the TF-IDF of 'Hamlet'?")
print( str(tf_idf["hamlet"]) )
print('')

print("What's the word with the highest TF-IDF in the book?")
print(sorted(((value,key) for (key,value) in tf_idf.items()), reverse=True)[0])
print('')

What's the TF-IDF of 'Hamlet'?
4.160140831900633

What's the word with the highest TF-IDF in the book?
(328.90175364062867, 'the')



# Stretch Goal: Speaker count

Use a regular expression and looping over the `hamlet.txt` file to build a dictionary `character_name -> # times speaking`.

Who speaks the most often? Who speaks the least often?

In [24]:
import re
import os

regex = r"\b[A-Z][A-Z]+\b"
with open('data/hamlet.txt', 'r') as f:
    txt = f.read()
    matches = re.findall(regex, txt)

tf = TF(matches)

In [25]:
print(sorted(((value,key) for (key,value) in tf.items()), reverse=True))
# Hamlet speaks the most often 
# Ambassador, Claudius, Gertrude, Lucianus, Prince, Servant speak once

[(363, 'HAMLET'), (111, 'HORATIO'), (106, 'KING'), (87, 'POLONIUS'), (74, 'QUEEN'), (63, 'LAERTES'), (59, 'OPHELIA'), (50, 'ROSENCRANTZ'), (45, 'CLOWN'), (44, 'FIRST'), (36, 'MARCELLUS'), (34, 'GUILDENSTERN'), (27, 'OSRIC'), (22, 'BARNARDO'), (21, 'SCENE'), (17, 'PLAYER'), (15, 'GHOST'), (14, 'REYNALDO'), (12, 'SECOND'), (12, 'II'), (10, 'ACT'), (9, 'FRANCISCO'), (8, 'IV'), (8, 'III'), (7, 'FORTINBRAS'), (7, 'CAPTAIN'), (3, 'VOLTEMAND'), (3, 'LORD'), (3, 'GENTLEMAN'), (2, 'VII'), (2, 'VI'), (2, 'SAILOR'), (2, 'PRIEST'), (2, 'OF'), (2, 'MESSENGER'), (2, 'DANES'), (2, 'CORNELIUS'), (1, 'TRAGEDY'), (1, 'THE'), (1, 'SERVANT'), (1, 'PROLOGUE'), (1, 'PRINCE'), (1, 'LUCIANUS'), (1, 'LORDS'), (1, 'GERTRUDE'), (1, 'DENMARK'), (1, 'CLAUDIUS'), (1, 'BOTH'), (1, 'AMBASSADOR'), (1, 'ALL')]
