# Warmup: Counter. 

Count how many times each element in a list occurs.

```
[1, 3, 2, 1, 5, 3, 5, 1, 4] ⇒

    1: 3 times
    2: 1 time
    3: 2 times
    4: 1 time
    5: 2 times
```


In [2]:
def key_counter(l): 
    d = {}
    for key in set(l):
        if l.count(key) <= 1:
            d[key] = str(l.count(key)) + " time" 
        else:
            d[key] = str(l.count(key)) + " times" 
    return d

In [3]:
print(key_counter([1, 3, 2, 1, 5, 3, 5, 1, 4]))

{1: '3 times', 2: '1 time', 3: '2 times', 4: '1 time', 5: '2 times'}


# Safe dict reading

define a function `safe_dict(d, k)` that takes in a python dict `d` and a key `k` and makes it safe to read even with keys that aren't in the dictionary. If you try to read from the dictionary with a bad key, it should return 0 instead.

```
d = {1 : 2, 3 : 4}
safe_dict(d, 1) -> 2
safe_dict(d, 'cat') -> 0
```

In [45]:
safe_dict = lambda d, k: d[k] if k in d else 0

# or
#def safe_dict(d, k):
#    return d[k] if k in d else 0

In [46]:
d = {1 : 2, 3 : 4}
print(safe_dict(d, 1))      # 2
print(safe_dict(d, 'cat'))  # 0

2
0


# File Reading: Hamlet Exercises

Open `hamlet.txt` in the `data` folder

### 1. Mentionned Hamlet

How many times is hamlet mentioned in the book?

Use python and line iteration to count it up

In [73]:
import os
#path = os.getcwd()
#print(path)

f = open('data/hamlet.txt', 'r')
txt = f.read()
f.close()

In [75]:
print("Hamlet is mentionned " + str(txt.count("Hamlet")) + " times")
print("HAMLET speaks " + str(txt.count("HAMLET")) + " times")
print("The word 'hamlet' appears " + str(txt.count("HAMLET") + txt.count("Hamlet")) + " times in total")

Hamlet is mentionned 111 times
HAMLET speaks 363 times
The word 'hamlet' appears 474 times in total


### 2. File Reading as a .py program

Make a python file that defines a function that counts the number of times hamlet is mentionned using the code in the previous exercise.

Then import it in your notebook and call it here.

In [81]:
# create file: python hamlet.py (in the terminal)
# write the file
with open('hamlet.py', 'w') as f:
    f.write(
"""
import os

def mentionned(self):
    f = open('data/hamlet.txt', 'r')
    txt = f.read()
    f.close()
    answer = "Hamlet is mentionned " + str(txt.count("Hamlet")) + " times"
    return answer
""")

# show py file
#with open('hamlet.py', 'r') as f:
#    print(f.read())

# load and execute
import hamlet
hamlet.mentionned()

Hamlet is mentionned 111 times


### 3. Unique words in hamlet

Write a program that counts the unique words in hamlet.

In [27]:
import os

def unique_words():

    f = open("data/hamlet.txt", "r")
    hamlet = []
    for line in f.readlines():
        hamlet += line.split()
    f.close()

    # this is the words in the document not the words spoken in the play
    return len(set(hamlet))

In [28]:
unique_words()

7676

In [29]:
# I initially misread and output the most frequent words in hamlet 
n = {}
for w in set(hamlet):
    n[w] = hamlet.count(w)
sorted(((value,key) for (key,value) in n.items()), reverse=True)[0:5]

[(949, 'the'), (697, 'and'), (628, 'of'), (602, 'to'), (514, 'I')]

# File Reading 2: A Python library.

In the `data` folder, you will find a folder called `csrgraph` which is a python library.

### 1. File count

Count the `py` files in the library using the `os` package

In [90]:
import os
# list the files
files = os.listdir('data/csrgraph/')
# count py file
sum([f.count("py") for f in files])

8

### 2. For the following packages, count the number of files that import them:

- pandas 

- numpy

- numba

In [15]:
import os
# list the files
files = os.listdir('data/csrgraph/')

def pkg_imported(pkg_list):
    for pkg in pkg_list:
        count = 0
        for file in files:
            with open('data/csrgraph/' + file , 'r') as f:     
                for line in f.readlines():
                    if pkg in line: count += 1
        print(str(count) + " files use " + pkg)

In [16]:
pkg_imported(['pandas', 'numpy', 'numba'])

5 files use pandas
11 files use numpy
27 files use numba


# First NLP Program: IDF

Given a list of words, the the inverse document frequency (IDF) is a basic statistic of the amount of information of each word in the text.

The IDF formulat is:

$$IDF(w) = ln(\dfrac{N}{1 + n(w)})$$

Where:

- $w$ is the token (unique word),
- $n(w)$ is the number of documents that $w$ occurs in,
- $N$ is the total number of documents

Write a function, `idf(docs)` that takes in a list of lists of words and returns a dictionary  `word -> idf score`

Example:

```
IDF([['interview', 'questions'], ['interview', 'answers']]) -> {'questions': 0.0, 
                                                                'interview': -0.4, 
                                                                'answers': 0.0}


```

In [62]:
import math

def IDF(docs):

    n = {}
    # if a list of doc is feeded
    if type(docs[0]) == list:

        # flaten docs and create set of unique words (w) in docs
        unique_words = set([item for sublist in docs for item in sublist])
        
        # fill n dict with the number of docs that w occurs in
        for w in unique_words:
            n[w] = sum( [True if sublist.count(w) >= 1 else False for sublist in docs ] )

    # if a single doc is feeded
    else:
      
        # create set of unique words (w) in doc
        unique_words = set(docs)

        # fill n dict with the occurence of w
        for w in unique_words:
            n[w] = docs.count(w)

    # store N value
    N = len(docs)

    # fill idf dict with idf values for each w
    idf = {}
    for w in unique_words:
        idf[w] = math.log(N / (1 + n[w]))

    return idf

In [63]:
IDF([['interview', 'questions'], ['interview', 'answers']])

{'answers': 0.0, 'questions': 0.0, 'interview': -0.40546510810816444}

# Stretch Goal: IDF on Hamlet

Calculate the IDF dictionary on the Hamlet book.

What's the IDF of "Hamlet"?

What's the word with the highest IDF in the book?

In [74]:
import os

# Import file
f = open("data/hamlet.txt", "r")
hamlet = []
for line in f.readlines():
    hamlet += line.split()
f.close()

# Import proper format (UTF-8 BOM)
# Remove punctuation, capitalization and stuff
#### skipped ####

print("Calculate the IDF dictionary on the Hamlet book.")
idf_hamlet = IDF(hamlet)
# print the top 10
print(sorted(((value,key) for (key,value) in idf_hamlet.items()), reverse=True)[0:10])
print('')

print("What's the IDF of 'Hamlet'?")
print( str(idf_hamlet["Hamlet"]) )
print('')

print("What's the word with the highest IDF in the book?")
print(sorted(((value,key) for (key,value) in idf_hamlet.items()), reverse=True)[0])
print('')

Calculate the IDF dictionary on the Hamlet book.
[(9.678968055041988, 'ï»¿THE'), (9.678968055041988, 'â€™twould'), (9.678968055041988, 'â€™twereâ€”I'), (9.678968055041988, 'â€™tween'), (9.678968055041988, 'â€™tis,'), (9.678968055041988, 'â€™pear'), (9.678968055041988, 'â€™noyance;'), (9.678968055041988, 'â€™gins'), (9.678968055041988, 'â€™Twill'), (9.678968055041988, 'â€™One')]

What's the IDF of 'Hamlet'?
6.970917853939778

What's the word with the highest IDF in the book?
(9.678968055041988, 'ï»¿THE')

