# Warmup: Counter. 

Count how many times each element in a list occurs.

```
[1, 3, 2, 1, 5, 3, 5, 1, 4] ⇒

    1: 3 times
    2: 1 time
    3: 2 times
    4: 1 time
    5: 2 times
```


In [3]:
def counter(random_list):
    occurences = {}
    for instance in random_list:
        occurences[instance] = occurences[instance] + 1 if instance in occurences else 1
    return occurences
counter([1, 3, 2, 1, 5, 3, 5, 1, 4])

{1: 3, 3: 2, 2: 1, 5: 2, 4: 1}

# Safe dict reading

define a function `safe_dict(d, k)` that takes in a python dict `d` and a key `k` and makes it safe to read even with keys that aren't in the dictionary. If you try to read from the dictionary with a bad key, it should return 0 instead.

```
d = {1 : 2, 3 : 4}
safe_dict(d, 1) -> 2
safe_dict(d, 'cat') -> 0
```

In [9]:
safe_dict = lambda d, k: d[k] if k in d else 0
d = {1 : 2, 3 : 4}
print(safe_dict(d, 1))
print(safe_dict(d, 'cat'))

2
0


# File Reading: Hamlet Exercises

Open `hamlet.txt` in the `data` folder

### 1. Mentionned Hamlet

How many times is hamlet mentioned in the book?

Use python and line iteration to count it up

In [23]:
hamlet = open("data/hamlet.txt", "r")
count = sum([line.lower().count('hamlet') for line in hamlet])

print(count)

474


### 2. File Reading as a .py program

Make a python file that defines a function that counts the number of times hamlet is mentionned using the code in the previous exercise.

Then import it in your notebook and call it here.

In [22]:
import file_reader

print(file_reader.hamlet_count())

474


### 3. Unique words in hamlet

Write a program that counts the unique words in hamlet.

In [26]:
import collections

hamlet = open("data/hamlet.txt", "r")

def count_uniques():
    content = hamlet.read()
    return len(collections.Counter(content.split(" ")))

count_uniques()

9327

# File Reading 2: A Python library.

In the `data` folder, you will find a folder called `csrgraph` which is a python library.

### 1. File count

Count the `py` files in the library using the `os` package

In [28]:
import os

def file_count():
    return len(os.listdir('data/csrgraph'))
print(file_count())

8


### 2. For the following packages, count the number of files that import them:

- pandas 

- numpy

- numba

In [15]:
import os

def file_count():
    imported_packages = {
        'pandas': 0,
        'numpy': 0,
        'numba': 0
    }

    for file_string in os.listdir('data/csrgraph'):
        file_buffer = open(f'data/csrgraph/{file_string}', "r")
        content = file_buffer.read()
        for key in imported_packages:
            if (content.find(key) is not -1):
                imported_packages[key] = imported_packages[key] + 1
    
    return imported_packages
file_count()

{'pandas': 4, 'numpy': 6, 'numba': 6}

# First NLP Program: IDF

Given a list of words, the the inverse document frequency (IDF) is a basic statistic of the amount of information of each word in the text.

The IDF formulat is:

$$IDF(w) = ln(\dfrac{N}{1 + n(w)})$$

Where:

- $w$ is the token (unique word),
- $n(w)$ is the number of documents that $w$ occurs in,
- $N$ is the total number of documents

Write a function, `idf(docs)` that takes in a list of lists of words and returns a dictionary  `word -> idf score`

Example:

```
IDF([['interview', 'questions'], ['interview', 'answers']]) -> {'questions': 0.0, 
                                                                'interview': -0.4, 
                                                                'answers': 0.0}


```

In [33]:
from math import log

def IDF(list_of_list):
    N = len(list_of_list)
    word_occurences = {}
    for i in range(len(list_of_list)):
        for j in range(len(list_of_list[i])):
            word = list_of_list[i][j]
            if(word in word_occurences):
                word_occurences[word] = word_occurences[word] + 1
            else:
                word_occurences[word] = 1
    for key in word_occurences:
         word_occurences[key] = round(log(N/(1 + word_occurences[key])), 1)
    return word_occurences
            
IDF([['interview', 'questions'], ['interview', 'answers']])



{'interview': -0.4, 'questions': 0.0, 'answers': 0.0}

# Stretch Goal: IDF on Hamlet

Calculate the IDF dictionary on the Hamlet book.

What's the IDF of "Hamlet"?

What's the word with the highest IDF in the book?