# Safe dict reading

define a function `safe_dict(d, k)` that takes in a python dict `d` and a key `k` and makes it safe to read even with keys that aren't in the dictionary. If you try to read from the dictionary with a bad key, it should return 0 instead.

```
d = {1 : 2, 3 : 4}
safe_dict(d, 1) -> 2
safe_dict(d, 'cat') -> 0
```

In [19]:
def safe_dict(d, k):
    if k not in d.keys():
        print(0)
    else:
        print(d[k])

d = {1 : 2, 3 : 4}        
safe_dict(d, 'meow')

0


# File Reading: Hamlet Exercises

Open `hamlet.txt` in the `data` folder

### 1. Mentionned Hamlet

How many times is hamlet mentioned in the book?

Use python and line iteration to count it up

In [2]:
data = open('/Users/Main/Data_Science/m1-4-files-strings/data/hamlet.txt').read()
data = data.lower() 
count = data.count('hamlet')
count

474

### 2. File Reading as a .py program

Make a python file that defines a function that counts the number of times hamlet is mentionned using the code in the previous exercise.

Then import it in your notebook and call it here.

In [17]:
from ham_count import *
wrd_count('hamlet', '/Users/Main/Data_Science/m1-4-files-strings/data/hamlet.txt')


474

### 3. Unique words in hamlet

Write a program that counts the unique words in hamlet.

In [4]:
with open("/Users/Main/Data_Science/m1-4-files-strings/data/hamlet.txt", "r") as file:
    lines = file.readlines()
    uniques = set()
    for line in lines:
        uniques = uniques.union(set(line.split(" ")))
        
    print(f"Unique words: {len(uniques)}")

Unique words: 8563


# File Reading 2: A Python library.

In the `data` folder, you will find a folder called `csrgraph` which is a python library.

### 1. File count

Count the `py` files in the library using the `os` package

In [5]:
import os
file_lst = os.listdir("/Users/Main/Data_Science/m1-4-files-strings/data/csrgraph") #list the files in the csrgraph
n_files = len(file_lst) #Calculate the length of the list
n_files #return length of list

8

### 2. For the following packages, count the number of files that import them:

- pandas 

- numpy

- numba

In [16]:
#https://stackoverflow.com/questions/34530237/find-files-in-a-directory-containing-desired-string-in-python/34530459
#Modified version of code found (above)
#Pandas
import os

pd_count = 0
for fname in os.listdir("data/csrgraph"):
        f = open("data/csrgraph" + '/' + fname, 'r') #required / to separate path and each fname
        if "pandas" in f.read():
            pd_count += 1
pd_count

4

In [14]:
#Numpy

np_count = 0
for fname in os.listdir("data/csrgraph"):
        f = open("data/csrgraph" + '/' + fname, 'r')
        if "numpy" in f.read():
            np_count += 1
np_count

6

In [15]:
#Numba

nba_count = 0
for fname in os.listdir("data/csrgraph"):
        f = open("data/csrgraph" + '/' + fname, 'r')
        if "numba" in f.read():
            nba_count += 1
nba_count

6

# First NLP Program: IDF

Given a list of words, the the inverse document frequency (IDF) is a basic statistic of the amount of information of each word in the text.

The IDF formulat is:

$$IDF(w) = ln(\dfrac{N}{1 + n(w)})$$

Where:

- $w$ is the token (unique word),
- $n(w)$ is the number of documents that $w$ occurs in,
- $N$ is the total number of documents

Write a function, `idf(docs)` that takes in a list of lists of words and returns a dictionary  `word -> idf score`

Example:

```
IDF([['interview', 'questions'], ['interview', 'answers']]) -> {'questions': 0.0, 
                                                                'interview': -0.4, 
                                                                'answers': 0.0}


```

In [8]:
import math as m

def idf(docs):
    emp_dict = {}
    doc_set = [set(x) for x in docs]
    N = len(doc_set)
    for doc in doc_set: #loop thru each doc
        for token in doc: #loop thru each token in doc
            emp_dict[token] = emp_dict.get(token, 0) + 1 #count num of dicts where word appears
    for k in emp_dict:
        emp_dict[k] = round(m.log((N / (1 + emp_dict[k]))), 1) #loops thru each key and then dic[key] returns value of key
    return emp_dict
    
idf([['interview', 'questions'], ['interview', 'answers']])

{'interview': -0.4, 'questions': 0.0, 'answers': 0.0}

# Stretch Goal: TF-IDF on Hamlet

The TF-IDF score is a commonly used statistic for the importance of words. Its $\frac{TF}{IDF}$ where TF is the "term frequency" (eg. how often the words happens in the document).

Calculate the TF-IDF dictionary on the Hamlet book.

What's the TF-IDF of "Hamlet"?

What's the word with the highest TF-IDF in the book?

# Stretch Goal: Speaker count

Use a regular expression and looping over the `hamlet.txt` file to build a dictionary `character_name -> # times speaking`.

Who speaks the most often? Who speaks the least often?