# Safe dict reading

define a function `safe_dict(d, k)` that takes in a python dict `d` and a key `k` and makes it safe to read even with keys that aren't in the dictionary. If you try to read from the dictionary with a bad key, it should return 0 instead.

```
d = {1 : 2, 3 : 4}
safe_dict(d, 1) -> 2
safe_dict(d, 'cat') -> 0
```

In [1]:
# Function that takes in a python dict d and a key k and makes it safe to read even with keys that aren't in the dictionary.
def safe_dict(d,x):
    try:
        return d[x]
    except KeyError:
        return 0

print(safe_dict({1 : 2, 3 : 4}, 1)) 
print(safe_dict({1 : 2, 3 : 4}, 'cat'))

2
0


# File Reading: Hamlet Exercises

Open `hamlet.txt` in the `data` folder

### 1. Mentionned Hamlet

How many times is hamlet mentioned in the book?

Use python and line iteration to count it up

In [2]:
data_file = open('data/hamlet.txt', 'r')

counthamlet=0
lines = data_file.readlines()

for line in lines:
    counthamlet+=line.upper().count('HAMLET')
data_file.close()

counthamlet

print(f"Hamlet is mentioned {counthamlet} times in the book.")

Hamlet is mentioned 474 times in the book.


### 2. File Reading as a .py program

Make a python file that defines a function that counts the number of times hamlet is mentionned using the code in the previous exercise.

Then import it in your notebook and call it here.

In [3]:
# A python file that defines a function that counts the number of times 
# hamlet is mentionned using the code in the previous exercise

with open('hamletfunctions.py', 'w') as f:
    f.write(
"""
def counthamlet(datafile):
    data_file = open(datafile, 'r')
    
    counthamlet=0
    lines = data_file.readlines()

    for line in lines:
        counthamlet+=line.upper().count('HAMLET')
    data_file.close()

    return counthamlet
""")

In [4]:
import hamletfunctions

hamletfunctions.counthamlet('data/hamlet.txt')

474

### 3. Unique words in hamlet

Write a program that counts the unique words in hamlet.

In [5]:
# Program that counts the unique words in hamlet

import re
regex = re.compile('\s+')

def uniquewords(datafile):
    data_file = open(datafile, 'r')
    
    allwords = []
    lines = data_file.readlines()
    
    for line in lines:
        if len(line)>1:
            cleanline = line.upper().replace("."," ").replace(";", " ")
            cleanline = cleanline.replace(":", " ").replace(",", " ").replace('"', " ")
            cleanline = cleanline.replace("[", " ").replace("]", " ").replace("_", " ")
            cleanline = cleanline.replace("-", " ").replace("!", " ").replace("?", " ").replace("\n", " ")
            for e in regex.split(cleanline.strip()):
                allwords.append(e)

    data_file.close()
    #print(sorted(set(allwords)))   
    count_unique=len(set(allwords))
    return count_unique

print(f"There are {uniquewords('data/hamlet.txt')} unique words in the book.")

There are 4900 unique words in the book.


# File Reading 2: A Python library.

In the `data` folder, you will find a folder called `csrgraph` which is a python library.

### 1. File count

Count the `py` files in the library using the `os` package

In [6]:
# Count the py files in the library using the os package

import os
listfiles = os.listdir('data/csrgraph')
count=0
for i in range(len(listfiles)):
    if listfiles[i].endswith('.py'):
        count+=1
print(count)
#print(os.listdir('data/csrgraph'))

8


### 2. For the following packages, count the number of files that import them:

- pandas 

- numpy

- numba

In [7]:
def countimports(package,directory):
    listfiles = os.listdir(directory)
    countfiles=0
    for i in range(len(listfiles)):
        if listfiles[i].endswith('.py'):
            filepath = directory+'/'+listfiles[i]
            with open(filepath, 'r') as f:
                readfile = f.read().replace('\n', '')
                if readfile.find(package) != -1:
                    countfiles+=1
    return countfiles

print(countimports('pandas','data/csrgraph'))
print(countimports('numpy','data/csrgraph'))
print(countimports('numba','data/csrgraph'))

4
6
6


# First NLP Program: IDF

Given a list of words, the the inverse document frequency (IDF) is a basic statistic of the amount of information of each word in the text.

The IDF formulat is:

$$IDF(w) = ln(\dfrac{N}{1 + n(w)})$$

Where:

- $w$ is the token (unique word),
- $n(w)$ is the number of documents that $w$ occurs in,
- $N$ is the total number of documents

Write a function, `idf(docs)` that takes in a list of lists of words and returns a dictionary  `word -> idf score`

Example:

```
IDF([['interview', 'questions'], ['interview', 'answers']]) -> {'questions': 0.0, 
                                                                'interview': -0.4, 
                                                                'answers': 0.0}


```

In [8]:
# Function, idf(docs) that takes in a list of lists of words 
# and returns a dictionary word -> idf score

import numpy as np

def idf(docs):
    tokens=np.unique(docs)
        
    nlist=[]
    for w in tokens:
        n=0
        for doc in docs:
            strtoken = ' '.join(doc)
            if strtoken.find(w) != -1:
                n+=1
        nlist.append((w,n))

    k,v = [],[]
    for i in range(len(nlist)):
        k.append(nlist[i][0])
        n = nlist[i][1]
        v.append(np.log(len(docs)/(1+(n))))

    return dict(zip(k,v))

# Test Scenarios & print result 

docs=([['interview', 'questions'], ['interview', 'answers']])
idf(docs)

{'answers': 0.0, 'interview': -0.40546510810816444, 'questions': 0.0}

# Stretch Goal: TF-IDF on Hamlet

The TF-IDF score is a commonly used statistic for the importance of words. Its $\frac{TF}{IDF}$ where TF is the "term frequency" (eg. how often the words happens in the document).

Calculate the TF-IDF dictionary on the Hamlet book.

What's the TF-IDF of "Hamlet"?

What's the word with the highest TF-IDF in the book?

# Stretch Goal: Speaker count

Use a regular expression and looping over the `hamlet.txt` file to build a dictionary `character_name -> # times speaking`.

Who speaks the most often? Who speaks the least often?