# Safe dict reading

define a function `safe_dict(d, k)` that takes in a python dict `d` and a key `k` and makes it safe to read even with keys that aren't in the dictionary. If you try to read from the dictionary with a bad key, it should return 0 instead.

```
d = {1 : 2, 3 : 4}
safe_dict(d, 1) -> 2
safe_dict(d, 'cat') -> 0
```

In [2]:
d = {1 : 2, 3 : 4}

def safe_dict(d, k):
    
    if k in d:
        return d[k]
    else:
        return 0
    
print(safe_dict(d, 1))
print(safe_dict(d, "cat"))

2
0


# File Reading: Hamlet Exercises

Open `hamlet.txt` in the `data` folder

### 1. Mentionned Hamlet

How many times is hamlet mentioned in the book?

Use python and line iteration to count it up

In [19]:
f = open("C:/Users/rando/Desktop/data-science/m1-4-files-strings/data/hamlet.txt", "r")
lines = f.readlines()

count = 0

for line in lines:
    if "hamlet" in line.lower():
        count += 1
count

469

### 2. File Reading as a .py program

Make a python file that defines a function that counts the number of times hamlet is mentionned using the code in the previous exercise.

Then import it in your notebook and call it here.

In [27]:
# %load data/count_hamlet.py
f = open("C:/Users/rando/Desktop/data-science/m1-4-files-strings/data/hamlet.txt", "r")
lines = f.readlines()

count = 0

for line in lines:
    if "hamlet" in line.lower():
        count += 1

count

469

### 3. Unique words in hamlet

Write a program that counts the unique words in hamlet.

In [28]:
f = open("C:/Users/rando/Desktop/data-science/m1-4-files-strings/data/hamlet.txt", "r")
lines = f.readlines()

words = []

for line in lines:
    sub_line = line.split()
    
    for l in sub_line:
        
        if l not in words:
        
            words.append(l)
    
    
len(words)

7676

# File Reading 2: A Python library.

In the `data` folder, you will find a folder called `csrgraph` which is a python library.

### 1. File count

Count the `py` files in the library using the `os` package

In [33]:
import os

len(os.listdir(path="C:/Users/rando/Desktop/data-science/m1-4-files-strings/data/csrgraph"))

8

### 2. For the following packages, count the number of files that import them:

- pandas 

- numpy

- numba

In [36]:
files = os.listdir(path="C:/Users/rando/Desktop/data-science/m1-4-files-strings/data/csrgraph")

file_count = 0

for file in files:
    
    f = open(f"C:/Users/rando/Desktop/data-science/m1-4-files-strings/data/csrgraph/{file}")
    lines = f.readlines()
    
    for line in lines:
        
        if "import pandas" in line or "import numpy" in line or "import numba" in line:
            
            file_count += 1
            break

file_count

6

# First NLP Program: IDF

Given a list of words, the the inverse document frequency (IDF) is a basic statistic of the amount of information of each word in the text.

The IDF formulat is:

$$IDF(w) = ln(\dfrac{N}{1 + n(w)})$$

Where:

- $w$ is the token (unique word),
- $n(w)$ is the number of documents that $w$ occurs in,
- $N$ is the total number of documents

Write a function, `idf(docs)` that takes in a list of lists of words and returns a dictionary  `word -> idf score`

Example:

```
IDF([['interview', 'questions'], ['interview', 'answers']]) -> {'questions': 0.0, 
                                                                'interview': -0.4, 
                                                                'answers': 0.0}


```

In [13]:
import math

def IDF(doc): 
    
    total_docs = len(doc)
    unique_words = set()
    word_count = {}
    word_idf = {}
    
    for d in doc:
        for word in d:
            unique_words.add(word)
            
            if word not in word_count:
                word_count[word] = 1
            else:
                word_count[word] += 1
    
    
    for word in word_count:
        word_idf[word] = round(math.log((total_docs / (1 + word_count[word]))), 1)
        
    return word_idf

IDF([['interview', 'questions'], ['interview', 'answers']])

{'interview': -0.4, 'questions': 0.0, 'answers': 0.0}

# Stretch Goal: TF-IDF on Hamlet

The TF-IDF score is a commonly used statistic for the importance of words. Its $\frac{TF}{IDF}$ where TF is the "term frequency" (eg. how often the words happens in the document).

Calculate the TF-IDF dictionary on the Hamlet book.

What's the TF-IDF of "Hamlet"?

What's the word with the highest TF-IDF in the book?

# Stretch Goal: Speaker count

Use a regular expression and looping over the `hamlet.txt` file to build a dictionary `character_name -> # times speaking`.

Who speaks the most often? Who speaks the least often?