Dylan Hastings

# Safe dict reading

define a function `safe_dict(d, k)` that takes in a python dict `d` and a key `k` and makes it safe to read even with keys that aren't in the dictionary. If you try to read from the dictionary with a bad key, it should return 0 instead.

```
d = {1 : 2, 3 : 4}
safe_dict(d, 1) -> 2
safe_dict(d, 'cat') -> 0
```

In [1]:
def safe_dict(d, k):
    '''
    This function takes in a python dict 'd' and a key 'k'
    and makes it safe to read, even with keys that
    aren't in the dictionary.
    '''
    try:
        d[k]
    except KeyError as ke:
        return 0
    return d[k]

Check:

In [2]:
d = {1 : 2, 3 : 4}
safe_dict(d, 'cat')

0

# File Reading: Hamlet Exercises

Open `hamlet.txt` in the `data` folder

### 1. Mentionned Hamlet

How many times is hamlet mentioned in the book?

Use python and line iteration to count it up

In [3]:
with open('C:/Users/dylan/Downloads/m1-4-files-strings-main/m1-4-files-strings-main/data/hamlet.txt', 'r') as file:
    data = file.read()

In [4]:
words = data.split()

In [5]:
count = 0

for word in words:
    if "hamlet" in word.lower():
        count += 1
        
count

474

### 2. File Reading as a .py program

Make a python file that defines a function that counts the number of times hamlet is mentionned using the code in the previous exercise.

Then import it in your notebook and call it here.

In [6]:
import Hamlet_Counter

In [7]:
Hamlet_Counter.hamlet_counter(words)

474

### 3. Unique words in hamlet

Write a program that counts the unique words in hamlet.

In [8]:
unique_words = set(words)

In [9]:
len(unique_words)

7676

# File Reading 2: A Python library.

In the `data` folder, you will find a folder called `csrgraph` which is a python library.

### 1. File count

Count the `py` files in the library using the `os` package

In [10]:
import os

In [11]:
path = 'C:/Users/dylan/Downloads/m1-4-files-strings-main/m1-4-files-strings-main/data/csrgraph'

In [12]:
py_count = 0

In [13]:
for file in os.listdir(path):
    if file.endswith(".py"): py_count += 1

In [14]:
py_count

8

### 2. For the following packages, count the number of files that import them:

- pandas 

- numpy

- numba

In [15]:
path = 'C:/Users/dylan/Downloads/m1-4-files-strings-main/m1-4-files-strings-main/data/csrgraph/'

In [16]:
pandas_count = 0
numpy_count = 0
numba_count = 0

for file in os.listdir(path):
    
    path += file
    
    with open(path, 'r') as file:
        data = file.read()
        
    words = data.split()
    
    # Reset flags
    contains_pandas = False
    contains_numpy = False
    contains_numba = False
    
    # Check if package is imported
    for word in words:
        if word == 'pandas': contains_pandas = 'True'
        if word == 'numpy': contains_numpy = 'True'
        if word == 'numba': contains_numba = 'True'
    
    # Increment counters
    if (contains_pandas): pandas_count += 1
    if (contains_numpy): numpy_count += 1
    if (contains_numba): numba_count += 1
        
    # Reset path
    path = 'C:/Users/dylan/Downloads/m1-4-files-strings-main/m1-4-files-strings-main/data/csrgraph/'

In [17]:
pandas_count

4

In [18]:
numpy_count

6

In [19]:
numba_count

6

# First NLP Program: IDF

Given a list of words, the the inverse document frequency (IDF) is a basic statistic of the amount of information of each word in the text.

The IDF formulat is:

$$IDF(w) = ln(\dfrac{N}{1 + n(w)})$$

Where:

- $w$ is the token (unique word),
- $n(w)$ is the number of documents that $w$ occurs in,
- $N$ is the total number of documents

Write a function, `idf(docs)` that takes in a list of lists of words and returns a dictionary  `word -> idf score`

Example:

```
IDF([['interview', 'questions'], ['interview', 'answers']]) -> {'questions': 0.0, 
                                                                'interview': -0.4, 
                                                                'answers': 0.0}


```

In [100]:
import math

In [199]:
def IDF(docs):
    '''
    Given a list of words, this function returns the
    Inverse Document Frequency (IDF).
    '''
    N = len(docs)
    tokens = []
    
    for lst in docs:
        for ele in lst:
            tokens.append(ele)
    
    n_w = dict.fromkeys(tokens, 0) #number of documents w occurs in
    idf = dict.fromkeys(tokens, 0) #idf of each token

    for token in tokens:
        n_w[token] += 1
        idf[token] = math.log(N / (1 + n_w[token]))
    
    return idf

Check:

In [200]:
IDF([['interview', 'questions'], ['interview', 'answers']])

{'interview': -0.40546510810816444, 'questions': 0.0, 'answers': 0.0}

# Stretch Goal: TF-IDF on Hamlet

The TF-IDF score is a commonly used statistic for the importance of words. Its $\frac{TF}{IDF}$ where TF is the "term frequency" (eg. how often the words happens in the document).

Calculate the TF-IDF dictionary on the Hamlet book.

What's the TF-IDF of "Hamlet"?

What's the word with the highest TF-IDF in the book?

# Stretch Goal: Speaker count

Use a regular expression and looping over the `hamlet.txt` file to build a dictionary `character_name -> # times speaking`.

Who speaks the most often? Who speaks the least often?