# Safe dict reading

define a function `safe_dict(d, k)` that takes in a python dict `d` and a key `k` and makes it safe to read even with keys that aren't in the dictionary. If you try to read from the dictionary with a bad key, it should return 0 instead.

```
d = {1 : 2, 3 : 4}
safe_dict(d, 1) -> 2
safe_dict(d, 'cat') -> 0
```

In [4]:
d = {1: 2, 3: 4, 5: 6, 7: 8, 9: 10, 11: 12}

def safe_dict(d, k):
    '''
    - First determine if k a value in dictionary d
        - if not, return 0
    - Then return the value of that key
    '''
    if k in d:
        for i in d:
            return(d.get(k))
    else:
        return 0
    
    
print(safe_dict(d, 'cat'))
print(safe_dict(d, 1))
print(safe_dict(d, 2))
print(safe_dict(d, 3))
print(safe_dict(d, 67565342))



0
2
0
4
0


# File Reading: Hamlet Exercises

Open `hamlet.txt` in the `data` folder

### 1. Mentionned Hamlet

How many times is hamlet mentioned in the book?

Use python and line iteration to count it up

In [37]:
hamlet = open('C:/Users/farha/Documents/m1-4-files-strings/data/hamlet.txt')

def word_count(x):
    '''
    - Convert all strings to lower case to account for both HAMLET and Hamlet
    - Then return the count using .count()
    '''
    
    lines = hamlet.read()
    lowercase = lines.lower()
      
    return lowercase.count(x)
        
word_count("hamlet")

474

### 2. File Reading as a .py program

Make a python file that defines a function that counts the number of times hamlet is mentionned using the code in the previous exercise.

Then import it in your notebook and call it here.

In [45]:
exec(open("ham.py").read())

word_count("hamlet")

474

### 3. Unique words in hamlet

Write a program that counts the unique words in hamlet.

In [73]:
'''
- Convert all lines into a list of strings where each element is each word
- Convert the list to a set to get the unique values
'''

hamlet = open('C:/Users/farha/Documents/m1-4-files-strings/data/hamlet.txt')

word_list = []
for line in hamlet.readlines():
    for i in line.split(' '):
        word_list.append(i)

unique_elements = set(word_list)

print(len(unique_elements))
print(len(word_list))

8563
34062


# File Reading 2: A Python library.

In the `data` folder, you will find a folder called `csrgraph` which is a python library.

### 1. File count

Count the `py` files in the library using the `os` package

In [4]:
import os

os.chdir('C:/Users/farha/Documents/m1-4-files-strings/data/csrgraph')

file_list = os.listdir()
file_count = len(file_list)
print(file_count)

8


### 2. For the following packages, count the number of files that import them:

- pandas 

- numpy

- numba

In [35]:
def package_count():
    '''
    - Obtain list of files in directory
    - Join the path with the names of the files
    - Open the files
    - And then search for the relevant strings and add to their respective counters
    '''
    pandas_count = 0
    numpy_count = 0
    numba_count = 0
    for all_files in os.listdir('C:/Users/farha/Documents/m1-4-files-strings/data/csrgraph'):
        files = os.path.join('C:/Users/farha/Documents/m1-4-files-strings/data/csrgraph', all_files)
        x = open(files)
        for lines in x.readlines():
            if "import pandas" in lines:
                pandas_count += 1
            if "import numpy" in lines:
                numpy_count += 1
            if "import numba" in lines:
                numba_count += 1
    
    print("Number of files that import Pandas: " + str(pandas_count))
    print("Number of files that import Numpy: " + str(numpy_count))
    print("Number of files that import Numba: " + str(numba_count))
    
package_count()

Number of files that import Pandas: 4
Number of files that import Numpy: 6
Number of files that import Numba: 5


# First NLP Program: IDF

Given a list of words, the the inverse document frequency (IDF) is a basic statistic of the amount of information of each word in the text.

The IDF formulat is:

$$IDF(w) = ln(\dfrac{N}{1 + n(w)})$$

Where:

- $w$ is the token (unique word),
- $n(w)$ is the number of documents that $w$ occurs in,
- $N$ is the total number of documents

Write a function, `idf(docs)` that takes in a list of lists of words and returns a dictionary  `word -> idf score`

Example:

```
IDF([['interview', 'questions'], ['interview', 'answers']]) -> {'questions': 0.0, 
                                                                'interview': -0.4, 
                                                                'answers': 0.0}


```

In [63]:
import math

def idf(docs):
    '''
    - First create a normal dictionary for all words and how many times they appear
    - Then create a new dictionary and use the IDF formula on the dictionary to calculate IDF
    '''
    dictionary = {}
    for documents in docs:
        for word in documents:
            if word in dictionary:
                dictionary[word] += 1
            else:
                dictionary[word] = 1
    
    idf_dictionary = {}
    for word in dictionary:
        idf_dictionary[word] = math.log(len(docs) / (1 + dictionary[word]))
     
    return idf_dictionary
    
idf([['interview', 'questions'], ['interview', 'answers']])

{'interview': -0.40546510810816444, 'questions': 0.0, 'answers': 0.0}

# Stretch Goal: TF-IDF on Hamlet

The TF-IDF score is a commonly used statistic for the importance of words. Its $\frac{TF}{IDF}$ where TF is the "term frequency" (eg. how often the words happens in the document).

Calculate the TF-IDF dictionary on the Hamlet book.

What's the TF-IDF of "Hamlet"?

What's the word with the highest TF-IDF in the book?

# Stretch Goal: Speaker count

Use a regular expression and looping over the `hamlet.txt` file to build a dictionary `character_name -> # times speaking`.

Who speaks the most often? Who speaks the least often?