# Safe dict reading

define a function `safe_dict(d, k)` that takes in a python dict `d` and a key `k` and makes it safe to read even with keys that aren't in the dictionary. If you try to read from the dictionary with a bad key, it should return 0 instead.

```
d = {1 : 2, 3 : 4}
safe_dict(d, 1) -> 2
safe_dict(d, 'cat') -> 0
```

In [1]:
d = {1: 2, 3: 4}

def safe_dict(d, k):
    try:
        v = d[k]
        return v
    except KeyError:
        return 0
safe_dict(d, 2)

0

# File Reading: Hamlet Exercises

Open `hamlet.txt` in the `data` folder

### 1. Mentionned Hamlet

How many times is hamlet mentioned in the book?

Use python and line iteration to count it up

In [2]:
hamlet = open('data/hamlet.txt', "r")
lines = hamlet.readlines()
total = 0

for line in lines:
#     We check for each way to spell hamlet..
    hamlets = line.count("Hamlet")
    hamlets2 = line.count("HAMLET")
    hamlets3 = line.count("hamlet")
#     Add up the occurences found
    if hamlets > 0: total += hamlets
    if hamlets2 > 0: total += hamlets2
    if hamlets3 > 0: total += hamlets2

print(total)

474


### 2. File Reading as a .py program

Make a python file that defines a function that counts the number of times hamlet is mentionned using the code in the previous exercise.

Then import it in your notebook and call it here.

In [3]:
import fword

print(fword.check_for('data/hamlet.txt', ["Hamlet", "HAMLET", "hamlet"]))
print(fword.check_for('', ["Hamlet", "HAMLET", "hamlet"]))
print(fword.check_for('data/hamlet.txt', []))
print(fword.check_for(0,["Hamlet", "HAMLET", "hamlet"]))

474
[Errno 2] No such file or directory: ''
0
Please provide at least one word to search for
0
Please provide a string for the file/filepath
0


### 3. Unique words in hamlet

Write a program that counts the unique words in hamlet.

In [4]:
import fword

print(fword.no_unique_words_in('data/hamlet.txt'))

5120


# File Reading 2: A Python library.

In the `data` folder, you will find a folder called `csrgraph` which is a python library.

### 1. File count

Count the `py` files in the library using the `os` package

In [5]:
import os

def files_in(path):
    content = []
    try: return os.listdir(path)
    except Exception as e: raise e

def how_many_files_in(path):
    content = []
    try:
        content = files_in(path)
    except Exception as e: 
        print(e)
        return
    print("There are", len(content), "files in the", path, "folder")

how_many_files_in('data/csrgraph')

There are 8 files in the data/csrgraph folder


### 2. For the following packages, count the number of files that import them:

- pandas 

- numpy

- numba

In [6]:
import os
import fword

# returns an array of file names contained in the specified path
# def files_in(path):
#     try: return os.listdir(path)
#     except Exception as e: raise e
        
# returns a dictionary with the keys as the file names,
# and the values as arrays of the lines of the file
# def files_as_lines_in(path):
#     try: 
#         file_names = files_in(path)
#         return dict(zip(file_names, [fword.get_all_lines_in_file(path+"/"+file) for file in file_names]))
#     except Exception as e: 
#         raise e
      
# Checks whether a file has a word or not
# def does_file_have_word(file_path, word): return fword.check_for(file_path, [word]) > 0

# Checks wether a set of files contains a specific word
# def files_containing_word_at_path(in_path, file_set, word):
#     total = 0
#     for file in files: total += 1 if does_file_have_word(in_path+"/"+file, "import "+ pkge) else 0
#     return total


def no_of_files_containing_word(files, word):
    if isinstance(files, dict) is False: 
        raise Error("Error no_of_files_containing_word - files must be a dictionary")
    total = 0
    for file in files.keys(): total += 1 if fword.do_lines_have_word(files[file], word) else 0 
    return total
        

def no_of_files_importing(pkges, in_path):
    
#     we create a dictionary of the packages, initialized at 0
    if len(pkges) < 1: return 0
    if in_path is None: return 0
    packages = dict(zip(pkges, [0] * len(pkges)))
    
#     we fetch an array of the file names at the path specified
    files = {}
    try:
        files = fword.files_as_lines_in(in_path)
    except Exception as e:
        print(e)
        return 0

#     we iterate over each package and check each file
    for pkge in pkges: 
        # we check for packages that are empty strings
        if len(pkge) <= 1: 
            print("Package '", pkge, "' is not a proper package name")
            return
        else:
            packages[pkge] += no_of_files_containing_word(files, "import "+pkge)
    return packages

print(no_of_files_importing(["os", "pandas", "numpy", "numba"], "data/csrgraph"))

{'os': 2, 'pandas': 4, 'numpy': 6, 'numba': 5}


# First NLP Program: IDF

Given a list of words, the the inverse document frequency (IDF) is a basic statistic of the amount of information of each word in the text.

The IDF formulat is:

$$IDF(w) = ln(\dfrac{N}{1 + n(w)})$$

Where:

- $w$ is the token (unique word),
- $n(w)$ is the number of documents that $w$ occurs in,
- $N$ is the total number of documents

Write a function, `idf(docs)` that takes in a list of lists of words and returns a dictionary  `word -> idf score`

Example:

```
IDF([['interview', 'questions'], ['interview', 'answers']]) -> {'questions': 0.0, 
                                                                'interview': -0.4, 
                                                                'answers': 0.0}


```

In [8]:
import math 
import fword

def idf(docs):
    
    # first build a a dict <doc_index:unique_words_set_for_doc>
    # doc_name will be the doc_index in the docs array
    doc_dict = {}
    unique_words = set()
    
    # we populate the dict
    for i in range(len(docs)):
        doc_dict[str(i)] = set(docs[i])
        # at the same time we get the set of unique words for all docs
        unique_words.update(doc_dict[str(i)])
    
#     we make one counter dictionary <word:counter> to keep track of each word's occurence
    words_in_no_of_docs = dict(zip([str(word) for word in unique_words], [0] * len(unique_words)))
#     we clone this base object for the result to return
    result = dict(words_in_no_of_docs)

    for word in unique_words:
#     for each word in unique_words we check how many times it appears in each doc
        for doc in doc_dict.keys(): 
            words_in_no_of_docs[word] += 1 if word in doc_dict[doc] else 0
#       we then apply the formulae to the result dict for each word
        print(words_in_no_of_docs)
        result[word] = math.log((len(docs) / (1 + words_in_no_of_docs[word])))
    return result

    
idf([['interview', 'questions'], ['interview', 'answers']])
    

{'answers': 1, 'questions': 0, 'interview': 0}
{'answers': 1, 'questions': 1, 'interview': 0}
{'answers': 1, 'questions': 1, 'interview': 2}


{'answers': 0.0, 'questions': 0.0, 'interview': -0.40546510810816444}

# Stretch Goal: TF-IDF on Hamlet

The TF-IDF score is a commonly used statistic for the importance of words. Its $\frac{TF}{IDF}$ where TF is the "term frequency" (eg. how often the words happens in the document).

Calculate the TF-IDF dictionary on the Hamlet book.

What's the TF-IDF of "Hamlet"?

What's the word with the highest TF-IDF in the book?

In [None]:
import math 
import fword


def idf(docs):
    
    # first build a a dict <doc_index:unique_words_set_for_doc>
    # doc_name will be the doc_index in the docs array
    doc_dict = {}
    unique_words = set()
    
    # we populate the dict
    for i in range(len(docs)):
        doc_dict[str(i)] = set(docs[i])
        # at the same time we get the set of unique words for all docs
        unique_words.update(doc_dict[str(i)])
    
#     we make one counter dictionary <word:counter> to keep track of each word's occurence
    words_in_no_of_docs = dict(zip([str(word) for word in unique_words], [0] * len(unique_words)))
#     we clone this base object for the result to return
    result = dict(words_in_no_of_docs)
    

    for word in unique_words:
#     for each word in unique_words we check how many times it appears in each doc
        for doc in doc_dict.keys(): 
            words_in_no_of_docs[word] += 1 if word in doc_dict[doc] else 0
#       we then apply the formulae to the result dict for each word
        result[word] = math.log((len(docs) / (1 + words_in_no_of_docs[word])))
    return result

def td_idf(file):
    # This method takes a filepath, converts it to an array of words only
    # [[word, word], [word, word]]
    lines = fword.get_all_lines_in_file_as_array_of_words_only(file)
    idf_file = idf(lines)
    tf_file = {}
    result = {}
    highest_v = -10000
    highest_n = ""
    for word in idf_file.keys():
#         This method counts the number of times the word appears in all lines
        tf_file[word] = fword.no_of_times_word_occurs_in(lines, word)
#     Calculate the tf-idf for the word
        result[word] = tf_file[word] / idf_file[word]
#     We update a max-value tracker pair of variables
        if result[word] > highest_v:
            highest_n = word
            highest_v = result[word]

    # we could have done some extra work to consider all-caps, capitalized an non-capitalized 
    # versions of each word as well (ie: the and The)
    print(idf_file)
    print("the TF-IDF of Hamlet is", result["Hamlet"])
    print("the TF-IDF of HAMLET is", result["HAMLET"])
    print("the word with the highest TF-IDF is '", highest_n, "' with, (", highest_v,")")
    
td_idf('data/hamlet.txt')

# Stretch Goal: Speaker count

Use a regular expression and looping over the `hamlet.txt` file to build a dictionary `character_name -> # times speaking`.

Who speaks the most often? Who speaks the least often?

In [None]:
import re
import fword


def speaker_count(file):
    #we want to check lines of words only to count names
    #speakers are "SPEAKER_NAME." 
    #without punctuation, we only have a one-word line in all caps
    #since the method get_all_lines_in_file_as_array_of_words_only(file)
    #of our library already filters out punctuation and spaces, and since
    #speaker name lines are always in caps,
    #if we check lines that only have one word and are all-caps we can be sure we're
    #looking at a speaker header
    
    lines = fword.get_all_lines_in_file_as_array_of_words_only(file)
    speakers = {}
    speaker_n_max = 0
    speaker_name = ""
    rege = re.compile(r'^[A-Z\d]+$')
    for line in lines:
        if len(line) == 1 and rege.match(line[0]):
            if line[0] in speakers: speakers[line[0]] += 1
            else: speakers[line[0]] = 1
            if speakers[line[0]] > speaker_n_max: 
                speaker_n_max = speakers[line[0]]
                speaker_name = line[0]
    print(speakers)
    print("The character who speaks the most is:", speaker_name, "(", speaker_n_max, "times )")
                
speaker_count("data/hamlet.txt")
            