# Safe dict reading

define a function `safe_dict(d, k)` that takes in a python dict `d` and a key `k` and makes it safe to read even with keys that aren't in the dictionary. If you try to read from the dictionary with a bad key, it should return 0 instead.

```
d = {1 : 2, 3 : 4}
safe_dict(d, 1) -> 2
safe_dict(d, 'cat') -> 0
```

In [51]:
# first way

d = {1:2, 3:4}
def safe_dict(d, k):
    if k not in d:
        return 0
    else:
        return d[k]

print(safe_dict(d, 'cat'))


# second way... use an 'exception'

def other_safe_dict(d, k):
    try:
        return d[k]
    except KeyError:
        return 0

other_safe_dict(d, 'cat')

0


0

# File Reading: Hamlet Exercises

Open `hamlet.txt` in the `data` folder

### 1. Mentionned Hamlet

How many times is hamlet mentioned in the book?

Use python and line iteration to count it up

In [53]:
# Answer 1: Can first separate by lines with readlines()

import re

f = open('/Users/mayarossi/DS-Workshop/m1-4-files-strings/data/hamlet.txt', 'r')

my_count = 0

data_file = f.readlines()
for line in data_file:
    tokens = re.findall('\w+', line)
    for token in tokens:
        if 'hamlet' in token.lower():
            my_count += 1
f.close()
my_count


474

In [54]:
# Answer 2: Can also separate into strings by using read(), 
# but with very big documents, this might not work and 
# better to use readlines().

import re

f = open('/Users/mayarossi/DS-Workshop/m1-4-files-strings/data/hamlet.txt', 'r')

my_count = 0

data_file = f.read()
tokens = re.findall('\w+', data_file)
for token in tokens:
    if 'hamlet' in token.lower():
        my_count += 1
f.close()
my_count



474

### 2. File Reading as a .py program

Make a python file that defines a function that counts the number of times hamlet is mentionned using the code in the previous exercise.

Then import it in your notebook and call it here.

In [55]:
with open('count_hamlet.py', 'w') as f:
    f.write(
"""
import re


def countyy(f):
    mymy_count = 0
    data_file = f.readlines()
    for line in data_file:
        tokens = re.findall('\w+', line)
        for token in tokens:
            if 'hamlet' in token.lower():
                mymy_count += 1
    return mymy_count

""")


In [56]:

import count_hamlet

f = open('/Users/mayarossi/DS-Workshop/m1-4-files-strings/data/hamlet.txt', 'r')

print(count_hamlet.countyy(f))

f.close()

474


### 3. Unique words in hamlet

Write a program that counts the unique words in hamlet.

In [3]:
# First solution: gives 8563
import re
my_dict_dict = {}

def unique_unique_words(f):
    count = 0
    corpus = []
    data_file = f.readlines()
    for line in data_file:
        words = line.split(" ")
        for word in words:
            corpus.append(word)
    return len(set(corpus))

f = open('/Users/mayarossi/DS-Workshop/m1-4-files-strings/data/hamlet.txt', 'r')

print(unique_unique_words(f))

f.close() 



8563


In [2]:
# Second solution: gives 5141, but I like this one better 
# because separates by word, and doesnt give 'unique' words 
# just because the word ends a sentence like 'together.'

import re
my_dict = {}

def unique_words(f):
    count = 0
    data_file = f.readlines()
    for line in data_file:
        tokens = re.findall('\w+', line)
        for token in tokens:
            my_dict[token] = tokens.count(token)
        if tokens.count(token) == 1:
            count += 1
        
    return count

f = open('/Users/mayarossi/DS-Workshop/m1-4-files-strings/data/hamlet.txt', 'r')

print(unique_words(f))

f.close() 


5141


# File Reading 2: A Python library.

In the `data` folder, you will find a folder called `csrgraph` which is a python library.

### 1. File count

Count the `py` files in the library using the `os` package

In [56]:
# First solution: with os

import os

my_list = os.listdir()

my_list

x = [f for f in my_list if f.endswith('.py')]

len(x)


# Second solution: without os 

files = !ls *.py

len(files)

8

### 2. For the following packages, count the number of files that import them:

- pandas 

- numpy

- numba

In [89]:
import re
import os
my_files = os.listdir('/Users/mayarossi/DS-Workshop/m1-4-files-strings/data/csrgraph/')
my_files = [file for file in my_files if file.endswith('.py')]


path = '/Users/mayarossi/DS-Workshop/m1-4-files-strings/data/csrgraph/'
my_dictionary = {}
packages = ['import pandas', 'import numpy', 'import numba']
def question_two(path):
    for package in packages:
        for files in my_files:
            with open(path+files, 'r') as f: 
                text = f.read()
                if re.search(package, text):
                    my_dictionary[package] = my_dictionary.get(package, 0) + 1 
                    # '.get' says try to find package but if you dont find it 
                    # just return 0 instead of an error.
    return my_dictionary


question_two(path)

{'import pandas': 4, 'import numpy': 6, 'import numba': 5}

# First NLP Program: IDF

Given a list of words, the the inverse document frequency (IDF) is a basic statistic of the amount of information of each word in the text.

The IDF formulat is:

$$IDF(w) = ln(\dfrac{N}{1 + n(w)})$$

Where:

- $w$ is the token (unique word),
- $n(w)$ is the number of documents that $w$ occurs in,
- $N$ is the total number of documents

Write a function, `idf(docs)` that takes in a list of lists of words and returns a dictionary  `word -> idf score`

Example:

```
IDF([['interview', 'questions'], ['interview', 'answers']]) -> {'questions': 0.0, 
                                                                'interview': -0.4, 
                                                                'answers': 0.0}


```

In [4]:
import math

my_list = [['interview', 'questions'], ['interview', 'answers']]

def idf(docs):
    my_dict = {}
    # get a list of unique words to be the dictionary keys. 
    # For each key need to calculate the equation. 
    # Loop through the big list(docs)
    N = len(docs)
    for doc in docs:
        unique_words = [w for w in doc]
        unique_words = set(unique_words)
        for w in unique_words:
            if w in doc:
                my_dict[w] = my_dict.get(w, 0) + 1 #then loop through dictionary
    for key, value in my_dict.items():
        my_dict[key] = math.log(N/(1 + my_dict[key]))
    return my_dict
        
idf(my_list)

{'questions': 0.0, 'interview': -0.40546510810816444, 'answers': 0.0}

# Stretch Goal: TF-IDF on Hamlet

The TF-IDF score is a commonly used statistic for the importance of words. Its $\frac{TF}{IDF}$ where TF is the "term frequency" (eg. how often the words happens in the document).

Calculate the TF-IDF dictionary on the Hamlet book.

What's the TF-IDF of "Hamlet"?

What's the word with the highest TF-IDF in the book?

# Stretch Goal: Speaker count

Use a regular expression and looping over the `hamlet.txt` file to build a dictionary `character_name -> # times speaking`.

Who speaks the most often? Who speaks the least often?