# Safe dict reading

define a function `safe_dict(d, k)` that takes in a python dict `d` and a key `k` and makes it safe to read even with keys that aren't in the dictionary. If you try to read from the dictionary with a bad key, it should return 0 instead.

```
d = {1 : 2, 3 : 4}
safe_dict(d, 1) -> 2
safe_dict(d, 'cat') -> 0
```

In [1]:
def safe_dict(d, k): 
    for key in d.items():
        if k in d:
            return d[k]
        else: 
            return 0


d = {1 : 2, 3 : 4}

print(safe_dict(d, 3))

4


# File Reading: Hamlet Exercises

Open `hamlet.txt` in the `data` folder

### 1. Mentionned Hamlet

How many times is hamlet mentioned in the book?

Use python and line iteration to count it up

In [None]:
path = "C:\\Users\Yuri\\Documents\\concordia_bootcamp\\file_string\\m1-4-files-strings\\data\\hamlet.txt"
text = open(path, "r").readlines()
count = 0
for line in text:
    if 'hamlet' in  line.lower():
        count += 1

print(count)

### 2. File Reading as a .py program

Make a python file that defines a function that counts the number of times hamlet is mentionned using the code in the previous exercise.

Then import it in your notebook and call it here.

In [None]:
f = open('yeah.py', 'w')
f.writelines(
    """ 
def count_hamlet():
    path = r"C:\\Users\Yuri\\Documents\\concordia_bootcamp\\file_string\\m1-4-files-strings\\data\\hamlet.txt"
    text = open(path, "r").readlines()
    count = 0
    for line in text:
        if 'hamlet' in  line.lower():
            count += 1
    return count
    """
)
f.close()

import yeah as y

hamlet_count = y.count_hamlet()

print(hamlet_count)

### 3. Unique words in hamlet

Write a program that counts the unique words in hamlet.

In [None]:
def counting(file):
    count = {}
    texte = file.read()
    word_list = texte.split()
    for word in word_list:
        if word in count:
            count[word] += 1
        else:
            count[word] = 1
    for key, value in dict(count).items():
        if value > 1:
            del count[key]
    return count


path = "C:\\Users\Yuri\\Documents\\concordia_bootcamp\\file_string\\m1-4-files-strings\\data\\hamlet.txt"
text = open(path, "r")

myDict = {}

print(counting(text))

# File Reading 2: A Python library.

In the `data` folder, you will find a folder called `csrgraph` which is a python library.

### 1. File count

Count the `py` files in the library using the `os` package

In [None]:
import os

path = r"C:\\Users\\Yuri\Documents\\concordia_bootcamp\\file_string\\m1-4-files-strings\\data\\csrgraph"

files = os.listdir(path)

print (len(files))

### 2. For the following packages, count the number of files that import them:

- pandas 

- numpy

- numba

In [None]:
import os 

root_dir = "C:\\Users\\Yuri\Documents\\concordia_bootcamp\\file_string\\m1-4-files-strings\\data\\csrgraph"

keyword = "import numpy"

count = 0
for root, dirs, files in os.walk(root_dir):
    for filename in files: #iterate over files
        file_path = os.path.join(root, filename) # file path 
        with open(file_path, 'r') as f: #open the file for reading
                for line in f.readlines():
                    if keyword in line:
                        count += 1
print(count)

# First NLP Program: IDF

Given a list of words, the the inverse document frequency (IDF) is a basic statistic of the amount of information of each word in the text.

The IDF formulat is:

$$IDF(w) = ln(\dfrac{N}{1 + n(w)})$$

Where:

- $w$ is the token (unique word),
- $n(w)$ is the number of documents that $w$ occurs in,
- $N$ is the total number of documents

Write a function, `idf(docs)` that takes in a list of lists of words and returns a dictionary  `word -> idf score`

Example:

```
IDF([['interview', 'questions'], ['interview', 'answers']]) -> {'questions': 0.0, 
                                                                'interview': -0.4, 
                                                                'answers': 0.0}


```

In [None]:
import pandas as pd
import sklearn as sk
import math 
import numpy as np



corpus = [['interview', 'questions', 'interview'], ['interview', 'answers']]
corpus_union = set().union(*corpus)
print(corpus_union)

wordDict = dict.fromkeys(corpus_union, 0)

#bag of word
for list in corpus:
    for word in list:
        wordDict[word] += 1



def computeIDF(corpus):
    corpus = [set(i) for i in corpus]
    idfDict = {}
    N = len(corpus)
    final_dict = {}
    for doc in corpus:
        for word in doc:
            idfDict[word] = idfDict.get(word, 0) + 1
    for word in idfDict:       
        final_dict[word] = np.log ( N / (idfDict[word] + 1))
    return final_dict

#idf = computeIDF([wordDict])

print(computeIDF([['interview', 'questions', 'interview'], ['interview', 'answers']]))

# Stretch Goal: TF-IDF on Hamlet

The TF-IDF score is a commonly used statistic for the importance of words. Its $\frac{TF}{IDF}$ where TF is the "term frequency" (eg. how often the words happens in the document).

Calculate the TF-IDF dictionary on the Hamlet book.

What's the TF-IDF of "Hamlet"?

What's the word with the highest TF-IDF in the book?

In [None]:
import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
import heapq

def tokenize_word():
    path = "C:\\Users\Yuri\\Documents\\concordia_bootcamp\\file_string\\m1-4-files-strings\\data\\hamlet.txt"
    with open(path, 'r') as f: 
        for line in f: 
            print(word_tokenize(line))

corpus = tokenize_word()


def test_():
    wordfreq = {}
    corpus = tokenize_word()
    for token in corpus:
        if token not in wordfreq.keys():
            wordfreq[token] = 1
        else: 
            wordfreq[token] += 1
    return heapq.nlargest(200, wordfreq, key=wordfreq.get)

print(test_())


# Stretch Goal: Speaker count

Use a regular expression and looping over the `hamlet.txt` file to build a dictionary `character_name -> # times speaking`.

Who speaks the most often? Who speaks the least often?