## Name: Sadnan Saquif
## Handle: SSaquif

# Safe dict reading

define a function `safe_dict(d, k)` that takes in a python dict `d` and a key `k` and makes it safe to read even with keys that aren't in the dictionary. If you try to read from the dictionary with a bad key, it should return 0 instead.

```
d = {1 : 2, 3 : 4}
safe_dict(d, 1) -> 2
safe_dict(d, 'cat') -> 0
```

In [1]:
safe_dict = lambda d, key: d[key] if key in d else 0

d = {1 : 2, 3 : 4}

print(safe_dict(d, 1))
print(safe_dict(d, 'cat'))

# File Reading: Hamlet Exercises

Open `hamlet.txt` in the `data` folder

### 1. Mentionned Hamlet

How many times is hamlet mentioned in the book?

Use python and line iteration to count it up

### hamlet.txt needs encoding='utf-8-sig', or else I get /ufeff as the starting character

In [2]:
def hamlets_in_file(file):
    # Using with will close the file once it reaches EOF
    with open(file, 'r', encoding='utf-8-sig') as buffer:
        data = buffer.read()
    
    # Checking if file is closed    
    print(buffer.closed)
    hamlet_count = data.lower().count("hamlet")
    return hamlet_count

hamlets_in_file('./data/hamlet.txt')

### 2. File Reading as a .py program

Make a python file that defines a function that counts the number of times hamlet is mentionned using the code in the previous exercise.

Then import it in your notebook and call it here.

In [3]:
# Using file.write()
def create_py_file(file):
    # 'w' overwites existing content 
    with open(file, 'w') as buffer:
        buffer.write(
"""
def hamlets_in_file(file):
    with open(file, 'r', encoding='utf-8-sig') as buffer:
        data = buffer.read()

    return data.lower().count("hamlet")
"""
        )

# dont want to run again since the file exists
create_py_file('./hamlet_count.py')

In [4]:
from hamlet_count import hamlets_in_file

hamlets_in_file('./data/hamlet.txt')

474

### 3. Unique words in hamlet

Write a program that counts the unique words in hamlet.

In [5]:
import re

# Going to re-use this later
# So returning a set of unique words
def unique_words():
    '''
    Splitting on space, newline and punctuations and special characters>
    Probably fewer words in reality.
    Haven't considered plurals or possesive nouns, other special characters etc
    '''
    with open('./data/hamlet.txt','r', encoding='utf-8-sig') as buffer:
        data = buffer.read()

    #Splitting words
    words = re.split('[ ,.?!;:_\n]', data.lower())
        
    unique_words = list(set(words))
#     print(unique_words)
    return unique_words

print(f"Unique Words: {len(unique_words())}")

Unique Words: 4930


# File Reading 2: A Python library.

In the `data` folder, you will find a folder called `csrgraph` which is a python library.

### 1. File count

Count the `py` files in the library using the `os` package

In [6]:
import os

def pyfiles(directory):
    file_list = os.listdir(directory)
    py_files = []
    count = 0
    for file in file_list:
        if file.endswith('.py'):
            py_files.append(file)
    return py_files

file_list = pyfiles('./data/csrgraph/')
print(len(file_list))

8


### 2. For the following packages, count the number of files that import them:

- pandas 

- numpy

- numba

In [7]:
# Regex Testing
module = "numpy"
inp_1 = 'import   numpy\n'

inp_2 = 'import numpy     as     n_p \n'

inp_3 = 'from numpy import number \n'

# [ ]+ = 1 or more space
# [ ]* = 0 or more space
pattern_1 = f"(import)[ ]+({module})[ ]*\n"
pattern_2 = f"(import)[ ]+({module})[ ]+(as)[ ]+[a-zA-Z0-9_]+[ ]*\n"
pattern_3 = f"(from)[ ]+({module} import)[ ]+[a-zA-Z0-9_*]+[ ]*\n"

print(f"Pattern 1 {re.search(pattern_1,inp_1)}")
print(f"Pattern 2 {re.search(pattern_2,inp_2)}")
print(f"Pattern 3 {re.search(pattern_3,inp_3)}")

# print(pattern_1)
# print(inp1)

Pattern 1 <re.Match object; span=(0, 15), match='import   numpy\n'>
Pattern 2 <re.Match object; span=(0, 29), match='import numpy     as     n_p \n'>
Pattern 3 <re.Match object; span=(0, 26), match='from numpy import number \n'>


In [8]:
def module_in_files(directory, module):
    # Only check the python files in directory
    py_files = pyfiles(directory)
    
    used = 0
    
    pattern_1 = f"(import)[ ]+({module})[ ]*\n"
    pattern_2 = f"(import)[ ]+({module})[ ]+(as)[ ]+[a-zA-Z0-9_]+[ ]*\n"
    pattern_3 = f"(from)[ ]+({module} import)[ ]+[a-zA-Z0-9_*]+[ ]*\n"
        
    for file in py_files:
        with open(f"{directory}/{file}", 'r') as buffer:
            data = buffer.read()
            re_1 = re.search(pattern_1, data)
            re_2 = re.search(pattern_2, data)
            re_3 = re.search(pattern_3, data)

            if  re_1 or re_2 or re_3:
                used += 1
                
    return used
    
print(f"pandas: {module_in_files('./data/csrgraph/', 'pandas')}")
print(f"numpy: {module_in_files('./data/csrgraph/', 'numpy')}")
print(f"numba: {module_in_files('./data/csrgraph/', 'numba')}")

pandas: 4
numpy: 6
numba: 6


# First NLP Program: IDF

Given a list of words, the the inverse document frequency (IDF) is a basic statistic of the amount of information of each word in the text.

The IDF formulat is:

$$IDF(w) = ln(\dfrac{N}{1 + n(w)})$$

Where:

- $w$ is the token (unique word),
- $n(w)$ is the number of documents that $w$ occurs in,
- $N$ is the total number of documents

Write a function, `idf(docs)` that takes in a list of lists of words and returns a dictionary  `word -> idf score`

Example:

```
IDF([['interview', 'questions'], ['interview', 'answers']]) -> {'questions': 0.0, 
                                                                'interview': -0.4, 
                                                                'answers': 0.0}


```

In [9]:
import math

def IDF(doc_list):
    N = len(doc_list)
    word_count = {}
    idfs = {}
    
    for doc in doc_list:
        for word in doc:
            
            if word in word_count:
                word_count[word] += 1
            else:                
                word_count[word] = 1
            
    for word in word_count:
        count = word_count[word]
        word_idf = math.log(N/(1+count))
        idfs[word] = round(word_idf,1)
        
    return idfs

IDF([['interview', 'questions'], ['interview', 'answers']])

{'interview': -0.4, 'questions': 0.0, 'answers': 0.0}

# Stretch Goal: TF-IDF on Hamlet

The TF-IDF score is a commonly used statistic for the importance of words. Its $\frac{TF}{IDF}$ where TF is the "term frequency" (eg. how often the words happens in the document).

Calculate the TF-IDF dictionary on the Hamlet book.

What's the TF-IDF of "Hamlet"?

What's the word with the highest TF-IDF in the book?

In [10]:
'''For TF-IDF afterwards it's going to be a little weak 
because TF in our case is count in the whole document and IDF is line count'''
def book_TF(doc):
    book = open(doc , 'r')
    book_tf = {}
    
    with open('./data/hamlet.txt','r', encoding='utf-8-sig') as buffer:
        data = buffer.read()

    #Splitting words
    all_words = re.split('[ ,.?!;:_\n]', data.lower())
    
    for word in all_words:
        if word in book_tf:
            book_tf[word] += 1
        else:
            book_tf[word] = 1
    
    return book_tf    


def book_IDF(doc):
    book = open(doc, 'r', encoding='utf-8-sig')
    book_idf = []
    
    for line in book.readlines():
        #print(f"line=> {line}")
        words = re.split('[ ,.?!;:_\n]', line.lower())
        book_idf.append(words)
    
    book.close()
    return IDF(book_idf)

def TF_by_IDF(doc):
    tfs = book_TF(doc)
    idfs = book_IDF(doc)
    tf_idf = {}
    
    for word in tfs:
        if word not in tf_idf:
            tf_idf[word] = round(tfs[word]/idfs[word],3)
    
    return tf_idf

def max_key(d):
    cur_max = 0
    max_name = None
    for key in d:
        if d[key] > cur_max:
            cur_max = d[key]
            max_name = key
    return (max_name,cur_max)

result = TF_by_IDF('./data/hamlet.txt')
print(f"1. Hamlet TF/IDF = {result['hamlet']}\n")
print(f"2. Highest TF/IDF = {max_key(result)}\n")
print(f"3. Entire TF/IDF = {result}")

1. Hamlet TF/IDF = 171.111

2. Highest TF/IDF = ('the', 619.444)



# Stretch Goal: Speaker count

Use a regular expression and looping over the `hamlet.txt` file to build a dictionary `character_name -> # times speaking`.

Who speaks the most often? Who speaks the least often?

In [11]:
# Regex Test
# Might be space or hyphen between names os those are allowed
pattern = '\n[A-Z -]+\.[ ]*\n'

res = re.search(pattern, '\nHAMLET\n')
if res:
    print(res.string)
else:
    print(res)

None


In [12]:
def dialogue_count(doc):
    pattern = '\n[A-Z -]+\.[ ]*\n'
    actors = None
    dialogues = {}
    with open(doc, 'r', encoding='utf-8-sig') as buffer:
        actors = re.findall(pattern, buffer.read())
        #print(res)
        
    for actor in actors:
        actor = actor.replace('\n','')
        actor = actor.replace('.','')
        if actor in dialogues:
            dialogues[actor] +=1
        else:
            dialogues[actor] = 1
    
    return dialogues
   
dialogue_count('./data/hamlet.txt')

{'BARNARDO': 18,
 'FRANCISCO': 8,
 'HORATIO': 107,
 'MARCELLUS': 31,
 'KING': 102,
 'LAERTES': 62,
 'POLONIUS': 86,
 'HAMLET': 358,
 'QUEEN': 69,
 'BOTH': 1,
 'ALL': 1,
 'OPHELIA': 58,
 'GHOST': 14,
 'REYNALDO': 13,
 'ROSENCRANTZ': 45,
 'GUILDENSTERN': 29,
 'VOLTEMAND': 1,
 'FIRST PLAYER': 8,
 'PROLOGUE': 1,
 'PLAYER KING': 4,
 'PLAYER QUEEN': 5,
 'LUCIANUS': 1,
 'FORTINBRAS': 6,
 'CAPTAIN': 7,
 'GENTLEMAN': 3,
 'DANES': 2,
 'SERVANT': 1,
 'FIRST SAILOR': 2,
 'MESSENGER': 2,
 'FIRST CLOWN': 33,
 'SECOND CLOWN': 12,
 'PRIEST': 2,
 'OSRIC': 25,
 'LORD': 3,
 'FIRST AMBASSADOR': 1}

From list
1. Hamlet speaks most often
2. There are a lot of actors who speak the least, once who only have 1 line