# CSCI4022 Homework 2; Review

## Due Monday, February 8 at 11:59 pm to Canvas

#### Submit this file as a .ipynb with *all cells compiled and run* to the associated dropbox.

***

Your solutions to computational questions should include any specified Python code and results as well as written commentary on your conclusions.  Remember that you are encouraged to discuss the problems with your classmates, but **you must write all code and solutions on your own**.

**NOTES**: 

- Any relevant data sets should be available on Canvas. To make life easier on the graders if they need to run your code, do not change the relative path names here. Instead, move the files around on your computer.
- If you're not familiar with typesetting math directly into Markdown then by all means, do your work on paper first and then typeset it later.  Here is a [reference guide](https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference) linked on Canvas on writing math in Markdown. **All** of your written commentary, justifications and mathematical work should be in Markdown.  I also recommend the [wikibook](https://en.wikibooks.org/wiki/LaTeX) for LaTex.
- Because you can technically evaluate notebook cells is a non-linear order, it's a good idea to do **Kernel $\rightarrow$ Restart & Run All** as a check before submitting your solutions.  That way if we need to run your code you will know that it will work as expected. 
- It is **bad form** to make your reader interpret numerical output from your code.  If a question asks you to compute some value from the data you should show your code output **AND** write a summary of the results in Markdown directly below your code. 
- 45 points of this assignment are in problems.  The remaining 5 are for neatness, style, and overall exposition of both code and text.
- This probably goes without saying, but... For any question that asks you to calculate something, you **must show all work and justify your answers to receive credit**. Sparse or nonexistent work will receive sparse or nonexistent credit. 
- There is *not a prescribed API* for these problems.  You may answer coding questions with whatever syntax or object typing you deem fit.  Your evaluation will primarily live in the clarity of how well you present your final results, so don't skip over any interpretations!  Your code should still be commented and readable to ensure you followed the given course algorithm.

---

**NAME:** Andrew Pickner

**IDENTIKEY:** 102333599

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import seaborn as sns

# Imports for Problem 1 sanity check
import itertools

# Imports for Problem 2
import re 
import string

***
<a/ id='p1'></a>
[Back to top](#top)
# Problem 1 (Theory: minhashing; 10 pts)

Consider minhash values for a single column vector that contains 10 components/rows. Seven of rows hold 0 and three hold 1. Consider taking all 10! = 3,628,800 possible distinct permutations of ten rows. When we choose a permutation of the rows and produce a minhash value for the column, we will use the number of the row, in the permuted order, that is the first with a 1.  Use Markdown cells to demonstrate answers to the following.

#### a) For exactly how many of the 3,628,800 permutations is the minhash value for the column a 9?  What proportion is this?

---

In [2]:
"""
Piazza really clarified this problem a lot for me. Further, piazza influenced me to use the one index 
especially in the context of the indices in these questions, although I do include my answers with zero indices, 
hopefully I make this clear haha...
"""
zeroed_col_vector = np.array([[i for i in range(0,10)]]).T # lol zero index, kinda hacky
zeroed_col_vector

array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])

In [3]:
mathy_col_vector = np.array([[i for i in range(1,11)]]).T # kinda hacky
mathy_col_vector

array([[ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10]])

In [4]:
proof_vector = np.array([[0 if i < 7 else 1 for i in range(0,10)]]).T # lol zero index, kinda hacky
proof_vector

array([[0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1]])

Exactly $\boxed{0}$. Because there are three $1$'s, the max value we'd see a 1 in would be row 8 (1-index), or row 7 for those still stuck with 0 indexing (myself included). I show what this column vector would look like in Python above, but I will also use Latex to further drive my point home.

$\begin{bmatrix}
           0_{1} \\
           0_{2} \\
           0_{3} \\
           0_{4} \\
           0_{5} \\
           0_{6} \\
           0_{7} \\
           1_{8} \\
           1_{9} \\
           1_{10} \\
\end{bmatrix}$

So, this won't be a very formal *proof*, but this column vector permutation gives us the max possible minhash value of 8 because this is the first row with a 1 in it. Any other permuations would either give us an 8 (because we could rearrange the last three columns multiple ways) or a lower number for the first occurrence of a 1.

---

#### b) For exactly how many of the 3,628,800 permutations is the minhash value for the column a 8?

---

Alright, so from part a), we see that getting a minhash value of $9$ is simply not possible given we have three ones in our column vector. However, if we use the same column vector I created above:

$\begin{bmatrix}
           0_{1} \\
           0_{2} \\
           0_{3} \\
           0_{4} \\
           0_{5} \\
           0_{6} \\
           0_{7} \\
           1_{8} \\
           1_{9} \\
           1_{10} \\
\end{bmatrix}$

We can see that we *do* have the chance to get a minhash value of $8$. There are $3!$ ways we could permute the last three ones, and $7!$ ways to permute the leading zeros. Besides the permuation I've provided, our column vector could also look like this:

$\begin{bmatrix}
           0_{7} \\
           0_{6} \\
           0_{5} \\
           0_{4} \\
           0_{3} \\
           0_{2} \\
           0_{1} \\
           1_{10} \\
           1_{9} \\
           1_{8} \\
\end{bmatrix}$

Thus, there are $3! \cdot 7! = (3\cdot 2 \cdot 1) \cdot (7 \cdot 6\cdot \ldots \cdot 2 \cdot 1)$ ways we could permute the column vector and still get a minhash value of $8$. More specifically, there are $\boxed{30,240}$ permutations out of the $3,628,800$ total permutations that would give us a minhash value of $8$. And finally in other words, we'd get a minhash value of $8$ around $0.8$% of the time.

---

#### c) For exactly how many of the 3,628,800 permutations is the minhash value for the column a 3?

---

**This is what I started out doing...**

This part needs to be broken down further... 

For starters, we need a $1$ to be in the $3$rd index and zeros in the first and second index, I'll start by permuting the zeros which similar to last part comes out to $7!$ possible ways.

Next, we need a $1$ to be in the third index, and because we have three ones we can choose our $1$ at index $3$ to be any of them, giving us $3$ possibilities.

And finally, we can place the two remaining $1$'s into an two of the remaining $7$ indices at the end of the column vector. This is its own permutation which can be broken down even further... There are $5!$ distinct ways we could order the remaining $0$'s, and $2!$ ways we could arrange..........................

**Wasn't really into where I was headed, so I started over...**

So my initial attempt had some correct elements but I was overcomplicating it. Like the last problem I'm going to break the column vector into two segments. 

First, we choose $2$ zeros from the $7$ total zeros to be our first two values of the column vector. We have $P(7, 2) = \frac{7!}{(7-2)!} = 7\cdot 6 = 42$ ways that these first two zeros can be chosen. 

Next, we can choose the $1$ that is in the third index $3$ different ways.

And finally, there are $7!$ possible ways we can permute the remaining $7$ values.

Thus we would have $42 \cdot 3 \cdot 7! = \boxed{635,040}$ ways of getting a minhash value of $3$. I liked this answer more than where I was at on my initial attempt, but I figured I could code something up relatively quickly to check my work, so I have the answers to B and C in a python cell below and they match up with my answers I got by hand (A is too trivial to hardcode).


---

In [5]:
perms = list(itertools.permutations([1, 1, 1, 0, 0, 0, 0, 0, 0, 0]))

b = np.sum([1 if perm[7] == 1 and perm[8] == 1 and perm[9] == 1 else 0 for perm in perms])
print(b) # sanity check

c = np.sum([1 if perm[0] == 0 and perm[1] == 0 and perm[2] == 1 else 0 for perm in perms])
print(c) # sanity check

30240
635040


***
<a/ id='p3'></a>
[Back to top](#top)
# Problem 2 (Applied Minhashing; 35 pts)

In this problem we compare similarities of 5 documents available on http://www.gutenberg.org

 1) The first approximately 10000 characters of Miguel de Unamuno's *Niebla*, written in Spanish, in the file `niebla.txt`
 
 2) The first approximately 10000 characters of Miguel de Cervantes *The Ingenious Gentleman Don Quixote of La Mancha*, written in Spanish, in the file `DQ.txt`
 
 3) The first approximately 10000 characters of Homer's *The Odyssey*, translated into English by Samuel Butler, in the file `odyssey.txt`
 
 4) The first approximately 10000 characters of Kate Chopin's *The Awakening* in the file `awaken.txt`
 
 5) The entirety of around 12000 characters of Kate Chopin's *Beyond the Bayou* in the file `BB.txt`
 
### a) Clean the 4 documents, scrubbing all punctuation, changes cases to lower case, and removing accent marks as appropriate.  

You should have only 27 unique characters in each book/section after cleaning, corresponding to white spaces and the 26 letters.  


**For this problem, you may import any text-based packages you desire to help wrangle the data.**  I recommend looking at some functions within `string` or the RegEx `re` packages.

You can and probably should use functions in the string package such as `string.lower`, `string.replace`, etc.

All 5 documents have been saved in UTF-8 encoding.




In [6]:
def clean(filename):
    # https://docs.python.org/3/library/functions.html#open
    with open(filename, 'r', encoding='utf-8') as file:
        # Reads the whole file into cleaned_text
        cleaned_text = file.read()
        
        # Changes cleaned_text to all lowercase characters
        cleaned_text = cleaned_text.lower()
        
        # https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string
        # Although I didn't run my own test (maybe later???), this appears to be the quickest way to
        # strip punctuation in newer versions of python. 
        # Also, took me awhile to get the regex expressions correct...
        
        # Replace all puncuation, numbers and other miscellaneous symbols with a whitespace.
        cleaned_text = re.sub(r'[\“\”\«\»\.\,\~\`\"\'\’\!\¡\@\#\$\%\^\&\*\(\)\_\-\—\+\=\<\>\?\¿\/\:\;\{\}\[\]0-9]', r'', cleaned_text)
        # Replace all newline characters with a whitespace. 
        cleaned_text = re.sub(r'\n', r' ', cleaned_text)
        # Googled this; this is the BOM (byte order mark) which indicates endianness, replace with empty char.
        cleaned_text = re.sub(r'\ufeff', r'', cleaned_text)
        # Removes extraneous spaces and replaces with a single whitespace character.
        cleaned_text = re.sub(r' +', r' ', cleaned_text)
                
        # https://stackoverflow.com/questions/2400504/easiest-way-to-replace-a-string-using-a-dictionary-of-replacements
        # I saw this solution floating around and figured I'd use it to replace the accented characters because
        # this is pretty modular. If I need to run my program on another text with other accented characters I can 
        # add them as I see fit.
        accented_chars = {
            'á': 'a',
            'é': 'e',
            'í': 'i',
            'ó': 'o',
            'ú': 'u',
            'ü': 'u',
            'ñ': 'n',
        }
        pattern = re.compile('|'.join(accented_chars.keys()))
        cleaned_text = pattern.sub(lambda x: accented_chars[x.group()], cleaned_text)
        # Sanity Check, 
        # len(set(cleaned_text)) should be 27 (except for Spanish documents)
#         print(len(set(cleaned_text)))
        # just checking which characters I still have to clean
#         print(sorted(set(cleaned_text)))

        return cleaned_text

In [7]:
texts = {'niebla.txt': None, 'DQ.txt': None, 'odyssey.txt': None, 'awaken.txt': None, 'BB.txt': None}
for key in texts.keys():
    print(key)
    texts[key] = clean(key)

niebla.txt
DQ.txt
odyssey.txt
awaken.txt
BB.txt



### b) Compute exact similarity scores between the documents.  Are these the expected results?

Notes:
- You may choose or explore different values of $k$ for your shingles.
- You may choose to shingle on words and create an n-gram model, but it is recommended you shingle on letters as described in class
- You may construct your characteristic matrix or characteristic sets with or without hash functions (e.g. by using `set()`).  Note that choice of hash function should change heavily with $k$!

In [8]:
def shingler(text, k):
    # I use a list comprehension to shingle a document with size k.
    # Then I take the set of the list so we don't have any duplicates.
    return set([text[0+i:k+i] for i in range(0, len(text))])

def jaccard_similarity(text1, text2, k=5):
    # Shingle each document using our shingle function.
    shingles1 = shingler(text1, k)
    shingles2 = shingler(text2, k)
    
    # The intersection of these two sets are all of the shared shingles. Then I find the length of this new set.
    intersection = len(shingles1.intersection(shingles2))
    union        = len(shingles1.union(shingles2))
    
    return intersection / union

In [9]:
for k in range(1, 11):
    keys = texts.keys()
    for key1 in keys:
        for key2 in keys:
            if key1 != key2:
                print('k = {} - Jaccard similarity between {} and {} is: {:0.5f}'.format(k, key1, key2, jaccard_similarity(texts[key1], texts[key2], k=k)))
    print('\n')

k = 1 - Jaccard similarity between niebla.txt and DQ.txt is: 0.96154
k = 1 - Jaccard similarity between niebla.txt and odyssey.txt is: 0.96296
k = 1 - Jaccard similarity between niebla.txt and awaken.txt is: 0.96296
k = 1 - Jaccard similarity between niebla.txt and BB.txt is: 0.96296
k = 1 - Jaccard similarity between DQ.txt and niebla.txt is: 0.96154
k = 1 - Jaccard similarity between DQ.txt and odyssey.txt is: 0.92593
k = 1 - Jaccard similarity between DQ.txt and awaken.txt is: 0.92593
k = 1 - Jaccard similarity between DQ.txt and BB.txt is: 0.92593
k = 1 - Jaccard similarity between odyssey.txt and niebla.txt is: 0.96296
k = 1 - Jaccard similarity between odyssey.txt and DQ.txt is: 0.92593
k = 1 - Jaccard similarity between odyssey.txt and awaken.txt is: 1.00000
k = 1 - Jaccard similarity between odyssey.txt and BB.txt is: 1.00000
k = 1 - Jaccard similarity between awaken.txt and niebla.txt is: 0.96296
k = 1 - Jaccard similarity between awaken.txt and DQ.txt is: 0.92593
k = 1 - Jacc

k = 7 - Jaccard similarity between DQ.txt and awaken.txt is: 0.00015
k = 7 - Jaccard similarity between DQ.txt and BB.txt is: 0.00060
k = 7 - Jaccard similarity between odyssey.txt and niebla.txt is: 0.00054
k = 7 - Jaccard similarity between odyssey.txt and DQ.txt is: 0.00035
k = 7 - Jaccard similarity between odyssey.txt and awaken.txt is: 0.06308
k = 7 - Jaccard similarity between odyssey.txt and BB.txt is: 0.06298
k = 7 - Jaccard similarity between awaken.txt and niebla.txt is: 0.00144
k = 7 - Jaccard similarity between awaken.txt and DQ.txt is: 0.00015
k = 7 - Jaccard similarity between awaken.txt and odyssey.txt is: 0.06308
k = 7 - Jaccard similarity between awaken.txt and BB.txt is: 0.07340
k = 7 - Jaccard similarity between BB.txt and niebla.txt is: 0.00046
k = 7 - Jaccard similarity between BB.txt and DQ.txt is: 0.00060
k = 7 - Jaccard similarity between BB.txt and odyssey.txt is: 0.06298
k = 7 - Jaccard similarity between BB.txt and awaken.txt is: 0.07340


k = 8 - Jaccard si

In [10]:
k = 7

keys = texts.keys()
for key1 in keys:
    for key2 in keys:
        if key1 != key2:
            print('k = {} - Jaccard similarity between {} and {} is: {:0.5f}'.format(k, key1, key2, jaccard_similarity(texts[key1], texts[key2], k=k)))
    print('\n')

k = 7 - Jaccard similarity between niebla.txt and DQ.txt is: 0.05480
k = 7 - Jaccard similarity between niebla.txt and odyssey.txt is: 0.00054
k = 7 - Jaccard similarity between niebla.txt and awaken.txt is: 0.00144
k = 7 - Jaccard similarity between niebla.txt and BB.txt is: 0.00046


k = 7 - Jaccard similarity between DQ.txt and niebla.txt is: 0.05480
k = 7 - Jaccard similarity between DQ.txt and odyssey.txt is: 0.00035
k = 7 - Jaccard similarity between DQ.txt and awaken.txt is: 0.00015
k = 7 - Jaccard similarity between DQ.txt and BB.txt is: 0.00060


k = 7 - Jaccard similarity between odyssey.txt and niebla.txt is: 0.00054
k = 7 - Jaccard similarity between odyssey.txt and DQ.txt is: 0.00035
k = 7 - Jaccard similarity between odyssey.txt and awaken.txt is: 0.06308
k = 7 - Jaccard similarity between odyssey.txt and BB.txt is: 0.06298


k = 7 - Jaccard similarity between awaken.txt and niebla.txt is: 0.00144
k = 7 - Jaccard similarity between awaken.txt and DQ.txt is: 0.00015
k = 7 

---

So obviously smaller shingles present issues; it seems once $k$ got to be $7$ and above is when we start seeing some meaningful differences between these texts. 

Even for small $k$'s we see Niebla and DQ are incredibly similar, and as we increase $k$, we see this become more pronounced compared to other texts. This makes sense because the alphabets are the same between the Spanish texts, and we have a few extra characters in the English texts. 

Another comparison that was supposed to be relatively similar was between Awaken and BB because these texts were written by the same author. While these two texts were the most similar, I found that these texts were also relatively similar to the Odyssey albeit less similar than with themselves (which I suppose makes sense because they both share the same alphabet, even though the odyssey is a translation).

From my test above, it seems like I passed the sanity check and the spanish texts are the most similar to each other and the two texts written by the same author are the most similar to each other, and I conclude that $k$ needs to be greater than $7$ to glean any useful information from these 'similarity scores'.

---

### c) Implement minhashing with 1000 hash functions on the 4 documents, checking your results against those in part b).

- You may choose your own value of $p$ as the modulus of the hash functions.  You are encouraged to use the example code from the minhashing in class notebook to start you out.

In [11]:
def hash_func(row, nhash):
    # use the "universal hash":  (a*x+b) mod p, where a, b are random ints and p > N (= 10 here) is prime
    np.random.seed(4022)
    A = np.random.choice(range(0,10000), size=nhash)
    B = np.random.choice(range(0,10000), size=nhash)
    p = 999983
    return [(A[i]*row + B[i]) % p for i in range(nhash)]
    
def minhash_signature(k, documents, nhash=1000):    
    # Create a list of all of the different sets of shingles (i.e. if we have 5 documents, we should have 5 sets of shingles)
    document_keys = documents.keys()
    shingle_list = [shingler(documents[key], k) for key in document_keys]
    # sanity check
#     print(shingle_list)
#     print(len(shingle_list))
    
    # Create a set of all the different shingles found in the documents.
    shingles = set()
    for shingle in shingle_list:
        shingles = shingles.union(shingle)
    # sanity check
#     print(shingles)
#     print(len(shingles))
    
    # Create characteristic matrix
    characteristic = np.full([len(shingles), len(shingle_list)], fill_value=0)
    # iterate through the set of all the shingles found in all of the documents
    for row, shingle in enumerate(shingles):
        # iterate through the list of shingle sets to figure out if a given document contains a given shingle
        # here, col is synonomous with document
        for col, shingle_set in enumerate(shingle_list):
            if shingle in shingle_set:
                characteristic[row, col] = 1
    # sanity check
#     print(characteristic)

    # Create signature matrix
    signature = np.full([nhash, len(shingle_list)], fill_value=np.inf)

    # I kinda just grabbed the code from the in-class notebook and retooled it in my style of coding
    # and so it would work with my characteristic matrix I create above.
    
    # For each row of the characteristic matrix... 
    for row in range(len(shingles)):
        # STEP 3:  Compute hash values (~permuted row numbers) for that row under each hash function
        hash_vals = hash_func(row, nhash)
        # For each column in the characteristic matrix
        for col in range(len(shingle_list)):
            # ... but if there is a 1, replace signature matrix element in that column for each hash fcn 
            # with the minimum of the hash value in this row, and the current signature matrix element
            if characteristic[row, col]==1:
                for h in range(nhash):
                    if hash_vals[h] < signature[h, col]:
                        signature[h, col] = hash_vals[h]
    return signature, document_keys

def minhash_similarity(k, signature_matrix, documents):
    for i1, key1 in enumerate(documents):
        for i2, key2 in enumerate(documents):
            if key1 != key2:
                sim = np.sum(signature_matrix[:, i1] == signature_matrix[:, i2]) / len(signature_matrix[:, 1])
                print('k = {} - Minhash similarity between {} and {} is: {:0.5f}'.format(k, key1, key2, sim))                     
        print('\n')

In [12]:
k = 7

sig_matrix, docs = minhash_signature(k, texts)
sig_matrix

array([[ 16.,  39., 102., 251., 188.],
       [ 11.,  65.,  82.,  28., 173.],
       [344.,  57.,  61.,  59.,  67.],
       ...,
       [ 32., 326.,  78., 179.,  78.],
       [  4., 279.,  93., 248.,  35.],
       [ 39., 105.,  82.,  16., 257.]])

In [13]:
minhash_similarity(k, sig_matrix, docs)

k = 7 - Minhash similarity between niebla.txt and DQ.txt is: 0.05800
k = 7 - Minhash similarity between niebla.txt and odyssey.txt is: 0.00100
k = 7 - Minhash similarity between niebla.txt and awaken.txt is: 0.00400
k = 7 - Minhash similarity between niebla.txt and BB.txt is: 0.00000


k = 7 - Minhash similarity between DQ.txt and niebla.txt is: 0.05800
k = 7 - Minhash similarity between DQ.txt and odyssey.txt is: 0.00000
k = 7 - Minhash similarity between DQ.txt and awaken.txt is: 0.00000
k = 7 - Minhash similarity between DQ.txt and BB.txt is: 0.00000


k = 7 - Minhash similarity between odyssey.txt and niebla.txt is: 0.00100
k = 7 - Minhash similarity between odyssey.txt and DQ.txt is: 0.00000
k = 7 - Minhash similarity between odyssey.txt and awaken.txt is: 0.06100
k = 7 - Minhash similarity between odyssey.txt and BB.txt is: 0.06200


k = 7 - Minhash similarity between awaken.txt and niebla.txt is: 0.00400
k = 7 - Minhash similarity between awaken.txt and DQ.txt is: 0.00000
k = 7 



### d) Discussion:

Can we detect expected differences here?  Are the two Spanish documents most similar to each other?  Are the two documents by the same author, with the same theme, the most similar?  What kind of alternatives might have captured the structures between these texts?



---

Yes both of the Spanish texts and the texts by the same author are the most similar to each other, and like we found with the exact Jaccard similarity the Odyssey is relatively similar to the texts by the English author.

As for alternatives, I could have experimented with different values of $k$ for the minhash algorithm and that may have given us results closer to the true Jaccard values (even though minhashing wasn't that far off). Furthermore, I feel like we would've wanted to keep the accents had we been dealing with Spanish only documents because words can change their meaning depending on the presence of an accent or not. I also feel like it would be beneficial for plaigarism detectors and other similar programs to shingle in conjunction with grams although I'm not entirely sure if this is the case or how this would work in practice. Maybe including basic punctuation would be helpful too because I could see authors using punctuation uniquely which might have given us a more accurate similarity score for Kate Chopin's work compared to the Odyssey.

---