# Assignment 3: "Search your transcripts. You will know it to be true." (Part 1)

## © Cristian Danescu-Niculescu-Mizil 2018

## CS/INFO 4300 Language and Information

### Due by midnight on Wednesday February 21th


This is an **individual** assignment.

If you use any outside sources (e.g. research papers, StackOverflow) please list your sources.

In this assignment we will explore the tradeoffs of information retrieval systems by finding newspaper quotes from "Keeping Up With The Kardashians".

**Guidelines**

All cells that contain the blocks that read `# YOUR CODE HERE` are editable and are to be completed to ensure you pass the test-cases. Make sure to write your code where indicated.

All cells that read `YOUR ANSWER HERE` are free-response cells that are editable and are to be completed.

You may use any number of notebook cells to explore the data and test out your functions, although you will only be graded on the solution itself.


You are unable to modify the read-only cells.

You should also use Markdown cells to explain your code and discuss your results when necessary.
Instructions can be found [here](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html).

All floating point values should be printed with **2 decimal places** precision. You can do so using the built-in round function.

**Grading**

For code-completion questions you will be graded on passing the public test cases we have included, as well as any hidden test cases that we have supplemented to ensure that your logic is correct.

For free-response questions you will be manually graded on the quality of your answer.


# Setup

Tabloids have been going crazy over our stars.  The press took some  quotes from the show, including:
       
 - *"It's like a bunch of people running around talking about nothing."*
 - *"Never say to a famous person that this possible endorsment would bring 'er to the spot light."*
 - *"Your yapping is making my head ache!"*
 - *"I'm going to Maryland, did I tell you?"*
 
We need to find out who said each of these, and in which episode. But since we're information scientists, that's not enough. We want to build an efficient search engine for retrieving where such quotes come from in the future.

What makes this difficult is that journalists often modify the quotes, so exact matching will not always work.

In [7]:
from __future__ import print_function
import numpy as np
import math
from collections import defaultdict
from nltk.tokenize import TreebankWordTokenizer
import Levenshtein  # package python-Levenshtein

In [8]:
queries = [u"It's like a bunch of people running around talking about nothing.",
           u"Never say to a famous person that this possible endorsment would bring 'er to the spot light.",
           u"Your yapping is making my head ache!",
           u"I'm going to Maryland, did I tell you?"]

## Load the data

Load the transcripts provided in the `kardashian-transcripts.json` file.

In [9]:
import json
with open("kardashian-transcripts.json", "r") as f:
    transcripts = json.load(f)
print(len(transcripts[0]))

851


## Reorganize the data

For this assignment, we'll consider documents to be individual message lines. The provided transcripts are grouped differently. We reorganize the data as a list of messages, where the messages are dictionary structures as provided.

In [10]:
flat_msgs = [m for transcript in transcripts for m in transcript]

# Searching the collection

The first and easiest thing to try is to directly compare the newspaper quote to the transcript strings.  If the press just copy-pasted from the transcript website, this might work.

## Find all messages that include the given quotes exactly.

Print the episode title, speaker name and full message, for all messages that exactly contain a given quote.

Write this as a function `verbatim_search` and run the function for each of the 4 quotes.

### Q1 Write a function `verbatim_search` that looks for exact matches of a query in each message.

Use `in`: `'efg' in 'cdefgh'` is `True`.

In [14]:
def verbatim_search(query, msgs):
    """ Verbatim search
    
    Arguments
    =========
    
    query: string,
        The query we are looking for.
        
    msgs: list of dicts,
        Each message in this list has a 'text' field with
        the raw document.
    
    Returns
    =======
    result: list of messages
        All messages that exactly contain the query string.
    
    """
    # YOUR CODE HERE
    result = []
    for message in msgs:
        if query in message['text']:
            result.append(message)
    return result
    raise NotImplementedError()
    

In [15]:
# This is an autograder test. Here we can test the function you just wrote above.
"""Check that tokenize returns the correct output"""
msgs = verbatim_search(queries[0], flat_msgs)
assert len(msgs) == 1
msgs = verbatim_search(queries[1], flat_msgs)
assert len(msgs) == 0

for query in queries:
    print(query)
    print("===")
    for msg in verbatim_search(query, flat_msgs):
        print("{}: {}\n\t({})\n".format(msg['speaker'],
                                        msg['text'],
                                        msg['episode_title']))
    print()
    

It's like a bunch of people running around talking about nothing.
===
BRUCE: It's like a bunch of people running around talking about nothing.
	(Keeping Up With the Kardashians - Kourt's First Cover)


Never say to a famous person that this possible endorsment would bring 'er to the spot light.
===

Your yapping is making my head ache!
===

I'm going to Maryland, did I tell you?
===



## Find the most similar messages to the quotes in terms of Edit Distance


This section is solved using the [python-Levenshtein](https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html) package.

### Q2.1 Write an `edit_distance_search` function.

Instead of searching for verbatim quotes, we will now try using the more flexible edit distance metric to see if this improves the quality of our search results. For each query, we will need to loop over each message, and compute the edit distance each time -- like with verbatim_search, there are no obvious shortcuts.

In [19]:
def edit_distance(query_str, msg_str):
    """Edit distance
    
    Arguments
    =========
    
    query_str: string,
        The query string
    
    msg_str: string,
        The message string we want to compare to
    
    Returns
    =======
    
    result: int,
    The edit distance between the two strings. Every modification type
    a cost of 1.
    """
    return Levenshtein.distance(query_str.lower(), msg_str.lower())

def edit_distance_search(query, msgs):
    """ Edit distance search
    
    Arguments
    =========
    
    query: string,
        The query we are looking for.
        
    msgs: list of dicts,
        Each message in this list already has a 'toks'
        field that contains the tokenized message.
    
    Returns
    =======
    
    result: list of (score, message) tuples.
        The result list is sorted by score such that the closest match
        is the top result in the list.
    
    """
    # YOUR CODE HERE
    result  = []
    for message in msgs:
        result.append((edit_distance(query, message['text']), message, ))
    return sorted(result, key = lambda x: x[0])

    raise NotImplementedError()

When the edit_distance_search function is completed you can run the lines below to print out the best matches for each query string.

In [21]:
# This is an autograder test. Here we can test the function you just wrote above.
score, _ = edit_distance_search(queries[1], flat_msgs)[0]
assert score >=38 and score <=46


top_10 = []
for query in queries:
    print("#" * len(query))
    print(query)
    print("#" * len(query))

    for score, msg in edit_distance_search(query, flat_msgs)[:10]:
        print("[{:.2f}] {}: {}\n\t({})".format(
            score,
            msg['speaker'],
            msg['text'],
            msg['episode_title']))
        top_10.append(msg)
    print()


#################################################################
It's like a bunch of people running around talking about nothing.
#################################################################
[0.00] BRUCE: It's like a bunch of people running around talking about nothing.
	(Keeping Up With the Kardashians - Kourt's First Cover)
[33.00] KRIS: It's not a bunch of teenagers running around.
	(Keeping Up With the Kardashians - Kris ``The Cougar'' Jenner)
[35.00] KHLOE: It's like, what are you talking about?
	(The Wedding: Keeping Up With the Kardashians)
[37.00] KIM: It's like it has separation anxiety or something.
	(Keeping Up With the Kardashians - Shape Up or Ship Out)
[37.00] KHLOE: It's like I want to learn how to do that.
	(Keeping Up With the Kardashians - Distance Makes the Heart Grow Fonder)
[37.00] KOURTNEY: It's like an explosion in your pantyhose.
	(Keeping Up With the Kardashians - I'd Rather Go Naked... Or Shopping)
[38.00] ROB: I have a bunch of connections in the indus

### Q2.2 Query Discussion and Analysis (Free Response)

Run the search using the code provided. Discuss why it worked, or why it might not have worked, for each query. Do you notice anything different about the costs to those discussed in lecture? Please fill out the table below with the costs from running the Levenshtein algorithm discussed in lecture and the cost from using the python Levenshtein package.

Copy-and-paste this markdown table into the cell below for your answers:
```
| Operation    | Cost (Lecture)|  Cost (python-Levenshtein)  | Example
| :----------: |:------------- | :-------------------------- | --------
| Addition     | YOUR ANSWER 1 | YOUR ANSWER 5               | "aa" -> "aab"
| Deletion     | YOUR ANSWER 2 | YOUR ANSWER 6               | "aa" -> "a"
| Substitution | YOUR ANSWER 3 | YOUR ANSWER 7               | "aa" -> "aa"
| Substitution | YOUR ANSWER 4 | YOUR ANSWER 8               | "aa" -> "ab"
```

```
| Operation    | Cost (Lecture)|  Cost (python-Levenshtein)  | Example
| :----------: |:------------- | :-------------------------- | --------
| Addition     |       1       |             1               | "aa" -> "aab"
| Deletion     |       1       |             1               | "aa" -> "a"
| Substitution |       0       |             0               | "aa" -> "aa"
| Substitution |       2       |             1               | "aa" -> "ab"
```

### Q3 Print the changes that need to be done to each quote to make it look like the closest match.

We've provided some code below that displays the edits that need to be made to a string to transform it into another string. (Yes, we've done most of the work for you!) Your job in this assignment is visualize the edits for the 4 queries at the top of this assignment to get a feel for how edit distance is working. Use the top matches you found in Q2. Include a short discussion of what seems to work well and what doesn't seem to work well.

In [26]:
a = "kardashians"
b = "dalmatians"
edits = Levenshtein.editops(a, b)
print(edits)

[('replace', 0, 0), ('replace', 2, 2), ('replace', 3, 3), ('delete', 5, 5), ('replace', 6, 5)]


In [27]:
def print_edits(str_a, str_b, edits):
    output = [[char] for char in str_a]
    indices = np.arange(len(str_a) + 1)
    for op, src, dest in edits:
        if op == 'insert':
            src = indices[src]
            output.insert(src, ["<span class='add'>{}</span>".format(str_b[dest])])
            indices += 1
        elif op == 'replace':
            src = indices[src]
            src_char = output[src][0]
            output[src] = output[src][1:]
            output[src].append("<span class='del'>{}</span><span class='add'>{}</span>".format(src_char, str_b[dest]))
        elif op == 'delete':
            src = indices[src]
            src_char = output[src].pop()
            output[src].append("<span class='del'>{}</span>".format(src_char))
    
    return "<div class='edit'>{}</div>".format("".join("".join(stack) for stack in output))

In [28]:
from IPython.display import HTML

In [29]:
HTML("""
<style type="text/css">

.edit {font-size: 20px;}
.del {text-decoration: line-through; color: #aaa;}
.add {color: green; font-weight: bold;}
</style>
""")

In [30]:
HTML(print_edits(a, b, edits))

Visualize the edits for the 4 queries at the top of this assignment. You need to use the display() method to render the HTML for the edit visualization.

In [40]:
i = 0
for query in queries:
    print("#" * len(query))
    print(query)
    print("#" * len(query))
    
    # YOUR CODE HERE
    display(HTML(print_edits(query, top_10[i]['text'], Levenshtein.editops(query, top_10[i]['text']))))
    i += 10
    #raise NotImplementedError()

#################################################################
It's like a bunch of people running around talking about nothing.
#################################################################


#############################################################################################
Never say to a famous person that this possible endorsment would bring 'er to the spot light.
#############################################################################################


####################################
Your yapping is making my head ache!
####################################


######################################
I'm going to Maryland, did I tell you?
######################################


### Q4 Changing the costs (Free Response)

As you may have noticed, the Levenshtein package did not provide a way to customize costs.
Below we provide you code that calculates Levenstein distance between two different strings with customizable insertion, deletion, and substitution costs. Try a few simple manipulations of costs and observe how this affects the Levenshtein distance. Why does the below example have its Levenshtein distance go down? 

In [41]:
def edit_distance(query, message,insertion_cost=1,deletion_cost=1,substitution_cost=2):
    """ Edit distance calculator
    
    Arguments
    =========
    
    query: query string,
        
    message: message string,
    
    insertion_cost: cost of insertion,
    
    deletion_cost: cost of deletion,
    
    substitution_cost: cost of substitution
    
    Returns
    """
    
    m = len(query) + 1
    n = len(message) + 1

    chart = {}
    for i in range(m): chart[i,0] = i
    for j in range(n): chart[0,j] = j
    for i in range(1, m):
        for j in range(1, n):
            chart[i, j] = min(
                chart[i, j-1] + insertion_cost,
                chart[i-1, j] + deletion_cost,
                chart[i-1, j-1] + (0 if query[i-1] == message[j-1] else substitution_cost)
            )
    return chart[i, j]

In [42]:
print(edit_distance("birthday","sunday"))
print(edit_distance("birthday","sunday",insertion_cost=5,deletion_cost=1,substitution_cost=1))

8
5


#### Why does the above example have its Levenshtein distance go down?



This is because insertion operation is not used in either of the two methods above. In the second method, cost of substitution reduced by 1. Since we substituted three characters, the total cost went down by 3.

### Q5 Changing the costs depending on the characters (Free Response)

Sometimes we might want to alter the penalty depending on the type of change we are making. Modify the function so that vowel-vowel and consonant-consonant substitutions now cost 1.5 instead of 2. (All other costs are unaffected.)

Recalculate the top ten closest matches for each query.

***Note: This (more transparent) implementation is considerably slower than the python-Levenshtein package. For this reason we will not run it on all the data, but only on the subset of quotes returned in Q2.1***

In [63]:
###############################################################
### Code completion: Write a function to change the penalty ###
### for vowel-vowel and consontant-consonant edits!         ###
###############################################################

def subs_cost(query,message,i,j):
    VOWELS = {'a','e', 'i', 'o', 'u'}
    # YOUR CODE HERE
    if query[i-1] == message[j-1]:
        return 0
    if ((query[i-1] in VOWELS) and (message[j-1] in VOWELS)) \
    or ((query[i-1] not in VOWELS) and (message[j-1] not in VOWELS)):
            return 1.5
    else:
            return 2
    raise NotImplementedError()

In [64]:
def edit_distance(query, message, insertion_cost=1, deletion_cost=1, substitution_cost=subs_cost):
    """ Edit distance calculator
    
    Arguments
    =========
    
    query: query string,
        
    message: message string,
    
    insertion_cost: cost of insertion,
    
    deletion_cost: cost of deletion,
    
    substitution_cost: function of substitution cost
    
    Returns
    """
    
    query = query.lower()
    message = message.lower()
    m = len(query) + 1
    n = len(message) + 1

    chart = {}
    for i in range(m): chart[i,0] = i
    for j in range(n): chart[0,j] = j
    for i in range(1, m):
        for j in range(1, n):
            chart[i, j] = min(
                chart[i, j-1] + insertion_cost,
                chart[i-1, j] + deletion_cost,
                chart[i-1, j-1] + subs_cost(query,message,i,j)
            )
    return chart[i, j]
    

In [65]:
# This is an autograder test. Here we can test the function you just wrote above.
score, _ = edit_distance_search(queries[1], top_10)[0]
assert score >= 48 and score <= 56


for query in queries:
    print("#" * len(query))
    print(query)
    print("#" * len(query))

    for score, msg in edit_distance_search(query, top_10)[:10]:
        print("[{:.2f}] {}: {}\n\t({})".format(
            score,
            msg['speaker'],
            msg['text'],
            msg['episode_title']))
    print()

#################################################################
It's like a bunch of people running around talking about nothing.
#################################################################
[0.00] BRUCE: It's like a bunch of people running around talking about nothing.
	(Keeping Up With the Kardashians - Kourt's First Cover)
[36.50] KRIS: It's not a bunch of teenagers running around.
	(Keeping Up With the Kardashians - Kris ``The Cougar'' Jenner)
[39.00] KHLOE: It's like, what are you talking about?
	(The Wedding: Keeping Up With the Kardashians)
[43.00] KHLOE: It's like I want to learn how to do that.
	(Keeping Up With the Kardashians - Distance Makes the Heart Grow Fonder)
[43.00] KOURTNEY: It's like an explosion in your pantyhose.
	(Keeping Up With the Kardashians - I'd Rather Go Naked... Or Shopping)
[43.50] KIM: It's like it has separation anxiety or something.
	(Keeping Up With the Kardashians - Shape Up or Ship Out)
[44.00] BRUCE: That's why you're running around wearing

#### Why didn't we have to worry about also changing insertion and deletion costs? (Given that we know a substitution can be achieved by an insertion and a deletion.) What would be  a similar alteration where we would need to also change the insertion and deletion costs?

Because for every step, the extra cost would be the minimum of three different operations. Since the new sub_cost is 1.5 or 2.0, which is always less or equal to the sum of the cost of a insertion and a deletion, we don't have to worry about changing insertion and deletion costs. If the new sub_cost is more than 2, then we need to take insertion and deletion costs into consideration.