# 🌐 **Let's play the Wiki Game!**

In today's notebook, you'll be implementing search algorithms to code a machine that can play the Wiki Game. For those who have never seen or played this, the wiki game is a race-type game where your goal is to get from one Wikipedia page to another as fast as possible. Navigating between Wikipedia pages is done by clicking links on the page. To play the Wiki Game, [check out the link here](https://www.thewikigame.com/).

Our goal in this notebook is going to be to find the shortest path taken from one page to another. There may be multiple shortest paths, but we only need to find and return one.

In [None]:
#@title Run this to import the data!
!curl -L -o wikipedia_grouped.zip "https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/Deep%20Dives/AI%20%2B%20Game%20Playing/Main%20Curriculum/Wikipedia%20Search/wikipedia_grouped.zip"
!unzip wikipedia_grouped.zip

import sqlite3 as sl


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1991M  100 1991M    0     0   201M      0  0:00:09  0:00:09 --:--:--  214M
Archive:  wikipedia_grouped.zip
  inflating: wikipedia_grouped.db    


In [None]:
#@title Run this to initialize some database functions!
# Titles in the database use underscores instead of spaces, so provide explicit methods to convert
def title_to_db_format(title: str):
  """
    Replace spaces in a title string with underscores to conform to database format
  """
  return title.replace(" ", "_")

def db_format_to_title(title: str):
  """
    Replace underscores in a title string with spaces to convert from database format
  """
  return title.replace("_", " ")

# Article IDs are returned as strings because it makes the sqlite3 database queries easier
def get_article_id(title, db_con, redirect = False):
  """
    Takes a title string in database format and returns the corresponding article ID
    If redirect is true returns the target of the redirect if the title is one

    inputs:
      title: string
      db_con: sqlite3 connection object
    outputs:
      article ID (string)
  """
  with db_con:
    id = db_con.execute("""
      SELECT id, is_redirect FROM pages WHERE title = ?    
    """, (title,)).fetchone()
    if id:
      if id[1] and redirect:
          return get_redirect(id[0], db_con)
      return str(id[0])
  print("Couldn't find page with title {} in database".format(title))

def get_article_titles(ids, db_con):
  """
    Takes a list of article IDs and returns a list of corresponding article titles
    Raises RuntimeError if some ids cannot be found in the database

    inputs:
      ids: list of strings
      db_con: sqlite3 connection object
    outputs:
      article titles (list of strings)
  """
  titles = []
  missing = []
  with db_con:
    for id in ids:
      title_row = db_con.execute("""
        SELECT title FROM pages WHERE id = ?    
      """, (str(id),)).fetchone()
      if title_row:
        titles.append(title_row[0])
      else:
        missing.append(id)
  # if missing:
  #   print("Couldn't find all requested ids. Missing ids {}".format(missing))
    # raise RuntimeError("Couldn't find all requested ids. Missing ids {}".format(missing))
  return titles
        
def get_article_title(id, db_con):
  """
    Alias for get_article_titles for a single ID

    inputs:
      id: string
      db_con: sqlite3 connection object
    outputs:
      title (string)
  """
  titles = get_article_titles([id], db_con)
  if titles != []:
    return titles[0]
  return ''

def get_redirect(source_id, db_con):
  """
    Gets target ID of a redirect page

    inputs:
      source_id: string
      db_con: sqlite3 connection object
    outputs:
      target id (string)
  """
  with db_con:
      target_id = db_con.execute("""
          SELECT target_id FROM redirects WHERE source_id = ?    
      """, (source_id,)).fetchone()
      if target_id:
          return str(target_id[0])
  print("Couldn't find redirect with id {} in database".format(source_id))

def get_page_links_from_id(source_id, db_con, return_titles = False):
  """
    Gets all page IDs that the source page links to.
    If return_titles is true returns page titles instead of IDs

    inputs:
      source_id: string
      db_con: sqlite3 connection object
    outputs:
      list of strings
  """
  with db_con:
    raw_ids = db_con.execute("""
      SELECT target_ids FROM links WHERE source_id = ?    
    """, (source_id,)).fetchone()
    if raw_ids is None:
      return []
    ids = raw_ids[0].split(',')
    if return_titles:
      return get_article_titles(ids, db_con)
    return ids

wikipedia_data = sl.connect('wikipedia_grouped.db')

# 📊 **Understanding the "Graph"**

To implement the wiki search algorithm, we first have to understand what we're working with. Take a look at the image below. How many individual wikipedia pages are linked?

**Note:** The links on the top and left sidebars do not count, and the `[show]` and `[edit]` buttons do not count. Remember to only count links that would redirect to another wiki page (no external links!)

<img src="https://drive.google.com/uc?export=view&id=1Ji3L2kQKP9DSzzP1hPqbGBYA1P2YmO6P"  
  width="2000"
  height="auto"
/>

If you'd like to take a look at the wikipedia page itself, follow the [link here](https://en.wikipedia.org/wiki/Kismet_(robot))!

In [None]:
#@title How many links are on the page?
num_links = 17 #@param {type:"slider", min:0, max:30, step:1}

if num_links == 17:
  print("That's right!")
elif num_links>17:
  print("Too many!")
else:
  print("Not enough!")

That's right!


# 🔤 **What does our data look like?**

There are over ***6 million*** English Wikipedia pages. That's a lot! Because of this, our data set for this notebook is **MASSIVE**. To cut down on the storage space, we've used SQLite, which is a data base engine that uses SQL. If you'd like to learn more about SQLite, [check out the link here](https://www.sqlite.org/index.html)!

Our data is stored in an SQLite database, which we can connect to using the variable `wikipedia_data`. Since we're not teaching you about databases in this course, we've provided a number of functions you can use to interact with the data. For all intents and purposes, treat `wikipedia_data` as though it's the data itself (even though it's really just a connection to the data).

If you run the cell below, you'll be able to see that `wikipedia_data` is an SQLite connection.



In [None]:
wikipedia_data

<sqlite3.Connection at 0x7fa6f0721f10>

## 🔢 **Converting between titles and IDs**

Wikipedia articles are stored in our dataset as ID numbers, so to search through the data set, we need to search through different linked ID numbers. 

For example, let's consider the Wikipedia article *Artificial Intelligence*. How can we get the corresponding ID?

We've provided two useful functions for you to do this: 
* `title_to_db_format`, which takes in a **string title** of a Wikipedia article and returns it in the format of the titles in the database
* `get_article_id`, which takes in a **title in database format** and the **database connection** (we introduced this above!) and returns the corresponding ID number for the title

Knowing these two functions, use the cell below to find the ID number of the article *Artificial Intelligence*.

In [None]:
title = title_to_db_format("Artificial Intelligence")

id_num = get_article_id(title,wikipedia_data) # this should be the correct ID number!

In [None]:
#@title Run this to see if you have the correct ID number!
if id_num == '524509' or id_num == 524509:
  print("That's the right ID number!")
else:
  print("That's NOT the right ID number.")

That's the right ID number!


Sometimes, we may want to convert from ID numbers back to Wikipedia article titles. How can we do this?

We've provided two more useful functions for you that act as a sort of converse to the ones above: 
* `db_format_to_title`, which takes in a **title in database format** of a Wikipedia article and returns it as a string title
* `get_article_title`, which takes in **an article ID** and the **database connection** (we introduced this above!) and returns the corresponding **title in database format** 

Knowing these two functions, use the cell below to find the article title for the article with ID number `3364578`.

In [None]:
id_number = 3364578
article_title = db_format_to_title(get_article_title(id_number,wikipedia_data)) # this should be the correct article title!
print(article_title)

List of Crayola crayon colors


In [None]:
#@title Run this to see if you have the correct article title!
if article_title == "List of Crayola crayon colors":
  print("That's the right title!")
else:
  print("That's NOT the right title.")

That's the right title!


## 👬 **Article "Neighbors"**
To have a successful graph-like structure and search algorithm, we need to have the concept of "neighbors". We've seen that for a Wikipedia article, neighboring articles of a page we're looking at are the ones that are linked on that particular page. We have one last function that we can work with:

`get_page_links_from_id` takes in an **article ID** and the **database connection** and returns a list of **article ID**s that are linked on that page.

Use the function above as well as the functions in the **Converting between titles and IDs** section to answer the questions below. Make sure you keep the variables that are initialized for you (those will be used to check your answers!)

### 🖇 **How many links does the page *Darth Vader* have?**

In [None]:
# use this cell to answer the question!

article = title_to_db_format("Darth Vader")

id = get_article_id(article,wikipedia_data)

num_links_dv = len(get_page_links_from_id(id,wikipedia_data))

In [None]:
#@title Run this to check your answer!
if num_links_dv == 640:
  print("That's the right number of links!")
else:
  print("That's NOT the right number of links.")

That's the right number of links!


### 🖇 **What are the titles of the linked pages on the page *Ring Pop*?**

In [None]:
# Use this cell to answer the question!

ringpop_link_titles = [db_format_to_title(get_article_title(int(i),wikipedia_data)) for i in get_page_links_from_id(get_article_id(title_to_db_format("Ring Pop"),wikipedia_data),wikipedia_data)]
print(ringpop_link_titles)

['Dallas Mavericks', 'Melissa Joan Hart', 'Topps', 'Bazooka (chewing gum)', 'Lollipop', 'Swarovski', 'Ring Pop', 'List of confectionery brands', 'Baby Bottle Pop', 'Push Pop', 'Juicy Drop Pop', '2011 NBA Finals', 'Whistle Pops']


In [None]:
#@title Run this to check your answer!
if set(ringpop_link_titles) == {'Dallas Mavericks', 'Melissa Joan Hart', 'Topps', 'Bazooka (chewing gum)', 'Lollipop', 'Swarovski', 'Ring Pop', 'List of confectionery brands', 'Baby Bottle Pop', 'Push Pop', 'Juicy Drop Pop', '2011 NBA Finals', 'Whistle Pops'}:
  print("That's the right list!")
else:
  print("That's NOT the right list.")

That's the right list!


### 🖇 **Is the page *Cherry* linked on the page *Pineapple*?**

In [None]:
# Use this cell to answer the question!

cherry_linked_on_pineapple = "Cherry" in [db_format_to_title(get_article_title(int(i),wikipedia_data)) for i in get_page_links_from_id(get_article_id(title_to_db_format("Pineapple"),wikipedia_data),wikipedia_data)] # this should either be True or False!

In [None]:
#@title Run this to check your answer!
if cherry_linked_on_pineapple:
  print("Yes, 'Cherry' is linked on the 'Pineapple' page!")
else:
  print("That's NOT correct.")

Yes, 'Cherry' is linked on the 'Pineapple' page!


### 🖇 **Is the page *Psych* linked on the page *Pluto*?**

In [None]:
# Use this cell to answer the question!

psych_linked_on_pluto = "Psych" in [db_format_to_title(get_article_title(int(i),wikipedia_data)) for i in get_page_links_from_id(get_article_id(title_to_db_format("Pluto"),wikipedia_data),wikipedia_data)] # this should either be True or False!

In [None]:
#@title Run this to check your answer!
if not psych_linked_on_pluto:
  print("Correct, 'Psych' is not linked on the 'Pluto' page!")
else:
  print("That's NOT correct.")

Correct, 'Psych' is not linked on the 'Pluto' page!


# 🔍 **Search**
Now that we understand the functions at our disposal, we finally have the means to write our search algorithm! For the purposes of this notebook since we're looking to find the shortest path, we're going to code BFS (no DFS today!).

`wiki_bfs` takes in three parameters: `start_title`, which is the article title of the starting Wikipedia page, `end_title`, which is the article title of the page we want to end up on, and `db_con`, which is our database connection.

**Fill in the function definition below.**

In [None]:
def get_neighbors(title):
  return [db_format_to_title(get_article_title(int(i),wikipedia_data)) for i in get_page_links_from_id(get_article_id(title_to_db_format(title),wikipedia_data),wikipedia_data)]

In [None]:
def wiki_bfs(start_title, end_title, db_con):
  '''
    Find the path between two Wikipedia articles using BFS
    Parameters:
      start_title: string article title, BFS starting point
      end_title: string article title, BFS ending point
      db_con: database connection to SQLite database
    Returns:
      a list containing article titles representing a path between start_title 
      and end_title. Path should start with start_title and end with end_title
  '''
  # initialize queue, seen set, current node/path and came_from if you're using it
  queue = [start_title]
  current = None
  seen = []
  came_from = {}
  while len(queue) != 0:
    temp = current
    current = queue.pop(0)
    seen.append(current)
    neighbors = get_neighbors(current)
    if temp != None:
      came_from[current] = temp
    for n in range(len(neighbors)):
      if neighbors[n] == end_title:
        path = []
        path.append(end_title)
        path.append(current)
        while current != start_title:
          lastTitle = came_from[current]
          current = lastTitle
          path.append(lastTitle)
        return path
      elif neighbors[n] not in seen:
        queue.append(neighbors[n])

  # create loop

    # take item off queue

    # look at neighbors of item

      # if neighbor is goal, return path or break

      # if neighbor has already been seen, do nothing

      # else, add neighbor/path onto queue

  # if you're calculating the path using came_from, do it here

Once you've finished your code, run the cell below to test it out! This cell will run your BFS code and also time it so you can see how long it takes to run. Try changing the start and end locations to different page titles!

In [None]:
%%time

# change the start and end to other articles! Make sure capitalization is right (case sensitive)
start = "Radon"
end = "The Art of Computer Programming"

print("The path taken was: ", wiki_bfs(start, end, wikipedia_data))

KeyboardInterrupt: ignored

# ⏲ **Heuristic Search**
We've successfully searched through the Wikipedia database, but our overall runtime increases a ton the more we have to search. Our BFS algorithm can handle paths of length four or less, but when we go over that, it takes *forever*. There are two reasons for this: one is that the Wikipedia database is enormous (so there's a lot of nodes to search), and the other is that it has a high **branching factor**. The **branching factor** refers to the number of edges going out of each node. In the context of our Wikipedia problem, this means that many pages have ***A TON*** of links on them.



In [None]:
#@title What page is more likely to link to *Apple*?
most_likely = "List of culinary fruits" #@param ["List of culinary fruits", "Star Wars", "Isaac Newton", "Planet", "Choose one!"]

Right now using just BFS, we're treating every link on the page the same. This means that if we're trying to get the path from ***Massachusetts Institute of Technology* to *Science*** and the pages ***Neuroscience*** and ***Roof and tunnel hacking*** are both linked, our algorithm will treat them the same. As people with an inherent understanding of language, when we see those pages it's easy for us to guess that *Neuroscience* is more likely to lead us to *Science*.

We can use a heuristic to help **target** our search a bit more!

## 🔤 **Recall Word Embeddings**
When you learned about NLP previously, you learned that words can be represented as numbers. Once we create a numerical representation for language, we can easily compute a "similarity score" between words. We're going to use that similarity score as our heuristic!

In [None]:
#@title Run this to set up our helper functions!
!wget -q --show-progress "https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/Deep%20Dives/AI%20%2B%20Game%20Playing/Main%20Curriculum/Wikipedia%20Search/wiki-news-300d-1M.vec.zip"
!unzip wiki-news-300d-1M.vec.zip

def get_vectors(path='wiki-news-300d-1M.vec'):
  '''
  Map the first 10K tokens to their word embedding vectors
  Parameter:
    path: file containing pre-trained word vectors from FastText model
  Returns:
    embeds: dictionary of 10K tokens and their word embedding vectors
  '''
  dictionary = open(path, 'r', encoding='utf-8',
                  newline='\n', errors='ignore')
  embeds = {}
  for line in dictionary:
    tokens = line.rstrip().split(' ')
    embeds[tokens[0]] = [float(x) for x in tokens[1:]]
    if len(embeds) == 200000:
        break
  return embeds

def vector_cosine_similarity(vec1,vec2):
  '''
  Calculate the  cosine similarity score between two embedding vectors
  Parameters:
    vec1: embedding for first word
    vec2: embedding for second word
  Returns:
    a float representing the two words' similarity
  '''
  numerator = 0
  for i in range(len(vec1)):
    numerator += vec1[i]*vec2[i]
  mag1 = (sum(elem**2 for elem in vec1))**0.5
  mag2 = (sum(elem**2 for elem in vec2))**0.5
  similarity = numerator/(mag1*mag2)
  return similarity

Archive:  wiki-news-300d-1M.vec.zip
  inflating: wiki-news-300d-1M.vec   


In [None]:
vecs = get_vectors() # dictionary of tokens and their embeddings

In [None]:
#@title Word embedding similarities { vertical-output: true, display-mode: "form" }
word1 = "saxophone" #@param {type:'string'}
word2 = "clarinet" #@param {type:'string'}

if word1 not in vecs:
  print("First word has no vector embedding.")
if word2 not in vecs:
  print("Second word has no vector embedding.")
if word1 in vecs and word2 in vecs:
  print("Similarity between "+word1+" and "+word2+": {:.2f}".format(vector_cosine_similarity(vecs[word1],vecs[word2])))

Similarity between saxophone and clarinet: 0.82


## 🔢 **Average Similarity Heuristic**
We're going to assign each page with a heuristic value based on the word similarity between that word and the ending article title. However, we have a few catches!
1. One or both words may not have a word embedding. In this case, we should just assign a heuristic value of `0.5`
2. Some article titles will be more than one word long! In this case, the heuristic value should be set to the average of the word similarity values of each word.
3. If both the title of a link and the ending title are multiple words long, you should check and average the similarity value of each word with every other word.
4. Some titles may have parentheses in them! Those should be removed before trying to find the word embedding

Use the **Word embedding similarities** cell above to help you get the answers to these questions, or use **vecs** to code in the lookup yourself.

**Hint:** you can use the string `split` function to break articles with multiple words into individual words! Check out some examples [at the link here](https://www.programiz.com/python-programming/methods/string/split). You can use the string `replace` function to take certain characters out of the string (hint: if you're replacing a character with an empty string `""`, it just deletes the character!). Check out some examples [at the link here](https://www.w3schools.com/python/ref_string_replace.asp).

#### **Getting vectors**
If you want to code to get these answers, the function `vector_cosine_similarity` will come in handy. This function takes in two *vectors* (which you can get from the `vecs` dictionary) and returns the similarity between the two.

### 🪣 **What should the heuristic value be if the link we're checking is *Orange* and the goal link is *Apple*?**

In [None]:
# Put any code here you need to help you calculate the heuristic value!

heuristic_1 = vector_cosine_similarity(vecs["Orange"],vecs["Apple"]) # change this!

In [None]:
#@title Run this to check your answer!
if round(heuristic_1, 3) == 0.477:
  print("That's the right heuristic value!")
else:
  print("That's NOT the right heuristic value.")

That's the right heuristic value!


### 🪣 **What should the heuristic value be if the link we're checking is *Bananagrams* and the goal link is *Letter*?**

In [None]:
# Put any code here you need to help you calculate the heuristic value!
if "Bananagrams" in vecs and "Letter" in vecs:
  heuristic_2 = vector_cosine_similarity(vecs["Bananagrams"],vecs["Letter"]) # change this!
else:
  heuristic_2 = .5

In [None]:
#@title Run this to check your answer!
if round(heuristic_2, 3) == 0.5:
  print("That's the right heuristic value!")
else:
  print("That's NOT the right heuristic value.")

That's the right heuristic value!


### 🪣 **What should the heuristic value be if the link we're checking is *Darth Vader* and the goal link is *Star Wars*?**

In [None]:
# Put any code here you need to help you calculate the heuristic value!
word1 = "Darth Vader".split()
word2 = "Star Wars".split()
heuristic = []

for i in word1:
  for j in word2:
    if i in vecs and j in vecs:
      heuristic.append(vector_cosine_similarity(vecs[i],vecs[j])) # change this!
    else:
      heuristic.append(.5)

heuristic_3 = sum(heuristic)/len(heuristic) # change this!

In [None]:
#@title Run this to check your answer!
if round(heuristic_3, 3) == 0.354:
  print("That's the right heuristic value!")
else:
  print("That's NOT the right heuristic value.")

That's the right heuristic value!


### 🪣 **What should the heuristic value be if the link we're checking is *Bananagrams (video game)* and the goal link is *MySpace*?**

In [None]:
# Put any code here you need to help you calculate the heuristic value!
word1 = "Bananagrams (video game)".replace('(','').replace(')','').split()
word2 = "MySpace".replace('(','').replace(')','').split()
heuristic = []

for i in word1:
  for j in word2:
    if i in vecs and j in vecs:
      heuristic.append(vector_cosine_similarity(vecs[i],vecs[j])) # change this!
    else:
      heuristic.append(.5)

heuristic_4 = sum(heuristic)/len(heuristic) # change this!
print(heuristic_4)

0.40386978317197214


In [None]:
#@title Run this to check your answer!
if round(heuristic_4, 3) == 0.461:
  print("That's the right heuristic value!")
else:
  print("That's NOT the right heuristic value.")

That's NOT the right heuristic value.


## ⏸ **A Brief Interlude on Sorting**
In class, we haven't talked about how to sort a list by a specific value. Let's explore how to use the function `sort` to do this!

First, check out the link to [this documentation here](https://docs.python.org/3/howto/sorting.html). Take a look at the **Sorting Basics** and **Key Functions** section. We're going to be using the `sort` function, which you use like:
```python
list_name.sort()
```
`sort` also takes in two optional parameters: 

With `reverse`, you can choose if you want the sorted list to be ascending or descending order. Here's how you could set to `True`:
```python
list_name.sort(reverse=True)
```
If you set `reverse` to `True`, the sorted list will be in **descending** order, aka high value to low value. If you set it to `False` or don't set it at all, the sorted list will be in **ascending order**, aka low value to high value.

With `key`, you can choose **how** to sort the list. If you don't want to be sorting the list by the values in the list themselves and instead by something else, you can use this. 

The format is something you're probably not used to. It uses what's called an **inline** or **lambda** function. We're not going to go into this a ton, but you can use the examples below and in the docs as a guide.

Here are some examples of using the `key` parameter:


```python
a = [1, 6, 3, 8, 2]
b = [(1,5), (7,3), (4,6), (3,2)]
c = {1:10, 3:4, 6:2, 8:9, 2:1, 7:8, 4:10}

a.sort(reverse=False) # regular sort
>>> [1, 2, 3, 6, 8]

b.sort(reverse=False, key=lambda x:x[1]) # sorting by the second value in the tuple
>>> [(3,2), (7,3), (1,5), (4,6)]

a.sort(reverse=False, key=lambda x:c[x]) # sorting by the value in the dict c
>>> [2, 6, 3, 8, 1]
```






### 📎 **What is the order of the list `[(1,5), (7,3), (4,6), (3,2)]` if we sort it ascending by the first value in the tuple?**

In [None]:
# use this cell to answer the question!

sortlist_order = [(1,5), (3,2), (4,6), (7,3)]

In [None]:
#@title Run this to check your answer!
if sortlist_order == [(1,5), (3,2), (4,6), (7,3)]:
  print("That's the right sorted list!")
else:
  print("That's NOT the right sorted list.")

That's the right sorted list!


### 📎 **Now use the sort function in the cell below to sort `[(1,5), (7,3), (4,6), (3,2)]` ascending by the first value in the tuple.**

In [None]:
# use sort on this list to get the list you came up with above!

sortfn_list_order = sorted([(1,5), (7,3), (4,6), (3,2)])

In [None]:
#@title Run this to check your answer!
if sortfn_list_order == [(1,5), (3,2), (4,6), (7,3)]:
  print("That's the right sorted list!")
else:
  print("That's NOT the right sorted list.")

That's the right sorted list!


### 📎 **What is the order of the list `[(1,5), (7,3), (4,6), (3,2)]` if we sort it ascending by the first value in the tuple's value in the dictionary `c`?**

#### `c = {1:10, 3:4, 6:2, 8:9, 2:1, 7:8, 4:10}`

In [None]:
# use this cell to answer the question!

b = [(1,5), (7,3), (4,6), (3,2)]
c = {1:10, 3:4, 6:2, 8:9, 2:1, 7:8, 4:10}

b.sort(reverse=True, key=lambda x:c[x[0]])

print(b)
sort_wdict = b

[(1, 5), (4, 6), (7, 3), (3, 2)]


In [None]:
#@title Run this to check your answer!
if sort_wdict == [(1, 5), (4, 6), (7, 3), (3, 2)]:
  print("That's the right sorted list!")
else:
  print("That's NOT the right sorted list.")

That's the right sorted list!


### 📎 **Now use the sort function in the cell below to sort `[(1,5), (7,3), (4,6), (3,2)]` ascending by the first value in the tuple's value in the dictionary `c`.**

In [None]:
c = {1:10, 3:4, 6:2, 8:9, 2:1, 7:8, 4:10}

# use sort on this list to get the list you came up with above!
sortfn_dict = [(1,5), (7,3), (4,6), (3,2)]

In [None]:
#@title Run this to check your answer!
if sortfn_dict == [(1, 5), (4, 6), (7, 3), (3, 2)]:
  print("That's the right sorted list!")
else:
  print("That's NOT the right sorted list.")

That's the right sorted list!


Now you should have a good idea of how you can use `sort` in your search!

### 🔨 **Implementing the Averaging Heuristic Function**
Now that we have a good idea of how the averaging heuristic function is supposed to work, let's get to it! Fill in the function definition below to make the heuristic function according to the rules we set above.

#### **Getting vectors**
The function `vector_cosine_similarity` will come in handy. This function takes in two **vectors** (which you can get from the `vecs` dictionary) and returns the similarity between the two.

**Hint:** you can use the string `split` function to break articles with multiple words into individual words! Check out some examples [at the link here](https://www.programiz.com/python-programming/methods/string/split). You can use the string `replace` function to take certain characters out of the string (hint: if you're replacing a character with an empty string `""`, it just deletes the character!). Check out some examples [at the link here](https://www.w3schools.com/python/ref_string_replace.asp).

In [None]:
def average_title_heuristic(title1, title2):
  '''
    Takes in two article titles and should return the corresponding 
    heuristic value for the titles (averaging if there's > 1 similarity).
  '''
  if title1=='' or title2 == '': # keep this bit in!
    return 0

  # YOUR CODE HERE
  
  word1 = title1.replace('(','').replace(')','').split()
  word2 = title2.replace('(','').replace(')','').split()
  heuristic = []

  for i in word1:
    for j in word2:
      if i in vecs and j in vecs:
        heuristic.append(vector_cosine_similarity(vecs[i],vecs[j])) # change this!
      else:
        heuristic.append(.5)

  returnNum = sum(heuristic)/len(heuristic) # change this!
  return returnNum

Once you've finished coding this function, use the cell below to write **at least two** test cases (you can copy some of the ones from above) to make sure your function is working right!

In [None]:
# Use this cell to write your test cases!
print(average_title_heuristic("Bananagrams (video game)", "MySpace"))
print(average_title_heuristic("Darth Vader", "Star Wars"))
print(average_title_heuristic("Bananagrams", "Letter"))
print(average_title_heuristic("Orange", "Apple"))

0.40386978317197214
0.3538097764167604
0.5
0.4774549995050186


## 🔢 **Maximum Similarity Heuristic**
Now, we're going to create a new heuristic - this time when there's more than one word in the link or the end title that we're checking, we're going to take the **maximum** similarity value between the words in the two instead of the average. The same other functionality (points 1 and 4 above) as above still applies. Here's what 2 and 3 are changed to

2. Some article titles will be more than one word long! In this case, the heuristic value should be set to the **maximum** of the word similarity values of each word.
3. If both the title of a link and the ending title are multiple words long, you should check the similarity value of each word with every other word and return the **maximum** value.

Answer the questions below again using **maximum similarity**.

### 🪣 **What should the heuristic value be if the link we're checking is *Darth Vader* and the goal link is *Star Wars*?**

In [None]:
# Put any code here you need to help you calculate the heuristic value!
word1 = "Darth Vader".replace('(','').replace(')','').split()
word2 = "Star Wars".replace('(','').replace(')','').split()
heuristic = []

for i in word1:
  for j in word2:
    if i in vecs and j in vecs:
      heuristic.append(vector_cosine_similarity(vecs[i],vecs[j])) # change this!
    else:
      heuristic.append(.5)

mheuristic_1 = max(heuristic) # change this!

In [None]:
#@title Run this to check your answer!
if round(mheuristic_1, 3) == 0.407:
  print("That's the right heuristic value!")
else:
  print("That's NOT the right heuristic value.")

That's the right heuristic value!


### 🪣 **What should the heuristic value be if the link we're checking is *Bananagrams (video game)* and the goal link is *MySpace*?**

In [None]:
# Put any code here you need to help you calculate the heuristic value!
word1 = "Bananagrams (video game)".replace('(','').replace(')','').split()
word2 = "MySpace".replace('(','').replace(')','').split()
heuristic = []

for i in word1:
  for j in word2:
    if i in vecs and j in vecs:
      heuristic.append(vector_cosine_similarity(vecs[i],vecs[j])) # change this!
    else:
      heuristic.append(.5)

mheuristic_2 = max(heuristic) # change this!

In [None]:
#@title Run this to check your answer!
if round(mheuristic_2, 3) == 0.5:
  print("That's the right heuristic value!")
else:
  print("That's NOT the right heuristic value.")

That's the right heuristic value!


### 🔨 **Implementing the Maximum Heuristic Function**
Now that we have a good idea of how the maximum heuristic function is supposed to work, let's get to it! Fill in the function definition below to make the heuristic function according to the rules we set above.

The same helpful functions as above apply here. In fact, your overall function will likely be very similar.

In [None]:
def max_title_heuristic(title1, title2):
  '''
    Takes in two article titles and should return the corresponding 
    heuristic value for the titles (averaging if there's > 1 similarity).
  '''
  if title1=='' or title2 == '': # keep this bit in!
    return 0

  # YOUR CODE HERE
  
  word1 = title1.replace('(','').replace(')','').split()
  word2 = title2.replace('(','').replace(')','').split()
  heuristic = []

  for i in word1:
    for j in word2:
      if i in vecs and j in vecs:
        heuristic.append(vector_cosine_similarity(vecs[i],vecs[j])) # change this!
      else:
        heuristic.append(.5)

  returnNum = max(heuristic) # change this!
  return returnNum

Once you've finished coding this function, use the cell below to write **at least two** test cases (you can copy some of the ones from above) to make sure your function is working right!

In [None]:
# Use this cell to write your test cases!
print(max_title_heuristic("Bananagrams (video game)", "MySpace"))
print(max_title_heuristic("Darth Vader", "Star Wars"))

0.5
0.4069661078875473


## 🏆 **Implementing Best First Search**
We learned about heuristics and **best** first search in class, and now it's time to implement it! Using the heuristic function we coded above, write your best first search algorithm in the cell below

In [None]:
def wiki_best_fs(start_title, end_title, db_con, heuristic):
  '''
    Find the path between two Wikipedia articles using best first search. Search
    algorithm should use heuristic function title_heuristic
    Parameters:
      start_title: string article title, best first starting point
      end_title: string article title, best first ending point
      db_con: database connection to SQLite database
      heuristic: a heuristic function that's passed in
    Returns:
      a list containing article titles representing a path between start_title 
      and end_title. Path should start with start_title and end with end_title
  '''
  queue = [start_title]
  current = None
  seen = []
  came_from = {}
  while len(queue) != 0:
    temp = current
    current = queue.pop(0)
    seen.append(current)
    neighbors = get_neighbors(current)
    if temp != None:
      came_from[current] = temp
    for n in range(len(neighbors)):
      if neighbors[n] == end_title:
        path = []
        path.append(current)
        while current != start_title:
          lastTitle = came_from[current]
          current = lastTitle
          path.append(lastTitle)
        path.append(end_title)
        return path
      elif neighbors[n] not in seen:
        queue.append(neighbors[n])
    queue.sort(reverse=False, key=lambda x:heuristic(x,end_title))

Once you've finished your code, run the cells below to test it out! This cell will run your best first search code and also time it so you can see how long it takes to run. The first cell uses plain BFS, the second cell uses the averaging heuristic, and the third cell uses the maximum heuristic.

In [None]:
%%time

# change the start and end to other articles! Make sure capitalization is right (case sensitive)
start = "Banana"
end = "Fruit"

print("The path taken was: ", wiki_bfs(start, end, wikipedia_data))

The path taken was:  ['Fruit', 'Banana']
CPU times: user 1.16 s, sys: 183 ms, total: 1.34 s
Wall time: 1.34 s


In [None]:
%%time

# change the start and end to other articles! Make sure capitalization is right (case sensitive)
start = "Banana"
end = "Fruit"

print("The path taken with the averaging heuristic was: ", wiki_best_fs(start, end, wikipedia_data, average_title_heuristic))

The path taken with the averaging heuristic was:  ['Banana', 'Fruit']
CPU times: user 1.16 s, sys: 183 ms, total: 1.35 s
Wall time: 1.35 s


In [None]:
%%time

# change the start and end to other articles! Make sure capitalization is right (case sensitive)
start = "Banana"
end = "Fruit"

print("The path taken with the max heuristic was: ", wiki_best_fs(start, end, wikipedia_data, max_title_heuristic))

The path taken with the max heuristic was:  ['Banana', 'Fruit']
CPU times: user 1.15 s, sys: 198 ms, total: 1.35 s
Wall time: 1.35 s


Hmm, it seems like the heuristic didn't actually improve our runtime all that much. In some cases, it's wildly slower (maybe even too slow to wait for). Why do you think that is?

# 🔛 **Extension: Bidirectional BFS**
The heuristic didn't actually improve our runtime, so let's try a different method! Right now we're only running BFS in one direction - from the start to the finish. If we start our BFS from both sides (the start and the goal) we can potentially cut down on the amount of time we spend searching! One of the biggest differences in bidirectional BFS is that we'll be working with two queues!

Let's check out the pseudocode for bidirectional BFS:
```
forward queue = [start]
backward queue = [end]
visited_start = {start}
visited_end = {end}
came_from_start = {start:None}
came_from_end = {end:None}
if end == start:
  return [start]
while neither queue is empty:
  if forward queue isn't empty:
    get first element in queue
    for the neighbors of that element:
      if neighbor in came_from_end:
        reconstruct path and return
      if neighbor not in visited_start:
        add to visited_start
        put on forward queue
  if backward queue isn't empty:
    get first element in queue
    for the neighbors of that element:
      if neighbor in came_from_start:
        reconstruct path and return
      if neighbor not in visited_end:
        add to visited_end
        put on backward queue
```



In [None]:
def bidirectional_bfs(start_title, end_title, db_con):
  '''
    Find the path between two Wikipedia articles using bidirectional BFS
    Parameters:
      start_title: string article title, BFS starting point
      end_title: string article title, BFS ending point
      db_con: database connection to SQLite database
    Returns:
      a list containing article titles representing a path between start_title 
      and end_title. Path should start with start_title and end with end_title
  '''
  forward_queue = [start_title]
  backward_queue = [end_title]
  visited_start = {start_title}
  visited_end = {end_title}
  came_from_start = {start:None}
  came_from_end = {end:None}
  if end_title == start_title:
    return start

Once you've finished your code, run the cells below to test it out! The first cell will run plain BFS, and the second cell will run your bidirectional BFS code. Try changing the start and end locations to different page titles!

In [None]:
%%time

# change the start and end to other articles! Make sure capitalization is right (case sensitive)
start = "Radon"
end = "The Art of Computer Programming"

print("The path taken with BFS was: ", wiki_bfs(start, end, wikipedia_data))

KeyboardInterrupt: ignored

In [None]:
%%time

# change the start and end to other articles! Make sure capitalization is right (case sensitive)
start = "Radon"
end = "The Art of Computer Programming"

print("The path taken with bidirectional BFS was: ", bidirectional_bfs(start, end, wikipedia_data))

By using bidirectional BFS, we can significantly decrease our overall runtime!

# 🎉 **That's it! You've finished this notebook :)**