# Redis

In [3]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

# download('https://github.com/AllenDowney/DSIRP/raw/main/utils.py')

## Persistence

Data stored only in the memory of a running program is called "volatile", because it disappears when the program ends.

Data that still exists after the program that created it ends is called
"persistent". In general, files stored in a file system are persistent,
as well as data stored in databases.

A simple way to make data persistent is to store it in a file. For example, before the program ends, it could translate its data structures into a format like [JSON](https://en.wikipedia.org/wiki/JSON) and then write them into a file.
When it starts again, it could read the file and rebuild the data
structures.

But there are several problems with this solution:

1.  Reading and writing large data structures (like a Web index) would
    be slow.

2.  The entire data structure might not fit into the memory of a single
    running program.

3.  If a program ends unexpectedly (for example, due to a power outage),
    any changes made since the program last started would be lost.

A better alternative is a database that provides persistent storage and
the ability to read and write parts of the database without reading and
writing the whole thing.

There are many kinds of [database management systems](https://en.wikipedia.org/wiki/Database) (DBMS) that provide
these capabilities.

The database we'll use is Redis, which organizes data in structures that are similar to Python data structures.
Among others, it provides lists, hashes (similar to Python dictionaries), and sets.

Redis is a "key-value database", which means that it represents a mapping from keys to values.
In Redis, the keys are strings and the values can be one of several types.

## Redis clients and servers

Redis is usually run as a remote service; in fact, the name stands for
"REmote DIctionary Server". To use Redis, you have to run the Redis
server somewhere and then connect to it using a Redis client.

To get started, we'll run the Redis server on the same machine where we run the Jupyter server.
This will let us get started quickly, but if we are running Jupyter on Colab, the database lives in a Colab runtime environment, which disappears when we shut down the notebook.
So it's not really persistent.

Later we will use [RedisToGo](http://thinkdast.com/redistogo), which runs Redis in the cloud.
Databases on RedisToGo are persistent.

The following cell installs the Redis server and starts it with the `daemonize` options, which runs it in the background so the Jupyter server can resume.

In [8]:
import sys

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install redis-py
    !/usr/local/lib/python*/dist-packages/redis_server/bin/redis-server --daemonize yes
else:
    !redis-server --daemonize yes

[31mERROR: Could not find a version that satisfies the requirement redis-py (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for redis-py[0m[31m
[0m/bin/bash: line 1: /usr/local/lib/python*/dist-packages/redis_server/bin/redis-server: No such file or directory


## redis-py

To talk to the Redis server, we'll use [redis-py](https://redis-py.readthedocs.io/en/stable/index.html).
Here's how we use it to connect to the Redis server.

In [10]:
try:
    import redis
except ImportError:
    !pip install redis

/bin/bash: line 1: redis: command not found


In [12]:
import redis

r = redis.Redis()

The `set` method adds a key-value pair to the database.
In the following example, the key and value are both strings.

In [13]:
r.set('key', 'value')

ConnectionError: Error 99 connecting to localhost:6379. Cannot assign requested address.

The `get` method looks up a key and returns the corresponding value.

In [None]:
r.get('key')

The result is not actually a string; it is a [bytearray](https://stackoverflow.com/questions/6224052/what-is-the-difference-between-a-string-and-a-byte-string).

For many purposes, a bytearray behaves like a string so for now we will treat it like a string and deal with differences as they arise.

The values can be integers or floating-point numbers.

In [None]:
r.set('x', 5)

And Redis provides some functions that understand numbers, like `incr`.

In [None]:
r.incr('x')

But if you `get` a numeric value, the result is a bytearray.

In [None]:
value = r.get('x')
value

If you want to do math with it, you have to convert it back to a number.

In [None]:
int(value)

If you want to set more than one value at a time, you can pass a dictionary to `mset`.

In [None]:
d = dict(x=5, y='string', z=1.23)
r.mset(d)

In [None]:
r.get('y')

In [None]:
r.get('z')

If you try to store any other type in a Redis database, you get an error.

In [None]:
from redis import DataError

t = [1, 2, 3]

try:
    r.set('t', t)
except DataError as e:
    print(e)

We could use the `repr` function to create a string representation of a list, but that representation is Python-specific.
It would be better to make a database that can work with any language.
To do that, we can use JSON to create a string representation.

The `json` module provides a function `dumps`, that creates a language-independent representation of most Python objects.

In [None]:
import json

t = [1, 2, 3]
s = json.dumps(t)
s

When we read one of these strings back, we can use `loads` to convert it back to a Python object.

In [None]:
t = json.loads(s)
t

**Exercise:** Create a dictionary with a few items, including keys and values with different types. Use `json` to make a string representation of the dictionary, then store it as a value in the Redis database. Retrieve it and convert it back to a dictionary.

In [15]:
my_dict = {
    "name": "John Doe",
    "age": 30,
    "city": "New York",
    "is_active": True,
    "scores": [85, 92, 78]
}

# dictionary to a JSON string
json_string = json.dumps(my_dict)

r.set("my_data", json_string)

# Retrieve from redis
retrieved_json_string = r.get("my_data")

retrieved_dict = json.loads(retrieved_json_string)

print(retrieved_dict)


NameError: name 'json' is not defined

## Redis Data Types

JSON can represent most Python objects, so we could use it to store arbitrary data structures in Redis. But in that case Redis only knows that they are strings; it can't work with them as data structures. For example, if we store a data structure in JSON, the only way to modify it would be to:

1. Get the entire structure, which might be large,

2. Load it back into a Python structure,

3. Modify the Python structure,

4. Dump it back into a JSON string, and

5. Replace the old value in the database with the new value.

That's not very efficient. A better alternative is to use the data types Redis provides, which you can read about in the
[Redis Data Types Intro](https://redis.io/topics/data-types-intro).

# Lists

The `rpush` method adds new elements to the end of a list (the `r` indicates the right-hand side of the list).

In [None]:
r.rpush('t', 1, 2, 3)

You don't have to do anything special to create a list; if it doesn't exist, Redis creates it.

`llen` returns the length of the list.

In [None]:
r.llen('t')

`lrange` gets elements from a list. With the indices `0` and `-1`, it gets all of the elements.

In [None]:
r.lrange('t', 0, -1)

The result is a Python list, but the elements are bytestrings.

`rpop` removes elements from the end of the list.

In [None]:
r.rpop('t')

You can read more about the other list methods in the [Redis documentation](https://redis.io/commands#list).

And you can read about the [redis-py API here](https://redis-py.readthedocs.io/en/stable/index.html#redis.Redis.rpush).

In general, the documentation of Redis is very good; the documentation of `redis-py` is a little rough around the edges.

**Exercise:** Use `lpush` to add elements to the beginning of the list and `lpop` to remove them.

Note: Redis lists behave like linked lists, so you can add and remove elements from either end in constant time.

In [None]:
r.lpush('t', -3, -2, -1)

In [None]:
r.lpop('t')

## Hash

A [Redis hash](https://redis.io/commands#hash) is similar to a Python dictionary, but just to make things confusing the nomenclature is a little different.

What we would call a "key" in a Python dictionary is called a "field" in a Redis hash.

The `hset` method sets a field-value pair in a hash:

In [None]:
r.hset('h', 'field', 'value')

The `hget` method looks up a field and returns the corresponding value.

In [None]:
r.hget('h', 'field')

`hset` can also take a Python dictionary as a parameter:

In [None]:
d = dict(a=1, b=2, c=3)
r.hset('h', mapping=d)

To iterate the elements of a hash, we can use `hscan_iter`:

In [None]:
for field, value in r.hscan_iter('h'):
    print(field, value)

The results are bytestrings for both the fields and values.

**Exercise:** To add multiple items to a hash, you can use `hset` with the keyword `mapping` and a dictionary (or other mapping type).

Use the `Counter` object from the `collections` module to count the letters in a string, then use `hset` to store the results in a Redis hash.

Then use `hscan_iter` to display the results.

In [None]:
# prompt: perform above exercise

from collections import Counter

text = "This is a sample text for counting letters."
letter_counts = Counter(text.lower())

r.hset('letter_counts', mapping=letter_counts)

for field, value in r.hscan_iter('letter_counts'):
  print(field.decode('utf-8'), value.decode('utf-8'))


## Deleting

Before we go on, let's clean up the database by deleting all of the key-value pairs.

In [None]:
for key in r.keys():
    r.delete(key)

## Anagrams (again!)

In a previous notebook, we made sets of words that are anagrams of each other by building a dictionary where they keys are sorted strings of letters and the values are lists of words.

We'll start by solving this problem again using Python data structures; then we'll translate it into Redis.

The following cell downloads a file that contains the list of words.

In [17]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

download('https://github.com/AllenDowney/DSIRP/raw/main/american-english')

Downloaded american-english


And here's a generator function that reads the words in the file and yields them one at a time.

In [18]:
def iterate_words(filename):
    """Read lines from a file and split them into words."""
    for line in open(filename):
        for word in line.split():
            yield word.strip()

The "signature" of a word is a string that contains the letter of the word in sorted order.
So if two words are anagrams, they have the same signature.

In [19]:
def signature(word):
    return ''.join(sorted(word))

The following loop makes a dictionary of anagram lists.

In [20]:
anagram_dict = {}
for word in iterate_words('american-english'):
    key = signature(word)
    anagram_dict.setdefault(key, []).append(word)

The following loop prints all anagram lists with 6 or more words

In [21]:
for v in anagram_dict.values():
    if len(v) >= 6:
        print(len(v), v)

6 ['abets', 'baste', 'bates', 'beast', 'beats', 'betas']
6 ['aster', 'rates', 'stare', 'tares', 'taser', 'tears']
6 ['caret', 'cater', 'crate', 'react', 'recta', 'trace']
7 ['carets', 'caster', 'caters', 'crates', 'reacts', 'recast', 'traces']
6 ['drapes', 'padres', 'parsed', 'rasped', 'spared', 'spread']
6 ['lapse', 'leaps', 'pales', 'peals', 'pleas', 'sepal']
6 ['least', 'slate', 'stale', 'steal', 'tales', 'teals']
6 ['opts', 'post', 'pots', 'spot', 'stop', 'tops']
6 ['palest', 'pastel', 'petals', 'plates', 'pleats', 'staple']
7 ['pares', 'parse', 'pears', 'rapes', 'reaps', 'spare', 'spear']


Now, to do the same thing in Redis, we have two options:

* We can store the anagram lists using Redis lists, using the signatures as keys.

* We can store the whole data structure in a Redis hash.

A problem with the first option is that the keys in a Redis database are like global variables. If we create a large number of keys, we are likely to run into name conflicts.
We can mitigate this problem by giving each key a prefix that identifies its purpose.

The following loop implements the first option, using "Anagram" as a prefix for the keys.

In [None]:
for word in iterate_words('american-english'):
    key = f'Anagram:{signature(word)}'
    r.rpush(key, word)

An advantage of this option is that it makes good use of Redis lists. A drawback is that makes many small database transactions, so it is relatively slow.

We can use `keys` to get a list of all keys with a given prefix.

In [None]:
keys = r.keys('Anagram*')
len(keys)

**Exercise:** Write a loop that iterates through `keys`, uses `llen` to get the length of each list, and prints the elements of all lists with 6 or more elements.

In [None]:
for key in keys:
  list_length = r.llen(key)
  if list_length >= 6:
    list_elements = r.lrange(key, 0, -1)
    print(list_length, [element.decode('utf-8') for element in list_elements])


Before we go on, we can delete the keys from the database like this.

In [None]:
r.delete(*keys)

The second option is to compute the dictionary of anagram lists locally and then store it as a Redis hash.

The following function uses `dumps` to convert lists to strings that can be stored as values in a Redis hash.

In [None]:
hash_key = 'AnagramHash'
for field, t in anagram_dict.items():
    value = json.dumps(t)
    r.hset(hash_key, field, value)

We can do the same thing faster if we convert all of the lists to JSON locally and store all of the field-value pairs with one `hset` command.

First, I'll delete the hash we just created.

In [None]:
r.delete(hash_key)

**Exercise:** Make a Python dictionary that contains the items from `anagram_dict` but with the values converted to JSON. Use `hset` with the `mapping` keyword to store it as a Redis hash.

In [None]:

hash_key = 'AnagramHash'
json_anagram_dict = {k: json.dumps(v) for k, v in anagram_dict.items()}
r.hset(hash_key, mapping=json_anagram_dict)


**Exercise:** Write a loop that iterates through the field-value pairs, converts each value back to a Python list, and prints the lists with 6 or more elements.

In [None]:

for field, value in r.hscan_iter('AnagramHash'):
  list_elements = json.loads(value)
  if len(list_elements) >= 6:
    print(len(list_elements), list_elements)


## Shut down

If you are running this notebook on your own computer, you can use the following command to shut down the Redis server.

If you are running on Colab, it's not really necessary: the Redis server will get shut down when the Colab runtime shuts down (and everything stored in it will disappear).

In [None]:
!killall redis-server

# Linked List

## Linked Lists

Implementing operations on linked lists is a staple of programming classes and technical interviews.

I resist them because it is unlikely that you will ever have to implement a linked list in your professional work. And if you do, someone has made a bad decision.

However, they can be good études, that is, pieces that you practice in order to learn, but never perform.

For many of these problems, there are several possible solutions, depending on the requirements:

* Are you allowed to modify an existing list, or do you have to create a new one?

* If you modify an existing structure, are you also supposed to return a reference to it?

* Are you allowed to allocate temporary structures, or do you have to perform all operations in place?

And for all of these problems, you could write a solution iteratively or recursively. So there are many possible solutions for each.

As we consider alternatives, some of the factors to keep in mind are:

* Performance in terms of time and space.

* Readability and demonstrably correctness.

In general, performance should be asymptotically efficient; for example, if there is a constant time solution, a linear solution would not be acceptable.
But we might be willing to pay some overhead to achieve bulletproof correctness.

Here's the class we'll use to represent the nodes in a list.

In [None]:
class Node:
    def __init__(self, data, next=None):
        self.data = data
        self.next = next

    def __repr__(self):
        return f'Node({self.data}, {repr(self.next)})'

We can create nodes like this:

In [None]:
node1 = Node(1)
node2 = Node(2)
node3 = Node(3)

node1

And then link them up, like this:

In [None]:
node1.next = node2
node2.next = node3

In [None]:
node1

There are two ways to think about what `node1` is:

* It is "just" a node object, which happens to contain a link to another node.

* It is the first node in a linked list of nodes.

When we pass a node as a parameter, sometimes we think of it as a node and sometimes we think of it as the beginning of a list.

## LinkedList objects

For some operations, it will be convenient to have another object that represents the whole list (as opposed to one of its nodes).

Here's the class definition.

In [None]:
class LinkedList:
    def __init__(self, head=None):
        self.head = head

    def __repr__(self):
        return f'LinkedList({repr(self.head)})'

If we create a `LinkedList` with a reference to `node1`, we can think of the result as a list with three elements.

In [None]:
t = LinkedList(node1)
t

## Search

**Exercise:** Write a function called `find` that takes a `LinkedList` and a target value; if the target value appears in the `LinkedList`, it should return the `Node` that contains it; otherwise it should return `None`.

You can use these examples to test your code.

In [None]:
find(t, 1)

In [None]:
find(t, 3)

In [None]:
find(t, 5)

## Push and Pop

Adding and removing elements from the *left* side of a linked list is relatively easy:

In [None]:
def lpush(t, value):
    t.head = Node(value, t.head)

In [None]:
t = LinkedList()
lpush(t, 3)
lpush(t, 2)
lpush(t, 1)
t

In [None]:
def lpop(t):
    if t.head is None:
        raise ValueError('Tried to pop from empty LinkedList')
    node = t.head
    t.head = node.next
    return node.data

In [None]:
lpop(t), lpop(t), lpop(t)

In [None]:
t

Adding and removing from the end right side take longer because we have to traverse the list.

**Exercise:** Write `rpush` and `rpop`.

You can use the following example to test your code.

In [None]:
t = LinkedList()
rpush(t, 1)
t

In [None]:
rpush(t, 2)
t

In [None]:
rpop(t)

In [None]:
rpop(t)

In [None]:
try:
    rpop(t)
except ValueError as e:
    print(e)

## Reverse

Reversing a linked list is a classic interview question, although at this point it is so classic you will probably never encounter it.


If you are allowed to make a new list, you can traverse the old list and `lpush` the elements onto the new list:

In [None]:
def reverse(t):
    t2 = LinkedList()
    node = t.head
    while node:
        lpush(t2, node.data)
        node = node.next

    return t2

In [None]:
t = LinkedList(Node(1, Node(2, Node(3, None))))
reverse(t)

Here's a recursive version that doesn't allocate anything

In [45]:
def reverse(t):
    t.head = reverse_rec(t.head)

def reverse_rec(node):

    # if there are 0 or 1 nodes
    if node is None or node.next is None:
        return node

    # reverse the rest LinkedList
    rest = reverse_rec(node.next)

    # Put first element at the end
    node.next.next = node
    node.next = None

    return rest

In [46]:
t = LinkedList(Node(1, Node(2, Node(3, None))))
reverse(t)
t

LinkedList(Node(3, Node(2, Node(1, None))))

And finally an iterative version that doesn't allocate anything.

In [47]:
def reverse(t):
    prev = None
    current = t.head
    while current :
        next = current.next
        current.next = prev
        prev = current
        current = next
    t.head = prev

In [48]:
t = LinkedList(Node(1, Node(2, Node(3, None))))
reverse(t)
t

LinkedList(Node(3, Node(2, Node(1, None))))

## Remove

One of the advantages of a linked list (compared to an array list) is that we can add and remove elements from the middle of the list in constant time.

For example, the following function takes a node and removes the node that follows it.

In [49]:
def remove_after(node):
    removed = node.next
    node.next = node.next.next
    return removed.data

Here's an example:

In [50]:
t = LinkedList(Node(1, Node(2, Node(3, None))))
remove_after(t.head)
t

LinkedList(Node(1, Node(3, None)))

**Exercise:** Write a function called `remove` that takes a LinkedList and a target value. It should remove the first node that contains the value, or raise a `ValueError` if it is not found.

Hint: This one is a little tricky.

In [52]:
def remove(linked_list, target):
    if linked_list.head is None:
        raise ValueError("Value not found in the list")

    if linked_list.head.data == target:
        linked_list.head = linked_list.head.next
        return

    current_node = linked_list.head
    while current_node.next:
        if current_node.next.data == target:
            current_node.next = current_node.next.next
            return

        current_node = current_node.next

    raise ValueError("Value not found in the list")


You can use this example to test your code.

In [53]:
t = LinkedList(Node(1, Node(2, Node(3, None))))
remove(t, 2)
t

LinkedList(Node(1, Node(3, None)))

In [54]:
remove(t, 1)
t

LinkedList(Node(3, None))

In [55]:
try:
    remove(t, 4)
except ValueError as e:
    print(e)

Value not found in the list


In [56]:
remove(t, 3)
t

LinkedList(None)

In [57]:
try:
    remove(t, 5)
except ValueError as e:
    print(e)

Value not found in the list


Although `remove_after` is constant time, `remove` is not. Because we have to iterate through the nodes to find the target, `remove` takes linear time.

## Insert Sorted

Similarly, you can insert an element into the middle of a linked list in constant time.

The following function inserts `data` after the given node in a list.

In [58]:
def insert_after(node, data):
    node.next = Node(data, node.next)

In [59]:
t = LinkedList(Node(1, Node(2, Node(3, None))))
insert_after(t.head, 5)
t

LinkedList(Node(1, Node(5, Node(2, Node(3, None)))))

**Exercise:** Write a function called `insert_sorted` (also known as `insort`) that takes a linked list and a value and inserts the value in the list in the first place where it will be in increasing sorted order, that is, with the smallest element at the beginning.

In [60]:
def insert_sorted(t, data):
    if t.head is None or t.head.data > data:
        lpush(t, data)
        return

    node = t.head
    while node.next:
        if node.next.data > data:
            insert_after(node, data)
            return
        node = node.next

    insert_after(node, data)

You can use the following example to test your code.

In [61]:
t = LinkedList()
insert_sorted(t, 1)
t

LinkedList(Node(1, None))

In [62]:
insert_sorted(t, 3)
t

LinkedList(Node(1, Node(3, None)))

In [63]:
insert_sorted(t, 0)
t

LinkedList(Node(0, Node(1, Node(3, None))))

In [64]:
insert_sorted(t, 2)
t

LinkedList(Node(0, Node(1, Node(2, Node(3, None)))))

Although `insert_after` is constant time, `insert_sorted` is not. Because we have to iterate through the nodes to find the insertion point, `insert_sorted` takes linear time.

Here's a recursive version that doesn't allocate anything

In [65]:
def reverse(t):
    t.head = reverse_rec(t.head)

def reverse_rec(node):

    # if there are 0 or 1 nodes
    if node is None or node.next is None:
        return node

    # reverse the rest LinkedList
    rest = reverse_rec(node.next)

    # Put first element at the end
    node.next.next = node
    node.next = None

    return rest

In [66]:
t = LinkedList(Node(1, Node(2, Node(3, None))))
reverse(t)
t

LinkedList(Node(3, Node(2, Node(1, None))))

And finally an iterative version that doesn't allocate anything.

In [None]:
def reverse(t):
    prev = None
    current = t.head
    while current :
        next = current.next
        current.next = prev
        prev = current
        current = next
    t.head = prev

In [None]:
t = LinkedList(Node(1, Node(2, Node(3, None))))
reverse(t)
t

# Indexer

In [70]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
    return filename

## Indexing the web

In the context of web search, an index is a data structure that makes it possible to look up a search term and find the pages where that term appears. In addition, we would like to know how many times the search term appears on each page, which will help identify the pages most relevant to the term.

For example, if a user submits the search terms "Python" and "programming", we would look up both search terms and get two sets of
pages. Pages with the word "Python" would include pages about the species of snake and pages about the programming language. Pages
with the word "programming" would include pages about different
programming languages, as well as other uses of the word. By selecting
pages with both terms, we hope to eliminate irrelevant pages and find
the ones about Python programming.

In order to make an index, we'll need to iterate through the words in a document and count them.
So that's where we'll start.

Here's a minimal HTML document we have seen before, borrowed from the BeautifulSoup documentation.

In [71]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

We can use `BeautifulSoup` to parse the text and make a DOM.

In [72]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)
type(soup)

The following is a generator function that iterates the elements of the DOM, finds the `NavigableString` objects, iterates through the words, and yields them one at a time.

From each word, it removes the characters identified by the `string` module as whitespace or punctuation.

In [73]:
from bs4 import NavigableString
from string import whitespace, punctuation

def iterate_words(soup):
    for element in soup.descendants:
        if isinstance(element, NavigableString):
            for word in element.string.split():
                word = word.strip(whitespace + punctuation)
                if word:
                    yield word.lower()

We can loop through the words like this:

In [74]:
for word in iterate_words(soup):
    print(word)

the
dormouse's
story
the
dormouse's
story
once
upon
a
time
there
were
three
little
sisters
and
their
names
were
elsie
lacie
and
tillie
and
they
lived
at
the
bottom
of
a
well


And count them like this.

In [75]:
from collections import Counter

counter = Counter(iterate_words(soup))
counter

Counter({'the': 3,
         "dormouse's": 2,
         'story': 2,
         'once': 1,
         'upon': 1,
         'a': 2,
         'time': 1,
         'there': 1,
         'were': 2,
         'three': 1,
         'little': 1,
         'sisters': 1,
         'and': 3,
         'their': 1,
         'names': 1,
         'elsie': 1,
         'lacie': 1,
         'tillie': 1,
         'they': 1,
         'lived': 1,
         'at': 1,
         'bottom': 1,
         'of': 1,
         'well': 1})

## Parsing Wikipedia

Now let's do the same thing with the text of a Wikipedia page:

In [76]:
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
filename = download(url)

Downloaded Python_(programming_language)


In [77]:
fp = open(filename)
soup2 = BeautifulSoup(fp)

In [78]:
counter = Counter(iterate_words(soup2))
counter.most_common(10)

[('the', 610),
 ('python', 500),
 ('and', 305),
 ('from', 293),
 ('on', 265),
 ('a', 260),
 ('of', 232),
 ('retrieved', 229),
 ('original', 229),
 ('archived', 228)]

As you might expect, the word "python" is one of the most common words on the Wikipedia page about Python.
The word "programming" didn't make the top 10, but it also appears many times.

In [79]:
counter['programming']

91

However, there are a number of common words, like "the" and "from" that also appear many times.
Later, we'll come back and think about how to distinguish the words that really indicate what the page is about from the common words that appear on every page.

But first, let's think about making an index.

## Indexing

An index is a map from a search word, like "python", to a collection of pages that contain the word.
The collection should also indicate how many times the word appears on each page.

We want the index to be persistent, so we'll store it on Redis.

So what Redis type should we use?
There are several options, but one reasonable choice is a hash for each word, where the fields are pages (represented by URL) and the values are counts.

To manage the size of the index, we won't list a page for a given search word unless it appears at least three times.

Let's get Redis started.

In [80]:
import sys

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install redis-server
    !/usr/local/lib/python*/dist-packages/redis_server/bin/redis-server --daemonize yes
else:
    !redis-server --daemonize yes

[31mERROR: Could not find a version that satisfies the requirement redis-server (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for redis-server[0m[31m
[0m/bin/bash: line 1: /usr/local/lib/python*/dist-packages/redis_server/bin/redis-server: No such file or directory


And make sure the Redis client is installed.

In [81]:
!./redis-stack-server-6.2.6-v7/bin/redis-stack-server --daemonize yes

/bin/bash: line 1: ./redis-stack-server-6.2.6-v7/bin/redis-stack-server: No such file or directory


In [82]:
try:
    import redis
except ImportError:
    !pip install redis

And let's make a `Redis` object that creates the connection to the Redis database.

In [83]:
import redis

r = redis.Redis()

**Exercise:** Write a function called `redis_index` that takes a URL and indexes it. It should download the web page with the given URL, iterate through the words, and make a `Counter` that maps from words to their frequencies.

Then it should iterate through the words and add field-value pairs to Redis hashes.

* The keys for the hashes should have the prefix `Index:`; for example, the key for the word `python` should be `Index:python`.

* The fields in the hashes should be URLs.

* The values in the hashes should be word counts.

Use your function to index at least these two pages:

In [84]:
url1 = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
url2 = 'https://en.wikipedia.org/wiki/Python_(genus)'

In [None]:
def redis_index(url):
    filename = download(url)
    fp = open(filename)
    soup = BeautifulSoup(fp)
    counter = Counter(iterate_words(soup))

    for word, count in counter.items():
        if count >= 3:
            key = f"Index:{word}"
            r.hset(key, url, count)


redis_index(url1)
redis_index(url2)


Use `hscan_iter` to iterate the field-values pairs in the index for the word `python`.
Print the URLs of the pages where this word appears and the number of times it appears on each page.

In [None]:
for field, value in r.hscan_iter("Index:python"):
    print(field.decode("utf-8"), int(value))

## Shutdown

If you are running this notebook on your own computer, you can use the following command to shut down the Redis server.

If you are running on Colab, it's not really necessary: the Redis server will get shut down when the Colab runtime shuts down (and everything stored in it will disappear).

In [None]:
!killall redis-server