#Mining the Social Web, 2nd Edition

##Chapter 5: . Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

This IPython Notebook provides an interactive way to follow along with and explore the numbered examples from [_Mining the Social Web (2nd Edition)_](http://bit.ly/135dHfs). The intent behind this notebook is to reinforce the concepts from the sample code in a fun, convenient, and effective way. This notebook assumes that you are reading along with the book and have the context of the discussion as you work through these exercises.

In the somewhat unlikely event that you've somehow stumbled across this notebook outside of its context on GitHub, [you can find the full source code repository here](http://bit.ly/16kGNyb).

## Copyright and Licensing

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use.

Note: If you find yourself wanting to copy output files from this notebook back to your host environment, see the bottom of this notebook for one possible way to do it.

## Example 1. Using boilerpipe to extract the text from a web page

In [1]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

%pylab inline

from boilerpipe.extract import Extractor

URL='http://radar.oreilly.com/2010/07/louvre-industrial-age-henry-ford.html'

extractor = Extractor(extractor='ArticleExtractor', url=URL)

print(extractor.getText())

Populating the interactive namespace from numpy and matplotlib
Listen
The Louvre of the Industrial Age
The Henry Ford is one of the world's great museums, and the world it chronicles is our own.
by Tim O'Reilly | @timoreilly | +Tim O'Reilly | July 30, 2010
This morning I had the chance to get a tour of The Henry Ford Museum in Dearborn, MI, along with Dale Dougherty, creator of Make: and Makerfaire, and Marc Greuther, the chief curator of the museum.  I had expected a museum dedicated to the auto industry, but it’s so much more than that.  As I wrote in my first stunned tweet, “it’s the Louvre of the Industrial Age.”
When we first entered, Marc took us to what he said may be his favorite artifact in the museum, a block of concrete that contains Luther Burbank’s shovel, and Thomas Edison’s signature and footprints.  Luther Burbank was, of course, the great agricultural inventor who created such treasures as the nectarine and the Santa Rosa plum. Ford was a farm boy who became an industr

## Example 2. Using feedparser to extract the text (and other fields) from an RSS or Atom feed

In [2]:
import feedparser

FEED_URL='http://feeds.feedburner.com/oreilly/radar/atom'

fp = feedparser.parse(FEED_URL)

for e in fp.entries:
    print(e.title)
    print(e.links[0].href)
    print(e.content[0].value)

Four short links: 19 February 2016
http://feedproxy.google.com/~r/oreilly/radar/atom/~3/wo5x8YAx904/four-short-links-19-february-2016.html
<ol>
<li><a href="http://motherboard.vice.com/read/robotic-exoskeleton-rewalk-will-be-covered-by-health-insurance">Exoskeletons Must be Covered by Health Insurance</a> (VICE) -- <i>A medical review board ruled that a health insurance provider in the United States is obligated to provide coverage and reimbursement for a $69,500 ReWalk robotic exoskeleton, in what could be a major turning point for people with spinal cord injuries.</i> (via <a href="http://robohub.org/review-board-rules-health-insurance-must-cover-robotic-exoskeletons/">Robohub</a>)</li>
<li><a href="https://medium.com/simone-brunozzi/y-combinator-and-alphabet-inc-f99e46852ded#.oiv2ytfpq">New Models for the Company of the 21st Century</a> (Simone Brunozzi) -- <i>large companies often get displaced by new entrants, failing to innovate and/or adapt to new technologies. Y-Combinator can 

## Example 3. Pseudocode for a breadth-first search

In [None]:
Create an empty graph
Create an empty queue to keep track of nodes that need to be processed

Add the starting point to the graph as the root node
Add the root node to a queue for processing

Repeat until some maximum depth is reached or the queue is empty:
  Remove a node from the queue 
  For each of the node's neighbors: 
    If the neighbor hasn't already been processed: 
      Add it to the queue 
      Add it to the graph 
      Create an edge in the graph that connects the node and its neighbor

**Naive sentence detection based on periods**

In [3]:
txt = "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow."
print(txt.split("."))

['Mr', ' Green killed Colonel Mustard in the study with the candlestick', ' Mr', ' Green is not a very nice fellow', '']


**More sophisticated sentence detection**

In [4]:
import nltk

# Downloading nltk packages used in this example
#nltk.download('punkt')

sentences = nltk.tokenize.sent_tokenize(txt)
print(sentences)

['Mr. Green killed Colonel Mustard in the study with the candlestick.', 'Mr. Green is not a very nice fellow.']


**Tokenization of sentences**

In [5]:
tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]
print(tokens)

[['Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.'], ['Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.']]


**Part of speech tagging for tokens**

In [6]:
# Downloading nltk packages used in this example
#nltk.download('maxent_treebank_pos_tagger')

pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]
print(pos_tagged_tokens)

[[('Mr.', 'NNP'), ('Green', 'NNP'), ('killed', 'VBD'), ('Colonel', 'NNP'), ('Mustard', 'NNP'), ('in', 'IN'), ('the', 'DT'), ('study', 'NN'), ('with', 'IN'), ('the', 'DT'), ('candlestick', 'NN'), ('.', '.')], [('Mr.', 'NNP'), ('Green', 'NNP'), ('is', 'VBZ'), ('not', 'RB'), ('a', 'DT'), ('very', 'RB'), ('nice', 'JJ'), ('fellow', 'NN'), ('.', '.')]]


**Named entity extraction/chunking for tokens**

In [7]:
# Downloading nltk packages used in this example
# nltk.download('maxent_ne_chunker')
# nltk.download('words')

ne_chunks = [nltk.ne_chunk(token) for token in pos_tagged_tokens ]
print(ne_chunks)
print(ne_chunks[0].pprint()) # You can prettyprint each chunk in the tree

[Tree('S', [Tree('PERSON', [('Mr.', 'NNP')]), Tree('PERSON', [('Green', 'NNP')]), ('killed', 'VBD'), Tree('ORGANIZATION', [('Colonel', 'NNP'), ('Mustard', 'NNP')]), ('in', 'IN'), ('the', 'DT'), ('study', 'NN'), ('with', 'IN'), ('the', 'DT'), ('candlestick', 'NN'), ('.', '.')]), Tree('S', [Tree('PERSON', [('Mr.', 'NNP')]), Tree('ORGANIZATION', [('Green', 'NNP')]), ('is', 'VBZ'), ('not', 'RB'), ('a', 'DT'), ('very', 'RB'), ('nice', 'JJ'), ('fellow', 'NN'), ('.', '.')])]
(S
  (PERSON Mr./NNP)
  (PERSON Green/NNP)
  killed/VBD
  (ORGANIZATION Colonel/NNP Mustard/NNP)
  in/IN
  the/DT
  study/NN
  with/IN
  the/DT
  candlestick/NN
  ./.)
None


## Example 4. Harvesting blog data by parsing feeds

In [8]:
import os
import sys
import json
import feedparser
from bs4 import BeautifulSoup
from nltk import clean_html


FEED_URL = 'http://feeds.feedburner.com/oreilly/radar/atom'

def cleanHtml(html):
    soup = BeautifulSoup(html, 'html.parser')
    return soup.getText(html)

fp = feedparser.parse(FEED_URL)

print("Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title))

blog_posts = []
for e in fp.entries:
    #print(e.title)
    blog_posts.append({'title': e.title, 'content'
                      : cleanHtml(e.content[0].value), 'link': e.links[0].href})

out_file = os.path.join('resources', 'ch05-webpages', 'feed.json')
#json.dumps([1,2,3], indent=1, sort_keys=True)

f = open(out_file, 'w')
f.write(json.dumps(blog_posts, indent=1, sort_keys=True))
f.close()

print('Wrote output file to %s' % (f.name, ))

Fetched 34 entries from 'O'Reilly Radar - Insight, analysis, and research about emerging technologies'
Wrote output file to resources/ch05-webpages/feed.json


## Example 5. Using NLTK’s NLP tools to process human language in blog data

In [9]:
import json
import nltk

# Download nltk packages used in this example
# nltk.download('stopwords')

BLOG_DATA = "resources/ch05-webpages/feed.json"

blog_data = json.loads(open(BLOG_DATA).read())

# Customize your list of stopwords as needed. Here, we add common
# punctuation and contraction artifacts.

stop_words = nltk.corpus.stopwords.words('english') + [
    '.',
    ',',
    '--',
    '\'s',
    '?',
    ')',
    '(',
    ':',
    '\'',
    '\'re',
    '"',
    '-',
    '}',
    '{',
    u'—',
    ]

for post in blog_data:
    sentences = nltk.tokenize.sent_tokenize(post['content'])

    words = [w.lower() for sentence in sentences for w in
             nltk.tokenize.word_tokenize(sentence)]

    fdist = nltk.FreqDist(words)

    # Basic stats

    num_words = sum([i[1] for i in fdist.items()])
    num_unique_words = len(fdist.keys())

    # Hapaxes are words that appear only once

    num_hapaxes = len(fdist.hapaxes())

    top_10_words_sans_stop_words = [w for w in fdist.items() if w[0]
                                    not in stop_words][:10]

    print(post['title'])
    print ('\tNum Sentences:'.ljust(25), len(sentences))
    print ('\tNum Words:'.ljust(25), num_words)
    print ('\tNum Unique Words:'.ljust(25), num_unique_words)
    print ('\tNum Hapaxes:'.ljust(25), num_hapaxes)
    print ('\tTop 10 Most Frequent Words (sans stop words):\n\t\t', \
            '\n\t\t'.join(['%s (%s)'
            % (w[0], w[1]) for w in top_10_words_sans_stop_words]))

Four short links: 19 February 2016
	Num Sentences:           204
	Num Words:               17722
	Num Unique Words:        227
	Num Hapaxes:             0
	Top 10 Most Frequent Words (sans stop words):
		 enter (29)
		reimbursement (29)
		insurance (58)
		ol (28)
		people (58)
		botnet (58)
		complaining (29)
		a=wo5x8yax904 (140)
		type (29)
		medical (29)
Scott Hurff on designing at Tinder
	Num Sentences:           3103
	Num Words:               114228
	Num Unique Words:        474
	Num Hapaxes:             4
	Top 10 Most Frequent Words (sans stop words):
		 p (845)
		lead (66)
		grow (66)
		ships (66)
		d=7q72wntakba (65)
		soundcloud (66)
		people (264)
		away (66)
		feels (66)
		staples (132)
Four short links: 18 February 2016
	Num Sentences:           79
	Num Words:               12715
	Num Unique Words:        171
	Num Hapaxes:             0
	Top 10 Most Frequent Words (sans stop words):
		 height= (25)
		increased (26)
		lives (26)
		made (26)
		/ (25)
		a=6dwaguusevk:5wtdvjxcu

## Example 6. A document summarization algorithm based principally upon sentence detection and frequency analysis within sentences

In [11]:
import json
import nltk
import numpy

BLOG_DATA = "resources/ch05-webpages/feed.json"

N = 100  # Number of words to consider
CLUSTER_THRESHOLD = 5  # Distance between words to consider
TOP_SENTENCES = 5  # Number of sentences to return for a "top n" summary

# Approach taken from "The Automatic Creation of Literature Abstracts" by H.P. Luhn

def _score_sentences(sentences, important_words):
    scores = []
    sentence_idx = -1

    for s in [nltk.tokenize.word_tokenize(s) for s in sentences]:

        sentence_idx += 1
        word_idx = []

        # For each word in the word list...
        for w in important_words:
            try:
                # Compute an index for where any important words occur in the sentence.

                word_idx.append(s.index(w))
            except ValueError: # w not in this particular sentence
                pass

        word_idx.sort()

        # It is possible that some sentences may not contain any important words at all.
        if len(word_idx)== 0: continue

        # Using the word index, compute clusters by using a max distance threshold
        # for any two consecutive words.

        clusters = []
        cluster = [word_idx[0]]
        i = 1
        while i < len(word_idx):
            if word_idx[i] - word_idx[i - 1] < CLUSTER_THRESHOLD:
                cluster.append(word_idx[i])
            else:
                clusters.append(cluster[:])
                cluster = [word_idx[i]]
            i += 1
        clusters.append(cluster)

        # Score each cluster. The max score for any given cluster is the score 
        # for the sentence.

        max_cluster_score = 0
        for c in clusters:
            significant_words_in_cluster = len(c)
            total_words_in_cluster = c[-1] - c[0] + 1
            score = 1.0 * significant_words_in_cluster \
                * significant_words_in_cluster / total_words_in_cluster

            if score > max_cluster_score:
                max_cluster_score = score

        scores.append((sentence_idx, score))

    return scores

def summarize(txt):
    sentences = [s for s in nltk.tokenize.sent_tokenize(txt)]
    normalized_sentences = [s.lower() for s in sentences]

    words = [w.lower() for sentence in normalized_sentences for w in
             nltk.tokenize.word_tokenize(sentence)]

    fdist = nltk.FreqDist(words)

    top_n_words = [w[0] for w in fdist.items() 
            if w[0] not in nltk.corpus.stopwords.words('english')][:N]

    scored_sentences = _score_sentences(normalized_sentences, top_n_words)

    # Summarization Approach 1:
    # Filter out nonsignificant sentences by using the average score plus a
    # fraction of the std dev as a filter

    avg = numpy.mean([s[1] for s in scored_sentences])
    std = numpy.std([s[1] for s in scored_sentences])
    mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences
                   if score > avg + 0.5 * std]

    # Summarization Approach 2:
    # Another approach would be to return only the top N ranked sentences

    top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:]
    top_n_scored = sorted(top_n_scored, key=lambda s: s[0])

    # Decorate the post object with summaries

    return dict(top_n_summary=[sentences[idx] for (idx, score) in top_n_scored],
                mean_scored_summary=[sentences[idx] for (idx, score) in mean_scored])

blog_data = json.loads(open(BLOG_DATA).read())

for post in blog_data:
       
    post.update(summarize(post['content']))

    print(post['title'])
    print('=' * len(post['title']))
    print('Top N Summary')
    print('-------------')
    print(' '.join(post['top_n_summary']))
    print('Mean Scored Summary')
    print('-------------------')
    print(' '.join(post['mean_scored_summary']))

Four short links: 19 February 2016
Top N Summary
-------------
)</li>
</ol>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/oreilly/radar/atom?a=wo5x8YAx904:qpT6eBTNFx8:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/oreilly/radar/atom?i=wo5x8YAx904:qpT6eBTNFx8:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/oreilly/radar/atom?a=wo5x8YAx904:qpT6eBTNFx8:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/oreilly/radar/atom?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/oreilly/radar/atom?a=wo5x8YAx904:qpT6eBTNFx8:JEwB19i1-c4"><img src="http://feeds.feedburner.com/~ff/oreilly/radar/atom?i=wo5x8YAx904:qpT6eBTNFx8:JEwB19i1-c4" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/oreilly/radar/atom?a=wo5x8YAx904:qpT6eBTNFx8:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/oreilly/radar/atom?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/oreilly/radar/atom?a=wo

## Example 7. Visualizing document summarization results with HTML output

In [17]:
import os
import json
import nltk
import numpy
from IPython.display import IFrame
from IPython.core.display import display

BLOG_DATA = "resources/ch05-webpages/feed.json"

HTML_TEMPLATE = """<html>
    <head>
        <title>%s</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    </head>
    <body>%s</body>
</html>"""

blog_data = json.loads(open(BLOG_DATA).read())

for post in blog_data:
   
    # Uses previously defined summarize function.
    post.update(summarize(post['content']))

    # You could also store a version of the full post with key sentences marked up
    # for analysis with simple string replacement...

    for summary_type in ['top_n_summary', 'mean_scored_summary']:
        post[summary_type + '_marked_up'] = '<p>%s</p>' % (post['content'], )
        for s in post[summary_type]:
            post[summary_type + '_marked_up'] = \
            post[summary_type + '_marked_up'].replace(s, '<strong>%s</strong>' % (s, ))

        filename = post['title'].replace("?", "") + '.summary.' + summary_type + '.html'
        f = open(os.path.join('resources', 'ch05-webpages', filename), 'w')
        html = HTML_TEMPLATE % (post['title'] + \
          ' Summary', post[summary_type + '_marked_up'],)
              
        f.write(str(html.encode('utf-8')))
        f.close()

        print("Data written to", f.name)

# Display any of these files with an inline frame. This displays the
# last file processed by using the last value of f.name...

print("Displaying %s:" % f.name)
display(IFrame('files/%s' % f.name, '100%', '600px'))

Data written to resources/ch05-webpages/Four short links: 19 February 2016.summary.top_n_summary.html
Data written to resources/ch05-webpages/Four short links: 19 February 2016.summary.mean_scored_summary.html
Data written to resources/ch05-webpages/Scott Hurff on designing at Tinder.summary.top_n_summary.html
Data written to resources/ch05-webpages/Scott Hurff on designing at Tinder.summary.mean_scored_summary.html
Data written to resources/ch05-webpages/Four short links: 18 February 2016.summary.top_n_summary.html
Data written to resources/ch05-webpages/Four short links: 18 February 2016.summary.mean_scored_summary.html
Data written to resources/ch05-webpages/Rachel Kalmar on data ecosystems.summary.top_n_summary.html
Data written to resources/ch05-webpages/Rachel Kalmar on data ecosystems.summary.mean_scored_summary.html
Data written to resources/ch05-webpages/Four short links: 17 February 2016.summary.top_n_summary.html
Data written to resources/ch05-webpages/Four short links: 17 F

## Example 8. Extracting entities from a text with NLTK

In [19]:
import nltk
import json

BLOG_DATA = "resources/ch05-webpages/feed.json"

blog_data = json.loads(open(BLOG_DATA).read())

for post in blog_data:

    sentences = nltk.tokenize.sent_tokenize(post['content'])
    tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]
    pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]

    # Flatten the list since we're not using sentence structure
    # and sentences are guaranteed to be separated by a special
    # POS tuple such as ('.', '.')

    pos_tagged_tokens = [token for sent in pos_tagged_tokens for token in sent]

    all_entity_chunks = []
    previous_pos = None
    current_entity_chunk = []
    for (token, pos) in pos_tagged_tokens:

        if pos == previous_pos and pos.startswith('NN'):
            current_entity_chunk.append(token)
        elif pos.startswith('NN'):
            if current_entity_chunk != []:

                # Note that current_entity_chunk could be a duplicate when appended,
                # so frequency analysis again becomes a consideration

                all_entity_chunks.append((' '.join(current_entity_chunk), pos))
            current_entity_chunk = [token]

        previous_pos = pos

    # Store the chunks as an index for the document
    # and account for frequency while we're at it...

    post['entities'] = {}
    for c in all_entity_chunks:
        post['entities'][c] = post['entities'].get(c, 0) + 1

    # For example, we could display just the title-cased entities

    print(post['title'])
    print('-' * len(post['title']))
    proper_nouns = []
    for (entity, pos) in post['entities']:
        if entity.istitle():
            print('\t%s (%s)' % (entity, post['entities'][(entity, pos)]))

Four short links: 19 February 2016
----------------------------------
	States (29)
	New Models (28)
	A (29)
	Y-Combinator (29)
	/ > Exoskeletons Must (1)
	Century < (1)
	Botnets (29)
	Simone Brunozzi (28)
	/ > Robohub < (1)
	Samsung (29)
	Linux Foundation (29)
	/ > Zephyr < (1)
	Swinging Click (29)
	/ > New Models (1)
	United (29)
	Simone Brunozzi (1)
	Company (29)
	Approach (28)
	Exoskeletons (28)
	Nest (29)
	Hands-On Approach (1)
	Health Insurance < (1)
	Learning Purpose < (1)


KeyboardInterrupt: 

## Example 9. Discovering interactions between entities

In [20]:
import nltk
import json

BLOG_DATA = "resources/ch05-webpages/feed.json"

def extract_interactions(txt):
    sentences = nltk.tokenize.sent_tokenize(txt)
    tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]
    pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]

    entity_interactions = []
    for sentence in pos_tagged_tokens:

        all_entity_chunks = []
        previous_pos = None
        current_entity_chunk = []

        for (token, pos) in sentence:

            if pos == previous_pos and pos.startswith('NN'):
                current_entity_chunk.append(token)
            elif pos.startswith('NN'):
                if current_entity_chunk != []:
                    all_entity_chunks.append((' '.join(current_entity_chunk),
                            pos))
                current_entity_chunk = [token]

            previous_pos = pos

        if len(all_entity_chunks) > 1:
            entity_interactions.append(all_entity_chunks)
        else:
            entity_interactions.append([])

    assert len(entity_interactions) == len(sentences)

    return dict(entity_interactions=entity_interactions,
                sentences=sentences)

blog_data = json.loads(open(BLOG_DATA).read())

# Display selected interactions on a per-sentence basis

for post in blog_data:

    post.update(extract_interactions(post['content']))

    print (post['title'])
    print ('-' * len(post['title']))
    for interactions in post['entity_interactions']:
        print ('; '.join([i[0] for i in interactions]))

Four short links: 19 February 2016
----------------------------------
ol; > <; > <; href=; http; //motherboard.vice.com/read/robotic-exoskeleton-rewalk-will-be-covered-by-health-insurance; Exoskeletons; Health Insurance < /a >; VICE; A; review board; health insurance provider; United; States; coverage; reimbursement; ReWalk; exoskeleton; turning point; people; cord injuries.; < /i >; <; href=; http; >; Robohub < /a >; > <; > <; href=; https; .oiv2ytfpq; New Models; Company; Century < /a >; Simone Brunozzi; >; companies; entrants; adapt
Y-Combinator; type; company; innovation; experiment; ideas; Google’s Alphabet; hand; tries; innovation; risk; company; pieces; ownership; responsibilities; CEOs. < /i > < /li > <; > <; href=; https; //www.zephyrproject.org/; Zephyr < /a >; Linux Foundation; IoT; source; OS
tbh; people
devices; href=; https; >; sensors; /a; >
fragmentation; Samsung; home; Nest
Swinging Click; companies; world; things; < /li > <; > <; href=; https; Approach; Botnets; Learn

KeyboardInterrupt: 

## Example 10. Visualizing interactions between entities with HTML output

In [21]:
import os
import json
import nltk
from IPython.display import IFrame
from IPython.core.display import display

BLOG_DATA = "resources/ch05-webpages/feed.json"

HTML_TEMPLATE = """<html>
    <head>
        <title>%s</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    </head>
    <body>%s</body>
</html>"""



In [None]:


blog_data = json.loads(open(BLOG_DATA).read())

for post in blog_data:

    post.update(extract_interactions(post['content']))

    # Display output as markup with entities presented in bold text

    post['markup'] = []

    for sentence_idx in range(len(post['sentences'])):

        s = post['sentences'][sentence_idx]
        for (term, _) in post['entity_interactions'][sentence_idx]:
            s = s.replace(term, '<strong>%s</strong>' % (term, ))

        post['markup'] += [s] 
            
    filename = post['title'].replace("?", "") + '.entity_interactions.html'
    f = open(os.path.join('resources', 'ch05-webpages', filename), 'w')
    html = HTML_TEMPLATE % (post['title'] + ' Interactions', 
                            ' '.join(post['markup']),)
    f.write(html.encode('utf-8'))
    f.close()

    print("Data written to", f.name)
    
    # Display any of these files with an inline frame. This displays the
    # last file processed by using the last value of f.name...
    
    print("Displaying %s:" % f.name)
    display(IFrame('files/%s' % f.name, '100%', '600px'))