![oyster](https://www.msc.org/multimedia/images/certified-species/oyster/@@images/f3baa28b-699e-4e54-9c91-3db6fc6646d5.png)

Sources, or, the world is your oyster
-------

**PDF files**

We will be using a special Python package called "PyPDF2." At this point, everytime I pull out a special package, you should ask "Where did that come from?" In all honesty, it's a matter searching and reading to figure out what the best tool for your task is. In this case, we are borrowing from the book ["Automate the Boring Stuff with Python: Practical Programming for Total Beginners"](https://automatetheboringstuff.com/), which is a good introduction to what we've covered so far (chapters 1-6 and a bit beyond) and this section on sources. For PDF files, we look to chapter 13,  [Working with PDF and Word Documents](https://automatetheboringstuff.com/chapter13/). 

As we have said several times, data are formatted for various purposes. In some cases, we care about presentation. We want information to be placed on a page in a particular way making it easy to read for a human. In other cases, the data are organized to be read easily by a machine. And there's a range of options in between. PDF, the Portable Document Format, was successful because it could guarantee the look of a document independent of the hardware, operating system and application software used to create it. In short, the documents are transferrable and can be shared widely without worrying about how it will look on different platforms. The format also meets legal requirements (like it cannot be altered without leaving some kind of electronic trace) and so you will find court documents and other government publications in PDF. 

So with PDF, the emphasis is on the look of the data. Extracting text from a PDF, however, can be difficult. There are two kinds of PDFs -- one that is text-based and one that is image based. In the image case, the document is like a sequence of gif's or jpg's. To get at the content you can transcribe the text by hand, or you can apply some kind of OCR (Optical Character Recognition) to turn patterns of light and dark pixels into a best guess about letters and words and sentences.

The so-called ["Trump Dossier"](https://assets.documentcloud.org/documents/3259984/Trump-Intelligence-Allegations.pdf) is an example of an image-based PDF. Click on the link and have a look. In a PDF viewer, try to "select" the text on the page. You can't. Instead, the entire page will likely be selected. This is your sign that the pages in your PDF are represented by images and your life is just that much harder. 

By contrast, consider any of the Financial Disclosures for Trump's nominees. These tend to be text-based PDF's. Here is [Steven Mnuchin's Ethics Agreement](https://extapps2.oge.gov/201/Presiden.nsf/PAS+Index/DF569F5F0BA38E29852580C1002C7A4F/$FILE/Mnuchin,%20Steven%20T.%20finalAMENDEDEA.pdf). Click on this link and then try to grab some text. You'll see you can easily copy and paste sections to a Word Document or some text editor. Ah, but we're not out of the woods in terms of text-based PDFs. Yes, we can grab and copy text, but they can still be hard to work with. A PDF's emphasis on how things look means blocks of formatted text might end up logially connected, and grabbing text produces a jumble of words and letters. Here's a passage from the Automate book.

>THE PROBLEMATIC PDF FORMAT<br><br>
While PDF files are great for laying out text in a way that’s easy for people to print and read, they’re not straightforward for software to parse into plaintext. As such, PyPDF2 might make mistakes when extracting text from a PDF and may even be unable to open some PDFs at all. There isn’t much you can do about this, unfortunately. PyPDF2 may simply be unable to work with some of your particular PDF files. That said, I haven’t found any PDF files so far that can’t be opened with PyPDF2.

So, let's see how this works. First, install the PyPDF2 package. As with TextBlob and other modules we have been using, this one does not come with either basic Python or with Anaconda.  We will use the Python installer PIP to add it to our environment. Recall, PIP is a UNIX command so we can call it from our Terminal window or from within the notebook by starting the cell with some "magic". The Recall that the "%%sh" says that the following lines are not Python, but UNIX shell ("sh") commands.

In [None]:
%%sh

pip install PyPDF2

So to try it out, let's just stick with Mnuchin's Ethics Agreement. You can get a complete list of these for all of Trump's nominees at the [White House Disclosures page](https://www.whitehouse.gov/briefing-room/disclosures/financial-disclosuraes). You can either [download the PDF](https://extapps2.oge.gov/201/Presiden.nsf/PAS+Index/DF569F5F0BA38E29852580C1002C7A4F/$FILE/Mnuchin,%20Steven%20T.%20finalAMENDEDEA.pdf) **and place it in the folder containing this notebook, giving it the name mnuchin.pdf**, or you can use the urlretrieve() function from the "urllib" package that we showed in the previous notebook.

Then, we import a PdfFileReader function (which should remind you of read_csv() but its name is in "CamelCase"). We first open the file as we did in the previous class and then read the contents using this special PDF reader. The result is an object that has data and methods that help you manipulate a PDF document. 

What kinds of things would we like to be able to do with data in PDF format?

In [None]:
from urllib import urlretrieve

url = "https://extapps2.oge.gov/201/Presiden.nsf/PAS+Index/DF569F5F0BA38E29852580C1002C7A4F/$FILE/Mnuchin,%20Steven%20T.%20finalAMENDEDEA.pdf"
urlretrieve(url,"mnuchin.pdf")

In [None]:
from PyPDF2 import PdfFileReader

# open the file for 'reading' and signal that the data inside might be 'binary'
jeb_file = open('mnuchin.pdf', 'rb')

# use the file to create a PDF reader object to extract the text
pdf = PdfFileReader(jeb_file)
type(pdf)

The reader object has attributes like the number of pages in the document as well as methods that help you extract and store information in the PDF.

In [None]:
pdf.numPages

The expression above doesn't have ( )'s after numPages -- instead, it fetches **data associated with our PDF object**. It is not a function that does something to the object, changing it in some way; but is, instead, a way to access data or attributes stored with the object -- in this case, the number of pages in Mnuchin's letter. 

Next, we can get a "page object" that represents a single page from the letter. What kinds of things do we want to do to a single page of PDF? Below we pull the first page (again, we start counting from zero) from the document and then have a look at its contents. Oh we also call help() to see what this object can do. You'll notice a lot of it has to do with manipulating PDFs, transforming PDF pages -- like by rotating them -- and authoring a PDF from within Python.

In [None]:
page = pdf.getPage(0)
type(page)

In [None]:
help(page)

In [None]:
print page.extractText()

Notice that the text looks a little garbled. That's because there are special symbols being used to render the fancy quotations and various other effects in the PDF document. The text of the page is stored as **a unicode string.** We can see this by not printing the text but just looking at it straight...

In [None]:
page.extractText()

... the 'u' in front of the string says that it is not a regular ASCII string. We see hints of special characters as well. The symbol '\xa7' represents two hexadecimal digits that stand for a particular [a unicode character](http://www.charbase.com/00a7-unicode-section-sign). We talked about the unicode specification in class but [here is a good reference](http://symbolcodes.tlt.psu.edu/web/unicode.html). It is thought of as the "Universal Alphabet", with the ability to represent symbols from all known languages. It looks like the text from our PDF are appearing in Python as unicode because the extractor is having trouble with the font being used for making the section symbols in the document.

We can remove odd symbols and turn them into ASCII. ASCII is a way to encode text that is older than Unicode and has a greatly reduced set of characters. It focuses just on those characters that are needed for standard english -- basically the US keyboard. We use the encode() method for a string to move between Unicode and ASCII, transforming a string with lots of extra symbols into something less complex. Again, we talked about ASCII briefly in class, and [here is a good reference](http://www.computerhope.com/jargon/a/ascii.htm) 

In the command below, we 'ignore' the characters that are not ASCII, we can also 'replace' them with some consistent symbol so we could scan for where things went wrong.

In [None]:
 page.extractText().encode('ascii','ignore')

But this still leaves us with a moderately serious cleanup problem. Look at the way the address has been smashed into the body of the text. Try one of the later pages, 4 or 5. See what happened there. Ah, such is the life of a righteous computational journalist -- PDFs are everywhere but they can be hard to work with. Now you have a glimpse of why.

In [None]:
page = pdf.getPage(3)
divesting = page.extractText()

divesting

We can use the raw text and our abilities with regular expressions to pull out a list of companies Mnuchin will divist his interests in after confirmation. Has he? The one new part of the regular expression syntax here is called a "positive look ahead". The "(?=...)" will match our pattern if the next string of text matches "...". In our case we want to move along the string looking for something bracketed by a number and a period. Here is the <a href="https://regexper.com/#%5B0-9%5D%2B%5C.%20(.*%3F)(%3F%3D%5B0-9%5D%2B%5C.)">display from regexper.com</a>.

Notice, just notice, how the first element in the list doesn't appear because the "1" comes through the PDF process as a capital "I". This, this is the kind of thing you deal with when given a PDF.

In [None]:
from re import findall

findall("[0-9]+\. (.*?)(?=[0-9]+\.)",divesting)

**HTML: Pulling simple tables**

We are now going to pull data from an HTML page. This exercise is often called "web scraping" or "screen scraping". At its core, web scraping refers to taking data formatted for one purpose and reformatting it for another -- taking information stored for display in a web browser, an HTML document (HyperText Markup Language), and turn it into something we can program with. One easy task involves simply pulling tables from an HTML page. These are as close to data as we are going to get to formatted data on an HTML page. In fact, we can pull HTML tables directly using Pandas. 

Actually, read_html() returns a list of DataFrames, one for each table on the page. Let's have a look. We first have to install a library to help us deal with HTML.

In [None]:
%%sh 

pip install html5lib

Now, let's look at the [Wikipedia entry for Donald Trump's cabinet](https://en.wikipedia.org/wiki/Cabinet_of_Donald_Trump). We might be interested in the table called "Confirmation Timeline." Let's see how this would work. First we just use read_html() from pandas. This gives us a list consisting of one data frame per table on the page. We then need to find our table (which  element of the list) and clean it up a little.

In [None]:
from pandas import read_html

tables = read_html("https://en.wikipedia.org/wiki/Cabinet_of_Donald_Trump")
type(tables)

In [None]:
len(tables)

The timeline table is the fourth entry in the list, the one having index 3. Notice that our table currently just has numbers for column names and that the real column names are stored in the second row of the table. 

In [None]:
tables[3].head(5)

Let's clean things up. Set the column names, drop the first couple of columns we don't need and then reset the row names to run 0, 1, 2...

In [None]:
# select the 4th table
cabinet = tables[3]

# use the first row for labels -- there are better ways to 
# pick off one row but let's do this for now
names = list(cabinet[1:2])
names[7] = "Confirmation"

# keep all but the 0 and 1 index rows
cabinet = cabinet[2:]
cabinet.columns = names

# and reset the index of the columns to 0 1 2... 
# we don't want the old index as a column so we "drop=True"
cabinet = cabinet.reset_index(drop=True)

In [None]:
cabinet.head(5)

This is still going to require a fair amount of clean up. But it's nothing we haven't seen before. We need to recode some data and deal with the footnotes that have come through in square brackets, and some of the rows need help (like Puzder's).

**HTML: Web scraping**

Let's try a harder example. Suppose we want to pull data about the temperature (taking Kat's lead). We might go to [weather.gov](weather.gov) and then type in the ZIP code for Columbia University (10027).

[Here's the page you get.](http://forecast.weather.gov/MapClick.php?lat=40.811700869100946&lon=-73.95285802527684#.VsCB6ZMrIfw)

Look for the temperature. (Such beautiful springlikeness!) Now, how do we get at that piece of data, the temperature right now? Most of the browsers you're using offer so-called developer tools to help us sleuth around the innards of a web page. In Chrome you just go up to your main menu and select View -> Developer -> Developer Tools. This will open a side pane that lets you examine HTML elements by mousing over the text on the right. By digging into the elements you eventually find where the data are hiding. Equivalently, you can simply "right click" on the data point and it will drive the text on the right to the right part of the HTML page.

Here's a great video describing Chrome's developer tools to explore the structure (the elements or tags) of a web page (another soft-spoken Brit).

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('nV9PLPFTnkE')

Looking through the page, you'll see that the data (the temperature we're after) is stored in a paragraph tag &lt;p&gt; with attribute "class=myforecast-current-lrg". For those of you who need an HTML refresher, [this is a good one](http://www.w3schools.com/html). Essentially HTML provides tags or "markup" to help describe the document for your browser so it can render the contents properly for you to read. The tags are things like paragraphs and lists and headings -- things that describe what kind of text we have and how it should be displayed.

The tools we introduce now let us **parse** a web page and find the contents of the tag we're after, the temperature value hiding in a &lt;p&gt; paragraph. 

The first step is accessing the web page. So far we've acquired data from the web through Pandas functions like read_csv() or read_html(). The "urllib" and "urllib2" packages, which come with base Python, let you explicitly request objects on the web. Here we are instead going to use the "requests" package. It contains a function get() to submit a request to a web server, together with objects to represent what comes back.

I add a header when I scrape the web with my email address. If I'm going to take data like this, I feel like I should leave behind a trace that I was there. So the URL for the page on weather.gov with Columbia's temperature is given below as a string stored in the variable "url". We use it to form a request to the weather.gov server for the page. We make the connection and read the page into a string. **The whole page as a string.**

In [None]:
from requests import get

url = "http://forecast.weather.gov/MapClick.php?lat=40.811700869100946&lon=-73.95285802527684#.VrzaH5MrIfw"    
head_data = {'From': 'markh@columbia.edu'}

response = get(url,headers=head_data)    
type(response)

In [None]:
response.status_code

In [None]:
response.text

Above, we include a "header" field (represented as a dictionary). The header passes information to the web server that might change the way it returns content. In later exercises, we might need to specify the header "User-Agent" which tells the server what kind of  browser the requeste is being made from -- some servers don't like handing pages out to bots. 

For now, we are using the "From" header to announce ourselves. I like to tell a source that I'm taking data. If you want to know more about headers, have a look [here](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields).

This is kind of a mess. The whole web page has been read in as a string. Thankfully, one of the great things about Python is a package called [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/), designed by [Leonard Richardson](http://www.crummy.com/self/). It is truly a thing of beauty. BeautifulSoup is a parser for HTML (and XML) that creates an object that lets you interact with the components of a web page. You can search for tags, extract attributes from the tags and pull the content contained in a tag. [The documentation is pretty simple too.](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) The latest version of BeautifulSoup is 4.5.3 and the package is called bs4.

In [None]:
from requests import get

url = "http://forecast.weather.gov/MapClick.php?lat=40.811700869100946&lon=-73.95285802527684#.VrzaH5MrIfw"    
head_data = {'From': 'markh@columbia.edu'}

response = get(url,headers=head_data)    

# the new bit, reading the data into a BeautifulSoup object
page = BeautifulSoup(response.text)
type(page)

See? Not a string! (You will see a UserWarning that you can ignore -- we will address it later.) BeautifulSoup objects have a method called prettify() that makes them a bit easier to read when they are printed.

In [None]:
print page.prettify()

As I said, the great strength of BeautifulSoup is your ability to navigate through its objects. Here we find() the first occurence of a paragraph tag (denoted p)...

In [None]:
print page.find("p").prettify()

... try to find it on the page for weather.gov! Next, we iterate through all the paragraph tags, inserting a string of "=="'s between them so you can see each one.

In [None]:
for p in page.find_all("p"):
    print p.prettify()
    print "==="*10

Now, in our case, the forecast is found in a paragraph tag that has an attribute "class=myforecast-current-lrg". If you want to find a tag with a given "class" attribute, you can simply change the call as follows. There is a more elaborate way to search based on other attributes than "class" (like id= or title= or src=), but "class" is used so often in web design to specify its "styling" (via a Cascading Style Sheet, say) that Leonard decided to make that kind of search easy. **So (tag, class label) are all you need...**

In [None]:
print page.find("p","myforecast-current-lrg").prettify()

Now, the object that's returned by find() or find_all() are again objects from BeautifulSoup, this time, a Tag. To get the contents of the tag in a string, we use the method get_text().

In [None]:
temperature = page.find("p","myforecast-current-lrg")
type(temperature)

In [None]:
temperature.get_text()

Another unicode string! Why? We can look up this unicode symbol "\xb0" and find that [it stands for degrees](http://www.charbase.com/00b0-unicode-degree-sign). We can remove it as we did the extra characters in our PDF using encode(). Finally, we might want to also strip off the F and turn the temp into a number, an integer. 

In [None]:
temperature.get_text().encode("ascii","ignore")

In [None]:
int(temperature.get_text().encode("ascii","ignore").replace("F",""))

With this, we can make a request for the URL for our ZIP code every 30 minutes, say, and come up with a time series of temperatures. Here we make a request, sleep() for 30 seconds and then make another request for the page and print out the next temperature. Let this run a little and see when the temperature changes. You can stop this by going up to the "Kernel" button on the menu bar at the top of this window and select "Interrupt". (It will also stop after 100 iterations.)

In [None]:
from requests import get
from bs4 import BeautifulSoup
from time import sleep

url = "http://forecast.weather.gov/MapClick.php?lat=40.811700869100946&lon=-73.95285802527684#.VrzaH5MrIfw"    
head_data = {'From': 'markh@columbia.edu'}

for i in range(100):
    
    response = get(url,headers=head_data)    
    page = BeautifulSoup(response.text)
    temperature = int(page.find("p","myforecast-current-lrg").get_text().encode("ascii","ignore").replace("F",""))

    print temperature
    sleep(30)

As another example, let's look at the White House petition site. It's a miracle this is still up! Have a look at [the page](https://petitions.whitehouse.gov/petitions) and tell me something about how the individual petitions are stored. Are they in consistent HTML objects with consistent CSS styling? 

Below we request the page and have a look.

In [None]:
from requests import get
from bs4 import BeautifulSoup
from time import sleep

url = "https://petitions.whitehouse.gov/"
head_data = {'From': 'markh@columbia.edu'}

response = get(url,headers=head_data)    
page = BeautifulSoup(response.text)
    
print len(page.find_all("div","views-row"))

The petitions are stored in "div" tags that have a class that starts with "views-row". So let's pull the first one and have a look. We can see easily where the goal, number of signatures and the title are stored. 

Tell me!

In [None]:
for petition in page.find_all("div","views-row"):
    print petition
    print "---"*5

Let's test out our insights. We'll pull the first petition region of the page and then try to extract the data.

In [None]:
petition = page.find("div","views-row")

In [None]:
print petition.find("span","goal").get_text()
print petition.find("span","signatures-number").get_text()
print petition.find("h3").get_text()

Now we're cooking! Let's iterate over each each petition on the page and store the data in lists. These will be columns for a data frame. Ha!

In [None]:
from requests import get
from bs4 import BeautifulSoup
from pandas import DataFrame

url = "https://petitions.whitehouse.gov/"
head_data = {'From': 'markh@columbia.edu'}

response = get(url,headers=head_data)    
page = BeautifulSoup(response.text)

# initialize the columns of a data frame
goal = []
signatures = []
title = []

# run through each petition and collect data
for petition in page.find_all("div","views-row"):
    
    goal.append(int(petition.find("span","goal").get_text().replace(",","")))
    signatures.append(int(petition.find("span","signatures-number").get_text().replace(",","")))
    title.append(petition.find("h3").get_text())
    
petitions = DataFrame({"goal":goal,"signatures":signatures,"title":title})
petitions

Now, to get all of them, look at what happens to the URL when you hit "load more" at the bottom of the [petitions page](https://petitions.whitehouse.gov/). You see the URL changes to

[https://petitions.whitehouse.gov/petitions?page=1](https://petitions.whitehouse.gov/petitions?page=1)

and hit it again to get

[https://petitions.whitehouse.gov/petitions?page=2](https://petitions.whitehouse.gov/petitions?page=2)

So the URL encodes the data we want. To get all the petitions, we simply have to update our URL, repeating the exact code from above. Oh and the White House, like Python, starts counting at 0. So this page is our original first page.

[https://petitions.whitehouse.gov/petitions?page=0](https://petitions.whitehouse.gov/petitions?page=0)

In the code below, we create a loop where the variable "p" runs from 0,1,2... Each time we append the index to the back of the URL and scoop up the next petition.

In [None]:
from requests import get
from bs4 import BeautifulSoup
from pandas import DataFrame, set_option

set_option("display.max_colwidth",200)

goal = []
signatures = []
title = []

for p in range(10):
    
    # run through 10 pages of petitions -- probably need fewer
    
    url = "https://petitions.whitehouse.gov/petitions?page="+str(p)
    head_data = {'From': 'markh@columbia.edu'}

    response = get(url,headers=head_data)    
    page = BeautifulSoup(response.text)

    # add the new data to our columns
    for petition in page.find_all("div","views-row"):
        
        goal.append(int(petition.find("span","goal").get_text().replace(",","")))
        signatures.append(int(petition.find("span","signatures-number").get_text().replace(",","")))
        title.append(petition.find("h3").get_text())
    
petitions = DataFrame({"goal":goal,"signatures":signatures,"title":title})
petitions

**Web Services and APIs**

What we are doing in the last two examples is using a web page as a carrier of data we're after. We are using an HTTP request as a kind of database lookup. The data we want, in this case, is specified by parameters in the URL. This method, by the way, is called GET. When the parameters we are sending are hidden in the payload of the request (like our headers) we call that POST. More on that later.

Here's access to the [Google Translate service.](http://translate.google.com). You can type in an expression and look at the URL change. Pass it a string, as well as the "from" and "to" languages will get back a translation.

In [None]:
from requests import get
url = 'http://translate.google.com/translate_a/t'

params = {
    "text": "Remove from office Jess L. Baily, United States Ambassador to the Republic of Macedonia", 
    "sl": "en", 
    "tl": "es", 
    "client": "p"
}

print get(url, params=params).content

This idea of pulling data from a web server, executing a service and receiving output, has been formalized and made a lot friendlier to programmers. You can think of data now as a kind of service. Data are served up in machine-readable forms rather than web pages that have to be wrangled — an **Application Programming Interface or API** describes how you interact with a data server, how you pose queries and enforces constraints on how much data you can pull (why?). Here is a simple API for Google's toolbar suggestions. Notice that with a little string manipulation, we can test any starter search.

In [None]:
url = "http://suggestqueries.google.com/complete/search?output=toolbar&hl=eng&q=why%20is%20trump"
request = Request(url)

connection = urlopen(request)
page = BeautifulSoup(connection)
print page.prettify()

The data come back in XML. It is one of the popular forms for an API to use. XML or the eXtensible Markup Language is in the same family as HTML, except that instead of having fixed tags, the tags are open to your choosing -- they give you freedom to describe your data. Much in the same way HTML tags describe components of a document. 

In this case, the data we're after is stored in the attributes of the tags, not as strings in the tags themselves. We extract attribute data from a Tag object using square brackets and the name of the attribute.

In [None]:
for s in page.find_all("suggestion"):
    print s["data"]

Play with this a little and see if you find anything interesting. 

**JSON**

We have already seen an API that returns data not as XML but JSON, the Javascript Object Notation. Recall that Twitter's data are all JSON.  JSON looks an awful lot like dictionaries and lists and the basic data that Python knows about. The New York Times also makes a lot of data available via APIs that return JSON data. Twitter, Instragram, Facebook all have APIs. In fact one of the great things about APIs is that once data are easily manipulable by machine, they can be resused for a variety of purposes. We used to call this a "mashup".

We are now going to let Suman lead us through the digg API and how to build a bot.

# Bot Beats

Before we begin, lets briefly go over some questions from our previous class. These are based on discussions we had:

1. *The case of walled gardens:* If bots are being built on platforms, dont platforms own a large share of the power. Does the situation change for news via apps vs. bots. 

2. *Automatically giving your bots a voice: * There are some common tools to automatically generate natural language, like Markov Chains. But there are also more sophisticated ones, like neural networks and my favorite type of neural network that does this is called Long-Short Term Memory, or LSTMs. 

3. *The computable elements of conversations:* Do current bots possess all of these? Not even close. A bot with even 3 of these abilities could usher the PokemonGo moment for conversational AI. 

4. *Why are people "chatting" with service bots* : Is this only because they want to break a bot? Maybe, but we should not underestimate the human need for connection. The easiet way to connect is through conversation. 

5. *Why are people not building bots on whatsapp: * A platform has to be open for people to build bots in it.

<hr>

## Session Goals
1. Access data about some topic through an API. The topic could be as big as "climate change" or something closer to *named entities*, like "new york city" 
2. Parse the data which the API returns. The data is a bunch of links about the topic, with related metadata for each link.
3. Make a conversational bot using certain fields in the data. 
4. Try to encode some of the computable elements of conversations that we discussed in the previous class - e.g, some kind of memory. 

*North Star: * 
- Can you educate a person about the recent events on some topic, such as bitcoin, climate change, chinese internet, healthcare etc. -  through conversation. 

<hr>

## Conversing about news on a topic/ beat

In [None]:
import json
import re 

### Step 1: Choose a topic.. 

This could be any topic, newsy or not, general or specific. For example, it could be 'data science' or 'climate change'. It could also be 'uber' or 'japan'. Try to choose one you are passionate about, or you want to educate others about and you probably feel more people should know about the topic and whats happening in it.

For example, I'll choose data science now:

In [None]:
topic = 'data science'

### Step 2: Query the API 

** For the purposes of this exercise, we are going to use Digg's search API. This will let us check if [digg editor's](http://digg.com/about) have featured anything about a particular topic. **


I am going to use digg's search API for the purposes of this example... but you are free to use *any API* out there that you want. 

Some notes about this API - 

- I like this API because I trust the editorial frontpage. Digg sees ~8.5 million urls everyday through its very aggregator tools (like Digg Reader or Digg Deeper), but the editors feature the best or the most important stories on the frontpage. The front page gets ~150 stories per day through this method. The search API we are talking about here **returns articles only from this *editorial curation.** And thus, they are usually high quality.  

- Its kind of clean data, and gives a lot of **social meta information**, like fb shares, diggs, the total number of articles about that topic that the editors have featured in the last 4 years etc. These counts are not always updated, because we dont usually show all this meta info on the search page on the website. Thus, I would be careful when using those counts if you have to. But it doesnt hurt to try it. 

- The data also tells if there is **multimedia** on the page, like a video. As a bot, you have the right to send your user at least a few gifs (somebody should write a bot rights manifesto) 

- Because the editors decide what goes on the front page, there is obvious **bias** in what topics will return acceptable results. For example, you might not get much if you query certain types of things, say [NY Giants](http://digg.com/search?q=NY+giants). The obvious question is why does the search api not bring results from reader+deeper which have all the links. Simple, its expensive and we thought the search page should have the editorial "voice". 

- Feel free to see the results visually or as json dictionary:
    - [http://digg.com/search?q=climate%20change](http://digg.com/search?q=climate%20change)
    - [http://digg.com/api/search/climate+change.json](http://digg.com/api/search/climate+change.json) (install [jsonview](https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc?hl=en) for pretty web view) 
    
- you can paginate, i.e. get the latest 20 links, then the next 20 and so on..

- **Rate Limiting**: Be careful not to hammer the API.  

In [None]:
import requests 

*What are we exactly trying to request*: 

- Document search on platforms is usually powered by underlying search engines [(lucence, elasticsearch)](https://en.wikipedia.org/wiki/Elasticsearch), which require a certain type of word expression to perform a query. The simplest of these word expressions is a "+" between each word. 

- User Interfaces have the ability to convert phrases into these expressions, so when you search on Netflix you dont write 'person+of+interest', your just write the words. Similarly your browser's address bar takes care of it. 

In [None]:
# format the topic words, so it becomes a search term. 
print 'My topic: ', topic
topic_query = topic.replace(" ","+") 
print topic_query

In [None]:
# Format the search terms into the request query 
api_name = 'http://digg.com/api/search/'
api_format = '.json'
api_query = api_name+topic_query+api_format
print api_query

Great, we have a url now.. if you click this you will see how the data looks as a json format. But bots cant click (or can they?) so all the magic will happpen through requests. So go ahead and query this. 

In [None]:
print 'Requesting resource: ', api_query
r = requests.get(api_query)

In [None]:
# this is a more PYTHONIC way to do it..
with requests.Session() as sess: 
    sess.get('http://digg.com') # go to digg first to get the session
    r = sess.get(api_query)

In [None]:
# what is this response called r ?? 
type(r)

In [None]:
# what can this response do ?
print dir(r)

In [None]:
# did the request fail or was it ok?
r.ok

In [None]:
# how long did the response take to come back
r.elapsed.total_seconds()

In [None]:
# so where is the data ??
data =  r.json()

In [None]:
# and whats the structure of this data the API returned?
data?

In [None]:
# if data['status']='fail'.. we have to go back and get it
r = requests.get(api_query)
data =  r.json()
data['status']

In [None]:
# what are the top level keys in this dictionary
data.keys()

In [None]:
# what does digg think about this request you made 
print data.get('status')
print data.get('mesg')

In [None]:
def get_topic_data(topic):
    topic_query = topic.replace(" ","+") 
    api_name = 'http://digg.com/api/search/'
    api_format = '.json'
    api_query = api_name+topic_query+api_format
    print 'Querying: ', api_query
    with requests.Session() as sess: 
        sess.get('http://digg.com') # go to digg first to get the session
        r = sess.get(api_query)
    data =  r.json()
    print data.get('status')
    if data.get('status')=='ok':
        feed_is = data['data']['feed']
        print 'Found ', len(feed_is), ' links.'
    else:
        print 'Something went in the Internet pipes.. can you try again'
        feed_is = []
    return feed_is

### 3. Playing around with the data


In [None]:
# where is the data
topic_data = get_topic_data('climate change')

In [None]:
# what does this data look like
topic_data?

In [None]:
#incase you want to see this as a data frame...
import pandas as pd
pd.DataFrame(topic_data).head()

well, there 44 top-level fields in each data item(link). Within some of these fields, there are more nested such as in 'content'. obviously we do not need all these fields. 

In [None]:
# now for any list.. 
# there's an easy way to print the index of the entry in addition to the entry itself
x = ['hi','there','lets','learn','about','enumerate']
for i in enumerate(x):
    print i

Basically, enumerate is just adding a counter to anything thats iterable, like a list  - so we can directly access one particular element. 

In [None]:
# what were these links, lets find out
for index, link_info in enumerate(topic_data):
    print index, link_info['content']['title']
    print link_info['content']['url']
    print ''

Two things to note:

- The response of this API is always a list of links in REVERSE CHRON. 

- There are some links that will look like an AD, because they are our digg-store links - but I will show you how to WEED THEM OUT.

In [None]:
# if you want to see all the keys for some data item (i.e. a link)
print topic_data[1].keys()

In [None]:
# here's a trick to print a dictionary with indent
import json
print json.dumps(topic_data[1]['content'], indent=3)

In [None]:
# what are the keys in the content dictionary of a data item
print topic_data[1]['content'].keys()

#### Age of a link

Now about the age of the link.. its not the best idea to throw in date times in chat conversations. Manipulating date times can be frustrating sometimes...


<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Anyone who thinks that AI is going to take over the world has never tried attempted to manipulate datetimes</p>&mdash; Greg Stoddard (@gregstod) <a href="https://twitter.com/gregstod/status/829820077583302656">February 9, 2017</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Converting, ranging, formatting datetime can very soon encroach into the eluding fields of [yak shaving](http://projects.csail.mit.edu/gsb/old-archive/gsb-archive/gsb2000-02-11.html). 

How can we easily solve this? Using a python package of course. 

In [None]:
# now.. for some data-time magic..lets install this package
!pip install arrow

In [None]:
import arrow

In [None]:
ts = topic_data[10]['date']
print ts
arrow.get(ts)

In [None]:
arrow.get(ts).humanize()

In [None]:
arrow.get(ts).naive

In [None]:
arrow.get(ts).date()

In [None]:
# changing time zones... is a pain usually
print 'Time right now utc: ', arrow.utcnow()
print 'Time right now est: ', arrow.utcnow().to('US/Pacific')

In [None]:
# shifting time.. arrow makes this so easy
tnow = arrow.utcnow()
tminus3w = timeNow.replace(weeks=-3)
print tnow
print tminus3w

ok thats enough about datetimes, one of the things that can quickly become frustrating to handle when you analyzing real world data. 

Lets get back to the stories

In [None]:
# what are some of the content fields that are interesting
def show_article(article_number):
    content = topic_data[article_number]['content']
    print 'Title: ',content['title']
    print ''
    print 'Description: ', content['description'].strip()
    print ''
    print 'Domain name: ', content['domain_name']
    print ''
    print 'Kicker: ', content['kicker'] # a funny thing that digg editors will write 
    print ''
    print 'Authors: ', content['author']
    print ''
    print 'Tags: ', [i['name'] for i in content['tags']]
    print ''
    print 'Age: ', arrow.get(topic_data[article_number]['date_published']).humanize()

In [None]:
show_article(6)

### 4. What data features should we utilize? 

Not all data fields can help your conversation. Picking the fields you want to use is **your** design choice. It depends how you want the conversation to flow, and how interaction happens. 

Here's mine ..
1. Tag
2. Domain Name
3. Editor kicker (maybe)
4. Author + story title

Lets assume my topic data is from 'data science'. I first want to ask the user if he/she wants a data science story from politics or space or some other tag. Then I construct the message to send to the user. The message might use the authors name and the title.

#### 4.1 Ask user if he/she wants a story from a tag.. 

In [None]:
# what are the tags of the latest 10 stories:
tags_for_links=[]
candidate_topic_data = topic_data[:10] # lastest 10 articles..
for link_info in candidate_topic_data:
    # now we remove the tags that are about ads, and store items. 
    skip_tags = ['digg-store','digg-picks','digg-editions'] # digg-store is our store content, science is too close to data science
    print link_info['content']['title']
    link_tags = [i['name'] for i in link_info['content']['tags'] if i['name'] not in skip_tags]
    print link_info['content']['url']
    print link_tags
    print ''
    tags_for_links.extend(link_tags)

In [None]:
# these are all the tags for the links returned by the API.. 
print tags_for_links

Now, its silly to ask the user... *do you want a science story from data science*. how can we be a little more imaginative. We probably should take the less common tags and suggest those to the user, as these will be more discriminative, unlike or "science". How do we find the less common a.k.a more discriminative tags?  

In [None]:
from collections import Counter  
tagDist = Counter(tags_for_links)

In [None]:
tagDist?

In [None]:
print tagDist.most_common()

Clearly our strategy isnt full proof - 'news' has pretty low frequency. The editors tag a story 'news' when it has a "happening right now" feeling

In [None]:
from random import choice

# this list should change based on the topic you choose. 
tags_dont_suggest = ['news','science'] 

# choose a random topic
askTopic = choice(tags_for_links)

In [None]:
askTopic

ok, so now lets put all these steps in a function

In [None]:
def choose_tag(topic_data, tags_dont_suggest=[]):
    tags_for_links=[]
    candidate_topic_data = topic_data[:-1] # so you could do last x articles
    for link_info in candidate_topic_data:
        link_tags = [i['name'] for i in link_info['content']['tags'] if i['name'] not in tags_dont_suggest]
        tags_for_links.extend(link_tags)
    tag_chosen = choice(tags_for_links)
    return tag_chosen

In [None]:
print 'Choosing a tag except: ', tags_dont_suggest
choose_tag(topic_data,tags_dont_suggest)

In [None]:
choiceTag = choose_tag(topic_data, ['science','weather'])
print choiceTag

Ok we have chosen a tag, can we now choose an article about that tag ?? 

- Since we will be sending the user more than 1 link content, we should keep track of the links we have discussed with the user, in a variable called "seen_links" 
- We will do a super easy set operation here to compare if two lists have common items

In [None]:
a = ['my', 'cat', 'is','ok']
b = ['my','dog','is','a','good','boy']
c = ['learning','about','bots']
print set(a).intersection(set(b))

In [None]:
# number of common elements between two lists..
len(set(a).intersection(set(b)))

In [None]:
# at least one common element ??
len(set(a).intersection(set(b)))>0

In [None]:
len(set(a).intersection(set(c))) > 0

Now we will once again put all these steps in a  function. 

In [None]:
def return_article_about_tag(choice_tag, topic_data, seen_links=[]):
    for link_info in topic_data:
        if link_info['content']['url'] in seen_links:
            continue
        # is this a store article..
        link_tags = [i['name'] for i in link_info['content']['tags']]
        if len(set(link_tags).intersection(set(['digg-store','digg-editions'])))>0:
            continue
        tags= [i['name'] for i in link_info['content']['tags']]
        if choice_tag in tags:
            return link_info, True
    # all the links of this tag has been picked, so picking something else.
    return choice(topic_data), False  

In [None]:
tags_dont_suggest = ['digg-store','science','digg-picks']
choice_tag  = choose_tag(topic_data, tags_dont_suggest)
print 'Tag chosen: ', choice_tag
choice_article, requestedTagMatch = return_article_about_tag(choice_tag, topic_data)
print 'Trying to choose article about: ', choice_tag
if requestedTagMatch:
    print ' .. [ok]'
else:
    print ' .. [Fail]'
print choice_article['content']['title']
print choice_article['content']['url']

### 5. Helpers for conversation boost

Lets try to do three very basic things that should be part of your bot:

1. spot abusive words, like *"dammit"*
2. spot greetings, like *"hi", "hello"*
3. random phrases that users *often* say


In [None]:
# abusive words - don't let users be abusive
swear_word_url= 'https://raw.githubusercontent.com/gunthercox/ChatterBot/c3431ced70b88eb08fb36534f18ed62c90ab9723/chatterbot/corpus/data/english/swear_words.csv' 
x=requests.get(swear_word_url)
swear_words = x.text.split(',')[:-1]
# ^^ this list is by no means comprehensive.. but there's a lot of such lists on the web. 

In [None]:
# some random convo phrases
convourl = 'https://raw.githubusercontent.com/gunthercox/ChatterBot/c3431ced70b88eb08fb36534f18ed62c90ab9723/chatterbot/corpus/data/english/conversations.corpus.json'
convph = requests.get(convourl).json()

In [None]:
print json.dumps(convph, indent=2)

In [None]:
# try to spot greetings..
greeturl = 'https://raw.githubusercontent.com/gunthercox/ChatterBot/c3431ced70b88eb08fb36534f18ed62c90ab9723/chatterbot/corpus/data/english/greetings.corpus.json'
greetings = requests.get(greeturl).json()

In [None]:
greetings

### 6. Help from  Eliza

There are so many times our bot wont understand the user. Can we ask eliza to speak for us? 

In [None]:
# help from eliza
from re import match, IGNORECASE
from random import choice
 
reflections = {
    "am": "are",
    "was": "were",
    "i": "you",
    "i'd": "you would",
    "i've": "you have",
    "i'll": "you will",
    "my": "your",
    "are": "am",
    "you've": "I have",
    "you'll": "I will",
    "your": "my",
    "yours": "mine",
    "you": "me",
    "me": "you"
}
 
actions = [
    [r'I need (.*)',
     ["Why do you need {0}?",
      "Would it really help you to get {0}?",
      "Are you sure you need {0}?"]],
 
    [r'Why don\'?t you ([^\?]*)\??',
     ["Do you really think I don't {0}?",
      "Perhaps eventually I will {0}.",
      "Do you really want me to {0}?"]],
 
    [r'Why can\'?t I ([^\?]*)\??',
     ["Do you think you should be able to {0}?",
      "If you could {0}, what would you do?",
      "I don't know -- why can't you {0}?",
      "Have you really tried?"]],
 
    [r'I can\'?t (.*)',
     ["How do you know you can't {0}?",
      "Perhaps you could {0} if you tried.",
      "What would it take for you to {0}?"]],
 
    [r'I am (.*)',
     ["Did you come to me because you are {0}?",
      "How long have you been {0}?",
      "How do you feel about being {0}?"]],
 
    [r'I\'?m (.*)',
     ["How does being {0} make you feel?",
      "Do you enjoy being {0}?",
      "Why do you tell me you're {0}?",
      "Why do you think you're {0}?"]],
 
    [r'Are you ([^\?]*)\??',
     ["Why does it matter whether I am {0}?",
      "Would you prefer it if I were not {0}?",
      "Perhaps you believe I am {0}.",
      "I may be {0} -- what do you think?"]],
 
    [r'What (.*)',
     ["Why do you ask?",
      "How would an answer to that help you?",
      "What do you think?"]],
 
    [r'How (.*)',
     ["How do you suppose?",
      "Perhaps you can answer your own question.",
      "What is it you're really asking?"]],
 
    [r'Because (.*)',
     ["Is that the real reason?",
      "What other reasons come to mind?",
      "Does that reason apply to anything else?",
      "If {0}, what else must be true?"]],
 
    [r'(.*) sorry (.*)',
     ["There are many times when no apology is needed.",
      "What feelings do you have when you apologize?"]],
 
    [r'Hello(.*)',
     ["Hello... I'm glad you could drop by today.",
      "Hi there... how are you today?",
      "Hello, how are you feeling today?"]],
 
    [r'I think (.*)',
     ["Do you doubt {0}?",
      "Do you really think so?",
      "But you're not sure {0}?"]],
 
    [r'(.*) friend (.*)',
     ["Tell me more about your friends.",
      "When you think of a friend, what comes to mind?",
      "Why don't you tell me about a childhood friend?"]],
 
    [r'Yes',
     ["You seem quite sure.",
      "OK, but can you elaborate a bit?"]],
 
    [r'(.*) computer(.*)',
     ["Are you really talking about me?",
      "Does it seem strange to talk to a computer?",
      "How do computers make you feel?",
      "Do you feel threatened by computers?"]],
 
    [r'Is it (.*)',
     ["Do you think it is {0}?",
      "Perhaps it's {0} -- what do you think?",
      "If it were {0}, what would you do?",
      "It could well be that {0}."]],
 
    [r'It is (.*)',
     ["You seem very certain.",
      "If I told you that it probably isn't {0}, what would you feel?"]],
 
    [r'Can you ([^\?]*)\??',
     ["What makes you think I can't {0}?",
      "If I could {0}, then what?",
      "Why do you ask if I can {0}?"]],
 
    [r'Can I ([^\?]*)\??',
     ["Perhaps you don't want to {0}.",
      "Do you want to be able to {0}?",
      "If you could {0}, would you?"]],
 
    [r'You are (.*)',
     ["Why do you think I am {0}?",
      "Does it please you to think that I'm {0}?",
      "Perhaps you would like me to be {0}.",
      "Perhaps you're really talking about yourself?"]],
 
    [r'You\'?re (.*)',
     ["Why do you say I am {0}?",
      "Why do you think I am {0}?",
      "Are we talking about you, or me?"]],
 
    [r'I don\'?t (.*)',
     ["Don't you really {0}?",
      "Why don't you {0}?",
      "Do you want to {0}?"]],
 
    [r'I feel (.*)',
     ["Good, tell me more about these feelings.",
      "Do you often feel {0}?",
      "When do you usually feel {0}?",
      "When you feel {0}, what do you do?"]],
 
    [r'I have (.*)',
     ["Why do you tell me that you've {0}?",
      "Have you really {0}?",
      "Now that you have {0}, what will you do next?"]],
 
    [r'I would (.*)',
     ["Could you explain why you would {0}?",
      "Why would you {0}?",
      "Who else knows that you would {0}?"]],
 
    [r'Is there (.*)',
     ["Do you think there is {0}?",
      "It's likely that there is {0}.",
      "Would you like there to be {0}?"]],
 
    [r'My (.*)',
     ["I see, your {0}.",
      "Why do you say that your {0}?",
      "When your {0}, how do you feel?"]],
 
    [r'You (.*)',
     ["We should be discussing you, not me.",
      "Why do you say that about me?",
      "Why do you care whether I {0}?"]],
 
    [r'Why (.*)',
     ["Why don't you tell me the reason why {0}?",
      "Why do you think {0}?"]],
 
    [r'I want (.*)',
     ["What would it mean to you if you got {0}?",
      "Why do you want {0}?",
      "What would you do if you got {0}?",
      "If you got {0}, then what would you do?"]],
 
    [r'(.*) mother(.*)',
     ["Tell me more about your mother.",
      "What was your relationship with your mother like?",
      "How do you feel about your mother?",
      "How does this relate to your feelings today?",
      "Good family relations are important."]],
 
    [r'(.*) father(.*)',
     ["Tell me more about your father.",
      "How did your father make you feel?",
      "How do you feel about your father?",
      "Does your relationship with your father relate to your feelings today?",
      "Do you have trouble showing affection with your family?"]],
 
    [r'(.*) child(.*)',
     ["Did you have close friends as a child?",
      "What is your favorite childhood memory?",
      "Do you remember any dreams or nightmares from childhood?",
      "Did the other children sometimes tease you?",
      "How do you think your childhood experiences relate to your feelings today?"]],
 
    [r'(.*)\?',
     ["Why do you ask that?",
      "Please consider whether you can answer your own question.",
      "Perhaps the answer lies within yourself?",
      "Why don't you tell me?"]],
 
    [r'quit',
     ["Thank you for talking with me.",
      "Good-bye.",
      "Thank you, that will be $150.  Have a good day!"]],
 
    [r'(.*)',
     ["Please tell me more.",
      "Let's change focus a bit... Tell me about your family.",
      "Can you elaborate on that?",
      "Why do you say that {0}?",
      "I see.",
      "Very interesting.",
      "{0}.",
      "I see.  And what does that tell you?",
      "How does that make you feel?",
      "How do you feel when you say that?"]]
]
 
def el_reflect(fragment):
    
    # Turn a string into a series of words
    tokens = fragment.lower().split()
    
    # for each word...
    for i in range(len(tokens)):
        token = tokens[i]
    
        # see if the word is in the "reflections" list and if it
        # is, replace it with its reflection (you -> me, say)
        if token in reflections:
            tokens[i] = reflections[token]
            
    return ' '.join(tokens)
 
def el_respond(statement):
    
    # run through all the actions
    for j in range(len(actions)):
    
        # for each one, see if it matches the statment that was typed
        pattern = actions[j][0] 
        responses = actions[j][1]
        found = match(pattern, statement.rstrip(".!"),IGNORECASE)
        
        if found:
        
            # for the first match, select a response at random and insert
            # the text from the statement into ELIZA's response
            response = choice(responses)
            return response.format(*[el_reflect(g) for g in found.groups()])

def eliza_help_me(msg):
    response = el_respond(msg)
    return response
 

In [None]:
eliza_help_me('you dont know much')

### 7. BeatBot logic flow

Remeber the computable elements of conversation from the previous class, should we try to do one of the basic versions here.. maybe some kind of memory.. very basic. Note, a frame is one story completed in delivery and back-and-forth conversation. So if you have 5 stories, you might have 5 frames, the length of each frame could vary. 

What should the bot remember in memory?
1. Remember if user was abusing me :( 
2. Remember how many times we finished a frame :) 
3. Remember user's last msg 
4. Remember the tag/topic user has chosen to talk about, but frame isn't complete.  
5. Remember how many times I failed to understand user, and my confusion level :/

In [None]:
initial_memory ={'choiceTag':'', #a tag user has chosen to discuss
                 'framesTalked':0, #frames we discussed
                 'abuseWords':0,  #abuse words encountered
                 'completedFrame':True, #I haven't begun a new frame yet
                 'sentLinks':[], #the frames we have talked about
                 'confusedLevel':0  #how many times I didnt understand 
                }

In [None]:
# this needs to be run everytime you choose a new topic to talk about. 
def generate_frame(memory):
    sent = ''
    choice_article, tFind = return_article_about_tag(choiceTag, topic_data, memory['sentLinks'])
    if tFind is False:
        tags= [i['name'] for i in choice_article['content']['tags']]
        'Actually, I found you a better new story from ', tags[0]+'. '
        memory['choiceTag'] = tags[0]
    sent += 'Here is a story from '+choice_article['content']['domain_name']+'. '
    authors = choice_article['content']['author']
    if authors:
        if len(authors)>3:
            sent += authors+ ' is reporting that '
    else:
            sent += 'There are reports that '
    title = choice_article['content']['title'] 
    title = title.replace('<mark>','').replace('</mark>','')  # those yellow markers
    sent+= title
    memory['sentLinks'].append(choice_article['content']['url'])
    return sent


def respond(msg, prev_msg, memory):
    some_greetings = [x[-1].lower() for x in greetings['greetings'][:10]]
    some_greetings+=['hey','ok','sure', 'yes', 'go on','well','yes','yeah','cool','now']
    skip = ['no','next','not really','pass','nope']
    contn = ['']
    #print 'Current Statement:', msg
    status = 'undefined'
    if msg in some_greetings or msg in skip:
        #print 'Prev statement: ', prev_msg
        exit=False
        if prev_msg in some_greetings:
            if msg in skip:
                # maybe pass or exit
                status = 'exit'
                response = 'you dont like my stories :( bye'
                memory['completedFrame']=True
            else:
                # tell story... 
                if memory['framesTalked']==0:
                    # first story
                    status = 'story'
                    memory['framesTalked']+=1
                    response = generate_frame(memory)
                    memory['completedFrame']=True
                else:
                    if memory['completedFrame'] == True:
                        bothappy = ['great..','coool..']
                        choiceTag = choose_tag(topic_data)
                        response = choice(bothappy)+'want another '+topic+' story, this time related to '+ choiceTag
                        memory['choiceTag'] = choiceTag
                        memory['completedFrame']=False
                        status = 'waiting'
                    else:
                        status = 'story'
                        memory['framesTalked']+=1
                        response = generate_frame(memory)
                        memory['completedFrame']=True
        else:
            if memory['confusedLevel']==2:
                response = 'Ok i quit'
                status='exit'
            choiceTag = choose_tag(topic_data)
            response = 'I have a '+topic+' story for you related to '+ choiceTag
            memory['choiceTag'] = choiceTag
            memory['completedFrame']=False
            memory['confusedLevel']+=1
    elif msg in ['bye', 'quit','stop']:
        status='exit'
        response = 'bye'
    elif msg in swear_words:
        status = 'abusing'
        response= 'no need for that tone'
        memory['abuseWords']+=1
        if memory['abuseWords']>1:
            response = 'here is a bottle of mouthwash. bye for now.'
            status = 'exit'
    else:
        status = 'confused'
        response = 'Eliza said.. '+eliza_help_me(msg)
        memory['confusedLevel']+=1
        if memory['confusedLevel']==4:
            status = 'exit'
            response = 'sorry im just a dumb bot. bye'
    return status, response

def botbeat():
    # a friendly welcome
    memory = initial_memory
    print choice(greetings['greetings'][0])
    print "Interested to learn about whats happening in "+topic+"?"
    prev_statement= ''
    # talk forever...
    while True:
        # collect a statement and respond, stop the conversation on 'quit'
        statement = raw_input("> ")
        #print statement
        status, response = respond(statement, prev_statement, memory)
        print response
        #print memory
        prev_statement = statement
        if status=='exit':
            break
        

### 8. Lets test it

In [None]:
topic = 'reddit'
tags_dont_suggest = ['digg-store','digg-picks','finance'] # change this based on your topic
topic_data= get_topic_data(topic)

In [None]:
botbeat()

This then is the nature of **hybrid bots**, which is a mix of scripted parts and automated parts. The more info we know about the metadata, the more accurately we can fine tune the scripts. 

<hr>

### 9.  Refinements:

Things to aim for: 
- In conversation.. you are telling your story. [Conversation design is art](http://www.gamasutra.com/view/feature/3719/defining_dialogue_systems.php). From bots like Eliza to games like Mass Effect - great conversations live and die by the strength of their dialogue. 

- There are some basic conversation design methods: (1) Non- branching dialogue, (2) Branching dialogue, (3) hub-and-spoke dialogue, (4) parser driven, like Eliza etc. but [there isn't a grand unified theory](https://en.wikipedia.org/wiki/Interactive_storytelling#Strategies). 

- Here's another analogy from video game script writing about dialogues: 
    - the game script is always written **after** the gameplay.. unlike what usually happens in movies where the significant part of the script is written before the movie goes into production. Based on your data + technology, there are only certain abilities your bot will have. You must adapt your script to that. 
    
- Explicit handoff: Does your user need to know when you are using data vs. scripted response.


Need inspiration about what topics might be interesting for users to learn about? Here are some [options](https://qz.com/about/our-current-obsessions-2/). Choose your query words wisely. <hr>*How will you leverage the abilities of your bot to educate a user about some topic she/he is not following closely everyday or is an expert in.*