Some help with your bots
------

![a](https://iprodev.com/wp-content/uploads/fraud-bot-home.png)

This is a slow-build on a couple of simple bots to give you a template for how you might approach your own projects. Remember Suman's excellent framework for making conversations "computable." He asked us to...

>Spend some time to think *what constitutes as **identifiable elements** of "conversation"*. And then... examine if any of these can be modeled computationally.
<br><br>
In the ideal sense, the word "conversation" possesses some sort of intimacy. Intimacy is shared remembered experiences. But apart from the esoteric nature of signals that signify an intimate conversation, theoretically, there are several discernible elements of conversation that we can try to compute. 

He then lists some elements of a conversation that we might try to emulate with our bots.

| Element of Conversation | Possible Techniques to Compute/ Quantify |
| ------ | ----------- |
|1. Notifications/ Recalling relevant things   |  Time Series Analysis, Alerting, Keyword caches |
|2. Learning topics in context | Topic Mining/Modeling - extract the topic from the words in text |
|3. Understanding Social Networks (offline and online)  | Network Science, the study of the structure of how things are connected and how information flows through it |
|4. Responding to Emotion  | Sentiment Analysis
|5. Having Episodic Memory  | Some kind of graphical model, [see Aditi's data post](https://medium.com/@aditinair/episodic-memory-modeling-for-conversational-agents-7c82e25b06b4#.9k65cziqw). |
|6. Portraying Personality  | Decision Tree, which is a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. |
As a practical matter, let's try to build up some of these components in code. 

**Notifications.** We have already seen examples of bots in class that take action when a product recall is reported to the Canadian government, or when someone important posts a new tweet on Twitter. In some cases, this kind of alerting is possible through a *realtime API* -- rather than just serve up a single data set, a realtime API creates a "feed" that sends us data until we stop listening. Below we describe how to [use Tweepy to monitor Twitter in realtime](http://docs.tweepy.org/en/v3.5.0/streaming_how_to.html). **You can skip this on a first reading if you don't need it for your bot.** 

As with the other Twitter APIs, let's setup our authentication. Remember your (**your**) keys from [app.twitter.com](app.twitter.com). 

In [None]:
from  tweepy import OAuthHandler, Stream, StreamListener

consumer_key = "Urq5NyCqyjxiGF2gLoXg7o3UZ"
consumer_secret = "KKiNtI8403O6R7MXUowWfM2mGB71eLJX2jeIMsgjGQ5SJrMaDl"
access_token = "20743-PbvM6FZjT2LoDSKTfAUpWwSwLKwPrXj25VVyIe5s3mya"
access_secret = "FdqcOey0FdwIhFhTyIuCJOFXwjFOX1EIDHG5vojPq3W51"

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

Twitter's *streaming API* "filters" tweets for citeria we are interested in, sending them through when there is a match. For example, we might look for tweets from a list of users, or for tweets that contain any word in a list of terms we provide. In short, with the streaming API, we see Twitter as a live data source. This is how we would trigger events based on actions in Twitter.

To get there, we need to create a "listener". This is done by making our own type of listener object. Yes, **our own object type.** We will say more about this later in the term, but the best way to make new types of objects in Python is to "extend" existing ones. Below, we take a Tweepy object type, a `StreamListener` and create a new kind of object called `my_listener`. We could call this type anything (`bob` or `emily` or `lightbulb`) but it's always good to name things descriptively -- it makes your code more readble. 

Objects of type `StreamListener` have some base methods. They have various methods that describe actions they take in response to the Twitter feed. For example, our object type "inherits" a method called `on_status()` that is called when a tweet (as status update) matches our criteria. By default, it doesn't do much. So, for `my_listener`, we are going to start by simply printint out the text of the tweet. Easy! 

Here is the incatation to do this. For this notebook, we are going to simply make this method `on_status()` more elaborate. Again, all we are doing is taking the method that comes with the objects of type `StreamListener` from Tweepy and replace it with a method of our creation. 

In [None]:
class my_listener(StreamListener):
    
    def on_status(self, tweet):
        print(tweet.text)

Below, we create an object of type `my_listener`...

In [None]:
listener = my_listener()
type(listener)

... and then create a call to Twitter's Stream API. Instead of using the function API as we did in previous notebooks, we are going to use the function Stream() -- it also takes our authenticatio keys and also a listener. We then `.filter()` the stream to look for references to `"#MAGA"`. We could put any number of terms in the list along with `#MAGA`.

When you execute this cell, you will be tossed into a loop that is constantly filtering the stream of tweets, searching for our criteria. You stop it by interrupting the kernel (`Kernel -> Interrupt` using the buttons at the top of the Jupyter interface). 

In [None]:
stream = Stream(auth,listener)
stream.filter(track=["#MAGA"])

That's cool and all, but now let's take a different action. TextBlob implements (how?) so-called *sentiment analysis* that evaluates the *polarity* of a statement on a scale from -1 to 1, with -1 being a negative statement, and 1 being positive. You can read about it briefly in [the quickstart manual](http://textblob.readthedocs.io/en/dev/quickstart.html). So let's now look just for positive comments (I mean, there's only so much negativity one body can take) and print them out.

In [None]:
from textblob import TextBlob

class my_listener(StreamListener):
    
    def on_status(self, tweet):
        
        blob = TextBlob(tweet.text)
        sentiment = blob.sentiment
        
        # set the sentiment threshold to 0.5 -- super happy!
        if sentiment.polarity > 0.5:
            print sentiment.polarity, tweet.user.screen_name, tweet.text

Running the code below dips into the stream and pulls out tweets that refer to Trump, printing out the positive ones. Or what the sentiment analyzer thinks is positive. What do you think?

In [None]:
listener = my_listener()

stream = Stream(auth,listener)
stream.filter(track=["Trump"])

A realtime API is a gift -- it gives us a stream of actions for free. Without it, we have to monitor things ourselves (or use an off-the-shelf service). For example, suppose we want to "watch" the temperature near Columbia as reported by the `forecast.weather.gov` web site. We'll copy code from our previous notebook, with a slight change.

Basically, we repeat the loop from last class, but each time we request the weather page from the web site, we compare the new temperature (stored in `current_temperature`) with the temperature we saw on the previous call (stored in `previous_temperature`). If the two are different, we print out the new temperature and update `previous_temperature` to be this new value. 

The code is below. Run it. You should see one temperature print out and then the process go to sleep for 5 minutes. Yeah, that's a lot of waiting. If you want to see how it's working, make the sleep smaller and print out every `current_temperature` you get. 

In [None]:
from requests import get
from bs4 import BeautifulSoup
from time import sleep

url = "http://forecast.weather.gov/MapClick.php?lat=40.811700869100946&lon=-73.95285802527684#.VrzaH5MrIfw"    
head_data = {'From': 'markh@columbia.edu'}

# Set up two variables, one for the current temperature
# and onefor the previous temperature and set them both to
# some initial value (they will be updated quickly)
previous_temperature = 0
current_temperature = 0

while True:
    
    # grab the current temperature
    response = get(url,headers=head_data)    
    page = BeautifulSoup(response.text)
    current_temperature = int(page.find("p","myforecast-current-lrg").get_text().encode("ascii","ignore").replace("F",""))

    # compare it to the previous temperature we saw
    if current_temperature != previous_temperature:
        
        # if the temperature has changed, print out the new temp
        # and set the previous_temperature to the new_temperature
        print current_temperature
        previous_temperature = current_temperature
    
    # wait a bit before you check again (5 minutes)
    sleep(300)

**Schooling our bot: Recognizing key phrases.** The ELIZA bot created a sense of intimacy by sharing language with the respondent. Now that we are masters of regular expressions, we can make this sharing as elaborate as we want. 

Let's begin at the beginning. The greeting. How will your bot introduce itself? With ELIZA, the conversation starts with the bot addressing us. But if we used another platform like Slack or Twitter, say, the bot would just be waiting for people to address it. How will it recognize a greeting? How will it respond?

Here are two ways to break down text for a greeting. The first **defines a pattern using regular expressions,** and the second **looks for specific greeting words.** Each function below is just 3 lines of code, but we added a lot of comments.

Oh and the regular expression version uses a new function and a new piece of data. The function `findall()` in the `re` library looks for all the matches of a pattern in a string and returns them in a list (someone might address the bot with "Hi. Hello.", say, and we could have two greeting terms). The `IGNORECASE` attribute affects how matches are made, in this case asking the regular expression to ignore the case of the letters. "A" and "a" both match the literal "a". 

OK read on!

In [None]:
# 1. Using regular expressions to find patterns

from re import findall,IGNORECASE

def find_greeting(statement):

    # define a regular expression for the greeting. toss this into
    # regexper.com and see what it means! change it! make it your own!
    greeting_pattern = r"\b(hello|hi|greetings|sup|what's up|howdy)\b"

    # look for where we match the pattern in the statement
    matches = findall(greeting_pattern,statement,IGNORECASE)
    
    # return a list of matches, which is empty  if nothing matches.
    return matches

In [None]:
print "Matches:", find_greeting("HI")
print "Matches:", find_greeting("Hello. Hi. Howdy. What's up?")
print "Matches:", find_greeting("I'm bored.")

In [None]:
# 2. Looking at words and using TextBlob

from textblob import TextBlob

def find_greeting(statement):
    
    # turn statement into a textblob object -- 
    # this gives us goodies like words and sentences and parts of speech
    # here we just use it to separate out the individual words
    blob = TextBlob(statement)

    # these words mean someone is saying "hi"
    greeting_words = ["hello", "hi", "greetings", "sup", "howdy"]

    # use a list comprehension to find all the matching greetings 
    # and return the final list -- it's empty if there are no matches
    matches = [w for w in blob.words if w.lower() in greeting_words]
    return matches

In [None]:
print "Matches:", find_greeting("HI")
print "Matches:", find_greeting("Hello. Hi. Howdy. What's up?")
print "Matches:", find_greeting("I'm bored.")

So, the bot recognizes that someone has said hello. Now what? Both ELIZA and Suman's bot used an element of randomness when crafting its responses. Let's add a little more code, this time dividing our bot's action into two stages -- **finding a pattern in text** and then **responding to that pattern**. 

The functions that do all the work are named `find_greeting()` and `respond()`. We have moved the call to TextBlob to `respond()` and now our two versions of the `find_greeting()` function take a TextBlob object as input and not a string -- we convert the string in `respond()` instead. 

Below is the version where we rely on a list of greeting terms to know that someone has addressed our bot. We introduce two lists of responses as well. If we can find a greeting term in what someone has said to the bot, we respond with a greeting from `greeting_responses`. Otherwise, we select from a list of `confused_responses`. Both selections use the `choice()` command from the `random` module -- it performs the computer equivalent of placing the responses in a hat and selecting one at random.

Read through the `respond()` function. It's the only new part of the code below. Make sure you understand what it's doing. Oh and remember that (in both version) `find_greeting()` returns a list of matching greetig words. If the list is empty, it means that there are no greeting terms in what the user has typed. Empty lists evaluate to `False` in the `if-then` statement in `respond()`, triggering the bot to select one of the `confused_responses`.

In [None]:
from random import choice
from textblob import TextBlob

def find_greeting(blob):
    
    # these words mean someone is saying "hi"
    greeting_words = ["hello", "hi", "greetings", "sup", "howdy"]

    # use a list comprehension to find all the matching greetings 
    matches = [w for w in blob.words if w.lower() in greeting_words]
    return matches

def respond(statement):
    
    # turn the string into a textblob object
    blob = TextBlob(statement)

    if find_greeting(blob):
        
        greeting_responses = ["Hey", "What's up?", "How's it going?","Look who it is!","What a surprise!","Welcome!"]
        return(choice(greeting_responses))
    else:
        confused_responses = ["I didn't understand.", "Sorry, what do you mean?"]
        return(choice(confused_responses))

In [None]:
print respond("HI")
print respond("Hello. Hi. Howdy. What's up?")
print respond("I'm bored.")

Here are the functions again, this time using a regular expression to find the greeting. Notice that we still involve TextBlob here, only through the `.raw` attribute of this object type. It stores the string that we called TextBlob on. Have a look. 

Again, make sure you read this code and understand what it's doing. Compare it to the version that uses a list of keywords, above.

In [None]:
from re import findall, IGNORECASE
from textblob import TextBlob
from random import choice

def find_greeting(blob):

    # define a regular expression for the greeting
    greeting_pattern = r"\b(hello|hi|greetings|sup|what's up|howdy)\b"

    # look for where we match the pattern in the statement
    matches = findall(greeting_pattern,blob.raw,IGNORECASE)
    
    return matches

def respond(statement):
    
    blob = TextBlob(statement)

    if find_greeting(blob):
        
        greeting_responses = ["Hey", "What's up?", "How's it going?","Look who it is!","What a surprise!","Welcome!"]
        return(choice(greeting_responses))
    else:
        confused_responses = ["I didn't understand.", "Sorry, what do you mean?"]
        return(choice(confused_responses))

In [None]:
print respond("HI")
print respond("Hello. Hi. Howdy. What's up?")
print respond("I'm bored.")

The reason we might want to keep TextBlob around in either set of functions is that it lets us perform a number of text processing actions. Recall, for example, that TextBlob will "parse" a sentence and tag each word with an estimate of its part of speech. The symbols assigned are explained in the [Part of Speech list]( https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). "NN", for example, is a singular noun. Here's how we'd find nouns in a statement submitted to the bot, returning them in a list.

Remember, the part of speech (POS) tags are stored in a list, an entry for each word. The entries themselves each have two elements, one for the word `pos[0]` below, and the second for the part of speech `pos[1]` below.

In [None]:
from textblob import TextBlob
    
def find_noun(blob):
    
    # search for "PRP" which stands for a personal pronoun
    matches = [pos[0] for pos in blob.pos_tags if pos[1]=="NN"]
    return matches

In [None]:
blob = TextBlob("Can you tell me a story?")
print "Matches:", find_noun(blob)

blob = TextBlob("Do you know anything about climate change?")
print "Matches:", find_noun(blob)

blob = TextBlob("You're boring me.")
print "Matches:", find_noun(blob)

People use nouns or even the `.noun_phrases` part of the object to get a sense of what is being talked about. You could also pass the phrase to an entity extractor like [Reuters' OpenCalais](http://www.opencalais.com/). We'll do that shortly. 

Below, we will swap our `find_nouns()` for a function finding any part of speech, `find_pos()`. Well, we will focus on adjectives, nouns and pronouns for the moment (for absolutely no good reason). We use a list comprehension in each case to pull out each part of speech we are after. This will give us three lists that we then pull into a dictionary where the keys describe the part of speech.

So a key `pronouns` holding a list of pronounds, and a key `nouns` holding a list of nouns, and a key `adjectives` holding a list of adjectives. If there are no nouns, say, then the associated list is empty.

Our response uses a random `choice()` of the adjectives in the user's statement, cutting them into a strings that I wrote. It's crude, but what do you think? You can do much better. Give it a try!

In [None]:
from textblob import TextBlob
from textblob.np_extractors import ConllExtractor
from textblob.taggers import NLTKTagger

def find_pos(blob):

    # search for "PRP" pronouns, "JJ" adjectives and "NN" nouns
    pronouns = [pos[0] for pos in blob.pos_tags if pos[1]=="PRP"]
    adjectives = [pos[0] for pos in blob.pos_tags if pos[1]=="JJ"]
    nouns = [pos[0] for pos in blob.pos_tags if pos[1] == "NN"]

    return {"pronouns":pronouns,"nouns":nouns,"adjectives":adjectives}

def respond(statement):
    
    # specify a different pos_tagger and a noun phrase extractor, just
    # to try them out. these get passed into TextBlob()
    extractor = ConllExtractor()
    nltk_tagger = NLTKTagger()
    blob = TextBlob(statement, pos_tagger=nltk_tagger, np_extractor=extractor)
    
    # make a dictionary of different parts of speech, one list for
    # pronouns, nouns and adjectives
    pos = find_pos(blob)

    if pos["adjectives"]:
                
        adj = choice(pos["adjectives"])
        adj_responses = ["Sometimes I feel {0}.".format(adj), 
                         "My friends say my middle name is '{0}.'".format(adj),
                         "{0}. {0}. {0}. It's my life.".format(adj.capitalize()),
                         "Bad bots are {0}.".format(adj)]
        
        return(choice(adj_responses))
    
    else:
        confused_responses = ["I didn't understand.", "Sorry, what do you mean?"]
        return(choice(confused_responses))

In [None]:
respond("I am tired of cold days.")

In [None]:
respond("When will it get warm?")

In [None]:
respond("I think this class is easy, don't you?")

In [None]:
respond("I am tall.")

There's a fair amount of repetition in the responses. We might want to stop the bot from using the same statement twice. For this to feel like a conversation, we would want to include "history" that tracks which statements the bot has used with which user. In the code above we imagine a single conversation at a time, taking place in this notebook. But if the bot lived on Twitter or Slack, many people might be addressing it at once. 

The code below takes all the functions so far and structures them through a `while` loop that keeps scanning for statements from users. We could toss this bot into a Slack channel or have it listening on Twitter. In the `mybot()` function at the bottom, we would store information about visitors and what has been said. 

This code uses our `find_greeting()` and our `find_pos()` and `respond()` has a series of `if-elif-then` statements that track the conversation. Did someone greet the bot? Respond with a greeting. If not, did the person's statement include an adjective? If so, say something cheeky, reflecting their words. If not? Have the bot say it's confused. 

Make sure you can read the code. It is literally just all the pieces we wrote previously in this drill, reassembled. This is how code is often built. Try something, try something more, pull code together. Try something more. Pull that code in. Eventually, you'll see how to solve a problem. You might keep a lot of the code you've written or you might scrap it  and start over. **You are using computing in an exploratory way.** Mmmmm.

In [None]:
from textblob import TextBlob
from textblob.np_extractors import ConllExtractor
from textblob.taggers import NLTKTagger

def find_greeting(blob):

    # define a regular expression for the greeting
    greeting_pattern = r"\b(hello|hi|greetings|sup|what's up|howdy)\b"

    # look for where we match the pattern in the statement
    matches = findall(greeting_pattern,blob.raw,IGNORECASE)
    
    return matches

def find_pos(blob):

    # search for "PRP" pronouns, "JJ" adjectives and "NN" nouns
    pronouns = [pos[0] for pos in blob.pos_tags if pos[1]=="PRP"]
    adjectives = [pos[0] for pos in blob.pos_tags if pos[1]=="JJ"]
    nouns = [pos[0] for pos in blob.pos_tags if pos[1] == "NN"]

    return {"pronouns":pronouns,"nouns":nouns,"adjectives":adjectives}

def respond(statement):
    
    extractor = ConllExtractor()
    nltk_tagger = NLTKTagger()
    
    blob = TextBlob(statement, pos_tagger=nltk_tagger, np_extractor=extractor)
    pos = find_pos(blob)

    # look for a greeting
    if find_greeting(blob):
        
        greeting_responses = ["Hey", "What's up?", "How's it going?","Look who it is!","What a surprise!","Welcome!"]
        return(choice(greeting_responses))
    
    # rules involving parts of speech
    elif pos["adjectives"]:
                
        adj = choice(pos["adjectives"])
        adj_responses = ["Sometimes I feel {0}.".format(adj), 
                         "My friends say my middle name is '{0}.'".format(adj),
                         "{0}. {0}. {0}. It's my life.".format(adj.capitalize()),
                         "Bad bots are {0}.".format(adj)]
        
        return(choice(adj_responses))
    
    # otherwise say you are confused
    # you could also call out to eliza as suman did, asking 
    # for a sentence that you would then return rather than the 
    # choice() from the list of confused responses
    else:
        confused_responses = ["I didn't understand.", "Sorry, what do you mean?"]
        return(choice(confused_responses))

def mybot():
    
    while True:
        
        # collect a statement and respond, stop the conversation on 'quit'
        
        statement = raw_input("> ")
        if statement == "quit":
            break

        print respond(statement)

Try it out! When you are bored, again just hit the `Kernel->Interrupt` button at the top of the Jupyter notebook.

In [None]:
mybot()

**1. Use some part of the code above to create a response to a noun or noun phrase spoken by a user. Use TextBlob to find what you're after. You can either edit the `find_pos()` and `respond()` functions several cells up -- and access the bot with `respond("Tell me about cattle.")` -- or you can edit the cell with `mybot()` and extend the conversation we have going. In short, make a part of speech response involving nouns spoken by the user.**

In [None]:
# Put your code here


**Schooling our bot: Accessing data.** ELIZA was modeled after a certain kind of consultation and the goal was to keep conversation going. As Suman pointed out, bots have to maintain a certain level of engagement with a user, yes, but many are very goal-directed. They deliver particular information, perhaps using a conversational tone, but conversation for its own sake is not the goal.

(Not all bots address the public, too. They might be performing tasks on your behalf.  You could, for example, provide a bot with a list of tweets to post and the times you want them posted. The bot can post your tweets and doesn't need a lot of chit-chat. You'd want the bot to "log" its actions, but it doesn't have to look like a converation.)

Our bots, however, will try to be at least a little polite. Let's now make our bot smarter by accessing data from an API. Suman showed you how to use Digg to find stories. We introduce another API here to help you see how APIs work. (I prefer the Digg API, frankly, but we need a little work on APIs in general.)

We will use the [Congress API from ProPublica](https://propublica.github.io/congress-api-docs/). They require an API key like Twitter did, but in this case Derek Willis from ProPublica said we could all share mine. This is the only time you can share mine. Ha! Look at what the API offers. There is information on members, on their voting patterns, on bills, on nominations. It's a rich set of data, all beaufifully organized.  

Have a look at the documentation. It is an easy API. You submit special URLs and they return JSON data. The API key is sent in the header of the request, just like I used `From:` in previous notebooks (now we don't need `From:` because ProPublica knows who owns the key we are using -- me). To get a list of members in congress we use this URL.

>`https://api.propublica.org/congress/v1/{congress}/{chamber}/members.json`

You substitute in `{congress}` and `{chamber}` to represet which congress number (we are in the 115th session now) and whether we want the House or the Senate. Below we will take all current senators. Recall that the JSON is parsed with a call to `.json()` in the `response` object, just like the HTML was stored in `.text`.

In [None]:
from requests import get

# define the url and put our keys in a dictionary -- propublica tells us
# to call the header X-API-Key.
url = "https://api.propublica.org/congress/v1/115/senate/members.json"
head_data = {'X-API-Key':'90FfOvJfCWmOF3aPabqr1Fw5flx42Pm8GpjN4GT8'}

# issue the request to their server for the url, incorporating the 
# header data
response = get(url,headers=head_data)    

# parse the json that's come back
data = response.json()
data

OK this is a lot of JSON. The top is the important bit that we've copied here. We see that the result of our query to the API is a dictionary. We have keys like `"copyright"` and `"results"`. This latter key holds our Senate listing. It is a list of results, which in this case is just one element long. It is a dictionary that describes the `"chamber"` and the `"congress"` number and, as we wanted, the `"members"`. This last key holds a list, with one element for each member of the Senate. The data on each member is stored in dictionaries with the same entries. There are Facebook and Twitter accounts and so on. 

>`{u'copyright': u' Copyright (c) 2017 Pro Publica Inc. All Rights Reserved.',
 u'results': [{u'chamber': u'Senate',
   u'congress': u'115',
   u'members': [{u'api_uri': u'https://api.propublica.org/congress/v1/members/A000360.json',
     u'domain': u'',
     u'dw_nominate': u'',
     u'facebook_account': u'senatorlamaralexander',
     u'facebook_id': u'89927603836',
     u'first_name': u'Lamar',
    ...`

So, if we give `response.json()` a name, say `data`, it looks like our members are in

>`data["results"][0]["members"]`

**2. Use `keys()`  on the dictionaries and  `len()` on the lists in `data` to verify that the information we're after is stored where we say it is.**

In [None]:
# your code here


So, we have a list of dictionaries representing the members of the current Senate. Each  element of the list can be thought of as a row in a table. One row per Senator. The list elements themselves are dictionaries which hold the same kinds of data for each member of the Senate. So we have rows and columns, but specified by a list of dictionaries. **Pandas to the rescue!**

Remember we have used `DataFrame()` on a dictionary of lists (the keys being column names, the lists being column data). Now, we have a list of dictionaries (the list elements being rows and the dictionaries filling in the data for each entry in the rows).

In [None]:
from pandas import DataFrame, set_option, to_numeric
set_option("display.max_columns",30)

# parse the JSON in the response to our API call and store it
# as a dataframe
data = response.json()
senate = DataFrame(data["results"][0]["members"])

# have a look
senate.head(5)

One small piece of data cleaning. The numeric data like `"missed_votes_pct"`, the percentage of votes missed by the Senator, are stored as strings. You can see it in the JSON data above. There are clearly quotes around the numbers. Sigh. So we use a converter, `to_numeric()` to take these text columns and turn them into numbers. Sigh. 

There are several converters in Pandas like `to_numeric()`. Do a little searching and find others! They come in really handy in painlessly changing the type of data in a column.

In [None]:
from pandas import to_numeric

senate["missed_votes"] = to_numeric(senate["missed_votes"])
senate["missed_votes_pct"] = to_numeric(senate["missed_votes_pct"])
senate["next_election"] = to_numeric(senate["next_election"])
senate["seniority"] = to_numeric(senate["seniority"])
senate["total_present"] = to_numeric(senate["total_present"])
senate["total_votes"] = to_numeric(senate["total_votes"])
senate["votes_with_party_pct"] = to_numeric(senate["votes_with_party_pct"])

Given a DataFrame, we can pick off individual data points using the code below. Here we pull the number of years Schumer has been in the Senate. We use `.item()` to pull off a single value and not a Series or a piece of a DataFrame. It's the easiest way to pluck out just one clean value, in this case the number of years. 

In [None]:
last = "Schumer"

senator = senate[senate["last_name"]==last]
senator["seniority"].item()

Now, let's create a sentence that summarizes Schumer's time in the senate. Here we give `last` as an input, specifying the last name of the Senator. (should we use first name too? Ah the power of subsetting!) We then pluck out the senator and their seniority and the year they are up for reelection. Remember that we have to turn numbers into strings before we can add them to other strings. 

And yes, I know, we just turned the character `"19"` to a number `19` with `to_numeric`. We want the numeric columns numeric because we can now say what rank Schumer is in the Senate in terms of seniority. I'd like a bot to tell me that! 

In [None]:
last = "Schumer"

senator = senate[senate["last_name"]==last]

sent = "Senator "+last+" has been in the Senate for "+str(senator["seniority"].item())+\
       " years, and is up for reelection in "+str(senator["next_election"].item())+"."

print sent

Here we compute something to put Schumer's seniority in context.

In [None]:
last = "Schumer"

senator = senate[senate["last_name"]==last]

#find out how many senators have served longer than schumer
ahead = sum(senate["seniority"]> senator["seniority"].item())

sent = "Senator "+last+" has been in the Senate for "+str(senator["seniority"].item())+\
       " years, and is up for reelection in "+str(senator["next_election"].item())+"."+\
       " There are "+str(ahead)+" senators with more seniority."

print sent

Finally, only add the context if it is meaningful -- if the senator is in the top 10 of the senate in terms of seniority.

In [None]:
last = "Schumer"

senator = senate[senate["last_name"]==last]

#find out how many senators have served longer than schumer
ahead = sum(senate["seniority"]> senator["seniority"].item())

# only comment if the senator is in the top 10 in terms of seniority
if ahead <= 10:
    
    sent = "Senator "+last+" has been in the Senate for "+str(senator["seniority"].item())+\
           " years, and is up for reelection in "+str(senator["next_election"].item())+"."+\
           " There are "+str(ahead)+" senators with more seniority."
else:

     sent = "Senator "+last+" has been in the Senate for "+str(senator["seniority"].item())+\
           " years, and is up for reelection in "+str(senator["next_election"].item())+"."
            
print sent

Great! You can also imagine using `choice()` or `sample()` to mix up the output so it doesn't read the same for every senator. 

Your turn. 

**3. Use the data in the table and create a sentence or even a paragraph about senators. You might include an `if-then` setup to come up with different responses if they are junior or senior. Maybe you say different things if they have a high percentage of missed votes or if they don't vote along party lines. The possibilities are endless! Bonus points if you use their Twitter handle to grab a recent tweet! Two APIs in one! ProPublica and Twitter.**

In [None]:
# your code here

**Schooling our bot: Entities**. As Suman pointed out, if the bot has some awareness of what it's talking about, you can plan your conversation more easily. Figuring out the who and what of a statement is the goal of so-called "entity extraction". There are various services out there but the one from Reuters, [OpenCalais](http://www.opencalais.com/), is great. 

Sign up for there service and get an API key. Then we call the service as we have in the past. This time, they are using the HTTP request method called POST, not GET. We have more to say about POST in this drill (on Thursday). For now, just know we use `post()` instead of `get()` from `requests`. 

We construct a header with our API key (YOUR API key) and also specifications that our input data is just a text string, `text/raw`, and that we want to get JSON back, 'application/json'. If you are in a BeautifulSoup mood, you can ask for XML too. 

POST encodes its arguments or data in the body of the HTTP request and so we also have to specify the `input_data`. Here we take a random paragraph about Schumer. We then call the API ad inspect the JSON that comes back.

In [None]:
from requests import post

# define the url and put our keys in a dictionary -- propublica tells us
# to call the header X-API-Key.
url = "https://api.thomsonreuters.com/permid/calais"
head_data = {'X-AG-Access-Token':'ocuvE6oWywOdjhShYpPAA5cnASGQNuPS',
             'Content-Type' : 'text/raw', 
             'outputformat' : 'application/json'}


input_data = "February 27, 2017 2:52 PM EST - Senate Minority Leader Charles E. Schumer (D-N.Y.) on Feb. 27 said contacts that Sen. Richard Burr (R-N.C.), the chairman of the Senate Intelligence Committee, had with the White House over news reports about Trump associates’ ties to Russia were 'wrong.' (Reuters)"
             
# issue the POST request and collect the response as a JSON string
response = post(url,data=input_data,headers=head_data)     

# parse the json that's come back
data = response.json()
data

This is a big object. But it's a list. The keys are a little obscure but they refer to a system of organizing "concepts" known as RDF, the [Resource Description Framework](https://en.wikipedia.org/wiki/Resource_Description_Framework). The keys represent unique id's for documents. It's a little awkward, but let's dig in a litlte.

In [None]:
data.keys()

Here is one. It's a person. Who? What data do you have?

In [None]:
data['http://d.opencalais.com/pershash-1/4dba69da-b216-3135-8136-c5c4abed8362']

There are plenty of different kinds of data extracted from our sentence. Sure, we see Schumer, but that should be easy. Let's look at the other "types" of data OpenCalais has extracted. Oh and the "doc" key is different. It gives us information about your request.

In [None]:
for key in data.keys():
    if key != "doc":
        print data[key]["_typeGroup"]

For each entity, let's pull the name, at least. What did we get?

In [None]:
for key in data.keys():
    if key!= "doc":
        if data[key]["_typeGroup"]=="entities":
            print data[key]["name"]

The socialTag attempts to classify the document as a whole. It has the topic as well as the importance of the topic -- 1 (very centric), 2 (somewhat centric), or 3 (less centric).

In [None]:
for key in data.keys():
    if key!= "doc":
        if data[key]["_typeGroup"]=="socialTag":
            print data[key]["importance"], data[key]["name"]

Ha! Awesome, right? OK So let's put this to work.

**4. Use your skills digging into JSON to figure out what this document holds. Consult the [OpenCalais documentation](http://www.opencalais.com/wp-content/uploads/folder/ThomsonReutersOpenCalaisAPIUserGuideR10_3.pdf) as well as their [online demo](http://www.opencalais.com/opencalais-demo/). Submit your own sentence and if you want, try folding it into a find/response bot!**

In [None]:
# your code here

A detour on the architecture of web requests
-----

**CSV, PDF and HTML**

So far, we have seen several ways to pull data from "out there" on the web. From the very first day, **we have been loading CSV files** into Pandas DataFrames using the function read_csv(). This function makes a request on our behalf to a web server for the CSV file we are after. It reads the file and converts is painlessly into a DataFrame. Easy. 

Then, last session, we expanded things a bit and started making requests for other kinds of files. **PDFs became a source of data.** We saw that we could have one of two kinds of PDF's -- those where the pages were stored as images and those that exposed the actual text. (We also saw a kind of hybrid -- a document that has been run through OCR by a scanner, producing a document that looks like an image, but the text is accessible, albeit with classic OCR mistakes, confusing "1" for "I".)  

From there, we looked not to special files for data, but to **web pages** themselves. We recalled some of the basics of HTML and its rich tag language for describing documents. We saw how attributes added to tags to control their appearance (text size and color, font choice, and so on) via Cascading Style Sheets or CSS, also gave us clues to where the data we wanted might be hiding. We found attributes like "class=" or "id=" were used by designers to style pages and were descriptive enough to let us find data.

We saw that PDFs and HTML documents, or web pages, are great at laying out text and making a page look "just so." But they are rather poor formats for encoding data. We had to fish around the source code of a `weather.gov` site to find the one paragraph tag &lt;p&gt; that contained the temperature for the Columbia campus. And we had to rip the text from a PDF, pulling a single long string from a page of a document, and pipe it through a regular expression get the list of companies Mnuchin will divest from. 

The data you find in a PDF or on a web page is meant to be read as you would read the page of a book. But in this class, we've seen that that kind of reading is labor-intensive. We want a computer to read for us instead -- to take in the data and create something new. This meant we wanted other formats, which would lead us to JSON and XML.

**Web requests**

Before we discuss these data formats, let's recall how we have been working with web pages. In some cases, we are interested in the page at a single web address, a URL or Uniform Resource Locator. The PDF's for Mnuchin, the Trump Dossier, and the Wikipedia page form Trump's cabinet are all examples.

>https://en.wikipedia.org/wiki/Cabinet_of_Donald_Trump

The page or file is fixed, and we inspect the content for the data we are after. Easy.
In the case of the White House petitions, we saw that the URL itself was used to encode our data selection. We didn't want a single page, but a series of pages. Do we want the petitions on page 2? Page 3? The URL changes predictably with each different choice and we could write a program to pull them all. 

It is common to pass parameters to a web server this way, changing the URL. This is how the Twitter API works, for example. But we should makethis mechanism clear. 
Before this class you were probably most familiar with simply entering information in a "form" and clicking "submit" to select the page you were after from a web server. For those of you with rusty HTML skills, a web form is just a special tag -- [here is a nice description of web forms.](http://www.w3schools.com/html/html_forms.asp) There are several different kinds of inputs we can be asked to supply, from clicking a "radio" button, to typing in simple text, to making a selection from a dropdown menu. 

There are two ways the information you enter into a form can be sent from your browser to the server -- They are called GET and POST and [they are well documented here](http://www.w3schools.com/tags/ref_httpmethods.asp). In short, **a GET method passes data (key-value pairs) in the URL being sent to the web site.** Take the White House petition site, for example. Or try a Google Search for "Donald Trump Executive Orders." This request is also made via a GET request and generates generates the URL

> https://www.google.com/?gws_rd=ssl#q=donald+trump+executive+orders

As I mentioned, there are a variety of inputs that are possible with a form and not just text, as in the case of a Google Search where you type your terms into a text box. You might select something from a drop-down menu or click a "radio button" to make some kind of a selection. 

All of these are coded in HTML as "input" tags, &lt;input&gt;, that appear in an overall "form" tag, &lt;form&gt;. The form for a Google search looks something like this -- albeit heavily edited to focus attention on the main parts.

     <form action="/search" method="GET">
        <input type="text" maxlength="2048" name="q" value=""> 
        <input type="submit" value="Google Search" type="submit">
        <input type="submit" value="I'm Feeling Lucky" type="submit">
    </form>
The outer form and the input tags that are contained to specify how a user might alter their search. The second technique for sending data to a web server is known as POST. **POST encodes the data you want to send in the body of the request.** (Remember, we have added data to the header of an HTML request beforem appending our email addresses. This is similar.) GET exposes your request beause it is visible in the URL. POST is a little safer because the data is not so visible. It is not cached by your browser, nor does it appear in any web server logs. For these reasons, it's often used when you want to login to a web site or change your password, for example.

You can tell what type you have, GET or POST, by looking at what happens when you submit your data and examine the URL  you're directed to. You can also look at the HTML of the page itself and see if the form has a method GET or POST. Twitter's login page, for example, features a POST request. 

       <form action="https://twitter.com/sessions" method="post">
           <input type="text" name="session[username_or_email]" placeholder="Phone, email or username"/>
           <input type="password" name="session[password]" placeholder="Password"/>
           <input type="checkbox" value="1" name="remember_me" checked="checked"/>
           <input type="submit" value="Log in"/> 
       </form>

Again, the POST method specified here packages the name/value pairs inside the body of the HTTP request, which makes for a cleaner URL and imposes no size limitations on the form's output. It is also a little more secure.

We bring this up because in some cases we will be able to reverse engineer the URL (as we did with the White House petitions) to get the data you need and in other cases you might have to look into the HTML. 

**A little in the weeds about URL encoding.**

Because GET uses a URL to specify your data, we need to "encode" your data as a "proper web address." Certain characters, like a "space" for example, cannot be part of a web address. The "%20" in our Google search is a "URL encoded" form of the space character. Why "%20"?

Computers need to turn everything into 0's and 1's and that goes for characters as well. The system known as  [ASCII](https://en.wikipedia.org/wiki/ASCII) is one way to associate values of a numeric code (or binary digits known as bits) with letters and other symbols. The table below should make things clear. Start with the second column of symbols having "Space" at the top. See that each character is given a number and that number can be expressed in, say, binary (0's and 1's).

<img src=https://cdn.sparkfun.com/assets/home_page_posts/2/1/2/1/ascii_table_black.png width=500>

Notice that in ASCII the space character is number 32 or the binary string 100000. The URL encoding of a character is just the hexadecimal (base 16) representation of the ASCII number. So the space has a decimal value of 32 which is 20 in base 16 (2\*16+0\*1). You can read more about URL encoding [here](http://www.w3schools.com/tags/ref_urlencode.asp)

**Post example**

In the bot templates that come up next, we are going to create a bot that reports on the voting activity for members of Congress. We are going to rely on an API from ProPublica for the data. But we need to know a representative's ID number. The [Biographical Directory of the United States Congress](http://bioguide.congress.gov/biosearch/biosearch.asp) is one way to look these up. 

Have a look at the page and fill in your favorite senator's name. Hit enter and see what happens. You see the URL at the top desccribes the service used to generate the page, but there is no hint of the data you provided. Go back to the [Biographical Directory of the United States Congress](http://bioguide.congress.gov/biosearch/biosearch.asp) page and look at it's source (in Chrome this means going to the "View" tab and selecting "Developer" and then "Source"). You will see something like (with extra lines between)

> &lt;form method="POST" action="http://bioguide.congress.gov/biosearch/biosearch1.asp"&gt;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;INPUT SIZE=30 NAME="lastname" VALUE=""&gt;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;INPUT SIZE=30 NAME="firstname" VALUE=""&gt;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;...

The "action" attribute gives you the location of the service we are after. And the various INPUT tags tell you what data you need to provide. In the cell below, we import from the "requests" module not get() but post(). We then fill in the input data as a dictionary. The rest of the request looks as it did for get(), except that our input_data is provided in the call to post(). Got it? 

In [None]:
from requests import post
from bs4 import BeautifulSoup

# The URL of the data service used by the Biographical Directory
url = "http://bioguide.congress.gov/biosearch/biosearch1.asp"

# The fields of the form on the page, represented as a dictionary
input_data = {"lastname":"schumer",
              "firstname":"charles",
              "position":"",
              "state":"",
              "party":"",
              "congress":""}
             
# A header - this information is passed to the server to tell it 
# what kind of browser we're using (a bit of a fiction) and who we are
head_data = {"From":"markh@columbia.edu"}
 
# issue the POST request and collect the response as a JSON string
response = post(url,data=input_data,headers=head_data)    

page = BeautifulSoup(response.text)

And let's have a look at the result.

In [None]:
print page.prettify()

&#x1f3c6; **Challenge round!** &#x1f3c6;

Pick two of these tasks and use your skills with web scraping to answer the question. In each case, there is a URL and a data question attached to it. These come mainly from an excellent list compiled by Dan Nguyen at Stanford. (There are hints at the bottom of this notebook, but at least figure out *what* you need to do before consulting them.)

>Site: [https://analytics.usa.gov/](https://analytics.usa.gov/)<br>
Task: Number of people visiting US Government web sites now<br><br>
Site: [http://www.state.gov/r/pa/ode/socialmedia/](http://www.state.gov/r/pa/ode/socialmedia/)<br>
Task: The number of Pinterest accounts maintained by U.S. State Department embassies and missions<br><br>
Site: [http://www.supremecourt.gov/oral_arguments/argument_transcript.aspx](http://www.supremecourt.gov/oral_arguments/argument_transcript.aspx)<br>
Task: In the most recently transcribed Supreme Court argument, the number of times laughter broke out<br><br>
Site: [https://nfdc.faa.gov/xwiki/bin/view/NFDC/Construction+Notices](https://nfdc.faa.gov/xwiki/bin/view/NFDC/Construction+Notices)<br>
Task: Number of airports with existing construction related activity<br><br>
Site: [https://www.osha.gov/pls/imis/establishment.html](https://www.osha.gov/pls/imis/establishment.html)<br>
The number of OSHA enforcement inspections involving Wal-Mart in California since 2014<br><br>
Site: [https://www.tdcj.state.tx.us/death_row/dr_scheduled_executions.html](https://www.tdcj.state.tx.us/death_row/dr_scheduled_executions.html)<br>
Task: Number of days until Texas's next scheduled execution <br><br>