<img src=https://imgs.xkcd.com/comics/regular_expressions.png width=400>

Regular Expressions
-------------------

In this portion of class, we are going to work with a "language" for expressing patterns in text. By "pattern" I mean specifying repetitions of symbols -- words or punctuation or sequences of numbers or any combination of these. Given a collection of text, for example, regular expressions might help you find dates or telephone numbers or URLs or email addresses -- all of these obey certain formatting rules. 

Regular expressions, then, are a way to describe these formatting rules so that we can search a body of text for them. Sometimes we are doing this because we want to find lists of facts about people (email addresses and their telephone numbers, say), creating structured data out of unstructured data (a common theme in this class). And sometimes we appeal to regular expressions because they help us in the act of "cleaning" data -- we might be given a date column in a data set that contains dates in two different formats (y-m-d and m/d/y, say) and we need to transform them into just one consistent format throughout. 

The patterns we express might also be about content. Can we detect the gender of sources? Can we find new memes in a stack of text? Unlike the proto-natural language processing we saw with TextBlob, regular expressions deal with words as patterns of characters. There is no understanding here about parts of speech or grammar. Just patterns of symbols -- characters, numbers, and emoji, even.

**Trump's Press Conference**

To explain how regular expressions work, we will look at a large collection of text -- the transcript of his Press Conference from February 16th. Lots of things were discussed and we can sort through topics, his speech patterns and so on.

[The full transcript is here](http://compute-cuj.org/transcropt.txt)

[A file with just sentences spoken by Trump is here](http://compute-cuj.org/just_trump_sentences.txt)

You can download the file in a browser window and save it to your computer or you can use urlretrieve() to access the file from Python directly. To keep things automated and in the notebook, let's see how to do the latter.

In [None]:
from urllib import urlretrieve

urlretrieve("http://compute-cuj.org/just_trump_sentences.txt","sentences.txt")
sentences = open("sentences.txt").readlines()

print "The object 'sentences' is of type", type(sentences)
print "There are", len(sentences), "sentences in the list"
print "\n"

sentences[:10]

**Aside: List comprehensions**

Notice that each sentence has a "newline" at the end of it, the \n character. We can remove these from the end of each sentence using a 'list comprehension.' This is a new piece of Python syntax that lets us create new lists from old ones, transforming each element of the old list. It is an alternative to a loop, and is, well, pretty snazzy. 

If we wanted to remove the \n from each sentence we might do something like the following if we had to use a loop. It goes over each sentence and strip()'s off the whitespace from the start and end of the string. Whitespace includes spaces and tabs and newlines.

In [None]:
new_sentences = []

for s in sentences:
    new_sentences.append(s.strip())
    
new_sentences[:10]

Here, we iterate through the list ofsentences. Each sentence is strip()'d, has all the "white space" removed from its front and back end, and then appended to the new_sentences list. I hope you agree that this is kind of clunky notation. 

A "list comprehension" is a cleaner way to accomplish the same thing. So, let's reread the data and apply this new code construction.

In [None]:
sentences = open("sentences.txt").readlines()
sentences = [s.strip() for s in sentences]

sentences[:10]

The second expression above says that we cycle through all the elements in "sentences", letting the variable name "s" represent each one in turn. The first element, for example, becomes the start of a new list, and it has had .strip() applied to it. The second element is then strip()'d and stored in the new list and so on. As you can see, a list comprehension reads like our loop in the previous cell and behaves similarly -- but it is syntactically nicer. 

You can also limit the number of results included in the new list by adding an "if" clause. As the list comprehension is running through the elements of the old list, it can chose whether or not to incude it in the new list. In the next line of code, for example, we keep only sentences that contain the word "Thanks". In terms of programming, we do this by counting the number of times "Thanks" occurs in the string and take only those that have a count bigger than 0.

Again, the expression below runs through each entry in the list "sentences", labeling them "s" in turn, and keeps only those with at least one occurrence of the word "Thanks". (Here we don't save the new, reduced and transformed list -- we just have a look at it.) 

By way of comparison, here is a loop that does essentially the same thing. It's a lot clunkier, right?

In [None]:
[s for s in sentences if s.count("fake")]

In [None]:
new_sentences = []

for s in sentences:
    if s.count("fake"):
        new_sentences.append(s)
        
new_sentences

That's list comprehensions -- new lists from old with a syntax that's a lot cleaner to read than a loop. We'll be using them for the remainder of this drill. 

**Back to the transcript and regular expressions**

Python implements the regular expression search framework through a package called "re" (aptly named). We are going to make use of the search() function in this package. It takes a pattern definition (a regular expression) and searches for it in a string, returning every match it finds. You can use this in a list comprehension because when a search() finds a regular expression pattern it is treated as True in an "if" statement. When it can't find the pattern, it is treated as False.

In [None]:
from re import search

**Literals**. As a way of specifying patterns, let's start with so-called "literals" -- these characters just match themselves. For example, the literal *"fake"* matches the following sentences from Trump’s transcript. This should be equivalent to the results we had by using the string method .count(). 

In [None]:
[s for s in sentences if search("fake",s)]

Replace the literal *"fake"* with a search for *"education"
*. Do you find any sentences matching this pattern? What other searches like this might you do to highlight sentences about education? Or other topics that this search suggests.

In [None]:
# your code here


**Metacharacters**. Any character except for [ ]\\^\$.|?\*+( )\{ and \} can be used to specify a literal -- they
match a single instance of themselves. The string *"fake"* represented a series of literals and to have a match we need to find a "f" followed by an "a" followed by an "k" and so on. The non-literals, on the other hand, are known as **metacharacters** and are used to specify much more complicated text patterns.

They help us specify "whitespace," word boundaries, sets or classes of literals, the beginning and end of a line, and various alternatives ("war" or "peace"). For example **^ represents the start of a line.** Let's look at what we get by searching Jeb's email for a pattern than includes this character.

In [None]:
[s for s in sentences if search("^I think",s)]

Try a new sentence starter and see what you find...

In [None]:
# your code here


To interpret what a regular expression is doing, we can use a special tool from [regexper.com](http://regexper.com/). It takes regular expressions and renders them graphically so you can get a better sense of how the machinery is functioning. For example, here is the display for our *"^I think"* example.

[Graphical view of the pattern "^I think"](http://regexper.com/#%5EI%20think)

You see a window where you can change the regular expression and then a graphical interpretation of what you've asked for. This is an extremely handy tool. 

Now, if specifying the start of a line is important, having a special character for the end of a line is likely to be handy also. **The \$ represents the end of a line.** Consider the pattern *"do.\$"*. Here are the lines it matches from Jeb's email. Do you notice anything odd here?


In [None]:
[s for s in sentences if search("remember.$",s)]

Notice that we have sentences that end "remember?" as well as "remember.". that's because the dot, ".", represents a wildcard and is used to refer to any character. So, *remember.\$* will match lines that end in "remember." or "remember?" or even "remember9" (should the transcriber be typing sloppily one day). Have a look at this at regexper.com.

[Graphical view of the pattern "remember.\$"](http://regexper.com/#remember.%24)

Putting a backslash \\ before one of the special metacharacters [ ] \\^\$/?\*+()\{ and \} lets us include these in a pattern as literals -- in technical terms, we have "escaped" the special meaning of these characters. Consider the pattern *"\\\$1"*. With the backslash, we have returned the dollar sign to its orginal meaning and the following lines will match sentences the Trump transcript. 

>Since my election, Ford announced it will abandon its plans to build a new factory in Mexico, and will instead invest \$700 million in Michigan, creating many, many jobs.
<br><br>
Fiat Chrysler announced it will invest \$1 billion in Ohio and Michigan, creating 2,000 new American jobs.


And to bring the point home, look at the following.

[Graphical view of the pattern "\\\$1"](http://regexper.com/#%5C%241)

So, given this, what do we need to do to match sentences ending with the word "remember" followed by a period?

In [None]:
# your code here

**A character class matches a single character out of all the possibilities contained in brackets, [  ]** — There are certain rules that apply when specifying these classes that we’ll get to in a second. Let's look at the pattern *"[Tt]hanks"* and see what lines it matches in the transcript.

In [None]:
[s for s in sentences if search("[Tt]hank",s)]

[Graphical view of the pattern "[Tt]hank"](http://regexper.com/#%5BTt%5Dhank)

Keep in mind that while there might be lots of options in the square brackets, we are only trying to match one character out of this group. The graphical display makes this clear. We'll talk about specifying more than one match in a few minutes.

In terms of the rules that work within character classes, you can specify a range of letters [a-z] or [A-Z] or numbers [0-9] — Keep in mind that the order within the character class doesn’t matter, it specifies a bag of characters from which we select one item. Let's look at the pattern *"[0-9] years"* and see which sentences it will match.

In [None]:
[s for s in sentences if search("[0-9] years",s)]

[Graphical view of the pattern "[0-9] years"](http://regexper.com/#%5B0-9%5D%20years)

**When used at the beginning of a character class ^ is also a metacharacter and it indicates matching characters NOT in the indicated class.** So the pattern *"[^?.]\$"* will match sentences that don't end in a period or a question mark (you don't have to "escape" characters in a character class -- or between [ and ]). 

In [None]:
[s for s in sentences if search("[^?.]$",s)]

We see a large number of exclamations in this list. And, here is the graphical representation.

[Graphical view of the pattern "[^?.]\$"](http://regexper.com/#%5B%5E%3F.%5D%24)

Continuing on our survey of metacharacters, the vertical bar "|" translates to “or” — We can use it to combine expressions, the subexpressions being called alternatives. The expression *"remember|forget"* will match these lines from transcript file.

In [None]:
[s for s in sentences if search("remember|forget",s)]

Of course we can join several alternatives. Consider *"year|month|day"*.

In [None]:
[s for s in sentences if search("day|month|year",s)]

Here we see a lot of matches to patterns like "Saturday" or "today". Both contain the literal *"day"* but they might not be what we had in mind. What we need is to be able to specify a word boundary (like punctuation or a space or the end/start of the line) to isolate specific words and not pieces of words. More on that shortly.

[Graphical view of the pattern "year|month|day"](http://regexper.com/#year%7Cmonth%7Cday)

The alternatives an be real expressions and not just literals. What does the pattern *"^[Kk]ate|email\\.\$"* do? Try it out, it matches these lines from our April 2000 emails.

In [None]:
[s for s in sentences if search("^[Ww]atch|OK\?$",s)]

And again, regexper.com to help us out.

[Graphical view of the pattern "^[Ww]atch|OK\\?\$"](http://regexper.com/#%5E%5BWw%5Datch%7COK%5C?%24)

**Subexpressions are often contained in parentheses (more metacharacters) to constrain the
alternatives in some way.** For example *"^(I would|I could)"*. Later we will see that we can identify each subexpression separately,allowing us to extract (or capture) the content they match.

In [None]:
[s for s in sentences if search("^(I would|I could)",s)]

And the graphical representation -- notice the new reference to groups that are formed by the parentheses.

<a href=http://regexper.com/#%5E(I%20would%7CI%20could)>Graphical view of the pattern "^(I would|I could)"</a>

We're building up quite a vocabulary. Try a more complex expression on your own.

In [None]:
# Your code here

**The question mark indicates that the indicated expression is optional.** The expression *"George( W\\.)? Bush"* will match references to “George W. Bush” or just “George Bush”.

<a href=http://regexper.com/#George(%20W%5C.)%3F%20Bush>Graphical view of the pattern "George( W\\.)? Bush"</a>

**The \* and + signs are metacharacters used to indicate repetition** — the \* means “any number, including zero, of the item” and + means “at least one of the item”. So we can specify parenthetical matter with the following regular expression.

In [None]:
[s for s in sentences if search("\(.*\)",s)]

To grab phone numbers that are separated by hyphens, or maybe even social security numbers in text, we could use this expression *"[0-9]+-[0-9]+-[0-9]+"*. Here's a series of lines (from a data release of Jeb Bush's emails, not the Trump transcript).

>Phone: 407-240-1891<br><br>In reference to your letter dated october 29, 1998 in which you offer to help me with my inmigration question, i am a us citizen who is petition for my husband (a mexican citizen) petition #SRC-98-204-50114 his name is FRANCISCO JAVIER CORTEZ HERNANDEZ.<br><br>Fax: 407-888-2445<br><br>Pager: 850-301-8072<br><br>Cell: 407-484-8167<br><br>The Reverned uses his pager# 813-303-4726 to get in contact with, or you may email
and I will get in touch with him. <br><br>

Its graphical representation is given by regexper.com: <a href=http://regexper.com/#%5B0-9%5D%2B-%5B0-9%5D%2B-%5B0-9%5D%2B>Graphical view of the pattern "[0-9]+-[0-9]+-[0-9]+"</a>. In words, we are looking for one or more numbers followed by a hyphen, followed by one or more numbers, and then another hyphen, and finally one or more numbers.

**The curly braces \{ and \} are referred to as interval quantifiers** — they let us specify the minimum and maximum number of matches of an expression *"I (\\w+ ){1,7}your"* will match lines that have "I" and then between 1 and 7 (inclusive) words before "your". We know we are looking for words because \\w is shorthand for the character class [a-zA-Z0-9]. We will see other shorthand notation like this shortly. In the parentheses, we are looking for a word character that occurs one or more times followed by a space. 

As with all shorthand, \\w comes about because people often need to look for "word characters" and all the typing with the full character class in square braces is a little tedious -- \\w is easier and cleaner.

**Note: When using shorthand character classes like \\w we need to specify the string as being "raw" meaning the backslash is taken to mean a backslash (otherwise \\b is interpreted as a single character meaning a backspace, and not \\ and then b). This is done by putting an "r" before the quotes defining the string.**

In [None]:
[s for s in sentences if search(r"I (\w+ ){1,7}your",s)]

Its graphical representation is given by regexper.com: <a href=http://regexper.com/#I%20(%5Cw%2B%20)%7B1%2C7%7Dyour>Graphical view of the pattern "I (\\w+ ){1,7}your"</a>.

Notice that the graphical display recognizes \\w+ as a word. Ha! Now, we can get a bit more fine-grained control over repetitions. {m,n} means at least m but not more than n matches,
{m} means exactly m matches, and {m,} means at least m matches.

With this information, how would we skim these emails for specific kinds of numbers? Credit
card numbers? Social Security numbers?

In most implementations of regular expressions, the parentheses not only limit the scope
of alternatives divided by a “|”, but also can be used to “remember” text matched by the
subexpression enclosed. We refer to the matched text with \1, \2, etc. So the expression
*" ([a-zA-Z]+) \1 "* will match these lines in Jeb Bush’s inbox from January of 2000, not Trump's transcripts.

>I feel this is a **win win** situation for the Governor, the Reverend and the people that need help.<br><br>I insisted **that that** be the outcome in that court and that we did not recede from that position.<br><br>I guess you're embarrassed **that that** line got out.<br><br>

The pattern is asking for repeated words. We highlighted them in the text above. Also have a look at the graphical representation of this regular expression.

<a href=http://regexper.com/#%20(%5Ba-zA-Z%5D%2B)%20%5C1%20>Graphical view of the pattern " ([a-zA-Z]+) \1 "</a>.

Before we leave this, we can go back to *"day|month|year"*. The shorthand \\b stands for word boundaries so what we really wanted with our original search a few cells back was *"\b(day|month|year)\b"*. Let's have a look.

In [None]:
[s for s in sentences if search(r"\b(day|month*|year)\b",s)]

In [None]:
from re import sub
sub(r"\(\w+\)","(inaudible)","I don't know, Peter (ph), is that one right?")

The presentation here is meant to give you a flavor of how regular expressions are structured; you have seen the major metacharacters and to use them to create patterns. Below I provide a useful cheat sheet to remember what the different metacharacters mean and what some of the useful shorthand character classes are. In addition, I can recommend [an interactive cheat sheet](https://www.debuggex.com/cheatsheet/regex/python), and the site [http://www.regular-expressions.info/](http://www.regular-expressions.info/) is also an excellent resource.

**Metacharacters**

<table>
          <tr>
            <th>Metacharacter</th>
            <th>What does it do?</th>
            <th>Examples</th>
            <th>Matches</th>
          </tr>
          <tr>
            <td>^</td>
            <td>Matches beginning of line</td>
            <td>^abc</td>
            <td>abc, abcdef.., abc123</td>
          </tr>
          <tr>
            <td>\$</td>
            <td>Matches end of line</td>
            <td>abc\$</td>
            <td>my:abc, 123abc, theabc</td>
          </tr>
          <tr>
            <td>.</td>
            <td>Match any character</td>
            <td>a.c</td>
            <td>abc, asg, a123c</td>
          </tr>
          <tr>
            <td>[...]</td>
            <td>Matches one character contained in brackets</td>
            <td>[abc]</td>
            <td>a,b, or c</td>
          </tr>
          <tr>
            <td>[^...]</td>
            <td>Matches one character not contained in brackets</td>
            <td>[^abc]</td>
            <td>xyz, 123, 1de</td>
          </tr>
          <tr>
            <td>[a-z]</td>
            <td>Matches one character between 'a' and 'z'</td>
            <td>[b-z]</td>
            <td>bc, mind, xyz</td>
          </tr>
          <tr>
            <td>\*</td>
            <td>Matches character before \* 0 or more times</td>
            <td>ab\*c</td>
            <td>abc, abbc, ac</td>
          </tr>
          <tr>
            <td>+</td>
            <td>Matches character before + one or more times</td>
            <td>a+c</td>
            <td>ac, aac, aaac,</td>
          </tr>
          <tr>
            <td>?</td>
            <td>Matches the character before the ? zero or one times. Also, used as a non-greedy match</td>
            <td>ab?c</td>
            <td>ac, abc</td>
          </tr>
          <tr>
            <td>{x}</td>
            <td>Match exactly 'x' number of times</td>
            <td>(abc){2}</td>
            <td>abcabc</td>
          </tr>
          <tr>
            <td>{x,}</td>
            <td>Match 'x' number of times or more</td>
            <td>(abc){2,}</td>
            <td>abcabc, abcabcabc</td>
          </tr>
           <tr>
            <td>{,x}</td>
            <td>Match up to 'x' number of times</td>
            <td>(abc){2,}</td>
            <td>abcabc, abcabcabc</td>
          </tr>
          <tr>
            <td>{x,y}</td>
            <td>Match between 'x' and 'y' times.</td>
            <td>(a){2,4}</td>
            <td>aa, aaa, aaaaa</td>
          </tr>
           <tr>
            <td>|</td>
            <td>OR operator</td>
            <td>abc|xyz</td>
            <td>abc or xyz</td>
          </tr>
          <tr>
            <td>(...)</td>
            <td>Capture anything matched</td>
            <td>(a)b(c)</td>
            <td>Captures 'a' and 'c'</td>
          </tr>
          <tr>
            <td>(?:...)</td>
            <td>Non-capturing group</td>
            <td>(a)b(?:c)</td>
            <td>Captures 'a' but only groups 'c'</td>
          </tr>
           <tr>
            <td>\</td>
            <td>Escape the character after the backslash; or create a special sequence (like word boundaries, \b, or a character representing a space, \s.</td>
            <td>a\sc</td>
            <td>a c</td>
          </tr>
        </table>

The special "metacharacters" () [] {} ^ \$ . | \* + ?  and \\ become "literals" again if you put a \\ in front of them -- That is, \\. matches a period and is no longer the wild card. We say we have "escaped" the metacharacter.

**Shorthand character classes**

<table>
          <tr>
            <td>\d</td>
            <td>Match any digit (0-9)</td>
          </tr>
          <tr>
            <td>\D</td>
            <td>Match any non digit</td>
          </tr>
          <tr>
            <td >\t</td>
            <td>Match a tab</td>
          </tr>
          <tr>
            <td>\n</td>
            <td>Match a new line</td>
          </tr>
          <tr>
            <td>\r</td>
            <td>Match a carriage return</td>
          </tr>
          <tr>
            <td>\s</td>
            <td>Matches a space character (space, \t, \r, \n)</td>
          </tr>
          <tr>
            <td>\S</td>
            <td>Matches any non-space character </td>
          </tr>
          <tr>
            <td>\b</td>
            <td>Word boundary</td>
          </tr>
          <tr>
            <td>\B</td>
            <td>Non word boundary</td>
          </tr>
          <tr>
            <td>\w</td>
            <td>Matches any one word character [a-zA-Z_0-9]</td>
          </tr>
          <tr>
            <td>\W</td>
            <td>Matches any one non word character</td>
          </tr>
          </table>

The Coming Age of Conversational Bots
--------------------------------------
<hr>
<img src="https://cdn-images-1.medium.com/max/1000/1*-uuhR1UX709LfUnDiS30Rg.png"  style="width: 65%;"/>
<br>



While bots are very exicting to build, study and play around with, its also useful to survey the technology and media industries so we have a better sense of what we are dealing with. Bots are just another medium, and if history has shown us anything - its that communication mediums change. The evolution (or adaptation) of both tech and media industries affects the very nature of transmission, reception and assimilation of information. As these standards evolve, the medium through which people receive news alters. With a new medium, comes newer opportunities to experiment with news delivery, and newer metrics to judge performance with. 

So in this session, we will try to focus on three things: 

1. Understand why bots are popular again (all of a sudden). 
2. Why would people want to chat with "the news"? Will the media adapt? 
3. What aspects of conversation are machines good at ? And what aspects are humans better than machines at? 

## 1. Why now? 

** Everything old is new again. ** 

A new paradigm is ushered because there were inefficiencies in the previous paradigm, either present in the design or caused by evolutionary usage. Lets start with that 1980s supercomputer in your pocket- the smartphone. Think about the apps on your phone. You can launch every app independenlty. But there are also digital assistants trying to become a central intelligene in your phone, through which you communicate with some of the apps. Apple’s Siri, Amazon’s Alexa, Facebook’s M, Google Now and Microsoft’s Cortana all provide a single interface to control specific app capabilities. However, none of them allow us to do anything drastically advanced other than reducing the number of taps we make on a phone.

What exactly is better than.. *theres an app for that* ? The way humans perceive personal assistants is changing, as the word personal starts to take precedence over the assistant role. It may be only a matter of time until conversational agents invade consumer markets. There is [growing anticipation](http://observer.com/2016/01/2016-will-be-the-year-of-conversational-commerce/) because stats around user interactions with chat bots vs. app usage is incredible. 

Both the AppStore and Google Play host over 1.5 M apps each. Yet *on average, the number of apps downloaded by a person in the US every month is zero*. It seems like another paradigm shift is happening - the onset of messaging. Here are four main indicators that messaging is making apps irrelevant: 

1. App download slows considerably: 
    - Apps aren’t dying. But the entire space is collapsing, just like so many other industries before it. Its too crowded now, too hard to break in, numerous forced taps just for on-boarding and countless separate interfaces to keep track. Apps come with their own friction components — walled gardens, sign-up drags, untimely push notifications and re-installs. Both app makers and app users are getting increasingly frustrated with the ecosystem.
    - App transition is costly. As a panacea, bots within WeChat enable its 600m monthly users to book taxis, or check in for flights, or buy cinema tickets, or manage banking and reserve doctors’ appointments without ever leaving the app. 
2. User Retention is poor: 
    - It is incredibly hard to make an app and keep people interested or engaged. On average, the Daily Average Users of an app drops to 77% within the first 3 days, and by a stunning 95% in first 3 months. 
3. Artificial Intelligence is improving: 
    - There are many things happening in the AI space, in the subfields of computer vision, natural language processing, algorithmic art, speech recognition etc. The field most applicable to bots is natural language understanding. 
    - Current state of bot intelligence is somewhat of an ugly marriage of bits of AI which kind of works and lots of hand coding ([but this can change soon, yes scientists are on it](http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/)). To be honest, the bot world AI is still waiting for a  Pokémon Go moment with a giant breakout hit, but [we are getting there (in a fun way)](https://www.theguardian.com/technology/2016/jun/28/chatbot-ai-lawyer-donotpay-parking-tickets-london-new-york?CMP=share_btn_tw)
4. Messaging usage is outgrowing app usage:
    - The big 4 messsaging platforms now have more users than the big 4 social networks! Can you name them ?
    - As an example, 40% of US teens use [Kik](https://en.wikipedia.org/wiki/Kik_Messenger) ! 
    
This isn't a drill, its the biggest internet phenomena since the app store. 
<hr>

### 1.1 What problems can bots solve ?

While the bot industry is still in its early days, bot developers have already created many useful bots. Betaworks opened applications for [BotCamp](https://betaworks.com/botcamp/), a 90-day pre-seed program for chat bot startups. BotCamp was set to accept 10 companies, with each receiving some pre-seed funding. These companies would work out of the Betaworks Studios space in Manhattan’s Meatpacking district.

We got many amazing applications, with people building bots in many verticals. This is the distribution of industries in which people were building bots. 
<br><br> 
<img src="http://sumandebroy.com/columbia/images/botcamp1.png" style="width: 70%;"/>
<br>

We did get 6% applications for bots in news media. 


### 1.2 On what platforms do bots exist?

Platforms are important, because a bot's response must be tuned to the demographics, behavior and interaction patterns which is characteristic of the specific platform.

<hr>
<img src="http://sumandebroy.com/columbia/images/botcamp2.png" style="width: 75%;"/>
<br>


1 in 6 people in the world use facebook messenger, while sms is still popular (think 2-fact auth is the simplest transactional exchange).


### 1.3 The botscape 

These are exciting times for bot developers and for bot users. Even with all these entertaining and productive bots, the best is yet to come! As you can see from the histograms above, people are building bots in all kinds of platforms, from voice to messaging to social networks. 

Before you embark on the journey to build your first news/cause/activism/media/journalism bot, it would be good to know what tools already exists, so you can invoke them when necessary. [Here's something](https://blog.chatbot.com/the-botscape-ce7581ae69a2#.11mcnpbpk) we compliled that gives you an idea of whats happening in the bots space. If you want to build a bot for news, this pipeline is a good study.  

<img src="https://cdn-images-1.medium.com/max/2000/1*x6BMYnOHZ9k4_9lxbOQTNg.jpeg" style="width: 95%;"/>

<hr>

## 2. Elements of Conversation

But today we will be talking about conversational bots. Language holds the weight of the human condition. Epistemically, there are two key reasons we converse. One originates from the fact that we consider time valuable, and the second originates from the fact that we want to be part of a community. 
- First, conversation demands a relatively equal exchange of information. What happens when there isn't a relatively equal exchange - can you think of terms we use for such "talk". 
- Second, since the early days of humanity, building social ties have been a critical factor in explaining behavior. Conversation is a natural way to "bond" and transfer knowledge. And thus humans have a natural urge to converse.. even if its with an artificial entity. For example, the mother of all bots and one of the first bots that millions of users ever interacted with was called ["SmarterChild"](https://en.wikipedia.org/wiki/SmarterChild), which lived on AOL and Windows/MSN messengers. Think of it as the precursor to Siri. 


### 2.1 Do people *really* want to converse with a bot? What does the data say? 


Users will try to converse with a bot EVEN IF they know that the bot's soul purpose is clearly transactional.  Here are some examples of conversational messages that poncho and the digg bot sees. Whenever a new chat message comes in, we pipe it through a slack channel. Going through the channel logs can be ...overwhelming.. and eye-opening. 

<h1 align="center"></h1>
<hr>
<img src="http://sumandebroy.com/columbia/images/cmsg1.png" style="width: 65%;"/>
<br>

And it clear that for some queries, a bot must even having some ethical sense of whether to respond or not!

In our bot datasets, more than 65% of conversations/user-queries had some form of conversational elements to it. 
Which brings us to the next section - what are those conversational elements?   

### 2.2 What elements of conversation are computable ?

Spend some time to think *what constitutes as **identifiable elements** of "conversation"*. And then we will go ahead and examine if any of these can be modeled computaitonally ?

In the ideal sense, the word "conversation" possesses some sort of intimacy. Intimacy is shared remembered experiences. But apart from the esoteric nature of signals that signify an intimate conversation, theoretically, there are several discernible elements of conversation that we can try to compute. 



#### Some Identifiable Elements of Conversation:

| Element of Conversation | Possible Techniques to Compute/ Quantify |
| ------ | ----------- |
|1. Notifications/ Recalling relevant things   |  Time Series Analysis, Alerting, Keyword caches |
|2. Learning topics in context | Topic Mining/Modeling - extract the topic from the words in text |
|3. Understanding Social Networks (offline and online)  | Network Science, the study of the structure of how things are connected and how information flows through it |
|4. Responding to Emotion  | Sentiment Analysis
|5. Having Episodic Memory  | Some kind of graphical model, [see Aditi's data post](https://medium.com/@aditinair/episodic-memory-modeling-for-conversational-agents-7c82e25b06b4#.9k65cziqw). |
|6. Portraying Personality  | Decision Tree, which is a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. |



### 2.3 Some bot examples

- **Xiaoice**
    - [Xiaoice](http://nautil.us/issue/33/attraction/your-next-new-best-friend-might-be-a-robot) is a chat bot developed by Microsoft Research Asia. It lives within apps like Line, WeChat and Weibo and boasts a user base of over 40 million, most of who spend hours chatting with it. In truth, labeling Xiaoice as an virtual assistant is an understatement. Xiaoice can comment if you’ve had a haircut. She can suggest products to purchase based on your conversations. She will respond angrily if you antagonize her. She is emotional, talkative, friendly and personal — in addition to supporting the usual transactional applications any user would demand. She is the 6th-most active celebrity on a platform that boasts 198 monthly active users; 25% of whom have said ‘I love you’ to her. 
- **TAY**
    - Tay was supposed to be Xiaoice's US cousin. Launched on Twitter, it had a persona of a teenage girl and had minimal AI-driven responses. However, it was quickly embarked into a journey of improper tweets and responses, so much so that Micorosft had to kill it. [Read what happened](http://www.businessinsider.com/microsoft-deletes-racist-genocidal-tweets-from-ai-chatbot-tay-2016-3?r=UK&IR=T). It is an incredible story in how bot design can go [wrong](https://medium.com/@carolinesinders/microsoft-s-tay-is-an-example-of-bad-design-d4e65bb2569f#.m1lvqu5bs). 
- **Quartz News Bot**
    - You don't read the news anymore, you chat with it. Quartz’s app doesn’t tailor its responses to your interests; it’s based on a flow of content created by human editors, not learning algorithms. [Read more](https://www.wired.com/2016/02/with-quartzs-app-you-dont-read-the-news-you-chat-with-it/)
- **BBC bot**
    - The bot lets the users read and subscribe to BBC Mundo top stories within the Messenger app and get automated replies and push breaking news alerts. Its on Twitter, FB Messenger and Telegram. [Read more](http://bbcnewslabs.co.uk/2016/07/12/bots-in-newslabs/)

### 2.4 Will there be a "clickbait" equivalent in the bot age

The most valuable metric in the impact of a bot is engagement. At its simplest, engagement is the number of back-forth conversations/interactions between the bot and some user within a certain time interval. Note this is somewhat different from the currently popular news business metric, which measures clicks. While both paradigms are trying to measure **Attention**, here's the marked difference:

* Web/Apps & Clicks = Momentary attention 
* Bots & Engagement = Sustained Attention in Context. 

Think how this can alter the news media business (if at all). What will be the clickbait or the curiosity-gap trick of the bot age? If something becomes computable, it can be hacked - but is it possible to hack a conversation ? 

<br>
<hr>

## 3. Scripted Bots, Decision Trees and Hybrids

The simplest bots are scripted bots.  Their entire interaction is hard-coded (the “script”) that determines what the bot can and cannot do.  The “script” is a [decision tree](https://en.wikipedia.org/wiki/Decision_tree) where responding to one question takes you down a specific path, which opens up a new, pre-determined set of possibilities.

For example, here's how you could design dialogue for a bot that helps users decide if they should have a cookie:


<img src= "http://static.twentytwowords.com/wp-content/uploads/cookie.gif" style="width: 40%;"/>

### 3.1 Both an editorial and technical problem

The scripted bot is a bot that will converse with a person based entirely off of a pre-programmed script. They use a decision tree model and depending on how the user interfaces are, might also employ a dialog tree to craft structured responses. Quartz bot frame is a good example here. 

<img src = "http://photos2.appleinsidercdn.com/gallery/15868-12411-Screen-Shot-2016-02-11-at-21647-PM-l.jpg">

(source: appleinsider)

The only issue with purely scripted bots is, as you suspected, scale. There are only so many stories that can be scripted as interactive frames. However, lets assume writers/editors construct many "conversational" frames over time. Can user interaction with these frames determine which conversations users prefer ? 

### 3.2  Conversation tracking in Bots

Every sequence of conversation frames that you design or script into a bot... can be thought of a giant network.. a connected network. And given enough interactions with *actual* users, you could identify [which nodes of the conversation are activated most often](https://en.wikipedia.org/wiki/Network_science#Network_properties), leading you to hypothesize which paths in the network are more frequent. The editors can then optimize the scripts in these paths, while pruning other paths that users hardly take.  

Here's a visualization of which nodes (or conversation points) are most often activated by users. If we know which conversation paths users frequently take, we get to know more about which conversations they like and script better responses [read here](https://medium.com/i-data/bot-analytics-simulating-random-walks-to-predict-network-interventions-4068d38c5a71#.a54xgn6nv). 

Fun story, the various conversation routes that a user can choose in a scripted bot can be modeled as another mathematical object - called [the random walk](https://en.wikipedia.org/wiki/Random_walk). 

<img src="https://cdn-images-1.medium.com/max/800/1*knmsr85ePLLEqB3QQK0kLA.gif" style="width: 75%;"/>

### 3.3 Hybrid  and Automated Bots

In hybrid bots, some part of the interaction is driven by scripts (pre-written) but the majority is driven by an algorithm that isn't necessarily hardcoded. Instead, the model is instead generated from the current state of the data and then fitted into a conversation template, making it sound like a natural language response. 

This method can lead to hybrid bots, or fully automated ones:
- extreme mashup + silly fun , like [NYTimes Minus Contex](https://twitter.com/NYTMinusContext) (this is automated)
- scripted-funny + data-driven responses, like: poncho, xiaoice 
- scripted-onboarding + data-driven responses, like: digg, tay 

As Natural Language Processing becomes smarter, we will be able to do more fun things with bots at scale. The two key areas that is making lots of progress is (1) NL understanding: i.e. figuring out what the chat message means in context, and (2) NL Generation, i.e. generate a NL report from data (think about baseball or financial reports written by AI)

<img src="http://sumandebroy.com/columbia/nlg.gif" style="width: 65%;"/>
<br>
<hr>
<br>

## Next Session:

- You choose a current topic that you are passionate about.
- We give you an API, and the API will return some news stories about that topic.
- You parse the API response, extract relevant content and frame it as a bot message.
- In a series of 2-3 back and forth chat exchanges, you need to enlighten a user about the most current story in that topic. Sending them the full story link is not allowed.  

Bots, a practicum
----

Suman has done an excellent job of examining the terrain for conversational bots. FastCompany also published a nice history of  a particular kind of software robot: [How The New, Improved Chatbots Rewrite 50 Years Of Bot History](http://www.fastcompany.com/3059439/why-the-new-chatbot-invasion-is-so-different-from-its-predecessors). Chatbots are really a conversation-based interfaces to services of various kinds. As you have seen, sometimes the logic behind their inner workings is incredibly complex, but other times their operation is very service-oriented and shallow. 

The FastCompany article makes the point that while strands of artificial intelligence research has been obsessed with making a both that was believably human (or perhaps with the potential for one), the latest incarnations of these programs exist "to help you get things done." They have grown plentiful using the scale of new platforms (Facebook, Twitter, and so on) and are fit tidily into their business interests.

Here's a snippet.

>But over the past few years, chatbots have made a comeback. With advancements in processing power, bots now have a better ability to interpret natural language and learn from users over time. Just as importantly, big companies like Facebook, Apple, and Microsoft are now eager to host our interactions with various services, and offer tools for developers to make those services available. Chatbots easily fit into their larger business models of advertising, e-commerce, online services, and device sales. Meanwhile, services that want to reach hundreds of millions of customers on a platform like Messenger will be helping to write the chat scripts.
<br><br>"A lot of the things that were barriers to us back then are no longer barriers to us today because of the evolution of the way technology works," Hoffer says.
<br><br>
Crucially, these bots are meant to be useful out of the gate, ... they no longer need conversation as a crutch for mass adoption. Sure, Apple’s Siri knows how to break the ice with a few jokes, but it largely exists to help you get things done. Rival assistants from Google and Amazon don’t exhibit much personality at all. Utility is winning out because the technology allows for it.

The article ends with a comment about early artificial intelligence researcher Joseph Weizenbaum. Weizenbaum created something called ELIZA, a computer program that was modeled after a psychiatrist (or an "active listener"). 

>If the latest round of chatbots succeed, they might prove that Weizenbaum, the creator of ELIZA, was right all along. These machines are not warm and cuddly replacements for the human intellect. They’re just another set of tools—an evolution of the apps that have served us for years.

As an aside, Weizenbaum didn't create ELIZA as a state of the art conversational program. He was, in fact, troubled by its positive reception, and later in life he wrote about the limits of artificial intelligence. The passage below is from his 1976 text [Computer power and human reason](https://en.wikipedia.org/wiki/Computer_Power_and_Human_Reason). He closes the third chapter with this image.

>Sometimes when my children were still little, my wife and I would stand over them as they lay sleeping in their beds. We spoke to each other only in silence, rehearsing a scene as old as mankind itself. It is as Ionesco told his journal: ‘Not everything is unsayable in words, only the living truth.

Beautiful, right? For a data class, it's a great reminder that data will always exist at distance from lived experience. We can try to pile more and more data on a given situation or phenomenon. But no matter how big our data gets, we are still missing something. That's why we've been stressing how your choice of data is a creative act.

**ELIZA**

Anyway, today we are going to work on the basics of a chatbot. Below we have the ELIZA program in Python form. Execute it and interact. You type "quit" to get out.

In [None]:
from re import match, IGNORECASE
from random import choice
 
reflections = {
    "am": "are",
    "was": "were",
    "i": "you",
    "i'd": "you would",
    "i've": "you have",
    "i'll": "you will",
    "my": "your",
    "are": "am",
    "you've": "I have",
    "you'll": "I will",
    "your": "my",
    "yours": "mine",
    "you": "me",
    "me": "you"
}
 
actions = [
    [r'I need (.*)',
     ["Why do you need {0}?",
      "Would it really help you to get {0}?",
      "Are you sure you need {0}?"]],
 
    [r'Why don\'?t you ([^\?]*)\??',
     ["Do you really think I don't {0}?",
      "Perhaps eventually I will {0}.",
      "Do you really want me to {0}?"]],
 
    [r'Why can\'?t I ([^\?]*)\??',
     ["Do you think you should be able to {0}?",
      "If you could {0}, what would you do?",
      "I don't know -- why can't you {0}?",
      "Have you really tried?"]],
 
    [r'I can\'?t (.*)',
     ["How do you know you can't {0}?",
      "Perhaps you could {0} if you tried.",
      "What would it take for you to {0}?"]],
 
    [r'I am (.*)',
     ["Did you come to me because you are {0}?",
      "How long have you been {0}?",
      "How do you feel about being {0}?"]],
 
    [r'I\'?m (.*)',
     ["How does being {0} make you feel?",
      "Do you enjoy being {0}?",
      "Why do you tell me you're {0}?",
      "Why do you think you're {0}?"]],
 
    [r'Are you ([^\?]*)\??',
     ["Why does it matter whether I am {0}?",
      "Would you prefer it if I were not {0}?",
      "Perhaps you believe I am {0}.",
      "I may be {0} -- what do you think?"]],
 
    [r'What (.*)',
     ["Why do you ask?",
      "How would an answer to that help you?",
      "What do you think?"]],
 
    [r'How (.*)',
     ["How do you suppose?",
      "Perhaps you can answer your own question.",
      "What is it you're really asking?"]],
 
    [r'Because (.*)',
     ["Is that the real reason?",
      "What other reasons come to mind?",
      "Does that reason apply to anything else?",
      "If {0}, what else must be true?"]],
 
    [r'(.*) sorry (.*)',
     ["There are many times when no apology is needed.",
      "What feelings do you have when you apologize?"]],
 
    [r'Hello(.*)',
     ["Hello... I'm glad you could drop by today.",
      "Hi there... how are you today?",
      "Hello, how are you feeling today?"]],
 
    [r'I think (.*)',
     ["Do you doubt {0}?",
      "Do you really think so?",
      "But you're not sure {0}?"]],
 
    [r'(.*) friend (.*)',
     ["Tell me more about your friends.",
      "When you think of a friend, what comes to mind?",
      "Why don't you tell me about a childhood friend?"]],
 
    [r'Yes',
     ["You seem quite sure.",
      "OK, but can you elaborate a bit?"]],
 
    [r'(.*) computer(.*)',
     ["Are you really talking about me?",
      "Does it seem strange to talk to a computer?",
      "How do computers make you feel?",
      "Do you feel threatened by computers?"]],
 
    [r'Is it (.*)',
     ["Do you think it is {0}?",
      "Perhaps it's {0} -- what do you think?",
      "If it were {0}, what would you do?",
      "It could well be that {0}."]],
 
    [r'It is (.*)',
     ["You seem very certain.",
      "If I told you that it probably isn't {0}, what would you feel?"]],
 
    [r'Can you ([^\?]*)\??',
     ["What makes you think I can't {0}?",
      "If I could {0}, then what?",
      "Why do you ask if I can {0}?"]],
 
    [r'Can I ([^\?]*)\??',
     ["Perhaps you don't want to {0}.",
      "Do you want to be able to {0}?",
      "If you could {0}, would you?"]],
 
    [r'You are (.*)',
     ["Why do you think I am {0}?",
      "Does it please you to think that I'm {0}?",
      "Perhaps you would like me to be {0}.",
      "Perhaps you're really talking about yourself?"]],
 
    [r'You\'?re (.*)',
     ["Why do you say I am {0}?",
      "Why do you think I am {0}?",
      "Are we talking about you, or me?"]],
 
    [r'I don\'?t (.*)',
     ["Don't you really {0}?",
      "Why don't you {0}?",
      "Do you want to {0}?"]],
 
    [r'I feel (.*)',
     ["Good, tell me more about these feelings.",
      "Do you often feel {0}?",
      "When do you usually feel {0}?",
      "When you feel {0}, what do you do?"]],
 
    [r'I have (.*)',
     ["Why do you tell me that you've {0}?",
      "Have you really {0}?",
      "Now that you have {0}, what will you do next?"]],
 
    [r'I would (.*)',
     ["Could you explain why you would {0}?",
      "Why would you {0}?",
      "Who else knows that you would {0}?"]],
 
    [r'Is there (.*)',
     ["Do you think there is {0}?",
      "It's likely that there is {0}.",
      "Would you like there to be {0}?"]],
 
    [r'My (.*)',
     ["I see, your {0}.",
      "Why do you say that your {0}?",
      "When your {0}, how do you feel?"]],
 
    [r'You (.*)',
     ["We should be discussing you, not me.",
      "Why do you say that about me?",
      "Why do you care whether I {0}?"]],
 
    [r'Why (.*)',
     ["Why don't you tell me the reason why {0}?",
      "Why do you think {0}?"]],
 
    [r'I want (.*)',
     ["What would it mean to you if you got {0}?",
      "Why do you want {0}?",
      "What would you do if you got {0}?",
      "If you got {0}, then what would you do?"]],
 
    [r'(.*) mother(.*)',
     ["Tell me more about your mother.",
      "What was your relationship with your mother like?",
      "How do you feel about your mother?",
      "How does this relate to your feelings today?",
      "Good family relations are important."]],
 
    [r'(.*) father(.*)',
     ["Tell me more about your father.",
      "How did your father make you feel?",
      "How do you feel about your father?",
      "Does your relationship with your father relate to your feelings today?",
      "Do you have trouble showing affection with your family?"]],
 
    [r'(.*) child(.*)',
     ["Did you have close friends as a child?",
      "What is your favorite childhood memory?",
      "Do you remember any dreams or nightmares from childhood?",
      "Did the other children sometimes tease you?",
      "How do you think your childhood experiences relate to your feelings today?"]],
 
    [r'(.*)\?',
     ["Why do you ask that?",
      "Please consider whether you can answer your own question.",
      "Perhaps the answer lies within yourself?",
      "Why don't you tell me?"]],
 
    [r'quit',
     ["Thank you for talking with me.",
      "Good-bye.",
      "Thank you, that will be $150.  Have a good day!"]],
 
    [r'(.*)',
     ["Please tell me more.",
      "Let's change focus a bit... Tell me about your family.",
      "Can you elaborate on that?",
      "Why do you say that {0}?",
      "I see.",
      "Very interesting.",
      "{0}.",
      "I see.  And what does that tell you?",
      "How does that make you feel?",
      "How do you feel when you say that?"]]
]
 
 
def reflect(fragment):
    
    # Turn a string into a series of words
    tokens = fragment.lower().split()
    
    # for each word...
    for i in range(len(tokens)):
        token = tokens[i]
    
        # see if the word is in the "reflections" list and if it
        # is, replace it with its reflection (you -> me, say)
        if token in reflections:
            tokens[i] = reflections[token]
            
    return ' '.join(tokens)
 
 
def respond(statement):
    
    # run through all the actions
    for j in range(len(actions)):
    
        # for each one, see if it matches the statment that was typed
        pattern = actions[j][0] 
        responses = actions[j][1]
        found = match(pattern, statement.rstrip(".!"),IGNORECASE)
        
        if found:
        
            # for the first match, select a response at random and insert
            # the text from the statement into ELIZA's response
            response = choice(responses)
            return response.format(*[reflect(g) for g in found.groups()])
 
 
def eliza():
    # a friendly welcome
    print "Hello. How are you feeling today?"
 
    # talk forever...
    while True:
        
        # collect a statement and respond, stop the conversation on 'quit'
        statement = raw_input("> ")
        print respond(statement)
 
        if statement == "quit":
            break


The final function eliza() drops you into a loop (a "while" loop that you "break" out of by typing "quit". The only other new thing here is that the notebook has a funciton "raw_input" that lets your reader type things and gives you access to their musings. Play with ELIZA a little. What do you think?

In [None]:
eliza()

Before we turn you lose, let's examine the code a little. The reflect() function turns and "I" into a "you", allowing the program to turn a user's statement "Because I love apples" around into the question "If you love apples, what else must be true?" Here is reflect() working on single phrases.

In [None]:
reflect("I am troubled")

In [None]:
reflect("your analysis is wrong")

Next, let's look at the respond() function a little more closely. There are some new code constructions here. Here is respond() in action.

In [None]:
respond("I am doing fine.")

Below we take the same function but add a number of print statements to see what it's doing. The new function is called irespond() instead, to avoid confusion. 

In [None]:
from pprint import pprint

def irespond(statement):
    
    # run through all the actions
    for j in range(len(actions)):
        
        # for each one, see if it matches the statment that was typed
        pattern = actions[j][0] 
        responses = actions[j][1]
        found = match(pattern, statement.rstrip(".!"),IGNORECASE)
        
        if found:
            
            # for the first match, select a response at random and insert
            # the text from the statement into ELIZA's response
            
            print "Found pattern:"
            print pattern
            print "--"*5
            print "Choosing between responses:"
            pprint(responses)
            
            response = choice(responses)            

            print "--"*5
            print "The matched groups:"
            pprint([reflect(g) for g in found.groups()])
            print "--"*5
            
            print "ELIZA's response:"
            print response.format(*[reflect(g) for g in found.groups()])
            print "--"*5
            
            return           

In [None]:
irespond("My dog.")

The responses have references that look like "{0}" and "{1}" and so on. Given a string with these special character strings, the method format() will substitute its first argument for "{0}", its second for "{1}" and so on. Here we make two substitutions.

In [None]:
"Not everything is {0} in words, only {1}".format("sayable","the living truth.")

The only other magic is the "\*" inside the format() call in irespond() and respond(). What the star notation does is take a list and make it like each element is another argument for the function. So the list below has two elements, two strings, and the star make the call below just like the one above. The first element of the list is the first argument to format() and the second is the second argument. 

In [None]:
"Not everything is {0} in words, only {1}".format(*["sayable","the living truth."])

We do this because the match() command returns a list of the groups identified in the regular expression -- the items marked out with parenthese. So above, the word "dog" is the only match and the groups() method returns a list with just one item. format() then takes that item and plops it into the response string, replacing "{0}".

In [None]:
irespond("do you think my mother would approve?")

In [None]:
irespond("I'm sad.")

Your turn! Start by copying and adapting the ELIZA code to fix up where it seems to get stuck, conversationally. When you are ready, start on your own bot. The more rules you rewrite the better. What are you going to talk about? What are you going to ground your conversation in? Maybe we ground it in Trump commentary (unless you're exhausted -- I'll understand)? [Here](https://www.nytimes.com/2016/11/18/technology/automated-pro-trump-bots-overwhelmed-pro-clinton-messages-researchers-say.html?_r=0) is a great article on simple political chat bots and another one [here](https://www.askhillaryanddonald.com/assets/Sample_Questions.pdf).