## Which John?

Guess the name of a person in Wikipedia from the first name and a sentence! 

The model currently is only trained on a fraction of the data and supports 'John' only. The first cell will download the pre-trained model for you. Please note that direct unpickling of someone else's file is unsafe and I welcome recommendations of how to do this in another way.

In [3]:
!if [ ! -e firstModel.dill ] ; then wget https://www.dropbox.com/s/b7p4tnxcmrpqkh4/firstModel.dill ; fi
    
import dill
model = dill.load(open('firstModel.dill', 'rb'))

The model object is in the format of

    model = Pipeline([
                ('cleaning', CleaningContextTransformer()),
                ('vectorize', DictVectorizer(sparse = False)), 
                ('decision tree', tree.DecisionTreeClassifier())
            ])

Here are a few examples of the prediction

In [4]:
model.predict(['beatles'])[0]

'http://en.wikipedia.org/wiki/John_Lennon'

In [5]:
model.predict(['Elected in 1960 as the 35th president of the United States'])[0]

'http://en.wikipedia.org/wiki/John_F._Kennedy'

In [6]:
model.predict(['An itinerant preacher and a major religious figure in Christianity and Islam'])[0]

'http://en.wikipedia.org/wiki/John_the_Baptist'

I have put together a bunch of widgets that display the process in a more human-friendly way

In [1]:
import warnings
warnings.filterwarnings('ignore')

from IPython.html import widgets
from IPython.display import display, Image
import sys
from bs4 import BeautifulSoup
import requests
from nltk import tokenize
import wikipedia

from IPython.display import HTML

supported_names = ['john']

first_name_widget = widgets.Text(description = '')
context_widget = widgets.Textarea(description = '')
question_widget = widgets.HTML(value = '')
image_widget = widgets.HTML(value = '')
summary_widget = widgets.HTML(value = '')
def displayAnswers(_):
    first_name = first_name_widget.value.strip()
    if first_name.lower() not in supported_names:
        question_widget.value = "<br><br>Sorry, I don't know any " + '"' + first_name + '"' + ' yet...'
        return
    question_widget.value = '<br><br>Is this who we are talking about?'
    context = context_widget.value
    url = model.predict([context.encode('utf8')])[0]
    soup = BeautifulSoup(requests.get(url).text)
    title = soup.find('h1', attrs = {'id':'firstHeading'}).text
    try:
        image_widget.value = '<img src=http:' + soup.find('table', attrs = {'class':'infobox'}).find('img')['src'] + '>'
    except:
        pass
    summary_widget.value = tokenize.sent_tokenize(wikipedia.summary(title))[0]
    return
submit_widget = widgets.Button(description = 'submit')
submit_widget.on_click(displayAnswers)


display(widgets.HTML(value = '<br>'))
display(widgets.HTML(value = 'Tell me how you fondly call this person (first name)'))
display(first_name_widget)
display(widgets.HTML(value = '<br>Use the name in a sentence (context of a reference)'))
display(context_widget)
display(widgets.HTML(value = '<br>'))
display(submit_widget)
display(question_widget)
display(widgets.HTML(value = '<br>'))
display(image_widget)
display(widgets.HTML(value = '<br>'))
display(summary_widget)

HTML('''<script>
code_show=false; 
function code_toggle() {
 if (code_show){
 $("div.input:eq(4)").hide();
 } else {
 $("div.input:eq(4)").show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Raw code"></form>''')