# ODSC 2017 Slides 

* [Training a Prosocial Chatbot](https://docs.google.com/presentation/d/1IND6PXOxgYb2IVmXnaSoNBcIOaiUVU-iZ6vnYz6J6-E/edit?usp=sharing)
* [Exploring Hyperspace](https://docs.google.com/presentation/d/1SEU8VL0KWPDKKZnBSaMxUBDDwI8yqIxu9RQtq2bpnNg/edit?usp=sharing)

## Ubuntu Dialog Corpus

* Ubuntu IRC channel (for developers and contributors)
* ~1.5 M Dialogs (interactions)
* ~10 M Utterances (statements)

## Application

[![Aira, a visual interpreter for the blind](../data/aira_video_demo_blind_person_512.png)](https://vimeo.com/143070863)

### References

* [Aira](http://aira.io), A Visual Interpreter for the Blind
* [WildML](http://www.wildml.com/2016/07/deep-learning-for-chatbots-2-retrieval-based-model-tensorflow/)
* 2018, Lane, Howard and Hapke: [Natural Language Processing in Action](https://www.manning.com/books/natural-language-processing-in-action/?a_aid=totalgood)
* 2016, Lowe, et al: ["The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems"](https://arxiv.org/pdf/1506.08909.pdf)
* [training set generator v1](https://github.com/rkadlec/ubuntu-ranking-dataset-creator)
* Lowe, [training set generator v2](https://github.com/ryan-lowe/Ubuntu-Dialogue-Generationv2)
* Shmalko, [lexica and slang normalizers](https://github.com/rasendubi/noisy-text/tree/master/data
* [retrieval-based vs rule-based and new modular approach](http://www.aclweb.org/old_anthology/C/C14/C14-1088.pdf)
* [retrieval-based bot](https://export.arxiv.org/pdf/1612.01627)
* [wrong idea about retrieval-bots](https://icrunchdata.com/blog/587/state-of-innovative-chatbots-and-intuitive-ai-in-2017/)

In [5]:
import pandas as pd
from nlpia.data.loaders import read_csv

df = read_csv('../data/ubuntu_dialog_test_10.csv')
print(df.shape)

Reading CSV with `read_csv(*('../data/ubuntu_dialog_test_10.csv',), **{'low_memory': False})`...
(18920, 11)


In [6]:
df.head()

Unnamed: 0,Context,Ground Truth Utterance,Distractor_0,Distractor_1,Distractor_2,Distractor_3,Distractor_4,Distractor_5,Distractor_6,Distractor_7,Distractor_8
0,anyone knows why my stock oneiric exports env ...,nice thanks! __eou__,"wrong channel for it, but check efnet.org, uno...","every time the kernel changes, you will lose v...",ok __eou__,!nomodeset > acer __eou__ I'm assuming it is a...,http://www.ubuntu.com/project/about-ubuntu/der...,thx __eou__ unfortunately the program isn't in...,how can I check? By doing a recovery for testi...,my humble apologies __eou__,#ubuntu-offtopic __eou__
1,i set up my hd such that i have to type a pass...,"so you dont know, ok, anyone else? __eou__ you...","nmap is nice, but it wasn't what I was looking...",ok __eou__,cdrom worked fine on windows. __eou__ i dont ...,"ah yes, i have read return as rerun __eou__",hm? __eou__,"not the case, LTS is every other .04 release. ...",Pretty much __eou__,I used the one I downloaded from AMD __eou__,"ffmpeg is part of the package , quixotedon , a..."
2,im trying to use ubuntu on my macbook pro reti...,just wondering how it runs __eou__,"yes, that's what I did, exported it to a ""id_d...",nothing - i am talking about the question of m...,that should fix the fonts being too large __eou__,"okay, so hcitool echos back hci0 <mac address ...",I get to the menu with options such as 'try ub...,why do u need analyzer __eou__ it is a toy __e...,Cntrl-C may stop the command but it doesn't fi...,"if you're only going to run Ubuntu, just get a...",the ones which are not picked up at the moment...
3,no suggestions? __eou__ links? __eou__ how can...,you cant load anything via usb or cd when luks...,-p sorry... __eou__ nmap -p22 __eou__ It d...,i guess so i can't even launch it. __eou__,noted __eou__,rxvt-unicode is one __eou__,I tarred all of ~ __eou__,I tarred all of ~ __eou__,"I don't really know if I can help, but I was c...","that works just fine, thanks! __eou__",thank you __eou__
4,I just added a second usb printer but not sure...,i was setting it up under the printer configur...,i'd say the most commonly venue would be via L...,"the old hardy man page, http://manpages.ubuntu...",i'll give a try __eou__,"by the way, the url you posted for davfs is fr...",http://ubuntuforums.org/showthread.php?t=15498...,"So I load up putty gui, then what do I do? __e...","you should read error messages, it says 'are ...",waiting the college semester to close just to ...,I was calling myself a jerk. All I know is tha...


In [50]:
# the last statement isn't at all informative (from an NLP perspective)
# So this is a tough training/test example
# Success relies a lot on the context (previous dialog)
df['Context'].iloc[0]

'anyone knows why my stock oneiric exports env var \'USERNAME\'?  I mean what is that used for?  I know of $USER but not $USERNAME .  My precise install doesn\'t export USERNAME __eou__ __eot__ looks like it used to be exported by lightdm, but the line had the comment "// FIXME: Is this required?" so I guess it isn\'t surprising it is gone __eou__ __eot__ thanks!  How the heck did you figure that out? __eou__ __eot__ https://bugs.launchpad.net/lightdm/+bug/864109/comments/3 __eou__ __eot__ '

In [54]:
df['Ground Truth Utterance'].iloc[0]

'nice thanks! __eou__'

In [51]:
# And our bot is trying to mimic multiple IRC users
# both the questioner and the answerer
df['Context'].iloc[1]

'i set up my hd such that i have to type a passphrase to access it at boot. how can i remove that passwrd, and just boot up normal. i did this at install, it works fine, just tired of having reboots where i need to be at terminal to type passwd in. help? __eou__ __eot__ backup your data, and re-install without encryption "might" be the easiest method __eou__ __eot__ '

In [52]:
# So the antisocial sarcasm here would mistrain our bot
df['Ground Truth Utterance'].iloc[1]

'so you dont know, ok, anyone else? __eou__ you are like, yah my mouse doesnt work, reinstall your os lolol what a joke __eou__'

In [55]:
df['Context'].iloc[2]

'im trying to use ubuntu on my macbook pro retina __eou__ i read in the forums that ubuntu has a apple version now? __eou__ __eot__  not that ive ever heard of..  normal ubutnu should work on an intel based mac. there is the PPC version also. __eou__  you want total control? or what are you wanting exactly? __eou__ __eot__ '

In [56]:
# We only have to respond with a single utterance.
# It can even be a question that isn't stated as a question
df['Ground Truth Utterance'].iloc[2]

'just wondering how it runs __eou__'

In [57]:
# TFIDF retrieval

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [12]:
sparse_vectors = tfidf.fit_transform(['Hello World!', 'Goodbye cruel world.', 'hello Jane.'])
sparse_vectors

<3x5 sparse matrix of type '<class 'numpy.float64'>'
	with 7 stored elements in Compressed Sparse Row format>

In [14]:
tfidf.vocabulary_

{'cruel': 0, 'goodbye': 1, 'hello': 2, 'jane': 3, 'world': 4}

In [15]:
tfidf.get_feature_names()

['cruel', 'goodbye', 'hello', 'jane', 'world']

In [13]:
vectors = pd.DataFrame(sparse_vectors.todense(), columns=tfidf.get_feature_names())
vectors

Unnamed: 0,cruel,goodbye,hello,jane,world
0,0.0,0.0,0.707107,0.0,0.707107
1,0.622766,0.622766,0.0,0.0,0.47363
2,0.0,0.0,0.605349,0.795961,0.0


In [16]:
tfidf = TfidfVectorizer(min_df=8, max_df=.3, max_features=100000)
tfidf

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.3, max_features=100000, min_df=8,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [17]:
tfidf.fit(pd.concat([df[df.columns[i]] for i in range(11)]))

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.3, max_features=100000, min_df=8,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [18]:
print(list(tfidf.vocabulary_)[:10])
print(len(tfidf.vocabulary_))

['anyone', 'knows', 'why', 'my', 'stock', 'oneiric', 'exports', 'env', 'var', 'username']
12358


In [22]:
X = tfidf.transform(df.Context)
X.shape

(18920, 12358)

In [23]:
X = pd.DataFrame(X.todense(), columns=tfidf.get_feature_names())
X.head()

Unnamed: 0,00,000,0000,001,002,0022,003,003a,01,011,...,zshwiki,zsnes,zsync,zvacet,zykes,zykotic9,zykotick9,zykotik9,ınstall,ıt
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
y = tfidf.transform(df['Ground Truth Utterance']).todense()

In [38]:
from sklearn.metrics.pairwise import cosine_similarity

In [47]:
def get_statement(s='Hi', db=X.values, orig=df.Context.iloc):
    q = tfidf.transform([s]).todense()[0]
    best_similarity = -1
    best_i = 0
    for i, v in enumerate(db):
        # print(i, q, v)
        similarity = cosine_similarity(q, pd.np.array([v]))
        if similarity > best_similarity:
            best_similarity
            best_i = i
    return orig[best_i], best_i

In [48]:
get_statement('Hello Ubuntu')

('what is the command used for fixing __eou__ help me .......what is the command used for fixing __eou__ __eot__ fixing what? __eou__ __eot__ ',
 18919)

In [49]:
df['Ground Truth Utterance'].iloc[get_statement("Ubuntu doesn't work on my Macbook Pro!")[1]]

'to fix partial updates __eou__'

In [16]:
def get_reply(s='Hi'):
    return df['Ground Truth Utterance'].iloc[get_statement(s)[1]]
    

In [17]:
get_reply('anyone knows why my stock oneiric exports env')

'nice thanks! __eou__'

In [None]:
get_reply('aut the line had the comment "// FIXME: Is this required?" so I guess it isn\'t surprising it is gone __eou__ __eot__ thanks!  How the heck did you figure that out? __eou__ __eot__ https://bugs.launchpad.net/lightdm/+bug/864109/comments/3 __eou__ __eot__ '

In [18]:
get_reply('i set up my hd such that i have to type a pass')

'so you dont know, ok, anyone else? __eou__ you are like, yah my mouse doesnt work, reinstall your os lolol what a joke __eou__'

In [19]:
from sklearn.decomposition import PCA
pca = PCA(n_components=200)
pca = pca.fit(tfidf.transform(df.Context).todense())
X_100d = pca.transform(X)
y_100d = pca.transform(y)

In [21]:
def get_statement_100d(s='Hi'):
    q = pca.transform(tfidf.transform([s]).todense())[0]
    similarity = 0
    best_i = 0
    for i, v in enumerate(X_100d):
        # print(i, q.shape, v.shape)
        sim = 2 - cosine_distances(pd.np.array([q]), pd.np.array([v]))
        if sim > similarity:
            similarity = sim
            best_i = i
    
    return df.Context.iloc[best_i], best_i

def get_reply_100d(s='Hi'):
    return df['Ground Truth Utterance'].iloc[get_statement_100d(s)[1]]

In [24]:
print(get_statement_100d(df.Context[0])[0])
print(get_reply_100d(df.Context[0])[0])
print(get_reply_100d("I'm trying to use ubuntu on my macbook pro"))

anyone knows why my stock oneiric exports env var 'USERNAME'?  I mean what is that used for?  I know of $USER but not $USERNAME .  My precise install doesn't export USERNAME __eou__ __eot__ looks like it used to be exported by lightdm, but the line had the comment "// FIXME: Is this required?" so I guess it isn't surprising it is gone __eou__ __eot__ thanks!  How the heck did you figure that out? __eou__ __eot__ https://bugs.launchpad.net/lightdm/+bug/864109/comments/3 __eou__ __eot__ 
n
ok, i just figured i'd ask here incase I was just retarded lol __eou__


In [95]:
get_statement_100d("me just installed another serial port copier but don't know")

("I use cinnamon __eou__ But i don't know :P __eou__ __eot__ fair enough. I know Kazam has big issues in Gnome3 and cinnamon. __eou__ __eot__ ",
 10644)

In [96]:
get_statement_100d("I just added a second usb printer but not sure")

('obi Its not working without USB stick. Without USB stick it asks to select a boot medium. With USB stick, it boots correctly. __eou__ obi Still the fdisk shows its not bootable. __eou__ __eot__ That display is irrelevant since 15 years :) __eou__ __eot__ ',
 7989)

In [31]:

print(get_reply_100d("Did you like the movie Avatar?"))


and it still doesn't work? __eou__
