# Document retrieval from wikipedia data

# Fire up GraphLab Create

In [1]:
import graphlab
graphlab.product_key.set_product_key('5CB0-AA08-B066-6CD5-9356-3A43-5B42-D3EE')
graphlab.product_key.get_product_key()

'5CB0-AA08-B066-6CD5-9356-3A43-5B42-D3EE'

# Load some text data - from wikipedia, pages on people

In [2]:
people = graphlab.SFrame('people_wiki.gl/')

[INFO] This non-commercial license of GraphLab Create is assigned to mrmthornton@gmail.com and will expire on October 27, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-1174 - Server binary: /usr/local/lib/python2.7/dist-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1447545023.log
[INFO] GraphLab Server Version: 1.6.1


Data contains:  link to wikipedia article, name of person, text of article.

# Compute TF-IDF for the corpus

To give more weight to informative words, we weigh them by their TF-IDF scores.


In [3]:
people['word_count'] = graphlab.text_analytics.count_words(people['text'])
people['tfidf'] = graphlab.text_analytics.tf_idf(people['word_count'])['docs']

In [4]:
filter = people['name'] == 'Elton John'
elton = people[filter]

In [5]:
filter = people['name'] == 'Paul McCartney'
paul = people[filter]

In [6]:
filter = people['name'] == 'Victoria Beckham'
vicky = people[filter]

### Turning dictonary of word counts into a table

In [7]:
elton[['word_count']].stack('word_count', new_column_name = ['word','count']).sort('count',ascending=False)

word,count
the,27
in,18
and,15
of,13
a,10
has,9
he,7
john,7
on,6
since,5


In [8]:
elton[['tfidf']].stack('tfidf', new_column_name = ['word','count']).sort('count',ascending=False)

word,count
furnish,18.38947184
elton,17.48232027
billboard,17.3036809575
john,13.9393127924
songwriters,11.250406447
overallelton,10.9864953892
tonightcandle,10.9864953892
19702000,10.2933482087
fivedecade,10.2933482087
aids,10.262846934


# Manually compute distances between a few people

Let's manually compare the distances between the articles for a few famous people.  

In [11]:
graphlab.distances.cosine(elton['tfidf'][0],vicky['tfidf'][0])

0.9567006376655429

In [12]:
graphlab.distances.cosine(elton['tfidf'][0],paul['tfidf'][0])

0.8250310029221779


# Build a nearest neighbor model for document retrieval

We now create a nearest-neighbors model and apply it to document retrieval.  

In [13]:
knn_word_count_model = graphlab.nearest_neighbors.create(people,features=['word_count'],label='name',distance='')


PROGRESS: Starting brute force nearest neighbors model training.


In [14]:
knn_tfidf_model = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name')

PROGRESS: Starting brute force nearest neighbors model training.


In [15]:
knn_word_count_model.query(elton)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 15.552ms     |
PROGRESS: | Done         |         | 100         | 527.607ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Elton John,0.0,1
0,Phil Collins,0.76399026764,2
0,Rod Stewart,0.773333333333,3
0,Annie Lennox,0.776623376623,4
0,Barry Gibb,0.780952380952,5


In [16]:
knn_tfidf_model.query(elton)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 9.541ms      |
PROGRESS: | Done         |         | 100         | 793.668ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Elton John,0.0,1
0,Phil Collins,0.76399026764,2
0,Rod Stewart,0.773333333333,3
0,Annie Lennox,0.776623376623,4
0,Barry Gibb,0.780952380952,5


In [17]:
knn_word_count_model.query(vicky)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 3.839ms      |
PROGRESS: | Done         |         | 100         | 485.552ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Victoria Beckham,0.0,1
0,Cheryl Cole,0.800586510264,2
0,Heidi Klum,0.810344827586,3
0,Simon Fuller,0.822742474916,4
0,Adele,0.824915824916,5


In [18]:
knn_tfidf_model.query(vicky)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 8.521ms      |
PROGRESS: | Done         |         | 100         | 784.089ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Victoria Beckham,0.0,1
0,Cheryl Cole,0.800586510264,2
0,Heidi Klum,0.810344827586,3
0,Simon Fuller,0.822742474916,4
0,Adele,0.824915824916,5
