# Document retrieval from wikipedia data

## Fire up GraphLab Create

In [1]:
import graphlab

A newer version of GraphLab Create (v1.7.1) is available! Your current version is v1.6.1.

You can use pip to upgrade the graphlab-create package. For more information see https://dato.com/products/create/upgrade.


# Load some text data - from wikipedia, pages on people

In [2]:
people = graphlab.SFrame('people_wiki.gl/')

[INFO] This non-commercial license of GraphLab Create is assigned to dsigd001@fiu.edu and will expire on October 30, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-15508 - Server binary: /Users/dibakarsigdel/.graphlab/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1448146419.log
[INFO] GraphLab Server Version: 1.6.1


Data contains:  link to wikipedia article, name of person, text of article.

In [3]:
people.head(1)

URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...


In [4]:
len(people)

59071

# Explore the dataset and checkout the text it contains

## Exploring the entry for Elton John

In [6]:
John = people[people['name'] == 'Elton John']

In [7]:
John

URI,name,text
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...


In [8]:
John['text']

dtype: str
Rows: ?
['sir elton hercules john cbe born reginald kenneth dwight 25 march 1947 is an english singer songwriter composer pianist record producer and occasional actor he has worked with lyricist bernie taupin as his songwriter partner since 1967 they have collaborated on more than 30 albums to datein his fivedecade career elton john has sold more than 300 million records making him one of the bestselling music artists in the world he has more than fifty top 40 hits including seven consecutive no 1 us albums 58 billboard top 40 singles 27 top 10 four no 2 and nine no 1 for 31 consecutive years 19702000 he had at least one song in the billboard hot 100 his single something about the way you look tonightcandle in the wind 1997 sold over 33 million copies worldwide and is the bestselling single of all time he has received six grammy awards five brit awards winning two awards for outstanding contribution to music and the first brits icon in 2013 for his lasting impact on british 

# Get the word counts for Elton John article

In [10]:
John['word_count'] = graphlab.text_analytics.count_words(John['text'])

In [11]:
print John['word_count']

[{'all': 1, 'least': 1, 'producer': 1, 'heavily': 1, 'inducted': 1, 'john': 7, 'over': 2, 'named': 1, 'making': 1, 'years': 1, 'four': 1, 'openly': 1, 'including': 1, 'highestprofile': 1, 'its': 2, 'impact': 1, '1': 2, '27': 1, '21': 2, 'wed': 1, 'datein': 1, 'royal': 1, '1947': 1, 'abbey': 1, 'winning': 1, 'late': 1, 'to': 4, 'taupin': 1, 'born': 1, '2014': 1, 'as': 2, 'has': 9, '2013': 1, 'his': 4, 'march': 1, '10': 1, 'songwriter': 2, 'solo': 1, 'continues': 1, 'records': 1, 'five': 1, 'occasional': 1, 'they': 1, 'inception': 1, 'world': 1, 'one': 3, 'hall': 2, 'bestselling': 2, 'fivedecade': 1, 'knighthood': 1, '58': 1, 'artist': 1, 'roll': 2, 'inductee': 1, 'list': 1, 'events': 1, 'hercules': 1, 'announced': 1, 'rock': 2, 'alltime': 1, 'brit': 1, 'bernie': 1, 'england': 1, 'concert': 1, 'be': 1, 'diana': 1, 'globe': 1, 'artists': 2, 'him': 3, 'culture': 1, 'year': 1, 'billboard': 4, 'aids': 2, 'empire': 1, 'honors': 1, 'composers': 1, 'established': 1, 'elton': 3, 'for': 5, 'recor

## Sort the word counts for the Elton John article

### Turning dictonary of word counts into a table

In [13]:
John_word_count_table = John[['word_count']].stack('word_count', new_column_name = ['word','count'])

### Sorting the word counts to show most common words at the top

In [14]:
John_word_count_table.head()

word,count
social,1
champion,1
be,1
2014,1
legal,1
became,1
2005,1
december,2
21,2
furnish,2


In [15]:
John_word_count_table.sort('count',ascending=False)

word,count
the,27
in,18
and,15
of,13
a,10
has,9
john,7
he,7
on,6
award,5


Most common words include uninformative words like "the", "in", "and",...

# Compute TF-IDF for the corpus 

To give more weight to informative words, we weigh them by their TF-IDF scores.

In [16]:
people['word_count'] = graphlab.text_analytics.count_words(people['text'])
people.head()

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'selection': 1, 'carltons': 1, 'being': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1, 'thomas': 1, 'closely': 1, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1, 'issued': 1, 'mainly': 1, 'nominat ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1, 'bauforschung': 1, ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'they': 1, 'gangstergenka': 1, ..."
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'currently': 1, 'less': 1, 'being': 1, ..."
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2, 'producer': 1, 'show' ..."
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1, 'salon': 1, 'gangs': 1, 'being': 1, ..."
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1, 'frankie': 1, 'labels': 1, ..."
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1, 'deborash': 1, 'both' ..."


In [18]:
tfidf = graphlab.text_analytics.tf_idf(people['word_count'])

# Earlier versions of GraphLab Create returned an SFrame rather than a single SArray
# This notebook was created using Graphlab Create version 1.7.1
if graphlab.version <= '1.6.1':
    tfidf = tfidf['docs']



In [21]:
tfidf[0]

{'10': 2.3157231098806563,
 '1979': 2.6032908378122737,
 '19982000': 6.509158574746988,
 '2000': 1.8763068991994527,
 '2001': 1.9280249665871378,
 '2002': 1.8753125887822302,
 '2003': 1.8013702663900752,
 '2005': 1.6425861253275964,
 '2006': 1.520737905384506,
 '2007': 1.4879730697555795,
 '2008': 1.5093391374786154,
 '2009': 1.5644364836042695,
 '2011': 1.7023470901042916,
 '2013': 1.9545642372230505,
 '2014': 2.2073995783446634,
 '21': 2.797250863489293,
 '32': 4.3717697890214335,
 '44game': 9.887883100557085,
 'a': 0.022476737890332586,
 'acted': 4.137429106591736,
 'afl': 4.70049729471633,
 'aflfrom': 10.986495389225194,
 'against': 4.015921958283749,
 'age': 2.138848033513307,
 'along': 2.5088749729287803,
 'also': 0.4627270916162349,
 'and': 0.002980575592194913,
 'as': 0.2543390440248236,
 'assistant': 2.5220702633476124,
 'at': 0.8612771466165147,
 'australia': 2.86858644684204,
 'australian': 8.630007339620153,
 'before': 2.9935647453367427,
 'being': 1.7938099524877322,
 'blu

In [20]:
people['tfidf'] = tfidf

## Examine the TF-IDF for the Elton John article

In [22]:
John = people[people['name'] == 'Elton John']

In [23]:
John[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)

word,tfidf
furnish,18.38947184
elton,17.48232027
billboard,17.3036809575
john,13.9393127924
songwriters,11.250406447
tonightcandle,10.9864953892
overallelton,10.9864953892
19702000,10.2933482087
fivedecade,10.2933482087
aids,10.262846934


Words with highest TF-IDF are much more informative.

# Manually compute distances between a few people

Let's manually compare the distances between the articles for a few famous people.  

In [24]:
Paul = people[people['name'] == 'Paul McCartney']

In [25]:
Victoria = people[people['name'] == 'Victoria Beckham']

## Distance

In [26]:
graphlab.distances.cosine(John['tfidf'][0],Victoria['tfidf'][0])

0.9567006376655429

In [27]:
graphlab.distances.cosine(John['tfidf'][0],Paul['tfidf'][0])

0.8250310029221779

# Build a nearest neighbor model for document retrieval

We now create a nearest-neighbors model and apply it to document retrieval.  

In [35]:
knn_model = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name',distance = 'cosine')

PROGRESS: Starting brute force nearest neighbors model training.


In [36]:
raw_knn_model = graphlab.nearest_neighbors.create(people,features=['word_count'],label='name',distance = 'cosine')

PROGRESS: Starting brute force nearest neighbors model training.


# Applying the nearest-neighbors model for retrieval

## Who is closest to Obama?

In [37]:
knn_model.query(John)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 102.819ms    |
PROGRESS: | 0            | 32229   | 54.5598     | 1.12s        |
PROGRESS: | 0            | 58595   | 99.1942     | 2.11s        |
PROGRESS: | Done         |         | 100         | 2.22s        |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Elton John,-2.22044604925e-16,1
0,Rod Stewart,0.717219667893,2
0,George Michael,0.747600998969,3
0,Sting (musician),0.747671954431,4
0,Phil Collins,0.75119324879,5


In [38]:
raw_knn_model.query(John)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 20.345ms     |
PROGRESS: | 0            | 39589   | 67.0193     | 1.03s        |
PROGRESS: | Done         |         | 100         | 1.65s        |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Elton John,2.22044604925e-16,1
0,Cliff Richard,0.16142415259,2
0,Sandro Petrone,0.16822542751,3
0,Rod Stewart,0.168327165587,4
0,Malachi O'Doherty,0.177315545979,5


## Other examples of document retrieval

In [39]:
knn_model.query(Victoria)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 24.069ms     |
PROGRESS: | 0            | 32548   | 55.0998     | 1.02s        |
PROGRESS: | 0            | 58994   | 99.8696     | 2.05s        |
PROGRESS: | Done         |         | 100         | 2.06s        |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Victoria Beckham,1.11022302463e-16,1
0,David Beckham,0.548169610263,2
0,Stephen Dow Beckham,0.784986706828,3
0,Mel B,0.809585523409,4
0,Caroline Rush,0.819826422919,5


In [40]:
raw_knn_model.query(Victoria)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 14.931ms     |
PROGRESS: | 0            | 41702   | 70.5964     | 1.02s        |
PROGRESS: | Done         |         | 100         | 1.57s        |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Victoria Beckham,-2.22044604925e-16,1
0,Mary Fitzgerald (artist),0.207307036115,2
0,Adrienne Corri,0.214509782788,3
0,Beverly Jane Fry,0.217466468741,4
0,Raman Mundair,0.217695474992,5
