# Document Retrieval Homework

In [15]:
import graphlab as gl
import pandas as pd # NOTE: I find it more useful to do analysis in pandas (like everyone else) than graphlab. GL is used in the course

In [5]:
people = gl.SFrame('data/people_wiki.gl/')
people.head()

URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...


**Question 1**

Compare top words according to word counts to TF-IDF: In the notebook we covered in the module, we explored two document representations: word counts and TF-IDF. Now, take a particular famous person, 'Elton John'. What are the 3 words in his articles with highest word counts? What are the 3 words in his articles with highest TF-IDF? These results illustrate why TF-IDF is useful for finding important words. Save these results to answer the quiz at the end.

In [36]:
def count_words_for_person(person_name):
    person = obama = people[people['name'] == person_name]
    person['word_count'] = gl.text_analytics.count_words(person['text'])
    return person

person = count_words_for_person('Elton John')
# in GL:
# elton_word_counts = person[['word_count']].stack('word_count', new_column_name = ['word','count'])
# elton_word_counts.head()
elton_wc = pd.Series(person['word_count'][0])
elton_wc.sort_values(ascending=False).head(10)

the     27
in      18
and     15
of      13
a       10
has      9
john     7
he       7
on       6
for      5
dtype: int64

The words with the highest counts are the, in and and

In [42]:
people['word_count'] = gl.text_analytics.count_words(people['text'])
tfidf = gl.text_analytics.tf_idf(people['word_count'])
people['tfidf'] = tfidf

In [48]:
elton_tfidf = pd.Series(people[people['name'] == 'Elton John']['tfidf'][0])
elton_tfidf.sort_values(ascending=False).head()

furnish        18.389472
elton          17.482320
billboard      17.303681
john           13.939313
songwriters    11.250406
dtype: float64

The words with the highest tf-idf scores are "furnish," "elton," and "billboard"

## Question 2

Measuring distance: Elton John is a famous singer; let’s compute the distance between his article and those of two other famous singers. In this assignment, you will use the cosine distance, which one measure of similarity between vectors, similar to the one discussed in the lectures. You can compute this distance using the graphlab.distances.cosine function. What’s the cosine distance between the articles on ‘Elton John’ and ‘Victoria Beckham’? What’s the cosine distance between the articles on ‘Elton John’ and Paul McCartney’? Which one of the two is closest to Elton John? Does this result make sense to you? Save these results to answer the quiz at the end.

> Note, I have to use graphlab for this as I'm not familiar with how (or if) cosine distance is implemented in pandas

In [53]:
def get_person(name):
    return people[people['name'] == name]

elton = get_person('Elton John')
victoria = get_person('Victoria Beckham')
paul = get_person('Paul McCartney')

In [57]:
def calc_cosine_distance(record1, record2):
    return gl.distances.cosine(record1['tfidf'][0], record2['tfidf'][0])

print "distance between Elton and Victoria is " + str(calc_cosine_distance(elton, victoria))
print "distance between Elton and Paul is " + str(calc_cosine_distance(elton, paul))

distance between Elton and Victoria is 0.956700637666
distance between Elton and Paul is 0.825031002922


Paul is closest to Elton. This makes sense to me because Elton and Paul are both great musicians of the 60's to today. They've both been knighted. Either one of them has more talent in their little toe on a bad day than Victoria :P

## Question 3

### Building nearest neighbors models with different input features and setting the distance metric:

In the sample notebook, we built a nearest neighbors model for retrieving articles using TF-IDF as features and using the default setting in the construction of the nearest neighbors model. Now, you will build two nearest neighbors models:

1. Using word counts as features
2. Using TF-IDF as features

In both of these models, we are going to set the distance function to cosine similarity. Here is how: when you call the function

`graphlab.nearest_neighbors.create`

add the parameter:

`distance='cosine'`

Now we are ready to use our model to retrieve documents. Use these two models to collect the following results:

1. What’s the most similar article, other than itself, to the one on ‘Elton John’ using word count features?
2. What’s the most similar article, other than itself, to the one on ‘Elton John’ using TF-IDF features?
3. What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using word count features?
4. What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using TF-IDF features?

In [61]:
knn_model_wc = gl.nearest_neighbors.create(people,
                                           features=['word_count'],
                                           label='name',
                                            distance='cosine')

knn_model_tfidf = gl.nearest_neighbors.create(people,
                                           features=['tfidf'],
                                           label='name',
                                            distance='cosine')

In [63]:
## Elton
print knn_model_wc.query(elton)
print knn_model_tfidf.query(elton)

+-------------+-------------------+-------------------+------+
| query_label |  reference_label  |      distance     | rank |
+-------------+-------------------+-------------------+------+
|      0      |     Elton John    | 2.22044604925e-16 |  1   |
|      0      |   Cliff Richard   |   0.16142415259   |  2   |
|      0      |   Sandro Petrone  |   0.16822542751   |  3   |
|      0      |    Rod Stewart    |   0.168327165587  |  4   |
|      0      | Malachi O'Doherty |   0.177315545979  |  5   |
+-------------+-------------------+-------------------+------+
[5 rows x 4 columns]



+-------------+------------------+--------------------+------+
| query_label | reference_label  |      distance      | rank |
+-------------+------------------+--------------------+------+
|      0      |    Elton John    | -2.22044604925e-16 |  1   |
|      0      |   Rod Stewart    |   0.717219667893   |  2   |
|      0      |  George Michael  |   0.747600998969   |  3   |
|      0      | Sting (musician) |   0.747671954431   |  4   |
|      0      |   Phil Collins   |   0.75119324879    |  5   |
+-------------+------------------+--------------------+------+
[5 rows x 4 columns]



In [64]:
## Victoria
print knn_model_wc.query(victoria)
print knn_model_tfidf.query(victoria)

+-------------+--------------------------+--------------------+------+
| query_label |     reference_label      |      distance      | rank |
+-------------+--------------------------+--------------------+------+
|      0      |     Victoria Beckham     | -2.22044604925e-16 |  1   |
|      0      | Mary Fitzgerald (artist) |   0.207307036115   |  2   |
|      0      |      Adrienne Corri      |   0.214509782788   |  3   |
|      0      |     Beverly Jane Fry     |   0.217466468741   |  4   |
|      0      |      Raman Mundair       |   0.217695474992   |  5   |
+-------------+--------------------------+--------------------+------+
[5 rows x 4 columns]



+-------------+---------------------+-------------------+------+
| query_label |   reference_label   |      distance     | rank |
+-------------+---------------------+-------------------+------+
|      0      |   Victoria Beckham  | 1.11022302463e-16 |  1   |
|      0      |    David Beckham    |   0.548169610263  |  2   |
|      0      | Stephen Dow Beckham |   0.784986706828  |  3   |
|      0      |        Mel B        |   0.809585523409  |  4   |
|      0      |    Caroline Rush    |   0.819826422919  |  5   |
+-------------+---------------------+-------------------+------+
[5 rows x 4 columns]

