# Document Retrieval From Wikipedia Data

## Fire Up GraphLab Create
(See [Getting Started with SFrames](../../week-1/work/Getting-Started-With-SFrames.ipynb) for setup instructions)

In [1]:
# Ignore GraphLab
# import graphlab

# Use Pandas
import pandas as pd
# User NumPy
import numpy as np

In [None]:
# Limit number of worker processes. This preserves system memory, which prevents hosted notebooks from crashing.
# graphlab.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 4)

# Load Some Text Data - From Wikipedia, Pages On People

In [2]:
# people = graphlab.SFrame('people_wiki.gl/')

# Import the CSV (Comma Separated Value)
people = pd.read_csv("people_wiki.csv")

Data contains: **link to wikipedia article**, **name of person**, **text of article**.

In [3]:
people.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


In [4]:
len(people)

59071

# Explore The Dataset, Checkout The Text

## Exploring The Entry For President Obama

In [5]:
obama = people[people['name'] == 'Barack Obama']

In [6]:
obama

Unnamed: 0,URI,name,text
35817,<http://dbpedia.org/resource/Barack_Obama>,Barack Obama,barack hussein obama ii brk husen bm born augu...


In [7]:
obama['text']

35817    barack hussein obama ii brk husen bm born augu...
Name: text, dtype: object

## Exploring The Entry For Actor George Clooney

In [8]:
clooney = people[people['name'] == 'George Clooney']
clooney['text']

38514    george timothy clooney born may 6 1961 is an a...
Name: text, dtype: object

# Get The Word Counts For Obama Article

## Helper

In [9]:
def remove_punctuation(text):
    """Remove punctuation(s) from a line of string text
    
    Args:
        text (str): The line of string text
        
    Returns:
        A string for the line of text with punctuation(s) removed
    """
    # Use the string library
    import string
    # Use `str.maketrans` to build a translation table
    # Use `translate` and the translation table to remove punctuation
    return text.translate(str.maketrans("", "", string.punctuation))


def word_count(text):
    """Count the occurrence of each word
    
    Args:
        text (str): The line of string text
        
    Returns:
        A dictionary of word(s) with a count
    """
    # Split the `text` on space
    word_list = text.split()
    # Create a dictionary to store the word count
    word_dict = {}
    
    # Loop through the list of word(s)
    for word in word_list:
        # If the word already exist, increment by 1
        if word in word_dict:
            word_dict[word] += 1
        # Else add the new word with count of 1
        else:
            word_dict[word] = 1
            
    return word_dict


# Create a new column `text_clean` from `text` with punctuation removed
people["text_clean"] = people["text"].apply(remove_punctuation)

# Get the 'Barack Obama' entry again
obama = people[people['name'] == 'Barack Obama']
# Get the word count for the `text_clean` column
obama_word = word_count(obama['text_clean'].values[0])
print("Obama Word Count Length: {}".format(len(obama_word)))

Obama Word Count Length: 273


In [None]:
# obama['word_count'] = graphlab.text_analytics.count_words(obama['text'])

In [None]:
# print obama['word_count']

## Sort The Word Count For The Obama Article

### Turning Dictonary Word Count To Table

In [10]:
# obama_word_count_table = obama[['word_count']].stack('word_count', new_column_name = ['word','count'])

# Create a DataFrame for the `obama_word`
# Convert `obama_word` to a list of tuple
# Construct the DataFrame with the `word` and `count` column
obama_word_count_table = pd.DataFrame(list(obama_word.items()), columns=['word', 'count'])

### Sorting the word counts to show most common words at the top

In [11]:
obama_word_count_table.head()

Unnamed: 0,word,count
0,barack,1
1,hussein,1
2,obama,9
3,ii,1
4,brk,1


In [12]:
# obama_word_count_table.sort('count',ascending=False)

# Sort DataFrame by `count` column
obama_word_count_table.sort_values(['count'], ascending=False)

Unnamed: 0,word,count
12,the,40
26,in,30
14,and,21
17,of,18
23,to,14
...,...,...
125,laureateduring,1
126,two,1
127,years,1
129,into,1


Most common words include uninformative words like **the**, **in**, **and**,...

# Compute TF-IDF For The Corpus

To give more weight to informative words, we weigh them by their TF-IDF scores.

In [13]:
# people['word_count'] = graphlab.text_analytics.count_words(people['text'])

people['word_count'] = people['text_clean'].apply(word_count)

people.head()

Unnamed: 0,URI,name,text,text_clean,word_count
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,digby morrell born 10 october 1979 is a former...,"{'digby': 1, 'morrell': 5, 'born': 1, '10': 1,..."
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...,alfred j lewy aka sandy lewy graduated from un...,"{'alfred': 1, 'j': 1, 'lewy': 3, 'aka': 1, 'sa..."
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...,harpdog brown is a singer and harmonica player...,"{'harpdog': 2, 'brown': 2, 'is': 7, 'a': 7, 's..."
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...,franz rottensteiner born in waidmannsfeld lowe...,"{'franz': 1, 'rottensteiner': 3, 'born': 1, 'i..."
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...,henry krvits born 30 december 1974 in tallinn ...,"{'henry': 1, 'krvits': 1, 'born': 1, '30': 1, ..."


In [55]:
######################
# Single Record Test #
######################

# Import scikit learn text feature extraction
# TfidfVectorizer is equivalent to CountVectorizer followed by TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create an instance of the tf-idf vectorizer
tfidf_vectorizer = TfidfVectorizer()
# Learn vocabulary and idf, return document-term matrix
# The `fit_transform` function takes `raw_documents` of type `iterable` such as a list of string or column of a DataFrame
term_matrix = tfidf_vectorizer.fit_transform(obama['text_clean'])

# Debug
# print("Term Matrix:\n{}".format(term_matrix))

# Get a `feature` to `term` map from `tfidf_vectorizer` attribute of `vocabulary_`
# The attribute `vocabulary_` returns a mapping of `term` to `feature` indices
# Reverse the dictionary `key` and `value`
feature_term_map = {item[1]:item[0] for item in tfidf_vectorizer.vocabulary_.items()}

# Debug
# print("Feature To Term Map:\n{}".format(list(feature_term_map.items())[:20]))

# Create a list to store the tf-idf
tfidf_list = []
# Loop through each row in `term_matrix`
for row in term_matrix:
    # Build tf-idf for each document, by mapping `term_matrix` value to `feature_term_map` value
    # Unpack `row.indices` and `row.data` into `(column, value)` tuple
    # Map `column` to `feature_term_map` as key, and `value` for the `tfidf_list`
    # Note: return only word(s) of non zero value for a specific document
    tfidf_list.append({feature_term_map[column]:value for (column, value) in zip(row.indices, row.data)})
    
# Debug
# print("TF-IDF:\n{}".format(tfidf_list))

In [14]:
# tfidf = graphlab.text_analytics.tf_idf(people['word_count'])

# Earlier versions of GraphLab Create returned an SFrame rather than a single SArray
# This notebook was created using Graphlab Create version 1.7.1
# if graphlab.version <= '1.6.1':
#     tfidf = tfidf['docs']

# tfidf

# Import scikit learn text feature extraction
# TfidfVectorizer is equivalent to CountVectorizer followed by TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create an instance of the tf-idf vectorizer
tfidf_vectorizer = TfidfVectorizer()
# Learn vocabulary and idf, return document-term matrix
# The `fit_transform` function takes `raw_documents` of type `iterable` such as a list of string or column of a DataFrame
term_matrix = tfidf_vectorizer.fit_transform(people['text_clean'])

# Get a `feature` to `term` map from `tfidf_vectorizer` attribute of `vocabulary_`
# The attribute `vocabulary_` returns a mapping of `term` to `feature` indices
# Reverse the dictionary `key` and `value`
feature_term_map = {item[1]:item[0] for item in tfidf_vectorizer.vocabulary_.items()}

# print("Feature To Term Map:\n{}".format(list(feature_term_map.items())[:20]))

# Create a list to store the tf-idf
tfidf_list = []
# Loop through each row in `term_matrix`
for row in term_matrix:
    # Build tf-idf for each document, by mapping `term_matrix` value to `feature_term_map` value
    # Unpack `row.indices` and `row.data` into `(column, value)` tuple
    # Map `column` to `feature_term_map` as key, and `value` for the `tfidf_list`
    # Note: return only word(s) of non zero value for a specific document
    tfidf_list.append({feature_term_map[column]:value for (column, value) in zip(row.indices, row.data)})

# print("TF-IDF:\n{}".format(tfidf_list[0]))

In [31]:
################
# tfidf Sample #
################

# Import scikit learn text feature extraction
# TfidfVectorizer is equivalent to CountVectorizer followed by TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
# Import `NearestNeighbors` for classifier implementing the k-nearest neighbors vote
from sklearn.neighbors import NearestNeighbors

# Create an instance of the tf-idf vectorizer
tfidf_vectorizer = TfidfVectorizer()
# Learn vocabulary and idf, return document-term matrix
# The `fit_transform` function takes `raw_documents` of type `iterable` such as a list of string or column of a DataFrame
term_matrix = tfidf_vectorizer.fit_transform(people['text'])

# Create a k-nearest neighbor model with the `cosine` metric and `brute` algorithm
knn_neighbor_cosine = NearestNeighbors(metric='cosine', algorithm='brute')
# Fit the k-nearest neighbor model
# Learn a vocabulary dictionary of all tokens in the raw documents using the `term_matrix`
knn_neighbor_cosine.fit(term_matrix)

# Get the `index` number of Barack Obama
obama_index = people[people['name'] == 'Barack Obama'].index[0]

# Find the k-nearest (10) neighbor of Barack Obama
# This is based on the counter vectorizer
cosine, indice = knn_neighbor_cosine.kneighbors(term_matrix[obama_index], n_neighbors=10)

cosine_neighbor = pd.DataFrame({'cosine': cosine.flatten(), 'id': indice.flatten()})
obama_nearest_df = (people.merge(cosine_neighbor, right_on = 'id', left_index = True).sort_values('cosine')[['id', 'name', 'cosine']])
obama_nearest_df

Unnamed: 0,id,name,cosine
0,35817,Barack Obama,0.0
1,24478,Joe Biden,0.570781
2,57108,Hillary Rodham Clinton,0.615934
3,38376,Samantha Power,0.624993
4,38714,Eric Stern (politician),0.649765
5,28447,George W. Bush,0.658687
6,39357,John McCain,0.661681
7,48693,Artur Davis,0.666942
8,18827,Henry Waxman,0.670205
9,46811,Jeff Sessions,0.672427


In [15]:
people['tfidf'] = tfidf_list
people.head()

Unnamed: 0,URI,name,text,text_clean,word_count,tfidf
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,digby morrell born 10 october 1979 is a former...,"{'digby': 1, 'morrell': 5, 'born': 1, '10': 1,...","{'melbourne': 0.04943637764477019, 'college': ..."
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...,alfred j lewy aka sandy lewy graduated from un...,"{'alfred': 1, 'j': 1, 'lewy': 3, 'aka': 1, 'sa...","{'every': 0.038317651977468975, 'capsule': 0.0..."
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...,harpdog brown is a singer and harmonica player...,"{'harpdog': 2, 'brown': 2, 'is': 7, 'a': 7, 's...","{'society': 0.039919477247877484, 'hamilton': ..."
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...,franz rottensteiner born in waidmannsfeld lowe...,"{'franz': 1, 'rottensteiner': 3, 'born': 1, 'i...","{'kurdlawitzpreis': 0.08928283865693797, 'spec..."
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...,henry krvits born 30 december 1974 in tallinn ...,"{'henry': 1, 'krvits': 1, 'born': 1, '30': 1, ...","{'curtis': 0.05258270131463223, 'promo': 0.067..."


## Examine The TF-IDF For The Obama Article

In [16]:
obama_tfidf = people[people['name'] == 'Barack Obama']

In [17]:
# obama[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)

# Create a DataFrame for the `obama_tfidf`
# Convert the `tfidf` column of `obama_tfidf` to a list of tuple
# Construct the DataFrame with the `word` and `tfidf` column
obama_tfidf_table = pd.DataFrame(list(obama_tfidf['tfidf'].values[0].items()), columns=['word', 'tfidf'])
# Sort DataFrame by `tfidf` column
obama_tfidf_table.sort_values(['tfidf'], ascending=False)

Unnamed: 0,word,tfidf
98,obama,0.365016
266,the,0.279322
92,act,0.249088
264,in,0.209672
114,iraq,0.151808
...,...,...
268,is,0.014350
216,new,0.013177
221,which,0.012341
232,that,0.011600


Words with highest TF-IDF are much more informative.

# Manually Compute Distances Between Few People

Let's manually compare the distances between the articles for a few famous people.

In [47]:
# clinton = people[people['name'] == 'Bill Clinton']

In [48]:
# beckham = people[people['name'] == 'David Beckham']

## Obama Closer To Clinton Than To Beckham?

We will use cosine distance, which is given by

`(1-cosine_similarity)`

and find that the article about president Obama is closer to the one about former president Clinton than that of footballer David Beckham.

In [None]:
# graphlab.distances.cosine(obama['tfidf'][0],clinton['tfidf'][0])

In [None]:
# graphlab.distances.cosine(obama['tfidf'][0],beckham['tfidf'][0])

# Build Nearest Neighbor Model For Document Retrieval

We now create a nearest-neighbors model and apply it to document retrieval.

In [20]:
# knn_model = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name')

# Import `NearestNeighbors` for classifier implementing the k-nearest neighbors vote
from sklearn.neighbors import NearestNeighbors
# Import `CountVectorizer` for converting a collection of text documents to a matrix of token counts
from sklearn.feature_extraction.text import CountVectorizer

# Create an instance of the count vectorizer
count_vectorizer = CountVectorizer()
# Learn the vocabulary dictionary and return document term matrix for the count vectorizer
count_matrix = count_vectorizer.fit_transform(people['text_clean'])

# Create a k-nearest neighbor model with the `euclidean` metric
knn_model = NearestNeighbors(metric='euclidean')
# Fit the k-nearest neighbor model
# Learn a vocabulary dictionary of all tokens in the raw documents using the `count_matrix`
knn_model.fit(count_matrix)
# Get the `index` number of Barack Obama
obama_index = people[people['name'] == 'Barack Obama'].index[0]
# Find the k-nearest (10) neighbor of Barack Obama
# This is based on the counter vectorizer
distance, indice = knn_model.kneighbors(count_matrix[obama_index], n_neighbors=10)

# Create a DataFrame for the `obama_neighbor`
obama_neighbor = pd.DataFrame({'distance': distance.flatten(), 'id': indice.flatten()})
# Merge `indice` and `distance` with `name` from the `people` DataFrame
k_nearest_df = (people.merge(obama_neighbor, right_on='id', left_index=True).sort_values('distance')[['id', 'name', 'distance']])
k_nearest_df

Unnamed: 0,id,name,distance
0,35817,Barack Obama,0.0
1,24478,Joe Biden,33.015148
2,28447,George W. Bush,34.307434
3,14754,Mitt Romney,35.79106
4,35357,Lawrence Summers,36.069378
5,31423,Walter Mondale,36.249138
6,13229,Francisco Barrio,36.276714
7,36364,Don Bonker,36.400549
8,22745,Wynn Normington Hugh-Jones,36.441734
9,7660,Refael (Rafi) Benvenisti,36.837481


# Applying The Nearest Neighbors Model For Retrieval

## Who Closest To Obama?

In [63]:
# knn_model.query(obama)

As we can see, president Obama's article is closest to the one about his vice-president Biden, and those of other politicians.

## Other Examples Document Retrieval

In [24]:
# swift = people[people['name'] == 'Taylor Swift']

# Get the `index` number of Taylor Swift
swift_index = people[people['name'] == 'Taylor Swift'].index[0]

# Find the k-nearest (10) neighbor of Taylor Swift
# This is based on the counter vectorizer
swift_distance, swift_indice = knn_model.kneighbors(count_matrix[swift_index], n_neighbors=10)

# Create a DataFrame for the `swift_neighbor`
swift_neighbor = pd.DataFrame({'distance': swift_distance.flatten(), 'id': swift_indice.flatten()})

# Merge `indice` and `distance` with `name` from the `people` DataFrame
swift_nearest_df = (people.merge(swift_neighbor, right_on='id', left_index=True).sort_values('distance')[['id', 'name', 'distance']])

In [25]:
# knn_model.query(swift)
swift_nearest_df

Unnamed: 0,id,name,distance
0,54264,Taylor Swift,0.0
1,56915,Monica (singer),28.195744
2,7326,Shania Twain,28.319605
3,939,Inna,28.337255
4,35583,Kim Carnes,28.809721
5,45294,Miranda Lambert,28.84441
6,18396,Princess (singer),28.879058
7,24270,Fuego (producer),28.913665
8,16330,Ellie Goulding,28.94823
9,49309,Ariana Grande,29.034462


In [26]:
# jolie = people[people['name'] == 'Angelina Jolie']

# Get the `index` number of Angelina Jolie
jolie_index = people[people['name'] == 'Angelina Jolie'].index[0]

# Find the k-nearest (10) neighbor of Taylor Swift
# This is based on the counter vectorizer
jolie_distance, jolie_indice = knn_model.kneighbors(count_matrix[jolie_index], n_neighbors=10)

# Create a DataFrame for the `swift_neighbor`
jolie_neighbor = pd.DataFrame({'distance': jolie_distance.flatten(), 'id': jolie_indice.flatten()})

# Merge `indice` and `distance` with `name` from the `people` DataFrame
jolie_nearest_df = (people.merge(jolie_neighbor, right_on='id', left_index=True).sort_values('distance')[['id', 'name', 'distance']])

In [27]:
# knn_model.query(jolie)
jolie_nearest_df

Unnamed: 0,id,name,distance
0,39521,Angelina Jolie,0.0
1,54362,Konkona Sen Sharma,24.698178
2,44571,Candice Bergen,24.799194
4,44992,Julianne Moore,24.979992
3,52700,Toni Collette,24.979992
5,48343,Fran Drescher,25.119713
6,33106,Bette Midler,25.159491
7,50619,Jessica Chastain,25.475478
8,36512,Jennifer Connelly,25.514702
9,52886,Marisa Tomei,25.632011


In [None]:
arnold = people[people['name'] == 'Arnold Schwarzenegger']

In [None]:
knn_model.query(arnold)

# Reference

* [Obtain tf-idf Word Weight With sklearn](https://stackoverflow.com/questions/45232671/obtain-tf-idf-weights-of-words-with-sklearn)
* [TfidfVectorizer: How Does Vectorizer With Fixed Vocabulary Deal With New Word](https://stackoverflow.com/questions/42068474/tfidfvectorizer-how-does-the-vectorizer-with-fixed-vocab-deal-with-new-words)
* [Clustering tf-idf](http://ethen8181.github.io/machine-learning/clustering/tfidf/tfidf.html)