# Document retrieval from wikipedia data
The dataset used in this notebook can be found here [people_wiki.csv](https://d396qusza40orc.cloudfront.net/phoenixassets/people_wiki.csv)

In [67]:
import pandas as pd
import numpy as np

In [70]:
import json
from collections import OrderedDict

In [None]:
#from nltk.corpus import stopwords

In [110]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import distance_metrics

In [4]:
people = pd.read_csv("./data/people_wiki.csv")

In [5]:
people.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


In [6]:
people.shape

(59071, 3)

## Explore Data Set

In [7]:
obama = people.loc[people.loc[:,"name"] == "Barack Obama"]

In [8]:
obama

Unnamed: 0,URI,name,text
35817,<http://dbpedia.org/resource/Barack_Obama>,Barack Obama,barack hussein obama ii brk husen bm born augu...


In [9]:
print(obama.loc[35817,"text"])

barack hussein obama ii brk husen bm born august 4 1961 is the 44th and current president of the united states and the first african american to hold the office born in honolulu hawaii obama is a graduate of columbia university and harvard law school where he served as president of the harvard law review he was a community organizer in chicago before earning his law degree he worked as a civil rights attorney and taught constitutional law at the university of chicago law school from 1992 to 2004 he served three terms representing the 13th district in the illinois senate from 1997 to 2004 running unsuccessfully for the united states house of representatives in 2000in 2004 obama received national attention during his campaign to represent illinois in the united states senate with his victory in the march democratic party primary his keynote address at the democratic national convention in july and his election to the senate in november he began his presidential campaign in 2007 and after

## Words count for Obama article
**No need to clean the data**: From the data exploration steps, it looks like data (text of the articles) are already clean, so there's no need to perfom a cleaning process. The text could be used "as is" to build bag of words.

**We are not using stop_words**: It is not used in the course notebook, probably on purpose to demonstrate the benefit of TF-IDF technique.

We limit the vocabulary to 5000 words for now, we'll see if it is enough to match the notebook of the course.

In [10]:
#stopswrd = set(stopwords.words("english"))
vectorizer = CountVectorizer(analyzer = "word",tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             max_features = 5000)

In [11]:
obama_words_count = vectorizer.fit_transform(obama["text"]).toarray()

In [12]:
obama.loc[35817,"words_count"] = str(dict(zip(vectorizer.get_feature_names(),obama_words_count[0])))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [13]:
obama

Unnamed: 0,URI,name,text,words_count
35817,<http://dbpedia.org/resource/Barack_Obama>,Barack Obama,barack hussein obama ii brk husen bm born augu...,"{'unconstitutional': 1, 'prize': 1, 'won': 1, ..."


In [14]:
obama.loc[35817,"words_count"]

"{'unconstitutional': 1, 'prize': 1, 'won': 1, '2000in': 1, 'californias': 1, 'general': 1, 'school': 3, 'care': 1, 'illinois': 2, 'wall': 1, 'death': 1, 'withdrawal': 1, 'operation': 1, 'major': 1, 'defense': 1, 'from': 3, '2013': 1, 'resulted': 1, 'control': 4, 'address': 1, 'attorney': 1, 'convention': 1, 'bm': 1, 'sought': 1, 'into': 1, 'dont': 2, 'romney': 1, 'which': 1, 'legislation': 1, 'republican': 2, 'americans': 1, 'seats': 1, 'reauthorization': 1, 'osama': 1, 'worked': 1, 'that': 1, 'obama': 9, 'ordered': 3, 'consumer': 1, '20': 2, 'foreign': 2, 'first': 3, 'act': 8, 'on': 2, 'and': 21, 'during': 2, 'troop': 1, 'husen': 1, 'earning': 1, 'street': 1, 'review': 1, 'began': 1, 'arms': 1, 'spending': 1, 'afghanistan': 2, 'sworn': 1, 'military': 4, 'is': 2, 'representing': 1, '1961': 1, 'then': 1, 'for': 4, '2012': 1, 'over': 1, 'taxpayer': 1, 'reelected': 1, 'of': 18, 'march': 1, 'january': 3, 'great': 1, '63': 1, 'running': 1, 'signed': 3, 'he': 7, 'nations': 1, 'budget': 1, '

### Sorting words count for Obama

In [15]:
obama_words_count_table = pd.DataFrame({"count":obama_words_count[0],"words":vectorizer.get_feature_names()})

In [16]:
obama_words_count_table.sort_values("count",ascending=False).head(10)

Unnamed: 0,count,words
242,40,the
115,30,in
28,21,and
162,18,of
245,14,to
106,11,his
160,9,obama
18,8,act
104,7,he
30,6,as


## Compute tf-idf for people_wiki corpus

In [17]:
vectorizer_tfidf = TfidfVectorizer(analyzer = "word",tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             max_features = 5000)

In [18]:
%%time
tfidf = vectorizer_tfidf.fit_transform(people.loc[:,"text"])

Wall time: 22 s


### Create the "tfidf" column

In [20]:
tfidf_a = tfidf.toarray()

**to_dict** is transforming one row of the tdidf_a array into a dictionary. Only items with value > 0 are kept when building the dictionary.

The dictionary is encoded in json so that it can be put into the dataframe.

In [102]:
def to_dict(tdidf_row):
    dic = {k:v for k,v in zip(vectorizer_tfidf.get_feature_names(),tdidf_row) if v>0}
    return(json.dumps(dic))

In [103]:
%%time
tfidf_df = pd.DataFrame(tfidf_a,people.index).apply(to_dict, axis=1)

Wall time: 14min 56s


In [105]:
people.insert(len(people.columns),"tfidf",tfidf_df)

In [106]:
people.head(5)

Unnamed: 0,URI,name,text,tfidf,tfidf2
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,"{'clubs': 0.062723630482851067, '2006': 0.0355...","{""clubs"": 0.06272363048285107, ""2006"": 0.03556..."
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...,"{'moving': 0.072655298197632542, 'his': 0.0218...","{""moving"": 0.07265529819763254, ""his"": 0.02189..."
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...,"{'won': 0.034258178312352588, 'what': 0.109773...","{""won"": 0.03425817831235259, ""what"": 0.1097737..."
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...,"{'ones': 0.069140065025941375, 'die': 0.066240...","{""ones"": 0.06914006502594137, ""die"": 0.0662407..."
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'better': 0.068868121877240659, '2006': 0.039...","{""better"": 0.06886812187724066, ""2006"": 0.0390..."


In [107]:
obama = people.loc[people.loc[:,"name"] == "Barack Obama"]

In [108]:
obama

Unnamed: 0,URI,name,text,tfidf,tfidf2
35817,<http://dbpedia.org/resource/Barack_Obama>,Barack Obama,barack hussein obama ii brk husen bm born augu...,"{'won': 0.018164529003915339, 'general': 0.024...","{""won"": 0.01816452900391534, ""general"": 0.0245..."


In [109]:
OrderedDict(sorted(json.loads(obama.loc[35817,"tfidf"]).items(), key=lambda t: t[1], reverse=True))

OrderedDict([('obama', 0.3983862007203234),
             ('the', 0.30485743226669804),
             ('act', 0.27185987662017774),
             ('in', 0.22884055920050594),
             ('iraq', 0.16568635397679352),
             ('and', 0.16015314257916738),
             ('law', 0.15791425408479473),
             ('control', 0.14391109380018258),
             ('of', 0.13774201209013084),
             ('us', 0.13406301846956642),
             ('ordered', 0.1335671861081277),
             ('military', 0.13042155292939375),
             ('democratic', 0.12504933995276982),
             ('response', 0.12026041865196672),
             ('involvement', 0.12026041865196672),
             ('to', 0.11169769826290973),
             ('his', 0.10584085657839853),
             ('senate', 0.1003095173774816),
             ('term', 0.09387327670917242),
             ('campaign', 0.09203868780800897),
             ('nominee', 0.08708336390533508),
             ('afghanistan', 0.0869973355106571),
     