# Step 5. Advanced: tf-idf search engine
Now that you have implemented PageRank, which tells you how authoritative websites are, you are going to build the other very important part of a search engine. Namely, this part will retrieve the documents that are _relevant_ to a given query. Simply put, this part will give you a subset of webpages that are presumed to be relevant to the query and then PageRank can rerank them according to how much you value authoritativeness over relevance. The methods in this section are a basis of both _keyword search_, and also _document clustering_.  (We’ll see a lot more about clustering in a few weeks.)

## Step 5.1. Preliminaries

### Step 5.1.1. Load nltk and some helpful tools

The cell below gives you some tools and assigns some variables that you will need later. You do not need to modify this cell provided that you have everything installed already.

In [1]:
# Uncomment the next line if you have not installed nltk already
#! pip install nltk
import nltk

# Uncomment the next line if you would like to have nltk_data accessible on your computer
nltk.data.path = ['/home/jovyan/work/nltk_data']
# Remember to check whether the config was successfully changed if you do it this way

# Uncomment the next line if you still need to download punkt
# nltk.download()
# Enter 'd' for Download, then 'punkt', and then 'q' for quit

from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

import numpy as np
import re

"""
# Returns True if the input (string) parameter has
# any sort of letter in it, else returns False.
"""
def has_letter(x):
    return re.match('.*[a-zA-Z].*',x) != None

# Stopwords are words we will ignore for search
# purposes, because they are too common to be useful
stopwords = set()

stop_file = open('stopwords.txt')
for line in stop_file:
    stopwords.add(line.strip())

# The NLTK parser breaks apostrophe-s into a separate "word"
# so we'll want to add it to the list... Though it's technically
# not a stop word in the traditional sense.
stopwords.add("'s")

# Use this as the maximum number of words we will index
MAX_WORDS = 10000

# Create the word stemmer
stemmer = PorterStemmer()

### Step 5.1.2. Import text files, clean them, and store in a dictionary 

As a "corpus" we fetched some data from Wikipedia, based on currently
trendy topics.  Each topic had multiple interpretations, some of which 
we suspected would "intersect" in interesting ways (e.g., Trump/Putin, Cloud/Google, 
Cloud/Climate).  Others had various interpretations (e.g., there are many types of 
Football).  See _Wikipedia.ipynb_ for the original download code.

Selected topics (for which the top-10 matches were returned by Wikipedia) were:

 * Pennsylvania
 * Trump
 * Apple
 * Google
 * Farm
 * Climate
 * Cloud
 * Football
 * Government
 * Putin
 
Please write the function `clean_article(article)` that takes a string as input, tokenizes it using `nltk.word_tokenize` (this one is not for Twitter), converts it to lower case, removed stopwords that are present in the set from the cell above, uses `stemmer` (defined above) to cut the word down to its stem, uses `has_letter` (defined above) to remove words that don't have any letters, and only keeps words with length greater than 1.

In [2]:
# TODO: Write your clean_article function here

# YOUR CODE HERE
nltk.download('punkt')
def clean_article(article):
    # tokenize
    article = nltk.word_tokenize(article)
    # convert to lower case
    article_new = article
    for i in range(0,len(article)):
        article_new[i] = article[i].lower()
    # remove stopwords
    article_new2 = [w for w in article_new if not w in stopwords]
    article_new2 = [stemmer.stem(word) for word in article_new2]
    for i in range(0,len(article_new2)):
        # remove words without letters
        if has_letter(article_new2[i]) == False:
            article_new2[i] = ''
        # keep words of length > 1
        if len(article_new2)<=1:
            article_new2[i] = ''
    
    article_new2 = list(filter(None, article_new2))
    
    return article_new2

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
import os

docs = {}

for filename in os.listdir('text'):
    file = open('text/' + filename)
    docs[filename] = clean_article(file.read())
    #print ('Loaded',filename)

print ("All files loaded")


All files loaded


In [4]:
print(' '.join(docs['Apple Inc..txt'][:50]))


appl american multin technolog compani headquart cupertino california suburb san jose design develop sell consum electron comput softwar onlin servic hardwar product includ iphon smartphon ipad tablet comput mac person comput ipod portabl media player appl watch smartwatch appl tv digit media player appl consum softwar includ maco io oper


## Step 5.2. Generate the Vocabulary

As discussed in class, natural language words follow a **Zipfian distribution**, which means that many, sometimes over half, of the unqiue words in a collection of documents only occur once. Limiting the **vocabulary**, or the list of words that are considered for a data science task, often improves both the efficiency and accuracy of the system. In Part 4, we did this manually. Now, we are going to limit the vocabulary of our search engine automatically such that it only considers the 10,000 most frequent words in the document collection. In the next cell, you should:

1. Create a single list containing the words of each (cleaned) document in succession.
2. Use `nltk.FreqDist` to count occurrences of each word
3. Make a list called `wordids` that contains the words in decreasing frequency order. For example, the most frequent word in this document collection is "appl", so `wordids[0] = 'appl'`.
3. Make a dictionary called `lexicon` that has words as keys and their indices in `wordids` as values.

In [5]:
# TODO: Create your vocabulary here. We will test the list named wordids and the dictionary named lexicon

# YOUR CODE HERE
# create singe list
full_list = []
keys, values = zip(*docs.items())
for i in range(0,len(values)):
    full_list = full_list + values[i]
    
# count occurrences of each word
fdist = nltk.FreqDist(full_list)

# make list of words in decreasing frequency order
fdist = fdist.most_common(len(fdist))
wordids = fdist
for i in range(0,len(fdist)):
    wordids[i] = fdist[i][0]

# make a dictionary
ids = list(range(0,len(wordids)))
lexicon = dict(zip(wordids,ids))

In [6]:
print(wordids[:10])


['appl', 'trump', 'state', 'footbal', 'cloud', 'googl', 'also', 'use', 'govern', 'first']


In [7]:
assert(lexicon['appl'] == 0)


## Step 5.3. Term Frequencies

Similar to Part 4, we are going to create a document-term matrix, but it is going to be a lot bigger now. Write a function `doc_vector(content, lexicon)` that given a cleaned article (as a list of words) and the lexicon, creates a vector of term frequencies. We will define **term frequency** as the number of times the word occurs in the document divided by the number of times the most frequent word in the document occurs. For example, in the document `['a', 'b', 'b']`, 'a' has a term frequency of 0.5 and 'b' has a term frequency of 1.0.

For greatest portability, please use the length of the vocabulary (`lexicon`) to determine the length of the output vector. Do not hard code the number 10,000.

In [8]:
# TODO: Write your doc_vector function here

# YOUR CODE HERE
def doc_vector(content, lexicon):
    keys, values = zip(*lexicon.items())
    nt = list(range(0,len(lexicon)))
    # for each word in the lexicon, count how many times it appears in content
    for i in range(0,len(lexicon)):
        word = keys[i]
        nt[i] = content.count(word)
    # calculate the term frequency depending on the max value
    mostFreq = max(nt)
    tf = list(map(lambda x: (x/mostFreq), nt))

    return tf

In [9]:
vectors = []
doclist = []

for topic in docs:
    doclist.append(topic)
    vectors.append(doc_vector(docs[topic], lexicon))

vectors = np.array(vectors)


In [23]:
lexicon

{'appl': 0,
 'trump': 1,
 'state': 2,
 'footbal': 3,
 'cloud': 4,
 'googl': 5,
 'also': 6,
 'use': 7,
 'govern': 8,
 'first': 9,
 'pennsylvania': 10,
 'new': 11,
 'putin': 12,
 'can': 13,
 'game': 14,
 'includ': 15,
 'play': 16,
 'climat': 17,
 'one': 18,
 'farm': 19,
 'year': 20,
 'unit': 21,
 'team': 22,
 'time': 23,
 'servic': 24,
 'may': 25,
 'leagu': 26,
 'nation': 27,
 'compani': 28,
 'two': 29,
 'gener': 30,
 'system': 31,
 'russia': 32,
 'develop': 33,
 'rule': 34,
 'player': 35,
 'form': 36,
 'user': 37,
 'russian': 38,
 'comput': 39,
 'million': 40,
 'citi': 41,
 'public': 42,
 'univers': 43,
 'chang': 44,
 'world': 45,
 'power': 46,
 'american': 47,
 'mani': 48,
 'call': 49,
 'club': 50,
 'search': 51,
 'presid': 52,
 'number': 53,
 'associ': 54,
 'provid': 55,
 'follow': 56,
 'area': 57,
 'will': 58,
 'countri': 59,
 'part': 60,
 'allow': 61,
 'ball': 62,
 'local': 63,
 'name': 64,
 'elect': 65,
 'product': 66,
 'work': 67,
 'organ': 68,
 'support': 69,
 'ii': 70,
 'book': 

In [10]:
# This should be document 0
for id in range(0, len(lexicon)):
    if vectors[0, id] > 0:
        print(wordids[id] + ' x ' + str(vectors[0, id]))


state x 0.0522388059701
googl x 1.0
also x 0.0746268656716
use x 0.134328358209
first x 0.0298507462687
new x 0.044776119403
can x 0.10447761194
includ x 0.0597014925373
one x 0.0597014925373
year x 0.00746268656716
time x 0.0298507462687
servic x 0.119402985075
may x 0.0522388059701
compani x 0.00746268656716
two x 0.0223880597015
system x 0.0298507462687
develop x 0.0223880597015
player x 0.0149253731343
user x 0.283582089552
comput x 0.0149253731343
univers x 0.00746268656716
mani x 0.0298507462687
call x 0.0671641791045
search x 0.00746268656716
number x 0.00746268656716
provid x 0.0597014925373
area x 0.00746268656716
will x 0.0671641791045
countri x 0.00746268656716
part x 0.0149253731343
allow x 0.0820895522388
name x 0.0223880597015
product x 0.0149253731343
work x 0.0820895522388
support x 0.141791044776
featur x 0.0522388059701
howev x 0.0373134328358
intern x 0.00746268656716
result x 0.00746268656716
major x 0.00746268656716
store x 0.00746268656716
york x 0.00746268656716


## Step 5.4. Inverse Document Frequencies

We would like to give **rare words** a higher weight than ones found in all documents.  To measure a word’s rareness, we will develop a metric called **inverse document frequency (idf)**.  In its simplest form, a word’s idf is a ratio between the total number of documents, and how many documents include a word.  (Note that the idf is independent of a given document -- it is a measure of the word’s popularity across the full set of documents, sometimes called a **corpus**.)  Typically, we don’t directly use the ratio, however -- instead we use its base-10 logarithm.  Thus, we define:

$$idf(w) = \log(\frac{\text{total number of docs}}{\text{number of docs containing }w})$$

Now compute a single vector `idf` representing, for each word, its idf within the corpus.

In [11]:
idf = []
# TODO: Populate the idf list with idf values across the vocabulary

# YOUR CODE HERE
import math
idf = list(range(0,len(lexicon)))
nDocs = len(docs)
for i in range(0,len(lexicon)):
    nDocsW = sum(vectors[:,i]!=0)
    idf[i] = math.log10(nDocs/nDocsW)

In [29]:
# len(idf) # 18,075
# MAX_WORDS
len(lexicon)


18075

In [12]:
for i in range(0, len(wordids)):
    print(wordids[i], idf[i])


appl 0.6901960800285137
trump 0.91204482964487
state 0.06694678963061322
footbal 0.760777154314221
cloud 0.6901960800285137
googl 0.7359535705891889
also 0.018098222092796223
use 0.09913147300201444
govern 0.3010299956639812
first 0.13389357926122641
pennsylvania 0.6690067809585756
new 0.11041248341170351
putin 0.91204482964487
can 0.13996772697341955
game 0.5762527277216769
includ 0.051706823073876314
play 0.3284682440109208
climat 0.5762527277216769
one 0.05672762444892715
farm 0.6690067809585756
year 0.13996772697341955
unit 0.11041248341170351
team 0.49986438185822213
time 0.13996772697341955
servic 0.3191282177567774
may 0.07741222330877814
leagu 0.6488033948702886
nation 0.2277980821295576
compani 0.3284682440109208
two 0.1850461017086077
gener 0.12790321557203896
system 0.21307482530885122
russia 0.7359535705891889
develop 0.17831271904963927
rule 0.40016146866599567
player 0.6690067809585756
form 0.1716821401506262
user 0.6901960800285137
russian 0.7359535705891889
comput 0.611

radiat 0.8772827233856582
condens 1.2922560713564761
decid 0.629498239674902
express 0.6488033948702886
chat 1.2922560713564761
voic 0.7871060930365701
window 0.7871060930365701
os 1.146128035678238
hybrid 0.9912260756924949
technic 0.5288280777935388
latin 0.6901960800285137
written 0.5762527277216769
affair 0.6488033948702886
altern 0.5762527277216769
suppli 0.6690067809585756
employ 0.6901960800285137
clone 1.2130748253088512
particularli 0.5440680443502757
facil 0.6690067809585756
son 0.8772827233856582
greater 0.5141048209728324
corp 1.146128035678238
loss 0.629498239674902
branch 0.7359535705891889
youth 0.8151348166368136
injuri 0.9498333905342697
men 0.6901960800285137
defeat 0.7871060930365701
transfer 0.6690067809585756
cultivar 1.5141048209728325
previous 0.5440680443502757
phone 0.8151348166368136
attribut 0.760777154314221
economi 0.7871060930365701
unlik 0.5598623115335075
smaller 0.5762527277216769
accept 0.5141048209728324
earlier 0.5440680443502757
director 0.629498239

loan 0.9912260756924949
nimbostratu 1.6901960800285136
alphabet 1.2922560713564761
ivana 1.2922560713564761
daughter 0.9912260756924949
defam 1.3891660843645326
tribe 1.146128035678238
mutual 1.2922560713564761
ontario 1.146128035678238
saa 1.6901960800285136
vanderbilt 1.9912260756924949
subtyp 1.5141048209728325
freight 1.6901960800285136
firebal 1.9912260756924949
genera 1.9912260756924949
otherwis 0.8450980400142568
similarli 0.7871060930365701
flash 1.146128035678238
specifi 0.91204482964487
francisco 0.9912260756924949
communist 1.2130748253088512
liber 0.91204482964487
coalit 0.9498333905342697
illustr 0.8772827233856582
prefix 1.2130748253088512
heritag 0.9912260756924949
learn 0.8151348166368136
paul 0.91204482964487
love 0.8772827233856582
pixel 1.5141048209728325
resolut 0.9912260756924949
equival 0.8151348166368136
cancel 1.03698356625317
termin 0.9912260756924949
prefer 0.8151348166368136
threaten 0.91204482964487
competitor 0.9912260756924949
manual 0.9498333905342697
cam

realiz 0.9498333905342697
emit 1.146128035678238
imposs 0.9912260756924949
k 1.2130748253088512
exploit 1.03698356625317
briefli 1.0881360887005513
automobil 1.3891660843645326
simplifi 0.9912260756924949
entitl 1.146128035678238
spell 1.146128035678238
silver 1.146128035678238
taiwan 1.2922560713564761
chose 0.9912260756924949
lisa 1.5141048209728325
hacker 1.2130748253088512
reloc 1.2130748253088512
warren 1.2922560713564761
sr. 1.2130748253088512
prison 1.146128035678238
yahoo 1.3891660843645326
feedback 1.2130748253088512
.avi 1.9912260756924949
valuat 1.2922560713564761
invent 1.2922560713564761
hail 1.146128035678238
so-cal 1.146128035678238
counter 1.03698356625317
jonathan 1.0881360887005513
pair 1.0881360887005513
jointli 0.9912260756924949
tokyo 1.146128035678238
tenur 0.9912260756924949
bluetooth 1.3891660843645326
outlet 0.9498333905342697
ca 1.03698356625317
recept 1.146128035678238
mp 1.5141048209728325
hdmi 1.6901960800285136
regulatori 1.2130748253088512
depict 0.991226

enrol 1.2922560713564761
disclos 1.3891660843645326
wrestlemania 1.9912260756924949
ross 1.3891660843645326
wynn 1.6901960800285136
chef 1.2922560713564761
geoffrey 1.146128035678238
counsel 1.2922560713564761
nbcunivers 1.6901960800285136
christoph 1.146128035678238
buchanan 1.3891660843645326
gambl 1.3891660843645326
disagr 1.2130748253088512
shuttl 1.146128035678238
airlin 1.3891660843645326
turkish 1.2922560713564761
sponsorship 1.3891660843645326
afterward 1.2130748253088512
sexton 1.6901960800285136
forg 1.2130748253088512
billi 1.2922560713564761
assault 1.5141048209728325
perri 1.3891660843645326
confront 1.2130748253088512
gaddafi 1.5141048209728325
ny 1.6901960800285136
e 1.2130748253088512
censorship 1.2922560713564761
chair 1.2922560713564761
sec 1.5141048209728325
statut 1.2922560713564761
peasant 1.5141048209728325
nineteen 1.6901960800285136
malta 1.6901960800285136
unemploy 1.2922560713564761
barcelona 1.2922560713564761
theater 1.2130748253088512
montreal 1.51410482097

interchang 1.5141048209728325
samoa 1.6901960800285136
leigh 1.6901960800285136
ilya 1.6901960800285136
marda 1.9912260756924949
ono 1.9912260756924949
tatar 1.9912260756924949
mid-level 1.9912260756924949
long-wav 1.9912260756924949
udaltsov 1.9912260756924949
shrovetid 1.9912260756924949
hurl 1.9912260756924949
calcio 1.9912260756924949
nottingham 1.6901960800285136
rfu 1.9912260756924949
huddersfield 1.6901960800285136
blackpool 1.9912260756924949
inter-c 1.9912260756924949
revi 1.9912260756924949
silverwar 1.9912260756924949
runners-up 1.9912260756924949
doncast 1.9912260756924949
standalon 1.5141048209728325
fring 1.3891660843645326
bold 1.6901960800285136
broaden 1.5141048209728325
peer 1.3891660843645326
deprec 1.3891660843645326
stand-alon 1.3891660843645326
grace 1.3891660843645326
sociolog 1.5141048209728325
holi 1.5141048209728325
socio-econom 1.5141048209728325
distort 1.3891660843645326
philosoph 1.3891660843645326
aristocraci 1.6901960800285136
tyranni 1.9912260756924949


ffdshow 1.9912260756924949
.gvp 1.9912260756924949
gvi 1.9912260756924949
avi 1.9912260756924949
mp3 1.5141048209728325
drm 1.6901960800285136
paramet 1.5141048209728325
directshow 1.9912260756924949
companion 1.6901960800285136
tremend 1.5141048209728325
smartwatch 1.6901960800285136
interbrand 1.6901960800285136
exponenti 1.5141048209728325
xerox 1.9912260756924949
infight 1.5141048209728325
instantli 1.5141048209728325
millionair 1.6901960800285136
follow-up 1.5141048209728325
intuit 1.5141048209728325
oust 1.6901960800285136
monopoli 1.5141048209728325
high-end 1.6901960800285136
ill-fat 1.5141048209728325
rework 1.5141048209728325
reminisc 1.5141048209728325
imovi 1.6901960800285136
astart 1.9912260756924949
g3 1.6901960800285136
antenna 1.9912260756924949
multi-touch 1.9912260756924949
facetim 1.9912260756924949
shuffl 1.6901960800285136
indefinit 1.5141048209728325
mobilem 1.6901960800285136
levinson 1.6901960800285136
reinvent 1.6901960800285136
9.7-inch 1.9912260756924949
pre-

surki 1.9912260756924949
business-ori 1.6901960800285136
wendel 1.6901960800285136
sander 1.6901960800285136
rfi 1.9912260756924949
pohlman 1.9912260756924949
semant 1.5141048209728325
string 1.9912260756924949
intext 1.9912260756924949
intitl 1.9912260756924949
inurl 1.9912260756924949
www.google.com 1.9912260756924949
malwar 1.9912260756924949
rayleigh–taylor 1.9912260756924949
crossroad 1.9912260756924949
cauliflow 1.6901960800285136
hue 1.6901960800285136
bikini 1.6901960800285136
kiloton 1.9912260756924949
plasma 1.9912260756924949
molten 1.9912260756924949
low-altitud 1.9912260756924949
solidifi 1.9912260756924949
solidif 1.9912260756924949
chlorin 1.6901960800285136
aluminium-28 1.9912260756924949
manganese-56 1.9912260756924949
iron-59 1.9912260756924949
cobalt-60 1.9912260756924949
glow 1.9912260756924949
shock 1.6901960800285136
skirt 1.6901960800285136
fungal 1.9912260756924949
bacteri 1.9912260756924949
'golden 1.9912260756924949
pagan 1.9912260756924949
iðunn 1.99122607569

atmosphere-ocean 1.6901960800285136
isthmu 1.6901960800285136
carbonifer 1.9912260756924949
balloon 1.6901960800285136
mid-20th 1.6901960800285136
deduc 1.6901960800285136
oral 1.6901960800285136
aerial 1.6901960800285136
pliocen 1.9912260756924949
holocen 1.6901960800285136
habitat 1.6901960800285136
runoff 1.6901960800285136
dendroclimatolog 1.9912260756924949
collat 1.6901960800285136
reef 1.9912260756924949
terrac 1.9912260756924949
wg1 1.9912260756924949
solomon 1.6901960800285136
grigg 1.6901960800285136
'our 1.9912260756924949
historicalclimatology.com 1.6901960800285136
fairi 1.9912260756924949
subtitl 1.9912260756924949
lifetim 1.6901960800285136
républiqu 1.9912260756924949
intelligentsia 1.6901960800285136
gollancz 1.9912260756924949
english-languag 1.6901960800285136
20th-centuri 1.6901960800285136
retrospect 1.6901960800285136
recaptur 1.6901960800285136
terrifi 1.9912260756924949
pre-emin 1.9912260756924949
glorifi 1.6901960800285136
veterinari 1.6901960800285136
rescu 1.

papua 1.9912260756924949
sidney 1.6901960800285136
ballet 1.9912260756924949
cazali 1.9912260756924949
stewart 1.6901960800285136
coventri 1.6901960800285136
bartlett 1.6901960800285136
landlin 1.9912260756924949
konami 1.9912260756924949
emoji-lik 1.9912260756924949
laugh 1.9912260756924949
birdi 1.9912260756924949
goos 1.9912260756924949
l 1.9912260756924949
unleash 1.6901960800285136
/shydino 1.9912260756924949
dinosaur 1.9912260756924949
tsvetkov 1.9912260756924949
gaga 1.9912260756924949
jennif 1.6901960800285136
guru 1.6901960800285136
lang 1.9912260756924949
derek 1.9912260756924949
revoir 1.9912260756924949
lament 1.9912260756924949
'we 1.9912260756924949
stefan 1.9912260756924949
cynthia 1.9912260756924949
avant-gard 1.9912260756924949
stigwood 1.9912260756924949
typist 1.9912260756924949
raga 1.9912260756924949
brodax 1.9912260756924949
dun 1.6901960800285136
world/uk 1.9912260756924949
khan 1.9912260756924949
boogi 1.9912260756924949
bolan 1.9912260756924949
elton 1.69019608

toehold 1.9912260756924949
toughen 1.9912260756924949
sightseer 1.9912260756924949
holm 1.9912260756924949
carriag 1.9912260756924949
indistinguish 1.9912260756924949
non-amish 1.9912260756924949
alsac 1.9912260756924949
continuum 1.9912260756924949
'german 1.9912260756924949
misunderstood 1.9912260756924949
'dutch 1.9912260756924949
bisect 1.9912260756924949
s-curv 1.9912260756924949
diagon 1.9912260756924949
pre-twent 1.9912260756924949
amerindian 1.9912260756924949
tributaries—virtuallli 1.9912260756924949
cliff-lik 1.9912260756924949
escarp 1.9912260756924949
peirc 1.9912260756924949
underlain 1.9912260756924949
mississippian 1.9912260756924949
17th-19th 1.9912260756924949
escarpment/plateau 1.9912260756924949
world—th 1.9912260756924949
region—without 1.9912260756924949
metamorph 1.9912260756924949
hazelton 1.9912260756924949
livelihood 1.9912260756924949
midstat 1.9912260756924949
laurentid 1.9912260756924949
overtop 1.9912260756924949
furthest 1.9912260756924949
seaport 1.991226

smarr 1.9912260756924949
browser-sid 1.9912260756924949
ui 1.9912260756924949
good-look 1.9912260756924949
handler 1.9912260756924949
colossus/gf 1.9912260756924949
real-nam 1.9912260756924949
gender-neutr 1.9912260756924949
gender-specif 1.9912260756924949
pronoun 1.9912260756924949
inund 1.9912260756924949
off-top 1.9912260756924949
nymwar 1.9912260756924949
up-vot 1.9912260756924949
jaw 1.9912260756924949
karim 1.9912260756924949
anymor 1.9912260756924949
cesspool 1.9912260756924949
homophob 1.9912260756924949
internship 1.9912260756924949
shawn 1.9912260756924949
vaughn 1.9912260756924949
conan 1.9912260756924949
fallon 1.9912260756924949
sketch 1.9912260756924949
monologu 1.9912260756924949
spielberg 1.9912260756924949
pea 1.9912260756924949
tyra 1.9912260756924949
critiqu 1.9912260756924949
ux 1.9912260756924949
messina 1.9912260756924949
role—thi 1.9912260756924949
autorité 1.9912260756924949
indépendant 1.9912260756924949
aai 1.9912260756924949
hellen 1.9912260756924949
nodal 1

intrigu 1.9912260756924949
non-trust 1.9912260756924949
expressrout 1.9912260756924949
x86-64 1.9912260756924949
arm-bas 1.9912260756924949
system-on-chip 1.9912260756924949
server-class 1.9912260756924949
computing—thi 1.9912260756924949
sub-class 1.9912260756924949
boinc 1.9912260756924949
cloud—volunt 1.9912260756924949
business-model 1.9912260756924949
incentiv 1.9912260756924949
hybrid- 1.9912260756924949
multi-cloud 1.9912260756924949
queue 1.9912260756924949
multidisciplinari 1.9912260756924949
insecur 1.9912260756924949
failure—which 1.9912260756924949
outag 1.9912260756924949
schultz 1.9912260756924949
emagin 1.9912260756924949
achil 1.9912260756924949
attack—a 1.9912260756924949
hyperjack 1.9912260756924949
bitcoin 1.9912260756924949
btc 1.9912260756924949
silent 1.9912260756924949
schneier 1.9912260756924949
downsid 1.9912260756924949
cto 1.9912260756924949
9.6bn 1.9912260756924949
13.5b 1.9912260756924949
32.8b 1.9912260756924949
downtim 1.9912260756924949
jatind 1.99122607

deterr 1.9912260756924949
coercion 1.9912260756924949
mirv-equip 1.9912260756924949
super-heavi 1.9912260756924949
intercontinent 1.9912260756924949
sarmat 1.9912260756924949
ss-18 1.9912260756924949
hamper 1.9912260756924949
corrod 1.9912260756924949
balkan 1.9912260756924949
euroscept 1.9912260756924949
ataka 1.9912260756924949
jobbik 1.9912260756924949
opportunist 1.9912260756924949
far-left 1.9912260756924949
destabilis 1.9912260756924949
clout 1.9912260756924949
orbán 1.9912260756924949
contravent 1.9912260756924949
exclav 1.9912260756924949
vaguely-word 1.9912260756924949
nuclear-cap 1.9912260756924949
exclave′ 1.9912260756924949
montenegro 1.9912260756924949
đukanović 1.9912260756924949
anti-govern 1.9912260756924949
podgorica 1.9912260756924949
spice 1.9912260756924949
edibl 1.9912260756924949
non-ferr 1.9912260756924949
briquett 1.9912260756924949
g20 1.9912260756924949
nsg 1.9912260756924949
saarc 1.9912260756924949
ministerial-level 1.9912260756924949
inter-government 1.9912

steward 1.9912260756924949
weiser 1.9912260756924949
ironwork 1.9912260756924949
eckley 1.9912260756924949
ephrata 1.9912260756924949
mather 1.9912260756924949
landi 1.9912260756924949
pennsburi 1.9912260756924949
priestley 1.9912260756924949
internet-rel 1.9912260756924949
ph.d. 1.9912260756924949
supervot 1.9912260756924949
gmail/inbox 1.9912260756924949
allo/duo/hangout 1.9912260756924949
turn-by-turn 1.9912260756924949
outset 1.9912260756924949
robust 1.9912260756924949
historyedit 1.9912260756924949
theoriz 1.9912260756924949
google.stanford.edu 1.9912260756924949
z.stanford.edu 1.9912260756924949
wojcicki 1.9912260756924949
menlo 1.9912260756924949
2004edit 1.9912260756924949
bezo 1.9912260756924949
cheriton 1.9912260756924949
shriram 1.9912260756924949
ken 1.9912260756924949
auletta 1.9912260756924949
kleiner 1.9912260756924949
perkin 1.9912260756924949
caufield 1.9912260756924949
byer 1.9912260756924949
sequoia 1.9912260756924949
vinod 1.9912260756924949
khosla 1.99122607569249

trojan 1.9912260756924949
sacr 1.9912260756924949
epigram 1.9912260756924949
girlhood 1.9912260756924949
suitor 1.9912260756924949
outran 1.9912260756924949
melanion 1.9912260756924949
melon 1.9912260756924949
cun 1.9912260756924949
eden 1.9912260756924949
coax 1.9912260756924949
mālum 1.9912260756924949
mălum 1.9912260756924949
bonum 1.9912260756924949
immort 1.9912260756924949
larynx 1.9912260756924949
seduct 1.9912260756924949
vein 1.9912260756924949
malus/pyru 1.9912260756924949
'genet 1.9912260756924949
astring 1.9912260756924949
sweeter 1.9912260756924949
subacid 1.9912260756924949
subcontin 1.9912260756924949
oddli 1.9912260756924949
unviable—low 1.9912260756924949
'wrong 1.9912260756924949
'cox 1.9912260756924949
'egremont 1.9912260756924949
ordinarili 1.9912260756924949
asexu 1.9912260756924949
heterozygot 1.9912260756924949
dna 1.9912260756924949
meiosi 1.9912260756924949
aneuploid 1.9912260756924949
persia 1.9912260756924949
lyceum 1.9912260756924949
malling-seri 1.991226075

sakha 1.9912260756924949
magadan 1.9912260756924949
buryatia 1.9912260756924949
zabaykalski 1.9912260756924949
irkutsk 1.9912260756924949
tahe 1.9912260756924949
mohe 1.9912260756924949
heilongjiang 1.9912260756924949
hulunbuir 1.9912260756924949
gannan 1.9912260756924949
huangnan 1.9912260756924949
hainan 1.9912260756924949
guoluo 1.9912260756924949
garzê 1.9912260756924949
ngawa 1.9912260756924949
qamdo 1.9912260756924949
tibet 1.9912260756924949
siachen 1.9912260756924949
spiti 1.9912260756924949
well-defin 1.9912260756924949
equatorward 1.9912260756924949
less-sever 1.9912260756924949
short-summ 1.9912260756924949
southwestward 1.9912260756924949
sociopolit 1.9912260756924949
custodian 1.9912260756924949
obedi 1.9912260756924949
placehold 1.9912260756924949
simeon 1.9912260756924949
bekbulatovich 1.9912260756924949
meticul 1.9912260756924949
accent 1.9912260756924949
bolshev 1.9912260756924949
overdo 1.9912260756924949
bonner 1.9912260756924949
theorist 1.9912260756924949
zakhar 1.

In [13]:
tfidf = vectors[0] * idf
print(tfidf)


[ 0.          0.          0.00349722 ...,  0.          0.          0.        ]


## Step 5.5. Produce Ranked Results

As we described previously, standard search simply takes a keyword query and treats it as another document.  This how you would generate a query vector:

In [14]:
query = doc_vector(clean_article('Google Apple Google'), lexicon)

for id in range(0, len(lexicon)):
    if query[id] > 0:
        print("%10s"%wordids[id] + ' x ' + str(query[id]))

      appl x 0.5
     googl x 1.0


To compute the similarity between two documents, we use the cosine of the angle between their vectors, given by:

$$sim(d_1,d_2) = \frac{d_1 \cdot d_2}{||d_1||\;||d_2||}$$

where $||d_i||$ is the Frobenius ($L_2$) norm of the document vector. Write a function `search(vectors, doclist, idf, query, num_results)` that, when given a 2D array of document vectors, a list of document names, idf scores for each word, a query vector, and the desired number of results:

1. Creates a Pandas DataFrame with schema (docid, docname, score) for the results of matching the query against each document.
2. Scales the words in the document and query vectors by their idfs.
3. Computes the cosine similarity score for the query against each document. Hint: Uses Numpy multiplication (`*` between arrays), dot product (`@` or `np.dotproduct()`) and other operations (`+`, `np.linalg.norm()` for vector norms, etc.)
4. Adds each document ID and score to the DataFrame.
5. Sorts the DataFrame by descending score.
6. Returns the first `num_results` results.

In [15]:
# TODO: Implement the search function here

# YOUR CODE HERE
def search(vectors, doclist, idf, query, num_results):
    import pandas as pd
    import numpy as np
    # create a Pandas DF
    d = {'docid': [], 'docname': [], 'score': []}
    results = pd.DataFrame(data=d)
    
    # scale the words by their idfs
    scaled_vect = vectors * idf
    scaled_query = np.array(query) * idf
    
    # compute cosine similarity
    score = list(range(0,len(doclist)))
    for i in range(0,len(doclist)):
        vect = scaled_vect[i] # vector for this document
        score[i] = np.dot(vect,scaled_query) / (np.linalg.norm(vect) * np.linalg.norm(scaled_query))
        
    # add doc ID and score to DF
    ids = list(range(1,len(doclist)+1))
    d = {'docid': ids, 'docname': doclist, 'score': score}
    results = pd.DataFrame(data=d)
    
    # sort DF by descending score
    results = results.sort_values(by=['score'],ascending=False).reset_index(drop=True)
    
    # find first num_results results
    final = results[0:num_results]
    
    return final

In [16]:
result = search(vectors, doclist, idf, doc_vector(clean_article('apple computer steve jobs'), lexicon), 10)
display(result)


Unnamed: 0,docid,docname,score
0,17,Apple I.txt,0.535743
1,6,Apple Inc..txt,0.473741
2,75,Apple III.txt,0.401054
3,3,Apple II series.txt,0.365331
4,32,Apple Store.txt,0.326945
5,79,Apple.txt,0.30192
6,38,Apple TV.txt,0.285315
7,52,Cooking apple.txt,0.278273
8,86,Apple Corps.txt,0.260448
9,36,Cloud computing.txt,0.158311


In [17]:
result = search(vectors, doclist, idf, doc_vector(clean_article('Trump Putin'), lexicon), 10)
display(result)


Unnamed: 0,docid,docname,score
0,21,Donald Trump.txt,0.661581
1,65,Legal affairs of Donald Trump.txt,0.636282
2,39,The Trump Organization.txt,0.628068
3,28,Trump University.txt,0.59363
4,20,Public image of Vladimir Putin.txt,0.591325
5,66,Family of Donald Trump.txt,0.587025
6,80,Vladimir Putin.txt,0.576733
7,97,Trump family.txt,0.52348
8,42,Eric Trump.txt,0.486722
9,35,Russia under Vladimir Putin.txt,0.464057
