In [1]:
from os import listdir
from os import popen
from os.path import isfile, join
import nltk
import string
from nltk.corpus import stopwords

In [2]:
mypath = 'data/'

This first line of the following code is just syntatic sugar for a for loop.

It loops throuugh every object inside the wikipedia files directory (in this case './data/') and if it is a file (rather than a directory) adds it to the array of files.

Only files is now an array which holds strings that are the path to each of the wikipedia text files. 

In [3]:
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
onlyfiles[15]

'Carthage.txt'

In the following loop, I read each file, pre-process it (remove stopwords, lower case, etc), and then write it to a new file inside the cleaned_data directory

In [4]:
i = 0
for f in onlyfiles:
    print('file '+ str(i) + ' of ' + str(len(onlyfiles)))
    i+=1
    with open('data/'+f,'r') as inFile, open('cleaned_data/'+f,'w') as outFile:
        for line in inFile.readlines():
            print(" ".join([word for word in line.lower().translate(str.maketrans('', '', string.punctuation)).split() 
            if len(word) >=4 and word not in stopwords.words('english')]), file=outFile)

file 0 of 9191
file 1 of 9191
file 2 of 9191
file 3 of 9191
file 4 of 9191
file 5 of 9191
file 6 of 9191
file 7 of 9191
file 8 of 9191
file 9 of 9191
file 10 of 9191
file 11 of 9191
file 12 of 9191
file 13 of 9191
file 14 of 9191
file 15 of 9191


KeyboardInterrupt: 

Below i import the library with the prebuilt tfidf vectorization as well as some useful utilities for speeding things up 

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd

Here i create the array file_paths which holds strings that are the path to each file in the cleaned_data directory

In [6]:
file_paths = ['cleaned_'+mypath+f for f in onlyfiles]
#file_dirs = [open(x).read() for x in file_paths]

Here I define the tfidf vectorizer. Futher details of its parameters of it can be found at the following link:
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

and its implementation can be found at:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py

Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification. The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less
informative than features that occur in a small fraction of the training
corpus.

The formula that is used to compute the tf-idf of term t is tf-idf(d, t) = tf(t) * idf(d, t), and the idf is computed as idf(d, t) = log [ n / df(d, t) ] + 1 (if ``smooth_idf=False``),
where n is the total number of documents and df(d, t) is the document frequency; the document frequency is the number of documents d that contain term t. The effect of adding "1" to the idf in the equation above is that terms with zero idf, i.e., terms  that occur in all documents in a training set, will not be entirely ignored.

(Note that the idf formula above differs from the standard textbook notation that defines the idf as idf(d, t) = log [ n / (df(d, t) + 1) ]).


In [7]:
vectorizer = TfidfVectorizer(input='filename')
vectorizer

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='filename',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

Here by calling the tfidf vectorizer's fit_transform function on all of the wikipedia files i learn the TFIDF weights for each word in the dataset. this function returns a document term matrix that has the appearences of every word in a document as well as its weight for each of these terms document. 

In [8]:
tfidf_matrix = vectorizer.fit_transform(file_paths)
tfidf_matrix[:10,:10]

<10x10 sparse matrix of type '<class 'numpy.float64'>'
	with 2 stored elements in Compressed Sparse Row format>

Below, we can see what the output of the fit tranform function looks like

In [29]:
print(tfidf_matrix[1928,:124431])

  (0, 47109)	0.00929048963563
  (0, 123975)	0.0211506061506
  (0, 83745)	0.0100610022266
  (0, 121066)	0.0105105866428
  (0, 121741)	0.0101203934128
  (0, 122439)	0.0108713865485
  (0, 123160)	0.00891317810767
  (0, 119592)	0.00875726542803
  (0, 120418)	0.0109716443936
  (0, 118878)	0.010801049572
  (0, 120783)	0.0107380332352


I now save the matrix so i dont need to recaluclate at each step

In [31]:
import scipy
scipy.sparse.save_npz('/tmp/tfidf_matrix.npz', tfidf_matrix)

I also save the current weights / state of the vectorizor (this also lets it remember the vocabulary mapping it has learned. i.e. above we see that word 47109 has 0 occurences in document 1928. this raw data means nothing to us but the vectorizer keeps a mapping of words to their locations in the vector that it generates. we will need this later to map backwards to words.)

In [28]:
import pickle

pickle_out = open("vectorizer.pickle","wb")
pickle.dump(vectorizer, pickle_out)
pickle_out.close()

I now transform (convert to vector) a test document and save it to the varible X

In [32]:
print(file_paths[3:4])
X = vectorizer.transform(file_paths[3:4])
X

['cleaned_data/Minas Gerais.txt']


<1x1854706 sparse matrix of type '<class 'numpy.float64'>'
	with 3002 stored elements in Compressed Sparse Row format>

What the following code does is analyze all 10000 wikipedia articles, for each, it converts it to a vector, pulls the mapping from each integer to its coresponding english word, sorts the words in the document by their tfidf score, takes the top n words from each, maps them to english and uses them as tags for the document.

The end result is a large dictonary (hashmap) that maps each wikipedia article to a set of tags 
Bad news first:
This code takes a truly ridiclous amount of memory to run. (if you ran it in this form due to some bad memory deallocation within the libraries it could easily take over 1000 gb of RAM).

There are two pieces of good news:
    1) I have written bash scripts (also in this directory) that chunk this problem up into ~20 equal parts that use 100 gb of ram or so a piece.
    2) I have already computed the output of this computation using a server with ~96 gb of ram into wiki_tags.json

In [33]:
n = 10
post_tags_dict = {}

To get these top terms you have to do a little bit of a song and dance to get the matrices as numpy arrays instead and in the right form.

The argsort call is really the useful one, here are the docs for it: https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

We have to do [::-1] because  argsort only supports sorting small to large. We call flatten to reduce the dimensions to 1d so that the sorted indices can be used to index the 1d feature array.

In [55]:
i = 1
feature_array = np.array(vectorizer.get_feature_names())

for (file_name, file) in zip(onlyfiles, file_paths):
    print('file '+ str(i) + ' of ' + str(len(onlyfiles)))
    i += 1
    X =  vectorizer.transform([file])
    #print(len(feature_array))
    tfidf_sorting = np.argsort(X.toarray()).flatten()[::-1]
    #print(type(tfidf_sorting))
    top_n = feature_array[tfidf_sorting][:n]
    post_tags_dict[file_name] = top_n

file 1 of 9191


KeyboardInterrupt: 

In [47]:
post_tags_dict

{'Africa.txt': '00',
 'Cladistics.txt': 'cladistic',
 'Economic Community of West African States.txt': '00',
 'GLONASS.txt': '00',
 'ISO 8000.txt': '00',
 'Minas Gerais.txt': '00'}

In [18]:
import pickle

pickle_out = open("wiki_article_tags.pickle","wb")
pickle.dump(post_tags_dict, pickle_out)
pickle_out.close()

Import the cosine similarity metrics

In [58]:
from sklearn.metrics.pairwise import cosine_similarity

Now that we have all the background work we are ready for a new request from the user. We take their input from the 'test.txt' file and vectorize it.

In [60]:
Y = vectorizer.transform(['test.txt'])

We then compute its similairty to all the vectors representing our wikipedia articles, sort them least similar to most, and take the last k elements (these are the k most similar)

In [61]:
k = 4

In [62]:
sim = cosine_similarity(tfidf_matrix, Y)
file_num = np.argmax(np.array(sim))
files = np.argsort(sim, axis=0)
files[-k:]

array([[5261],
       [ 616],
       [2922],
       [2646]])

The previous operation just gave us which file number to look at from within our onlyfiles array. So we now loop through the array of number, plus them into onlyfiles and save the value stored in onlyfiles[file_number] into the file names list

In [43]:
file_names = [onlyfiles[num[0]] for num in files[-4:]]
file_names

['Oceania Football Confederation.txt',
 'List of FIFA country codes.txt',
 'FIFA World Cup.txt',
 'FIFA.txt']

we can now plug the the titles of each of the wikipedia articles (same as filename) into the tags hashmap to get our the associated tags

In [15]:
post_tags_dict[file_names[0]]

array(['fifa', '2006', 'world', 'group', 'teams', 'germany', 'june',
       'tournament', 'goals', 'knockout'],
      dtype='<U233')

In [16]:
post_tags_dict

{'1080i.txt': array(['1080i', 'frame', 'lines', 'interlaced', '1080', 'resolution',
        'video', 'pixels', '1080p', 'television'],
       dtype='<U233'),
 '10th edition of Systema Naturae.txt': array(['linnaeus', 'systema', 'naturae', 'beetles', '10th', 'species',
        'edition', 'plantarum', 'animals', 'nomenclature'],
       dtype='<U233'),
 '110 film.txt': array(['kodak', 'film', 'cameras', 'cartridge', 'slide', 'camera',
        'format', 'lomography', 'slides', 'eastman'],
       dtype='<U233'),
 '12-hour clock.txt': array(['noon', 'midnight', 'time', 'clock', '12hour', '1200', '24hour',
        'midnightb', '1159', 'clocks'],
       dtype='<U233'),
 '126 film.txt': array(['kodak', 'film', 'format', 'cartridge', 'cameras', 'instamatic',
        'camera', 'ferrania', 'adox', 'eastman'],
       dtype='<U233'),
 '135 film.txt': array(['film', 'kodak', 'camera', 'cameras', 'leica', 'format', '24',
        'retina', 'cassettes', 'slrs'],
       dtype='<U233'),
 '16:9.txt': array