## Compute most common words across all pages we're considering
We need to know the frequency of words across all pages in our set (which is pages with coordinate tags). This can be parallelized.

In [1]:
from wikiparse import geo_indexer, tokenize

In [2]:
from pathlib import Path
xml_filename = "wiki_sample.xml"

scratch_folder = Path("scratch-pipeline")

In [3]:
indexer = geo_indexer.Indexer(xml_filename,
            scratch_folder=scratch_folder)

opening scratch-pipeline\index.db
Ready. Metadata: [('size', 999)]


In [4]:
import time
pipeline_start = time.time()

In [5]:
doc_freq = tokenize.create_doc_freq(indexer, scratch_folder)

creating queue with 10 threads
done starting threads; queueing
done queueing pages           
telling threads to stop
waiting


In [6]:
import pickle
fname = scratch_folder/'wikipedia_wordfreq.pkl'
with open(fname, 'wb') as f:
    f.write(pickle.dumps(doc_freq))
print("saved docfreq file to ", fname)

saved docfreq file to  scratch-pipeline\wikipedia_wordfreq.pkl


## Term Frequency, Inverse Document Frequency
Gets each page and creates a list of all the words that occur in it, and their TF-IDF / significance score. Save the top 30 most significant words for each page.

In [7]:
df = tokenize.create_tfidf(indexer, scratch_folder)

computing TF-IDF for 999 pages
time left ~ 17.9 seconds ; done: 2
time left ~ 8.5 seconds ; done: 12
time left ~ 12.0 seconds ; done: 25
time left ~ 13.3 seconds ; done: 81
time left ~ 9.5 seconds ; done: 229
wrote to CSV scratch-pipeline\tfidf_0.csv
storage finished
0.01 ms per, total time: 7.7 seconds


In [8]:
fname = scratch_folder/'tfidf.csv'
df.to_csv(fname)
print("wrote final CSV to", fname)

wrote final CSV to scratch-pipeline\tfidf.csv


In [9]:
took = time.time() - pipeline_start
if took < 60:
    print("pipeline took", round(took, 2), "seconds")
elif took < 3600:
    print("pipeline took", round(took/60, 2), "minutes")
else:
    print("pipeline took", round(took/60/60, 2), "hours")

pipeline took 20.48 seconds
