## Compute most common words across all pages we're considering

In order to do the most significant words computation, we need to know the frequency of words across all the pages we've collected (term frequency), and the frequency of all words across the entire corpus (document frequency).

In [1]:
from wikiparse import geo_indexer, tokenize

In [2]:
from pathlib import Path
from wikiparse import config

xml_filename = config.xml
scratch_folder = Path(config.folder)

In [3]:
indexer = geo_indexer.Indexer(xml_filename,
            scratch_folder=scratch_folder)

opening C:\Users\rowan\Documents\geowiki\scratch\index.db
Ready. Metadata: [('size', 1524893)]


In [4]:
import time
pipeline_start = time.time()

In [5]:
doc_freq = tokenize.create_doc_freq(indexer, scratch_folder)

creating queue with 10 threads
done starting threads; queueing
time left: 18.4 hours		ing
time left: 5.1 hours		eing
time left: 5.29 hours		ing
time left: 5.12 hours		ing
time left: 4.92 hours		ing
9.587time left: 4.61 hours		
time left: 4.44 hours		eing
time left: 4.06 hours		eing
time left: 3.55 hours		eing
time left: 3.51 hours		eing
time left: 3.38 hours		ingg
time left: 3.18 hours		eing
time left: 2.81 hours		eing
time left: 2.5 hours		ueing
time left: 2.43 hours		eing
time left: 2.39 hours		eing
time left: 2.35 hours		eing
time left: 2.25 hours		eing
time left: 2.09 hours		eing
time left: 1.59 hours		eing
time left: 1.57 hours		eing
time left: 1.56 hours		eing
time left: 1.52 hours		eing % done with queueing
time left: 1.36 hours		eing % done with queueing % done with queueing % done with queueing % done with queueing% done with queueing
time left: 1.34 hours		ingg
time left: 1.28 hours		eing
time left: 1.28 hours		
time left: 1.25 hours		eing
time left: 1.21 hours		eing
time lef

In [6]:
import pickle
fname = scratch_folder/'wikipedia_wordfreq.pkl'
with open(fname, 'wb') as f:
    f.write(pickle.dumps(doc_freq))
print("saved docfreq file to ", fname)

saved docfreq file to  C:\Users\rowan\Documents\geowiki\scratch\wikipedia_wordfreq.pkl


## Term Frequency, Inverse Document Frequency
Gets each page and creates a list of all the words that occur in it, and their TF-IDF / significance score. Save the top 30 most significant words for each page.

In [7]:
df = tokenize.create_tfidf(indexer, scratch_folder)

computing TF-IDF for 647725 pages
time left ~ 2.53 hours ; done: 11
time left ~ 2.28 hours ; done: 16
time left ~ 1.67 hours ; done: 48
time left ~ 1.72 hours ; done: 69
time left ~ 2.05 hours ; done: 288
time left ~ 2.08 hours ; done: 1369
time left ~ 2.12 hours ; done: 2187
time left ~ 2.01 hours ; done: 9519
wrote to CSV C:\Users\rowan\Documents\geowiki\scratch\tfidf_0.csv
time left ~ 2.08 hours ; done: 11998
time left ~ 2.05 hours ; done: 19147
time left ~ 2.05 hours ; done: 19664
wrote to CSV C:\Users\rowan\Documents\geowiki\scratch\tfidf_1.csv
time left ~ 2.08 hours ; done: 23195
wrote to CSV C:\Users\rowan\Documents\geowiki\scratch\tfidf_2.csv
time left ~ 2.02 hours ; done: 35208
wrote to CSV C:\Users\rowan\Documents\geowiki\scratch\tfidf_3.csv
wrote to CSV C:\Users\rowan\Documents\geowiki\scratch\tfidf_4.csv
time left ~ 1.97 hours ; done: 53976
time left ~ 1.94 hours ; done: 58601
wrote to CSV C:\Users\rowan\Documents\geowiki\scratch\tfidf_5.csv
time left ~ 1.92 hours ; done: 6

In [8]:
fname = scratch_folder/'tfidf.csv'
df.to_csv(fname)
print("wrote final CSV to", fname)

wrote final CSV to C:\Users\rowan\Documents\geowiki\scratch\tfidf.csv


In [9]:
df.sample(10)

Unnamed: 0.1,Unnamed: 0,tf,article,df,tf_idf
36431,mogendorf,1,"Linden, Westerwaldkreis",28,0.035714
80958,inpatients,1,Rothbury Community Hospital,96,0.010417
237340,obsolete,1,"Magill, South Australia",1007,0.000993
267908,businessman,1,"Pulaski, Wisconsin",6294,0.000159
296749,parallel,1,"Campaspe, Queensland",6942,0.000144
325219,coin,1,Tidbury Ring,1485,0.000673
246901,rental,1,Douglass Place,4011,0.000249
178826,divert,1,Herbert River,801,0.001248
273621,stopped,1,"Mirny, Sakha Republic",6207,0.000161
234302,the,4,"Pardeesville, Pennsylvania",581915,7e-06


In [10]:
took = time.time() - pipeline_start
if took < 60:
    print("pipeline took", round(took, 2), "seconds")
elif took < 3600:
    print("pipeline took", round(took/60, 2), "minutes")
else:
    print("pipeline took", round(took/60/60, 2), "hours")

pipeline took 3.75 hours
