Python module implementing Stable Random Projections, as described in Cultural Analytics Vol. 1, Issue 2, 2018 October 04, 2018: Stable Random Projection: Lightweight, General-Purpose Dimensionality Reduction for Digitized Libraries
These create interchangeable, data-agnostic vectorized representations of text suitable for a variety of contexts. Unlike most vectorizations, they are suitable for representing texts in any language that uses space-tokenization, or non-linguistic content, since they contain no implicit language model besides words.
You may want to use them in concert with the pre-distributed Hathi SRP features described further here.
Requires python 3
pip install pysrp
Version 2.0 (July 2022) slightly changes the default tokenization algorithm!
Previously it was
\w; now it is
I also no longer recommend use
See the docs folder for some IPython notebooks demonstrating:
- Taking a subset of the full Hathi collection (100,000 works of fiction) based on identifiers, and exploring the major clusters within fiction.
- Creating a new SRP representation of text files and plotting dimensionality reductions of them by language and time
- Searching for copies of one set of books in the full HathiTrust collection, and using Hathi metadata to identify duplicates and find errors in local item descriptions.
- Training a classifier based on library metadata using TensorFlow, and then applyinig that classification to other sorts of text.
Use the SRP class to build an object to perform transformations.
This is a class method, rather than a function, which builds a cache of previously seen words.
import srp # initialize with desired number of dimensions hasher = srp.SRP(640)
The most important method is 'stable_transform'.
This can tokenize and then compute the SRP.
hasher.stable_transform(words = "foo bar bar"))
If counts are already computed, word and count vectors can be passed separately.
hasher.stable_transform(words = ["foo","bar"],counts = [1,2])
SRP files are stored in a binary file format to save space. This format is the same used by the binary word2vec format.
This format is now deprecated--I recommend the Apache Arrow binary serialization format instead.
file = SRP.Vector_file("hathivectors.bin") for (key, vector) in file: pass # 'key' is a unique identifier for a document in a corpus # 'vector' is a `numpy.array` of type `<f4`.
There are two other methods. One lets you read an entire matrix in at once. This may require lots of memory. It returns a dictionary with two keys: 'matrix' (a numpy array) and 'names' (the row names).
all = SRP.Vector_file("hathivectors.bin").to_matrix() all['matrix'][:5] all['names'][:5]
The other lets you treat the file as a dictionary of keys. The first lookup may take a very long time; subsequent lookups will be fast without requiring you to load the vectors into memory. To get a 1-dimensional representation of a book:
all = SRP.Vector_file("hathivectors.bin") all['gri.ark:/13960/t3032jj3n']
You can also, thanks to Peter Organisciak, access multiple vectors at once this way by passing a list of identifiers. This returns a matrix with shape (2, 160) for a 160-dimensional representation.
Writing to SRP files
You can build your own files row by row.
# Note--the dimensions of the file and the hasher should be equal. output = SRP.Vector_file("new_vectors.bin",dims=640,mode="w") hasher = SRP.SRP(640) for filename in [a,b,c,d]: hash = hasher.stable_transform(" ".join(open(filename).readlines())) output.add_row(filename,hash) # files must be closed. output.close()
Since files must be closed, it can be easier to use a context manager:
# Note--the dimensions of the file and the hasher should be equal. hasher = SRP.SRP(640) with SRP.Vector_file("new_vectors.bin",dims=640,mode="w") as output: for filename in [a,b,c,d]: hash = hasher.stable_transform(" ".join(open(filename).readlines())) output.add_row(filename,hash)