Skip to content

bmschmidt/pySRP

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
SRP
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

pySRP

Python module implementing Stable Random Projections, as described in Cultural Analytics Vol. 1, Issue 2, 2018 October 04, 2018: Stable Random Projection: Lightweight, General-Purpose Dimensionality Reduction for Digitized Libraries

These create interchangeable, data-agnostic vectorized representations of text suitable for a variety of contexts. Unlike most vectorizations, they are suitable for representing texts in any language that uses space-tokenization, or non-linguistic content, since they contain no implicit language model besides words.

You may want to use them in concert with the pre-distributed Hathi SRP features described further here.

Installation

Requires python 3

pip install pysrp

Changelog

Version 2.0 (July 2022) slightly changes the default tokenization algorithm!

Previously it was \w; now it is [\p{L}\p{Pc}\p{N}\p{M}]+.

I also no longer recommend use

Usage

Examples

See the docs folder for some IPython notebooks demonstrating:

Basic Usage

Use the SRP class to build an object to perform transformations.

This is a class method, rather than a function, which builds a cache of previously seen words.

import srp
# initialize with desired number of dimensions
hasher = srp.SRP(640)

The most important method is 'stable_transform'.

This can tokenize and then compute the SRP.

hasher.stable_transform(words = "foo bar bar"))

If counts are already computed, word and count vectors can be passed separately.

hasher.stable_transform(words = ["foo","bar"],counts = [1,2])

Read/write tools

SRP files are stored in a binary file format to save space. This format is the same used by the binary word2vec format.

DEPRECATION NOTICE

This format is now deprecated--I recommend the Apache Arrow binary serialization format instead.

file = SRP.Vector_file("hathivectors.bin")

for (key, vector) in file:
  pass
  # 'key' is a unique identifier for a document in a corpus
  # 'vector' is a `numpy.array` of type `<f4`.

There are two other methods. One lets you read an entire matrix in at once. This may require lots of memory. It returns a dictionary with two keys: 'matrix' (a numpy array) and 'names' (the row names).

all = SRP.Vector_file("hathivectors.bin").to_matrix()
all['matrix'][:5]
all['names'][:5]

The other lets you treat the file as a dictionary of keys. The first lookup may take a very long time; subsequent lookups will be fast without requiring you to load the vectors into memory. To get a 1-dimensional representation of a book:

all = SRP.Vector_file("hathivectors.bin")
all['gri.ark:/13960/t3032jj3n']

You can also, thanks to Peter Organisciak, access multiple vectors at once this way by passing a list of identifiers. This returns a matrix with shape (2, 160) for a 160-dimensional representation.

all[['gri.ark:/13960/t3032jj3n', 'hvd.1234532123']]

Writing to SRP files

You can build your own files row by row.

# Note--the dimensions of the file and the hasher should be equal.
output = SRP.Vector_file("new_vectors.bin",dims=640,mode="w")
hasher = SRP.SRP(640)


for filename in [a,b,c,d]:
  hash = hasher.stable_transform(" ".join(open(filename).readlines()))
  output.add_row(filename,hash)

# files must be closed.
output.close()

Since files must be closed, it can be easier to use a context manager:

# Note--the dimensions of the file and the hasher should be equal.
hasher = SRP.SRP(640)

with SRP.Vector_file("new_vectors.bin",dims=640,mode="w") as output:
  for filename in [a,b,c,d]:
    hash = hasher.stable_transform(" ".join(open(filename).readlines()))
    output.add_row(filename,hash)

About

Python Module implementing SRP

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages