Algorithms for string classification and string embeddings using 'weak' supervision, with eventual application to 'schema alignment'.
NB: This package is in the middle of an API redefinition and simplification. The master
branch is functional, but keep an eye out for changes. Ongoing work is being done on the api-v3
branch.
For schema alignment, basic idea is to:
- learn an embedding of strings into dense N-dimensional vector representations s.t. instances of the same variable are closer than instances of other variables (recurrent neural networks)
- align variables whose embedded distributions are "close" (solve assignment problem)
Here are two ways that we could think about similarity of strings:
-
syntactic
: strings are similar, because they have similar structure- usernames :
ben46 is close to frank123
- subject_line :
'Re: good morning' is close to 'Re: circling back'
- usernames :
-
semantic
: strings are similar, because of extrinsic information about the world- date :
'2016-01-01' is close to 'Jan 1st 2016'
- country :
'AR' is close to 'Argentina'
- date :
and here are two ways we could think about similarity of sets of strings:
-
distributional
: sets have similar distributions- forum post_id : (near?) unique key
- forum username : may follow similar distributions across domains
-
relational
: sets have similar relationships to other sets of strings- relationship (eg mutual information) between post_id and username may be similar across domains
Prototype code for calculating syntactic
and semantic
similarity are included in this repo.
wit/examples/string-example.py
shows how to build a string classifier (iesemantic
)wit/examples/simple-embedding-example.py
shows how to use the triplet loss function to learn a string embedding (iesyntactic
)wit/examples/simple-alignment-example.py
-- splitting and re-aligning a simple dataset
wit/notebooks/address-matching.ipynb
-- trying to learn a good metric for addresseswit/notebooks/simple-forum-notebook.py
-- aligning schemas of multiple forums at once
See https://github.com/gophronesis/census-schema-alignment
for some more concrete examples, developed during the January 2016 XDATA census hackathon.