probablepeople

probablepeople is a python library for parsing unstructured romanized name or company strings into components, using advanced NLP methods. This is based off usaddress, a python library for parsing addresses.

Try it out on our web interface! For those who aren't python developers, we also have an API.

What this can do: Using a probabilistic model, it makes (very educated) guesses in identifying name or corporation components, even in tricky cases where rule-based parsers typically break down.

What this cannot do: It cannot identify components with perfect accuracy, nor can it verify that a given name/company is correct/valid.

probablepeople learns how to parse names/companies through a body of training data. If you have examples of names/companies that stump this parser, please send them over! By adding more examples to the training data, probablepeople can continue to learn and improve.

How to use the probablepeople python library

Install probablepeople with pip, a tool for installing and managing python packages (beginner's guide here)

In the terminal,
```
pip install probablepeople  
```

Parse some names/companies!

Note that parse and tag are differet methods:

import probablepeople as pp
name_str='Mr George "Gob" Bluth II'
corp_str='Sitwell Housing Inc'

# The parse method will split your string into components, and label each component.
pp.parse(name_str) # expected output: [('Mr', 'PrefixMarital'), ('George', 'GivenName'), ('"Gob"', 'Nickname'), ('Bluth', 'Surname'), ('II', 'SuffixGenerational')]
pp.parse(corp_str) # expected output: [('Sitwell', 'CorporationName'), ('Housing', 'CorporationName'), ('Inc', 'CorporationLegalType')]

# The tag method will try to be a little smarter
# it will merge consecutive components, strip commas, & return a string type
pp.tag(name_str) # expected output: (OrderedDict([('PrefixMarital', 'Mr'), ('GivenName', 'George'), ('Nickname', '"Gob"'), ('Surname', 'Bluth'), ('SuffixGenerational', 'II')]), 'Person')
pp.tag(corp_str) # expected output: (OrderedDict([('CorporationName', 'Sitwell Housing'), ('CorporationLegalType', 'Inc')]), 'Corporation')

Links:

Documentation: http://probablepeople.rtfd.org/
Web Interface: http://parserator.datamade.us/probablepeople
Distribution: https://pypi.python.org/pypi/probablepeople
Repository: https://github.com/datamade/probablepeople
Issues: https://github.com/datamade/usaddress/issues
Blog post: http://datamade.us/blog/parse-name-or-parse-anything-really/

For the nerds:

Probablepeople uses parserator, a library for making and improving probabilistic parsers - specifically, parsers that use python-crfsuite's implementation of conditional random fields. Parserator allows you to train probablepeople's model (a .crfsuite settings file) on labeled training data, and provides tools for easily adding new labeled training data.

Building & testing development code

git clone https://github.com/datamade/probablepeople.git  
cd probablepeople  
pip install -r requirements.txt  
python setup.py develop
parserator train name_data/labeled/labeled.xml,name_data/labeled/company_labeled.xml probablepeople
nosetests .

Creating/adding labeled training data (.xml outfile) from unlabeled raw data (.csv infile)

If there are name/company formats that the parser isn't performing well on, you can add them to training data. As probablepeople continually learns about new cases, it will continually become smarter and more robust.

NOTE: The model doesn't need many examples to learn about new patterns - if you are trying to get probablepeople to perform better on a specific type of name, start with a few (<5) examples, check performance, and then add more examples as necessary.

For this parser, we are keeping person names and organization names separate in the training data. The two training files used to produce the model are:

name_data/labeled/labeled.xml for people
name_data/labeled/company_labeled.xml for organizations.

To add your own training examples, first put your unlabeled raw data in a csv. Then:

parserator label [infile] [outfile] probablepeople

[infile] is your raw csv and [outfile] is the appropriate training file to write to. For example, if you put raw strings in my_companies.csv, you'd use parserator label my_companies.csv name_data/labeled/company_labeled.xml probablepeople

The parserator label command will start a console labeling task, where you will be prompted to label raw strings via the command line. For more info on using parserator, see the parserator documentation.

Re-training the model

If you've added new training data, you will need to re-train the model. To set multiple files as traindata, separate them with commas.

parserator train [traindata] probablepeople

for example, to train the model on both labeled names and labeled companies,

parserator train name_data/labeled/labeled.xml,name_data/labeled/company_labeled.xml probablepeople

Contribute back by sending a pull requests with your added labeled examples.

Name		Name	Last commit message	Last commit date
Latest commit History 430 Commits
docs		docs
name_data		name_data
probablepeople		probablepeople
tests		tests
training		training
training_data_prep		training_data_prep
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

probablepeople

How to use the probablepeople python library

Links:

For the nerds:

Building & testing development code

Creating/adding labeled training data (.xml outfile) from unlabeled raw data (.csv infile)

Re-training the model

Copyright

About

Releases

Sponsor this project

Packages

Languages

License

adamchainz/probablepeople

Folders and files

Latest commit

History

Repository files navigation

probablepeople

How to use the probablepeople python library

Links:

For the nerds:

Building & testing development code

Creating/adding labeled training data (.xml outfile) from unlabeled raw data (.csv infile)

Re-training the model

Copyright

About

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

Packages