scholar

This code provides an interface for Word2Vec in Python. It can use a part-of-speech tagged corpus to query for specific parts of speech. It was intended to be used with autoplay, a learning environment for interactive fiction.

Using linear algebra, we were able to pull affordances (relevant verbs) for nouns out of word2vec, with varying success on other parts of speech. This interface provides methods that perform those operations.

Requirements

Our processed files are available here.

Provided are several bin files: (1) an untagged copy of Wikipedia from January 2016, (2) a part-of-speech tagged copy of Wikipedia, and (3) a truncated version of 2. Per standard natural language processing, we used Penn Treebank tags to denote parts of speech. We tagged our copy of Wikipedia (using Parsey McParseface) and used that as input to word2vec.

The truncated corpus contains the top 30,000 most popular singular nouns on Wikipedia, and the top 3,000 verbs. It is stored in pickle format and loads much faster than the full binary, but also loses access to some methods. I would honestly recommend using the truncated corpus if you can, since the speed-up is significant, and most of the real functionality is included. The above download link also provides pre-computed distributions of word-to-pos tags, to speed up computation during runtime.

Usage

There are 3 options for using scholar.

Untagged words (gives basic word2vec functionality, 4 million words)
pos-tagged words (gives basic word2vec functionality and part-of-speech queries, 4 million words)
Slim pos-tagged words (limited word2vec functionality and part-of-speech queries, 33k most popular nouns and verbs)

Usage Examples

The available methods using the full untagged corpus are demonstrated below. Most methods don't require tags, but some miscellaneous methods that do require tags are maintained for convenience.

import scholar.scholar as sch

# This will load word2vec using the full untagged corpus
s = sch.Scholar(tags=False)

# These methods require a Penn Treebank tag

s.get_cosine_similarity('man', 'woman')
s.analogy('king -man woman')
s.get_angle('dog', 'cat')
dog_vec = s.get_vector('dog')
s.get_words(dog_vec, 10)

# These methods may or may not require Penn Treebank tags.

# Miscellaneous
s.get_most_common_words('VB', 10) # Takes a tag as a parameter
s.get_most_common_tag('dog')
s.get_words_by_rarity('the boy walked across the wasteland.')
s.exists_in_model('dog')
s.exists_in_model_untagged('dog') # Using an untagged corpus, these methods are identical

The available methods using the full pos-tagged corpus are demonstrated below. Some methods require a pos-tag, and some of them don't.

import scholar.scholar as sch

# This will load word2vec using the full tagged corpus
s = sch.Scholar()

# These methods require a Penn Treebank tag

s.get_cosine_similarity('man_NN', 'woman_NN')
s.analogy('king_NN -man_NN woman_NN')
s.get_angle('dog_NN', 'cat_NN')
dog_vec = s.get_vector('dog_NN')
s.get_words(dog_vec, 10)

# The below methods DO NOT require the use of a Penn Treebank tag, but will accept them.

# For singular nouns
s.get_verbs('mailbox', 10) # Where 10 is the number of desired results
s.get_adjectives('mailbox', 10)
s.get_hypernyms('tree', 10)
s.get_hyponyms('weapon', 10)
s.get_parts('house', 10)
s.get_whole('bread', 10)

# For plural nouns
s.get_verbs_plural('mailboxes', 10)
s.get_adjectives_plural('mailboxes', 10)
s.get_hypernyms_plural('trees', 10)
s.get_hyponyms_plural('weapons', 10)
s.get_parts_plural('houses', 10)
s.get_whole_plural('loaves', 10)

# For verbs
s.get_nouns('purchase', 10)
s.get_nouns_plural('purchase', 10)

# These methods may or may not require Penn Treebank tags.

# Miscellaneous
s.get_most_common_words('VB', 10) # Takes a tag as a parameter
s.get_most_common_tag('dog') # Does not require tag
s.get_words_by_rarity('the boy walked across the wasteland.') # Does not require tag
s.exists_in_model('dog_NN') # Requires tag
s.exists_in_model_untagged('dog') # Does not require tag

The methods available using the slim pos-tagged corpus are below. Again, some require tags, and some don't.

import scholar.scholar as sch

# This will load word2vec using the truncated corpus
s = sch.Scholar(slim=True)

# These methods require a Penn Treebank tag

s.get_cosine_similarity('man_NN', 'woman_NN')
s.analogy('king_NN -man_NN woman_NN')
s.get_angle('dog_NN', 'cat_NN')
dog_vec = s.get_vector('dog_NN')
s.get_words(dog_vec, 10)

# The below methods DO NOT require the use of a Penn Treebank tag, but will accept them.

# For singular nouns
s.get_verbs('mailbox', 10) # Where 10 is the number of desired results
s.get_adjectives('mailbox', 10)
s.get_hypernyms('tree', 10)
s.get_hyponyms('weapon', 10)
s.get_parts('house', 10)
s.get_whole('bread', 10)

# These methods may or may not require Penn Treebank tags.

# Miscellaneous
s.get_most_common_words('VB', 10) # Takes a tag as a parameter
s.get_most_common_tag('dog') # This method should never be run with a tag
s.get_words_by_rarity('the boy walked across the wasteland.') # Does not require tag
s.exists_in_model('dog_NN') # Requires tag
s.exists_in_model_untagged('dog') # Does not require tag

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
canon_adj.txt		canon_adj.txt
canon_adj_pl.txt		canon_adj_pl.txt
canon_hypernym.txt		canon_hypernym.txt
canon_hypernym_pl.txt		canon_hypernym_pl.txt
canon_meronym.txt		canon_meronym.txt
canon_meronym_pl.txt		canon_meronym_pl.txt
canon_verbs.txt		canon_verbs.txt
canon_verbs_pl.txt		canon_verbs_pl.txt
scholar.py		scholar.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

init.py

init.py

canon_adj.txt

canon_adj.txt

canon_adj_pl.txt

canon_adj_pl.txt

canon_hypernym.txt

canon_hypernym.txt

canon_hypernym_pl.txt

canon_hypernym_pl.txt

canon_meronym.txt

canon_meronym.txt

canon_meronym_pl.txt

canon_meronym_pl.txt

canon_verbs.txt

canon_verbs.txt

canon_verbs_pl.txt

canon_verbs_pl.txt

scholar.py

scholar.py

Repository files navigation

scholar

Requirements

Usage

Usage Examples

About

Releases

Packages

Languages

danielricks/scholar

Folders and files

Latest commit

History

Repository files navigation

scholar

Requirements

Usage

Usage Examples

About

Resources

Stars

Watchers

Forks

Languages