Project 1: WordNet
CMSC 330, Spring 2018
Due Monday February 12th (Late Tuesday, 13th).
This is an individual assignment. You must work on this project alone.
WordNet is a semantic lexicon for the English language that is used extensively by computational linguists and cognitive scientists. WordNet groups words into sets of synonyms called synsets and describes semantic relationships between them. Relevant to this project is the is-a relationship, which connects a hyponym (more specific synset) to a hypernym (more general synset). For example, a plant organ is a hypernym to plant root and plant root is a hypernym to carrot.
To begin this project (and all other projects for this course), you will need to clone the git repository. Click here for directions on cloning the repository. The following are the relevant files:
- Ruby Files
- graph.rb: A simple graph implementation for use in your WordNet. This has been completed for you, and is optional to use.
- wordnet.rb: Contains a skeleton implementation of the three classes that you will complete for this project.
- interactive.rb: A frontend that utilizes the methods you have written to interface with WordNet. This has been completed for you, though it will not be functional until your WordNet is implemented.
- Submission Scripts
- submit.rb: Execute this script to submit your project to the submit server.
- submit.jar and .submit: Don't worry about these files, but make sure you have them.
Structure of WordNet Graph
In order to perform operations on WordNet, you will construct your own representation of hypernym relationships using the provided graph implementation. Each vertex
v is a non-negative integer representing a synset id, and each directed edge
w as a hypernym of
v. The graph is directed and acyclic (DAG), though not necessarily a tree since each synset can have several hypernyms. A small subset of the WordNet graph is illustrated below.
Input File Formats
The WordNet is represented by two files which must each be formed as described below in order to be valid. A major part of this project will be to process and load from these supplied input files.
The synsets file is a list of all of the synsets in WordNet (i.e. the vertices of the graph above). A synset is a list of nouns that share the same meaning. Each line of a valid synsets file consists of two fields:
- Synset id: A non-negative integer identifying the synset.
- Synset: A comma-delimited list of one or more nouns that belong to the synset. Nouns are made up of letters (uppercase and lowercase), numbers, underscores, dashes, periods, apostrophes, and forward slashes. These criteria will always define valid nouns wherever valid nouns are referenced in this document.
Note: A noun can appear in more than one synset. A noun will appear once for each meaning the noun has. For example, all of the following synsets contain the noun "word", but with different meanings.
id: 37559 synset: discussion,give-and-take,word id: 50266 synset: news,intelligence,tidings,word id: 60429 synset: parole,word,word_of_honor id: 60430 synset: password,watchword,word,parole,countersign
The hypernyms file contains the hypernym relationships between synsets. Each line of a valid hypernyms file contains two fields:
- Synset id: A non-negative integer identifying the synset these edges originate from.
- Hypernym ids: A comma-delimited list of one or more non-negative integers representing synsets that edges will go to.
Each line of the file represents a set of edges from a synset to its hypernyms. For example, the line
from: 171 to: 22798,57458
means that the synset 171 ("Actified") has 2 hypernyms: 22798 ("antihistamine") and 57458 ("nasal_decongestant"), meaning that Actified is both an antihistamine and a nasal decongestant. The synsets are obtained from lines in the synsets file with the corresponding synset ids.
Note: A synset's hypernyms are not restricted to being listed on a single line. They may be split among multiple lines. For example, the hypernyms from the example above may also be represented as follows:
from: 171 to: 22798 from: 171 to: 57458
A Note on Types
Ruby has no built in way to restrict the types passed to methods. As such, all method types specified in this document are the only ones you need to handle. You may assume that no arguments will be passed outside of the types we specify, and your program may do anything in cases where improperly typed arguments are passed. This is undefined behavior for this program and will not be tested.
The expected types will be represented in the following format at the beginning of each section:
load : (String) -> Array or nil
This example defines a method called
load that takes a single String as an argument and either returns an Array or nil, meaning that you may assume that a String will be passed in and you are responsible for ensuring that only an Array or nil is returned. This also means that a subclass of String could be passed in or that a subclass of Array could be returned.
Note: Some shorthand is used to avoid verbosity in type siguatures.
Integeris used to refer to the union of
Boolis used to refer to the union of
nilis used to refer to
Part 1: Synsets
The first part of this project consists of parsing from the synset file and implementing operations that pertain to single synsets. Information on the valid file format can be found above. These methods will be implemented in the
The methods you must implement have the following signatures:
class Synsets addSet : (Integer, Array) -> Bool load : (String) -> Array or nil lookup : (Integer) -> Array findSynsets : (Object) -> Array or Hash or nil end
Adding to the Synsets
You must first implement a method
addSet(synset_id, nouns) which, given a non-negative synset id and an array of nouns in the synset, adds this entry to the object. If the synset id is negative, the list of nouns is empty, or the synset id has already been defined, return false without modifying the current object. Otherwise, return true and store the new synset.
Loading from File
You must next implement a method
load(synsets_file) which, given a String representing a path to the synset file to load, validates the file and stores its contents in a logical way. The exact internal representation of the data is up to you, but a hash is a very advisable structure for this purpose.
You may assume that the file exists. If any line is invalid misformed according to the rules in the file format section, the object's state should not be changed and an array of invalid line numbers (indexed from 1) should be returned in ascending order. Additionally, lines attempting to re-define a synset should be counted as invalid and be added to this array. That is:
id: 37559 synset: discussion ... id: 37559 synset: word
should not create a synset with two words, but should rather count the second definition as an invalid line.
This method may be called multiple times. Repeated loads should each augment the contents of the structure, adding all entries from the new file to the current instance and essentially behaving like extensions of previous loads, almost like one single long file.
If all lines of the file are properly formed as described in the file format section, then the data should be stored (again, exactly how is up to you) and
nil should be returned. If any lines are misformed as described above, the load should fail, the stored data should remain exactly as before the call (even ignoring correct lines from the file), and the aforementioned array of line numbers (indexed from 1) should be returned.
lookup(synset_id) should take a synset id and return an array of the nouns in the synset. If there is no entry for the requested id, return an empty array.
findSynsets(to_find) will behave differently depending on if it is given a String or an Array as its argument.
- String: When supplied a String, this method should return an array of 0 or more synset ids corresponding to synsets containing this noun.
- Array: When supplied an array of nouns, this method should return a hash where each key is a single noun from the array and each value is an array of synset ids corresponsing to synsets containing the noun. This is similar to when a String argument is provided, but allows for batch processing.
- Anything else: If anything but a String or an Array is provided, this method should simply return
Part 2: Hypernyms
The second part of this project requires you to construct and operate on a graph representing the hypernym relationships among members of the WordNet.
Hypernyms class, the methods you must implement have the following signatures:
class Hypernyms addHypernym : (Integer, Integer) -> Bool load : (String) -> Array or nil lca : (Integer, Integer) -> Array or nil end
You must first implement a method
addHypernym(source, destination) which, given two synset ids, creates a hypernym relationship where the destination is a hypernym of the source. This method should return
false if either supplied synset id is negative or if both synset ids are the same. Otherwise, the new hypernym relationship should be added and
true should be returned. If the hypernym relation already exists, you should still return true.
Loading from File
Just as in part 1, you must implement a method
load(hypernyms_file) which, given a String representing a path to the hypernyms file to load, validates the file and stores its contents in an instance of the provided graph class. Although you are not responsible for writing the
Graph class yourself, you are responsible for understanding how it works.
You may assume that the file exists. If any line is invalid misformed according to the rules in the file format section, the object's state should not be changed and an array of invalid line numbers should be returned in ascending order. Unlike part 1, lines that are properly formed but define edges from an id that has already had outgoing edges defined should be counted as valid. That is:
from: 1 to: 2 ... from: 1 to: 4 ... from: 1 to: 4
should count all three lines as valid, but should not actually create duplicate edges.
If any line of the file is invalid, the object's state should not be changed and an array of invalid line numbers should be returned. If all lines of the file are properly formed as described in the file format section, then the data should be stored and
nil should be returned.
This method may be called multiple times. Repeated successful loads should each augment the contents of the structure, treating it just like an initial load. However, if a load fails on a structure that has already been successfully loaded, the data should not be replaced or affected in any way.
In this part, you will find the least common ancestors of two nouns. An ancestral path between two ids
w in the graph is a directed path from
v to a common ancestor
x, together with a directed path from
w to the same ancestor
x. A shortest ancestral path, or SAP, between ID
v and ID
w is an ancestral path of minimum total length. The common ancestor that participates in the path is called the lowest common ancestor, or LCA, of v and w. In the graph below, one ancestral path between ids 4 and 6 is of length 3 with common ancestor 5 (4->5 and 6->7->5). However, the SAP between 4 and 6 is of length 2 with common ancestor 7 (4->7 and 6->7). This makes 7 the LCA of 4 and 6.
lca(id1, id2) should take a synset id and return an array of all synset ids that are lowest common ancestors of the two provided synset ids according to the definition provided above. If either of the ids are not contained in the graph, this method should return
nil. If no common ancestor is found, then an empty array should be returned.
Part 3: WordNet Command Line
We have provided a simple frontend to WordNet for you, however we have not implemented the parser that recognizes commands and gives back responses. The frontend can be started by running interactive.rb (which you do not need to modify, though you may if you would like to add your own testing features).
The WordNet environment tracks an instance of
Synsets and an instance of
Hypernyms and performs operations on them according to provided commands. The structure of the
CommandParser class is as follows:
class CommandParser parse : (String) -> Hash end
All you must implement for the
CommandParser class is the method
parse(command). This method is given a String command and is responsible for performing an action and providing a response. A command may contain arbitrary leading and trailing whitespace, and arguments may be separated by one or more whitespace characters. A response is represented as a hash with entries as follows:
- :recognized_command (the leading colon is intentional; this is a symbol, not a String): This will vary based on the command, but the purpose is to identify how to process the data portion of the response. Each command section that follows will clearly specify what to set this field to. If the command is not recognized, set the value to
- :result: This key's value will correspond to the resulting data after processing the command.
The commands that you will need to parse are as follows:
load <synset_file> <hypernym_file>
Behavior: The load command takes two arguments, the synset file name and the hypernym file name respectively. File names may contain word characters (i.e. letters, numbers, and underscores) as well as forward slashes, dashes, and periods and must be validated as part of checking the structure of the command. The files should be load into
Synsets object and the
Hypernyms objects using their
load : (String) methods. Neither object should be modified unless all the following conditions are met:
- The synset file is valid
- The hypernym file is valid
- All synsets referenced by the hypernym file are defined in the synsets file or in the already-loaded synsets of the contained
Synsetsobject (remember, multiple loads augment the data instead of replacing it).
You may add any methods you need to any of the provided classes in order to validate these criteria.
:result field should be set to
true in case of success and
false in case of failure according to the criteria above. If the command is misformed or additional arguments are supplied, then set
Behavior: The lookup command takes exactly one synset id as an argument and sets
:result to whatever is returned by calling
lookup : (Integer) on the
Synsets object. If the synset id isn't properly formed (by the same rules as the synset input file) or the wrong number of arguments is supplied then
:result should be set to
Behavior: The find command sets
:result to the result of
findSynsets : Object called with only one noun passed as a single String. If the noun is invalid (by the same rules as the synset input file) or additional arguments are supplied then
:result should be set to
Behavior: The findmany command sets
:result to the result of
findSynsets : Object called with one or more nouns passed as an Array. The list of nouns will be comma separated with no whitespace just as in a valid input file. If the form of the list isn't valid or additional arguments are supplied, then set
lca <id1> <id2>
Behavior: The lca command takes two synset ids as arguments and looks up their least common ancestors. The
:result field should be set to the result of calling
lca : (Integer, Integer), or
:error if the command is misformed (i.e. wrong number of arguments or the arguments are not valid Integers).
You should submit a file
wordnet.rb containing your solution. You may submit other files, but they will be ignored during grading. We will run your solution as MiniTest test units, just as in the provided public test file.
Be sure to follow the project description exactly! Your solution will be graded automatically, so any deviation from the specification will result in lost points.
You can submit your project in two ways:
- Submit your
wordnet.rbfile directly to the submit server by clicking on the submit link in the column "web submission".
Then, use the submit dialog to submit your
Select your file using the "Browse" button, then press the "Submit project!" button. You do not need to put it in a zip file.
- Submit directly by executing a the submission script on a computer with Java and network access. Included in this project are the submission scripts and related files listed under Project Files. These files should be in the directory containing your project. From there you can either execute submit.rb or run the command
java -jar submit.jardirectly (this is all submit.rb does).
No matter how you choose to submit your project, make sure that your submission is received by checking the submit server after submitting.
Please carefully read the academic honesty section of the course syllabus. Any evidence of impermissible cooperation on projects, use of disallowed materials or resources, or unauthorized use of computer accounts, will be submitted to the Student Honor Council, which could result in an XF for the course, or suspension or expulsion from the University. Be sure you understand what you are and what you are not permitted to do in regards to academic integrity when it comes to project assignments. These policies apply to all students, and the Student Honor Council does not consider lack of knowledge of the policies to be a defense for violating them. Full information is found in the course syllabus, which you should review before starting.