*Anna Goecke, Rodrigo Lopez Portillo Alcocer, Elizabeth Pankratz*

In this notebook, we walk the reader through how we developed a gold standard for our Taboo card generator based on existing Taboo cards, and how we created brand new Taboo cards.

# Section 1: Developing a gold standard

In this section, we present the workflow for how we used our gold standard dataset to generate a probability distribution, which we then used to determine which semantic relations that are most likely to appear on a Taboo card.
This gives us a way to generate more accurate, true-to-life versions of Taboo cards. 

The 240 Taboo cards that our gold standard is based on belong to Elizabeth's family's Canadian edition of Taboo, produced sometime in the 1990s or early 2000s.
These cards were transcribed into the text file `taboo_cards.txt` (in the current directory).
The module `gs_probdist` contains the functions needed for this section.
(Each function is documented with a docstring detailing the arguments and returned values, if the reader wants more information.)

In [1]:
import gs_probdist as gspd

As a first step, the function `get_card_dicts()` combines two other `gspd` functions, `read_in()` and `format_cards()`, to read in the transcribed Taboo cards and produce a dictionaries whose entries each represent one of the cards. 
For example, the main word (MW) "syrup" is assigned to the dictionary's key, and its value is a list of the five corresponding taboo words (TWs).

In [2]:
card_dict = gspd.get_card_dicts()
card_dict['syrup']

['maple', 'pancakes', 'trees', 'sap', 'sweet']

`get_card_dicts()` provides the input for the function `cards_to_df()`, which converts this dictionary to a pandas dataframe.
In this dataframe, each row is a pairing of the MW with each of its TWs.
This format allows for easy annotation of the semantic relationship between each MW/TW pair.
The following cell shows the top five rows of the resulting dataframe, corresponding to one card.

In [3]:
gspd.cards_to_df(card_dict).head()

Unnamed: 0,mw,tw
0,huddle,gather
1,huddle,football
2,huddle,group
3,huddle,play
4,huddle,together


At this point, we exported this dataframe to a CSV file and manually categorised the following types of semantic relationship between each TW and its MW:

- **collocations** (i.e. combinations of words at rates more frequent than chance; see Evert 2009)
- **synonyms** (words meaning the same thing)
- **antonyms** (words with opposite meanings)
- **hyponyms** (a subset of a word's meaning, i.e. a more specific version)
- **hypernyms** (a superset of a word's meaning, i.e. a more general version)

We also had categories for cultural references---MWs and TWs relating in a way that requires cultural or world knowledge---and a catch-all "other" category.
We did not make an attempt to replicate these two categories, since our focus was on the linguistic aspect of this project.
We chose these semantic relations since they are fairly easy to operationalise using two tools that we have gotten to know this semester: word2vec word embeddings (for the collocations, since word embeddings represent textual co-occurrence; **CITATION NEEDED**) and WordNet (for all the others; **CITATION NEEDED**).

(The annotation guidelines can be found in `gs-annotation-guidelines.txt`.)

The rest of the functions in `gspd` are used to process the annotated gold standard dataset, which is saved as `gold-std-categorised.csv` in the current directory.
First, `read_in_categorised()` simply processes the csv file into a pandas dataframe with one row per MW/TW pairing and a 1 in the column of the category that that pair belongs to, and zeroes everywhere else, as illustrated in the next cell.

In [4]:
goldstd_data = gspd.read_in_categorised()
goldstd_data.head()

Unnamed: 0,mw,tw,semrel_synonym,semrel_antonym,semrel_hyponym,semrel_hypernym,collocation,cultural_ref,other
0,huddle,gather,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,huddle,football,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,huddle,group,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,huddle,play,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,huddle,together,0.0,0.0,0.0,0.0,1.0,0.0,0.0


This dataframe is used as input for the final two functions of `gspd`: `plot_category_freqs()`, to create the bar plot used in our presentation at the beginning of February (this function also exports the plot to a PDF in the current directory), and the crucial `freq_dist_to_prob_dist()`, which converts the relative frequencies of each category into a probability distribution.
This probability distribution was then used to randomly supply a category for each of the five TW slots, in proportion to that category's actual frequency in the real Taboo cards.

In [5]:
gspd.plot_category_freqs(goldstd_data);

In [6]:
category_prob_dist = gspd.freq_dist_to_prob_dist(goldstd_data)
category_prob_dist

{'collocation': 0.6634275618374559,
 'semrel_synonym': 0.19699646643109542,
 'semrel_hypernym': 0.10159010600706714,
 'semrel_hyponym': 0.020318021201413426,
 'semrel_antonym': 0.0176678445229682}

# Section 2: Determining semantic relations for the given word

At this point, we know the probability of each TW slot being assigned one of the five categories.
However, we need to find the words, for any MW given as input, that belong to these categories, so that in the final step, this can all be put together.
This is where the functions defined in the module `semrel` come in.

In [7]:
import semrel as sr

One major pillar of this module is `make_semrel_dict()`, which takes in the given MW and uses the NLTK interface with WordNet to populate a dictionary with that word's synonyms, antonyms, hypernyms, and hyponyms.
It uses other functions defined in this module: `get_synonyms()`,  `get_antonyms()`, `get_hypernyms()`, and `get_hyponyms()`.
Two other functions were created for this module, as well: `wd_to_synsets()` (used in `get_synonyms()`), and `synset_to_wd()` (used in `get_hypernyms()` and `get_hyponyms()`).

Two examples of the results of  `make_semrel_dict()` are given below.

In [8]:
sr.make_semrel_dict('sheep')

{'semrel_synonym': set(),
 'semrel_antonym': set(),
 'semrel_hypernym': {'bovid', 'follower', 'simpleton'},
 'semrel_hyponym': {'ewe', 'ram', 'wether'}}

In [9]:
sr.make_semrel_dict('good')

{'semrel_synonym': {'adept',
  'beneficial',
  'commodity',
  'dear',
  'dependable',
  'effective',
  'estimable',
  'full',
  'thoroughly',
  'well'},
 'semrel_antonym': {'bad', 'evil', 'ill'},
 'semrel_hypernym': {'advantage', 'artifact', 'morality', 'quality'},
 'semrel_hyponym': {'basic',
  'beneficence',
  'benefit',
  'benignity',
  'better',
  'desirability',
  'entrant',
  'export',
  'fungible',
  'future',
  'import',
  'kindness',
  'merchandise',
  'middling',
  'optimum',
  'saintliness',
  'salvage',
  'shopping',
  'summum_bonum',
  'virtue',
  'wisdom',
  'worldly_possession',
  'worthiness'}}

Two remarks about the above:

- Sometimes WordNet does not contain any words that stand in the desired semantic relation to the MW (e.g. there are no synonyms for "sheep"). 
However, it is possible that this semantic relation will be assigned to one of the five TW slots on a card for "sheep", but this assignment is unfulfillable, since no synonyms are available.
We will come back to how we deal with these cases in a little while.
- The word "good" can be an adjective or a noun (singular of "goods"), and `make_semrel_dict()` was designed to capture all parts of speech and all senses of the input word, which is why we have synonyms like "commodity" and also "beneficial".
We made this decision because, in Taboo, you are not told what part of speech your MW belongs to; you can use any of its senses to try to make your team members guess it (and often this is a very good strategy for English Taboo, where many of the words are homonymous).

The other pillar of this module is `get_collocations()`.
This is a recursive function that uses the word2vec word embeddings to return some number of suitable collocates, which we operationalise as words that word2vec considers to be similar to the main word, due to their frequent use in the same contexts.

"Suitable" is an important word here, since some post-processing has to be done to the words that are returned by gensim's `most_similar()` function.
Words that we do not want on the cards, but that `most_similar()` might unearth, are those that:
- contain the string of the main word (e.g. inflectional variations of the MW)
- have a Levenshtein distance of 3 or fewer to the main word (meaning that they are probably typos)
- are already contained in the semantic relations dictionary

All of these cases are taken care of by `get_collocations()`.

In [10]:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [25]:
# No "forbidden words" (the set in the second argument is empty)
sr.get_collocations('flower', set(), model, num_collocates = 3)

{'anthurium', 'blossom', 'chrysanthemum', 'orchid', 'peony', 'tulip'}

In [18]:
# Forbidding "anthurium" and "orchid" (e.g. if they were already in the semantic relations dictionary)
sr.get_collocations('flower', {'orchid', 'anthurium'}, model, num_collocates = 3)

{'blossom', 'chrysanthemum', 'peony', 'tulip'}

The idea behind returning (at minimum) some set number of suitable collocates is the following: generating semantic categories to fulfill based on the probability distribution from above means that each MW will need between 0 and 5 collocates.
The cardinality threshold is there to make sure the function does as little work as possible.

The work accumulates because of the recursion, which is an important part of the function's operation.
For example, imagine that we need 5 collocations, but `most_similar()` returns only 3 suitable ones.
Then, we need to extend the range of most similar words that `most_similar()` returns, so that we have more words to choose from.
So, the function checks whether the number of suitable collocates provided in the first iteration is more than the passed-in cardinality threshold, and if not, it increases the number of most similar words to return and checks again how many of those are suitable.

At this point, we have ways to make a dictionary that contains all of the synonyms, antonyms, hypernyms, and hyponyms from WordNet, as well as a set of collocated words based on the word2vec embeddings.
We also have a probability distribution that tells us how probable it is to find any one of these semantic relations in one TW slot in the final card.
Now all that remains is to put this information together and generate the Taboo cards.

# Section 3: Putting it all together: Card generation

The module containing the functions for the culmination of this part of our project is `cardgen`.

In [13]:
import cardgen as cg

Its workhorse is `card_generator()`, which takes as arguments the desired MW, the dictionary containing the probability distribution from the gold standard, and the gensim model.
It uses the function `select_five_categories()` to generate a list of five semantic relations, one for each slot on the card.
Then, `get_good_label_distrib()` assesses the cardinality of the sets in the semantic relation dictionaries to see if the MW has enough synonyms, say, to fulfill the number of synonyms that `select_five_categories()` has generated.
If not, this label is replaced by "collocation", since this is the most frequent category. 
(It would have been possible to be more sophisticated and re-generate a category according to the probability distribution, but this was much more straightforward, and the word2vec words contain more than just strict collocations, anyway.)

After this, we have the final list of categories to appear on the current card that are compatible with the cardinalities of the sets in the semantic relations dictionary.
Now we randomly select words with the desired semantic relation from the dictionary and fill in the rest with new collocates (which is why the forbidden-words and number-of-collocates parameters in `sr.get_collocations()` were important).
The card is returned as a single-entry dictionary, with the MW as a key and the generated list of TWs as the value.

In [14]:
cg.card_generator('ghost', category_prob_dist, model)

{'ghost': ['writer', 'suggestion', 'touch', 'spooky', 'poltergeist']}

There is also a function to pretty-print these cards in a more Taboo-like format (though how "pretty" this is is a matter of contention...).
Together, `card_generator()` and `pretty_print()` are wrapped up in `draw_card()`, which simulates drawing a Taboo card from the playing deck.

One of the fun parts of this card generator is that, because of the randomness in the probability distribution and the fact that there are more words to choose from than will fit on a card, the cards generated for a given MW are different almost every time.
Let's see an example with the word "ghost" again.

In [15]:
cg.draw_card('ghost', category_prob_dist, model)

 ---------------------
 |    ghost          |
 ---------------------
 |    apparition     |
 |    touch          |
 |    haunted        |
 |    hauntings      |
 |    poltergeist    |
 ---------------------


# References
- Evert 2009
- the wordnet one
- the word2vec one