## Introduction

This notebook is designed to expand words under the existing four tropes by adding their synonyms.

#### Step 1: Import Packages

We will access Princeton University's wordnet through the spacy package since spacy has a user friendly interface. Note, wordnet can be accessed through the nltk package as well.

In [1]:
import spacy
from spacy_wordnet.wordnet_annotator import WordnetAnnotator 

#### Step 2: Initialize spaCy Model

Start by initializing a spacy model. This model is trained for the English language, (represented by `_en`). It provides the core operations: vocabulary, syntax, entities, and vectors (represented by the `_core` flag). It is trained using blogs, news, comments etc text from the web (represented by the `_web` flag). And, it is the largest size available containing all the functionalities needed for this project (represented by the `_lg` flag). Check [this link](https://spacy.io/models/en) for more documentation.

In [2]:
nlp = spacy.load('en_core_web_lg')

#### Step 3: Load Wordnet

Although spacy has numerous functionalities, we want to use the added package `WordNet` for this specific portion of the project. For this portion, we are interested in finding synonyms to the words under different tropes for the ASTRSC website. WordNet is the most suited database for this task. Developed by Princeton University, 

"**WordNet is a large lexical database of English**. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept....**WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings.** " Check [the website](https://wordnet.princeton.edu/) for more detail.

Users of WordNet describe its purpose as twofold: "...to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications." [History of Information](https://www.historyofinformation.com/detail.php?id=2104). This is well suited for our purpose.

Use [this website](http://wordnetweb.princeton.edu/perl/webwn?s=jungle&sub=Search+WordNet&o2=1&o0=&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=000) to test WordNet.

In [3]:
nlp.add_pipe(WordnetAnnotator(nlp.lang), after='tagger')

**Additional technical information on pipe**: 

Language class and pipe description: Usually you’ll load this once per process as nlp and pass the instance around your application. The Language class is created when you call spacy.load() and contains the shared vocabulary and language data, optional model data loaded from a model package or a path, and a processing pipeline containing components like the tagger or parser that are called on a document in order. You can also add your own processing pipeline components that take a Doc object, modify it and return it.

#### Step 4: Find Synonyms to Trope Words

Start by loading the existing trope words as a list.


In [4]:
conflictArray = {"war","conflict","armed","weapon","pirate","attack","kill","death","dead","murder","fear","menace","smuggled weapons","smugglers","crime","gang","ransom","execution","shot","death","pain","suffer","civil war","chaos","gun","gunshot"}
generalizationArray = {"africa","african", "subsaharan africa","sub-saharan africa","continent"}
tribalismArray = {"tribe", "tribal", "tribalism", "tribalistic", "ethnic", "ethnicity"}
wildlifeArray = {"savannah","rainforest","jungle","trees","insect","lion","rhino","hunt","big game"}

all_tropes_dict = {"conflictArray":conflictArray, "generalizationArray":generalizationArray, "tribalismArray":tribalismArray, "wildlifeArray":wildlifeArray}



Define a function that extracts the synonyms:

In [5]:
def find_synonym(all_tropes_dict):
    
    all_synonyms = {}
    
    for trope in all_tropes_dict:
        synonyms_per_trope = {}
        for word in all_tropes_dict[trope]:
            unique_synonyms_per_word = set()
            token = nlp(word)[0]
            synonym_raw_outputs = token._.wordnet.synsets()
            for synonym in synonym_raw_outputs:
                word_wo_extra_info = synonym.name()[:-5]
                unique_synonyms_per_word.add(word_wo_extra_info)
            synonyms_per_trope[word] = unique_synonyms_per_word
        all_synonyms[trope] = synonyms_per_trope
            
    
    return all_synonyms

Let's run our dictionaries through this function.

In [37]:
synonyms_with_old_list = find_synonym(all_tropes_dict)

**Conclusion**:

Surprisingly, these word sets are very limited. I crosschecked these outcomes against one of the most comprehensive dictionaries online, [Vocabulary.com](https://www.vocabulary.com) and discovered that **the words in the tropes indeed have very limited number of synonyms**.

There are two courses of action going forward:

1. Try to find more words for the tropes
2. Try to extract more synonyms from these given words using root words ('go' for 'went', or 'run' for 'running') called lemmatization.

I believe Option 1 is the most effective course of action for now. I will research related words online. After we find a comprehensive enough dictionary online, we can explore option 2 to extract the synonym of this new found dictionary.

## Part II: Improved after Discussion

We have decided that the current version of the Generalization array is complete. We will add a few words to the Tribalism array and we have scraped [this website](http://www.rhymezone.com/r/rhyme.cgi?Word=wildlife&org1=syl&org2=l&typeofrhyme=rel&org3=y) to complete the wildlife array. The output of this stage will be a complete versions of the Generalization, Tribalism, and Wildlife arrays. We have decided to use a different approach for the conflict array to be described in another notebook.

Let's start by compiling the updated versions of the three target arrays.

In [6]:
generalizationArray = {"africa","african", "subsaharan africa","sub-saharan africa","continent"}
tribalismArray = {"tribe", "tribal", "tribalism", "tribalistic", "ethnic", "ethnicity", "ethnicism", "ethno", "bushmen", "pygmy", "bantu", "San people", "nomadic"}
wildlifeArray = {"acacia",
"african elephant",
"alligators",
"amazonian",
"amphibians",
"amphibiology",
"animal",
"animalness",
"animals",
"antelope",
"aquatic",
"aquatic bird",
"arctic",
"asylum",
"atlas",
"bald eagle",
"big bend national park",
"big game",
"bighorn",
"bighorn sheep",
"biocenosis",
"biodiversity",
"biosphere",
"biota",
"bird",
"bird of passage",
"bird sanctuary",
"bird table",
"birdcatching",
"birdlet",
"birdlife",
"birdling",
"birds",
"birdwatch",
"black grouse",
"bladder senna",
"bobcat",
"bobolink",
"bobwhite quail",
"bountiful",
"brimming",
"broads",
"buffalo",
"canadian goose",
"capercaillie",
"capitol reef national park",
"capturing",
"cattle",
"chimpanzee",
"chronic wasting disease",
"coastal",
"colutea arborescens",
"conifers",
"conservancy",
"conservation",
"conservationist",
"conservationists",
"conserve",
"conserved",
"conserving",
"cougars",
"countryside",
"coyotes",
"creek",
"curios",
"ddt",
"decimated",
"decimation",
"deer",
"deer mouse",
"deerfly",
"desert tortoise",
"dinotherium",
"diorama",
"dotterel",
"ducks",
"eagles",
"eaten",
"echinochloa frumentacea",
"ecology",
"ecosystem",
"ecosystems",
"ecotourism",
"elephant",
"elephant bird",
"elephants",
"elk",
"encounter",
"encounters",
"encourage",
"endanger",
"endangered",
"endangered species",
"endangering",
"endemic",
"environmental",
"eskimo curlew",
"everglades",
"exhibits",
"exotic",
"exterminated",
"extermination",
"extinction",
"faun",
"fauna",
"fish",
"fisheries",
"fishing",
"flora",
"foliage",
"forage",
"forest",
"forest ranger",
"forestry",
"forests",
"foxes",
"freshwater",
"frigate bird",
"furbearer",
"gamekeeper",
"gayal",
"geese",
"geology",
"geriatric",
"giraffe",
"gnatcatchers",
"gorilla",
"goshawk",
"grizzlies",
"grizzly bear",
"ground",
"guacharo",
"guadalupe fur seal",
"habitat",
"habitats",
"hairbird",
"hellbender",
"herbivores",
"herds",
"herpetology",
"houbara",
"indian elephant",
"insect",
"insects",
"invertebrates",
"japanese millet",
"jasper national park",
"jungle",
"kangaroo rat",
"kruger national park",
"leopard",
"leopards",
"lion",
"livestock",
"mammal",
"mammalian",
"mammals",
"marine",
"marine animal",
"migrating",
"migratory",
"monarch butterfly",
"monkey",
"mule deer",
"nocturnal",
"nongame",
"norfolk",
"oilbird",
"ornithologist",
"ortygan",
"otters",
"paguma",
"paleornithology",
"panther",
"park",
"prairie chicken",
"prairie dog",
"prairie grouse",
"prairie rattlesnake",
"prairie wolf",
"predatory animal",
"preservationist",
"procyonid",
"profusion",
"pronghorn",
"ptarmigans",
"rabbits",
"rain forest",
"rangership",
"raptor",
"raptors",
"reptiles",
"reservation",
"rhino",
"river otter",
"ruthless",
"safari",
"safari park",
"safeguard",
"sage grouse",
"sanctuaries",
"sanctuary",
"sandhill crane",
"sanwa millet",
"sar",
"sauvage",
"savage",
"savannah",
"save",
"saving",
"scenery",
"scenic",
"scenics",
"sea cow",
"sea elephant",
"sea leopard",
"seabird",
"seabirds",
"seafowl",
"seals",
"seeing",
"serengeti",
"seriema",
"shellfish",
"shoot",
"shooting",
"shore bird",
"shorebird",
"sightings",
"snail darter",
"snowshoe hare",
"songbirds",
"squirrels",
"stone curlew",
"sungrebe",
"taxonomic group",
"terrain",
"terrestrial",
"thremmatology",
"toll",
"tortoises",
"trees",
"tropical",
"tropical rain forest",
"vegetation",
"vertebrate",
"water rail",
"waterbirds",
"waterfalls",
"waterfowl",
"watershed",
"watersheds",
"waterways",
"weka",
"wetland",
"wetlands",
"whales",
"whitetail deer",
"whooping crane",
"wild",
"wild dog",
"wildflower",
"wildflowers",
"wildfowl",
"wildgrave",
"wildland",
"wildlife",
"willow grouse",
"wolves",
"wood stork",
"woodland",
"woodland caribou",
"woodlands",
"woods",
"wrybill",
"xerosere",
"zebra",
"zoo",
"zoographer",
"zoography",
"zoological",
"zoological garden",
"zoopathology",
"zoophagy",
"zootechnics",
"zootrophic"}

# Remove conflict from all tropes
all_tropes_dict = {"generalizationArray":generalizationArray, "tribalismArray":tribalismArray, "wildlifeArray":wildlifeArray}



Find synonyms for all these words and store them



In [8]:
all_three_tropes_synonyms = find_synonym(all_tropes_dict)

all_three_tropes_synonyms

{'generalizationArray': {'continent': {'celibate', 'continent'},
  'africa': {'africa'},
  'subsaharan africa': set(),
  'african': {'african'},
  'sub-saharan africa': set()},
 'tribalismArray': {'ethnic': {'cultural', 'heathen'},
  'tribalistic': set(),
  'tribalism': {'tribalism'},
  'nomadic': {'mobile'},
  'ethnicism': set(),
  'bushmen': {'bushman'},
  'San people': set(),
  'tribe': {'kin', 'tribe'},
  'pygmy': {'pygmy'},
  'ethnicity': {'ethnicity'},
  'bantu': {'bantu'},
  'tribal': {'tribal'},
  'ethno': set()},
 'wildlifeArray': {'marine animal': {'marine', 'nautical'},
  'leopards': {'leopard'},
  'bird of passage': {'bird', 'boo', 'dame', 'shuttlecock'},
  'rabbits': {'lapin', 'rabbit'},
  'foxes': {'dodger', 'fox'},
  'endanger': {'endanger', 'queer'},
  'pronghorn': {'pronghorn'},
  'elephants': {'elephant'},
  'rain forest': {'rain'},
  'sauvage': set(),
  'ornithologist': {'ornithologist'},
  'broads': {'broad'},
  'dotterel': {'dotterel'},
  'elephant': {'elephant'},


In [35]:
print("Generalization has {} unique words.".format(len(all_three_tropes_synonyms["generalizationArray"])))
print("Tribalism has {} unique words.".format(len(all_three_tropes_synonyms["tribalismArray"])))
print("Wildlife has {} unique words.".format(len(all_three_tropes_synonyms["wildlifeArray"])))

Generalization has 5 unique words.
Tribalism has 13 unique words.
Wildlife has 286 unique words.


From the synonyms, we have 331 unique wildlife, 18 tribalism, and 7 generalization array words.

~ THE END OF NOTEBOOK ~

In [10]:
all_three_tropes_synonyms["generalizationArray"]

{'continent': {'celibate', 'continent'},
 'africa': {'africa'},
 'subsaharan africa': set(),
 'african': {'african'},
 'sub-saharan africa': set()}