## Finding Nature Specific Words

This notebook is designed to find more words for the `Nature` trope. [This website](https://a-z-animals.com/animals/location/africa/) contains animals specifically found in Africa. We will scrape this website for all of the animals listed, compare the names extracted with the ones already present in the `Nature` dict, and remove duplicates to create our final list. This code and the documentation will remain very similar to Tsion's previous code to extract more words.

The script below follows scraping practices outlined in [this freeCodeCamp website](https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/).

Tsion's Note: because scraping is an iterative process that highly depends on how the website is designed, there isn't one manual to guide the user through scraping this particular website. Instead, the user could iteratively index into the website until the list containing the words is reached.

In [3]:
# necessary packages
import re
from bs4 import BeautifulSoup as bs
import requests

### Scraping The Website

These words are sourced [from this website](https://a-z-animals.com/animals/location/africa/).

Use the `prettify` function to print the output in a properly indexed format to easily identify how the words are encoded on the website.

In [4]:
# save url
urls = ["https://a-z-animals.com/animals/location/africa/algeria/", "https://a-z-animals.com/animals/location/africa/angola/",
        "https://a-z-animals.com/animals/location/africa/benin/", "https://a-z-animals.com/animals/location/africa/botswana/",
        "https://a-z-animals.com/animals/location/africa/burkina-faso/", "https://a-z-animals.com/animals/location/africa/burundi/",
        "https://a-z-animals.com/animals/location/africa/cameroon/", "https://a-z-animals.com/animals/location/africa/central-african-republic/",
        "https://a-z-animals.com/animals/location/africa/chad/", "https://a-z-animals.com/animals/location/africa/cote-divoire/",
        "https://a-z-animals.com/animals/location/africa/democratic-republic-of-congo/", "https://a-z-animals.com/animals/location/africa/djibouti/",
        "https://a-z-animals.com/animals/location/africa/egypt/", "https://a-z-animals.com/animals/location/africa/equatorial-guinea/",
        "https://a-z-animals.com/animals/location/africa/eritrea/", "https://a-z-animals.com/animals/location/africa/ethiopia/",
        "https://a-z-animals.com/animals/location/africa/gabon/", "https://a-z-animals.com/animals/location/africa/gambia/",
        "https://a-z-animals.com/animals/location/africa/ghana/", "https://a-z-animals.com/animals/location/africa/guinea/",
        "https://a-z-animals.com/animals/location/africa/guinea-bissau/", "https://a-z-animals.com/animals/location/africa/kenya/",
        "https://a-z-animals.com/animals/location/africa/lesotho/", "https://a-z-animals.com/animals/location/africa/liberia/",
        "https://a-z-animals.com/animals/location/africa/libya/", "https://a-z-animals.com/animals/location/africa/madagascar/",
        "https://a-z-animals.com/animals/location/africa/malawi/", "https://a-z-animals.com/animals/location/africa/mali/",
        "https://a-z-animals.com/animals/location/africa/mauritania/", "https://a-z-animals.com/animals/location/africa/morocco/",
        "https://a-z-animals.com/animals/location/africa/mozambique/", "https://a-z-animals.com/animals/location/africa/namibia/",
        "https://a-z-animals.com/animals/location/africa/niger/", "https://a-z-animals.com/animals/location/africa/nigeria/",
        "https://a-z-animals.com/animals/location/africa/republic-of-congo/", "https://a-z-animals.com/animals/location/africa/rwanda/",
        "https://a-z-animals.com/animals/location/africa/senegal/", "https://a-z-animals.com/animals/location/africa/sierra-leone/"
        "https://a-z-animals.com/animals/location/africa/somalia/", "https://a-z-animals.com/animals/location/africa/south-africa/"
        "https://a-z-animals.com/animals/location/africa/south-sudan/", "https://a-z-animals.com/animals/location/africa/sudan/",
        "https://a-z-animals.com/animals/location/africa/swaziland/", "https://a-z-animals.com/animals/location/africa/tanzania/",
        "https://a-z-animals.com/animals/location/africa/togo/", "https://a-z-animals.com/animals/location/africa/tunisia/",
        "https://a-z-animals.com/animals/location/africa/uganda/", "https://a-z-animals.com/animals/location/africa/western-sahara/",
        "https://a-z-animals.com/animals/location/africa/zambia/", "https://a-z-animals.com/animals/location/africa/zimbabwe/"]

# create array where all words will go
animals = []
for url in urls:
    # send request
    res = requests.get(url)

    txt = res.text
    status = res.status_code

    # convert to a beautiful soup object
    soup = bs(res.content, 'html.parser')
    
    #get body
    body = soup.body

     # get relevant section 
    def href_match(href):
        return href and re.compile("https://a-z-animals.com/animals/.+").search(href)
    animals_html = soup.body.find_all(href=href_match)
#     print(animals_html)
    
    for animal in animals_html:
        animals.append(animal.text.strip())
        
print(len(animals), animals)

['Mammals', 'Reptiles', 'Fish', 'Birds', 'Amphibians', 'Animals by Location', 'Africa', 'North African', 'Mediterranean coastline', 'jackals', 'wild boars', 'Fennec foxes', 'jerboas', 'leopards', 'cheetahs', 'monkey', 'fish', 'sharks', 'hammerheads', 'reptiles', 'insect', 'ostriches', 'cranes', 'avocets', 'doves', 'eagles', 'Endangered', 'Vulnerable', 'Near Threatened', 'Critically Endangered', 'extinct', 'Scimitar oryx', 'Lion', 'Brown bear', 'Fennec fox', 'buffalo', 'hippos', 'rhinos', 'Mambas', 'Wild Boars', 'African Elephants', '', 'African Wild Dog', '', 'American Cockroach', '', 'Ant', '', 'Antelope', '', 'Armyworm', '', 'Banana Spider', '', 'Barb', '', 'Barn Owl', '', 'Barn Swallow', '', 'Bat', '', 'Bed Bugs', '', 'Bee', '', 'Beetle', '', 'Bird', '', 'Biscuit Beetle', '', 'Black Widow Spider', '', 'Brazilian Treehopper', '', 'Brown-banded Cockroach', '', 'Brown Dog Tick', '', 'Bumblebee', '', 'Butterfly', '', 'Caecilian', '', 'Camel', '', 'Caracal', '', 'Carpenter Ant', '', 'Car

In [12]:
# now we create a final array that includes (unique) animals from the 'animal' list and from the current 'nature' dict on site

# 1. Create list of current words in 'nature' dict
nature_list = ["wildlife",
    "fauna",
    "habitat",
    "habitats",
    "conservation",
    "animals",
    "birds",
    "animal",
    "flora",
    "livestock",
    "hunting",
    "parks",
    "biota",
    "geese",
    "wilderness",
    "wild",
    "park",
    "natural",
    "nature",
    "hunt",
    "camping",
    "wildcat",
    "sauvage",
    "savage",
    "faun",
    "faune",
    "wild-type",
    "waterfowl",
    "deer",
    "bighorn sheep",
    "ecology",
    "biodiversity",
    "forest",
    "conservancy",
    "bald eagle",
    "wildfowl",
    "endangered species",
    "conservationists",
    "raptors",
    "birdlife",
    "wetlands",
    "otters",
    "elk",
    "fisheries",
    "whitetail deer",
    "bighorn",
    "mule deer",
    "capercaillie",
    "waterbirds",
    "poachers",
    "bird sanctuary",
    "ecosystems",
    "grizzlies",
    "reptiles",
    "naturalists",
    "coyotes",
    "marine",
    "waterways",
    "antelope",
    "cougars",
    "river otter",
    "moose",
    "wildflower",
    "ecotourism",
    "seabird",
    "turtles",
    "vegetation",
    "tortoises",
    "gnatcatchers",
    "wood stork",
    "amphibians",
    "wolves",
    "stone curlew",
    "foxes",
    "timber rattlesnake",
    "sacred ibis",
    "tigers",
    "grizzly bear",
    "monarch butterfly",
    "sandhill crane",
    "desert tortoise",
    "goshawk",
    "cirl bunting",
    "leopards",
    "panther",
    "bobcat",
    "alligators",
    "oilbird",
    "safari park",
    "seafowl",
    "bird of passage",
    "zoography",
    "zoological garden",
    "shore bird",
    "plain wanderer",
    "zoo",
    "deer mouse",
    "mammal",
    "dinotherium",
    "bird",
    "conservationist",
    "gallery forest",
    "zoographer",
    "wetland",
    "zoophagy",
    "predatory animal",
    "bird table",
    "sea cow",
    "anhima",
    "tropical rain forest",
    "sea elephant",
    "sea leopard",
    "aquatic bird",
    "birdwatch",
    "zoopathology",
    "wild dog",
    "marine animal",
    "african elephant",
    "sanctuary",
    "indian elephant",
    "woodland",
    "elephant bird",
    "birdling",
    "birdcatching",
    "prairie wolf",
    "rangership",
    "snail darter",
    "hairbird",
    "zoological",
    "wildgrave",
    "kangaroo rat",
    "exotic",
    "indigenous",
    "aquatic",
    "terrestrial",
    "endangered",
    "migratory",
    "coastal",
    "captive",
    "vertebrate",
    "arctic",
    "threatened",
    "fascinating",
    "nocturnal",
    "mammalian",
    "scenic",
    "endemic",
    "ocean",
    "underwater",
    "winged",
    "amazonian",
    "predatory",
    "fish",
    "plants",
    "forests",
    "forestry",
    "soil",
    "timber",
    "cattle",
    "insects",
    "soils",
    "wildlands",
    "wildflowers",
    "pesticides",
    "fishing",
    "extinction",
    "woods",
    "sanctuaries",
    "ecosystem",
    "elephants",
    "mammals",
    "outdoors",
    "ducks",
    "poaching",
    "harvesting",
    "watersheds",
    "decimation",
    "waterfalls",
    "forage",
    "invertebrates",
    "feeding",
    "extermination",
    "pastoralists",
    "shellfish",
    "whales",
    "buffalo",
    "songbirds",
    "eagles",
    "seals",
    "squirrels",
    "hunted",
    "endanger",
    "depleted",
    "inhabit",
    "exterminated",
    "forest ranger",
    "game warden",
    "gamekeeper",
    "park",
    "natural resources",
    "panda",
    "preservationist",
    "preserved",
    "safari",
    "serengeti",
    "agricultural",
    "biosphere",
    "environmental",
    "freshwater",
    "geriatric",
    "herbivores",
    "wildland"]

previous_length = len(nature_list)

# 2. Loop through 'animals' list, make word lower case, check if already in final list, and if not, add to final list
for word in animals:
    lc_word = word.lower()
    # remove commas from words (some extracted words had commas)
    if ',' in lc_word:
        lc_word = lc_word.replace(",", "")
    # ignore extinct animals
    if "saurus" or "therium" in lc_word: continue 
    if lc_word not in nature_list: 
        nature_list.append(lc_word)
        
print(nature_list, previous_length, len(nature_list))
        

hippopotamus,
hippopotamus
patas monkies,
patas monkies
tanzania,
tanzania
chimpanzees, the blue monkey,
chimpanzees the blue monkey
reptiles,
reptiles
fish,
fish
lions,
lions
leopards,
leopards
reptiles,
reptiles
amphibians,
amphibians
amphibians,
amphibians
zebras,
zebras
ethiopia,
ethiopia
snakes,
snakes
leopards,
leopards
reptiles,
reptiles
['wildlife', 'fauna', 'habitat', 'habitats', 'conservation', 'animals', 'birds', 'animal', 'flora', 'livestock', 'hunting', 'parks', 'biota', 'geese', 'wilderness', 'wild', 'park', 'natural', 'nature', 'hunt', 'camping', 'wildcat', 'sauvage', 'savage', 'faun', 'faune', 'wild-type', 'waterfowl', 'deer', 'bighorn sheep', 'ecology', 'biodiversity', 'forest', 'conservancy', 'bald eagle', 'wildfowl', 'endangered species', 'conservationists', 'raptors', 'birdlife', 'wetlands', 'otters', 'elk', 'fisheries', 'whitetail deer', 'bighorn', 'mule deer', 'capercaillie', 'waterbirds', 'poachers', 'bird sanctuary', 'ecosystems', 'grizzlies', 'reptiles', 'natur

### Manually Take Out Irrelevant Strings
We will now remove any irrelevant strings from the new 678 words

In [None]:
nature_final = 

### Convert all plural words to singular
We will now convert all plural words to their singular form to ensure our ASTRSC tool can detect both the singular and plural forms of the words. 

### Conclusion
After finishing this process, we were able to increase the number of words in the dictionary by over 4X.

~ End of notebook ~