# Finding Nature Specific Words

This notebook is designed to find more words for the `Nature` trope. [This website](https://a-z-animals.com/animals/location/africa/) contains animals specifically found in Africa. We will scrape this website for all of the animals listed, compare the names extracted with the ones already present in the `Nature` dict, and remove duplicates to create our final list. This code and the documentation will remain very similar to Tsion's previous code to extract more words.

The script below follows scraping practices outlined in [this freeCodeCamp website](https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/).

Tsion's Note: because scraping is an iterative process that highly depends on how the website is designed, there isn't one manual to guide the user through scraping this particular website. Instead, the user could iteratively index into the website until the list containing the words is reached.

In [41]:
# necessary packages
import re
from bs4 import BeautifulSoup as bs
import requests

## Scraping The Website

These words are sourced [from this website](https://a-z-animals.com/animals/location/africa/).

Use the `prettify` function to print the output in a properly indexed format to easily identify how the words are encoded on the website.

In [58]:
# save url
urls = ["https://a-z-animals.com/animals/location/africa/algeria/", "https://a-z-animals.com/animals/location/africa/angola/",
        "https://a-z-animals.com/animals/location/africa/benin/", "https://a-z-animals.com/animals/location/africa/botswana/",
        "https://a-z-animals.com/animals/location/africa/burkina-faso/", "https://a-z-animals.com/animals/location/africa/burundi/",
        "https://a-z-animals.com/animals/location/africa/cameroon/", "https://a-z-animals.com/animals/location/africa/central-african-republic/",
        "https://a-z-animals.com/animals/location/africa/chad/", "https://a-z-animals.com/animals/location/africa/cote-divoire/",
        "https://a-z-animals.com/animals/location/africa/democratic-republic-of-congo/", "https://a-z-animals.com/animals/location/africa/djibouti/",
        "https://a-z-animals.com/animals/location/africa/egypt/", "https://a-z-animals.com/animals/location/africa/equatorial-guinea/",
        "https://a-z-animals.com/animals/location/africa/eritrea/", "https://a-z-animals.com/animals/location/africa/ethiopia/",
        "https://a-z-animals.com/animals/location/africa/gabon/", "https://a-z-animals.com/animals/location/africa/gambia/",
        "https://a-z-animals.com/animals/location/africa/ghana/", "https://a-z-animals.com/animals/location/africa/guinea/",
        "https://a-z-animals.com/animals/location/africa/guinea-bissau/", "https://a-z-animals.com/animals/location/africa/kenya/",
        "https://a-z-animals.com/animals/location/africa/lesotho/", "https://a-z-animals.com/animals/location/africa/liberia/",
        "https://a-z-animals.com/animals/location/africa/libya/", "https://a-z-animals.com/animals/location/africa/madagascar/",
        "https://a-z-animals.com/animals/location/africa/malawi/", "https://a-z-animals.com/animals/location/africa/mali/",
        "https://a-z-animals.com/animals/location/africa/mauritania/", "https://a-z-animals.com/animals/location/africa/morocco/",
        "https://a-z-animals.com/animals/location/africa/mozambique/", "https://a-z-animals.com/animals/location/africa/namibia/",
        "https://a-z-animals.com/animals/location/africa/niger/", "https://a-z-animals.com/animals/location/africa/nigeria/",
        "https://a-z-animals.com/animals/location/africa/republic-of-congo/", "https://a-z-animals.com/animals/location/africa/rwanda/",
        "https://a-z-animals.com/animals/location/africa/senegal/", "https://a-z-animals.com/animals/location/africa/sierra-leone/"
        "https://a-z-animals.com/animals/location/africa/somalia/", "https://a-z-animals.com/animals/location/africa/south-africa/"
        "https://a-z-animals.com/animals/location/africa/south-sudan/", "https://a-z-animals.com/animals/location/africa/sudan/",
        "https://a-z-animals.com/animals/location/africa/swaziland/", "https://a-z-animals.com/animals/location/africa/tanzania/",
        "https://a-z-animals.com/animals/location/africa/togo/", "https://a-z-animals.com/animals/location/africa/tunisia/",
        "https://a-z-animals.com/animals/location/africa/uganda/", "https://a-z-animals.com/animals/location/africa/western-sahara/",
        "https://a-z-animals.com/animals/location/africa/zambia/", "https://a-z-animals.com/animals/location/africa/zimbabwe/"]


# create array where all words will go
animals_1 = []
for url in urls:
    # send request
    res = requests.get(url)

    txt = res.text
    status = res.status_code

    # convert to a beautiful soup object
    soup = bs(res.content, 'html.parser')
    
    #get body
    body = soup.body

     # get relevant section 
    def href_match(href):
        return href and re.compile("https://a-z-animals.com/animals/.+").search(href)
    animals_html = soup.body.find_all(href=href_match)
#     print(animals_html)
    
    for animal in animals_html:
        animals_1.append(animal.text.strip())
        
print(len(animals_1), animals_1)

32769 ['Mammals', 'Reptiles', 'Fish', 'Birds', 'Amphibians', 'Animals by Location', 'Africa', 'North African', 'Mediterranean coastline', 'jackals', 'wild boars', 'Fennec foxes', 'jerboas', 'leopards', 'cheetahs', 'monkey', 'fish', 'sharks', 'hammerheads', 'reptiles', 'insect', 'ostriches', 'cranes', 'avocets', 'doves', 'eagles', 'Endangered', 'Vulnerable', 'Near Threatened', 'Critically Endangered', 'extinct', 'Scimitar oryx', 'Lion', 'Brown bear', 'Fennec fox', 'buffalo', 'hippos', 'rhinos', 'Mambas', 'Wild Boars', 'African Elephants', '', 'African Wild Dog', '', 'American Cockroach', '', 'Ant', '', 'Antelope', '', 'Armyworm', '', 'Banana Spider', '', 'Barb', '', 'Barn Owl', '', 'Barn Swallow', '', 'Bat', '', 'Bed Bugs', '', 'Bee', '', 'Beetle', '', 'Bird', '', 'Biscuit Beetle', '', 'Black Widow Spider', '', 'Brazilian Treehopper', '', 'Brown-banded Cockroach', '', 'Brown Dog Tick', '', 'Bumblebee', '', 'Butterfly', '', 'Caecilian', '', 'Camel', '', 'Caracal', '', 'Carpenter Ant', ''

## Process animals from the 2 other websites
Now we will retrieve the words from the two other websites and add them to an array.

### Website 2

In [43]:
# Website 2
url_website_2 = "https://www.edreams.com/blog/animals-in-africa-from-a-to-z/"

res_2 = requests.get(url_website_2)
txt_2 = res_2.text
status_2 = res_2.status_code

In [50]:
# convert to a beautiful soup object
soup_2 = bs(res_2.content, 'html.parser')

# get body
body_2 = soup.body
print(body_2)

<body class="page-template page-template-page-location page-template-page-location-php page page-id-43585 page-child parent-pageid-43553 aawp-custom template-test template-test-name- template-test-cell-0 template-test-template-page-location lasso-v309 header-image header-full-width content-sidebar genesis-breadcrumbs-hidden"><svg focusable="false" height="0" role="none" style="visibility: hidden; position: absolute; left: -9999px; overflow: hidden;" viewbox="0 0 0 0" width="0" xmlns="http://www.w3.org/2000/svg"><defs><filter id="wp-duotone-dark-grayscale"><fecolormatrix color-interpolation-filters="sRGB" type="matrix" values=" .299 .587 .114 0 0 .299 .587 .114 0 0 .299 .587 .114 0 0 .299 .587 .114 0 0 "></fecolormatrix><fecomponenttransfer color-interpolation-filters="sRGB"><fefuncr tablevalues="0 0.49803921568627" type="table"></fefuncr><fefuncg tablevalues="0 0.49803921568627" type="table"></fefuncg><fefuncb tablevalues="0 0.49803921568627" type="table"></fefuncb><fefunca tablevalues

In [78]:
# find all "h3" tags and store these
animals_2_html = soup_2.body.find_all("h3")
print(animals_2_html)
# add all animals within tags to array 
animals_2 = []
for html_match in animals_2_html[1:]:
    animal = html_match.text.strip()
    print(animal)
    animals_2.append(animal)
    if animal == "Zebra": break

[<h3>Ina</h3>, <h3><strong>Aardvark</strong></h3>, <h3><strong>Baboon</strong></h3>, <h3 style="text-align: justify"><strong>Cheetah</strong></h3>, <h3 style="text-align: justify"><strong>Desert Warthog</strong></h3>, <h3 style="text-align: justify"><strong>Elephant</strong></h3>, <h3 style="text-align: justify"><strong>Fennec Fox</strong></h3>, <h3 style="text-align: justify"><strong>Giraffe</strong></h3>, <h3 style="text-align: justify"><strong>Hippo</strong></h3>, <h3 style="text-align: justify"><strong>Impala</strong></h3>, <h3 style="text-align: justify"><strong>Jackal</strong></h3>, <h3 style="text-align: justify"><strong>Klipspringer</strong></h3>, <h3 style="text-align: justify"><strong>Lion</strong></h3>, <h3 style="text-align: justify"><strong>Magpie</strong></h3>, <h3 style="text-align: justify"><strong>Nyala</strong></h3>, <h3 style="text-align: justify"><strong>Otter</strong></h3>, <h3 style="text-align: justify"><strong>Porcupine</strong></h3>, <h3>Quelea Quelea</h3>, <h3

### Website 3

In [53]:
# Website 3
url_website_3 = "https://animalsake.com/african-animals-list"

res_3 = requests.get(url_website_3)
txt_3 = res_3.text
status_3 = res_3.status_code

Aardvark ~ Erdvark
Aardwolf ~ Aardwolf
Addax
Addra Gazelle ~ Nanger dama
African Gray Parrot
Baboon ~ Nyani
Bat-eared Fox ~ Mbweha masikio
Beisa Oryx ~ Choroa
Blue Monkey ~ Choroa
Bongo
Bonobo
Bontebok
Buffalo
Burchell’s Zebra ~ Bontsebra
Bushbaby or Galago ~ Komba
Bushbuck ~ Bosbok
Bushpig ~ Bosvark
Camel ~ Ngamia
Caracal ~ Rooikat
Cheetah ~ Duma
Chimpanzee ~ Sokwe mtu
Civet ~ Fungo
Colobus Monkey ~ Mbega
Crocodile
Dik-dik
Duiker ~ Duiker
Eland ~ Eland
African Elephant ~ Tembo
Fennec Fox
Forest Hog
Fossa
Gecko
Genet ~ Kanu
Gerbil
Gerenuk or Waller’s Gazelle ~ Swala twiga
Giraffe ~ Twiga
Gembsok Oryx ~ Choroa
Gorilla ~ Sokwe
Greater Kudu ~ Tandala mkubwa
Grant’s Gazelle ~ Swala granti
Ground Squirrel ~ Kindi
Grysbok ~ Dondoro
Hare ~ Sunguru
Hedgehog ~ Kalunguyeye
Hippopotamus ~ Kiboko
Honey Badger ~ Nyegere
Hyenas (Spotted) ~ Gevlekte hiёna
Hyenas (Striped) ~ Fisi
Hyrax ~ Pimbi
Impala ~ Swala pala
Jackal ~ Bweha
Klipspringer ~ Mbuzi Maw
Lechwe Antelope
Lemur
Leopard ~ Chui
Lesser Kudu 

In [54]:
# convert to a beautiful soup object
soup_3 = bs(res_3.content, 'html.parser')

# get body
body_3 = soup.body
# print(body_3)

In [56]:
# find all "h3" tags and store these
animals_3_html = soup_3.body.find_all("h3")

# add all animals within tags to array 
animals_3 = []
for html_match in animals_3_html:
    animal = html_match.text.strip()
    # Remove African name
    if "~" in animal:
        index = animal.index("~")
        animal = animal[0:index - 1]
    print(animal)
    animals_3.append(animal)

Aardvark
Aardwolf
Addax
Addra Gazelle
African Gray Parrot
Baboon
Bat-eared Fox
Beisa Oryx
Blue Monkey
Bongo
Bonobo
Bontebok
Buffalo
Burchell’s Zebra
Bushbaby or Galago
Bushbuck
Bushpig
Camel
Caracal
Cheetah
Chimpanzee
Civet
Colobus Monkey
Crocodile
Dik-dik
Duiker
Eland
African Elephant
Fennec Fox
Forest Hog
Fossa
Gecko
Genet
Gerbil
Gerenuk or Waller’s Gazelle
Giraffe
Gembsok Oryx
Gorilla
Greater Kudu
Grant’s Gazelle
Ground Squirrel
Grysbok
Hare
Hedgehog
Hippopotamus
Honey Badger
Hyenas (Spotted)
Hyenas (Striped)
Hyrax
Impala
Jackal
Klipspringer
Lechwe Antelope
Lemur
Leopard
Lesser Kudu
Lichtenstein’s Hartebeest
African Lion
Madagascar Cockroach
Mamba
Mandrill
Meerkats
Mongoose
Monitor Lizard
Nyala
Okapi
Oribi
Ostrich
Otter
Pangolin
Patas
Porcupine
Puff Adder
Puku
Reedbuck
Rhinoceros
Roan
Sable Antelope
Serval
Cape Elephant Shrew
Sitatunga
Steinbok
Suni
Thomson’s Gazelle
Topi
Vervet Monkey
Warthog
Waterbuck
Wild Ass
African Wild Dog
Wild Cat
Wildebeest or Gnu


In [123]:
# now we create a final array that includes (unique) animals from the 'animal' list and from the current 'nature' dict on site

# 1. Create list of current words in 'nature' dict
nature_list = ["wildlife",
    "fauna",
    "habitat",
    "habitats",
    "conservation",
    "animals",
    "birds",
    "animal",
    "flora",
    "livestock",
    "hunting",
    "parks",
    "biota",
    "geese",
    "egyptian geese",
    "egyptian goose",
    "goose",
    "wilderness",
    "wild",
    "park",
    "natural",
    "nature",
    "hunt",
    "camping",
    "wildcat",
    "sauvage",
    "savage",
    "wild-type",
    "waterfowl",
    "deer",
    "barbary stag",
    "ecology",
    "biodiversity",
    "forest",
    "conservancy",
    "wildfowl",
    "endangered species",
    "conservationists",
    "raptors",
    "birdlife",
    "wetlands",
    "otters",
    "spotted necked otter",
    "african clawless otter",
    "congo clawless otter",
    "eurasian otter",
    "fisheries",
    "waterbirds",
    "poachers",
    "bird sanctuary",
    "ecosystems",
    "reptiles",
    "naturalists",
    "marine",
    "waterways",
    "antelope",
    "wildflower",
    "ecotourism",
    "seabird",
    "turtles",
    "vegetation",
    "tortoises",
    "african spurred tortoise",
    "amphibians",
    "wolves",
    "stone curlew",
    "foxes",
    "sacred ibis",
    "african sacred ibis",
    "monarch butterfly",
    "african monarch butterfly",
    "plain tiger",
    "african monarch",
    "goshawk",
    "african goshawk",
    "cirl bunting",
    "leopards",
    "panther",
    "safari park",
    "seafowl",
    "bird of passage",
    "zoography",
    "zoological garden",
    "shore bird",
    "african black oystercatcher",
    "water thick-knee",
    "white-fronted plover",
    "blacksmith plover",
    "common sandpiper",
    "zoo",
    "mammal",
    "dinotherium",
    "bird",
    "conservationist",
    "gallery forest",
    "zoographer",
    "wetland",
    "zoophagy",
    "bird table",
    "sea cow",
    "manatee",
    "tropical rain forest",
    "sea elephant",
    "elephant seal",
    "sea leopard",
    "aquatic bird",
    "birdwatch",
    "zoopathology",
    "wild dog",
    "african wild dog",
    "marine animal",
    "african elephant",
    "sanctuary",
    "miombo woodland",
    "woodland",
    "elephant bird",
    "birdling",
    "birdcatching",
    "rangership",
    "zoological",
    "wildgrave",
    "exotic",
    "indigenous",
    "aquatic",
    "terrestrial",
    "endangered",
    "migratory",
    "coastal",
    "captive",
    "vertebrate",
    "threatened",
    "fascinating",
    "nocturnal",
    "mammalian",
    "scenic",
    "endemic",
    "ocean",
    "underwater",
    "winged",
    "predatory",
    "fish",
    "plants",
    "forests",
    "forestry",
    "soil",
    "timber",
    "cattle",
    "insects",
    "soils",
    "wildlands",
    "wildflowers",
    "pesticides",
    "fishing",
    "extinction",
    "woods",
    "sanctuaries",
    "ecosystem",
    "elephants",
    "mammals",
    "outdoors",
    "ducks",
    "poaching",
    "harvesting",
    "watersheds",
    "decimation",
    "waterfalls",
    "forage",
    "invertebrates",
    "feeding",
    "extermination",
    "pastoralists",
    "shellfish",
    "whales",
    "buffalo",
    "african buffalo",
    "songbirds",
    "eagles",
    "seals",
    "squirrels",
    "hunted",
    "depleted",
    "inhabit",
    "exterminated",
    "forest ranger",
    "game warden",
    "gamekeeper",
    "park",
    "natural resources",
    "preservationist",
    "preserved",
    "safari",
    "serengeti",
    "agricultural",
    "biosphere",
    "environmental",
    "freshwater",
    "herbivores",
    "wildland"]

previous_length = len(nature_list)

# Convert all lists to sets to remove duplicates & combine all animals into 1 set
all_animals = set(animals_1).union(set(animals_2)).union(set(animals_3))
print(len(all_animals))

# 2. Add 'all_animals' set words to final list
for word in all_animals:
    lc_word = word.lower()
    # remove commas from words (some extracted words had commas)
    if ',' in lc_word:
        lc_word = lc_word.replace(",", "")
    # ignore extinct animals
    if "saurus" in lc_word or "therium" in lc_word: continue 
    # check if already in final list, and if not, add to final list
    if lc_word not in nature_list: 
        nature_list.append(lc_word)
        
print(nature_list, previous_length, len(nature_list))

921
['wildlife', 'fauna', 'habitat', 'habitats', 'conservation', 'animals', 'birds', 'animal', 'flora', 'livestock', 'hunting', 'parks', 'biota', 'geese', 'egyptian geese', 'egyptian goose', 'goose', 'wilderness', 'wild', 'park', 'natural', 'nature', 'hunt', 'camping', 'wildcat', 'sauvage', 'savage', 'wild-type', 'waterfowl', 'deer', 'barbary stag', 'ecology', 'biodiversity', 'forest', 'conservancy', 'wildfowl', 'endangered species', 'conservationists', 'raptors', 'birdlife', 'wetlands', 'otters', 'spotted necked otter', 'african clawless otter', 'congo clawless otter', 'eurasian otter', 'fisheries', 'waterbirds', 'poachers', 'bird sanctuary', 'ecosystems', 'reptiles', 'naturalists', 'marine', 'waterways', 'antelope', 'wildflower', 'ecotourism', 'seabird', 'turtles', 'vegetation', 'tortoises', 'african spurred tortoise', 'amphibians', 'wolves', 'stone curlew', 'foxes', 'sacred ibis', 'african sacred ibis', 'monarch butterfly', 'african monarch butterfly', 'plain tiger', 'african monarc

## Convert plural words to singular words
We will now convert any animal/insect names that are in their plural form to their singular form

In [124]:
for word in nature_list:
    if len(word) != 0 and word[-1] == "s":
        print(word)

habitats
animals
birds
parks
wilderness
endangered species
conservationists
raptors
wetlands
otters
fisheries
waterbirds
poachers
ecosystems
reptiles
naturalists
waterways
turtles
tortoises
amphibians
wolves
foxes
sacred ibis
african sacred ibis
leopards
indigenous
plants
forests
insects
soils
wildlands
wildflowers
pesticides
woods
sanctuaries
elephants
mammals
outdoors
ducks
watersheds
waterfalls
invertebrates
pastoralists
whales
songbirds
eagles
seals
squirrels
natural resources
herbivores
avocets
african elephants
hippos
aardvarks
hippopotamuses
northern white rhinoceros
common hippopotamus
savannah elephants
goats
vultures
wild boars
chickens
donkeys
hippopotamus
sardines
flamingoes
dinopithecus
african wild dogs
oryxes
goliath frogs
termites
pelicans
hammerheads
monkeys
frogs
rhinos
grasshoppers
moths
zebras
camels
albatrosses
pigs
forest elephants
black rhinoceros
xerus
crocodiles
african lions
sharks
civets
dragonflies
gorillas
cuckoos
white rhinoceroses
humpback whales
pygmy hi

### Manually Take Out Irrelevant Strings
We will now remove any irrelevant strings from the new 678 words

In [None]:
nature_final = 

### Convert all plural words to singular
We will now convert all plural words to their singular form to ensure our ASTRSC tool can detect both the singular and plural forms of the words. 

### Conclusion
After finishing this process, we were able to increase the number of words in the dictionary by over 4X.

~ End of notebook ~