# Scraping a Dataset of Animal Fun Facts
Example table:  

| Name             | Text | Source | Wikipedia  |
| -------------    | ---- | ------ | ---------- |
|Asian Elephant    | ...  | url    | https://en.wikipedia.org/wiki/Asian_elephant  |
|American Goldfinch| ...  | url    | https://en.wikipedia.org/wiki/American_goldfinch |

*Probably need to hand-label the scientific names and tags...

# Possible Sources of Animal Facts:
SeaWorld: https://seaworld.org/animals/facts/  
Animal Corner: https://animalcorner.org/animals/  
Animal Pathfinder: https://sites.google.com/a/newburyport.k12.ma.us/library/animal-pathfinder  
Animal Fact Guide (~60) https://animalfactguide.com/animal-facts/  
A-Z Animals: https://a-z-animals.com/animals/  
Fact Animal: https://factanimal.com/  
San Diego Zoo: https://sdzwildlifeexplorers.org/animals  
Animal Fun Facts: https://www.animalfunfacts.net/frogs/5-poison-dart-frog.html  
Reddit: https://www.google.com/search?q=reddit+what+are+some+really+amazing+animal+facts&rlz=1C1RXQR_enUS1031US1031&oq=reddit+what+are+some+really+amazing+animal+facts&aqs=chrome..69i57j69i60.6451j1j7&sourceid=chrome&ie=UTF-8  
R/Awwducational Verified Posts: https://www.reddit.com/r/Awwducational/top/?t=all&f=flair_name%3A%22Verified%22  
Animal Facts Encyclopedia: https://www.animalfactsencyclopedia.com/Animals-A-to-Z.html  
All About Birds: https://www.allaboutbirds.org/guide/House_Sparrow/overview

In [None]:
!pip install praw

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting praw
  Downloading praw-7.6.1-py3-none-any.whl (188 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.8/188.8 KB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting prawcore<3,>=2.1
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Collecting websocket-client>=0.54.0
  Downloading websocket_client-1.4.2-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.3/55.3 KB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting update-checker>=0.18
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: websocket-client, update-checker, prawcore, praw
Successfully installed praw-7.6.1 prawcore-2.3.0 update-checker-0.18.0 websocket-client-1.4.2


In [None]:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
import string
import praw
import re

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Page 1: SeaWorld 
https://seaworld.org/animals/facts/  
1100 animal facts clearly labeled under a 'fun facts' header ✅  
Note: some facts begin with 'for more information about X visit LINK' and need to be filtered out (TODO)

In [None]:
page = requests.get('https://seaworld.org/animals/facts/reptiles/boa-constrictor/')
len(page.content)

96692

In [None]:
from tables.exceptions import NaturalNameWarning
def retrieve_facts(page):
    # returns the list of facts from the page, and the animal name
    soup = BeautifulSoup(page.content, 'html.parser')
    h = soup.find_all('ol') # the fun facts are the only ordered-list on the page
    ff = soup.find('h2', string=re.compile("Fun Facts")) # h2 element containing string "Fun Facts"

    if len(h) == 0: # If no ordered list get facts in between ff and next h2 element
        items = []
        for tag in ff.next_siblings:
            if tag.name == "h2":
                break
            else:
                items.append(tag)
        item_text = [i.get_text().strip() for i in items]
    else:
        items = h[0].find_all('li')
        item_text = [i.get_text().strip() for i in items]

    # removes list items that have "for more information" inside of them
    for i in range(len(item_text)):
        if "more information" in item_text[i]:
            item_text.pop(i)
            break

    animal_name = soup.find_all('h1', class_="page-banner__title")[0].get_text()
    return animal_name, item_text

def get_sub_page_links(home_page):
    # returns list of links under each animal category eg. asian elephant from mammals page
    soup = BeautifulSoup(home_page.content, 'html.parser')
    links = [t.a['href'] for t in soup.find_all('h2', class_="base-listing-child__title")]
    return links

In [None]:
# go through all paths and add all found facts to a list
begin_seaworld = time.time()
seaworld_url_prefix = 'https://seaworld.org/animals/facts/'
seaworld_url_paths = ['amphibians/', 
                      'arthropods/',
                      'birds/',
                      'bony-fish/',
                      'cartilaginous-fish/',
                      'cnidarians/',
                      'echinoderms/',
                      'mammals/',
                      'molluscans/',
                      'reptiles/']

failed_links = []
entries = [] # list of dicts [{'orig_name':str, 'informal_name':str, 'source':str, 'text':str}, ...]
links_tried = 0
for p in seaworld_url_paths:    # go through paths and find all animal sublinks
    url = seaworld_url_prefix + p 
    homepage = requests.get(url)
    sublinks = get_sub_page_links(homepage)
    for link in sublinks:
        total += 1
        page_url = 'https://seaworld.org' + link
        page = requests.get(page_url)
        try:
            animal_name, facts = retrieve_facts(page)
            for fact in facts:
                entries.append({
                    'animal_name': animal_name, 
                    'source': page_url, 
                    'text': fact,
                    })
        except Exception as e: 
            print(f"Exception occurred at: {page_url} \n")
            print(e)
            failed_links.append(page_url)   
end_seaworld = time.time()
print(f"Total runtime of the cell is {end_seaworld - begin_seaworld} seconds.")
print(f"Total links tried: {links_tried}")
print(f"Total failed links: {len(failed_links)}")
print(f"Total entries added: {len(entries)}")


https://seaworld.org/animals/facts/amphibians/axolotl/
https://seaworld.org/animals/facts/amphibians/cuban-tree-frog/
https://seaworld.org/animals/facts/amphibians/mantella-frogs/
https://seaworld.org/animals/facts/amphibians/marine-toad/
https://seaworld.org/animals/facts/amphibians/north-american-bullfrog/
https://seaworld.org/animals/facts/amphibians/oriental-fire-bellied-toad/
https://seaworld.org/animals/facts/amphibians/poison-arrow-frogs/
https://seaworld.org/animals/facts/amphibians/tiger-salamander/
https://seaworld.org/animals/facts/arthropods/emperor-scorpion/
https://seaworld.org/animals/facts/arthropods/monarch-butterfly/
https://seaworld.org/animals/facts/arthropods/southeastern-lubber/
https://seaworld.org/animals/facts/arthropods/spiders/
https://seaworld.org/animals/facts/birds/abdims-stork/
https://seaworld.org/animals/facts/birds/abyssinian-blue-winged-goose/
https://seaworld.org/animals/facts/birds/adelie-penguin/
https://seaworld.org/animals/facts/birds/african-bla

In [None]:
# Delete duplicate fact entries in second animal
duplicates_found = 0
for i, e1 in enumerate(entries):
    for j, e2 in enumerate(entries):
        if (e1["text"] == e2["text"]) and (i != j):
            duplicates_found += 1
            entries.pop(j)
print(f"{duplicates_found} duplicate facts found.")
# Initially found 1194

In [None]:
df = pd.DataFrame.from_records(entries)
df

Unnamed: 0,animal_name,source,text
0,Axolotl,https://seaworld.org/animals/facts/amphibians/...,One derivation of the name 'axolotl' reference...
1,Axolotl,https://seaworld.org/animals/facts/amphibians/...,An axolotl's skeleton is comprised mostly of c...
2,Axolotl,https://seaworld.org/animals/facts/amphibians/...,"They are aquatic, and although they posses rud..."
3,Axolotl,https://seaworld.org/animals/facts/amphibians/...,If axolotls spend prolonged periods of time in...
4,Axolotl,https://seaworld.org/animals/facts/amphibians/...,Axolotls have amazing healing abilities. Norma...
...,...,...,...
1096,Yellow Rat Snake,https://seaworld.org/animals/facts/reptiles/ye...,"Like many reptiles, the incubation temperature..."
1097,Yellow Rat Snake,https://seaworld.org/animals/facts/reptiles/ye...,"The yellow rat snake, or chicken snake, is kno..."
1098,Yellow Rat Snake,https://seaworld.org/animals/facts/reptiles/ye...,"Like pythons and boas, rat snakes are constric..."
1099,Yellow Rat Snake,https://seaworld.org/animals/facts/reptiles/ye...,Yellow rat snakes spend much time underground ...


In [None]:
df.to_csv('/content/drive/MyDrive/Colab Notebooks/Animal_Facts/animal_facts_seaworld.csv')

# Page 2: Animal Corner
https://animalcorner.org/animals/  
This page has some 600 animals on it.  
Unfortunately, this page is unstructured text. I could parse by sentence, and then hand-filter them for the 'fun' ones, or parse by paragraph (but those would be long), or try automatically parsing it?

In [None]:
# TODO

# Page 3: Animal Fact Guide
https://animalfactguide.com/animal-facts/   
About 60 animals.  
Also unstructured.


In [None]:
# TODO

# Page 4: A-Z Animals
https://a-z-animals.com/animals/animals-that-start-with-a/  
This one has indexes of hundreds of animals, and also the pages each have a labeled fun fact!  

On the other hand, a ton of facts are just wrong or lazy (e.e. Arctic Char are the most northerly distributed of all *Freshwater* fish, not all fish) (and no source)


In [None]:
# urls are like: https://a-z-animals.com/animals/animals-that-start-with-a/ <---replace 'a' with any letter
page = requests.get('https://a-z-animals.com/animals/animals-that-start-with-a/')
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <script>
   !function(){"use strict";for(var t=window.location.search.substring(1).split("&"),e=0;e<t.length;e++){var i="adt_ei",a=t[e];if(0===a.indexOf(i)){var r=a.split(i+"=")[1];localStorage.setItem(i,r),t.splice(e,1),history.replaceState(null,"","?"+t.join("&"));break}}}();
  </script>
  <meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots"/>
  <script data-cfasync="false">
   // 			(function (w, d) {
// 				const adLayoutTest = Math.random() < 0.5;

// 				w.adthrive = w.adthrive || {};
// 				w.adthrive.cmd = w.adthrive.cmd || [];

// 				if (adLayoutTest) {
// 				  d.getElementsByTagName('head')[0].classList.add('adthrive-test-1');
// 				}

// 				w.adthrive.cmd.push(() => {
// 				   w.adthrive.config.abGroup.set('pubtst1', adLayoutTest ? 'on' : 'off');

// 				   if(adLayoutTest) {

In [None]:
len(soup.find_all('h3'))

157

In [None]:
len(soup.find_all('p', class_="font-weight-bold fun-fact"))

157

In [None]:
entries = []
for character in string.ascii_lowercase:
  page = requests.get(f'https://a-z-animals.com/animals/animals-that-start-with-{character}/')
  soup = BeautifulSoup(page.content, 'html.parser')

  #it's sketchy, but these do line up exactly, so I could just merge the parallel lists4
  h3s = [h.get_text() for h in soup.find_all('h3')]
  src_links = [h.a['href'] for h in soup.find_all('h3')]
  facts = [f.get_text().replace('Fun Fact:', '').strip() for f in soup.find_all('p', class_="font-weight-bold fun-fact")]


  for animal_name, source_link, fact in zip(h3s, src_links, facts):
    if len(fact) < 1: continue # skip blanks
    entries.append({'animal_name': animal_name, 'source': source_link, 'text': fact})

  time.sleep(1)
len(entries)

2138

In [None]:
df = pd.DataFrame.from_records(entries)
df

Unnamed: 0,animal_name,source,text
0,Aardvark,https://a-z-animals.com/animals/aardvark/,Can move 2ft of soil in just 15 seconds!
1,Aardwolf,https://a-z-animals.com/animals/aardwolf/,The aardwolf has five toes on its front paws
2,Abyssinian,https://a-z-animals.com/animals/abyssinian/,The oldest breed of cat in the world!
3,Addax,https://a-z-animals.com/animals/addax/,The hooves of the addax are splayed and have f...
4,Adelie Penguin,https://a-z-animals.com/animals/adelie-penguin/,Eats up to 2kg of food per day!
...,...,...,...
2133,Zebra Tarantula,https://a-z-animals.com/animals/zebra-tarantula/,They can stay hidden in their burrows for months!
2134,Zebu,https://a-z-animals.com/animals/zebu/,There are around 75 different species!
2135,Zonkey,https://a-z-animals.com/animals/zonkey/,The offspring of Zebra and Donkey parents!
2136,Zorse,https://a-z-animals.com/animals/zorse/,The offspring of a Zebra and Horse parents!


In [None]:
df.to_csv('/content/drive/MyDrive/Colab Notebooks/Animal_Facts/animal_facts_az_animals.csv')

# Page 4.5: A-Z Animals (2nd pass)
Some of the pages have headers with the word 'facts' in it. Others don't.  
These headers are sometimes followed by a paragraph, sometimes a list.  
Maybe juat skip the ones with paragraphs?

In [None]:
page = requests.get('https://a-z-animals.com/animals/alligator-gar/')
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
# TODO: no longer have access, might need to rotate IPs (https://www.scrapehero.com/how-to-rotate-proxies-and-ip-addresses-using-python-3/)
import requests
from lxml.html import fromstring
def get_proxies():
    url = 'https://free-proxy-list.net/'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find_all('table', class_="table table-striped table-bordered") 
    
    proxies = set()
    for row in table[0].tbody.findAll('tr'):
      # first_column = row.findAll('th')[0].contents
      # proxies.add(first_column)
      print(row)
       
    print(soup.prettify())
    
    return proxies
proxies = get_proxies()

In [None]:
print(proxies)

{'66.70.178.214:9300'}


In [None]:
datapath = '/content/drive/MyDrive/Colab Notebooks/Animal_Facts/'
with open(datapath+'Amano Shrimp Animal Facts _ Caridina multidentata - AZ Animals.html') as f:
  page = f.read()
soup = BeautifulSoup(page, 'lxml')
print(soup.prettify())
# h2s = soup.find_all('h2')
# for h in h2s:
#   print(h)

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <script>
   !function(){"use strict";for(var t=window.location.search.substring(1).split("&"),e=0;e<t.length;e++){var i="adt_ei",a=t[e];if(0===a.indexOf(i)){var r=a.split(i+"=")[1];localStorage.setItem(i,r),t.splice(e,1),history.replaceState(null,"","?"+t.join("&"));break}}}();
  </script>
  <meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots"/>
  <script data-cfasync="false">
   // 			(function (w, d) {
// 				const adLayoutTest = Math.random() < 0.5;

// 				w.adthrive = w.adthrive || {};
// 				w.adthrive.cmd = w.adthrive.cmd || [];

// 				if (adLayoutTest) {
// 				  d.getElementsByTagName('head')[0].classList.add('adthrive-test-1');
// 				}

// 				w.adthrive.cmd.push(() => {
// 				   w.adthrive.config.abGroup.set('pubtst1', adLayoutTest ? 'on' : 'off');

// 				   if(adLayoutTest) {

# Page 5: Fact Animal
https://factanimal.com/animals/  
Has an index of sublinks ✅  
Each page has an interesting facts header ✅   
The opening paragraphs could be used as facts? no, just the facts at bottom.  
Problems:
- some facts lead into each other (facts not independent) (ex. see https://factanimal.com/kakapo/ facts 2, 3, 4)  
  - just record them separately, can't hand-format 4,000 facts, what can ya do.
- the paragraph below adds important context  
  - just merge title and following paragraph together


In [None]:
# This page has the list of links to other pages: extract the list of links
page = requests.get('https://factanimal.com/animals/')
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-GB">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="https://gmpg.org/xfn/11" rel="profile"/>
  <link href="https://factanimal.com/xmlrpc.php" rel="pingback"/>
  <meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots">
   <title>
    Animals A-Z - Fact Animal
   </title>
   <link href="https://factanimal.com/animals/" rel="canonical">
    <meta content="en_GB" property="og:locale">
     <meta content="article" property="og:type"/>
     <meta content="Animals A-Z - Fact Animal" property="og:title"/>
     <meta content="This page includes all animals we plan to cover on Fact Animal. As we publish new content, each of these animal types will be hyperlinked" property="og:description"/>
     <meta content="https://factanimal.com/animals/" property="og:url"/>
     <meta content="Fact Animal" property="og:site_name"/>
     <meta content="202

In [None]:
animal_page_links = []
paragraphs = soup.find_all('p')
for p in paragraphs:
  for a in p.find_all('a'):
    animal_page_links.append(a['href'])

animal_page_links = animal_page_links[1:] # first link is not an animal so skip
animal_page_links = [link for link in animal_page_links if not re.search('.*\.pdf', link)]

print(f'there are {len(animal_page_links)} pages to scrape. 10-20 facts per page. so 2000-4000 animal facts.')
print(animal_page_links[:10], '...')

there are 250 pages to scrape. 10-20 facts per page. so 2000-4000 animal facts.
['https://factanimal.com/aardvark/', 'https://factanimal.com/aardwolf/', 'https://factanimal.com/alligator/', 'https://factanimal.com/alpine-ibex/', 'https://factanimal.com/arctic-fox/', 'https://factanimal.com/armadillo/', 'https://factanimal.com/asian-giant-hornet/', 'https://factanimal.com/atlantic-wolffish/', 'https://factanimal.com/atlas-moth/', 'https://factanimal.com/axolotl/'] ...


In [None]:
page = requests.get('https://factanimal.com/kakapo/')
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-GB">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="https://gmpg.org/xfn/11" rel="profile"/>
  <link href="https://factanimal.com/xmlrpc.php" rel="pingback"/>
  <meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots">
   <title>
    14 Kakapo Facts - Fact Animal
   </title>
   <link href="https://factanimal.com/kakapo/" rel="canonical">
    <meta content="en_GB" property="og:locale">
     <meta content="article" property="og:type"/>
     <meta content="14 Kakapo Facts - Fact Animal" property="og:title"/>
     <meta content="Kakapo Profile The word kakapo translates to mean night parrot and that essentially describes this unusual bird. Also called the owl parrot, this is a" property="og:description"/>
     <meta content="https://factanimal.com/kakapo/" property="og:url"/>
     <meta content="Fact Animal" property="og:site_name"/>
    

In [None]:
def parse_page(page_url):
  page = requests.get(page_url)
  soup = BeautifulSoup(page.content, 'html.parser')

  animal_name = re.sub(' Facts', '', soup.find('h1', class_="page-title").text)
  overview_fact = "" # save the overview page as a string, add to a list and parse it later
  print(page_url)

  fact_entries = []
  h3s = soup.find_all(['h3'])
  for idx, h in enumerate(h3s):

    # first h3 should be overview info (save somewhere else, then manually parse into facts later)
    if idx==0:
      next = soup.find_all('h2')[0].find_next_sibling("p")
      if not next: continue
      while next.name == "p":
        if next.img == None and len(next.text) > 5:
          overview_fact += next.text + '\n'
        next = next.find_next_sibling()

      next = h.find_next_sibling("p")
      if not next: continue
      while next.name != "h2":
        overview_fact += next.text + '\n'
        next = next.find_next_sibling()
      continue

    # find headers like: "11. The kakapo has an unusual way to protect itself when startled." and merge with following paragraph
    m = re.match('\d+. ', h.text)
    if m:
      part1_text = h.text[m.end():]
      try:
        part2_text = h.find_next_sibling("p").text
      except:
        part2_text = ''
      fact = part1_text + '.\n'+ part2_text
      fact_entries.append({'animal_name': animal_name, 'source': page_url, 'text': fact})

  overview_fact_entry = {'animal_name': animal_name, 'source': page_url, 'text': overview_fact.strip()}

  # fact_entries is a list of entries, overview_fact_entry is a single entry
  return fact_entries, overview_fact_entry

facts, overview_fact = parse_page('https://factanimal.com/hedgehog/')
print(len(facts))
print(overview_fact['text'])
for f in facts:
  print(f)
  print('----------\n')

https://factanimal.com/hedgehog/
22
Hedgehogs are loveable mammals known for their prickly exteriors and ball-like shape.
While these adorable creatures can be found in the wild, they are also often kept as pets. The sharp quills that cover most of their bodies require some special handling and care from owners.
There are 18 species of hedgehog spread across Europe, Africa, and Central Asia. There are no hedgehogs native to the Americas, or Australia and they have been introduced into New Zealand.
They are mostly ground-dwelling and forage for insects, fruit, roots, and grasses. Crickets, millipedes, worms, and beetles are some of their favorite snacks.
Despite possessing a similar appearance, hedgehogs are not closely related to porcupines or echidnas, and have distant ancestory to shrews.
Their quills are quite different in that the barbed spines of porcupines are capable of detaching from their bodies. Furthermore, porcupines are classified as rodents while hedgehogs are not.
{'anim

In [None]:
overview_entries = []
entries = []
for url in animal_page_links:
  fact_entries, overview_fact_entry = parse_page(url)
  entries += fact_entries
  overview_entries.append(overview_fact_entry)
  time.sleep(0.55)

https://factanimal.com/aardvark/
https://factanimal.com/aardwolf/
https://factanimal.com/alligator/
https://factanimal.com/alpine-ibex/
https://factanimal.com/arctic-fox/
https://factanimal.com/armadillo/
https://factanimal.com/asian-giant-hornet/
https://factanimal.com/atlantic-wolffish/
https://factanimal.com/atlas-moth/
https://factanimal.com/axolotl/
https://factanimal.com/aye-aye/
https://factanimal.com/babirusa/
https://factanimal.com/baboon/
https://factanimal.com/bald-eagle/
https://factanimal.com/barracuda/
https://factanimal.com/barreleye-fish/
https://factanimal.com/basking-shark/
https://factanimal.com/bat-eared-fox/
https://factanimal.com/bearded-vulture/
https://factanimal.com/beaver/
https://factanimal.com/beluga-whale/
https://factanimal.com/bilby/
https://factanimal.com/binturong/
https://factanimal.com/bison/
https://factanimal.com/black-rhino/
https://factanimal.com/black-widow-spider/
https://factanimal.com/blanket-octopus/
https://factanimal.com/blobfish/
https://f

In [None]:
df = pd.DataFrame.from_records(entries)
df

Unnamed: 0,animal_name,source,text
0,Aardvark,https://factanimal.com/aardvark/,Their name means ‘Earth Pig’.\nAnd for good re...
1,Aardvark,https://factanimal.com/aardvark/,Aardvarks are also named after their long mout...
2,Aardvark,https://factanimal.com/aardvark/,They’re well adapted for the heat.\nAardvarks ...
3,Aardvark,https://factanimal.com/aardvark/,They are solitary animals and nocturnal feeder...
4,Aardvark,https://factanimal.com/aardvark/,They can burrow at an alarming rate!.\nSince t...
...,...,...,...
3356,Zorse,https://factanimal.com/zorse/,Zorses are a type of zebroid.\nWhen it comes t...
3357,Zorse,https://factanimal.com/zorse/,There are no animals with zorse parents.\nSadl...
3358,Zorse,https://factanimal.com/zorse/,They’re hardier than mules.\nOne of the most n...
3359,Zorse,https://factanimal.com/zorse/,There are hundreds of types of zorses.\nNot al...


In [None]:
df.to_csv('/content/drive/MyDrive/Colab Notebooks/Animal_Facts/animal_facts_factanimal.csv')

In [None]:
df2 = pd.DataFrame.from_records(overview_entries)
df2

Unnamed: 0,animal_name,source,text
0,Aardvark,https://factanimal.com/aardvark/,The Afrikaans language refers to a type of pig...
1,Aardwolf,https://factanimal.com/aardwolf/,"The Aardwolf is a member of the Hyena family, ..."
2,Alligator,https://factanimal.com/alligator/,The alligator is about as close as one can get...
3,Alpine Ibex,https://factanimal.com/alpine-ibex/,The story of the Alpine Ibex is a rollercoaste...
4,Arctic Fox,https://factanimal.com/arctic-fox/,Equipped with specialist summer and winter war...
...,...,...,...
230,Woodlouse,https://factanimal.com/woodlouse/,"Wherever you are in the world, there’s a good ..."
231,Yeti Crab,https://factanimal.com/yeti-crab/,Kiwa are a genus of marine decapods that inhab...
232,Zebra,https://factanimal.com/zebra/,Zebras (subgenus Hippotigris) are well-known f...
233,Zebra Duiker,https://factanimal.com/zebra-duiker/,"Of all the duiker species, the zebra duiker ma..."


In [None]:
df2.to_csv('/content/drive/MyDrive/Colab Notebooks/Animal_Facts/animal_facts_factanimal_overview_TODO_MANUAL_PARSE.csv')

# Page 6: San Diego Zoo
https://sdzwildlifeexplorers.org/animals  
  
This will need a fair bit of hand-parsing...
But not too difficult, since not many pages. But lots of facts per animal.

In [None]:
# TODO

# Page 7: AnimalFunFacts.net
https://www.animalfunfacts.net/tags/884-a.html  
will need to loop over a-z and hand-select animal page urls...  
I could make like a text file and copy-paste things...  
...or loop over each animal page's text, and copy-paste the facts into an input field...


...some pages have 'fun facts' header, others do not. this is a problem.❌  

...but, there are maybe 200+ animals listed here, times 5 facts each is 1000+.




In [None]:
# TODO

# Page 8: Individual Reddit Threads
Not a reliable source, at all, but at least it's structured.  

How to scrape reddit: https://towardsdatascience.com/scraping-reddit-data-1c0af3040768


In [None]:
reddit = praw.Reddit(client_id='pkuorm8He6E5jLZ2vlsqdA', 
                     client_secret='_6qYhsjmScaj6-PoVWlLH7B0vJWJqA', 
                     user_agent='animal_facts_scraper')

In [None]:
reddit_page_ids = ['gbh7zz', #✅ https://www.reddit.com/r/AskReddit/comments/gbh7zz/what_are_some_really_amazing_animal_facts/ 
                   '9a4ku4', #✅ https://www.reddit.com/r/AskReddit/comments/9a4ku4/whats_your_1_obscure_animal_fact/
                   '2evjpq', #✅ https://www.reddit.com/r/AskReddit/comments/2evjpq/what_are_some_animal_fun_fact_you_know/
                   'uw8jr7', # https://www.reddit.com/r/AskReddit/comments/uw8jr7/what_is_your_number_1_obscure_animal_fact/
                   'it1hqv', # https://www.reddit.com/r/AskReddit/comments/it1hqv/what_is_an_animal_fact_that_absolutely_blew_your/
                   'c9ifmi', # https://www.reddit.com/r/AskReddit/comments/c9ifmi/what_weird_animal_facts_do_you_know/
                   'bfo113', # https://www.reddit.com/r/Awwducational/comments/bfo113/whats_everyones_best_goto_animal_fact/
                   '14v827', # https://www.reddit.com/r/AskReddit/comments/14v827/reddit_whats_your_best_random_animal_fact/
                   'cfye65', # https://www.reddit.com/r/AskReddit/comments/cfye65/what_are_some_random_animal_facts_that_you_can/
                   '9xj60q', # https://www.reddit.com/r/AskReddit/comments/9xj60q/whats_a_cool_random_fact_you_know_about_animals/
                   '928dpv'  # https://www.reddit.com/r/AskReddit/comments/928dpv/whats_the_craziest_animal_fact_you_know/
]

cur_page = reddit_page_ids[2]


submission = reddit.submission(id=cur_page) # https://www.reddit.com/r/AskReddit/comments/gbh7zz/what_are_some_really_amazing_animal_facts/
print(submission.title)
print(len(submission.comments))
print(cur_page)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



What are some animal "fun fact" you know?
81
2evjpq


In [None]:
entries = []

submission.comments.replace_more(limit=0)  # you can replace 0 with None to see all the comments - i think this cuts off the bottom 1700 or so lower-ranked comments, which is probably for the best...
for idx, c in enumerate(submission.comments): # just top-level comments, not replies
  link_url = c.permalink
  fact = c.body

  print('Index', idx)
  print(fact)
  animal = input('Animal:')

  # skip
  if animal in ['', ' ', 'x', 'X']:
    print('~Skipped~\n----------------\n')
    continue

  # edit to break into multiple facts
  if animal in ['edit']:
    print('\n~~~~~~~~~~~~EDIT MODE~~~~~~~~~~~~~~~~~\n')
    while True:
      edited_name = input('Animal Name:')
      if edited_name in ['', ' ', 'x', 'X']:
        print('\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n')
        break
      edited_fact = input('Enter Fact:')
      entries.append( {'animal_name': edited_name, 'source': link_url, 'text': edited_fact} )
      print(f'Added {edited_name} ({len(entries)})\n')
  
  # or add as is
  else:
    entries.append( {'animal_name': animal, 'source': link_url, 'text': fact} )
    print(f'Added {animal} ({len(entries)})')
  print('-------------\n')


len(entries)

Index 0
Turkey vultures nostrils are the same size as their talons so they can pick dead flesh out of their nose. They also pee down their legs to prevent bugs on their food from coming up into their feathers...and to cool down. 
Animal:edit

~~~~~~~~~~~~EDIT MODE~~~~~~~~~~~~~~~~~

Animal Name:turkey vulture
Enter Fact:Turkey vultures nostrils are the same size as their talons so they can pick dead flesh out of their nose.
Added turkey vulture (1)

Animal Name:turkey vulture
Enter Fact:Turkey vultures pee down their legs to prevent bugs on their food from coming up into their feathers, and to cool down.
Added turkey vulture (2)

Animal Name:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-------------

Index 1
The Saluki (looks like a shaggy greyhound) is the fastest land animal on the planet in a two mile race.  

Edit: It was a three mile race, according the NOVA episode I learned that from.
Animal:edit

~~~~~~~~~~~~EDIT MODE~~~~~~~~~~~~~~~~~

Animal Name:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

45

In [None]:
df = pd.DataFrame.from_records(entries)
df

Unnamed: 0,animal_name,source,text
0,turkey vulture,/r/AskReddit/comments/2evjpq/what_are_some_ani...,Turkey vultures nostrils are the same size as ...
1,turkey vulture,/r/AskReddit/comments/2evjpq/what_are_some_ani...,Turkey vultures pee down their legs to prevent...
2,alpaca,/r/AskReddit/comments/2evjpq/what_are_some_ani...,"Alpacas are often placed in fields of sheep, a..."
3,vampire bat,/r/AskReddit/comments/2evjpq/what_are_some_ani...,vampire bats drink half their weight in blood ...
4,kangaroo,/r/AskReddit/comments/2evjpq/what_are_some_ani...,Kangaroos are pretty much giant marsupial bell...
5,duck,/r/AskReddit/comments/2evjpq/what_are_some_ani...,Ducks can sleep half of their brains. They can...
6,lyrebird,/r/AskReddit/comments/2evjpq/what_are_some_ani...,Lyrebirds can imitate virtually every sound th...
7,jaguar,/r/AskReddit/comments/2evjpq/what_are_some_ani...,Jaguars are the only big cat that rarely kill ...
8,honey badger,/r/AskReddit/comments/2evjpq/what_are_some_ani...,Honey badgers will suffer hundreds of bee stin...
9,dog,/r/AskReddit/comments/2evjpq/what_are_some_ani...,Dogs have a special system in their necks that...


In [None]:
df.to_csv(f'/content/drive/MyDrive/Colab Notebooks/Animal_Facts/animal_facts_reddit_{cur_page}.csv')

# Page 9: r/Awwducational
Also reddit, but posts have a 'verified' flair, with source links in the comments, so possibly more reliable.  
And the title is usually in fact form.  

https://www.reddit.com/r/Awwducational/top/?t=all&f=flair_name%3A%22Verified%22

In [None]:
reddit = praw.Reddit(client_id='pkuorm8He6E5jLZ2vlsqdA', 
                     client_secret='_6qYhsjmScaj6-PoVWlLH7B0vJWJqA', 
                     user_agent='animal_facts_scraper')
awwducational_posts = reddit.subreddit('Awwducational').search('flair:"verified"', limit=None) 
print(len(list(awwducational_posts)))

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



250


In [None]:
entries = []
print('hi')
for post in reddit.subreddit('Awwducational').search('flair:"Verified"', limit=None):
  url_link = 'https://reddit.com'+post.permalink
  fact = post.title
  media_link = post.url # photo/video from post
  print("Media Link:", media_link)
  print(url_link)
  print(fact)

  animal = input('Animal:')

  # skip
  if animal in ['', ' ', 'x', 'X']:
    print('~Skipped~\n----------------\n')
    continue

  # edit to break into multiple facts
  if animal in ['edit']:
    print('\n~~~~~~~~~~~~EDIT MODE~~~~~~~~~~~~~~~~~\n')
    while True:
      edited_name = input('Animal Name:')
      if edited_name in ['', ' ', 'x', 'X']:
        print('\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n')
        break
      edited_fact = input('Enter Fact:')
      entries.append( {'animal_name': edited_name, 'source': url_link, 'text': edited_fact, 'media_link': media_link} )
      print(f'Added {edited_name} ({len(entries)})\n')
  
  # or add as is
  else:
    entries.append( {'animal_name': animal, 'source': url_link, 'text': fact, 'media_link': media_link} )
    print(f'Added {animal} ({len(entries)})')
  print('-------------\n')



  print('Added ')

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



hi
Media Link: https://v.redd.it/pry5ygflw97a1
https://reddit.com/r/Awwducational/comments/zrrt90/platypus_bill_comes_equipped_with_specialized/
Platypus' bill comes equipped with specialized nerve endings, called electroreceptors, which detect tiny electrical currents generated by the muscular contractions of prey. It has no teeth, so the platypus stores its "catch" in its cheek pouches and mashes up its meal with the help of gravel bits
Animal:platypus
Added platypus (1)
-------------

Added 
Media Link: https://v.redd.it/fz1sjqa7m05a1
https://reddit.com/r/Awwducational/comments/zhf5mt/giant_pandas_subsist_almost_entirely_on_bamboo/
Giant Pandas subsist almost entirely on bamboo, eating from 26 to 84 pounds per day.
Animal:giant panda
Added giant panda (2)
-------------

Added 
Media Link: https://v.redd.it/of3f9eoh7n6a1
https://reddit.com/r/Awwducational/comments/zownb5/the_clarity_of_indian_ringneck_parrots_speech/
The clarity of Indian Ringneck Parrot's speech, along with their ab

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Added tapir (99)
-------------

Added 
Media Link: https://v.redd.it/siovzwkv84m91
https://reddit.com/r/Awwducational/comments/x6t9mo/hippos_have_selfsharpening_teeth_which_are_used/
Hippos have self-sharpening teeth which are used for both chewing and combat. On average, hippos have 36 teeth; their molars do the hard work of grinding down the 40kg of plant material they consume each day. This hippo is getting a thorough dental hygiene check and cleaning at a zoo in Osaka.
Animal:Hippo
Added Hippo (100)
-------------

Added 
Media Link: https://v.redd.it/ldm5dnwvelu91
https://reddit.com/r/Awwducational/comments/y7c49m/giant_pandas_are_the_only_bears_with_grasping/
Giant pandas are the only bears with grasping paws! Instead of opposable thumbs, an elongated wrist bone acts as a sixth finger to let them hold bamboo more easily.
Animal:Giant panda
Added Giant panda (101)
-------------

Added 
Media Link: https://i.redd.it/2vys3ab66mu91.jpg
https://reddit.com/r/Awwducational/comments/y7g0m

KeyboardInterrupt: ignored

In [None]:
df = pd.DataFrame.from_records(entries)
df

Unnamed: 0,animal_name,source,text,media_link
0,Pygmy Hippopotamus,https://reddit.com/r/Awwducational/comments/z0...,The Pygmy Hippopotamus is the much smaller for...,https://v.redd.it/tcwv55l0n41a1
1,fossa,https://reddit.com/r/Awwducational/comments/yc...,The fossa is Madagascar's top predator. It is ...,https://i.redd.it/8sgrkss4muv91.png
2,earwig,https://reddit.com/r/Awwducational/comments/yt...,Earwigs are devoted mothers. They stay with th...,https://i.redd.it/bktxnyajqjz91.jpg
3,resplendent quetzal,https://reddit.com/r/Awwducational/comments/yl...,The resplendent quetzal is a sacred symbol in ...,https://i.redd.it/lro4zsz3rux91.png
4,turtle,https://reddit.com/r/Awwducational/comments/yh...,Some turtles can swim backwards. This one just...,https://v.redd.it/y29pww9feyw91
...,...,...,...,...
242,Desert Rain Frog,https://reddit.com/r/Awwducational/comments/w6...,The Desert Rain Frog (Breviceps macrops) produ...,https://v.redd.it/dot5dx3oajd91
243,South American tapir,https://reddit.com/r/Awwducational/comments/wk...,The South American tapir is the largest native...,https://v.redd.it/udi2hv0x8sg91
244,Short-eared owl,https://reddit.com/r/Awwducational/comments/xs...,Short-eared owls are one of the most widely di...,https://i.redd.it/1hph7nbnz2r91.jpg
245,shiny cowbird,https://reddit.com/r/Awwducational/comments/xj...,The shiny cowbird is a year-round resident acr...,https://i.redd.it/42pbfc6xmzo91.jpg


In [None]:
df.to_csv(f'/content/drive/MyDrive/Colab Notebooks/Animal_Facts/animal_facts_awwducational_verified_flair.csv')

# Page 10: Animal Facts Encyclopedia
https://www.animalfactsencyclopedia.com/Animals-A-to-Z.html  
Each page has a header titled 'A Few More ___ Facts'.

In [None]:
# Get all page links from this page
links_page = requests.get('https://www.animalfactsencyclopedia.com/Animals-A-to-Z.html')
soup = BeautifulSoup(links_page.content, 'html.parser')
urls = [el.get('href') for el in soup.find('div', class_='responsive_grid_block-6 responsive_grid_block-211592215').find_all('a')]

print(len(urls))
urls

86


['https://www.animalfactsencyclopedia.com/Aardvark-facts.html',
 'https://www.animalfactsencyclopedia.com/African-wild-dog-facts.html',
 'https://www.animalfactsencyclopedia.com/Andalusian-horse.html',
 'https://www.animalfactsencyclopedia.com/Anteater-facts.html',
 'https://www.animalfactsencyclopedia.com/Armadillo-facts.html',
 'https://www.animalfactsencyclopedia.com/Baboon-facts.html',
 'https://www.animalfactsencyclopedia.com/Baby-animals.html',
 'https://www.animalfactsencyclopedia.com/Badger-facts.html',
 'https://www.animalfactsencyclopedia.com/Bat-facts.html',
 'https://www.animalfactsencyclopedia.com/Beaver-facts.html',
 'https://www.animalfactsencyclopedia.com/Black-bear-facts.html',
 'https://www.animalfactsencyclopedia.com/Bonobo-facts.html',
 'https://www.animalfactsencyclopedia.com/Buffalo-facts.html',
 'https://www.animalfactsencyclopedia.com/Camel-facts.html',
 'https://www.animalfactsencyclopedia.com/Cape-buffalo-facts.html',
 'https://www.animalfactsencyclopedia.com/

In [None]:
# method for parsing a single page
import re
def facts_from_animalfactsencyclopedia_page(url):
  page = requests.get(url)
  soup = BeautifulSoup(page.content, 'html.parser')

  animal_name = 'None'
  facts = []
  for h2 in soup.find_all('h2'):
    title = h2.text.lower()

    # find the header that says "a few more <animal_name> facts" and grab the list items
    if "a few more" in title and "facts" in title:
      facts_list = h2.find_next('ul')
      facts = [li.text for li in facts_list.find_all("li")]

      # also grab the animal name from this header
      # animal_name_search = re.search('a few more {1,2}(.+) facts', title)
      # if animal_name_search:
      #   print('passed search')
      #   animal_name = animal_name_search.group(1)
      #   print(animal_name, facts)
      animal_name = title.replace('a few more', '').replace('facts', '').strip()
      print(animal_name, facts)
      return animal_name, facts
      
  print('FAILED:', url)
  

# example
facts_from_animalfactsencyclopedia_page('https://www.animalfactsencyclopedia.com/Mexican-raccoon.html')

mexican raccoon ['The Mexican raccoon is formally known as the coati', 'There are four species of coati', 'The Mexican raccoon is the species known as the white-nosed coati', 'Coatimundi, quati, tejon and hog-nosed coon are some other names', 'Tarantulas are a common part of the diet of a coati', 'Coatis are members of the raccoon, or procyonid family']


('mexican raccoon',
 ['The Mexican raccoon is formally known as the coati',
  'There are four species of coati',
  'The Mexican raccoon is the species known as the white-nosed coati',
  'Coatimundi, quati, tejon and hog-nosed coon are some other names',
  'Tarantulas are a common part of the diet of a coati',
  'Coatis are members of the raccoon, or procyonid family'])

In [None]:
entries = []
for u in urls:
  name, facts = facts_from_animalfactsencyclopedia_page(u)
  for f in facts:
    entries.append({'animal_name': name, 'source': u, 'text': f})
  time.sleep(1) # avoid rate limits

df = pd.DataFrame.from_records(entries)
df

aardvark ['\nAardvarks are sometimes called "ant bears", "earth pigs",\nand "cape anteaters"\xa0', 'Aardvarks\nhave rather primitive brains that are very small for the size of the\nanimal. Some have suggested they are not particularly bright....\xa0', 'Aardvarks\nteeth are lined with fine upright tubes and have no roots or enamel.\xa0', 'The aardvarks Latin family name "Tubulidentata" means "tube toothed"\xa0', 'Baby aardvarks are born with front teeth that fall out and\nnever grow back.\xa0', 'Aardvarks are living fossils not having changed for\nmillions of years.\xa0', 'Aardvarks will occasionally stand, and even take a step or\ntwo, on their hind legs\xa0', 'Aardvarks can\xa0 use their powerful tails as a\nwhip-like weapon of defense.\n']
african wild dog ['Wild dogs are known by many different names including painted dog, painted wolf, cape hunting dog, African hunting dog, singing dog and ornate wolf- wow!', 'They are the most efficient hunters of any large predator with an 80% su

Unnamed: 0,animal_name,source,text
0,aardvark,https://www.animalfactsencyclopedia.com/Aardva...,"\nAardvarks are sometimes called ""ant bears"", ..."
1,aardvark,https://www.animalfactsencyclopedia.com/Aardva...,Aardvarks\nhave rather primitive brains that a...
2,aardvark,https://www.animalfactsencyclopedia.com/Aardva...,Aardvarks\nteeth are lined with fine upright t...
3,aardvark,https://www.animalfactsencyclopedia.com/Aardva...,"The aardvarks Latin family name ""Tubulidentata..."
4,aardvark,https://www.animalfactsencyclopedia.com/Aardva...,Baby aardvarks are born with front teeth that ...
...,...,...,...
614,wombat,https://www.animalfactsencyclopedia.com/Wombat...,A wombats droppings are in the shape of a cube.
615,wombat,https://www.animalfactsencyclopedia.com/Wombat...,Wombats snore... Wombat Facts
616,zebra,https://www.animalfactsencyclopedia.com/Zebra-...,"Contrary to some claims, zebras, like all othe..."
617,zebra,https://www.animalfactsencyclopedia.com/Zebra-...,The zebras stripes make it more difficult for ...


In [None]:
df.to_csv('/content/drive/MyDrive/Colab Notebooks/Animal_Facts/animal_facts_animalfactsencyclopedia.csv')

# All About Birds
https://www.allaboutbirds.org/guide/House_Sparrow/overview  
A site with structured fun facts, but only for birds. An average person who is not a bird-enthusast would likely not find these interesting.  
Use the 'Cool Facts' header.


In [None]:
# TODO