# Fish Welfare Project
## Part 1: Scraping the DB

* Author: Angelina Li
* Date: 2019/09/10
* Description: This project is an attempt to collect information presented on the [FishEthoBase](http://fishethobase.net/db/) and make that data more navigatable to interested users.

## Notebook tasks
1. Grab a list of all the fish catalogued on the FishEthoBase.
2. For each fish, check if they have a short profile.
3. Grab the short profile table for each fish.
4. Grab the picture and summary information per species.

Ideal variables to collect per species:
* English name
* Latin name
* Summary link
* Short profile link
* Summary description
* Image link
* Home range, depth range, migration, reproduction, etc. etc. likelihood / potential / certainty
* FishEthoScore per section

In [1]:
import os
import re
import requests
import sys
import time
import urllib

from bs4 import BeautifulSoup

In [2]:
# These are relative paths - running the script from a different location may produce surprising results.
MAIN_DIR = ".."
DATA_DIR = os.path.join(MAIN_DIR, "data")

FISH_BASE_ADDR = "http://fishethobase.net"
DB_ADDR = FISH_BASE_ADDR + "/db"
S_PAUSE = 2 # how many seconds to pause in between requests

REQ_SUCCESS = 200 # success status code

In [26]:
def get_soup(url_address, pause_secs=S_PAUSE):
    page = requests.get(url_address)
    if page.status_code != REQ_SUCCESS:
        print("Couldn't load content on this page:", url_address)
        return
    soup = BeautifulSoup(page.content, "html.parser")
    time.sleep(pause_secs)
    print("Loaded page:", url_address)
    return soup

In [4]:
db_soup = get_soup(DB_ADDR)
print(db_soup.prettify()[:400])


Loaded page: http://fishethobase.net/db
<!DOCTYPE html>
<!--[if lt IE 7]>      <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js">
 <!--<![endif]-->
 <html>
  <head>
   <meta charset="utf-8"/>
   <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/


**1. Grab a list of all the fish catalogued on the FishEthoBase.**

In [5]:
species_tree = db_soup.find(class_="speciestree-sub")
print(species_tree.prettify()[:400])

<ul class="speciestree-sub">
 <li>
  <a class="speciestree-order" href="#" onclick="return false;">
   Cephalopoda (Cephalopoda)
  </a>
  <ul>
   <li>
    <a class="speciestree-intern speciestree-short" href="/db/28/" style="" target="_self" title="Short profile">
     <i>
      Octopus vulgaris
     </i>
     (Common octopus)
    </a>
   </li>
  </ul>
 </li>
 <li>
  <a class="speciestree-order" h


In [6]:
# seems like each fish is categorized as a 'speciestree-intern'. Let's take a look at these.
species_soups = species_tree.find_all(attrs={"class": "speciestree-intern"})
print("Num species:", len(species_soups), "\n")
print("HTML Tree:\n" + species_soups[0].prettify())

Num species: 44 

HTML Tree:
<a class="speciestree-intern speciestree-short" href="/db/28/" style="" target="_self" title="Short profile">
 <i>
  Octopus vulgaris
 </i>
 (Common octopus)
</a>


In [7]:
summary_link_pattern = re.compile("/db/[0-9]+/?")
name_pattern = re.compile("[A-Za-z, ]+\([A-Za-z ()]+\)")

def get_species_dict(species_soup):
    raw_addr = species_soup.get("href")
    if not summary_link_pattern.match(raw_addr):
        print("Unexpected raw address:", raw_addr)
        return
    summary_addr = FISH_BASE_ADDR + raw_addr
    
    name_text = species_soup.get_text()
    if not name_pattern.match(name_text):
        print("Unexpected name format:", name_text)
        return
    latin_name = name_text.split("(")[0].strip()
    english_name = name_text.split("(", 1)[1][:-1].strip()
    
    return dict(
        summary_addr=summary_addr,
        latin_name=latin_name,
        english_name=english_name,
        key=re.sub("\W", "", english_name).lower()
    )

get_species_dict(species_soups[0])

{'summary_addr': 'http://fishethobase.net/db/28/',
 'latin_name': 'Octopus vulgaris',
 'english_name': 'Common octopus',
 'key': 'commonoctopus'}

In [8]:
species = list(map(get_species_dict, species_soups))

species[1]

{'summary_addr': 'http://fishethobase.net/db/21/',
 'latin_name': 'Litopenaeus vannamei',
 'english_name': 'Pacific whiteleg shrimp',
 'key': 'pacificwhitelegshrimp'}

In [9]:
species

[{'summary_addr': 'http://fishethobase.net/db/28/',
  'latin_name': 'Octopus vulgaris',
  'english_name': 'Common octopus',
  'key': 'commonoctopus'},
 {'summary_addr': 'http://fishethobase.net/db/21/',
  'latin_name': 'Litopenaeus vannamei',
  'english_name': 'Pacific whiteleg shrimp',
  'key': 'pacificwhitelegshrimp'},
 {'summary_addr': 'http://fishethobase.net/db/34/',
  'latin_name': 'Penaeus monodon',
  'english_name': 'Giant tiger prawn (Black tiger)',
  'key': 'gianttigerprawnblacktiger'},
 {'summary_addr': 'http://fishethobase.net/db/2/',
  'latin_name': 'Acipenser baerii',
  'english_name': 'Siberian sturgeon',
  'key': 'siberiansturgeon'},
 {'summary_addr': 'http://fishethobase.net/db/3/',
  'latin_name': 'Acipenser gueldenstaedtii',
  'english_name': 'Russian sturgeon',
  'key': 'russiansturgeon'},
 {'summary_addr': 'http://fishethobase.net/db/4/',
  'latin_name': 'Acipenser naccarii',
  'english_name': 'Adriatic sturgeon',
  'key': 'adriaticsturgeon'},
 {'summary_addr': 'ht

In [10]:
print("# species:", len(species))
print("# unique keys:", len(set([dct["key"] for dct in species])) ) # checking keys

# species: 44
# unique keys: 44


**4. Grab the picture and summary information per species.**

In [11]:
def get_species_dict_with_summary(species_dict):
    return_dict = species_dict.copy() # shallow copy is sufficient for these dicts; 
                                      # the prev dicts will also be discarded
    soup = get_soup(species_dict["summary_addr"])

    picture = soup.find(id="species_picture")
    if picture:
        img = picture.find("img")
        img_addr = FISH_BASE_ADDR + "/" + img["src"].lstrip("/")
        download_filename = "{fn}.{ext}".format(fn=return_dict["key"], ext=img_addr.split(".")[-1]) # totally hacky
        download_addr = os.path.join(DATA_DIR, "images", download_filename)
        
        urllib.request.urlretrieve(img_addr, download_addr)
        
        return_dict["image_filename"] = download_filename
        print("Downloaded photo:", download_addr)
    
    summary = soup.find(class_="feb-content-box")
    if summary:
        all_text = summary.get_text()
        summary_text = all_text.strip("\n").split("\n", 1)[0].replace("\xa0", " ")
        return_dict["summary_text"] = summary_text
    
    else:
        print("**No summary found!**")

    return return_dict

In [13]:
# shortened testing snippet to reduce runtime
species_with_summaries_shortened = list(map(get_species_dict_with_summary, species[:4]))


Loaded page: http://fishethobase.net/db/28/

Loaded page: http://fishethobase.net/db/21/

Loaded page: http://fishethobase.net/db/34/
Downloaded photo: ..\data\images\gianttigerprawnblacktiger.jpg

Loaded page: http://fishethobase.net/db/2/
Downloaded photo: ..\data\images\siberiansturgeon.jpg


In [14]:
species_with_summaries_shortened

[{'summary_addr': 'http://fishethobase.net/db/28/',
  'latin_name': 'Octopus vulgaris',
  'english_name': 'Common octopus',
  'key': 'commonoctopus',
  'summary_text': 'Octopus vulgaris has recently aroused much interest in aquaculture, considered suitable for large-scale production given its commercial value, its fecundity, rapid growth, high protein content, and high feed conversion rate. The main problem, however, is the high mortality rate observed during paralarval rearing, making successful juvenile settlement still very difficult to achieve. Unfortunately, despite the high knowledge on the biology and ethology of this species, there are many other aspects to be solved from a welfare perspective. For instance, the current farming systems result in high stress in O. vulgaris due to spatial constraint, high densities and sociability, which consequently increase aggression (cannibalism and autophagy) at different life stages. In addition, octopus skin is particularly sensitive and c

In [15]:
# some of the summaries appear too short, or incorrect. Let's try to locate those summaries.

odd_summaries = [s for s in species_with_summaries_shortened if not "summary_text" in s or len(s["summary_text"]) < 100]
odd_summaries

[{'summary_addr': 'http://fishethobase.net/db/21/',
  'latin_name': 'Litopenaeus vannamei',
  'english_name': 'Pacific whiteleg shrimp',
  'key': 'pacificwhitelegshrimp',
  'summary_text': 'Habitat and development'}]

In [16]:
# for most of the above, it looks like I've pulled the wrong summary (Except for the one that has no short profile).
# To be safe, let's pull the 'general remarks' tab for each species too, and compare what is happening.

**2. For each fish, check if they have a short profile. 3. Grab the short profile table for each fish. (5. Grab the general remarks for each fish.)**

In [17]:
def get_species_with_short_profiles(species_dict):
    return_dict = species_dict.copy() # shallow copy is sufficient for these dicts; 
                                      # the prev dicts will also be discarded
    sp_address = species_dict["summary_addr"] + "shortprofile/"
    soup = get_soup(sp_address)
    if not soup:
        return
    return_dict["short_profile_addr"] = sp_address
    
    sp_table = soup.find("table", attrs={"class": "shortprofile"})
    if sp_table:
        sp_table_rows = sp_table.find_all("tr")

        all_data = {}
        headings = ["likelihood", "potential", "certainty"] # should be a way of validating this
        for row in sp_table_rows[1:-1]:            
            columns = row.find_all("td")[1:]
            criteria = columns[0].get_text()
            values = [ col["class"][0] for col in columns[1:] ]
            row_data = dict(zip(headings, values))
            all_data[criteria] = row_data
        
        score_columns = sp_table_rows[-1].find_all("td")
        score_criteria = score_columns[0].get_text()
        total_scores = map(float, [ col.get_text() for col in score_columns[1:] ])
        all_data[score_criteria] = dict(zip(headings, total_scores))

        return_dict["etho_scores"] = all_data

    return return_dict

In [18]:
get_species_with_short_profiles(species_with_summaries_shortened[0])


Loaded page: http://fishethobase.net/db/28/shortprofile/


{'summary_addr': 'http://fishethobase.net/db/28/',
 'latin_name': 'Octopus vulgaris',
 'english_name': 'Common octopus',
 'key': 'commonoctopus',
 'summary_text': 'Octopus vulgaris has recently aroused much interest in aquaculture, considered suitable for large-scale production given its commercial value, its fecundity, rapid growth, high protein content, and high feed conversion rate. The main problem, however, is the high mortality rate observed during paralarval rearing, making successful juvenile settlement still very difficult to achieve. Unfortunately, despite the high knowledge on the biology and ethology of this species, there are many other aspects to be solved from a welfare perspective. For instance, the current farming systems result in high stress in O. vulgaris due to spatial constraint, high densities and sociability, which consequently increase aggression (cannibalism and autophagy) at different life stages. In addition, octopus skin is particularly sensitive and can be

In [19]:
# it looks like it's much more difficult to grab the general remarks than I thought it would be;
# I'm starting to think that's rendered dynamically. Let's clean everything up and wrap up, and 
# worse comes to worse we can do a little manual data cleaning.

**Tying everything together**

In [31]:
# These are relative paths - running the script from a different location may produce surprising results.
MAIN_DIR = ".."
DATA_DIR = os.path.join(MAIN_DIR, "data")
IMG_DIR = os.path.join(DATA_DIR, "images")

FISH_BASE_ADDR = "http://fishethobase.net"
DB_ADDR = FISH_BASE_ADDR + "/db"
S_PAUSE = 2 # how many seconds to pause in between requests

REQ_SUCCESS = 200 # success status code

def get_species_data():
    soup = get_soup(DB_ADDR)
    species_soups = get_species_soups(soup)
    species_dicts = list(map(get_species_dict, species_soups))
    return species_dicts

def get_species_soups(db_soup):
    return db_soup.find(class_="speciestree-sub").find_all(attrs={"class": "speciestree-intern"})

def get_species_dict(species_soup):
    print("\n***** Collecting species dict *****")
    
    species_dict = dict()
    add_summary_link(species_soup, species_dict)
    add_names(species_soup, species_dict)
    
    if "link_summary" not in species_dict: return
    summary_soup = get_soup(species_dict["link_summary"])
    if not summary_soup: return
    
    add_picture(summary_soup, species_dict)
    add_description(summary_soup, species_dict)
    
    profile_address = species_dict["link_summary"].rstrip("/") + "/shortprofile/"
    profile_soup = get_soup(profile_address)
    if not profile_soup: return
    species_dict["link_profile"] = profile_address
    
    add_short_profile(profile_soup, species_dict)
    
    print("***** Finished with species dict *****")
    return species_dict

## Accumulator helper functions

def add_summary_link(species_soup, species_dict):
    pattern = re.compile("/db/[0-9]+/?")
    raw_address = species_soup.get("href")
    if not pattern.match(raw_address):
        print("=> (!!) Can't parse raw address:", raw_address)
        return
    link = FISH_BASE_ADDR + raw_address
    species_dict["link_summary"] = link
    print("=> Added summary link")

def add_names(species_soup, species_dict):
    pattern = re.compile("[A-Za-z, ]+\([A-Za-z() ]+\)")
    name_text = species_soup.get_text()
    if not pattern.match(name_text):
        print("=> (!!) Unexpected name format:", name_text)
        return
    clean_name = lambda n: n.strip().strip("(").strip(")") # isn't technically correct, I know
    latin_name, english_name = tuple(map(clean_name, name_text.split("(", 1)))
    
    species_dict["name_latin"] = latin_name
    species_dict["name_english"] = english_name
    species_dict["sp_id"] = re.sub("\W", "", english_name).lower()
    print("=> Added names")

def add_picture(summary_soup, species_dict):
    picture = summary_soup.find(id="species_picture")
    if not picture:
        print("=> (!!) Can't find picture")
        return
    
    img = picture.find("img")
    img_link = FISH_BASE_ADDR + "/" + img["src"].lstrip("/")
    img_extension = img_link.split(".")[-1] # super hacky
    filename = species_dict["sp_id"] + "." + img_extension
    filepath = os.path.join(IMG_DIR, filename)
    
    urllib.request.urlretrieve(img_link, filepath)
    
    species_dict["filename_image"] = filename
    print("=> Added picture")

def add_description(summary_soup, species_dict):
    summary_box = summary_soup.find(class_="feb-content-box")
    if not summary_box:
        print("=> (!!) Can't find summary")
        return
    
    description = "\n".join([p.get_text() for p in summary_box.find_all("p")])
    species_dict["description"] = description
    print("=> Added description")

def add_short_profile(profile_soup, species_dict):
    table = profile_soup.find("table", attrs={"class": "shortprofile"})
    if not table:
        print("=> (!!) Can't find profile table")
        return
    
    data = {}
    variables = ["likelihood", "potential", "certainty"]
    rows = table.find_all("tr")[1:]
    
    for row in rows[:-1]:
        criteria, data_dict = get_profile_row_data(
            row, variables,
            get_col_data=lambda col: col["class"][0],
            data_start_index=1)
        data[criteria] = data_dict
    
    score_criteria, score_dict = get_profile_row_data(
        row, variables,
        get_col_data=lambda col: col.get_text())
    data[score_criteria] = score_dict
    
    species_dict["etho_scores"] = data
    print("=> Added profile table")

def get_profile_row_data(row, variables, get_col_data, data_start_index=0):
    columns = row.find_all("td")[data_start_index:]
    criteria = columns[0].get_text()
    values = map(get_col_data, columns[1:])
    return criteria, dict(zip(variables, values))

In [32]:
all_data = get_species_data()

Loaded page: http://fishethobase.net/db

***** Collecting species dict *****
=> Added summary link
=> Added names
Loaded page: http://fishethobase.net/db/28/
=> (!!) Can't find picture
=> Added description
Loaded page: http://fishethobase.net/db/28/shortprofile/
=> Added profile table
***** Finished with species dict *****

***** Collecting species dict *****
=> Added summary link
=> Added names
Loaded page: http://fishethobase.net/db/21/
=> (!!) Can't find picture
=> Added description
Loaded page: http://fishethobase.net/db/21/shortprofile/
=> Added profile table
***** Finished with species dict *****

***** Collecting species dict *****
=> Added summary link
=> Added names
Loaded page: http://fishethobase.net/db/34/
=> Added picture
=> Added description
Loaded page: http://fishethobase.net/db/34/shortprofile/
=> Added profile table
***** Finished with species dict *****

***** Collecting species dict *****
=> Added summary link
=> Added names
Loaded page: http://fishethobase.net/db/2

Loaded page: http://fishethobase.net/db/15/shortprofile/
=> Added profile table
***** Finished with species dict *****

***** Collecting species dict *****
=> Added summary link
=> Added names
Loaded page: http://fishethobase.net/db/32/
=> (!!) Can't find picture
=> Added description
Loaded page: http://fishethobase.net/db/32/shortprofile/
=> Added profile table
***** Finished with species dict *****

***** Collecting species dict *****
=> Added summary link
=> Added names
Loaded page: http://fishethobase.net/db/49/
=> Added picture
=> Added description
Loaded page: http://fishethobase.net/db/49/shortprofile/
=> Added profile table
***** Finished with species dict *****

***** Collecting species dict *****
=> Added summary link
=> Added names
Loaded page: http://fishethobase.net/db/18/
=> Added picture
=> Added description
Loaded page: http://fishethobase.net/db/18/shortprofile/
=> Added profile table
***** Finished with species dict *****

***** Collecting species dict *****
=> Added 

In [33]:
all_data[0]

{'link_summary': 'http://fishethobase.net/db/28/',
 'name_latin': 'Octopus vulgaris',
 'name_english': 'Common octopus',
 'sp_id': 'commonoctopus',
 'description': 'Octopus vulgaris has recently aroused much interest in aquaculture, considered suitable for large-scale production given its commercial value, its fecundity, rapid growth, high protein content, and high feed conversion rate. The main problem, however, is the high mortality rate observed during paralarval rearing, making successful juvenile settlement still very difficult to achieve. Unfortunately, despite the high knowledge on the biology and ethology of this species, there are many other aspects to be solved from a welfare perspective. For instance, the current farming systems result in high stress in O. vulgaris due to spatial constraint, high densities and sociability, which consequently increase aggression (cannibalism and autophagy) at different life stages. In addition, octopus skin is particularly sensitive and can b

In [34]:
import json

OUTPUT_FILENAME = "fishdb.json"
OUTPUT_FP = os.path.join(DATA_DIR, OUTPUT_FILENAME)
with open(OUTPUT_FP, "w") as outfile:
    json.dump(all_data, outfile)