# Fish Welfare Project
## Part 2: Supplemental data

* Author: Angelina Li
* Date: 2019/09/11
* Description: Now that we have a collection of information scraped from the FishEthoBase, it might be good to collect some supplemental information on each species, if possible. In particular, I'm interested in finding population number / catch number data on these species.

## Notebook tasks
1. Import in DB data. Potentially convert into a dataframe for easier usage.
2. Investigate and integrate the fish count databases

In [1]:
import json
import os
import pandas as pd
import random
import re
import requests

# eventually I might want to migrate to scrapy, but BS is easier to use on jupyter and
# makes more sense for a small, single-scrape project.
from bs4 import BeautifulSoup

In [2]:
MAIN_DIR = ".."
DATA_DIR = os.path.join(MAIN_DIR, "data")
ETHO_DIR = os.path.join(DATA_DIR, "fish_etho_db")
COUNT_DIR = os.path.join(DATA_DIR, "fish_count")

ETHO_INPUT_FP = os.path.join(ETHO_DIR, "fishdb.json")

S_PAUSE = 5 # how many seconds to pause in between requests
REQ_SUCCESS = 200 # success status code

In [3]:
# define some potentially useful helper function/s first

def get_soup(url_address, pause_secs=S_PAUSE):
    page = requests.get(url_address)
    if page.status_code != REQ_SUCCESS:
        print("Couldn't load content on this page:", url_address)
        return
    soup = BeautifulSoup(page.content, "html.parser")
    time.sleep(random.uniform(0.5, 1.5) * pause_secs)
    print("Loaded page:", url_address)
    return soup

In [4]:
# grab the original dataset
with open(ETHO_INPUT_FP, "r") as datafile:
    db_data = json.loads(datafile.read())

db_data[0]

{'link_summary': 'http://fishethobase.net/db/28/',
 'name_latin': 'Octopus vulgaris',
 'name_english': 'Common octopus',
 'sp_id': 'commonoctopus',
 'description': 'Octopus vulgaris has recently aroused much interest in aquaculture, considered suitable for large-scale production given its commercial value, its fecundity, rapid growth, high protein content, and high feed conversion rate. The main problem, however, is the high mortality rate observed during paralarval rearing, making successful juvenile settlement still very difficult to achieve. Unfortunately, despite the high knowledge on the biology and ethology of this species, there are many other aspects to be solved from a welfare perspective. For instance, the current farming systems result in high stress in O. vulgaris due to spatial constraint, high densities and sociability, which consequently increase aggression (cannibalism and autophagy) at different life stages. In addition, octopus skin is particularly sensitive and can b

In [5]:
# flatten data, and convert into a dataframe for readability.
def flatten_species_data(species_dict):
    flattened = species_dict.copy() # shallow copy - delete etho scores section
    scores = flattened.pop("etho_scores") # removes the problematic etho_scores dictionary for parsing
    for crit in scores:
        for level in scores[crit]:
            flat_name = "{}_{}".format(crit, level[:2]) # take first two chars per level name
            flattened[flat_name] = scores[crit][level]
    return flattened

flatten_species_data(db_data[0])

{'link_summary': 'http://fishethobase.net/db/28/',
 'name_latin': 'Octopus vulgaris',
 'name_english': 'Common octopus',
 'sp_id': 'commonoctopus',
 'description': 'Octopus vulgaris has recently aroused much interest in aquaculture, considered suitable for large-scale production given its commercial value, its fecundity, rapid growth, high protein content, and high feed conversion rate. The main problem, however, is the high mortality rate observed during paralarval rearing, making successful juvenile settlement still very difficult to achieve. Unfortunately, despite the high knowledge on the biology and ethology of this species, there are many other aspects to be solved from a welfare perspective. For instance, the current farming systems result in high stress in O. vulgaris due to spatial constraint, high densities and sociability, which consequently increase aggression (cannibalism and autophagy) at different life stages. In addition, octopus skin is particularly sensitive and can b

In [6]:
# let's save this for easier access later
db_flattened_data = list(map(flatten_species_data, db_data))

ETHO_OUTPUT_FP = os.path.join(ETHO_DIR, "fishdb_flattened.json")
with open(ETHO_OUTPUT_FP, "w") as outfile:
    json.dump(db_flattened_data, outfile)

In [7]:
db_df = pd.DataFrame(db_flattened_data)
db_df.head()

Unnamed: 0,aggregation_ce,aggregation_li,aggregation_po,aggression_ce,aggression_li,aggression_po,depth_range_ce,depth_range_li,depth_range_po,description,...,slaughter_ce,slaughter_li,slaughter_po,sp_id,stress_ce,stress_li,stress_po,substrate_ce,substrate_li,substrate_po
0,high,low,low,middle,low,low,high,low,middle,Octopus vulgaris has recently aroused much int...,...,middle,unclear,low,commonoctopus,middle,low,middle,middle,low,high
1,middle,low,middle,low,unclear,middle,middle,low,low,"Like other farmed shrimp species, the Pacific ...",...,middle,low,middle,pacificwhitelegshrimp,middle,low,middle,high,low,high
2,low,unclear,nofindings,low,unclear,nofindings,middle,low,low,Penaeus monodon is one of the most cultivated ...,...,middle,low,middle,gianttigerprawnblacktiger,high,low,middle,middle,low,middle
3,low,unclear,middle,low,unclear,middle,middle,low,low,"Acipenser baerii, an endangered species accord...",...,middle,low,high,siberiansturgeon,low,low,middle,middle,low,middle
4,low,unclear,nofindings,low,unclear,middle,high,low,low,Acipenser gueldenstaedtii is a critically enda...,...,low,unclear,high,russiansturgeon,low,low,middle,middle,low,high


In [8]:
db_df[["name_english", "sp_id", "link_summary", "fishethoscore_li", "fishethoscore_po", "fishethoscore_ce"]].head(10)

Unnamed: 0,name_english,sp_id,link_summary,fishethoscore_li,fishethoscore_po,fishethoscore_ce
0,Common octopus,commonoctopus,http://fishethobase.net/db/28/,0,1,3
1,Pacific whiteleg shrimp,pacificwhitelegshrimp,http://fishethobase.net/db/21/,0,2,3
2,Giant tiger prawn (Black tiger,gianttigerprawnblacktiger,http://fishethobase.net/db/34/,0,1,2
3,Siberian sturgeon,siberiansturgeon,http://fishethobase.net/db/2/,0,2,0
4,Russian sturgeon,russiansturgeon,http://fishethobase.net/db/3/,0,2,2
5,Adriatic sturgeon,adriaticsturgeon,http://fishethobase.net/db/4/,0,1,0
6,Sterlet sturgeon,sterletsturgeon,http://fishethobase.net/db/6/,0,1,0
7,Stellate sturgeon,stellatesturgeon,http://fishethobase.net/db/5/,0,0,0
8,White sturgeon,whitesturgeon,http://fishethobase.net/db/7/,0,1,2
9,Hybrid sturgeon,hybridsturgeon,http://fishethobase.net/db/53/,0,0,0


In [9]:
# Just for ease of research, let's get a list of names for our species
db_df[["name_english", "name_latin"]]

Unnamed: 0,name_english,name_latin
0,Common octopus,Octopus vulgaris
1,Pacific whiteleg shrimp,Litopenaeus vannamei
2,Giant tiger prawn (Black tiger,Penaeus monodon
3,Siberian sturgeon,Acipenser baerii
4,Russian sturgeon,Acipenser gueldenstaedtii
5,Adriatic sturgeon,Acipenser naccarii
6,Sterlet sturgeon,Acipenser ruthenus
7,Stellate sturgeon,Acipenser stellatus
8,White sturgeon,Acipenser transmontanus
9,Hybrid sturgeon,"BAEyNAC, NACxBAE"


**2. Explore the fish count databases**

In [10]:
CT_DECAPOD_FP = os.path.join(COUNT_DIR, "Farmed-decapods-2015.xlsx")
CT_FISH_FP = os.path.join(COUNT_DIR, "Farmed-fishes-2015.xlsx")
CT_WILD_FP = os.path.join(COUNT_DIR, "fishcount_estimated_wild_fish_2007-2016.xlsx")

In [11]:
# grab all of the datas
deca_df = pd.read_excel(CT_DECAPOD_FP, sheet_name="Decapods", header=8)
print(len(deca_df))
deca_df.head()

1852


Unnamed: 0,Country,FAO Species Category,Scientific name,Decapod species?,Crustacean species?,Class,Order,Family,Multi-species?,Year,Production (t),EMW id,Estimated mean weight (lower),Estimated mean weight (upper),mean weight (lower),mean weight (upprr),Numbers (lower) millions,Numbers (upper) millions
0,Afghanistan,Cyprinids nei,Cyprinidae,N,N,Actinopterygii,CYPRINIFORMES,Cyprinidae,,2015.0,1000.0,,0.0,0.0,,,,
1,Afghanistan,Rainbow trout,Oncorhynchus mykiss,N,N,Actinopterygii,SALMONIFORMES,Salmonidae,,2015.0,150.0,,0.0,0.0,,,,
2,Albania,Bighead carp,Hypophthalmichthys nobilis,N,N,Actinopterygii,CYPRINIFORMES,Cyprinidae,,2015.0,16.0,,0.0,0.0,,,,
3,Albania,Common carp,Cyprinus carpio,N,N,Actinopterygii,CYPRINIFORMES,Cyprinidae,,2015.0,26.8,,0.0,0.0,,,,
4,Albania,Crucian carp,Carassius carassius,N,N,Actinopterygii,CYPRINIFORMES,Cyprinidae,,2015.0,12.0,,0.0,0.0,,,,


In [12]:
fish_df = pd.read_excel(CT_FISH_FP, sheet_name="Fish species", header=6)
print(len(fish_df))
fish_df.head()

1853


Unnamed: 0,Country,FAO Species Category,Scientific name,Fish species?,Class,Order,Family,Multi-species?,Year,Production (t),EMW id,Estimated mean weight (lower),Estimated mean weight (upper),mean weight (lower),mean weight (upper),Numbers (lower) millions,Numbers (upper) millions
0,Afghanistan,Rainbow trout,Oncorhynchus mykiss,Y,Actinopterygii,SALMONIFORMES,Salmonidae,N,2015.0,150.0,155.0,210.0,5000.0,210.0,5000.0,0.03,0.714286
1,Afghanistan,Cyprinids nei,Cyprinidae,Y,Actinopterygii,CYPRINIFORMES,Cyprinidae,,2015.0,1000.0,,0.0,0.0,322.064283,1081.212063,0.924888,3.10497
2,Albania,Bighead carp,Hypophthalmichthys nobilis,Y,Actinopterygii,CYPRINIFORMES,Cyprinidae,N,2015.0,16.0,29.0,500.0,1500.0,500.0,1500.0,0.010667,0.032
3,Albania,Common carp,Cyprinus carpio,Y,Actinopterygii,CYPRINIFORMES,Cyprinidae,N,2015.0,26.8,57.0,500.0,2500.0,500.0,2500.0,0.01072,0.0536
4,Albania,Crucian carp,Carassius carassius,Y,Actinopterygii,CYPRINIFORMES,Cyprinidae,N,2015.0,12.0,62.0,150.0,400.0,150.0,400.0,0.03,0.08


In [13]:
wild_df = pd.read_excel(CT_WILD_FP, sheet_name="Sheet1", header=17)
print(len(wild_df))
wild_df.head()

12045


Unnamed: 0,Country,FAO Species Category,Scientific name,Fish species?,Class,Multi-species?,Year,Production (t),EMW id,Estimated mean weight EMW (lower) g,Estimated mean weight EMW (upper) g,Global Generic estimated mean weight for class GEMW (lower) g,Global Generic estimated mean weight for class GEMW (upper) g,Mean weight used (lower) g,Mean weight used (upper) g,Estimated numbers (lower) millions,Estimated numbers (upper) millions
0,Afghanistan,Freshwater fishes nei,,Y,Includes species from > 1 class,,2007-2016,1000.0,,0.0,0.0,37.8921,96.5228,37.8921,96.5228,10.360251,26.390735
1,Albania,"Angelsharks, sand devils nei",Squatinidae,Y,Elasmobranchii (sharks and rays),Y,2007-2016,16.0,23.0,1683.72,19793.8,5950.39,10539.4,1683.72,19793.8,0.000808,0.009503
2,Albania,Atlantic bluefin tuna,Thunnus thynnus,Y,Actinopterygii (ray-finned fishes),N,2007-2016,18.0,51.0,262000.0,262000.0,37.7549,96.1746,262000.0,262000.0,6.9e-05,6.9e-05
3,Albania,Atlantic bonito,Sarda sarda,Y,Actinopterygii (ray-finned fishes),N,2007-2016,15.3,52.0,1818.0,5000.0,37.7549,96.1746,1818.0,5000.0,0.00306,0.008416
4,Albania,Barracudas nei,Sphyraena spp,Y,Actinopterygii (ray-finned fishes),Y,2007-2016,0.7,91.0,186.795,9072.0,37.7549,96.1746,186.795,9072.0,7.7e-05,0.003747


In [14]:
# checks - I think the scientific names are a good potential key to merge on. But are they unique
# across the data? I would imagine some things show up in both wild and another dataset.

get_uniq_sci_names = lambda df: set(df["Scientific name"].tolist())
deca_sci_names = get_uniq_sci_names(deca_df)
fish_sci_names = get_uniq_sci_names(fish_df)
wild_sci_names = get_uniq_sci_names(wild_df)

In [15]:
# are there any overlaps between decapods and fish? That would be weird. wild and farmed fish counts
# can be dealt with separately, I guess.

len(deca_sci_names.intersection(fish_sci_names))

419

In [16]:
# I'm going to try merging things on, and see what happens
# first, let's clear up the column names
def get_clean_name(varname, stub):
    stripped = re.sub("[^\w\s]", "", varname)
    snake_cased = re.sub("\s", "_", stripped.lower())
    add_stub = stub.lower() + "_" + snake_cased
    return add_stub

def get_clean_columns(columns, stub):
    return list(map(lambda n: get_clean_name(n, stub), columns))

In [17]:
deca_df.columns = get_clean_columns(deca_df.columns, "deca")
deca_df.head()

Unnamed: 0,deca_country,deca_fao_species_category,deca_scientific_name,deca_decapod_species,deca_crustacean_species,deca_class,deca_order,deca_family,deca_multispecies,deca_year,deca_production_t,deca_emw_id,deca_estimated_mean_weight_lower,deca_estimated_mean_weight_upper,deca_mean_weight_lower,deca_mean_weight_upprr,deca_numbers_lower_millions,deca_numbers_upper_millions
0,Afghanistan,Cyprinids nei,Cyprinidae,N,N,Actinopterygii,CYPRINIFORMES,Cyprinidae,,2015.0,1000.0,,0.0,0.0,,,,
1,Afghanistan,Rainbow trout,Oncorhynchus mykiss,N,N,Actinopterygii,SALMONIFORMES,Salmonidae,,2015.0,150.0,,0.0,0.0,,,,
2,Albania,Bighead carp,Hypophthalmichthys nobilis,N,N,Actinopterygii,CYPRINIFORMES,Cyprinidae,,2015.0,16.0,,0.0,0.0,,,,
3,Albania,Common carp,Cyprinus carpio,N,N,Actinopterygii,CYPRINIFORMES,Cyprinidae,,2015.0,26.8,,0.0,0.0,,,,
4,Albania,Crucian carp,Carassius carassius,N,N,Actinopterygii,CYPRINIFORMES,Cyprinidae,,2015.0,12.0,,0.0,0.0,,,,
