# 自GBIF網站下載資料集

https://www.inaturalist.org/

https://www.gbif.org/

iNaturalist網站介紹，擷取自wiki:  
> iNaturalist 簡稱 iNat，是一個藉由社交網路實現的公眾科學專案，線上運營同名生物資料庫網站和行動應用軟體，旨在讓全球各地的人們發現、記錄並分享身邊的物種資訊，使用者包括專家學者、教師、學生、自然愛好者及生物愛好者等。iNat 透過社群網路，將專業研究人員與普通群眾連接起來，使業餘觀察者隨手記錄的資料，成為生物多樣性研究的重要參考資訊。

GBIF.org網站介紹，擷取自wiki:  
> 全球生物多樣性資訊機構（英文：Global Biodiversity Information Facility，縮寫為 GBIF），或稱全球生物多樣性信息網絡、全球生物多樣性資訊設施，是2001年在世界多國政府資助下成立的一個國際組織，由丹麥哥本哈根大學公園內的 GBIF 祕書處運作管理，線上運營同名生物學資料庫，旨在向大眾提供無論何時何地均能公開且自由存取的地球上各種生物的資料。GBIF 以參與國、合作組織作為節點，各節點收集的資料以共同標準、開源格式提供給全球的資料中心，使得這些包含記錄地點、時間的生物資訊得以公開分享。這些生物資訊來源不一，有來自18–19世紀的博物館標本典藏，到近年來業餘學者們發布的附有地理標記的相片。



iNaturalist網站似乎會定期同步觀察資料到gbif網站，而我們可以在gbif網站註冊帳號，可以下載我們所需的資料集。

我自己在gbif網站上下載的資料集，這個資料集的過濾條件是 台灣的 iNaturalist網站的 蝴蝶資料集。

https://www.gbif.org/occurrence/download/0372200-210914110416597

![img06](img/img06.png)

## 下載資料集

首先，先下載蝴蝶的資料集，並且將它解壓縮

In [1]:
import os
import requests
import shutil
import random
import pandas as pd
from glob import glob

random.seed(47)

In [2]:
dataset_url = 'https://api.gbif.org/v1/occurrence/download/request/0372200-210914110416597.zip'
dataset_name = os.path.basename(dataset_url)
dirname = os.path.splitext(dataset_name)[0]

In [3]:
# 下載檔案
with requests.get(dataset_url, stream=True) as res:
    with open(dataset_name, 'wb') as f:
        for chunk in res.iter_content(chunk_size=4096):
            f.write(chunk)

In [4]:
# 解壓縮壓縮檔
shutil.unpack_archive(dataset_name, dirname)

In [5]:
glob(dirname + '/*')

['0372200-210914110416597/verbatim.txt',
 '0372200-210914110416597/dataset',
 '0372200-210914110416597/citations.txt',
 '0372200-210914110416597/occurrence.txt',
 '0372200-210914110416597/rights.txt',
 '0372200-210914110416597/multimedia.txt',
 '0372200-210914110416597/meta.xml',
 '0372200-210914110416597/metadata.xml']

解壓縮後，可以看到資料夾內有許多txt檔與xml檔。其中`occurrence.txt`與`multimedia.txt`是最重要的檔案

這兩個txt檔都是類似csv格式，但是是以tab為分隔每個欄位。用Windows的Excel應也能開啟，但資料量大，可能會很卡。    
`occurrence.txt`是包含觀察紀錄的表格，每一筆紀錄都是某個人的觀察紀錄，紀錄觀察的人事時地物，也包含鑑定出的物種。  
`multimedia.txt`裡面有觀察拍攝的所有照片的下載連結，每筆紀錄為一張照片，可對應到。一個觀察記錄可能會拍攝多張照片，都會記錄在這個表格中。

In [6]:
occurrence = pd.read_csv(os.path.join(dirname, 'occurrence.txt'), sep='\t')
multimedia = pd.read_csv(os.path.join(dirname, 'multimedia.txt'), sep='\t')

  occurrence = pd.read_csv(os.path.join(dirname, 'occurrence.txt'), sep='\t')


In [7]:
display(occurrence.head())
display(multimedia.head())

Unnamed: 0,gbifID,abstract,accessRights,accrualMethod,accrualPeriodicity,accrualPolicy,alternative,audience,available,bibliographicCitation,...,relativeOrganismQuantity,level0Gid,level0Name,level1Gid,level1Name,level2Gid,level2Name,level3Gid,level3Name,iucnRedListCategory
0,3802823405,,,,,,,,,,...,,TWN,Chinese Taipei,TWN.6_1,Taipei,TWN.6.1_1,Taipei,,,NE
1,3802820558,,,,,,,,,,...,,TWN,Chinese Taipei,TWN.7_1,Taiwan,TWN.7.15_1,Yulin,,,
2,3802820005,,,,,,,,,,...,,TWN,Chinese Taipei,TWN.7_1,Taiwan,TWN.7.12_1,Taitung,,,NE
3,3802819694,,,,,,,,,,...,,TWN,Chinese Taipei,TWN.2_1,Kaohsiung,TWN.2.1_1,Kaohsiung,,,NE
4,3802819314,,,,,,,,,,...,,TWN,Chinese Taipei,TWN.4_1,Taichung,TWN.4.1_1,Taichung,,,


Unnamed: 0,gbifID,type,format,identifier,references,title,description,source,audience,created,creator,contributor,publisher,license,rightsHolder
0,3802823405,StillImage,image/jpeg,https://inaturalist-open-data.s3.amazonaws.com...,https://www.inaturalist.org/photos/200193088,,,,,2022-05-23T23:27:44Z,amygoog,,iNaturalist,http://creativecommons.org/licenses/by-nc/4.0/,amygoog
1,3802823405,StillImage,image/jpeg,https://inaturalist-open-data.s3.amazonaws.com...,https://www.inaturalist.org/photos/200193099,,,,,2022-05-23T23:27:44Z,amygoog,,iNaturalist,http://creativecommons.org/licenses/by-nc/4.0/,amygoog
2,3802823405,StillImage,image/jpeg,https://inaturalist-open-data.s3.amazonaws.com...,https://www.inaturalist.org/photos/200193072,,,,,2022-05-23T23:27:44Z,amygoog,,iNaturalist,http://creativecommons.org/licenses/by-nc/4.0/,amygoog
3,3802820558,StillImage,image/jpeg,https://inaturalist-open-data.s3.amazonaws.com...,https://www.inaturalist.org/photos/200402293,,,,,2022-05-24T23:07:00Z,Hong,,iNaturalist,http://creativecommons.org/licenses/by-nc/4.0/,Hong
4,3802820005,StillImage,image/jpeg,https://inaturalist-open-data.s3.amazonaws.com...,https://www.inaturalist.org/photos/200872946,,,,,2022-05-26T06:17:30Z,lijeanliao,,iNaturalist,http://creativecommons.org/licenses/by-nc/4.0/,lijeanliao


`multimedia`的檔案中，欄位較少，簡單觀察就能看出他們的意義。這邊挑重要的列出。

1. gbifID: 觀察紀錄的ID，可以對應到occurrence裡的gbifID欄位
2. identifier: 照片的下載連結。
3. reference: 照片在inaturalist網站的網址

`occurrence`的檔案中，欄位非常多，這邊挑選出較重要的欄位。

1. gbifID: 算是觀察紀錄的ID
2. references: 觀察記錄的網址
3. kingdom, pyhlum, class, order, family分別代表 界、門、綱、目、科。因為已經過濾過只有蝴蝶，這幾個欄位幾乎都一樣。
4. genus, specificEpithet 物種的屬名和種名。
5. taxonRank代表鑑定等級，最差有可能只鑑定到科別，通常可以鑑定到種名，甚至是亞種。在這邊我們要挑選鑑定等級到種或亞種的。

我們先把`occurrence`的資料挑重要欄位出來，這樣較容易觀察。

In [8]:
occurs1 = occurrence[['gbifID', 'references', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'specificEpithet', 'taxonRank']]
occurs1

Unnamed: 0,gbifID,references,kingdom,phylum,class,order,family,genus,specificEpithet,taxonRank
0,3802823405,https://www.inaturalist.org/observations/11848...,Animalia,Arthropoda,Insecta,Lepidoptera,Nymphalidae,Ideopsis,similis,SPECIES
1,3802820558,https://www.inaturalist.org/observations/11860...,Animalia,Arthropoda,Insecta,Lepidoptera,Hesperiidae,Telicota,bambusae,SUBSPECIES
2,3802820005,https://www.inaturalist.org/observations/11886...,Animalia,Arthropoda,Insecta,Lepidoptera,Pieridae,Catopsilia,pomona,SPECIES
3,3802819694,https://www.inaturalist.org/observations/11862...,Animalia,Arthropoda,Insecta,Lepidoptera,Hesperiidae,Telicota,bambusae,SPECIES
4,3802819314,https://www.inaturalist.org/observations/11835...,Animalia,Arthropoda,Insecta,Lepidoptera,Nymphalidae,Polygonia,c-aureum,SUBSPECIES
...,...,...,...,...,...,...,...,...,...,...
24813,1144284567,https://www.inaturalist.org/observations/2112172,Animalia,Arthropoda,Insecta,Lepidoptera,Papilionidae,Papilio,bianor,SPECIES
24814,1135522961,https://www.inaturalist.org/observations/1140734,Animalia,Arthropoda,Insecta,Lepidoptera,Nymphalidae,Junonia,lemonias,SPECIES
24815,1098894587,https://www.inaturalist.org/observations/1112350,Animalia,Arthropoda,Insecta,Lepidoptera,Papilionidae,Graphium,sarpedon,SPECIES
24816,1088893997,https://www.inaturalist.org/observations/839401,Animalia,Arthropoda,Insecta,Lepidoptera,Papilionidae,Graphium,sarpedon,SUBSPECIES


接下來我們要過濾`taxonRank`檢定品質夠高的觀察紀錄，不夠高的要剃除。原本有24818個觀察記錄，濾完之後還有24700個。

In [9]:
occurs2 = occurs1[occurs1['taxonRank'].isin(['SPECIES', 'SUBSPECIES'])]
occurs2

Unnamed: 0,gbifID,references,kingdom,phylum,class,order,family,genus,specificEpithet,taxonRank
0,3802823405,https://www.inaturalist.org/observations/11848...,Animalia,Arthropoda,Insecta,Lepidoptera,Nymphalidae,Ideopsis,similis,SPECIES
1,3802820558,https://www.inaturalist.org/observations/11860...,Animalia,Arthropoda,Insecta,Lepidoptera,Hesperiidae,Telicota,bambusae,SUBSPECIES
2,3802820005,https://www.inaturalist.org/observations/11886...,Animalia,Arthropoda,Insecta,Lepidoptera,Pieridae,Catopsilia,pomona,SPECIES
3,3802819694,https://www.inaturalist.org/observations/11862...,Animalia,Arthropoda,Insecta,Lepidoptera,Hesperiidae,Telicota,bambusae,SPECIES
4,3802819314,https://www.inaturalist.org/observations/11835...,Animalia,Arthropoda,Insecta,Lepidoptera,Nymphalidae,Polygonia,c-aureum,SUBSPECIES
...,...,...,...,...,...,...,...,...,...,...
24813,1144284567,https://www.inaturalist.org/observations/2112172,Animalia,Arthropoda,Insecta,Lepidoptera,Papilionidae,Papilio,bianor,SPECIES
24814,1135522961,https://www.inaturalist.org/observations/1140734,Animalia,Arthropoda,Insecta,Lepidoptera,Nymphalidae,Junonia,lemonias,SPECIES
24815,1098894587,https://www.inaturalist.org/observations/1112350,Animalia,Arthropoda,Insecta,Lepidoptera,Papilionidae,Graphium,sarpedon,SPECIES
24816,1088893997,https://www.inaturalist.org/observations/839401,Animalia,Arthropoda,Insecta,Lepidoptera,Papilionidae,Graphium,sarpedon,SUBSPECIES


接下來要來做inner join，我們用兩個資料表的gbifID欄位來進行join。
並且只要留下必要欄位就好。

如此一來就能夠對圖片一一下載，再依據屬名和種名放到正確的分類資料夾中。

In [10]:
medias = multimedia[['gbifID', 'identifier']].merge(occurs2[['gbifID', 'genus', 'specificEpithet']], left_on='gbifID', right_on='gbifID')
medias

Unnamed: 0,gbifID,identifier,genus,specificEpithet
0,3802823405,https://inaturalist-open-data.s3.amazonaws.com...,Ideopsis,similis
1,3802823405,https://inaturalist-open-data.s3.amazonaws.com...,Ideopsis,similis
2,3802823405,https://inaturalist-open-data.s3.amazonaws.com...,Ideopsis,similis
3,3802820558,https://inaturalist-open-data.s3.amazonaws.com...,Telicota,bambusae
4,3802820005,https://inaturalist-open-data.s3.amazonaws.com...,Catopsilia,pomona
...,...,...,...,...
36746,1135522961,https://inaturalist-open-data.s3.amazonaws.com...,Junonia,lemonias
36747,1098894587,https://inaturalist-open-data.s3.amazonaws.com...,Graphium,sarpedon
36748,1088893997,https://inaturalist-open-data.s3.amazonaws.com...,Graphium,sarpedon
36749,1088893997,https://inaturalist-open-data.s3.amazonaws.com...,Graphium,sarpedon


In [11]:
medias1 = medias[~medias.isna().any(axis='columns')]
medias1 = medias1.reset_index(drop=True)
medias1

Unnamed: 0,gbifID,identifier,genus,specificEpithet
0,3802823405,https://inaturalist-open-data.s3.amazonaws.com...,Ideopsis,similis
1,3802823405,https://inaturalist-open-data.s3.amazonaws.com...,Ideopsis,similis
2,3802823405,https://inaturalist-open-data.s3.amazonaws.com...,Ideopsis,similis
3,3802820558,https://inaturalist-open-data.s3.amazonaws.com...,Telicota,bambusae
4,3802820005,https://inaturalist-open-data.s3.amazonaws.com...,Catopsilia,pomona
...,...,...,...,...
36743,1135522961,https://inaturalist-open-data.s3.amazonaws.com...,Junonia,lemonias
36744,1098894587,https://inaturalist-open-data.s3.amazonaws.com...,Graphium,sarpedon
36745,1088893997,https://inaturalist-open-data.s3.amazonaws.com...,Graphium,sarpedon
36746,1088893997,https://inaturalist-open-data.s3.amazonaws.com...,Graphium,sarpedon


我們對這個資料表用for迴圈跑，逐一下載圖片。

In [12]:
def download_file(url: str, path: str):
    '''給予URL和儲存路徑，將檔案下載儲存'''
    response = requests.get(url)
    if response.status_code == 200:
        with open(path, 'wb') as f:
            f.write(response.content)
    else:
        print('Download', url, 'failed:', response.status_code, response.reason)


In [13]:
count = 5
for row in medias1.itertuples():
    print(row)
    count -= 1
    if count <= 0:
        break

Pandas(Index=0, gbifID=3802823405, identifier='https://inaturalist-open-data.s3.amazonaws.com/photos/200193088/original.jpg', genus='Ideopsis', specificEpithet='similis')
Pandas(Index=1, gbifID=3802823405, identifier='https://inaturalist-open-data.s3.amazonaws.com/photos/200193099/original.jpg', genus='Ideopsis', specificEpithet='similis')
Pandas(Index=2, gbifID=3802823405, identifier='https://inaturalist-open-data.s3.amazonaws.com/photos/200193072/original.jpg', genus='Ideopsis', specificEpithet='similis')
Pandas(Index=3, gbifID=3802820558, identifier='https://inaturalist-open-data.s3.amazonaws.com/photos/200402293/original.jpeg', genus='Telicota', specificEpithet='bambusae')
Pandas(Index=4, gbifID=3802820005, identifier='https://inaturalist-open-data.s3.amazonaws.com/photos/200872946/original.jpg', genus='Catopsilia', specificEpithet='pomona')


真的下載會需要一段時間，下載完成之後的資料夾排列方式跟上次相同。

目錄內有許多資料夾，資料夾名稱都是各個蝴蝶的學名。資料夾再進去就是該物種的所有圖片。如此的擺放方式適合訓練分類模型。

In [14]:
# root_dir = 'gbif_dataset'
# for row in medias1.itertuples():
#     directory = os.path.join(root_dir,
#                              '{}_{}'.format(row.genus, row.specificEpithet))
#     os.makedirs(directory, exist_ok=True)
#     path = os.path.join(directory,
#                         'img{:05d}_{}'.format(row.Index, os.path.basename(row.identifier)))
#     print('Write', path)
#     download_file(row.identifier, path)

## 

接著來製作索引檔，並且順便在索引檔中區分train, test以及valid的資料集。

這邊的範例預計拆成train 90%, test 5%, valid 5%，比例基本上可以隨個人調配，不一定要一樣。


In [15]:
all_length = len(medias1)
test_length = int(all_length * 0.05)
valid_length = int(all_length * 0.05)
train_length = all_length - test_length - valid_length

print('Train:', train_length)
print('Test:', test_length)
print('Valid:', valid_length)

dataset = ['train'] * train_length + ['test'] * test_length + ['valid'] * valid_length
random.shuffle(dataset)


Train: 33074
Test: 1837
Valid: 1837


接著，我們把亂數洗牌過的資料集區分，與`media1`的資料合併，並且整理出索引檔的內容。

In [16]:
root_dir = 'gbif_dataset'

dataset_index = []

for row, s in zip(medias1.itertuples(), dataset):
    label = '{}_{}'.format(row.genus, row.specificEpithet)
    directory = os.path.join(root_dir, label)
    path = os.path.join(directory,
                        'img{:05d}_{}'.format(row.Index, os.path.basename(row.identifier)))
    dataset_index.append([path, label, s])

內容大致上先放進list之後，把他製作成pandas的DataFrame。再稍作排序，排序只是為了給人類看。

In [17]:
dataset_frame = pd.DataFrame(dataset_index, columns=['filepaths', 'labels', 'data set'])
dataset_frame = dataset_frame.sort_values(by=['data set', 'labels', 'filepaths'], axis='index')
dataset_frame

Unnamed: 0,filepaths,labels,data set
18587,gbif_dataset/Abraximorpha_davidii/img18587_ori...,Abraximorpha_davidii,test
13656,gbif_dataset/Abrota_ganga/img13656_original.jpg,Abrota_ganga,test
3115,gbif_dataset/Acraea_issoria/img03115_original....,Acraea_issoria,test
3179,gbif_dataset/Acraea_issoria/img03179_original.jpg,Acraea_issoria,test
3657,gbif_dataset/Acraea_issoria/img03657_original....,Acraea_issoria,test
...,...,...,...
15079,gbif_dataset/Zizula_hylax/img15079_original.jpg,Zizula_hylax,valid
22757,gbif_dataset/Zizula_hylax/img22757_original.jpg,Zizula_hylax,valid
33800,gbif_dataset/Zizula_hylax/img33800_original.jpg,Zizula_hylax,valid
34853,gbif_dataset/Zizula_hylax/img34853_original.jpg,Zizula_hylax,valid


最後匯出成csv格式的索引檔

In [18]:
dataset_frame.to_csv('butterflies.csv', index=False)

## 製作類別索引

接著我們來製作類別的索引檔，不過基本上只有要記錄所有類別有哪些，以及他們的順序而已，簡單的txt檔就足以應付。

In [19]:
classes = sorted(set(dataset_frame['labels']))

In [20]:
classes[:5]

['Abisara_burnii',
 'Abraximorpha_davidii',
 'Abrota_ganga',
 'Acraea_issoria',
 'Acytolepis_puspa']

In [21]:
with open('classes.txt', 'w') as f:
    for cls in classes:
        f.write(cls)
        f.write('\n')