_This Notebook is shared under an [Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/) license._

_Alicia Urquidi Diaz, GitHub:[aliciuki](https://github.com/aliciuki), Twitter: [@aliciabedul](https://www.twitter.com/aliciabedul)_

-------


# WDS Membership Data

#### Crawl from www.worlddatasystem.org

This Python script grabs all current WDS member profiles (Regular, Associate, Network and Partner members) listed in the official WDS site. 

### Outputs

- `wds-members.html` contains a list of `<div>` objects--one `<div id="content-core">` per member, containing the institution's logo (if provided), the repository's text description, and an information table for all WDS member profiles. 
- `wds-mem-tab.html` contains a list of all the informational tables from member profiles, as HTML tables.

## 1. Import packages

This script uses the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), [requests](https://docs.python-requests.org/en/master/), and [re](https://docs.python.org/3/library/re.html) packages.

In [11]:
from bs4 import BeautifulSoup
import requests
import re

## 2. Get all links to member pages

First, we crawl the membership section (Regular Members, Associate Members, Network Members and Partner Members) to grab a list of links to all member profiles and save it into a `link` list.

In [12]:
membership = ["https://www.worlddatasystem.org/community/membership/regular-members","https://www.worlddatasystem.org/community/membership/network-members","https://www.worlddatasystem.org/community/membership/partner-members","https://www.worlddatasystem.org/community/membership/associate-members"]
members = []
links = []

for i in membership:
    r = requests.get(i)
    w = r.text
    soup = BeautifulSoup(w, 'html.parser')
    wds = soup.find("div", {"id": "content-core"})
    urls = wds.find_all('a')
    url = re.findall(r'(https://www\.worlddatasystem\.org/community/membership.+?)"',str(urls))
    links.append(url) 

In [13]:
link = []
for i in links:
    for x in i:
        link.append(x)
print(link)

['https://www.worlddatasystem.org/community/membership/regular-members/@@member_view?fid=incorporated-research-institutions-for-seismology-iris-data-management-system', 'https://www.worlddatasystem.org/community/membership/regular-members/@@member_view?fid=wdc-geoinformatics-and-sustainable-development', 'https://www.worlddatasystem.org/community/membership/regular-members/@@member_view?fid=isric-wdc-soils', 'https://www.worlddatasystem.org/community/membership/regular-members/@@member_view?fid=dkrz-wdc-climate', 'https://www.worlddatasystem.org/community/membership/regular-members/@@member_view?fid=wdc-meteorology-asheville', 'https://www.worlddatasystem.org/community/membership/regular-members/@@member_view?fid=centre-de-donnees-astronomiques-de-strasbourg-cds', 'https://www.worlddatasystem.org/community/membership/regular-members/@@member_view?fid=world-glacier-monitoring-service-zurich', 'https://www.worlddatasystem.org/community/membership/regular-members/@@member_view?fid=austral

## 3. Crawl site and parse HTML with BeautifulSoup

### Grab _content core_
We use `requests` to get all html files and parse them with `BeautifulSoup`. The output of this step is `wds-members.html`.


In [14]:
with open("wds-members.html", "w+") as o:
    for i in link:
        r = requests.get(i)
        w = r.text
        soup = BeautifulSoup(w, 'html.parser')
        wds = soup.find("div", {"id": "content-core"})
        o.write(str(wds))    

### Grab tables only

We parse `wds-members.html` to grab the info tables and write them into `wds-mem-tab.html`:

In [19]:
m = open('wds-members.html', 'r')
soup = BeautifulSoup(m, 'html.parser')
a = soup.find_all('table')
t = str(a)
o = open('wds-mem-tab.html','w+')
o.write(t)
o.close()