# Scraping External Links from Old GFW Country Pages

We want to scrape external links attributed to each country from the old coutry page widgets. We do this using the following endpoint:

```
http://api.globalforestwatch.org/countries/cmr?thresh=30
```

where the only variable we need to consider is the ISO-3 code (CMR for Cameroon is this case). Thsi returns a JSON comtainig the attribute ```external_links``` which holds a list of key:value pairs with *title* and *url* values for any links.

In particular we are interested in **Forest Atlas** links, and should indicate whether a Forest Atlas link is present.

The final data frame should contain a row for each ISO and have a colums for:

- iso (```string```)
- forest_atlas (```bool```)
- external_links (```list``` of ```objects```)

SCRAPED DATA FOUND HERE: https://wri-01.carto.com/tables/external_links_gfw/table

In [1]:
import requests
import requests_cache
from pprint import pprint
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
requests_cache.install_cache('demo_cache')

In [4]:
# Get all countries from endpoint

ds = '499682b1-3174-493f-ba1a-368b4636708e'  # ADMIN 2 level data
url = f"https://production-api.globalforestwatch.org/v1/query/{ds}"
sql = (f"SELECT iso, polyname FROM {ds} WHERE thresh = 0 AND polyname is 'gadm28' GROUP BY iso")

properties = {"sql": sql}
r = requests.get(url, params = properties)
print(r.url)
print(f'Status: {r.status_code}')
data = r.json().get('data')
pprint(data)

https://production-api.globalforestwatch.org/v1/query/499682b1-3174-493f-ba1a-368b4636708e?sql=SELECT+iso%2C+polyname+FROM+499682b1-3174-493f-ba1a-368b4636708e+WHERE+thresh+%3D+0+AND+polyname+is+%27gadm28%27+GROUP+BY+iso
Status: 200
[{'iso': 'BRA'},
 {'iso': 'USA'},
 {'iso': 'ROU'},
 {'iso': 'RUS'},
 {'iso': 'MEX'},
 {'iso': 'JPN'},
 {'iso': 'PHL'},
 {'iso': 'AUS'},
 {'iso': 'DZA'},
 {'iso': 'COL'},
 {'iso': 'THA'},
 {'iso': 'TUR'},
 {'iso': 'NGA'},
 {'iso': 'VNM'},
 {'iso': 'IND'},
 {'iso': 'UKR'},
 {'iso': 'HRV'},
 {'iso': 'ARG'},
 {'iso': 'NLD'},
 {'iso': 'IDN'},
 {'iso': 'NOR'},
 {'iso': 'DEU'},
 {'iso': 'POL'},
 {'iso': 'GTM'},
 {'iso': 'CHN'},
 {'iso': 'VEN'},
 {'iso': 'LKA'},
 {'iso': 'EGY'},
 {'iso': 'KEN'},
 {'iso': 'YEM'},
 {'iso': 'HND'},
 {'iso': 'AFG'},
 {'iso': 'CAN'},
 {'iso': 'SWE'},
 {'iso': 'PRT'},
 {'iso': 'MNG'},
 {'iso': 'SLV'},
 {'iso': 'BGR'},
 {'iso': 'MWI'},
 {'iso': 'IRN'},
 {'iso': 'KOR'},
 {'iso': 'EST'},
 {'iso': 'PRY'},
 {'iso': 'BTN'},
 {'iso': 'TUN'},
 {

In [6]:
# Populate list with isos

isos = []
for d in data:
    isos.append(d.get('iso'))
    
isos[0:3], len(isos)
    

(['BRA', 'USA', 'ROU'], 226)

In [24]:
# Function to call endpoint with an iso and append object to list .
# If country no present on api call, return None in links value

def scraper(iso, links):

    url = f"http://api.globalforestwatch.org/countries/{iso}?thresh=30"

    try:
        r = requests.get(url)
        links.append({'iso': iso, 'links': r.json().get('external_links')})
        
    except:
        links.append({'iso': iso, 'links': None})
    
    print(iso, "done!")
    
    return

In [25]:
%%time
#Run through all isos from list and populate list with iso and links

links = []
for i in isos:
    scraper(i, links)
links

BRA done!
USA done!
ROU done!
RUS done!
MEX done!
JPN done!
PHL done!
AUS done!
DZA done!
COL done!
THA done!
TUR done!
NGA done!
VNM done!
IND done!
UKR done!
HRV done!
ARG done!
NLD done!
IDN done!
NOR done!
DEU done!
POL done!
GTM done!
CHN done!
VEN done!
LKA done!
EGY done!
KEN done!
YEM done!
HND done!
AFG done!
CAN done!
SWE done!
PRT done!
MNG done!
SLV done!
BGR done!
MWI done!
IRN done!
KOR done!
EST done!
PRY done!
BTN done!
TUN done!
URY done!
ECU done!
PER done!
SVN done!
GBR done!
PRK done!
TZA done!
SLB done!
KHM done!
KAZ done!
CHE done!
CUB done!
HUN done!
AGO done!
UGA done!
SRB done!
UZB done!
DOM done!
MYS done!
LAO done!
NIC done!
GHA done!
BDI done!
MOZ done!
BLR done!
GUY done!
ITA done!
NAM done!
DNK done!
CZE done!
AUT done!
FRA done!
IRQ done!
BOL done!
PNG done!
MKD done!
CRI done!
SDN done!
AZE done!
ETH done!
PAN done!
SVK done!
PRI done!
BEN done!
NZL done!
SOM done!
ZMB done!
GEO done!
LBR done!
BGD done!
MMR done!
VUT done!
SUR done!
TLS done!
ZWE done!


In [26]:
len(links)

226

In [27]:
links

[{'iso': 'BRA',
  'links': '[{  "title": "Forest Products Legality Risk: Brazil Overview",  "url": "http://risk.forestlegality.org/countries/brazil"}, {  "title": "Imazon",  "url": "http://www.imazon.org.br"}, {  "title": "InfoAmazonia",  "url": "http://infoamazonia.org"}]'},
 {'iso': 'USA', 'links': ''},
 {'iso': 'ROU', 'links': ''},
 {'iso': 'RUS', 'links': ''},
 {'iso': 'MEX', 'links': ''},
 {'iso': 'JPN', 'links': ''},
 {'iso': 'PHL', 'links': ''},
 {'iso': 'AUS', 'links': ''},
 {'iso': 'DZA', 'links': ''},
 {'iso': 'COL',
  'links': '[{  "title": "InfoAmazonia",  "url": "http://infoamazonia.org"}]'},
 {'iso': 'THA', 'links': ''},
 {'iso': 'TUR', 'links': ''},
 {'iso': 'NGA', 'links': ''},
 {'iso': 'VNM',
  'links': '[{  "title": "Forest Products Legality Risk: Vietnam Overview",  "url": "http://risk.forestlegality.org/countries/vietnam"}]'},
 {'iso': 'IND', 'links': ''},
 {'iso': 'UKR', 'links': ''},
 {'iso': 'HRV', 'links': ''},
 {'iso': 'ARG', 'links': ''},
 {'iso': 'NLD', 'link

In [84]:
# Create lists for iso, external links, and a forest atlas
# If forest atlas is present in the link then return True

fa = []
ext_links = []
country = []

for l in links:
    link_string = l.get('links')
    
    if link_string and 'Atlas' in link_string:
        fa.append(True)
        ext_links.append(l.get('links'))
        country.append(l.get('iso'))
        
    else:
        fa.append(False)
        ext_links.append(l.get('links'))
        country.append(l.get('iso'))
        

In [85]:
# Create dataframe

tmp_data = []
for n in range(len(country)):
    tmp_data.append([country[n], fa[n], ext_links[n]])

In [86]:
cnames=['iso', 'forest_atlas', 'external_links']

df = pd.DataFrame(tmp_data, columns=cnames)

In [87]:
df.head()

Unnamed: 0,iso,forest_atlas,external_links
0,BRA,False,"[{ ""title"": ""Forest Products Legality Risk: B..."
1,USA,False,
2,ROU,False,
3,RUS,False,
4,MEX,False,


In [92]:
# Craete csv
file_name = 'data_sci_tutorials/scraped_links.csv'

df.to_csv(file_name, encoding='utf-8', index=False)