# Populating Spatial Coverage for DCAT Data Portals

## Part 1: Introduction
This Jupyter Notebook is intended to find the spatial coverage based on bounding box for DCAT Data Portals. It is a reverse version of **<a href='https://github.com/BTAA-Geospatial-Data-Project/geonames'>geonames</a>**.

## Part 2: Preparation
We will be using **Jupyter Notebook(anaconda 3)** to edit and run the script. Information on Anaconda installation can be found <a href='https://docs.anaconda.com/anaconda/install/'>here</a>. Please note that this script is running on Python 3.

Before running the script, you may need to:
### 1. Run other two Jupyter Notebooks
- If the target state(s) hasn't been converted into city or county bounding box file, you may need to run `city_bbox.ipynb` or `county_bbox.ipynb` or `merge_geojson.ipynb` first.  

### 2. Restructure Directories
- `dcat_sjoin.ipynb`
- `city_bbox.ipynb`
- `county_bbox.ipynb`
- `merge_geojson.ipynb`
- geojson folder
    - State1 foloder
        - *State1_County_bbox.json*
        - *State1_City_bbox.json*
    - State2 foloder
        - *State2_County_bbox.json*
        - *State2_City_bbox.json*
    - ...
- reports folder
    - *allNewItems_ActionDate.csv* formatted in GBL Metadata Template
        
### 3. Inspect *allNewItems_ActionDate.csv*
If records belong to regional data portal, you probably need to create merged county bounding box file first. 

The final product would be one CSV file named ***allNewItems_ActionDate_test.csv***. 

> Original created on Feb 4 2021 <br>
> @author: Yijing Zhou @YijingZhou33

## Part 3: Get Started

### Step 1: Import modules

In [1]:
import pandas as pd
import os
import geopandas as gpd
import json
from shapely.geometry import box
import numpy as np
from functools import reduce
import time
import urllib.request
import requests

### Step 2: Set file path

In [2]:
ActionDate = time.strftime('%Y%m%d')
# newItemscsv = os.path.join('reports', f'allNewItems_{ActionDate}.csv')
newItemscsv = os.path.join('reports', 'allNewItems_20210202.csv')
df_csv = pd.read_csv(newItemscsv, encoding = 'unicode_escape')

### Step 3: Check if download link is valid

In [8]:
def check_download(df):
    sluglist = []
    for _, row in df.iterrows():
        url = row['Download']
        slug = row['Slug']
        try:
            response = requests.get(url, timeout = 3)
            response.raise_for_status()
            if response.headers['content-type'] == 'application/json; charset=utf-8':
                print(f'{slug}: Not a zipfile')
            else:
                sluglist.append(slug)
        except requests.exceptions.HTTPError as errh:
            print (f'{slug}: {errh}')
        except requests.exceptions.RequestException as err:
            print (f'{slug}: {err}')
        except requests.exceptions.ConnectionError as errc:
            print (f'{slug}: {errc}')
        except requests.exceptions.Timeout as errt:
            print (f'{slug}: {errt}')
    return sluglist  
sluglist = check_download(df_csv)
df_csv = df_csv[df_csv['Slug'].isin(sluglist)]

Not a zip file:  ec8e015cb6d941db94d1c963d1355526_4
Not a zip file:  ec8e015cb6d941db94d1c963d1355526_3
HTTPSConnectionPool(host='data2018-mcgov-gis.opendata.arcgis.com', port=443): Read timed out. (read timeout=3): 42f9c2daa9824d3dabd6714e493f2b88_0
HTTPSConnectionPool(host='data2018-mcgov-gis.opendata.arcgis.com', port=443): Read timed out. (read timeout=3): b47d9884847d4e6a8a06766e636bb9bd_0
HTTPSConnectionPool(host='data2018-mcgov-gis.opendata.arcgis.com', port=443): Read timed out. (read timeout=3): 16343b687c624a5aa3b658ea18d4cb26_0
HTTPSConnectionPool(host='data.baltimorecity.gov', port=443): Read timed out. (read timeout=3): 8218ff5808d94165baa89afd46587f52_0
500 Server Error: Internal Server Error for url: https://opendata.minneapolismn.gov/datasets/d3c37bd37e8f4e7a92b844710045629d_0.zip: d3c37bd37e8f4e7a92b844710045629d_0
Not a zip file:  4fb8522d16164670b43a3434b73b9fbf_0
Not a zip file:  317bf35dd9b24a49a092c00c83a81fff_0
Not a zip file:  c47ec477a9514df4afac431292c10ad0_0


### Step 4: Split csv file if necessary

In [1]:
## If records come from Esri, the spatial coverage is considered as United States.
df_esri = df_csv[df_csv['Publisher'] == 'Esri'].reset_index(drop=True)
df_csv = df_csv[df_csv['Publisher'] != 'Esri'].reset_index(drop=True)

NameError: name 'df_csv' is not defined

### Step 5: Classify portals
The portal code is the main indicator: <br>
- a - state
- b - county
- c - city
- d - university (usually a city)
- f - regional
- 99 - Esri

In [None]:
def portal_level(df):
    leveldict = {'a': 'County', 'b': 'County', 'c': 'City', 'd': 'City', 'f': 'Regional'}
    levellist = []
    statelist = []
    for _, row in df.iterrows():
        ## if it is a state('a') or county('b') data portal, 
        ## use county-level bounding box files  
        if 'a' in row['Code']:
            level = leveldict['a']
            state = row['Publisher'].split(' ')[-1] 
        elif 'b' in row['Code']:
            level = leveldict['b']
            state = row['Publisher'].split(', ')[-1] 
            
        ## if it is a city('c') or university('d') data portal, 
        ## use both county-level and county-level bounding box files    
        elif 'c' in row['Code']:
            level = leveldict['c']
            state = row['Publisher'].split(', ')[-1]    
        elif 'd' in row['Code']:
            level = leveldict['d']
            state = row['Publisher'].split(' ')[-1]
            
        ## if it is a regional('f') data portal, 
        ## use (merged) county-level bounding box files    
        elif 'f' in row['Code']:
            ## Regional portal: SEMCOG, Southeast Michigan Council of Governments             
            if row['Code'] == '06f-01':
                level = leveldict['a']
                state = 'Michigan'
            ## Regional portal: Delaware Valley Regional Planning Commission
            ## The bouding box includes counties from Delawasre, Maryland, New Jersey and Pennsylvania
            elif row['Code'] == '04f-01': 
                level = leveldict['a']
                state = 'Delaware'
        levellist.append(level)
        statelist.append(state)
    
    df['Level'] = levellist
    df['State'] = statelist
    return df
df_csv = portal_level(df_csv)

## Part 4: Build up GeoJSON dataframe

### Step 6: Create bounding box for csv file

In [None]:
def format_coordinates(df, identifier):
    ## create regular bouding box coordinate pairs and round them to 2 decimal places
    df = pd.concat([df, df['Bounding Box'].str.split(',', expand=True).astype(float).round(2)], axis=1).rename(
        columns={0:'minX', 1:'minY', 2:'maxX', 3:'maxY'})
    
    ## check if there exists wrong coordinates
    for _, row in df.iterrows():
        if (row.maxX - row.minX) > 10 or (row.maxY - row.minY) > 10:
            print(f'Wrong Coordinates --> {identifier}: ', row[identifier])
    
    ## create bouding box
    df['Coordinates'] = df.apply(lambda row: box(row.minX, row.minY, row.maxX, row.maxY), axis=1)
    
    ## clean up unnecessary columns
    return df.drop(columns =['minX', 'minY', 'maxX', 'maxY'])

df_clean = format_coordinates(df_csv, 'Slug')

### Step 7: Convert csv and GeoJSON file into dataframe

In [None]:
gdf_rawdata = gpd.GeoDataFrame(df_clean, geometry = df_clean['Coordinates'])
gdf_rawdata.crs = 'EPSG:4326'

### Step 8: Split dataframe and convert them into dictionary 

In [None]:
## e.g.
## splitdict = {'Minnesota': {'County': df_1, 'City': df_2}, 
##              'Michigan':  {'County': df_3}, 
##               ...}

splitdict = {}
for state in list(gdf_rawdata['State'].unique()):
    gdf_slice = gdf_rawdata[gdf_rawdata['State'] == state]
    if state:
        leveldict = {}
        for level in list(gdf_slice['Level'].unique()):
            leveldict[level] = gdf_slice[gdf_slice['Level'] == level].drop(columns = 'State')
        splitdict[state] = leveldict
    else:
        df_nobbox = gdf_slice.drop(columns =['Coordinates', 'geometry', 'State'])
        sluglist = df_nobbox['Code'].unique()
        print("Can't find the bounding box file: ", sluglist)

## Part 5: Spatial Join
**<a href='https://geopandas.org/reference/geopandas.sjoin.html#geopandas-sjoin'>`geopandas.sjoin`</a>** provides the following the criteria used to match rows:
- intersects 
- within
- contains

### Step 9: Perform spatial Join on each record

In [None]:
def sjoin(gdf_rawdata, op, state, identifier, level):
    bboxpath = os.path.join('geojson', state, f'{state}_{level}_bbox.json')
    gdf_basemap = gpd.read_file(bboxpath)
    ## spatial join
    df_merged = gpd.sjoin(gdf_rawdata, gdf_basemap, op = op, how = 'left')[[identifier, level, 'State']].astype(str)
    # merge column level and 'State' into one column 'Placename'
    df_merged['Pname'] = df_merged[[level, 'State']].agg(', '.join, axis=1).replace('nan, nan', 'nan')
    # group records by identifier
    df_group = df_merged.drop(columns = ['State']).reset_index(drop = True).groupby(identifier
            )['Pname'].apply(list).reset_index(name = op)
    return df_group

### Step 10: Format place names from city-level data portals

In [None]:
def format_city_placename(row, state):
    ## replace ['nan, nan'] with ['nan']
    if len(row) == 2 and row[0] == 'nan':
        result = ['nan']
    else:
        ## e.g. ['nan', 'Minneapolis, Minnesota, Hennepin County, Minnesota']
        ## remove 'nan' from list: ['Minneapolis, Minnesota, Hennepin County, Minnesota']
        nonan = filter(lambda x: x != 'nan', row)
        ## ['Minneapolis, ', 'Hennepin County, Minnesota']
        namelist = ', '.join(nonan).split(state + ', ')
        ## ['Minneapolis, Minnesota', 'Hennepin County, Minnesota']
        result = list(set([i + state for i in namelist[:-1]] + [namelist[-1]]))
    return result

### Step 11: Fetch the proper join bouding box file fro different data portals

In [None]:
def spatial_join(gdf_rawdata, state, identifier, level):
    dflist = []
    operations = ['intersects', 'within', 'contains']
    for op in operations:
        bboxpath = os.path.join('geojson', state, f'{state}_{level}_bbox.json')
        
        ## city-level records need to perform spatial join twice (city & county)
        ## spatial coverage might contain city name
        ## e.g. ['Ann Arbor, Michigan', 'Washtenaw County, Michigan']
        if level == 'City':
            ## Disteict of Columbia doesn't have county boudning box file
            if state == 'District of Columbia':
                df_group = sjoin(gdf_rawdata, op, state, identifier, level)
            ## check if there exists both city and county bouding box file
            elif os.path.isfile(bboxpath):
                df_city = sjoin(gdf_rawdata, op, state, identifier, 'City')
                df_county = sjoin(gdf_rawdata, op, state, identifier, 'County')
                df_merged = df_city.append(df_county, ignore_index = True)
                df_group = df_merged.groupby(identifier).agg(lambda row: [', '.join(x) for x in row]).reset_index()
                df_group[op] = df_group[op].apply(lambda row: format_city_placename(row, state))   
            ## missing city file: Iowa & Nebraska    
            else: 
                df_group = sjoin(gdf_rawdata, op, state, identifier, 'County')
                
        ## county-level records need to perform spatial join once (county)        
        elif level == 'County':
            df_group = sjoin(gdf_rawdata, op, state, identifier, level)
        
        ## replace ['nan'] with None
        df_group[op] = df_group[op].apply(lambda row: None if row[0] == 'nan' else row)
        dflist.append(df_group)

    ## merge dataframes created by different match options
    df_sjoin = reduce(lambda left,right: pd.merge(left, right, on = identifier, how = 'outer'), dflist)
    
    ## ultimately it returns a dataframe with identifier and placename related to matching operation
    ## e.g. dataframe = {'identifier', 'level', intersects'}
    return gdf_rawdata.merge(df_sjoin, on = identifier).drop(columns =['Coordinates', 'geometry'])

### Step 12: Merge place names generated by three matching operations to raw data

In [None]:
mergeddf = []
## loop through splitdict based on key 'State'
for state, gdfdict in splitdict.items():
    ## loop through records based on key 'Level'
    for level, gdf_split in gdfdict.items():
        df_comparison = spatial_join(gdf_split, state, 'Slug', level)
        ## e.g. mergeddf = {'identifier', 'intersects', 'within', 'contains'}
        mergeddf.append(df_comparison)
    
## merge placename columns ['intersects', 'within', 'contains'] to raw data
gdf_merged = reduce(lambda left, right: left.append(right), mergeddf).reset_index(drop = True)

## Part 6: Populate place names

### Step 13: Format spatial coverage based on GBL Metadata Template

In [None]:
## e.g. ['Camden County, New Jersey', 'Delaware County, Pennsylvania', 'Philadelphia County, Pennsylvania']
def format_placename(colname):
    inv_map = {}
    plist = []
    
    ## {'Camden County': 'New Jersey', 'Delaware County': 'Pennsylvania', 'Philadelphia County': 'Pennsylvania'}
    namedict = dict(item.split(', ') for item in colname)

    ## {'New Jersey': ['Camden County'], 'Pennsylvania': ['Delaware County', 'Philadelphia County']}
    for k, v in namedict.items():
        inv_map[v] = inv_map.get(v, []) + [k] 
    
    ## ['Camden County, New Jersey|New Jersey', 'Delaware County, Pennsylvania|Philadelphia County, Pennsylvania|Pennsylvania']
    for k, v in inv_map.items():
        pname = [elem + ', ' + k for elem in v]
        pname.append(k)
        plist.append('|'.join(pname))

    ## Camden County, New Jersey|New Jersey|Delaware County, Pennsylvania|Philadelphia County, Pennsylvania|Pennsylvania
    return '|'.join(plist)

---
### step 14: Manual items to change 
Usually if one records intersects too many places, the script will treat the spatial coverage as the whole state. <br>
But you can customize it here!

In [None]:
## Twin Cities Metropolitan Area, Minnesota 
twin_cities = ['Anoka County, Minnesota', 'Carver County, Minnesota', 'Chisago County, Minnesota', 
               'Dakota County, Minnesota', 'Hennepin County, Minnesota', 'Ramsey County, Minnesota', 
               'Scott County, Minnesota', 'Washington County, Minnesota']

---

### Step 15: Populate spatial coverage for state, county and regional data portals

In [None]:
def county_level_formatting(row):
    if row['within'] is None:
        ## no within feature && <= 5 contain features --> contains features
        if row['contains'] is not None and len(row['contains']) < 6:
            placename = format_placename(row['contains'])
        ## no intersect, within (and contains) feature --> wrong coordinates
        elif row['intersects'] is None:
            placename = ''
        ## no within feature && > 5 contain features --> state
        else:
            statedict = dict(item.split(', ') for item in row['intersects']) 
            placename = ('|').join(set([v for k, v in statedict.items()]))
    else:
        ## otherwise, within features
        placename = format_placename(row['within'])  
    return placename

### Step 16: Populate spatial coverage for city and university data portals

In [None]:
def city_level_formatting(row):
    if row['within'] is None:
        ## no within feature && <= 5 contain features --> contains features
        if row['contains'] is not None and len(row['contains']) < 6:
            placename = format_placename(row['contains'])   
        ## no intersect, within (and contains) feature --> wrong coordinates    
        elif row['intersects'] is None:
            placename = '' 
        else:
            ## Twin Cities Metropolitan area            
            if row['Code'] == '05c-01':
                placename = format_placename(twin_cities)
            ## no within feature && <= 5 intersect features --> intersect features
            elif row['intersects'] is not None and len(row['intersects']) < 6:
                placename = format_placename(row['intersects'])  
            ## no within feature && > 5 intersect features --> state
            else:
                statedict = dict(item.split(', ') for item in row['intersects']) 
                placename = ('|').join(set([v for k, v in statedict.items()]))
    else:   
        ## within features && <= 4 contains features --> within + contains features
        if row['contains'] is not None and len(row['contains']) < 5:
            placename = format_placename(row['contains'] + row['within'])
        ## within features && <= 5 intersects features --> intersects features    
        elif row['intersects'] is not None and len(row['intersects']) < 6:
            placename = format_placename(row['intersects'])
        else:
            placename = format_placename(row['within']) 
            
    return placename

### Step 17: Merge data portals of different levels

In [None]:
def populate_placename(df, identifier):
    placenamelist = []
    for _, row in df.iterrows():
        print('identifier --> ', row[identifier])
        if row['Level'] == 'County':
            placename = county_level_formatting(row)
        elif row['Level'] == 'City':
            placename = city_level_formatting(row)
        placenamelist.append(placename)
    
    df['Spatial Coverage'] = placenamelist
    return df.drop(columns = ['Level', 'intersects', 'within', 'contains'])
df_bbox = populate_placename(gdf_merged, 'Slug')

## Part 7: Write to csv file

In [None]:
## check if there exists data portal from Esri
if len(df_esri):
    df_esri['Spatial Coverage'] = 'United States'
    df_final = df_bbox.append(df_esri, ignore_index = True)
else:
    df_final = df_bbox

df_final.to_csv(newItemscsv, index = False)