# Populating Spatial Coverage for DCAT Data Portals

## Part 1: Introduction
This Jupyter Notebook is intended to find the spatial coverage based on bounding box for DCAT Data Portals. It is a reverse version of **<a href='https://github.com/BTAA-Geospatial-Data-Project/geonames'>geonames</a>**.

## Part 2: Preparation
We will be using **Jupyter Notebook(anaconda 3)** to edit and run the script. Information on Anaconda installation can be found <a href='https://docs.anaconda.com/anaconda/install/'>here</a>. Please note that this script is running on Python 3.

Before running the script, you may need to:
### 1. Run other Jupyter Notebooks
- If the target state(s) hasn't been converted into city or county bounding box file, you may need to 
    1. download county and city boundary file (GeoJSON or Shapefile) online
    2. run `city_boundary.ipynb` or `county_boundary.ipynb` to create boundary GeoJSON files 
        - if there exists regional data portals, you may need to run `merge_geojson.ipynb` to merge them together
    3. run `city_bbox.ipynb` or `county_bbox.ipynb` to create bounding box GeoJSON files

### 2. Restructure Directories
- `dcat_sjoin.ipynb`
- `city_boundary.ipynb`
- `county_boundary.ipynb`
- `city_bbox.ipynb`
- `county_bbox.ipynb`
- `merge_geojson.ipynb`
- geojson folder
    - State1 foloder
        - *State1_County_boundaries.json*
        - *State1_City_boundaries.json*
        - *State1_County_bbox.json*
        - *State1_City_bbox.json*
    - Code foloder (Multiple states)
        - *Code_County_boundaries.json*
        - *Code_City_boundaries.json*
        - *Code_County_bbox.json*
        - *Code_City_bbox.json*
    - ...
- reports folder
    - *allNewItems_ActionDate.csv* formatted in GBL Metadata Template
        
### 3. Inspect *allNewItems_ActionDate.csv*
The final product would be one CSV file named ***allNewItems_ActionDate_test.csv***. 

> Original created on Feb 4 2021 <br>
> @author: Yijing Zhou @YijingZhou33

## Part 3: Get Started

### Step 1: Import modules

In [None]:
import pandas as pd
import os
import geopandas as gpd
import json
from shapely.geometry import box
import numpy as np
from itertools import chain
from functools import reduce
import time
import urllib.request
import requests

### Step 2: Set file path

In [None]:
ActionDate = time.strftime('%Y%m%d')
newItemscsv = os.path.join('reports', f'allNewItems_{ActionDate}.csv')
df_csv = pd.read_csv(newItemscsv, encoding = 'unicode_escape')

### Step 3: Check if download link is valid

In [None]:
def check_download(df):
    sluglist = []
    for _, row in df.iterrows():
        url = row['Download']
        slug = row['Slug']
        try:
            response = requests.get(url, timeout = 3)
            response.raise_for_status()
            ## check if it is a zipfile             
            if response.headers['content-type'] == 'application/json; charset=utf-8':
                print(f'{slug}: Not a zipfile')
            else:
                print(f'{slug}: Success')
                sluglist.append(slug)
        ## check HTTPError: 404(not found) or 500 (server error)       
        except requests.exceptions.HTTPError as errh:
            print (f'{slug}: {errh}')
        except requests.exceptions.RequestException as err:
            print (f'{slug}: {err}')
        except requests.exceptions.ConnectionError as errc:
            print (f'{slug}: {errc}')
        ## check Timeout: it will retry connecting 3 times before throwing the error  
        except requests.exceptions.Timeout as errt:
            attempts = 3
            while attempts:
                try:
                    response = requests.get(url, timeout = 3)
                    break
                except TimeoutError:
                    attempts -= 1
            print (f'{slug}: {errt}')
    return sluglist  
sluglist = check_download(df_csv)

## only includes records with valid download link
df_csv = df_csv[df_csv['Slug'].isin(sluglist)]

### Step 4: Split csv file if necessary

In [None]:
## if records come from Esri, the spatial coverage is considered as United States
df_esri = df_csv[df_csv['Publisher'] == 'Esri'].reset_index(drop=True)
df_csv = df_csv[df_csv['Publisher'] != 'Esri'].reset_index(drop=True)

### Step 5: Splite state from column 'Publisher'
The portal code is the main indicator: <br>
- a - state
- b - county
- c - city
- d - university (usually a city)
- f - regional
- 99 - Esri

In [None]:
def split_state(df):
    statelist = []
    for _, row in df.iterrows():
        if 'a' in row['Code']:
            # e.g. State of Minnesota
            state = row['Publisher'].split(' ')[-1] 
        elif 'b' in row['Code']:
            # e.g. Wilkin County, Minnesota
            state = row['Publisher'].split(', ')[-1]             
        elif 'c' in row['Code']:
            # e.g. City of Baltimore, Maryland
            state = row['Publisher'].split(', ')[-1]    
        elif 'd' in row['Code']:
            # e.g. University of Michigan
            state = row['Publisher'].split(' ')[-1]           
        elif 'f' in row['Code']:
            ## Regional portal: SEMCOG, Southeast Michigan Council of Governments             
            if row['Code'] == '06f-01':
                state = 'Michigan'
            ## Regional portal: Delaware Valley Regional Planning Commission
            ## The bouding box includes counties from Delawasre, Maryland, New Jersey and Pennsylvania
            elif row['Code'] == '04f-01': 
                state = '04f-01'
        statelist.append(state)
    
    df['State'] = statelist
    return df
df_csv = split_state(df_csv)

## Part 4: Build up GeoJSON dataframe

### Step 6: Create bounding box for csv file

In [None]:
def format_coordinates(df, identifier):
    ## create regular bouding box coordinate pairs and round them to 2 decimal places
    ## manually generates the buffering zone
    df = pd.concat([df, df['Bounding Box'].str.split(',', expand=True).astype(float).round(2)], axis=1).rename(
        columns={0:'minX', 1:'minY', 2:'maxX', 3:'maxY'})
    
    ## check if there exists wrong coordinates and drop them
    coordslist = ['minX', 'minY', 'maxX', 'maxY']
    idlist = []
    for _, row in df.iterrows():
        for coord in coordslist:
            if abs(row[coord]) == 0 or abs(row[coord]) == 180:
                idlist.append(row[identifier])
        if (row.maxX - row.minX) > 10 or (row.maxY - row.minY) > 10:
            idlist.append(row[identifier])
    
    ## create bouding box
    df['Coordinates'] = df.apply(lambda row: box(row.minX, row.minY, row.maxX, row.maxY), axis=1)
    
    ## clean up unnecessary columns
    df = df.drop(columns =['minX', 'minY', 'maxX', 'maxY']).reset_index(drop = True)
    
    df_clean = df[~df[identifier].isin(idlist)]
    df_wrongcoords = df[df[identifier].isin(idlist)].drop(columns = ['State', 'Coordinates'])
    
    return [df_clean, df_wrongcoords]

df_csvlist = format_coordinates(df_csv, 'Slug')
df_clean = df_csvlist[0]
df_wrongcoords = df_csvlist[1]

### Step 7: Convert csv and GeoJSON file into dataframe

In [None]:
gdf_rawdata = gpd.GeoDataFrame(df_clean, geometry = df_clean['Coordinates'])
gdf_rawdata.crs = 'EPSG:4326'

### Step 8: Split dataframe and convert them into dictionary 

In [None]:
## e.g.
## splitdict = {'Minnesota': {'Bounding Box 1': df_1, 'Bounding Box 2': df_2}, 
##              'Michigan':  {'Bounding Box 3': df_3, ...}, 
##               ...}

splitdict = {}
for state in list(gdf_rawdata['State'].unique()):
    gdf_slice = gdf_rawdata[gdf_rawdata['State'] == state]
    if state:
        bboxdict = {}
        for bbox in list(gdf_slice['Bounding Box'].unique()):
            bboxdict[bbox] = gdf_slice[gdf_slice['Bounding Box'] == bbox].drop(columns = 'State')
        splitdict[state] = bboxdict
    else:
        df_nobbox = gdf_slice.drop(columns =['Coordinates', 'geometry', 'State'])
        sluglist = df_nobbox['Code'].unique()
        print("Can't find the bounding box file: ", sluglist)

## Part 5: Spatial Join
**<a href='https://geopandas.org/reference/geopandas.sjoin.html#geopandas-sjoin'>`geopandas.sjoin`</a>** provides the following the criteria used to match rows:
- intersects 
- within
- contains

### Step 9: Perform spatial Join on each record

In [None]:
def split_placename(df, level):
    formatlist = []
    for _, row in df.iterrows():
        ## e.g. 'Baltimore County, Baltimore City'
        ## --> ['Baltimore County&Maryland', 'Baltimore City&Maryland']
        if row[level] != 'nan':
            placelist = row[level].split(', ')
            formatname = ', '.join([(i + '&' + row['State']) for i in placelist])  
        ## e.g. 'nan'
        ## --> ['nan']
        else:
            formatname = 'nan'
        formatlist.append(formatname)
    return formatlist

def city_and_county_sjoin(gdf_rawdata, op, state, identifier):
    bboxpath = os.path.join('geojson', state, f'{state}_City_bbox.json')
    gdf_basemap = gpd.read_file(bboxpath)
    ## spatial join
    df_merged = gpd.sjoin(gdf_rawdata, gdf_basemap, op = op, how = 'left')[[identifier, 'City', 'County', 'State']].astype(str)
    # merge column 'City', 'County' into one column 'Pname'
    df_merged['City'] = split_placename(df_merged, 'City')
    df_merged['County'] = split_placename(df_merged, 'County')
    df_merged['Pname'] = df_merged[['City', 'County']].agg(', '.join, axis=1).replace('nan, nan', 'nan')
    # group records by identifier
    df_group = df_merged.drop(columns = ['State']).reset_index(drop = True).groupby(identifier
            )['Pname'].apply(list).reset_index(name = op)
    return df_group

def city_or_county_sjoin(gdf_rawdata, op, state, identifier, level):
    bboxpath = os.path.join('geojson', state, f'{state}_{level}_bbox.json')
    gdf_basemap = gpd.read_file(bboxpath)
    ## spatial join
    df_merged = gpd.sjoin(gdf_rawdata, gdf_basemap, op = op, how = 'left')[[identifier, level, 'State']].astype(str)
    # merge column level and 'State' into one column 'Placename'
    df_merged['Pname'] = df_merged.apply(lambda row: (row[level] + '&' + row['State']) if str(row[level]) != 'nan' else 'nan', axis = 1)
    # group records by identifier
    df_group = df_merged.drop(columns = ['State']).reset_index(drop = True).groupby(identifier
            )['Pname'].apply(list).reset_index(name = op)
    return df_group

### Step 10: Remove duplicates and 'nan' from place name

In [None]:
def remove_nan(row):
    ## e.g. ['nan', 'Minneapolis, Minnesota', 'Hennepin County, Minnesota', 'Hennepin County, Minnesota']
    ## remove 'nan' and duplicates from list: ['Minneapolis, Minnesota, 'Hennepin County, Minnesota']
    nonan = list(filter(lambda x: x != 'nan', row))
    nodups = list(set(', '.join(nonan).split(', ')))
    result = [i.replace('&', ', ') for i in nodups]
    return result

### Step 11: Fetch the proper join bouding box files

In [None]:
def spatial_join(gdf_rawdata, state, identifier):
    dflist = []
    operations = ['intersects', 'within', 'contains']
    for op in operations:
        bboxpath = os.path.join('geojson', state, f'{state}_City_bbox.json')
        
        ## Disteict of Columbia doesn't have county boudning box file
        if state == 'District of Columbia':
            df_group = city_or_county_sjoin(gdf_rawdata, op, state, identifier, 'City')
            df_group[op] = df_group[op].apply(lambda row: remove_nan(row)) 
        
        ## check if there exists bounding box files
        elif os.path.isfile(bboxpath):
            df_city = city_and_county_sjoin(gdf_rawdata, op, state, identifier)
            df_county = city_or_county_sjoin(gdf_rawdata, op, state, identifier, 'County')
            df_merged = df_city.append(df_county, ignore_index = True)
            df_group = df_merged.groupby(identifier).agg(lambda row: [', '.join(x) for x in row]).reset_index()
            df_group[op] = df_group[op].apply(lambda row: remove_nan(row))   
       
        ## missing bounding box file    
        else: 
            print('Missing city bounding box file: ', state)
            continue 
                     
        ## replace [''] with None
        df_group[op] = df_group[op].apply(lambda row: None if row == [''] else row)
        dflist.append(df_group)

    ## merge dataframes created by different match options
    df_sjoin = reduce(lambda left,right: pd.merge(left, right, on = identifier, how = 'outer'), dflist)
    
    ## ultimately it returns a dataframe with identifier and placename related to matching operation
    ## e.g. dataframe = {'identifier', 'level', intersects'}
    gdf_final =  gdf_rawdata.merge(df_sjoin, on = identifier).drop(columns =['Coordinates', 'geometry'])
    return gdf_final

### Step 12: Merge place names generated by three matching operations to raw data

In [None]:
mergeddf = []
## loop through splitdict based on key 'State'
for state, gdfdict in splitdict.items():
    ## loop through records based on key 'Bounding Box'
    for bbox, gdf_split in gdfdict.items():
        df_comparison = spatial_join(gdf_split, state, 'Slug')
        ## e.g. mergeddf = {'identifier', 'intersects', 'within', 'contains'}
        mergeddf.append(df_comparison)
    
## merge placename columns ['intersects', 'within', 'contains'] to raw data
gdf_merged = reduce(lambda left, right: left.append(right), mergeddf).reset_index(drop = True)

## Part 6: Populate place names

### Step 13: Format spatial coverage based on GBL Metadata Template

In [None]:
## e.g. ['Camden County, New Jersey', 'Delaware County, Pennsylvania', 'Philadelphia County, Pennsylvania']
def format_placename(colname):
    inv_map = {}
    plist = []

    ## {'Camden County': 'New Jersey', 'Delaware County': 'Pennsylvania', 'Philadelphia County': 'Pennsylvania'}
    namedict = dict(item.split(', ') for item in colname)

    ## {'New Jersey': ['Camden County'], 'Pennsylvania': ['Delaware County', 'Philadelphia County']}
    for k, v in namedict.items():
        inv_map[v] = inv_map.get(v, []) + [k] 
    
    ## ['Camden County, New Jersey|New Jersey', 'Delaware County, Pennsylvania|Philadelphia County, Pennsylvania|Pennsylvania']
    for k, v in inv_map.items():
        pname = [elem + ', ' + k for elem in v]
        pname.append(k)
        plist.append('|'.join(pname))

    ## Camden County, New Jersey|New Jersey|Delaware County, Pennsylvania|Philadelphia County, Pennsylvania|Pennsylvania
    return '|'.join(plist)

### Step 14: Select spatial coverage based on operaions

In [None]:
def populate_placename(df, identifier):
    placenamelist = []
    for _, row in df.iterrows():
        if row['contains'] is None:
            if row['intersects'] is None: 
                placename = ''
            elif row['within'] is None:
                placename = format_placename(row['intersects']) 
            else: 
                placename = format_placename(row['within']) 
        else:
            placename = format_placename(row['contains']) 
        placenamelist.append(placename)
    df['Spatial Coverage'] = placenamelist
    return df.drop(columns = ['intersects', 'within', 'contains'])

df_bbox = populate_placename(gdf_merged, 'Slug')

## Part 7: Write to csv file

In [None]:
## check if there exists data portal from Esri
if len(df_esri):
    df_esri['Spatial Coverage'] = 'United States'
    
dflist = [df_esri, df_bbox, df_wrongcoords]
df_final = pd.concat(filter(len, dflist), ignore_index=True)

df_final.to_csv(newItemscsv, index = False)