# Populating place name based on county bounding box

## Part 1: Introduction
This Jupyter Notebook is intended to find the spatial coverage based on bounding box. It is a reverse version of **<a href='https://github.com/BTAA-Geospatial-Data-Project/geonames'>geonames</a>**.

## Part 2: Preparation
We will be using **Jupyter Notebook(anaconda 3)** to edit and run the script. Information on Anaconda installation can be found <a href='https://docs.anaconda.com/anaconda/install/'>here</a>. Please note that this script is running on Python 3.

Before running the script, you may need to:
### 1. Run other two Jupyter Notebooks
- If the target state(s) hasn't been converted into city or county bounding box file, you may need to run `city_bbox.ipynb` or `county_bbox.ipynb` or `merge_geojson.ipynb` first.  

### 2. Restructure Directories
- ***sjoin.ipynb***
- ***city_bbox.ipynb***
- ***county_bbox.ipynb***
- ***merge_geojson.ipynb***
- **geojson** folder
    - **state1** foloder
        - ***state1_County_bbox.json***
        - ***state1_City_bbox.json***
    - **state2** foloder
        - ***state2_County_bbox.json***
        - ***state2_City_bbox.json***
    - ...
- **data** folder
    - **code** foloder
        - ***code.csv*** formatted in GBL Metadata Template
        - ***state_bbox.json*** and/or **state1_state2_....json**
        
### 3. Inspect `code.csv`
If records belong to regional data portal, you probably need to create merged county bounding box file. 

The final product would be one CSV file named ***code_placename.csv***. 

> Original created on Jan 31 2021 <br>
> @author: Yijing Zhou @YijingZhou33

## Part 3: Get Started

### Step 1: Import modules

In [1]:
import pandas as pd
import os
import geopandas as gpd
import json
from shapely.geometry import box
import numpy as np
from functools import reduce

### Step 2: Manual items to change

In [2]:
code = 'Merged'

###### Rawdata comes from state data portal -- single state #####

# **************** uncomment **********************
# csvname = 'Maryland'
# gjsoname = 'Maryland_bbox'
# *************************************************

###### Rawdata comes from regional data portal -- multiple states #####

# **************** uncomment **********************
csvname = '4f-01'
gjsoname = 'ML_PA_NJ_DE_bbox'
# *************************************************

### Step 3: Set file path

In [3]:
rawdata = os.path.join('data', code, csvname + '.csv')
basemap = os.path.join('data', code, gjsoname + '.json')
output = os.path.join('data', code, csvname + '_placename.csv')

## Part 4: Build up GeoJSON dataframe

### Step 4: Create bounding box for csv file

In [4]:
df_csv = pd.read_csv(rawdata)

def format_coordinates(df, identifier):
    ## create regular bouding box coordinate pairs and round them to 1 decimal places
    df = pd.concat([df, df['Bounding Box'].str.split(',', expand=True).astype(float).round(2)], axis=1).rename(
        columns={0:'minX', 1:'minY', 2:'maxX', 3:'maxY'})
    
    ## check if there exists wrong coordinates
    for _, row in df.iterrows():
        if (row.maxX - row.minX) > 10 or (row.maxY - row.minY) > 10 or (row.minX < -100):
            print('Wrong Coordinates --> Identifier: ', row[identifier])
    
    ## create bouding box
    df['Coordinates'] = df.apply(lambda row: box(row.minX, row.minY, row.maxX, row.maxY), axis=1)
    
    ## clean up unnecessary columns
    return df.drop(columns =['minX', 'minY', 'maxX', 'maxY'])

df_clean = format_coordinates(df_csv, 'layer_slug_s')

Wrong Coordinates --> Identifier:  498997b99f0042a3aa9c4aba8a79a30d_0
Wrong Coordinates --> Identifier:  4d360675145241f691e8d3655de2b287_0


### Step 5: Convert csv and GeoJSON file into dataframe

In [5]:
gdf_rawdata = gpd.GeoDataFrame(df_clean, geometry = df_clean['Coordinates'])
gdf_rawdata.crs = 'EPSG:4326'

gdf_county = gpd.read_file(basemap)

## Part 5: Spatial Join
**<a href='https://geopandas.org/reference/geopandas.sjoin.html#geopandas-sjoin'>`geopandas.sjoin`</a>** provides the following the criteria used to match rows:
- intersects 
- within
- contains

In [6]:
def spatial_join(identifier):
    dflist = []
    operations = ['intersects', 'within', 'contains']
    for op in operations:
        df_merged = gpd.sjoin(gdf_rawdata, gdf_county, op = op, how = 'left')[[identifier, 'County', 'State']].astype(str)
        ## merge column 'County' and 'State' into one 'County, State'
        df_merged['County'] = df_merged[['County', 'State']].agg(', '.join, axis=1).replace('nan, nan', 'nan')
        ## group records by identifier
        df_group = df_merged.drop(columns = ['State']).reset_index(drop = True).groupby(identifier
                    )['County'].apply(list).reset_index(name = op)
        ## replace ['nan'] with None
        df_group[op] = df_group[op].apply(lambda row: None if row[0] == 'nan' else row)
        dflist.append(df_group.rename(columns={'County': op}))
    
    ## merge dataframes created by different match options
    df_sjoin = reduce(lambda left,right: pd.merge(left, right, on = identifier, how = 'outer'), dflist)
    
    return gdf_rawdata.merge(df_sjoin, on = identifier).drop(columns =['Coordinates', 'geometry'])

df_comparison = spatial_join('layer_slug_s')

## Part 6: Populate place names

In [7]:
## e.g. ['Camden County, New Jersey', 'Delaware County, Pennsylvania', 'Philadelphia County, Pennsylvania']
def format_placename(colname):
    inv_map = {}
    plist = []
    
    ## {'Camden County': 'New Jersey', 'Delaware County': 'Pennsylvania', 'Philadelphia County': 'Pennsylvania'}
    namedict = dict(item.split(', ') for item in colname)

    ## {'New Jersey': ['Camden County'], 'Pennsylvania': ['Delaware County', 'Philadelphia County']}
    for k, v in namedict.items():
        inv_map[v] = inv_map.get(v, []) + [k] 
    
    ## ['Camden County, New Jersey|New Jersey', 'Delaware County, Pennsylvania|Philadelphia County, Pennsylvania|Pennsylvania']
    for k, v in inv_map.items():
        pname = [elem + ', ' + k for elem in v]
        pname.append(k)
        plist.append('|'.join(pname))

    ## Camden County, New Jersey|New Jersey|Delaware County, Pennsylvania|Philadelphia County, Pennsylvania|Pennsylvania
    return '|'.join(plist)

In [8]:
def populate_placename(df, identifier):
    placenamelist = []
    for _, row in df.iterrows():
        if row['within'] is None:
            ## no within feature && <= 5 contain features --> contains features
            if row['contains'] is not None and len(row['contains']) < 6:
                placename = format_placename(row['contains'])
            ## no intersect, within (and contains) feature --> wrong coordinates
            elif row['intersects'] is None:
                placename = ''
            ## > 3 contain features --> state
            else:
                statedict = dict(item.split(', ') for item in row['intersects']) 
                placename = ('|').join(set([v for k, v in statedict.items()]))
        else:
            ## otherwise, within features
            placename = format_placename(row['within'])
        placenamelist.append(placename)
    
    df['Place Name'] = placenamelist
    df_final = df.drop(columns = ['intersects', 'within', 'contains'])
    df_final.to_csv(output, index = False)

populate_placename(df_comparison, 'layer_slug_s')