# Second Administrative Divisions of Countries

**[Work in progress]**

This notebook creates a .csv file with second administrative divisions (Counties in the US) for ingestion into the Knowledge Graph.

Data source: [GeoNames.org](https://download.geonames.org/export/dump/)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
from pathlib import Path
import pandas as pd

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-b9d10363-6d59-4deb-9595-2cb904a99d1d/installation-4.1.0/import


### Create admin2

In [4]:
admin2_url = 'https://download.geonames.org/export/dump/admin2Codes.txt'

In [5]:
names = ['code', 'name', 'name_ascii', 'geonameid']

In [6]:
admin2 = pd.read_csv(admin2_url, sep='\t', dtype='str', names=names)
admin2 = admin2[['code', 'name_ascii', 'geonameid']]

### Standardize column names for Knowlege Graph
* id: unique identifier for country
* name: name of node
* parent_id: unique identifier for continent
* properties: camelCase

In [7]:
admin2.rename(columns={'code': 'id'}, inplace=True) # standard id column to link nodes
admin2.rename(columns={'name_ascii': 'name'}, inplace=True)
admin2.rename(columns={'geonameid': 'geonameId'}, inplace=True)
admin2['parentId'] = admin2['id'].str.rsplit('.', 1, expand=True)[0]

#### Rename Counties to match US Census naming conventions

In [8]:
admin2.query("id == 'US.DC.001'")

Unnamed: 0,id,name,geonameId,parentId
39787,US.DC.001,Washington County,4140987,US.DC


In [9]:
admin2.loc[admin2['id'] == 'US.DC.001', 'name'] = 'District of Columbia'

In [10]:
admin2.query("id == 'US.DC.001'")

Unnamed: 0,id,name,geonameId,parentId
39787,US.DC.001,District of Columbia,4140987,US.DC


In [11]:
admin2.query("id == 'US.CA.075'") # San Francisco

Unnamed: 0,id,name,geonameId,parentId
42198,US.CA.075,City and County of San Francisco,5391997,US.CA


In [12]:
admin2.loc[admin2['id'] == 'US.CA.075', 'name'] = 'San Francisco'

In [13]:
admin2.query("id == 'US.CA.075'") # San Francisco

Unnamed: 0,id,name,geonameId,parentId
42198,US.CA.075,San Francisco,5391997,US.CA


### Example

In [14]:
admin2.query("name == 'San Diego County'")

Unnamed: 0,id,name,geonameId,parentId
42197,US.CA.073,San Diego County,5391832,US.CA


In [15]:
# Number of US counties
admin2[admin2['id'].str.startswith('US.')].shape

(3142, 4)

### Export a minimum subset for now

In [16]:
admin2.to_csv(NEO4J_IMPORT / "00g-GeoNamesAdmin2.csv", index=False)