# UN Regions

**[Work in progress]**

This notebook creates a .csv file with UN geographic regions, subregions, and intermediate region information for ingestion into the Knowledge Graph.

Data source: [Statistics Division of the United Nations Secretariat](https://unstats.un.org/unsd/methodology/m49/)

Data set: [M49](https://unstats.un.org/unsd/methodology/m49/overview)

Data preparation: The Excel file downloaded from UN has features that are not compatible with Pandas. To fix this issue, download the the Excel file, load into Excel, than save again in the .xlsl format and put into this location ../../reference_data/UNSDMethodology.xlsx

Authors: Braden Riggs (bdriggs@ucsd.edu), Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
from pathlib import Path
import pandas as pd

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


### UN regions, subregions, and intermediate regions

In [4]:
df = pd.read_excel("../../reference_data/UNSDMethodology.xlsx", dtype='str')
df = df[['Region Name', 'Region Code', 'Sub-region Name', 'Sub-region Code', 
         'Intermediate Region Name', 'Intermediate Region Code', 'ISO-alpha3 Code']]
df.fillna('', inplace=True)
# for now exclude region without an iso code (Channel Islands)
df = df.query("`ISO-alpha3 Code` != ''")
# Antarctica has no region code
df = df.query("`Region Name` != ''")
df.head()

Unnamed: 0,Region Name,Region Code,Sub-region Name,Sub-region Code,Intermediate Region Name,Intermediate Region Code,ISO-alpha3 Code
0,Africa,2,Northern Africa,15,,,DZA
1,Africa,2,Northern Africa,15,,,EGY
2,Africa,2,Northern Africa,15,,,LBY
3,Africa,2,Northern Africa,15,,,MAR
4,Africa,2,Northern Africa,15,,,SDN


Assign names without spaces

In [5]:
df.rename(columns={'Region Name': 'UNRegion'}, inplace=True)
df.rename(columns={'Region Code': 'UNRegionCode'}, inplace=True)
df.rename(columns={'Sub-region Name': 'UNSubRegion'}, inplace=True)
df.rename(columns={'Sub-region Code': 'UNSubRegionCode'}, inplace=True)
df.rename(columns={'Intermediate Region Name': 'UNIntermediateRegion'}, inplace=True)
df.rename(columns={'Intermediate Region Code': 'UNIntermediateRegionCode'}, inplace=True)
df.rename(columns={'ISO-alpha3 Code': 'iso3'}, inplace=True)

### Assign unique identifiers
Use m49 as a prefix for the M49 standard  by the Statistics Division of the United Nations Secretariat 

In [6]:
df['UNRegionCode'] = 'm49:' + df['UNRegionCode']
df['UNSubRegionCode'] = 'm49:' + df['UNSubRegionCode'] 
df['UNIntermediateRegionCode'] = 'm49:' + df['UNIntermediateRegionCode'] 

In [7]:
df.head()

Unnamed: 0,UNRegion,UNRegionCode,UNSubRegion,UNSubRegionCode,UNIntermediateRegion,UNIntermediateRegionCode,iso3
0,Africa,m49:2,Northern Africa,m49:15,,m49:,DZA
1,Africa,m49:2,Northern Africa,m49:15,,m49:,EGY
2,Africa,m49:2,Northern Africa,m49:15,,m49:,LBY
3,Africa,m49:2,Northern Africa,m49:15,,m49:,MAR
4,Africa,m49:2,Northern Africa,m49:15,,m49:,SDN


### Add missing region information (from hand-curated list)

In [8]:
additions = pd.read_csv("../../reference_data/UNRegionAdditions.csv")
additions.fillna('', inplace=True)

In [9]:
additions.tail(10)

Unnamed: 0,UNRegion,UNRegionCode,UNSubRegion,UNSubRegionCode,UNIntermediateRegion,UNIntermediateRegionCode,iso3
0,Antarctica,m49:Antarctica,,,,,ATA
1,Americas,m49:19,Latin America and the Caribbean,m49:419,,,BES
2,Americas,m49:19,Latin America and the Caribbean,m49:419,,,ANT
3,Asia,m49:142,Eastern Asia,m49:30,,,HKG
4,Asia,m49:142,Eastern Asia,m49:30,,,MAC
5,Asia,m49:142,Eastern Asia,m49:30,,,TWN
6,Europe,m49:150,Southern Europe,m49:39,,,XKX
7,Europe,m49:150,Southern Europe,m49:39,,,CSG


In [10]:
df = df.append(additions)

In [11]:
df.to_csv(NEO4J_IMPORT / "00k-UNAll.csv", index=False)  

### Save region assignments in separate files
This is done so iso3 country codes can be linked to the lowest level in the UN region hierarchy.

In [12]:
intermediateRegion = df[df['UNIntermediateRegion'] != '']
intermediateRegion.to_csv(NEO4J_IMPORT / "00k-UNIntermediateRegion.csv", index=False)                            

In [13]:
intermediateRegion.head()

Unnamed: 0,UNRegion,UNRegionCode,UNSubRegion,UNSubRegionCode,UNIntermediateRegion,UNIntermediateRegionCode,iso3
7,Africa,m49:2,Sub-Saharan Africa,m49:202,Eastern Africa,m49:14,IOT
8,Africa,m49:2,Sub-Saharan Africa,m49:202,Eastern Africa,m49:14,BDI
9,Africa,m49:2,Sub-Saharan Africa,m49:202,Eastern Africa,m49:14,COM
10,Africa,m49:2,Sub-Saharan Africa,m49:202,Eastern Africa,m49:14,DJI
11,Africa,m49:2,Sub-Saharan Africa,m49:202,Eastern Africa,m49:14,ERI


In [14]:
subRegion = df[(df['UNSubRegion'] != '') & (df['UNIntermediateRegion'] == '')]
subRegion.to_csv(NEO4J_IMPORT / "00k-UNSubRegion.csv", index=False)  

In [15]:
subRegion.head()

Unnamed: 0,UNRegion,UNRegionCode,UNSubRegion,UNSubRegionCode,UNIntermediateRegion,UNIntermediateRegionCode,iso3
0,Africa,m49:2,Northern Africa,m49:15,,m49:,DZA
1,Africa,m49:2,Northern Africa,m49:15,,m49:,EGY
2,Africa,m49:2,Northern Africa,m49:15,,m49:,LBY
3,Africa,m49:2,Northern Africa,m49:15,,m49:,MAR
4,Africa,m49:2,Northern Africa,m49:15,,m49:,SDN


In [16]:
region = df[(df['UNSubRegion'] == '') & (df['UNIntermediateRegion'] == '')]
region.to_csv(NEO4J_IMPORT / "00k-UNRegion.csv", index=False)  

In [17]:
region.head()

Unnamed: 0,UNRegion,UNRegionCode,UNSubRegion,UNSubRegionCode,UNIntermediateRegion,UNIntermediateRegionCode,iso3
0,Antarctica,m49:Antarctica,,,,,ATA
