# Countries

**[Work in progress]**

This notebook creates a .csv file with country information for ingestion into the Knowledge Graph.

Data source: [GeoNames.org](https://download.geonames.org/export/dump/)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
from pathlib import Path
import pandas as pd

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_HOME = Path(os.getenv('NEO4J_HOME'))
print(NEO4J_HOME)

/Users/peter/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-4af96121-2328-4e2f-ba60-6d8b728a26d5/installation-4.0.3


### Create countries

In [4]:
country_url = 'https://download.geonames.org/export/dump/countryInfo.txt'

In [5]:
names = ['ISO','ISO3','ISO-Numeric','fips','Country','Capital','Area(in sq km)','Population',
         'Continent','tld','CurrencyCode','CurrencyName','Phone','Postal Code Format',
         'Postal Code Regex','Languages','geonameid','neighbours','EquivalentFipsCode'
        ]

In [6]:
countries = pd.read_csv(country_url, sep='\t',comment='#', dtype='str', names=names)

### Add missing data
TODO North American Countries and island nations are missing the Continent data

Add missing iso code for Namibia

In [7]:
index = countries.query("ISO3 == 'NAM'").index
countries.at[index, 'ISO'] = 'NA'
countries.head()

Unnamed: 0,ISO,ISO3,ISO-Numeric,fips,Country,Capital,Area(in sq km),Population,Continent,tld,CurrencyCode,CurrencyName,Phone,Postal Code Format,Postal Code Regex,Languages,geonameid,neighbours,EquivalentFipsCode
0,AD,AND,20,AN,Andorra,Andorra la Vella,468,77006,EU,.ad,EUR,Euro,376,AD,,,,,
1,AE,ARE,784,AE,United Arab Emirates,Abu Dhabi,82880,9630959,AS,.ae,AED,Dirham,971,,,"ar-AE,fa,en,hi,ur",290557.0,"SA,OM",
2,AF,AFG,4,AF,Afghanistan,Kabul,647500,37172386,AS,.af,AFN,Afghani,93,,,"fa-AF,ps,uz-AF,tk",1149361.0,"TM,CN,IR,TJ,PK,UZ",
3,AG,ATG,28,AC,Antigua and Barbuda,St. John's,443,96286,,.ag,XCD,Dollar,+1-268,,,en-AG,3576396.0,,
4,AI,AIA,660,AV,Anguilla,The Valley,102,13254,,.ai,XCD,Dollar,+1-264,,,en-AI,3573511.0,,


### Standardize column names for Knowlege Graph
* id: unique identifier for country
* name: name of node
* parentId: unique identifier for continent
* properties: camelCase

In [8]:
countries['id'] = countries['ISO'] # standard id column to link nodes
countries.rename(columns={'ISO': 'iso'}, inplace=True)
countries.rename(columns={'ISO3': 'iso3'}, inplace=True)
countries.rename(columns={'Country': 'name'}, inplace=True)
countries.rename(columns={'Population': 'population'}, inplace=True)
countries.rename(columns={'Area(in sq km)': 'areaSqKm'}, inplace=True)
countries.rename(columns={'Continent': 'parentId'}, inplace=True)

### Export a minimum subset for now

In [9]:
countries = countries[['id','name','iso','iso3','population','areaSqKm','parentId']]
countries = countries.fillna('')

In [10]:
countries.head(1000)

Unnamed: 0,id,name,iso,iso3,population,areaSqKm,parentId
0,AD,Andorra,AD,AND,77006,468.0,EU
1,AE,United Arab Emirates,AE,ARE,9630959,82880.0,AS
2,AF,Afghanistan,AF,AFG,37172386,647500.0,AS
3,AG,Antigua and Barbuda,AG,ATG,96286,443.0,
4,AI,Anguilla,AI,AIA,13254,102.0,
5,AL,Albania,AL,ALB,2866376,28748.0,EU
6,AM,Armenia,AM,ARM,2951776,29800.0,AS
7,AO,Angola,AO,AGO,30809762,1246700.0,AF
8,AQ,Antarctica,AQ,ATA,0,14000000.0,AN
9,AR,Argentina,AR,ARG,44494502,2766890.0,SA


In [11]:
countries.to_csv(NEO4J_HOME / "import/00e-GeoNamesCountry.csv", index=False)