# Downloads population data from the UN
**[Work in progress]**

This notebook standardizes location information for UN population data.

Data source: [United Nations, Department of Economic and Social Affairs, Population Division (2019). World Population Prospects 2019, Online Edition. Rev. 1.](https://population.un.org/wpp/Download/Standard/Population/)

Author: Peter Rose (pwrose@ucsd.edu)

In [82]:
import os
import pandas as pd
from pathlib import Path
from py2neo import Graph

In [83]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [84]:
NEO4J_HOME = Path(os.getenv('NEO4J_HOME'))
print(NEO4J_HOME)

/Users/peter/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-4af96121-2328-4e2f-ba60-6d8b728a26d5/installation-4.0.3


## Standardize Location data for SARS-CoV-2 Strain metadata

TODO this code should be replaced with a general solution.

Below is a workaround for now.

In [85]:
male_url = 'https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/EXCEL_FILES/1_Population/WPP2019_POP_F15_2_ANNUAL_POPULATION_BY_AGE_MALE.xlsx'

In [86]:
df = pd.read_excel(male_url, dtype='str', skiprows=16)
df.fillna('', inplace=True)

In [87]:
df.rename(columns={'Country code': 'isoNumeric'}, inplace=True)
df.rename(columns={'Country code': 'isoNumeric'}, inplace=True)

In [88]:
df.head()

Unnamed: 0,Index,Variant,"Region, subregion, country or area *",Notes,isoNumeric,Type,Parent code,Reference date (as of 1 July),0-4,5-9,10-14,15-19,20-24,25-29,30-34,35-39,40-44,45-49,50-54,55-59,60-64,65-69,70-74,75-79,80-84,85-89,90-94,95-99,100+
0,1,Estimates,WORLD,,900,World,0,1950,172419.832,138298.389,133685.702,122155.285,113206.517,97625.681,84569.5590000001,81447.2,73162.818,63547.845,52643.9989999999,42559.548,34381.408,25077.247,16576.97,9343.682,3876.719,1296.646,316.51,58.606,9.393
1,2,Estimates,WORLD,,900,World,0,1951,182257.02,140812.12,133897.4,123595.257,114626.719,100104.291,85644.043,81437.439,74081.015,64629.447,53677.167,43119.836,34589.36,25395.97,16800.706,9605.701,4141.625,1394.066,358.4,61.627,8.429
2,3,Estimates,WORLD,,900,World,0,1952,191290.45,144511.618,133510.8,125446.329,115482.531,102631.058,87211.526,81001.792,75067.611,65569.945,54802.122,43804.974,34754.609,25746.934,16977.534,9777.497,4359.072,1441.701,387.573,71.241,7.648
3,4,Estimates,WORLD,,900,World,0,1953,198555.919,149495.972,133041.118,127524.369,116111.024,105154.32,89278.45,80512.181,76086.62,66449.397,56051.334,44642.202,34985.045,26128.81,17137.876,9878.303,4496.103,1441.25,397.749,77.735,7.00899999999999
4,5,Estimates,WORLD,,900,World,0,1954,203835.966,155321.418,133291.077,129331.575,116983.485,107497.541,91672.6389999999,80472.414,76961.108,67381.319,57384.959,45643.154,35389.557,26516.49,17318.753,9928.194,4531.149,1401.163,373.078,72.312,6.48299999999999


In [93]:
df = df.query("Type == 'Country/Area'")
df.query("`Reference date (as of 1 July)` == '2020'", inplace=True)

In [100]:
df.query("isoNumeric == '480'")

Unnamed: 0,Index,Variant,"Region, subregion, country or area *",Notes,isoNumeric,Type,Parent code,Reference date (as of 1 July),0-4,5-9,10-14,15-19,20-24,25-29,30-34,35-39,40-44,45-49,50-54,55-59,60-64,65-69,70-74,75-79,80-84,85-89,90-94,95-99,100+
2484,2485,Estimates,Mauritius,1,480,Country/Area,910,2020,32.65,35.076,41.058,48.18,47.498,51.611,44.292,43.654,50.064,41.592,42.44,44.276,36.211,29.369,19.771,10.218,5.935,2.625,0.776,0.165,0.019


In [94]:
pop = df.iloc[:,8:29].astype(float)
pop = pop * 1000
pop = pop.round().astype(int)

In [95]:
pop.head()

Unnamed: 0,0-4,5-9,10-14,15-19,20-24,25-29,30-34,35-39,40-44,45-49,50-54,55-59,60-64,65-69,70-74,75-79,80-84,85-89,90-94,95-99,100+
1916,1037420,916098,756835,608958,519132,488832,432923,328904,219323,142566,119690,108502,95127,64790,31125,17232,8578,3114,645,67,3
1987,62913,58084,51713,45520,40399,36124,32215,27102,21440,17198,13988,11192,8461,5717,3307,1975,939,300,60,6,0
2058,50198,49542,48619,49783,48910,46883,45168,38975,35331,29285,21801,18653,13473,8806,6826,3997,1892,680,153,17,1
2129,252785,243666,248424,190468,143318,144383,135279,95791,68861,68077,42518,40814,32998,26962,20998,12969,6408,2203,530,89,7
2200,8520306,7720508,6999073,6543197,5930683,4889739,3761349,3091148,2445523,2071480,1567789,1159002,946594,735747,539874,340207,168831,67383,16275,1946,181


In [101]:
pop_all = pd.concat([df['isoNumeric'], pop], axis=1)

In [103]:
pop_all.shape

(201, 22)