The SALURBAL data for Latin American cities is provided as both count and rate per 10 million people, and contains recent mortality estimates.  While the population of the cities is not explicitly provided, the intuition is that this could be derived by relating the absolute count to the relative rate.   

A rate per 10 million (r) is calculated as count (c) divided by (population (p) divided by 10^7). 

r = c / (p/10^7)

Therefore,

r * (p/10^7) = c , and
p/10^7 = c/r, and so
p = 10^7*c/r

If the count for Buenos Aires in Argentina were 56,553, and the count per 10 million population were 35,437, that implies that the overall population is 10^7*56,533/35,437, or 15,953,100.

According to [Wikipedia](https://en.wikipedia.org/wiki/Buenos_Aires), the 2021 population of Buenos Aires metropolitan region is 	15,624,000 --- which roughly corresponds to the above 2022 statistic (slightly less, which is as expected).

So, we can now apply this formula with some confidence for any city without a rate of zero (which would result in a division by zero error).

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('./covid-data/SALURBAL - L1 level covid19  - Death count and rate per 10m persons for cities - 2022-09-06.csv', index_col=0)

In [3]:
df

Unnamed: 0,Outcome,Level,Type,Country,Location,Date,Value
1,deaths,L1,count,Argentina,Argentina Non-salurbal,2020-03-22,0
2,deaths,L1,count,Argentina,Argentina Non-salurbal,2020-03-29,14
3,deaths,L1,count,Argentina,Argentina Non-salurbal,2020-04-05,34
4,deaths,L1,count,Argentina,Argentina Non-salurbal,2020-04-12,47
5,deaths,L1,count,Argentina,Argentina Non-salurbal,2020-04-19,62
...,...,...,...,...,...,...,...
82364,deaths,L1,count per 10 M,Peru,Tumbes,2022-05-09,75765
82365,deaths,L1,count per 10 M,Peru,Tumbes,2022-05-16,75765
82366,deaths,L1,count per 10 M,Peru,Tumbes,2022-05-23,75765
82367,deaths,L1,count per 10 M,Peru,Tumbes,2022-05-30,75765


In [4]:
# retain only most recent record for each city
df = df.sort_values('Date').drop_duplicates(['Location','Type'],keep='last').sort_index()

In [5]:
# note records with name 'salurbal' in location
print(f"To exclude:{df[df['Location'].str.contains('salurbal')]['Location'].unique()}")
# drop records with name 'salurbal' in location
df = df[~df['Location'].str.contains('salurbal')]
df.sort_index()

To exclude:['Argentina Non-salurbal' 'Brazil Non-salurbal' 'Chile Non-salurbal'
 'Colombia Non-salurbal' 'Guatemala Non-salurbal' 'Mexico Non-salurbal'
 'Peru Non-salurbal']


Unnamed: 0,Outcome,Level,Type,Country,Location,Date,Value
345,deaths,L1,count,Argentina,Bahia Blanca,2022-05-29,916
460,deaths,L1,count per 10 M,Argentina,Bahia Blanca,2022-05-29,29701
575,deaths,L1,count,Argentina,Buenos Aires,2022-05-29,56553
690,deaths,L1,count per 10 M,Argentina,Buenos Aires,2022-05-29,35437
805,deaths,L1,count,Argentina,Comodoro Rivadavia,2022-05-29,667
...,...,...,...,...,...,...,...
81904,deaths,L1,count per 10 M,Peru,Tarapoto,2022-06-06,64460
82020,deaths,L1,count,Peru,Trujillo,2022-06-06,6703
82136,deaths,L1,count per 10 M,Peru,Trujillo,2022-06-06,70595
82252,deaths,L1,count,Peru,Tumbes,2022-06-06,1027


In [6]:
# transform long to wide
df = df.pivot(
    index=["Country", "Location", "Date"],
    columns="Type",
    values="Value",
)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Type,count,count per 10 M
Country,Location,Date,Unnamed: 3_level_1,Unnamed: 4_level_1
Argentina,Bahia Blanca,2022-05-29,916,29701
Argentina,Buenos Aires,2022-05-29,56553,35437
Argentina,Comodoro Rivadavia,2022-05-29,667,30235
Argentina,Concordia,2022-05-29,339,18266
Argentina,Corrientes,2022-05-29,791,19866
...,...,...,...,...
Peru,Sullana,2022-06-06,1917,74000
Peru,Tacna,2022-06-06,1739,57385
Peru,Tarapoto,2022-06-06,946,64460
Peru,Trujillo,2022-06-06,6703,70595


In [7]:
# estimate population as p = 10^7 * count / rate, dropping any nulls
df['population'] = (10**7) * (df['count'] / df['count per 10 M'])
df = df[~df['population'].isna()].copy()
df['population'] = (df['population']).astype('int32')
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Type,count,count per 10 M,population
Country,Location,Date,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Argentina,Bahia Blanca,2022-05-29,916,29701,308407
Argentina,Buenos Aires,2022-05-29,56553,35437,15958743
Argentina,Comodoro Rivadavia,2022-05-29,667,30235,220605
Argentina,Concordia,2022-05-29,339,18266,185590
Argentina,Corrientes,2022-05-29,791,19866,398167
...,...,...,...,...,...
Peru,Sullana,2022-06-06,1917,74000,259054
Peru,Tacna,2022-06-06,1739,57385,303040
Peru,Tarapoto,2022-06-06,946,64460,146757
Peru,Trujillo,2022-06-06,6703,70595,949500


In [8]:
# How many cities with statistics? (355)
len(df)

355

In [9]:
# How many countries? (7)
len(df.reset_index()['Country'].unique())

7

In [10]:
# Which countries
df.reset_index()['Country'].unique()

array(['Argentina', 'Brazil', 'Chile', 'Colombia', 'Guatemala', 'Mexico',
       'Peru'], dtype=object)

In [11]:
# get deaths per 100k population
df['deaths_per_100k'] = df['count']/(df['population']/100000)
df['deaths_per_100k'] = df['deaths_per_100k'].astype('int32')
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Type,count,count per 10 M,population,deaths_per_100k
Country,Location,Date,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Argentina,Bahia Blanca,2022-05-29,916,29701,308407,297
Argentina,Buenos Aires,2022-05-29,56553,35437,15958743,354
Argentina,Comodoro Rivadavia,2022-05-29,667,30235,220605,302
Argentina,Concordia,2022-05-29,339,18266,185590,182
Argentina,Corrientes,2022-05-29,791,19866,398167,198
...,...,...,...,...,...,...
Peru,Sullana,2022-06-06,1917,74000,259054,740
Peru,Tacna,2022-06-06,1739,57385,303040,573
Peru,Tarapoto,2022-06-06,946,64460,146757,644
Peru,Trujillo,2022-06-06,6703,70595,949500,705


The statistic of 354 deaths per 100,000 population is much higher than the 7.65 reported by Adlakha and Sallis for Buenos Aires...  
```
Earlier:   221/( 2890000/100000) =   7.64
    Now: 56553/(15958743/100000) = 354.37
```

The Adlakha and Sallis statistic was based on pop. estimate of 2,890,000 --- which roughly corresponds to the urban not metroplitan city area, according to Wikipedia.  The SALURBAL data corresponds to the metropolitan area, hence the far greater population.   The Adlakha and Sallis article was also submitted for publication in May 2020 -- very early in the pandemic --- hence the extreme difference in death count is likely explained by Greater Buenos Aires lock down period from 19 March (a week and a half after reporting of the first case) to 17 July.

# Population density
Wikipedia could be a convenient single source for metro area statistics, under assumption these are ball park accurate and current.

Buenos Aires metropolitan area is recorded as 4,758 persons per sqkm.  

In [12]:
df.loc[('Argentina','Buenos Aires')]['population'][0]/4758

3354.0863808322824

The estimated statistic for Buenos Aires population density of 3354 persons per sqkm, is lower than that in Adlkha and Sallis (13,680).   The Adlakha and Sallis statistic was based on pop. estimate of 2,890,000  --- which roughly corresponds to the urban not metroplitan city area, according to Wikipedia.  Wikipedia gives the urban area at 203 sqkm.

In [13]:
2890000/203

14236.453201970444

Above we see that the Buenos Aires density estimates in Adlakha and Sallis likely corresponds to urban not metro area for Buenos Aires.

However, the 2022 death statistic we have is for Metro area, empirically determined by back-calculating the population and identifying that this corresponded to the city's metropolitan statistic.

So --- in general i think for Buenos Aires, the way to represent this is the Greater Buenos Aires conurbation https://en.wikipedia.org/wiki/Greater_Buenos_Aires.

But ... can we determine this for every city, or do we just assume these are all 'Metro'?

# Mexico city

In Adlkha and Sallis, the mid-2020 statistics for Mexico City were:

|                  |Statistic        |
|------------------|-----------------|
|Population density|9800             |
|Population        |missing, but from 8.81 to 8.92 million, depending on derivation using case or death rates|
|Deaths            |2203             |
|Deaths per 100k pop|25              |

Using the 2022 SALURBAL data, for Mexico City, we have

In [14]:
df.loc[('Mexico', 'Ciudad de Mexico')][['population','count','deaths_per_100k']].transpose()

Date,2022-06-06
Type,Unnamed: 1_level_1
population,21190532
count,53718
deaths_per_100k,253


As with Buenos Aires, for [Mexico City](https://en.wikipedia.org/wiki/Mexico_City) according to Wikipedia, the Adlakha and Sallis article present the urban population (9.2 million) while SALURBAL data roughly corresponds to the Metro population (21.8 million).  Death count and rate is higher, mostly as a function of time and time-related contexts; but the area catchment (population and density) is broader and not directly comparable.

If the [Greater Mexico City metro area](https://en.wikipedia.org/wiki/Greater_Mexico_City) is taken as 7,866.1 km2, then this gives a marginal density of 21190532/7866=2693 persons per sqkm approximately.

# Population and density comparisons with the GHSL-derived data

The GHSL derived data had a Buenos Aires population estimate of 13,906,506 (2015); that corresponds to the metropolitan area (15.9 million in 2022).  Likewise for Mexico City, the population estimate of 19.6 million (2015) corresponds to the greater metropolitan area (21.8, 2020). We calculated population density as population relative to the built up area, not the overall area using the GHSL data (so more like net density, rather than gross density; an idiosyncracy we could amend if we thought appropriate).  For Buenos Aires, this was 10,230 persons per sqkm (lower than Adlakha and Sallis); and Mexico City, 15,066 persons per sqkm (higher than Adlakha and Sallis).  

If we calculated gross density (which may be more appropriate), for Buenos Aires we would have 13906506/1967 = 7,069 persons per sqkm (lower still than Adlakha and Sallis, but about double that of the SALURBAN reference), and for Mexico City 19559564/2114= 9,252 (almost 4 times that of SALURBAN refernce)

This may be explained by the GHSL method for identifying urban centres. Rather than administrative boundaries, it uses satellite imagery, census data and other sources to identify urban agglomerations; effectively like an empircal greater metropolitan area (which may effectively be several cities combined).  So, it includes peri-urban areas to an extent which may differ (be lesser than or greater than) would be the case for administrative boundaries.   If analysis were restricted more to inner city areas, as was the case for Buenos Aires and Mexico City in Adlakha and Sallis' paper, this would result in a higher density estimate than would be achieved were population considered relative to the greater metropolitan expanse which includes areas towards the urban fringe which would be expected to be more sparse population-wise.

The much lower density estimates for the SALURBAN data would likely relate to the use of administrative boundaries that capture regional areas with very low population that both the GHSL method and even moreso the use of inner city areas would exclude.



# So what to do?

The best bet for a comparable analysis would be to source population and density statistics for metro areas, perhaps using Wikipedia as a convenience source with broad recognition.  It's city pages routinely contain these statistics, although some smaller cities may not be listed; but its perhaps a do-able source in the first instance (and may have link to official source).

If we were to do any spatial analyses (eg connectivity, as we discussed), we would require actual boundaries.  We could use GHSL for that, but I think it might be best to avoid doing this and keep things simple.  I think manually going through each city's country's statistical agency for digital boundaries and statistics will be too much effort (even just for Aus and NZ it was too much -- must be more difficult once we move outside high income English speaking countries!).

For the Latin American cities we can use the recorded count and rate statistics to derive population which can be used as a reference with recorded populations from Wikipedia to ensure that at least for these cities the catchments for density and death statistics correspond.

In [30]:
df.loc[('Peru','Lima')][['population','count','deaths_per_100k']]

Type,population,count,deaths_per_100k
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2022-06-06,9922528,9734,98
