# OhCalcutta

This problem is inspired from Dr C.R.Rao's "Statistics and Truth"

**Problem:** You are asked to estimate the total urban population in the world. You are given that there a total of 6148 cities in the world with population more than 100000. (N=6148)

**Challenge:**
1. You cannot do a census in each city. You can only do a census in a sample of the cities.
2. You don't know anything prior about the distribution of data.

**Approach:**
If you can calculate the mean population of a city, then you can multiply it with total number of cities (N) to estimate the total urban population.

So, you have to estimate mean - Would CLT help?

Theorotically yes. The sampling distribution of means should follow normal distribution, whose mean would be population_mean

Data downloaded from [simplemaps.com](https://simplemaps.com/data/world-cities)

In [69]:
import random
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scipy.stats as stats

In [80]:
df = pd.read_csv("worldcities.csv")
df = df[df.population > 100000].reset_index(drop=True)
N = df.shape[0]
print("Total number of cities with more than 100000 population: ", N)

Total number of cities with more than 100000 population:  6148


You did a simple lottery to select the cities in which you'd conduct a census. You have budget for 40 cities. (n=40)

In [75]:
sample_size = 40
sample_df = df.sample(sample_size)
sample_df

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
1347,Al Ḩudaydah,Al Hudaydah,14.8022,42.9511,Yemen,YE,YEM,Al Ḩudaydah,admin,548433.0,1887984890
1508,Larkana,Larkana,27.5583,68.2111,Pakistan,PK,PAK,Sindh,minor,490508.0,1586678302
2740,Nelamangala,Nelamangala,13.102,77.374,India,IN,IND,Karnātaka,,245624.0,1356943451
4195,Gulu,Gulu,2.7817,32.2992,Uganda,UG,UGA,Gulu,admin,152276.0,1800406007
4806,Zama,Zama,35.4833,139.4,Japan,JP,JPN,Kanagawa,,130753.0,1392313741
4023,Huaycan,Huaycan,-12.0181,-76.8139,Peru,PE,PER,Lima,,160000.0,1604578883
5289,Lodhran,Lodhran,29.5333,71.6333,Pakistan,PK,PAK,Punjab,minor,117851.0,1586813871
1742,Banjul,Banjul,13.4531,-16.5775,"Gambia, The",GM,GMB,Banjul,primary,413397.0,1270723713
5237,Totonicapán,Totonicapan,14.9108,-91.3606,Guatemala,GT,GTM,Totonicapán,admin,118960.0,1320223386
6051,Olomouc,Olomouc,49.5939,17.2508,Czechia,CZ,CZE,Olomoucký Kraj,admin,101825.0,1203328061


Great! Now, you can build the confidence interval for the mean. This leverages CLT.

In [76]:
sample_size = 40

# Sample data
data = np.array(sample_df.population)

# Sample mean
sample_mean = np.mean(data)

# Sample standard deviation
sample_std = np.std(data, ddof=1)  # ddof=1 provides the sample standard deviation

# Confidence level
confidence_level = 0.95

# Degrees of freedom
dof = sample_size - 1

# t-statistic for the confidence level
t_stat = stats.t.ppf((1 + confidence_level) / 2, dof)

# Margin of error
margin_of_error = t_stat * (sample_std / np.sqrt(sample_size))

# Confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

print(f"Sample Mean: {sample_mean}")
print(f"Sample Standard Deviation: {sample_std}")
print(f"t-Statistic: {t_stat}")
print(f"Margin of Error: {margin_of_error}")
print(f"{confidence_level*100}% Confidence Interval: {confidence_interval}")


Sample Mean: 297836.825
Sample Standard Deviation: 281146.28529481083
t-Statistic: 2.022690911734728
Margin of Error: 89914.94379141793
95.0% Confidence Interval: (207921.88120858208, 387751.76879141794)


In [83]:
total_pop_CI =  (confidence_interval[0]*N,confidence_interval[1]*N)
print("95% Confidence Interval for total urban population in the world: ", total_pop_CI)

95% Confidence Interval for total urban population in the world:  (1278303725.6703627, 2383897874.5296373)


So, from the chosen sample, you would say that the upper bound of total urban population in the world is 238,38,97,875

`238 Crores` or `2.38 Billion`

Now if the inference was right, our true urban population count would be less than or equal to this. Is it?

In [86]:
df.population.sum() < confidence_interval[1]*N

False

In [87]:
df.population.sum()

4132049481.0

The true urban population from the dataset we considered is 413,20,49,481. `413 Crores` or `4.13 Billion`

We underestimated the urban population by `~2 Billion`

If we are confident that we would always under-estimate this number, then we can do bias correction by adding a required constant. But, would we always underestimate?

Consider a hypothetical sample now of the same size 40. Only a small difference. Let's consider that the city Delhi fell in this sample (it did not previously)

In [104]:
sample_df = pd.concat([sample_df[:39], pd.DataFrame(df.iloc[2,:]).T])
sample_df.tail()

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
943,Dongyang,Dongyang,29.2667,120.2167,China,CN,CHN,Zhejiang,minor,804398.0,1156259752
2062,Vitória da Conquista,Vitoria da Conquista,-14.8658,-40.8389,Brazil,BR,BRA,Bahia,,341128.0,1076812020
4492,Lausanne,Lausanne,46.5198,6.6335,Switzerland,CH,CHE,Vaud,admin,141418.0,1756055099
5257,Senador Canedo,Senador Canedo,-16.7594,-49.0864,Brazil,BR,BRA,Goiás,minor,118451.0,1076337498
2,Delhi,Delhi,28.61,77.23,India,IN,IND,Delhi,admin,32226000.0,1356872604


Lets estimate the total urban population again using the same process used above.

In [106]:
sample_size = 40

# Sample data
data = np.array(sample_df.population)

# Sample mean
sample_mean = np.mean(data)

# Sample standard deviation
sample_std = np.std(data, ddof=1)  # ddof=1 provides the sample standard deviation

# Confidence level
confidence_level = 0.95

# Degrees of freedom
dof = sample_size - 1

# t-statistic for the confidence level
t_stat = stats.t.ppf((1 + confidence_level) / 2, dof)

# Margin of error
margin_of_error = t_stat * (sample_std / np.sqrt(sample_size))

# Confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

print(f"Sample Mean: {sample_mean}")
print(f"Sample Standard Deviation: {sample_std}")
print(f"t-Statistic: {t_stat}")
print(f"Margin of Error: {margin_of_error}")
print(f"{confidence_level*100}% Confidence Interval: {confidence_interval}")

print("---***---")
total_pop_CI =  (confidence_interval[0]*N,confidence_interval[1]*N)
print("95% Confidence Interval for total urban population in the world: ", total_pop_CI)

Sample Mean: 1099349.525
Sample Standard Deviation: 5055527.237618852
t-Statistic: 2.022690911734728
Margin of Error: 1616836.0429511657
95.0% Confidence Interval: (-517486.5179511658, 2716185.5679511656)
---***---
95% Confidence Interval for total urban population in the world:  (-3181507112.363767, 16699108871.763765)


The upper bound is 1669,91,08,872 - `1669 Crores` or `16.69 Billions` (This is greater than the total population of the world)

The lower bound is negative. We can use commonsense and consider `62 Crores` or `0.62 Billion` as lower bound `(N*min(sample_data))`

This Confidence Interval is nevertheless meaningless for any Urban policy practitioner.

## Why did we fail in making statisical inference here?

This is the `"Oh Calcutta!"` problem described in the book `"Statistics and Truth"` by `Dr CR Rao.` Nassim Taleb articulated this as the fat tail problem. I wrote about it here: [Taleb Fat tail](https://medium.com/@saikrishna_17904/taleb-fat-tail-aaeda60661d2)

Basically we failed because CLT fails in this case. And CLT is crucial assumption in constructing a Confidence Interval of mean.


In [118]:
sample_size = 10
sample = df[df.population >= 5000000 ].sample(sample_size)

# Sample data
data = np.array(sample.population)

# Sample mean
sample_mean = np.mean(data)

# Sample standard deviation
sample_std = np.std(data, ddof=1)  # ddof=1 provides the sample standard deviation

# Confidence level
confidence_level = 0.95

# Degrees of freedom
dof = sample_size - 1

# t-statistic for the confidence level
t_stat = stats.t.ppf((1 + confidence_level) / 2, dof)

# Margin of error
margin_of_error = t_stat * (sample_std / np.sqrt(sample_size))

# Confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

print(f"Sample Mean: {sample_mean}")
print(f"Sample Standard Deviation: {sample_std}")
print(f"t-Statistic: {t_stat}")
print(f"Margin of Error: {margin_of_error}")
print(f"{confidence_level*100}% Confidence Interval: {confidence_interval}")

print("---***---")
total_pop_CI =  (confidence_interval[0]*N,confidence_interval[1]*N)
print("95% Confidence Interval for total urban population in the world: ", total_pop_CI)

Sample Mean: 10426898.5
Sample Standard Deviation: 6023991.807712857
t-Statistic: 2.2621571627409915
Margin of Error: 4309304.141049111
95.0% Confidence Interval: (6117594.358950889, 14736202.641049111)
---***---
95% Confidence Interval for total urban population in the world:  (37610970118.83006, 90598173837.16994)


In [120]:
3026050764.7094064+90598173837

93624224601.70941

In [119]:
sample

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
99,Nangandao,Nangandao,35.3036,113.9268,China,CN,CHN,Henan,minor,5708191.0,1156127660
7,São Paulo,Sao Paulo,-23.55,-46.6333,Brazil,BR,BRA,São Paulo,admin,23086000.0,1076532519
89,Singapore,Singapore,1.3,103.8,Singapore,SG,SGP,,primary,5983000.0,1702341327
51,Kuala Lumpur,Kuala Lumpur,3.1478,101.6953,Malaysia,MY,MYS,Kuala Lumpur,primary,8911000.0,1458988644
18,Buenos Aires,Buenos Aires,-34.6033,-58.3817,Argentina,AR,ARG,"Buenos Aires, Ciudad Autónoma de",primary,16710000.0,1032717330
25,Chengdu,Chengdu,30.66,104.0633,China,CN,CHN,Sichuan,admin,14645000.0,1156421555
34,Baoding,Baoding,38.874,115.464,China,CN,CHN,Hebei,minor,11544036.0,1156256829
73,Chattogram,Chattogram,22.335,91.8325,Bangladesh,BD,BGD,Chattogram,admin,7000000.0,1050830722
121,Rangoon,Rangoon,16.795,96.16,Burma,MM,MMR,Yangon,primary,5209541.0,1104616656
111,Tai’an,Tai'an,36.202,117.087,China,CN,CHN,Shandong,,5472217.0,1156095188
