In [2]:
import pandas as pd

In [3]:
url  = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
dfs = pd.read_html(url) 

In [4]:
for df in dfs:
    display(df)

Unnamed: 0,Rank,Country (or dependent territory),Population,% of worldpopulation,Date,Source
0,1,China[b],1401928520,,26 Mar 2020,National population clock[3]
1,2,India[c],1360255195,,26 Mar 2020,National population clock[4]
2,3,United States[d],329516498,,26 Mar 2020,National population clock[5]
3,4,Indonesia,266911900,,1 Jul 2019,National annual projection[6]
4,5,Pakistan[e],219071520,,26 Mar 2020,2017 census[7]
...,...,...,...,...,...,...
237,–,Tokelau (NZ),1400,,1 Jul 2018,National annual estimate[91]
238,195,Vatican City,799,,1 Jul 2019,UN projection[2]
239,–,Cocos (Keeling) Islands (Australia),538,,30 Jun 2018,National estimate[196]
240,–,Pitcairn Islands (UK),50,,1 Jan 2019,National estimate[197]


Unnamed: 0,vteLists of countries by population statistics,vteLists of countries by population statistics.1
0,Global,Current population Demographics of the world
1,Continents/subregions,Africa Antarctica Asia Europe North America Ca...
2,Intercontinental,Americas Arab world Commonwealth of Nations Eu...
3,Cities/urban areas,World cities National capitals Megacities Mega...
4,Past and future,Past and future population World population es...
5,Population density,Current density Past and future population den...
6,Growth indicators,Population growth rate Natural increase Birth ...
7,Other demographics,Age at first marriage Age structure Dependency...
8,Health,Antidepressant consumption Antiviral medicatio...
9,Education and innovation,Bloomberg Innovation Index Education Index Int...


We are interested in the first table, but will do a little data munging to put it in a more usable format.  So here's the data table.

In [5]:
df_population_by_country = dfs[0]
df_population_by_country

Unnamed: 0,Rank,Country (or dependent territory),Population,% of worldpopulation,Date,Source
0,1,China[b],1401928520,,26 Mar 2020,National population clock[3]
1,2,India[c],1360255195,,26 Mar 2020,National population clock[4]
2,3,United States[d],329516498,,26 Mar 2020,National population clock[5]
3,4,Indonesia,266911900,,1 Jul 2019,National annual projection[6]
4,5,Pakistan[e],219071520,,26 Mar 2020,2017 census[7]
...,...,...,...,...,...,...
237,–,Tokelau (NZ),1400,,1 Jul 2018,National annual estimate[91]
238,195,Vatican City,799,,1 Jul 2019,UN projection[2]
239,–,Cocos (Keeling) Islands (Australia),538,,30 Jun 2018,National estimate[196]
240,–,Pitcairn Islands (UK),50,,1 Jan 2019,National estimate[197]


And here's the data munging

In [6]:
df = df_population_by_country.drop(['Rank', '% of worldpopulation', 'Date', 'Source'], axis=1)
df.rename(columns={'Country (or dependent territory)':'Country'}, inplace=True)
df['Country'].replace({r'\[.\]': ''}, regex=True, inplace=True)
df

Unnamed: 0,Country,Population
0,China,1401928520
1,India,1360255195
2,United States,329516498
3,Indonesia,266911900
4,Pakistan,219071520
...,...,...
237,Tokelau (NZ),1400
238,Vatican City,799
239,Cocos (Keeling) Islands (Australia),538
240,Pitcairn Islands (UK),50


Finally, save the table to a pickle file for later use.

In [7]:
df.to_csv('./data/WikiTables_population.csv', index=False)

And here's how to read the `DataFrame` back from the file:

In [9]:
df = pd.read_csv('./data/WikiTables_population.csv')
df

Unnamed: 0,Country,Population
0,China,1401928520
1,India,1360255195
2,United States,329516498
3,Indonesia,266911900
4,Pakistan,219071520
...,...,...
237,Tokelau (NZ),1400
238,Vatican City,799
239,Cocos (Keeling) Islands (Australia),538
240,Pitcairn Islands (UK),50


How to look up a population:

In [30]:
df = df.set_index('Country')
df

Unnamed: 0_level_0,Population
Country,Unnamed: 1_level_1
China,1401928520
India,1360255195
United States,329516498
Indonesia,266911900
Pakistan,219071520
...,...
Tokelau (NZ),1400
Vatican City,799
Cocos (Keeling) Islands (Australia),538
Pitcairn Islands (UK),50


Here's one way to look up the population of India:

In [54]:
df.at['India', 'Population']

1360255195

And here's one way to extract multiple values:

In [53]:
df.loc[['Italy', 'United States', 'United Kingdom'], 'Population'].values

array([ 60243406, 329516498,  66435600], dtype=int64)

We can also extract a dict for even easier lookup:

So now the population of India is:

In [65]:
population = df.to_dict()['Population']
population

{'China': 1401928520,
 'India': 1360255195,
 'United States': 329516498,
 'Indonesia': 266911900,
 'Pakistan': 219071520,
 'Brazil': 211305822,
 'Nigeria': 206139587,
 'Bangladesh': 168332538,
 'Russia': 146745098,
 'Mexico': 126577691,
 'Japan': 125950000,
 'Philippines': 108455678,
 'Egypt': 100169235,
 'Ethiopia': 98665000,
 'Vietnam': 96208984,
 'DR Congo': 89561404,
 'Iran': 83311966,
 'Turkey': 83154997,
 'Germany': 83149300,
 'France': 67069000,
 'Thailand': 66485892,
 'United Kingdom': 66435600,
 'Italy': 60243406,
 'South Africa': 58775022,
 'Tanzania': 55890747,
 'Myanmar': 54339766,
 'South Korea': 51780579,
 'Colombia': 49395678,
 'Kenya': 47564296,
 'Spain': 47100396,
 'Argentina': 44938712,
 'Algeria': 43000000,
 'Sudan': 42374695,
 'Ukraine': 41902416,
 'Uganda': 40299300,
 'Iraq': 39127900,
 'Poland': 38386000,
 'Canada': 37970905,
 'Morocco': 35849953,
 'Saudi Arabia': 34218169,
 'Uzbekistan': 34090725,
 'Malaysia': 32730760,
 'Afghanistan': 32225560,
 'Venezuela': 322

In [76]:
population['India']

1360255195