# Experiment on Benford's Law
I want to test the Benford's Law which "... is an observation that in many real-life sets of numerical data, the leading digit is likely to be small. In sets that obey the law, the number 1 appears as the leading significant digit about 30 % of the time, while 9 appears as the leading significant digit less than 5 % of the time."

See <a href="https://en.wikipedia.org/wiki/Benford%27s_law" target="_blank">Benford's law</a>.

# Methodology
I will perform Webscraping on pages containing real world data like Wikipedia pages.
I will create Pandas dataframes containing the data and compute the occurrence frequency of the leading digits.

Finally, I will test if the computed frequencies respect the logarithmic distribution of Benford's Law.

In [None]:
!mamba install bs4==4.10.0 -y
!pip install lxml==4.6.4
!mamba install html5lib==1.1 -y


                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (0.15.3) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████


Looking for: ['bs4==4.10.0']

pkgs/main/linux-64       [>                   ] (--:--) No change
pkgs/main/noarch

In [None]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page
import pandas as pd
import math

## Benford's Law

In [None]:
def benfords_law(digit):
    digit = int(digit)
    return math.log10((digit + 1)/digit)

# Experiment n. 01: Population of Nations

In [None]:
url = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"

In [None]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [None]:
soup = BeautifulSoup(data,"html.parser")

In [None]:
#find all html tables in the web page
tables = soup.find_all('table') # in html table is represented by the tag <table>
# number of tables in the web page:
print("In the web page there are", len(tables), "tables")

In the web page there are 2 tables


In [None]:
# I want to retrieve the table containing the following string:
string = "Country/Area"

for index,table in enumerate(tables):
    if (string in str(table)):
        table_index = index
print(table_index)

0


In [None]:
# Create empty DF
population_data = pd.DataFrame(columns=["Country/Area", "UN continental region", "UN statistical subregion", "Population (1 July 2018)", "Population (1 July 2019)", "Change"])

# Loop trough the rows of the HTML table. HTML tables rows have the attribute "tr"
for row in tables[table_index].tbody.find_all("tr"):
    # Loop trough the columns of each row. HTML columns have the attribute "tc"
    col = row.find_all("td")
    if (col != []):
        country = col[0].text
        continental_region = col[1].text
        statistical_subregion = col[2].text
        population_2018 = col[3].text
        population_2019 = col[4].text
        change = col[5].text
        population_data = population_data.append({"Country/Area":country, "UN continental region":continental_region, "UN statistical subregion":statistical_subregion, "Population (1 July 2018)":population_2018, "Population (1 July 2019)":population_2019, "Change":change}, ignore_index=True)
    

population_data

Unnamed: 0,Country/Area,UN continental region,UN statistical subregion,Population (1 July 2018),Population (1 July 2019),Change
0,China[a],Asia,Eastern Asia,1427647786,1433783686,+0.43%\n
1,India,Asia,Southern Asia,1352642280,1366417754,+1.02%\n
2,United States,Americas,Northern America,327096265,329064917,+0.60%\n
3,Indonesia,Asia,South-eastern Asia,267670543,270625568,+1.10%\n
4,Pakistan,Asia,Southern Asia,212228286,216565318,+2.04%\n
...,...,...,...,...,...,...
228,Montserrat (United Kingdom),Americas,Caribbean,4993,4989,−0.08%\n
229,Falkland Islands (United Kingdom),Americas,South America,3234,3377,+4.42%\n
230,Niue (New Zealand),Oceania,Polynesia,1620,1615,−0.31%\n
231,Tokelau (New Zealand),Oceania,Polynesia,1319,1340,+1.59%\n


In [None]:
# Faster way to extract dataframe
dataframe_list = pd.read_html(url, flavor='bs4')
population_df = dataframe_list[table_index]
population_df

Unnamed: 0,Country/Area,UN continentalregion[4],UN statisticalsubregion[4],Population(1 July 2018),Population(1 July 2019),Change
0,China[a],Asia,Eastern Asia,1427647786,1433783686,+0.43%
1,India,Asia,Southern Asia,1352642280,1366417754,+1.02%
2,United States,Americas,Northern America,327096265,329064917,+0.60%
3,Indonesia,Asia,South-eastern Asia,267670543,270625568,+1.10%
4,Pakistan,Asia,Southern Asia,212228286,216565318,+2.04%
...,...,...,...,...,...,...
229,Falkland Islands (United Kingdom),Americas,South America,3234,3377,+4.42%
230,Niue (New Zealand),Oceania,Polynesia,1620,1615,−0.31%
231,Tokelau (New Zealand),Oceania,Polynesia,1319,1340,+1.59%
232,Vatican City[z],Europe,Southern Europe,801,799,−0.25%


In [None]:
# Drop last row
population_df.drop(population_df.tail(1).index, inplace = True)

# Drop not needed columns
population_2019 = population_df.drop(['UN continentalregion[4]', 'UN statisticalsubregion[4]', 'Population(1 July 2018)', 'Change'], axis='columns')
population_2019

Unnamed: 0,Country/Area,Population(1 July 2019)
0,China[a],1433783686
1,India,1366417754
2,United States,329064917
3,Indonesia,270625568
4,Pakistan,216565318
...,...,...
228,Montserrat (United Kingdom),4989
229,Falkland Islands (United Kingdom),3377
230,Niue (New Zealand),1615
231,Tokelau (New Zealand),1340


In [None]:
# I define a function to get the 1st digit of the population value
def get_leading_digit(row):
    num = row['Population(1 July 2019)']
    num_str = str(num)
    return num_str[0]

In [None]:
# I add a column for the leading digit
population_2019['Leading Digit'] = population_2019.apply (lambda row: get_leading_digit(row), axis=1)

In [None]:
# Isolate the result
result = population_2019.groupby('Leading Digit').count()
result = result['Population(1 July 2019)']
result = pd.DataFrame(result)
result = result.rename(columns={'Population(1 July 2019)': 'Count'})
result

Unnamed: 0_level_0,Count
Leading Digit,Unnamed: 1_level_1
1,71
2,38
3,28
4,23
5,24
6,16
7,10
8,14
9,9


In [None]:
total = result['Count'].sum()

In [None]:
result['Actual Percentage'] = result['Count'].apply(lambda x: x / total)
result

Unnamed: 0_level_0,Count,Actual Percentage
Leading Digit,Unnamed: 1_level_1,Unnamed: 2_level_1
1,71,0.304721
2,38,0.16309
3,28,0.120172
4,23,0.098712
5,24,0.103004
6,16,0.06867
7,10,0.042918
8,14,0.060086
9,9,0.038627


In [None]:
result = result.reset_index()
result['Benfords Law'] = result['Leading Digit'].apply(lambda i: benfords_law(i))
result

Unnamed: 0,Leading Digit,Count,Actual Percentage,Benfords Law
0,1,71,0.304721,0.30103
1,2,38,0.16309,0.176091
2,3,28,0.120172,0.124939
3,4,23,0.098712,0.09691
4,5,24,0.103004,0.079181
5,6,16,0.06867,0.066947
6,7,10,0.042918,0.057992
7,8,14,0.060086,0.051153
8,9,9,0.038627,0.045757


I can see that the population of nations follows the Benford's Law

# Experiment n.02: Length of Rivers

In [None]:
url = "https://en.wikipedia.org/wiki/List_of_rivers_by_length"

In [None]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [None]:
soup = BeautifulSoup(data,"html.parser")

In [None]:
#find all html tables in the web page
tables = soup.find_all('table') # in html table is represented by the tag <table>
# number of tables in the web page:
print("In the web page there are", len(tables), "tables")

In the web page there are 7 tables


In [None]:
# I want to retrieve the table containing the following string:
string = "Outflow"

for index,table in enumerate(tables):
    if (string in str(table)):
        table_index = index
print(table_index)

5


In [None]:
# Faster way to extract dataframe
dataframe_list = pd.read_html(url, flavor='bs4')
river_df = dataframe_list[table_index]
river_df = river_df[['River', 'Length (km)']]
river_df

Unnamed: 0,River,Length (km)
0,Nile–White Nile–Kagera–Nyabarongo–Mwogo–Rukara...,"6,650(7,088)"
1,Amazon–Ucayali–Tambo–Ene–Mantaro[n 1],"6,400[4](6,992)"
2,Yangtze–Jinsha–Tongtian–Dangqu (Chang Jiang),"6,300(6,418)"
3,Mississippi–Missouri–Jefferson–Beaverhead–Red ...,6275
4,Yenisey–Angara–Selenga–Ider,5539
...,...,...
182,Loire,1012
183,Essequibo,1010
184,Khopyor,1010
185,Tagus(Tajo/Tejo),1006


In [None]:
# clean badly formatted data
# delete parenthesis () and what's inside
river_df['Length (km)'] = river_df['Length (km)'].str.replace(r"\(.*\)","")
# delete parenthesis [] and what's inside
river_df['Length (km)'] = river_df['Length (km)'].str.replace(r"\[.*\]","")
# remove commas
river_df['Length (km)'] = river_df['Length (km)'].str.replace(",","")
river_df

  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,River,Length (km)
0,Nile–White Nile–Kagera–Nyabarongo–Mwogo–Rukara...,6650
1,Amazon–Ucayali–Tambo–Ene–Mantaro[n 1],6400
2,Yangtze–Jinsha–Tongtian–Dangqu (Chang Jiang),6300
3,Mississippi–Missouri–Jefferson–Beaverhead–Red ...,6275
4,Yenisey–Angara–Selenga–Ider,5539
...,...,...
182,Loire,1012
183,Essequibo,1010
184,Khopyor,1010
185,Tagus(Tajo/Tejo),1006


In [None]:
# I define a function to get the 1st digit of the river length
def get_leading_digit(row):
    num = row['Length (km)']
    num_str = str(num)
    return num_str[0]

In [None]:
# I add a column for the leading digit
river_df['Leading Digit'] = river_df.apply (lambda row: get_leading_digit(row), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [None]:
# Isolate the result
result = river_df.groupby('Leading Digit').count()
result = result['Length (km)']
result = pd.DataFrame(result)
result = result.rename(columns={'Length (km)': 'Count'})
result

Unnamed: 0_level_0,Count
Leading Digit,Unnamed: 1_level_1
1,125
2,34
3,14
4,7
5,3
6,4


In [None]:
total = result['Count'].sum()
total

187

In [None]:
result['Actual Percentage'] = result['Count'].apply(lambda x: x / total)
result

In [None]:
result = result.reset_index()
result['Benfords Law'] = result['Leading Digit'].apply(lambda i: benfords_law(i))
result

Unnamed: 0,Leading Digit,Count,Actual Percentage,Benfords Law
0,1,125,0.668449,0.30103
1,2,34,0.181818,0.176091
2,3,14,0.074866,0.124939
3,4,7,0.037433,0.09691
4,5,3,0.016043,0.079181
5,6,4,0.02139,0.066947


Here I can see that river length apparently doesn't follow Benford's Law. But the reason is straightforward: the table I found on Wikipedia contains only the rivers that are longer than 1,000 km. And since no river is longer than 7,000 km, we don't have counts for digits 7, 8 and 9.

If I could find a complete river table I suppose it will follow Benford's Law. In fact here I see that it sort of follows a "truncated" Benford's Law where the highest frequency is for smaller digits.