In [36]:
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns

# Land border information from wikipedia

I'm going to take a look at this [table from wikipedia](https://en.wikipedia.org/wiki/List_of_countries_and_territories_by_land_borders)
where it shows you all the land borders that countries have.

In [23]:
# scrape and parse the data from wikipedia

from lxml import html
import requests

page = requests.get("https://en.wikipedia.org/wiki/List_of_countries_and_territories_by_land_borders")
tree = html.fromstring(page.content)
table = tree.cssselect("table.wikitable")[0]

# this is the list that we're going to stick all the data on
tabledata = []
# for each row in the table data
for row in table[1:]:
    # extract the table data elements
    tds = row.cssselect("td")
    # stick this on the list of rows
    tabledata.append(
        [
            td.text_content()
            for td in tds
        ]
    )

In [91]:
# convert to a pandas dataframe, naming the columns sensibly
import pandas as pd
import numpy as np
df = pd.DataFrame(tabledata,
                 columns = ["country", 
                            "land_border_length", 
                            "number_borders",
                            "number_neighbours",
                            "neighbour_list"
                           ])

In [25]:
# tidy up this data frame a little bit
import re
def convertnumeric(x):
    """
    data tidying, remove the commas and notes from numeric values
    """
    # remove stuff in brackets, these are little notes that we don't need
    x = re.sub("[\[\(].*[\]\)]", "", x)
    # remove the commas from the numbers
    x = x.strip().replace(",", "")
    # convert to numeric
    try:
        return float(x)
    except:
        return 0.0

def remove_notes(x):
    """
    data tidying, removing any notes or whatever from the country names
    """
    x = re.sub("[\[\(].*", "", x)
    return x.strip()
    
clean = (
    df
    .assign(country = df.country.map(remove_notes))
    .assign(land_border_length = df.land_border_length.map(convertnumeric))
    .assign(number_borders = df.number_borders.map(convertnumeric))
    .assign(number_neighbours = df.number_neighbours.map(convertnumeric))
)

I've extracted a nice clean dataset from wikipedia (click the `show code` button at the end of this article to see how).
Now I can start to ask a few questions.
Which country has the longest border in total?

In [40]:
(
    clean
    .loc[clean.land_border_length == max(clean.land_border_length)]
)

Unnamed: 0,country,land_border_length,number_borders,number_neighbours,neighbour_list
53,People's Republic of China,22147,19,16,Afghanistan: 76 km\n Bhutan: 470 km\n Hong Ko...


Unsurprisingly, China.
China is big.

Which country has the most neighbours?

In [54]:
(
    clean
    .loc[clean.number_neighbours == max(clean.number_neighbours)]
    [['country']]
)

Unnamed: 0,country
83,European Union


Hmm.
The wikipedia data has a few strange things in it like that,
the European Union isn't a country,
but that list includes a lof of *territories* that add a bit of confusion.
What are the places that have most borders when we exclude the EU?

In [56]:
(
    clean
    .query("country != 'European Union'")
    .sort_values("number_neighbours", ascending = False)
    .head(5)
).apply(lambda x: print("%s\t%d" % (x.country, x.number_neighbours)), axis = 1);

People's Republic of China	16
Russia	14
France 
→includes:
→ Clipperton Island
→ French Guiana
→ French Polynesia
→ French Southern and Antarctic Lands
→ Guadeloupe
→ Martinique
→ Mayotte
→ New Caledonia
→ Réunion
→ Saint Barthélemy
→ Saint Martin
→ Saint Pierre and Miquelon
→ Wallis and Futuna	11
Brazil	10
Germany	9


China again.
France is cheating a little bit as well because it includes a lot of overseas territories.

Here are all the countries that border China.
The parentheses tell you how many border segments there are,
so Russia and China have two separate borders.
The numbers in square brackets are notes,
telling you things like that Hong Kong isn't a sovereign state,
but it is a special administrative region of China.

In [69]:
(
    clean
    .loc[clean.country == "People's Republic of China"]
    
)['neighbour_list'].map(print);

 Afghanistan: 76 km
 Bhutan: 470 km
 Hong Kong[28] (PR China): 30 km
 India (3): 3,380 km
 Kazakhstan: 1,533 km
 North Korea: 1,416 km
 Kyrgyzstan: 858 km
 Laos: 423 km
 Macau[29] (PR China): 0.34 km
 Mongolia: 4,677 km
 Myanmar: 2,185 km
   Nepal: 1,236 km
 Pakistan:[30] 523 km
 Russia (2): 3,645 km
 Tajikistan: 414 km
 Vietnam: 1,281 km


Now we'll have a look at the countries that don't have any land border at all.
So filtering for a land border length of zero.

In [71]:
(
    clean
    .query("land_border_length == 0")
    [['country', "number_neighbours"]]
)

Unnamed: 0,country,number_neighbours
0,Adélie Land,1
5,American Samoa,0
6,Amsterdam Island and Île Saint-Paul,0
9,Anguilla,0
10,Antártica Chilena Province,2
11,Antigua and Barbuda,0
14,Aruba,0
15,Ashmore and Cartier Islands,0
16,Australia,0
17,Australian Antarctic Territory,3


These are almost all islands.
There's something funny going on though,
some of those places have no land border but do have some land neighbours.
What's going on here?

In [27]:
(
    clean
    .query("land_border_length == 0 & number_neighbours > 0")
    [['country', "number_neighbours"]]
)

Unnamed: 0,country,number_neighbours
0,Adélie Land,1
10,Antártica Chilena Province,2
17,Australian Antarctic Territory,3
38,British Antarctic Territory,3
69,Northern Cyprus,2
180,Nagorno-Karabakh Republic,3
212,Queen Maud Land,2
215,Ross Dependency,1
239,Somaliland,3
259,Transnistria,2


These countries fall into two categories.
Some are to do with Antarctic claims — 
[Adélie Land](https://en.wikipedia.org/wiki/Ad%C3%A9lie_Land),
[Queen Maud Land](https://en.wikipedia.org/wiki/Queen_Maud_Land),
[Ross Dependency](https://en.wikipedia.org/wiki/Ross_Dependency).

The others are where there are disputes over the sovereignty.
So they've put in a zero land border length because people don't agree on what it should be.

Now I'll do a calculation on the data.
Some borders are split into multiple sections.
I'll divide the total border length of each country by the total number of borders they have.
This gives me the average segment length of that country.

Which countries have the shortest borders on average?

In [168]:
(
    clean
    .assign(avgsegment = clean.land_border_length / clean.number_borders)
    .query("number_neighbours > 0 & avgsegment > 0")
    .sort_values(by = "avgsegment")
    .head(10)
    [['country', 'number_borders', 'number_neighbours', 'neighbour_list', 'avgsegment']]
)

Unnamed: 0,country,number_borders,number_neighbours,neighbour_list,avgsegment
244,Sri Lanka,1,1,India: 0.1 km,0.1
154,Macau,1,1,People's Republic of China: 0.34 km,0.34
99,Gibraltar,1,1,Spain: 1.2 km,1.2
277,Vatican City,1,1,Italy: 3.2 km,3.2
173,Monaco,1,1,France: 4.4 km,4.4
234,Sint Maarten,1,1,Saint Martin[45] (France): 10.2 km,10.2
222,Saint Martin,1,1,Sint Maarten[14] (Netherlands): 10.2 km,10.2
115,Hong Kong,1,1,People's Republic of China: 30 km,30.0
2,Akrotiri and Dhekelia,5,1,Cyprus (5):[6],30.4
68,Cyprus,5,1,Akrotiri and Dhekelia[5] (United Kingdom) (5)[6],30.4


Number 1 there is Sri Lanka.
I'd have put Sri Lanka down as an island,
but aparently there is a land border on [Rama's Bridge](https://en.wikipedia.org/wiki/Adam%27s_Bridge).

Which countries have the longest borders on average?

In [169]:
(
    clean
    .assign(avgsegment = clean.land_border_length / clean.number_borders)
    .query("number_neighbours > 0")
    .sort_values(by = "avgsegment", ascending = False)
    .head(10)
    [['country', 'number_borders', 'number_neighbours', 'neighbour_list', 'avgsegment']]
)

Unnamed: 0,country,number_borders,number_neighbours,neighbour_list,avgsegment
47,Canada,2,1,"United States (2): 8,893 km[26]",4446.5
174,Mongolia,2,2,"People's Republic of China: 4,677 km\n Russia...",4110.0
272,United States,3,2,"Canada (2): 8,893 km[26]\n Mexico: 3,141 km",4011.333333
135,Kazakhstan,5,5,"People's Republic of China: 1,533 km\n Kyrgyz...",2402.4
24,Bangladesh,2,2,"India (199):[19] 4,053 km\n Myanmar: 193 km",2123.0
197,Pakistan,4,4,"Afghanistan: 2,430 km\n People's Republic of ...",1693.5
278,Venezuela,3,3,"Brazil: 2,200 km\n Colombia: 2,050 km\n Guyan...",1664.333333
12,Argentina,6,5,"Bolivia: 832 km\n Brazil: 1,224 km\n Chile (2...",1610.833333
120,India,9,7,"Bangladesh (199):[19] 4,053 km\n Bhutan: 605 ...",1567.011111
279,Vietnam,3,3,"Cambodia: 1,228 km\n People's Republic of Chi...",1546.333333


Canada has two borders with the United States,
and both of them are huge.
A *6,416km* border with the contiguous 48 states,
and *2,477km* border with Alaska.

In [160]:
# using ideas from here, extract that neighbour list into something more sensible
# http://stackoverflow.com/questions/13050003/pandas-apply-function-to-dataframe-that-can-return-multiple-rows

a = clean.loc[clean.country == "India"]

def extractneighbours(x):
    neighbourlist = x.neighbour_list.values[0].split("\n")
    df_list = []
    for neighbour in neighbourlist:
        # extract the neighbour name
        name = re.sub("[\[\(].*", "", neighbour)
        name = re.sub(":.*", "", name).strip()
        
        # extract the number of segments
        segments = re.search("\([0-9]*\)", neighbour)
        if segments:
            segments = int(segments.group().replace("(", "").replace(")", ""))
        else:
            segments = 1
        
        # how long is this border?
        # find the last space in the string
        length = neighbour[neighbour.rfind(" "):]
        length = re.sub("[\[\(].*", "", length)
        try:
            length = float(length.replace("\xa0km", "").replace(",", ""))
        except:
            length = np.nan

            
        df_list.append({
            "neighbour_name" : name,
            "segments" : segments,
            "border_length" : length
        })
    return pd.DataFrame(df_list)
        

segments = (
    clean
    .groupby("country")
    .apply(extractneighbours)
    .reset_index()
    .drop("level_1", axis = 1)
)


In [164]:
(
    segments
    .sort_values(by = "border_length", ascending = False)
    .head(8)
)

Unnamed: 0,country,border_length,neighbour_name,segments
783,United States,8893,Canada,2
137,Canada,8893,United States,2
384,Kazakhstan,6846,Russia,1
612,Russia,6846,Kazakhstan,1
33,Argentina,5300,Chile,2
152,Chile,5300,Argentina,2
474,Mongolia,4677,People's Republic of China,1
567,People's Republic of China,4677,Mongolia,1


The longest single border stretch is between Russia and Kazakhstan.
The United States and Canada have the longest border between two countries,
but that is split into two segments.

On the subject of segments, 
which borders are split into the most separate segments?

In [166]:
(
    segments
    .sort_values(by = ["segments", "country"], ascending = False)
    .head(8)
)

Unnamed: 0,country,border_length,neighbour_name,segments
334,India,4053,Bangladesh,199
63,Bangladesh,4053,India,199
507,Netherlands,450,Belgium,31
75,Belgium,450,Netherlands,31
790,Uzbekistan,1099,Kyrgyzstan,6
404,Kyrgyzstan,1099,Uzbekistan,6
282,Germany,167,Belgium,6
73,Belgium,167,Germany,6


What on earth is going on with the India and Bangladesh border?
At first this appears to be a problem with the data,
but it is actually correct.
Within Bangladesh there were [102 enclaves](https://en.wikipedia.org/wiki/India%E2%80%93Bangladesh_enclaves) of Indian territory,
with 21 counter enclaves of Bangladesh within them,
and a [triple enclave](https://en.wikipedia.org/wiki/Dahala_Khagrabari).

This border is currently being simplified,
and the enclaves are currently being swapped back and forth.
This data might be out of date soon,
if it isn't already.

[This article](http://mentalfloss.com/article/29086/its-complicated-5-puzzling-international-borders) tells the story of these enclaves in a bit more detail,
and also tells you a bit more about what is going on with the Netherlands - Belgium border.

In [167]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="show code"></form>''')