### Problem definition

Here are some assumptions I have made:

* The probability of a given location _p(i)_ is the number of times it is visited devided by the number of visits in the dataset
* There is no clear concept of uniqueness in airport arrivals. For one person, there is only one possible airport to reach at a time. Therefore we can take the number of vists as unique vists.

The dataset found by me is "Air Passenger Arrivals - Total by Region and Selected Country of Embarkation" collected by Singapore Changi Airport and it is available on https://data.gov.sg/dataset/air-passenger-arrivals-total-by-region-and-selected-country-of-embarkation

In [95]:
import pandas as pd

from utils import calc_location_entropy

In [94]:
df = spark.read.csv('file:///tmp/data.csv', header='true', inferSchema='true')

### ETL on input data:

* Filter valid # arrivals
* Type conversion
* Standadize column names

In [None]:
loc_arrivals_df = df.filter('value != "na"') \
    .withColumn('num_visits', df['value'].cast(IntegerType())) \
    .withColumnRenamed('country', 'location') \
    .cache()

### Global entropy

In [83]:
calc_location_entropy(loc_arrivals_df)

3.129235821660711

### Entropy analytics by year

In [84]:
loc_arrivals_df.select(funcs.min('month'), funcs.max('month')).first()

Row(min(month)=u'1961-01', max(month)=u'2017-09')

In [85]:
# this piece of calculation takes a bit long so re-run it only when necessary
# In Scala it is possible to make `calc_location_entropy` a AggregationFunction and fasten it a lot

entropy_by_year = {}
for year in range(1961, 2017 + 1):
    yearly_df = loc_arrivals_df.filter('month like "%d%%"' % year)
    entropy_by_year[year] = calc_location_entropy(yearly_df)

In [89]:
pd_df = pd.DataFrame(data={'year': entropy_by_year.keys(), 'location_entropy': entropy_by_year.values()})

In [97]:
pd_df.sort_values(by=['year'])
pd_df

Unnamed: 0,location_entropy,year
0,2.242444,1961
1,2.251619,1962
2,2.246052,1963
3,2.130665,1964
4,2.16347,1965
5,1.869591,1966
6,2.045576,1967
7,2.118908,1968
8,2.232388,1969
9,2.273982,1970


### Discoveries

* Since the location of entropy represents _diversity_ of locations being visited,
in this case it shows how diversely the airports in the world have been visited.
* The year of 1966 has the lowest entropy within the dataset.
* The general trend is that as time goes by, the location entropy increases.
* In recent 10 years, 2008 has the highest entropy.
* The location entropy of airport arrivals in the world is probably related to global economy. The more prosperous the economy, people arrives more diversely at the airports in the globe.