In [1]:
%pylab inline
import pandas as pd

Populating the interactive namespace from numpy and matplotlib


In [2]:
# %load ../helpers.py
import datetime

def dated_filename (fn, ext='.csv'):
    today = datetime.date.today()
    return '{}-{}{}'.format(fn, today, ext)


In [3]:
rankings = pd.read_csv('data/global-alexa-rankings-2019-06-15.csv')

## Give each site a category

Each URL is 'unique' in Alexa's site ontology. So, we will rename each `url` to a `url_code`, which uniquely identifies that site numerically.

In [4]:
num_sites = len(rankings.groupby('url').count())
rankings['url_code'] =\
    rankings['url'].astype('category').cat.rename_categories(range(num_sites))

# Calculate Levenshtein distance from global ranking

For each country in the dataset, we will compute the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between that country's top rankings and the global rankings.

Here's a Leveinshtein distance implementation from the `textdistance` package:

In [5]:
from textdistance import levenshtein

levenshtein.distance("Python", "Peithen")

3

To make this work, we'll need to encoude rankings `as_string` - a list of integers representing the top URLs in that country.

In [6]:
def as_string (country_ranking):
    '''
    Encodes top-ranked countries as a list of codes, where each code relates to a url. 
    Returns a list of integers.
    '''
    return country_ranking['url_code'].tolist()

global_rankings = as_string(rankings[rankings['country_name']=='Global'])

In [7]:
results = []
for country, group in rankings.groupby('country_name'):
    country_rankings = as_string(group)
    if country != 'Global':
        results += [{
            'country': country,
            'levenshtein_from_global_ranking':\
                levenshtein.distance(country_rankings, global_rankings),
        }]
results = pd.DataFrame(results)

In [9]:
results.sort_values('levenshtein_from_global_ranking')

Unnamed: 0,country,levenshtein_from_global_ranking
132,Puerto Rico,42
135,Romania,43
171,United Kingdom,43
76,Ireland,43
93,Luxembourg,44
14,Belgium,44
10,Bahrain,44
52,Finland,44
141,Serbia,44
43,Denmark,44


In [23]:
results.sort_values('levenshtein_from_global_ranking').tail()

Unnamed: 0,country,levenshtein_from_global_ranking
74,Iran,49
32,China,50
136,Russia,50
107,Moldova,50
5,Armenia,50


In [10]:
results.to_csv(dated_filename('analysis/levenshtein_from_global_ranking'))