In [1]:
%pylab inline
import pandas as pd
from helpers import dated_filename

Populating the interactive namespace from numpy and matplotlib


In [2]:
rankings = pd.read_csv('data/global-alexa-rankings-2019-06-15.csv')

## Give each site a category

Each URL is 'unique' in Alexa's site ontology. So, we will rename each `url` to a `url_code`, which uniquely identifies that site numerically.

In [3]:
num_sites = len(rankings.groupby('url').count())
rankings['url_code'] =\
    rankings['url'].astype('category').cat.rename_categories(range(num_sites))

# Calculate Levenshtein distance from global ranking

For each country in the dataset, we will compute the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between that country's top rankings and the global rankings.

Here's a Leveinshtein distance implementation from the `textdistance` package:

In [4]:
from textdistance import levenshtein

levenshtein.distance("Python", "Peithen")

3

To make this work, we'll need to encoude rankings `as_string` - a list of integers representing the top URLs in that country.

In [5]:
def as_string (country_ranking):
    '''
    Encodes top-ranked countries as a list of codes, where each code relates to a url. 
    Returns a list of integers.
    '''
    return country_ranking['url_code'].tolist()

global_rankings = as_string(rankings[rankings['country_name']=='Global'])

In [6]:
results = []
for country, group in rankings.groupby('country_name'):
    country_rankings = as_string(group)
    if country is not 'Global':
        results += [{
            'country': country,
            'levenshtein_from_global_ranking':\
                levenshtein.distance(country_rankings, global_rankings),
        }]
results = pd.DataFrame(results)

In [8]:
results.sort_values('levenshtein_from_global_ranking')
results.to_csv(dated_filename('analysis/levenshtein_from_global_ranking'))