In [1]:
#import packages
import re
from geopy.distance import great_circle

In [2]:
#mathematica names are outputted in camelCase, translating to snake_case.
def camelcase_to_snakecase(name):
    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()

In [7]:
#input file paths
wt_centroids_infile = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/glm/raw-data/pop-weighted-centroids.tsv'
pax_traffic_infile = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/glm/raw-data/Americas-to-Americas-2012-2016_KK_DB.xlsx'
#output file paths
test_outfile = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/glm/origin-destin-formatted-data/test-predictors.tsv'
great_cirle_dists_outfile = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/glm/origin-destin-formatted-data/great-circle-dists.tsv'
north_south_outfile = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/glm/origin-destin-formatted-data/north-south-indicator.tsv'


In [5]:
#read in latitudes and longitudes for each country as a dict
#key is the country in snake_case, value is a tuple (latitude,longitude)

weighted_pop_centroids = {}
for line in open(wt_centroids_infile,'rU'):
    place = camelcase_to_snakecase(line.split('\t')[0])
    latitude = float(line.split('\t')[1])
    longitude = float(line.split('\t')[2])
    weighted_pop_centroids[place] = (latitude,longitude)

In [6]:
#do some country editing since mathematica gives us some countries that we aren't interested in.
test_countries = ['united_states','mexico','canada'] #smaller set for testing scripts and such
excluded_countries = ['falkland_islands','montserrat', 'greenland', 'saint_pierre_miquelon']
included_countries = [key for key in weighted_pop_centroids.keys() if key not in excluded_countries]

## Great Circle Distances predictor data

This predictor is a true pairwise matrix, where the latitude and longitude for each country represents a weighted average of the latitude and longitude of each city within the country. The weight accorded to each city is the proportion of the country's population that lives in that city.

The infile containing the weighted population centroids was generated in mathematica, see `weighted-centroids.nb` for the code.

Predictors are written to file as a long edge tsv, with both the origin and the destination specified. For example:

origin | destination | value
--------|-----------| ----
united_states   |    mexico      |   XXX
mexico    |    united_states      |   ZZZ

All great circle distances are presented in kilometers (metric for the win).

In [11]:
# write test tsv file
with open(test_outfile,'w') as file:
    file.write('{}\t{}\t{}\n'.format('origin', 'destination', 'great_circle_dist_km'))
    for origin in test_countries:
        for destination in test_countries:
            if origin == destination: #don't actually want to store the diagonal values of the matrix.
                continue
            else:
                dist = great_circle(weighted_pop_centroids[origin],weighted_pop_centroids[destination]).kilometers
                file.write('{}\t{}\t{}\n'.format(origin, destination, dist))

# write actual full tsv file
with open(great_cirle_dists_outfile,'w') as file:
    file.write('{}\t{}\t{}\n'.format('origin', 'destination', 'great_circle_dist_km'))
    for origin in included_countries:
        for destination in included_countries:
            if origin == destination: #don't actually want to store the diagonal values of the matrix.
                continue
            else:
                dist = great_circle(weighted_pop_centroids[origin],weighted_pop_centroids[destination]).kilometers
                file.write('{}\t{}\t{}\n'.format(origin, destination, dist))

## North-South predictor data

We are interested in determining whether Zika mirgration has occurred in a specific direction. For instance, has most transmission occurred in a northward or southward direction? Or has there been so much mixture that there is no particular trend?

This predictor is a true pairwise matrix. The latitude and longitude for each country correspond to the weighted population centroids of that country. The cell value is given a `1` if the origin is north of the destination, and a `-1` if the origin is south of the destination.

For example

origin | destination | value
--------|-----------| ----
united_states   |    mexico      |   1
mexico    |    united_states      |   -1

In [13]:
with open(north_south_outfile,'w') as file:
    file.write('{}\t{}\t{}\n'.format('origin', 'destination', 'north_south_indicator'))
    for origin in included_countries:
        for destination in included_countries:
            origin_lat = weighted_pop_centroids[origin][0]
            destination_lat = weighted_pop_centroids[destination][0]
            if origin == destination:
                continue
            else:
                if origin_lat == destination_lat:
                    value = 0
                elif origin_lat > destination_lat: #origin is north of destination
                    value = 1
                elif origin_lat < destination_lat: #origin is south of destination
                    value = -1
                file.write('{}\t{}\t{}\n'.format(origin, destination, value))