## README: Pairwise predictors

The following predictors are inherently pairwise:

* Great circle distances between countries' population-weighted centroids
* Passenger air travel between countries
* Indicator of north-south relationship between countries

The output files for pairwise predictors are written as a single long edge tsv file, with both the origin and the destination specified. For example:

origin          | destination       | value
----------------|-------------------| ----
united_states   |    mexico         |   XXX
mexico          |    united_states  |   ZZZ

A bit more information about each of the predictors and its construction is given before the relevant code block. 


In [3]:
#import packages
import re
from geopy.distance import great_circle

In [38]:
#input file paths
countries_file = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/glm/indexed-countries-45.tsv'
wt_centroids_infile = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/glm/standardized-data/std-population-weighted-centroids.tsv'

#output file paths
test_outfile = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/glm/origin-destin-formatted-data/origin-destin-test-predictors.tsv'
great_cirle_dists_outfile = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/glm/origin-destin-formatted-data/origin-destin-great-circle-dists.tsv'
north_south_outfile = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/glm/origin-destin-formatted-data/origin-destin-north-south-indicator.tsv'


In [35]:
#do some country editing since mathematica gives us some countries that we aren't interested in.
test_countries = ['united_states','mexico','canada'] #smaller set for testing scripts and such
included_countries = [line.split('\t')[0] for line in open(countries_file,'rU') if line.split('\t')[0] != 'country']
assert len(included_countries) == 45

## Great Circle Distances predictor data

The infile for calculating great circle distances between population-weighted centroids is a `tsv` file with the latitude and longitude for each country's population-weighted centroid. A population-weighted centroid represents a weighted average of the latitude and longitude of each city within the country, where the weight accorded to each city is the proportion of the country's population that lives in that city.

The infile containing the weighted population centroids was generated in mathematica, see `weighted-centroids.nb` for the code.

To generate the great circle distances predictor we will calculated the great circle distance between all pairs of countries' population-weighted centroid. All great circle distances are presented in kilometers (metric for the win).

In [39]:
#Make a dictionary from the weighted population centroid data where the key is the country and the value is a latitude-longitude tuple.

weighted_pop_centroids = {}
for line in open(wt_centroids_infile,'rU'):
    if line.startswith('country'): #line is header
        continue
    else:
        country = line.split('\t')[0]
        if country in included_countries:
            latitude = float(line.split('\t')[1])
            longitude = float(line.split('\t')[2])
            weighted_pop_centroids[country] = (latitude,longitude)
        else:
            continue

assert len(weighted_pop_centroids.keys()) == 45

In [40]:
# write actual full tsv file
with open(great_cirle_dists_outfile,'w') as file:
    file.write('{}\t{}\t{}\n'.format('origin', 'destination', 'great_circle_dist_km'))
    for origin in included_countries:
        for destination in included_countries:
            if origin == destination: #don't actually want to store the diagonal values of the matrix.
                continue
            else:
                dist = great_circle(weighted_pop_centroids[origin],weighted_pop_centroids[destination]).kilometers
                file.write('{}\t{}\t{}\n'.format(origin, destination, dist))

In [41]:
#I'm also going to make a tiny 3 country version of the great circle distances for testing purposes:
test_countries = ['canada', 'united_states', 'mexico']
test_outfile = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/glm/origin-destin-formatted-data/origin-destination-TEST.tsv'
with open(test_outfile,'w') as file:
    file.write('{}\t{}\t{}\n'.format('origin', 'destination', 'great_circle_dist_km'))
    for origin in test_countries:
        for destination in test_countries:
            if origin == destination: #don't actually want to store the diagonal values of the matrix.
                continue
            else:
                dist = great_circle(weighted_pop_centroids[origin],weighted_pop_centroids[destination]).kilometers
                file.write('{}\t{}\t{}\n'.format(origin, destination, dist))

## North-South predictor data

We are interested in determining whether Zika mirgration has occurred in a specific direction. For instance, has most transmission occurred in a northward or southward direction? Or has there been so much mixture that there is no particular trend?

This predictor is a true pairwise matrix. The latitude and longitude for each country correspond to the weighted population centroids of that country. The cell value is given a `1` if the origin is north of the destination, and a `-1` if the origin is south of the destination.

For example:

origin          | destination       | value
----------------|-------------------| ----
united_states   |    mexico         |   1
mexico          |    united_states  |   -1

In [42]:
with open(north_south_outfile,'w') as file:
    file.write('{}\t{}\t{}\n'.format('origin', 'destination', 'north_south_indicator'))
    for origin in included_countries:
        for destination in included_countries:
            origin_lat = weighted_pop_centroids[origin][0]
            destination_lat = weighted_pop_centroids[destination][0]
            if origin == destination:
                continue
            else:
                if origin_lat == destination_lat:
                    value = 0
                elif origin_lat > destination_lat: #origin is north of destination
                    value = 1
                elif origin_lat < destination_lat: #origin is south of destination
                    value = -1
                file.write('{}\t{}\t{}\n'.format(origin, destination, value))

## Passenger air traffic data

Passenger air traffic between different countries in the Americas was made available to us by Kamran Kahn (Bluedot). The data are collected by the International Air Transport Association (IATA) and the counts comprise actual passenger ticket sales and include all flight connections. Thus these data are a reflection of the true origins and destinations of individual passengers.

Because the raw data was supplied to us in an origin-destination format, the version of the data that has had the names standardized is already good to go. However for consistency's sake I'll write that file also to the directory that contains the `origin-destination` formatted tsv files.

In [45]:
cp /Users/alliblk/Desktop/gitrepos/zika-usvi/data/glm/standardized-data/std-pax-volume.tsv /Users/alliblk/Desktop/gitrepos/zika-usvi/data/glm/origin-destin-formatted-data/origin-destin-pax-volume.tsv