# Creative Extension Analysis

In this notebook, we propose a creative extension analysis of the paper _Friendship and Mobility: User Movement in Location-Based Social Networks_. We chose to answer multiple research questions by using the same datasets (check-ins) and an additional one (airports). 

In particular, we aim to find out:
- Which countries travel the most long distance by plane?
- Where do people from different countries travel to the most?
- Where users' friends are based?
- Is it possible to predict user home areas based on their long distance travel patterns?

## Tools

In [1]:
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from numba import njit
import itertools

import reverse_geocoder
import pickle
import os

In [2]:
%load_ext blackcellmagic

## Load the Data

We load the check-ins datasets and the new airport dataset.

In [3]:
data_folder = "data"

checkins_b = pd.read_csv(
    os.path.join(data_folder, "loc-brightkite_totalCheckins.txt.gz"),
    compression="gzip",
    delimiter="\t",
    usecols=[0, 1, 2, 3],
    names=["user", "checkin_time", "latitude", "longitude"],
    parse_dates = ["checkin_time"]
)

checkins_g = pd.read_csv(
    os.path.join(data_folder, "loc-gowalla_totalCheckins.txt.gz"),
    compression="gzip",
    delimiter="\t",
    usecols=[0, 1, 2, 3],
    names=["user", "checkin_time", "latitude", "longitude"],
)

airports = pd.read_csv(
    os.path.join(data_folder, "airports.csv"), usecols=[2, 3, 4, 5, 8]
)

countries = pd.read_csv(
    os.path.join(data_folder, "countries.csv"), usecols=[1, 2],
    names=["iso_country", "country"]
)

In the following cells, we will only display dataframes for Brightkite.

In [4]:
checkins_b.head()

Unnamed: 0,user,checkin_time,latitude,longitude
0,0,2010-10-17 01:48:53+00:00,39.747652,-104.99251
1,0,2010-10-16 06:02:04+00:00,39.891383,-105.070814
2,0,2010-10-16 03:48:54+00:00,39.891077,-105.068532
3,0,2010-10-14 18:25:51+00:00,39.750469,-104.999073
4,0,2010-10-14 00:21:47+00:00,39.752713,-104.996337


## Preprocessing Check-ins Datasets

In [5]:
print(f"Brightkite checkins dataset has {len(checkins_b)} rows.")
print(f"Gowalla checkins dataset has {len(checkins_g)} rows.")

Brightkite checkins dataset has 4747287 rows.
Gowalla checkins dataset has 6442892 rows.


We check for missing values in both datasets

In [6]:
NaN_rows_b = np.count_nonzero(np.count_nonzero(checkins_b.isnull(), axis=1))
NaN_rows_g = np.count_nonzero(np.count_nonzero(checkins_g.isnull(), axis=1))
print(f"There are {NaN_rows_b} rows in Brightkite with at least a NaN value.")
print(f"There are {NaN_rows_g} rows in Gowalla with at least a NaN value.")
print("\nWhere the NaN values are in Brightkite:")
checkins_b.isnull().any()

There are 6 rows in Brightkite with at least a NaN value.
There are 0 rows in Gowalla with at least a NaN value.

Where the NaN values are in Brightkite:


user            False
checkin_time     True
latitude         True
longitude        True
dtype: bool

We see that Brightkite has just 6 rows with a NaN value. As we can see above thes NaN values are in the latitude and longitude column. We will remove the rows containg NaNs since these are probably some measuring mistakes.

In [7]:
checkins_b = checkins_b.dropna()

We see there are coordinates with incorrect values in both datasets:

In [8]:
wrong_coordinates_b = checkins_b[
    ((checkins_b["longitude"] < -180) | (checkins_b["longitude"] > 180))
    | ((checkins_b["latitude"] < -90) | (checkins_b["latitude"] > 90))
]
wrong_coordinates_g = checkins_g[
    ((checkins_g["longitude"] < -180) | (checkins_g["longitude"] > 180))
    | ((checkins_g["latitude"] < -90) | (checkins_g["latitude"] > 90))
]

print(f"There are {len(wrong_coordinates_b)} wrong coordinates in Brightkite.")
print(f"There are {len(wrong_coordinates_g)} wrong coordinates in Gowalla.")

There are 109 wrong coordinates in Brightkite.
There are 29 wrong coordinates in Gowalla.


The incorrect latitudes are probably some type of errors that happen unfrequently. As we can see there are only 109 and 29 incorrect coordinates in the BrightKite and Gowalla dataframe respectively. These might be measuring errors from the user's device or some conversion error or error while saving the coordinates in the Dataframe.

We can also see that there are check-ins at coordinate (0,0). These are probably outliers since these coordinates are in the middle of the Atlantic Ocean. Even though Gowalla only has a few, we will remove these datapoints in both datasets. Even though some datapoints in the datasets might not be outliers but just tourists visiting the virtual [Null Island](https://fr.wikipedia.org/wiki/Null_Island), we can safely assume that noone's home can be located there and we thus will remove all check-ins at (0,0).

In [9]:
number_00_checkins_b = len(
    checkins_b[
        (checkins_b.latitude == 0)
        & (checkins_b.longitude == 0)
    ]
)

number_00_checkins_g = len(
    checkins_g[
        (checkins_g.latitude == 0)
        & (checkins_g.longitude == 0)
    ]
)

print(
    f"There are {number_00_checkins_b} datapoints at coordinates (0,0) in the Brightkite dataset."
)

print(
    f"There are {number_00_checkins_g} datapoints at coordinates (0,0) in the Gowalla dataset."
)

There are 256137 datapoints at coordinates (0,0) in the Brightkite dataset.
There are 135 datapoints at coordinates (0,0) in the Gowalla dataset.


There are far more errors in the Brightkite dataset suggesting that there is some difference in the way the two services handle coordinates. A reason might be that these errors are due to users' devices but in this case it would mean that almost all users with faulty devices are in Brightkite and none are in Gowalla. This might be explained by the fact that Gowalla might have more requirements for users' devices before they download their service (better OS version, better sensors....).  
Another reason might be that Gowalla has some mechanism to detect fault measurements and remove them.  
Another reason might be that Brightkite might allow users to have the app without using their GPS and saves all checkins with default coordinate (0,0). This might sound strange but maybe they still want to save users' checkins even though they don't have GPS coordinates. In this case it would have been more reasonable to put some other value to indicate a lack of GPS coordinates (ex. NaN). 

We can now just keep the coordinates with correct values ad not at (0,0):

In [10]:
checkins_b = checkins_b[
    (checkins_b.latitude >= -90)
    & (checkins_b.latitude <= 90)
    & ~((checkins_b.latitude == 0) & (checkins_b.longitude == 0))
]

checkins_g = checkins_g[
    (checkins_g.latitude >= -90)
    & (checkins_g.latitude <= 90)
    & ~((checkins_g.latitude == 0) & (checkins_g.longitude == 0))
]

In [11]:
checkins_b.head()

Unnamed: 0,user,checkin_time,latitude,longitude
0,0,2010-10-17 01:48:53+00:00,39.747652,-104.99251
1,0,2010-10-16 06:02:04+00:00,39.891383,-105.070814
2,0,2010-10-16 03:48:54+00:00,39.891077,-105.068532
3,0,2010-10-14 18:25:51+00:00,39.750469,-104.999073
4,0,2010-10-14 00:21:47+00:00,39.752713,-104.996337


## Work on Airport Dataset

In [12]:
airports.head()

Unnamed: 0,type,name,latitude_deg,longitude_deg,iso_country
0,heliport,Total Rf Heliport,40.070801,-74.933601,US
1,small_airport,Aero B Ranch Airport,38.704022,-101.473911,US
2,small_airport,Lowell Field,59.9492,-151.695999,US
3,small_airport,Epps Airpark,34.864799,-86.770302,US
4,closed,Newport Hospital & Clinic Heliport,35.6087,-91.254898,US


Wee see there are many different types of airports:

In [13]:
airports.type.value_counts()

small_airport     35218
heliport          12515
closed             4862
medium_airport     4538
seaplane_base      1039
large_airport       613
balloonport          25
Name: type, dtype: int64

We are only interested in airports, we check some examples of different types of airports in Switzerland:

In [14]:
airports[(airports.iso_country == "CH") & (airports.type == "small_airport")].head(5)

Unnamed: 0,type,name,latitude_deg,longitude_deg,iso_country
15528,small_airport,Altiport de Croix de Coeur,46.12338,7.23425,CH
15529,small_airport,Altisurface du Glacier de Prasfleuri,46.064289,7.354317,CH
15530,small_airport,Altiport du Glacier de Tsanfleuron,46.320653,7.23288,CH
34348,small_airport,Bex Airport,46.258301,6.98639,CH
34350,small_airport,Ecuvillens Airport,46.755001,7.07611,CH


In [15]:
airports[(airports.iso_country=='CH') & (airports.type=='medium_airport')].head(5)

Unnamed: 0,type,name,latitude_deg,longitude_deg,iso_country
34349,medium_airport,Les Eplatures Airport,47.0839,6.79284,CH
34357,medium_airport,Sion Airport,46.219166,7.326944,CH
34363,medium_airport,Alpnach Air Base,46.943901,8.28417,CH
34364,medium_airport,Dübendorf Air Base,47.398602,8.64823,CH
34365,medium_airport,Emmen Air Base,47.092444,8.305184,CH


In [16]:
airports[(airports.iso_country=='CH') & (airports.type=='large_airport')].head(5)

Unnamed: 0,type,name,latitude_deg,longitude_deg,iso_country
34351,large_airport,Geneva Cointrin International Airport,46.238098,6.10895,CH
34411,large_airport,Zürich Airport,47.464699,8.54917,CH


We are only interested in large airports, so we only keep them:

In [17]:
airports = airports[(airports.type == "large_airport")]
airports = airports.drop(columns="type")

We check the countries with most large airports.

In [18]:
airports.iso_country.value_counts().head(5)

US    170
CN     35
GB     27
RU     19
DE     17
Name: iso_country, dtype: int64

We now merge with the `countries` DataFrame to get the name of each country for each airport:

In [19]:
countries.head()

Unnamed: 0,iso_country,country
0,code,name
1,AD,Andorra
2,AE,United Arab Emirates
3,AF,Afghanistan
4,AG,Antigua and Barbuda


We first check for NA values and remove the correspoinding rows:

In [20]:
print(countries.isna().sum())

iso_country    1
country        0
dtype: int64


In [21]:
countries = countries.dropna()

In [22]:
airports = airports.merge(countries)

In [23]:
airports.head()

Unnamed: 0,name,latitude_deg,longitude_deg,iso_country,country
0,Port Moresby Jacksons International Airport,-9.44338,147.220001,PG,Papua New Guinea
1,Keflavik International Airport,63.985001,-22.6056,IS,Iceland
2,Priština International Airport,42.5728,21.035801,XK,Kosovo
3,Guodu air base,36.001741,117.63201,CN,China
4,Yantai Penglai International Airport,37.657222,120.987222,CN,China


## Exploratory Data Analysis

### Finding the home of each user

We use the following [formulas](https://stackoverflow.com/questions/1253499/simple-calculations-for-working-with-lat-lon-and-km-distance) to transform latitude and longitude into kilometers.

- **Latitude: 1 deg = 110.574 km**  
- **Longitude: 1 deg = 111.320*cos(latitude) km**

We will divide the world according to 25km*25km squares. Each cell will be identified using a unique number. We first define two indices per box: $x,y\in \mathbb{Z}$  
The box (0,0) will be the box with lower left (i.e southwest) coordinate (0,0).  
X indexes grow when going east. $x \in [-N, N]$.  
Y indexes grow north. $y \in [-M, M]$

Then we create a unique number using [Cantor pairing function](https://en.wikipedia.org/wiki/Pairing_function#Cantor_pairing_function):
$$f(n_1, n_2) = \frac{1}{2}(n_2 + n_1)(n_2+n_1+1)+n_1$$

We use [njit](https://numba.pydata.org/numba-doc/latest/user/performance-tips.html) to make the calculations faster.

In [24]:
@njit
def assign_cell(latitude, longitude):
    """
    Assign a cell_number based on the cantor pairing function and discretization into 25km * 25km cells.

    Arguments
    =========
    latitude : float
        latitude in degrees.
    longitude : float
        longitude in degrees.

    Returns
    =======
    cell_number : int
        Cell number based on the cantor pair function.
    """

    km_east = 111.320 * np.cos(np.deg2rad(latitude)) * longitude
    km_north = 110.574 * latitude
    x_index = km_east // 25
    y_index = km_north // 25
    cell_number = (1/2)*(y_index + x_index)*(y_index + x_index + 1) + x_index
    return cell_number

In [25]:
checkins_b["cell_number"] = assign_cell(checkins_b["latitude"].values, checkins_b["longitude"].values)
checkins_g["cell_number"] = assign_cell(checkins_g["latitude"].values, checkins_g["longitude"].values)

checkins_b["cell_number"] = checkins_b["cell_number"].astype('int')
checkins_g["cell_number"] = checkins_g["cell_number"].astype('int')

The goal of the step below is to find the box_index with most check-ins for each user. To do this, we:
1. Group by user and box_index and get the counts.
2. We sort these counts and take the biggest for each user.
3. We recreate a DataFrame from the Serie we now have.
4. We sort the DataFrame by user and make ["user", "box_index"] the only columns.

In [26]:
user_cell_b = (
    checkins_b.groupby(by=["user", "cell_number"])
    .size()
    .sort_values()
    .groupby(level=0)
    .tail(1)
    .to_frame()
    .sort_values("user")
    .reset_index()
    .drop(columns=0)
)

user_cell_g = (
    checkins_g.groupby(by=["user", "cell_number"])
    .size()
    .sort_values()
    .groupby(level=0)
    .tail(1)
    .to_frame()
    .sort_values("user")
    .reset_index()
    .drop(columns=0)
)

user_cell_b.head()

Unnamed: 0,user,cell_number
0,0,16660
1,1,34813
2,2,16660
3,3,34813
4,4,51736


We create a new dataframe to have the home coordinates of each user. (Recall: home coordinates are the average location of check-ins in the most visited box per user).

In [27]:
homes_b = (
    user_cell_b.merge(checkins_b).groupby("user").mean()
)

homes_g = (
    user_cell_g.merge(checkins_g).groupby("user").mean()
)

In [28]:
homes_b.head()

Unnamed: 0_level_0,cell_number,latitude,longitude
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,16660,39.746677,-104.973619
1,34813,37.602712,-122.377186
2,16660,39.73834,-104.960762
3,34813,37.670403,-122.406295
4,51736,60.173672,24.94245


### Find the country of residence for each user

We want to retrieve the country of residence of each user. To do so, we will collect all cells in which there is at least one home. Then we assume all homes in one cell are from one country. To minimize errors due to this 25km x 25km discretization we will identify each cell by the average checkin position of all checkins inside the cell. This way if a cell is between two countries and most checkins are on one side of the border we will define the majority of homes in the country where the most checkins happened.

We create a function which will return the country code given a point coordinates.

In [29]:
center_cells_b = checkins_b[['latitude', 'longitude', 'cell_number']].groupby('cell_number').mean()
center_cells_g = checkins_g[['latitude', 'longitude', 'cell_number']].groupby('cell_number').mean()

center_cells_b = center_cells_b.reset_index()
center_cells_g = center_cells_g.reset_index()

In [30]:
center_cells_b.head()

Unnamed: 0,cell_number,latitude,longitude
0,-286,64.545703,-149.087119
1,-284,64.1525,-145.842222
2,-282,63.562699,-142.300615
3,-280,63.934833,-145.788211
4,-279,63.661389,-144.064444


In [31]:
home_countries_b = homes_b.merge(center_cells_b, on='cell_number', suffixes=["_home", "_cell"])
home_countries_g = homes_g.merge(center_cells_g, on='cell_number', suffixes=["_home", "_cell"])

In [32]:
home_countries_b = home_countries_b.drop(columns=["latitude_home", "longitude_home"])
home_countries_g = home_countries_g.drop(columns=["latitude_home", "longitude_home"])

home_countries_b = home_countries_b.drop_duplicates()
home_countries_g = home_countries_g.drop_duplicates()

In [33]:
home_countries_b.head()

Unnamed: 0,cell_number,latitude_cell,longitude_cell
0,16660,39.731991,-104.980205
751,34813,37.626781,-122.389266
974,51736,60.183202,24.943493
1179,5274,41.911515,-87.676377
1780,20922,43.590722,3.737697


In [34]:


def retrieve_country(lat, lon):
    """
    Return the ISO country code of a point in space

    Arguments
    =========
        lat : float
            Latitude of the point
        lon : float
            Longitude of the point

    Returns
    =======
        str
            ISO country code of the point

    """

    return reverse_geocoder.search((lat, lon))[0]["cc"]


# vectorize the function
retrieve_country_vec = np.vectorize(retrieve_country)

The following cell takes a lot of time, so we pickled it.

In [44]:
# home_countries_b["country"] = retrieve_country_vec(
#    home_countries_b["latitude_cell"], home_countries_b["longitude_cell"]
# )

# home_countries_g["country"] = retrieve_country_vec(
#    home_countries_g["latitude_cell"], home_countries_g["longitude_cell"]
# )

In [45]:
# home_countries_b.to_pickle(os.path.join(data_folder, 'home_countries_b.pkl'))
# home_countries_g.to_pickle(os.path.join(data_folder, 'home_countries_g.pkl'))

In [46]:
home_countries_b = pd.read_pickle(os.path.join(data_folder, "home_countries_b.pkl"))
home_countries_g = pd.read_pickle(os.path.join(data_folder, "home_countries_g.pkl"))

In [None]:
home_countries_b.head()

Now that we have for each 'home' the country code, we can now merge the two datasets so that we have a complete dataframe.

In [39]:
homes_b = homes_b.reset_index().merge(home_countries_b, on='cell_number')
homes_g = homes_g.reset_index().merge(home_countries_g, on="cell_number")

homes_b = homes_b.drop(columns = ['latitude_cell', 'longitude_cell'])
homes_g = homes_g.drop(columns=["latitude_cell", "longitude_cell"])

In [43]:
homes_g.sample(10)

Unnamed: 0,user,cell_number,latitude,longitude,country
21060,138273,16088,45.53727,-122.568411,US
98187,170310,40783,58.29194,12.309998,SE
43959,143342,33179,52.352277,9.77481,DE
17984,131423,34549,37.648456,-122.100071,US
30529,989,27125,30.625818,-96.335887,US
23449,105717,46097,59.333336,18.394076,SE
89174,132177,41072,57.794744,13.419371,SE
6862,22424,35079,37.375792,-122.002734,US
67509,46940,26872,35.089092,-106.62109,US
2319,95113,29027,30.267717,-97.751649,US


### Detect long distance travel

We are using the following heuristic to detect a long distance travel. If two consecutive checkins of a same user happen at a distance of more than 500km, we assume a long dictance trip occured and we can safely assume most of those happened by plane.

To do that we will:
1. Sort checkins by user and time.
2. Filter and keep only checkins that happen just before and after a long distance travel

In [56]:
checkins_b = checkins_b.sort_values(by=["user", "checkin_time"])
checkins_g = checkins_g.sort_values(by=["user", "checkin_time"])

In [81]:
long_distance_travel_b = checkins_b.reset_index().merge(
    checkins_b.shift(-1).dropna().reset_index(),
    on="index",
    suffixes=("_start", "_end"),
)

long_distance_travel_b = long_distance_travel_b[long_distance_travel_b.user_start == long_distance_travel_b.user_end]
long_distance_travel_b = long_distance_travel_b.drop(columns=["index"])

long_distance_travel_g = checkins_g.reset_index().merge(
    checkins_g.shift(-1).dropna().reset_index(),
    on="index",
    suffixes=("_start", "_end"),
)

long_distance_travel_g = long_distance_travel_g[long_distance_travel_g.user_start == long_distance_travel_g.user_end]
long_distance_travel_g = long_distance_travel_g.drop(columns=["index"])


In [82]:
long_distance_travel_b

Unnamed: 0,user_start,checkin_time_start,latitude_start,longitude_start,cell_number_start,user_end,checkin_time_end,latitude_end,longitude_end,cell_number_end
0,0,2009-05-25 20:56:10+00:00,37.774929,-122.419415,34285,0.0,2009-05-25 21:35:28+00:00,37.600747,-122.382376,34813.0
1,0,2009-05-25 21:35:28+00:00,37.600747,-122.382376,34813,0.0,2009-05-25 21:37:44+00:00,37.600747,-122.382376,34813.0
2,0,2009-05-25 21:37:44+00:00,37.600747,-122.382376,34813,0.0,2009-05-25 21:42:47+00:00,37.600747,-122.382376,34813.0
3,0,2009-05-25 21:42:47+00:00,37.600747,-122.382376,34813,0.0,2009-05-25 22:13:23+00:00,37.615223,-122.389979,34813.0
4,0,2009-05-25 22:13:23+00:00,37.615223,-122.389979,34813,0.0,2009-05-26 02:21:12+00:00,39.878664,-104.682105,16113.0
...,...,...,...,...,...,...,...,...,...,...
4491023,58220,2009-01-22 10:08:12+00:00,33.833333,35.833333,39753,58220.0,2009-01-23 23:23:27+00:00,33.855255,35.578156,39471.0
4491024,58220,2009-01-23 23:23:27+00:00,33.855255,35.578156,39471,58220.0,2009-01-23 23:32:07+00:00,33.853611,35.577222,39471.0
4491025,58220,2009-01-23 23:32:07+00:00,33.853611,35.577222,39471,58220.0,2009-01-23 23:32:26+00:00,33.853611,35.577222,39471.0
4491026,58220,2009-01-23 23:32:26+00:00,33.853611,35.577222,39471,58220.0,2009-01-29 09:19:17+00:00,33.853611,35.577222,39471.0


The formulas for the function below are from [stackoverflow](https://stackoverflow.com/questions/19412462/getting-distance-between-two-points-based-on-latitude-longitude).

In [107]:
@njit
def distance_between_two_coordinates(
    lat1_degrees, lon1_degrees, lat2_degrees, lon2_degrees
):
    """
    Distance in km between two points given in coordinates.

    Arguments
    =========
    lat1_degrees : float
        latitude of the first point in degrees.
    lon1_degrees : float
        longitude of the first point in degrees.
    lat2_degrees : float
        latitude of the second point in degrees.
    lon2_degrees : float
        longitude of the second point in degrees.

    Returns
    =======
    distance : float
        distance in km between the two coordiantes.
    """

    # approximate radius of earth in km
    R = 6373.0

    lat1 = np.deg2rad(lat1_degrees)
    lon1 = np.deg2rad(lon1_degrees)
    lat2 = np.deg2rad(lat2_degrees)
    lon2 = np.deg2rad(lon2_degrees)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat / 2) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2) ** 2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))

    distance = R * c

    return distance

In [111]:
long_distance_travel_b["distance"] = distance_between_two_coordinates(
    long_distance_travel_b.latitude_start.values,
    long_distance_travel_b.longitude_start.values,
    long_distance_travel_b.latitude_end.values,
    long_distance_travel_b.longitude_end.values,
)

long_distance_travel_g["distance"] = distance_between_two_coordinates(
    long_distance_travel_g.latitude_start.values,
    long_distance_travel_g.longitude_start.values,
    long_distance_travel_g.latitude_end.values,
    long_distance_travel_g.longitude_end.values,
)

In [2]:
long_distance_travel_b = long_distance_travel_b[long_distance_travel_b.distance >= 500]
long_distance_travel_g = long_distance_travel_g[long_distance_travel_g.distance >= 500]

In [123]:
long_distance_travel_b.sample(5)

Unnamed: 0,user_start,checkin_time_start,latitude_start,longitude_start,cell_number_start,user_end,checkin_time_end,latitude_end,longitude_end,cell_number_end,distance
4025094,34233,2009-05-30 20:55:42+00:00,40.714269,-74.005973,2165,34233.0,2009-05-31 17:57:53+00:00,40.760779,-111.891047,19125.0,3167.599933
1928631,7648,2009-10-13 05:57:35+00:00,38.284431,-0.557602,14026,7648.0,2009-10-13 07:44:38+00:00,42.883514,-2.729923,16281.0,543.417881
771506,2175,2010-09-21 02:37:35+00:00,35.697822,139.774251,219958,2175.0,2010-09-26 20:16:52+00:00,1.3568,103.9891,110208.0,5310.727443
3324717,18344,2009-08-29 00:51:10+00:00,38.32941,-76.46405,4583,18344.0,2009-09-09 05:50:22+00:00,39.533449,-119.757987,27791.0,3711.909232
2777313,12265,2008-06-04 00:16:31+00:00,40.714269,-74.005973,2165,12265.0,2008-06-04 22:07:53+00:00,39.099727,-94.578567,11608.0,1760.40988


## Research Questions

### Which countries travel the most long distance by plane?

### Where do people from different countries travel to the most?

### Check if it is possible to predict user home areas based on their long distance travel patterns?