# Creative Extension Analysis

In this notebook, we propose a creative extension analysis of the paper _Friendship and Mobility: User Movement in Location-Based Social Networks_. We chose to answer multiple research questions by using the same datasets (check-ins) and an additional one (airports). 

In particular, we aim to find out:
- Which countries travel the most long distance by plane?
- Where do people from different countries travel to the most?
- Where users' friends are based?
- Is it possible to predict user home areas based on their long distance travel patterns?

## Tools

In [2]:
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from numba import njit
import itertools

import reverse_geocoder
import pickle

## Collect Data

We load the check-ins datasets and the new airport dataset.

In [3]:
checkins_b = pd.read_csv("data/Brightkite_totalCheckins.txt", delimiter='\t', names=["user", "checkin_time", "latitude", "longitude", "location_id"])
checkins_g = pd.read_csv("data/Gowalla_totalCheckins.txt", delimiter='\t', names=["user", "checkin_time", "latitude", "longitude", "location_id"])

In [10]:
airports_import = pd.read_csv("data/airports.csv")

In [123]:
countries_import = pd.read_csv("data/countries.csv")

In the following cells, we will only display dataframes for Brightkite.

In [23]:
checkins_b.head()

Unnamed: 0,user,checkin_time,latitude,longitude,location_id
0,0,2010-10-17T01:48:53Z,39.747652,-104.99251,88c46bf20db295831bd2d1718ad7e6f5
1,0,2010-10-16T06:02:04Z,39.891383,-105.070814,7a0f88982aa015062b95e3b4843f9ca2
2,0,2010-10-16T03:48:54Z,39.891077,-105.068532,dd7cd3d264c2d063832db506fba8bf79
3,0,2010-10-14T18:25:51Z,39.750469,-104.999073,9848afcc62e500a01cf6fbf24b797732f8963683
4,0,2010-10-14T00:21:47Z,39.752713,-104.996337,2ef143e12038c870038df53e0478cefc


Here's the airport dataset.

In [11]:
airports_import.head()

Unnamed: 0,id,ident,type,name,latitude_deg,longitude_deg,elevation_ft,continent,iso_country,iso_region,municipality,scheduled_service,gps_code,iata_code,local_code,home_link,wikipedia_link,keywords
0,6523,00A,heliport,Total Rf Heliport,40.070801,-74.933601,11.0,,US,US-PA,Bensalem,no,00A,,00A,,,
1,323361,00AA,small_airport,Aero B Ranch Airport,38.704022,-101.473911,3435.0,,US,US-KS,Leoti,no,00AA,,00AA,,,
2,6524,00AK,small_airport,Lowell Field,59.9492,-151.695999,450.0,,US,US-AK,Anchor Point,no,00AK,,00AK,,,
3,6525,00AL,small_airport,Epps Airpark,34.864799,-86.770302,820.0,,US,US-AL,Harvest,no,00AL,,00AL,,,
4,6526,00AR,closed,Newport Hospital & Clinic Heliport,35.6087,-91.254898,237.0,,US,US-AR,Newport,no,,,,,,00AR


In [25]:
print("Brightkite checkins dataset have {} rows".format(len(checkins_b)))
print("Gowalla checkins dataset have {} rows".format(len(checkins_g)))

Brightkite checkins dataset have 4747287 rows
Gowalla checkins dataset have 6442892 rows


## Work on Check-ins Datasets

We check for missing values in both datasets

In [136]:
print(checkins_b.isna().sum())
print()
print(checkins_g.isna().sum())

user            0
checkin_time    6
latitude        6
longitude       6
location_id     6
dtype: int64

user            0
checkin_time    0
latitude        0
longitude       0
location_id     0
dtype: int64


In [27]:
checkins_b = checkins_b.dropna()

In [28]:
checkins_b["checkin_time"] = pd.to_datetime(checkins_b["checkin_time"], format = "%Y-%m-%dT%H:%M:%SZ")
checkins_g["checkin_time"] = pd.to_datetime(checkins_g["checkin_time"], format = "%Y-%m-%dT%H:%M:%SZ")

We need to check also if coordinates are correct values

In [29]:
print(len(checkins_b[((checkins_b["longitude"] < -180) | (checkins_b["longitude"] > 180)) | ((checkins_b["latitude"] < -90) | (checkins_b["latitude"] > 90))]))
print(len(checkins_g[((checkins_g["longitude"] < -180) | (checkins_g["longitude"] > 180)) | ((checkins_g["latitude"] < -90) | (checkins_g["latitude"] > 90))]))

109
29


We have 109 incorrect position values in the Brightkite dataset and 29 in Gowalla.

In [31]:
to_remove_b = checkins_b[((checkins_b["longitude"] < -180) | (checkins_b["longitude"] > 180)) | ((checkins_b["latitude"] < -90) | (checkins_b["latitude"] > 90))]
to_remove_g = checkins_g[((checkins_g["longitude"] < -180) | (checkins_g["longitude"] > 180)) | ((checkins_g["latitude"] < -90) | (checkins_g["latitude"] > 90))]

checkins_b = checkins_b.drop(to_remove_b.index)
checkins_g = checkins_g.drop(to_remove_g.index)

In [32]:
checkins_b.head()

Unnamed: 0,user,checkin_time,latitude,longitude,location_id
0,0,2010-10-17 01:48:53,39.747652,-104.99251,88c46bf20db295831bd2d1718ad7e6f5
1,0,2010-10-16 06:02:04,39.891383,-105.070814,7a0f88982aa015062b95e3b4843f9ca2
2,0,2010-10-16 03:48:54,39.891077,-105.068532,dd7cd3d264c2d063832db506fba8bf79
3,0,2010-10-14 18:25:51,39.750469,-104.999073,9848afcc62e500a01cf6fbf24b797732f8963683
4,0,2010-10-14 00:21:47,39.752713,-104.996337,2ef143e12038c870038df53e0478cefc


Also, the ```location_id``` column is useless.

In [33]:
checkins_b = checkins_b.drop(columns=['location_id'])
checkins_g = checkins_g.drop(columns=['location_id'])

In [34]:
checkins_b.head()

Unnamed: 0,user,checkin_time,latitude,longitude
0,0,2010-10-17 01:48:53,39.747652,-104.99251
1,0,2010-10-16 06:02:04,39.891383,-105.070814
2,0,2010-10-16 03:48:54,39.891077,-105.068532
3,0,2010-10-14 18:25:51,39.750469,-104.999073
4,0,2010-10-14 00:21:47,39.752713,-104.996337


## Work on Airport Dataset

In [309]:
airports_import.head()

Unnamed: 0,id,ident,type,name,latitude_deg,longitude_deg,elevation_ft,continent,iso_country,iso_region,municipality,scheduled_service,gps_code,iata_code,local_code,home_link,wikipedia_link,keywords
0,6523,00A,heliport,Total Rf Heliport,40.070801,-74.933601,11.0,,US,US-PA,Bensalem,no,00A,,00A,,,
1,323361,00AA,small_airport,Aero B Ranch Airport,38.704022,-101.473911,3435.0,,US,US-KS,Leoti,no,00AA,,00AA,,,
2,6524,00AK,small_airport,Lowell Field,59.9492,-151.695999,450.0,,US,US-AK,Anchor Point,no,00AK,,00AK,,,
3,6525,00AL,small_airport,Epps Airpark,34.864799,-86.770302,820.0,,US,US-AL,Harvest,no,00AL,,00AL,,,
4,6526,00AR,closed,Newport Hospital & Clinic Heliport,35.6087,-91.254898,237.0,,US,US-AR,Newport,no,,,,,,00AR


In [310]:
airports = airports_import.drop(columns=["id", "ident", "elevation_ft", "continent", "iso_region", "municipality", "gps_code", "iata_code", "local_code", "home_link", "wikipedia_link", "keywords", "scheduled_service"])

In [311]:
airports.type.value_counts()

small_airport     35249
heliport          12566
closed             4888
medium_airport     4537
seaplane_base      1042
large_airport       613
balloonport          25
Name: type, dtype: int64

In [312]:
airports= airports[(airports.type == "large_airport") | (airports.type == "medium_airport")]

In [313]:
airports.type.value_counts()

medium_airport    4537
large_airport      613
Name: type, dtype: int64

In [314]:
airports_medium= airports[airports.type == "medium_airport"]

In [315]:
# Je pense qu'on peut enlever les aeroports moyens (aeroport d'annemasse par ex)
airports_medium[airports_medium.iso_country=='FR'].head()

Unnamed: 0,type,name,latitude_deg,longitude_deg,iso_country
33073,medium_airport,Calais-Dunkerque Airport,50.962101,1.95476,FR
33077,medium_airport,Aérodrome de Péronne Saint-Quentin,49.8685,3.02958,FR
33090,medium_airport,Le Touquet-Côte d'Opale Airport,50.517399,1.62059,FR
33092,medium_airport,Valenciennes-Denain Airport,50.325802,3.46126,FR
33095,medium_airport,Aérodrome d'Amiens-Glisy,49.873004,2.387074,FR


In [316]:
# Only keep the large airports
airports = airports[airports.type == "large_airport"]
airports= airports.drop(columns="type")

In [317]:
airports.iso_country.value_counts()

US    170
CN     35
GB     27
RU     19
DE     17
     ... 
HU      1
NE      1
MT      1
KY      1
BW      1
Name: iso_country, Length: 148, dtype: int64

In [318]:
countries_import.head()

Unnamed: 0,id,code,name,continent,wikipedia_link,keywords
0,302672,AD,Andorra,EU,https://en.wikipedia.org/wiki/Andorra,
1,302618,AE,United Arab Emirates,AS,https://en.wikipedia.org/wiki/United_Arab_Emir...,"UAE,مطارات في الإمارات العربية المتحدة"
2,302619,AF,Afghanistan,AS,https://en.wikipedia.org/wiki/Afghanistan,
3,302722,AG,Antigua and Barbuda,,https://en.wikipedia.org/wiki/Antigua_and_Barbuda,
4,302723,AI,Anguilla,,https://en.wikipedia.org/wiki/Anguilla,


In [319]:
countries = countries_import.drop(columns=['id', 'continent','wikipedia_link', 'keywords'])

In [320]:
countries = countries.rename(columns={'code':'iso_country', 'name':'country'})

In [321]:
countries.head()

Unnamed: 0,iso_country,country
0,AD,Andorra
1,AE,United Arab Emirates
2,AF,Afghanistan
3,AG,Antigua and Barbuda
4,AI,Anguilla


In [322]:
print(countries.isna().sum())

iso_country    1
country        0
dtype: int64


In [323]:
countries = countries.dropna()

In [324]:
airports = airports.merge(countries, on='iso_country')

In [325]:
airports.head()

Unnamed: 0,name,latitude_deg,longitude_deg,iso_country,country
0,Port Moresby Jacksons International Airport,-9.44338,147.220001,PG,Papua New Guinea
1,Keflavik International Airport,63.985001,-22.6056,IS,Iceland
2,Priština International Airport,42.5728,21.035801,XK,Kosovo
3,Guodu air base,36.001741,117.63201,CN,China
4,Yantai Penglai International Airport,37.657222,120.987222,CN,China


## Exploratory Data Analysis

## Research Questions

### Which countries travel the most long distance by plane?

In [37]:
copy_checkins_b = checkins_b.copy()
copy_checkins_g = checkins_g.copy()

In [38]:
"""
Assign a cell_number based on the cantor pairing function and discretization into 25km * 25km cells.
"""
@njit
def assign_cell(lat, lon):
    lon_km = 111.320 * np.cos(np.deg2rad(lat)) * lon
    lat_km = 110.574 * lat
    #assign to intervals using the cantor pair function
    lat_km, lon_km = lat_km // 25, lon_km // 25 #now lat_km and lon_km contains the quotient from the division by 25.
    return (1/2)*(lat_km + lon_km)*(lat_km + lon_km + 1) + lon_km

For each checkin, we want to assign a cell_number based on the division of the world in 25km by 25km as in the replication. We do that using the cantor pairing function in the function above.

In [39]:
copy_checkins_b["cell_number"] = assign_cell(copy_checkins_b["latitude"].values, copy_checkins_b["longitude"].values)
copy_checkins_g["cell_number"] = assign_cell(copy_checkins_g["latitude"].values, copy_checkins_g["longitude"].values)

copy_checkins_b["cell_number"] = copy_checkins_b["cell_number"].astype('int')
copy_checkins_g["cell_number"] = copy_checkins_g["cell_number"].astype('int')

Next, we want to simplify the notion of home for each user because we only want to retrieve the country he lives in. To do that, we will simply take the average position of every checkin in each cell.

In [40]:
center_cells_b = copy_checkins_b[['latitude', 'longitude', 'cell_number']].groupby('cell_number').mean()
center_cells_g = copy_checkins_g[['latitude', 'longitude', 'cell_number']].groupby('cell_number').mean()

center_cells_b = center_cells_b.reset_index()
center_cells_g = center_cells_g.reset_index()

In [41]:
center_cells_b.head()

Unnamed: 0,cell_number,latitude,longitude
0,-286,64.545703,-149.087119
1,-284,64.1525,-145.842222
2,-282,63.562699,-142.300615
3,-280,63.934833,-145.788211
4,-279,63.661389,-144.064444


Now we work on the home for each user. Home is now the average position of checkins in each cell with the most checkins.

In [43]:
user_cell_b = copy_checkins_b.groupby(["user", "cell_number"]).count()
user_cell_g = copy_checkins_g.groupby(["user", "cell_number"]).count()

#We can remove these columns
user_cell_b = user_cell_b.drop(columns = ["latitude", "longitude"]) 
user_cell_g = user_cell_g.drop(columns = ["latitude", "longitude"]) 

#rename the column to count
user_cell_b.columns = ["count"] 
user_cell_g.columns = ["count"] 

#Sort by user to have user 0 first
user_cell_b = user_cell_b.sort_values("count").groupby(level=0).tail(1).sort_values('user') 
user_cell_g = user_cell_g.sort_values("count").groupby(level=0).tail(1).sort_values('user') 

#Reset the index to break the multiindex and keep only the column cell_number
user_cell_b = user_cell_b.reset_index().drop(columns=["count"])
user_cell_g = user_cell_g.reset_index().drop(columns=["count"]) 

In [44]:
user_cell_b.head()

Unnamed: 0,user,cell_number
0,0,16660
1,1,34813
2,2,16660
3,3,34813
4,4,51736


Now we can merge the two to have the home for each user as the average position of checkins in each cell.

In [47]:
homes_b = user_cell_b.merge(center_cells_b, on='cell_number')
homes_g = user_cell_g.merge(center_cells_g, on='cell_number')

In [48]:
homes_b.head()

Unnamed: 0,user,cell_number,latitude,longitude
0,0,16660,39.731991,-104.980205
1,2,16660,39.731991,-104.980205
2,8,16660,39.731991,-104.980205
3,12,16660,39.731991,-104.980205
4,13,16660,39.731991,-104.980205


Now for each home, we want to retrieve the country based on its coordinates.

In [26]:
homes_countries_b = homes_b.copy()
homes_countries_g = homes_g.copy()

In [27]:
homes_countries_b = homes_countries_b[['latitude', 'longitude']].groupby(['latitude', 'longitude']).count().reset_index()
homes_countries_g = homes_countries_g[['latitude', 'longitude']].groupby(['latitude', 'longitude']).count().reset_index()

In [28]:
homes_countries_b.head()

Unnamed: 0,latitude,longitude
0,-45.866346,170.516186
1,-45.076905,170.98207
2,-44.733333,170.466667
3,-43.576341,172.628973
4,-43.520079,172.604997


We create a function which will return the country code given a point coordinates.

In [50]:
"""
Return the ISO country code of a point in space

Arguments
---------
    -lat, lon: Latitude and Longitude of the point

Returns
---------
    -Returns the ISO country code of the point
"""


def retrieve_country(lat, lon):
    return reverse_geocoder.search((lat, lon))[0]['cc']

#vectorize the function
retrieve_country_vec = np.vectorize(retrieve_country)

The following cell takes a lot of time, so we pickled it.

In [54]:
# homes_countries_b['country'] = retrieve_country_vec(homes_countries_b['latitude'], 
#                                                     homes_countries_b['longitude'])

# homes_countries_g['country'] = retrieve_country_vec(homes_countries_g['latitude'], 
#                                                     homes_countries_g['longitude'])

In [30]:
#homes_countries_b.to_pickle('homes_countries_b.pkl')
#homes_countries_g.to_pickle('homes_countries_g.pkl')

In [54]:
homes_countries_b = pd.read_pickle("data/homes_countries_b.pkl")
homes_countries_g = pd.read_pickle("data/homes_countries_g.pkl")

In [57]:
homes_countries_b.head()

Unnamed: 0,latitude,longitude,country
0,-45.866346,170.516186,NZ
1,-45.076905,170.98207,NZ
2,-44.733333,170.466667,NZ
3,-43.576341,172.628973,NZ
4,-43.520079,172.604997,NZ


Now that we have for each 'home' the country code, we can now merge the two datasets so that we have a complete dataframe.

In [58]:
homes_countries_b = homes_b.merge(homes_countries_b, on=['latitude', 'longitude'])
homes_countries_g = homes_g.merge(homes_countries_g, on=['latitude', 'longitude'])

In [60]:
homes_countries_b.head()

Unnamed: 0,user,cell_number,latitude,longitude,country
0,0,16660,39.731991,-104.980205,US
1,2,16660,39.731991,-104.980205,US
2,8,16660,39.731991,-104.980205,US
3,12,16660,39.731991,-104.980205,US
4,13,16660,39.731991,-104.980205,US


### Where do people from different countries travel to the most?

### Check if it is possible to predict user home areas based on their long distance travel patterns?