In [1]:
import pandas as pd
import csv
import numpy as np
from scripts.geocoding import *
from scripts.preprocessing import *
%load_ext autoreload
%autoreload 2

[Quoting Problem](http://stackoverflow.com/a/29857126/4811003)

**Schema**

In [2]:
schema = pd.read_table("data/twitter-swisscom/schema.txt", header = None, delim_whitespace=True,index_col=0)
schema.head()

Unnamed: 0_level_0,1,2,3,4,5
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,id,bigint(20),UNSIGNED,No,
2,userId,bigint(20),UNSIGNED,No,
3,createdAt,timestamp,No,0000-00-00,00:00:00
4,text,text,utf8_unicode_ci,No,
5,longitude,float,Yes,,


# Preprocessing

First process the `sample.tsv` by removing the `\\\n` in the content of twitter.

In [3]:
preprocessing_data("data/twitter-swisscom/sample.tsv", "data/twitter-swisscom/modified_sample.tsv")

length of the lines =  10000
Iter 0
file has been written successfully.


In [4]:
with open("data/twitter-swisscom/sample.tsv", 'r') as h:
    lines = h.readlines()
    
to_be_processed_lines = [k for k, i in enumerate(lines) if i.endswith('\\\n') ]
print(len(to_be_processed_lines))

1210


In [5]:
lines[0]

'776522983837954049\t735449229028675584\t2016-09-15 20:48:01\tse lo dici tu... https://t.co/x7Qm1VHBKL\t\\N\t\\N\t51c0e6b24c64e54e\t\\N\t1\t\x00\t46.0027\t8.96044\tTwitter for iPhone\thttp://twitter.com/#!/download/iphone\tplvtone filiae.\thazel_chb\t146\t110\t28621\tEarleen. \n'

In [6]:
def remove_line_breaks(list_of_strings):
    for i, s in enumerate(list_of_strings):
        if s.endswith('\\\n'):
            list_of_strings[i] = s[:-2]

remove_line_breaks(lines)
with open("data/twitter-swisscom/modified_sample.tsv", 'w') as h:
    h.writelines(lines)

In [7]:
# None quating in the dataset
df = pd.read_csv("data/twitter-swisscom/modified_sample.tsv", header = None, sep='\t',quoting=csv.QUOTE_NONE)
df.columns = schema[1]
df.head()

1,id,userId,createdAt,text,longitude,latitude,placeId,inReplyTo,source,truncated,placeLatitude,placeLongitude,sourceName,sourceUrl,userName,screenName,followersCount,friendsCount,statusesCount,userLocation
0,776522983837954049,735449229028675584,2016-09-15 20:48:01,se lo dici tu... https://t.co/x7Qm1VHBKL,\N,\N,51c0e6b24c64e54e,\N,1,,46.0027,8.96044,Twitter for iPhone,http://twitter.com/#!/download/iphone,plvtone filiae.,hazel_chb,146,110,28621,Earleen.
1,776523000636203010,2741685639,2016-09-15 20:48:05,https://t.co/noYrTnqmg9,\N,\N,4e7c21fd2af027c6,\N,1,,46.8131,8.22414,Twitter for iPhone,http://twitter.com/#!/download/iphone,samara,letisieg,755,2037,3771,Suisse
2,776523045200691200,435239151,2016-09-15 20:48:15,@BesacTof @Leonid_CCCP Tu dois t'engager en si...,\N,\N,12eb9b254faf37a3,776522113859608576,5,,47.201,5.94082,Twitter for Android,http://twitter.com/download/android,lebrübrü❤,lebrubru,811,595,30191,Fontain
3,776523058404290560,503244217,2016-09-15 20:48:18,@Mno0or_Abyat اشوف مظاهرات على قانون العمل الج...,\N,\N,30bcd7f767b4041e,776521597515624448,1,,45.8011,6.16552,Twitter for iPhone,http://twitter.com/#!/download/iphone,عبدالله القنيص,bingnais,28433,417,12262,Shargeyah
4,776523058504925185,452805259,2016-09-15 20:48:18,Greek night #geneve (@ Emilios in Genève) http...,6.14414,46.1966,c3a6437e1b1a726d,\N,3,,46.2048,6.14319,foursquare,http://foursquare.com,Alkan Şenli,Alkanoli,204,172,3390,İstanbul/Burgazada


# Data Stats

## Case Study

#### Information Independent of tweets

|attribute| value| |
|------|------| --------- |
|id | 776523058504925185 | unique index in url|
|url |https://twitter.com/alkanoli/status/776523058504925185||
|`userLocation`| İstanbul/Burgazada | come from/resides|

#### Information related to tweets

|attribute| value ||
|-----|-----|----|
|`text`|'Greek night #geneve (@ Emilios in Genève) https://t.co/sEplW0Mcyz'| content|
|`createdAt`|`2016-09-15 20:48:18`| |
|`placeId`| `c3a6437e1b1a726d`| place id of a tweet (by inspecting the tweet's webpage)|
|`longitude`| 6.14414||
|`latitude`|46.1966	||
|`placeLatitude`|46.2048||
|`placeLongitude`|6.14319||

## Basic Information

### How to Use Geo Information

1. (`latitude`, `longitude`) and (`placeLatitude`, `placeLongitude`) are basically same. Data of (`placeLatitude`, `placeLongitude`) are more complete.
2. `placeId`: name of tweet location. Groups people nearby. We may need **reverse geocoding** to get the name of the exact place.

In [8]:
df[(df.longitude != '\\N')][['longitude', 'latitude', 'placeLongitude', 'placeLatitude', 'placeId', 'userLocation']].dropna().head(10)

1,longitude,latitude,placeLongitude,placeLatitude,placeId,userLocation
4,6.14414,46.1966,6.14319,46.2048,c3a6437e1b1a726d,İstanbul/Burgazada
26,8.95092,46.006,8.96044,46.0027,51c0e6b24c64e54e,Lahore
31,6.81899,47.1003,6.82645,47.1136,c2bf4772ec58dc04,"La Chaux-de-Fonds, Neuchâtel"
41,8.94542,45.9915,8.95449,45.9884,6b2eafacf6c765ba,Lahore
54,5.99278,47.2763,5.96952,47.2635,1cf182db3b9e8fc5,Besançon
88,9.08087,45.8132,9.08382,45.8002,cd661902b07eb657,\N
93,6.10767,46.2308,6.079,46.2322,068c70be7b3a4cc2,"Utrecht, NL"
128,6.15127,46.2101,6.14319,46.2048,c3a6437e1b1a726d,"iPhone: 47.632786,-122.026932"
139,7.69677,46.4999,7.71341,46.4657,1231efcfff3a1c64,\N
140,8.30979,47.0541,8.31721,47.0408,8b3e53628223753a,\N


The placeId maybe the hash number of longitude and latitude. They are more compact way for us to group similar users.



#### We can use geopy to do reverse geocoding using library geopy

Since `geopy` has many sources, the resulting format can be different. We will use the most common ones like 

1. `country_code`
2. `state`

The following are two examples: the first one is a place in Berlin and the second one is EPFL.

Compare two examples, we found:

1. EPFL has `county`, `town`, but no `city` entry.
2. `Potsdamer Platz` has `city` but no `county`, `town`.

Both examples have `state` entry, which seems to be a better choice. (For countries like Liechtenstein which has no `state`, we will ...)

In [9]:
rev_geo(lat=52.509669, lon=13.376294)

{'address': {'attraction': 'Potsdamer Platz',
  'city': 'Berlin',
  'city_district': 'Mitte',
  'country': 'Deutschland',
  'country_code': 'de',
  'postcode': '10117',
  'road': 'Potsdamer Platz',
  'state': 'Berlin'},
 'boundingbox': ['52.5093982', '52.5095982', '13.3764983', '13.3766983'],
 'display_name': 'Potsdamer Platz, Mitte, Berlin, 10117, Deutschland',
 'lat': '52.5094982',
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright',
 'lon': '13.3765983',
 'osm_id': '245981373',
 'osm_type': 'node',
 'place_id': '644947'}

In [10]:
geo(loc='EPFL')

(6.566561505148, 46.5186594)

In [11]:
rev_geo(lon=6.566561505148, lat=46.5186594)

{'address': {'country': 'Schweiz, Suisse, Svizzera, Svizra',
  'country_code': 'ch',
  'county': "District de l'Ouest lausannois",
  'pedestrian': 'Place Cosandey',
  'postcode': '1015',
  'state': 'Vaud',
  'town': 'Ecublens',
  'university': 'École Polytechnique Fédérale de Lausanne (EPFL)'},
 'boundingbox': ['46.5152316', '46.5222479', '6.5601751', '6.5721733'],
 'display_name': "École Polytechnique Fédérale de Lausanne (EPFL), Place Cosandey, Ecublens, District de l'Ouest lausannois, Vaud, 1015, Schweiz, Suisse, Svizzera, Svizra",
 'lat': '46.5186594',
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright',
 'lon': '6.566561505148',
 'osm_id': '23391253',
 'osm_type': 'way',
 'place_id': '68692007'}

In [12]:
# Distance in kilo meters using vincent metrics
distance(lat1=52.509669, lon1=13.376294,lon2=6.566561505148, lat2=46.5186594)

648.9242490859898

##### Apply to this problem

In [13]:
# Since every item with same `placeId` has same lat and lon, we will simple use `first()` method
# to groupby
placeId_lat_lon_df = df.groupby('placeId').first()[['placeLatitude','placeLongitude']]
placeId_lat_lon_df.head()

1,placeLatitude,placeLongitude
placeId,Unnamed: 1_level_1,Unnamed: 2_level_1
000a93ad12003aaa,46.8911,7.51217
0046b64d1941431e,46.2118,6.43633
0070770855fc0793,47.4499,10.3448
007355fb62ccfa7b,45.8411,8.72494
00b3d266c3ec547d,47.2673,8.67959


In [14]:
import pickle
import os

if os.path.exists("data/twitter-swisscom/placeId_lat_lon_df.pickle"):
    with open("data/twitter-swisscom/placeId_lat_lon_df.pickle", "rb") as h:
        placeId_lat_lon_df = pickle.load(h)
else:
    def get_geo_info(x):
        return rev_geo(lon=x['placeLongitude'], lat=x['placeLatitude'])

    import time
    t = time.time()
    placeId_lat_lon_df['raw_geo'] = placeId_lat_lon_df.apply(lambda x: get_geo_info(x), axis=1)
    print("time for all items = ", time.time() - t)
    with open("data/twitter-swisscom/placeId_lat_lon_df.pickle", "wb") as h:
        pickle.dump(placeId_lat_lon_df, h)

In [15]:
placeId_lat_lon_df.iloc[1, 2]

{'address': {'country': 'France',
  'country_code': 'fr',
  'county': 'Thonon-les-Bains',
  'postcode': '74420',
  'road': 'Route de la Gruaz',
  'state': 'Auvergne-Rhône-Alpes',
  'suburb': 'Les Andrys',
  'village': 'Villard'},
 'boundingbox': ['46.2116281', '46.2151875', '6.434952', '6.4378786'],
 'display_name': 'Route de la Gruaz, Les Andrys, Villard, Thonon-les-Bains, Haute-Savoie, Auvergne-Rhône-Alpes, 74420, France',
 'lat': '46.2129927',
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright',
 'lon': '6.4369696',
 'osm_id': '143600488',
 'osm_type': 'way',
 'place_id': '96129716'}

In [16]:
placeId_lat_lon_df['country_code'] = placeId_lat_lon_df['raw_geo'].apply(lambda x: x['address']['country_code'])

In [17]:
# swiss, french, german, italian, austria, turky, Liechtenstein
placeId_lat_lon_df['country_code'].value_counts()

ch    565
fr    174
it    134
de     53
at     17
li      4
tr      1
Name: country_code, dtype: int64

In [18]:
placeId_lat_lon_df['state']=placeId_lat_lon_df['raw_geo'].apply(lambda x: x['address'].get('state'))

In [19]:
placeId_lat_lon_df['state'].unique()

array(['Bern - Berne', 'Auvergne-Rhône-Alpes', 'Bayern', 'LOM', 'Zürich',
       'Graubünden - Grigioni - Grischun', 'Grand-Est', 'Tirol',
       'Schaffhausen', 'Genève', 'Bourgogne-Franche-Comté', 'Vorarlberg',
       'Ticino', 'Valais - Wallis', 'Schwyz', 'Thurgau', 'Solothurn',
       'Neuchâtel', 'Jura', 'Aargau', 'Basel-Landschaft', 'Vaud',
       'Sankt Gallen', 'Basel-Stadt', 'Baden-Württemberg', 'Luzern',
       'Fribourg - Freiburg', 'Obwalden', 'Nidwalden', 'PIE', 'Zug',
       'Glarus', 'VDA', 'Tekirdağ', None, 'Appenzell Innerrhoden', 'TAA',
       'Appenzell Ausserrhoden', 'Uri', 'Nouvelle-Aquitaine'], dtype=object)

# Processing

## Problems in the data:

1. line break '\' in the text. Solved by preprocess and generate a new file.
2. '\N' in Twitter time stamp. 

### Deal with missing values

##  Time

We would like to analysis the people's location/behavior by 

Periodical behavior of a person/group: 

1. hour of the twitter: morning, afternoon, evening. (Maybe we can classify them)
2. weekday of a week: 1~7 (1 means Monday): Work at some place and visit .. on Sunday
3. month: People cross border more frequently in summer/winter/...?

Notice the potential problem:

1. Bias:
    1. some people may only tweet in the weekend.
    2. some people only tweet when they are travelling

##### pandas library

In [20]:
s = df.loc[0, 'createdAt']; print(s)
st = pd.Timestamp(s).to_pydatetime()

2016-09-15 20:48:01


In [21]:
st.day, st.date(), st.ctime(), st.hour, st.isocalendar(), st.isoweekday(), st.isoformat()

(15,
 datetime.date(2016, 9, 15),
 'Thu Sep 15 20:48:01 2016',
 20,
 (2016, 37, 4),
 4,
 '2016-09-15T20:48:01')

In [22]:
st.month, st.year, st.weekday(), st.toordinal(), st.timetuple()

(9,
 2016,
 3,
 736222,
 time.struct_time(tm_year=2016, tm_mon=9, tm_mday=15, tm_hour=20, tm_min=48, tm_sec=1, tm_wday=3, tm_yday=259, tm_isdst=-1))

##### For this problem

In [23]:
df['time'] = df['createdAt'].apply(lambda x: pd.Timestamp(x).to_pydatetime())

In [24]:
df['weekday'] = df['time'].apply(lambda x: x.weekday())

In [25]:
df['day'] = df['time'].apply(lambda x: x.day)
df['hour'] = df['time'].apply(lambda x: x.hour)
df['month'] = df['time'].apply(lambda x: x.month)

In [26]:
df[['time', 'month', 'weekday', 'day', 'hour']].head(5)

1,time,month,weekday,day,hour
0,2016-09-15 20:48:01,9,3,15,20
1,2016-09-15 20:48:05,9,3,15,20
2,2016-09-15 20:48:15,9,3,15,20
3,2016-09-15 20:48:18,9,3,15,20
4,2016-09-15 20:48:18,9,3,15,20


## add location information

In [27]:
# df_1 is first 5,000,000 data 
with open("data/twitter-swisscom/df_saved3.pickle", "rb") as h:
    df_1 = pickle.load(h)

In [31]:
place_df_1 = df_1.groupby('placeId').first()

Augment geo information

In [32]:
def get_geo_info(x):
    return rev_geo(lon=x['placeLongitude'], lat=x['placeLatitude'])

In [33]:
def get_place_name_of_placeId_from_df(df, pause_secs=1):
    """The input dataframe is """
    import time
    if (df.index.name != 'placeId' or any(df.index != df.index.unique())):
        raise ValueError("Input dataframe should use uniuqe 'placeId' as index")
    
    for col in ['placeLongitude', 'placeLatitude']:
        if (col not in df.columns):
            raise ValueError("Input dataframe should have cols 'placeLongitude', 'placeLatitude'")

    pId_geo_map = []
    t = time.time()
    try:
        for i, pId in enumerate(df.index):
            x = df.loc[pId]
            raw_geo_data = rev_geo(lon=x['placeLongitude'], lat=x['placeLatitude'])
            pId_geo_map.append((pId, raw_geo_data))
            time.sleep(pause_secs)

    #         if (i+1 % 20 == 0):
            print(i, time.time() - t)
            t = time.time()
    except Exception as e:
        print("Expected Error")
        return pId_geo_map
        
    return pId_geo_map

if os.path.exists('data/twitter-swisscom/pId_geo_df_1.pickle'):
    with open("data/twitter-swisscom/pId_geo_df_1.pickle", 'rb') as d:
        pId_geo_df_1 = pickle.load(d)
else:
    list_of_pId_geo_pair = get_place_name_of_placeId_from_df(place_df_1, pause_secs=1.1)
    pId_geo_map={'placeId':[i for i,geo in results], 'geo':[geo for i,geo in results]}
    pId_geo_df_1 = pd.DataFrame(pId_geo_map).set_index('placeId')
    with open("data/twitter-swisscom/pId_geo_df_1.pickle", 'wb') as d:
        pickle.dump(pId_geo_df_1, d)

In [34]:
pId_geo_df_1.head()

Unnamed: 0_level_0,geo,country_code,state
placeId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
cbc61afc4287f5f0,"{'osm_type': 'node', 'lon': '6.1481217', 'osm_...",ch,Genève
cbc9be43b85b2499,"{'osm_type': 'way', 'lon': '9.9607657', 'osm_i...",it,LOM
cbcea709e3d85e36,"{'osm_type': 'way', 'lon': '6.0503151', 'osm_i...",ch,Genève
cbd63634353b5d6d,"{'osm_type': 'way', 'lon': '7.709452', 'osm_id...",ch,Bern - Berne
cbd8ad77c9e72076,"{'osm_type': 'way', 'lon': '7.4516371', 'osm_i...",ch,Solothurn


In [36]:
pId_geo_df_1['state'] = pId_geo_df_1.apply(lambda x: x['geo']['address'].get('state', 'unspecified'), axis=1)

In [37]:
pId_geo_df_1['country_code'] = pId_geo_df_1.apply(lambda x: x['geo']['address']['country_code'], axis=1)

In [39]:
pId_geo_df_1.groupby(['country_code', 'state']).count().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,geo
country_code,state,Unnamed: 2_level_1
at,Tirol,29
at,Vorarlberg,136
ch,Aargau,474
ch,Appenzell Ausserrhoden,47
ch,Appenzell Innerrhoden,25


In [40]:
pId_geo_df_1['country_code'].value_counts()

ch    8336
it    1769
fr    1316
de     607
at     165
li      42
us       1
Name: country_code, dtype: int64

In [41]:
with open("data/twitter-swisscom/pId_geo_df_1.pickle", 'wb') as d:
    pickle.dump(pId_geo_df_1, d)

In [44]:
df_1

1,userId,createdAt,placeId,placeLatitude,placeLongitude,followersCount,friendsCount,statusesCount,time,weekday,day,hour,month,year
552,6257282,2010-03-09 18:09:51,9f61955efec1f923,47.5367,7.57849,14249,9260,19585,2010-03-09 18:09:51,1,9,18,3,2010
610,15602037,2010-03-10 22:44:24,0f62dd0accad77d3,47.3791,8.50021,177,136,5167,2010-03-10 22:44:24,2,10,22,3,2010
612,625553,2010-03-11 05:59:25,f0df77ae625fea91,46.1996,6.13011,471,82,3363,2010-03-11 05:59:25,3,11,5,3,2010
613,17341045,2010-03-11 06:18:47,512441eea623380b,46.9214,7.38855,586,508,9016,2010-03-11 06:18:47,3,11,6,3,2010
619,634553,2010-03-11 07:03:10,234fc23432bfd559,46.1938,6.15415,2230,387,10605,2010-03-11 07:03:10,3,11,7,3,2010
623,14657884,2010-03-11 11:03:11,5d8c73488f53c56e,46.1873,6.12815,167,277,2885,2010-03-11 11:03:11,3,11,11,3,2010
624,6257282,2010-03-11 11:43:20,56c8ac55f85f3681,47.5538,7.58398,14249,9260,19585,2010-03-11 11:43:20,3,11,11,3,2010
631,15050292,2010-03-11 13:08:42,30c54d82d6bef2bb,47.3765,8.54322,230,276,1788,2010-03-11 13:08:42,3,11,13,3,2010
633,7630552,2010-03-11 15:12:33,610181defd3b2fb4,47.3597,8.45071,533,292,5663,2010-03-11 15:12:33,3,11,15,3,2010
634,5936932,2010-03-11 16:59:24,1f576f8d77d2a4cf,47.3859,8.51265,394,266,3420,2010-03-11 16:59:24,3,11,16,3,2010


In [43]:
df_1['year'] = df_1['time'].apply(lambda x: x.year) 

In [47]:
df_1_place_x_time_count = df_1.groupby(['placeId', 'year', 'month', 'day']).count()['userId']
df_1_place_x_time_count.name = 'twitter_count'

In [48]:
gp_df_1_place_x_time_count = pd.pivot_table(df_1_place_x_time_count.reset_index(), 
                    values='twitter_count', columns=['year', 'month', 'day'], 
                   index='placeId',fill_value=0)

# User-Time-Movement

In [176]:
df_sample = df_1[:10000].copy()

#### Look at one user

In [102]:
# Here is an example of one group --- user: 5033 which has 20 twitters in the first 10,000 twitters
twitter_idx_5033 = gp_df_sample.groups[5033]
df_sample_5033 = df_sample.ix[twitter_idx_5033]
df_sample_5033.head()

1,userId,createdAt,placeId,placeLatitude,placeLongitude,followersCount,friendsCount,statusesCount,time,weekday,day,hour,month,year
733,5033,2010-03-13 17:38:24,0f62dd0accad77d3,47.3791,8.50021,901,664,2525,2010-03-13 17:38:24,5,13,17,3,2010
1161,5033,2010-03-20 22:25:10,4478451a1302dc88,47.3694,8.49866,901,664,2525,2010-03-20 22:25:10,5,20,22,3,2010
1165,5033,2010-03-20 23:52:04,4478451a1302dc88,47.3694,8.49866,901,664,2525,2010-03-20 23:52:04,5,20,23,3,2010
1641,5033,2010-03-31 10:35:25,05384bb9519f36f9,47.384,8.50013,901,664,2525,2010-03-31 10:35:25,2,31,10,3,2010
1661,5033,2010-03-31 17:15:45,b9d932e8811ffe29,47.3658,8.53002,901,664,2525,2010-03-31 17:15:45,2,31,17,3,2010


In [139]:
s = df_sample_5033.sort_values('createdAt')['placeId']
s.loc[s.shift() != s]

# s = pd.Series(['1', '2', '2', '3', '3', '3', '3', '1', '1', '2', '3'])
# s.loc[s.shift() != s]

733     0f62dd0accad77d3
1161    4478451a1302dc88
1641    05384bb9519f36f9
1661    b9d932e8811ffe29
1711    f7349fa6253975c7
1841    a377b486f13b455a
2199    3acb748d0f1e9265
Name: placeId, dtype: object

In [226]:
# Example
from itertools import groupby

def remove_consecutive_in_list(l):
    return [x[0] for x in groupby(l)]

In [177]:
df_sample['time_id'] = df_sample.apply(lambda x: str(x['time'])[:10], axis=1)

In [208]:
df_sample_m = df_sample[['userId', 'placeId', 'time_id', 'time']]

In [210]:
gp = df_sample_m.groupby(['time_id', 'userId'])

In [213]:
timeId_userId_x_placeId_df_sample = gp.aggregate(lambda x: tuple(x.loc[x.shift() != x]))

In [229]:
timeId_userId_x_placeId_df_sample['place_list'] = \
timeId_userId_x_placeId_df_sample['placeId'].apply(lambda x: remove_consecutive_in_list(x))

In [248]:
timeId_userId_x_placeId_df_sample

Unnamed: 0_level_0,1,placeId,time,place_list
time_id,userId,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-03-09,6257282,"(9f61955efec1f923,)","(2010-03-09 18:09:51,)",[9f61955efec1f923]
2010-03-10,15602037,"(0f62dd0accad77d3,)","(2010-03-10 22:44:24,)",[0f62dd0accad77d3]
2010-03-11,41483,"(29f071a24ea0518e,)","(2010-03-11 21:22:57,)",[29f071a24ea0518e]
2010-03-11,613853,"(8a4c512bf226c440,)","(2010-03-11 19:57:14,)",[8a4c512bf226c440]
2010-03-11,625553,"(f0df77ae625fea91,)","(2010-03-11 05:59:25,)",[f0df77ae625fea91]
2010-03-11,634553,"(234fc23432bfd559,)","(2010-03-11 07:03:10,)",[234fc23432bfd559]
2010-03-11,5936932,"(1f576f8d77d2a4cf,)","(2010-03-11 16:59:24,)",[1f576f8d77d2a4cf]
2010-03-11,6257282,"(56c8ac55f85f3681,)","(2010-03-11 11:43:20,)",[56c8ac55f85f3681]
2010-03-11,7630552,"(610181defd3b2fb4,)","(2010-03-11 15:12:33,)",[610181defd3b2fb4]
2010-03-11,14406528,"(b9d932e8811ffe29,)","(2010-03-11 19:51:49,)",[b9d932e8811ffe29]


Consider people who is not in this period of time.

In [256]:
people_movement_df_sample = timeId_userId_x_placeId_df_sample[timeId_userId_x_placeId_df_sample.apply(
        lambda x: len(x['place_list']) > 1, axis=1)]

for place in place_list, we convert it to the state/country

In [268]:
def placeId_to_country(pId_df, pId):
    return pId_df.loc[pId]['country_code']

In [269]:
placeId_to_country(pId_geo_df_1, 'c688b242f927674b')

'ch'

In [273]:
people_movement_df_sample['country_list'] = people_movement_df_sample['place_list'].apply(
    lambda x: [placeId_to_country(pId_geo_df_1, i) for i in x])

In [313]:
people_across_country_df_sample = people_movement_df_sample[
    people_movement_df_sample.apply(lambda x: len(set(x['country_list'])) > 1, axis=1)].copy()

In [315]:
people_across_country_df_sample['country_list'] = people_across_country_df_sample['country_list'].apply(
    lambda x: remove_consecutive_in_list(x))

In [318]:
people_across_country_df_sample

Unnamed: 0_level_0,1,placeId,time,place_list,country_list
time_id,userId,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-09-02,14899742,"(7d680f9768061fb8, 675cc59808602911, cb2e15f0c...","(2010-09-02 09:35:14, 2010-09-02 09:40:13, 201...","[7d680f9768061fb8, 675cc59808602911, cb2e15f0c...","[ch, fr]"
2010-09-02,17222077,"(bed10c1a8a148d8e, e26637aadb1b7cf2)","(2010-09-02 13:31:14, 2010-09-02 16:15:47)","[bed10c1a8a148d8e, e26637aadb1b7cf2]","[fr, de]"
2010-09-03,14899742,"(7d680f9768061fb8, 675cc59808602911, cb2e15f0c...","(2010-09-03 11:09:09, 2010-09-03 11:56:22, 201...","[7d680f9768061fb8, 675cc59808602911, cb2e15f0c...","[ch, fr]"
2010-09-03,16574461,"(7a6ebb0bf0adb61d, 3578d5d5f21096e8)","(2010-09-03 05:02:50, 2010-09-03 08:25:06)","[7a6ebb0bf0adb61d, 3578d5d5f21096e8]","[fr, ch]"
2010-09-03,17222077,"(3c425a2e301c15d3, e26637aadb1b7cf2)","(2010-09-03 14:01:16, 2010-09-03 19:27:40)","[3c425a2e301c15d3, e26637aadb1b7cf2]","[ch, de]"
2010-09-04,14899742,"(675cc59808602911, 291fa9e3be2095ee, cb2e15f0c...","(2010-09-04 11:35:09, 2010-09-04 14:56:44, 201...","[675cc59808602911, 291fa9e3be2095ee, cb2e15f0c...","[ch, fr]"
2010-09-05,14899742,"(cb2e15f0c0de05f2, 7d680f9768061fb8, 291fa9e3b...","(2010-09-05 08:42:54, 2010-09-05 13:14:34, 201...","[cb2e15f0c0de05f2, 7d680f9768061fb8, 291fa9e3b...","[fr, ch, fr]"
2010-09-06,14899742,"(7d680f9768061fb8, 1dbdca31c641ff14, 7d680f976...","(2010-09-06 09:03:52, 2010-09-06 10:27:16, 201...","[7d680f9768061fb8, 1dbdca31c641ff14, 7d680f976...","[ch, fr]"
2010-09-07,14899742,"(cb2e15f0c0de05f2, 675cc59808602911, ec1e80b75...","(2010-09-07 08:03:48, 2010-09-07 08:04:36, 201...","[cb2e15f0c0de05f2, 675cc59808602911, ec1e80b75...","[fr, ch]"
2010-09-07,15498938,"(fdcd221ac44fa326, f0a42d3bec54c6da, 0c127b56d...","(2010-09-07 08:47:02, 2010-09-07 09:30:54, 201...","[fdcd221ac44fa326, f0a42d3bec54c6da, 0c127b56d...","[de, ch]"


In [321]:
country_list = pId_geo_df_1.country_code.unique()
country_list

array(['ch', 'it', 'de', 'at', 'fr', 'li', 'us'], dtype=object)

In [349]:
adj_country = {i: {j: 0 for j in country_list} for i in country_list}
def add_adj_country(adj, l):
    for i in range(len(l)-1):
        adj[l[i]][l[i+1]] += 1

In [350]:
people_across_country_df_sample['country_list'].apply(lambda x: add_adj_country(adj_country, x))

time_id     userId  
2010-09-02  14899742    None
            17222077    None
2010-09-03  14899742    None
            16574461    None
            17222077    None
2010-09-04  14899742    None
2010-09-05  14899742    None
2010-09-06  14899742    None
2010-09-07  14899742    None
            15498938    None
2010-09-08  14899742    None
2010-09-10  14899742    None
2010-09-11  14899742    None
2010-09-12  14899742    None
2010-09-13  14899742    None
2010-09-14  8614392     None
            16329076    None
2010-09-15  8614392     None
            14899742    None
2010-09-17  14899742    None
2010-09-18  8614392     None
            14899742    None
2010-09-19  14899742    None
2010-09-20  8614392     None
            14899742    None
2010-09-21  13448872    None
2010-09-22  14899742    None
2010-09-23  8614392     None
            14899742    None
            16574461    None
                        ... 
2010-10-05  14899742    None
2010-10-06  14899742    None
2010-10-07  14899742  

In [351]:
adj_country

{'at': {'at': 0, 'ch': 1, 'de': 0, 'fr': 0, 'it': 0, 'li': 0, 'us': 0},
 'ch': {'at': 0, 'ch': 0, 'de': 8, 'fr': 43, 'it': 2, 'li': 1, 'us': 0},
 'de': {'at': 1, 'ch': 5, 'de': 0, 'fr': 3, 'it': 0, 'li': 0, 'us': 0},
 'fr': {'at': 0, 'ch': 41, 'de': 1, 'fr': 0, 'it': 0, 'li': 0, 'us': 0},
 'it': {'at': 0, 'ch': 2, 'de': 0, 'fr': 0, 'it': 0, 'li': 0, 'us': 0},
 'li': {'at': 0, 'ch': 1, 'de': 0, 'fr': 0, 'it': 0, 'li': 0, 'us': 0},
 'us': {'at': 0, 'ch': 0, 'de': 0, 'fr': 0, 'it': 0, 'li': 0, 'us': 0}}

https://bl.ocks.org/mbostock/4060954
https://bl.ocks.org/mbostock/70d5541b547cc222aa02
http://bl.ocks.org/mbostock/9943478
http://square.github.io/crossfilter/

In [353]:
pId_geo_df_1.head()

Unnamed: 0_level_0,geo,country_code,state
placeId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
cbc61afc4287f5f0,"{'osm_type': 'node', 'lon': '6.1481217', 'osm_...",ch,Genève
cbc9be43b85b2499,"{'osm_type': 'way', 'lon': '9.9607657', 'osm_i...",it,LOM
cbcea709e3d85e36,"{'osm_type': 'way', 'lon': '6.0503151', 'osm_i...",ch,Genève
cbd63634353b5d6d,"{'osm_type': 'way', 'lon': '7.709452', 'osm_id...",ch,Bern - Berne
cbd8ad77c9e72076,"{'osm_type': 'way', 'lon': '7.4516371', 'osm_i...",ch,Solothurn


In [352]:
pId_geo_df_1.shape

(12236, 3)

In [354]:
pId_geo_df_1.index

Index(['cbc61afc4287f5f0', 'cbc9be43b85b2499', 'cbcea709e3d85e36',
       'cbd63634353b5d6d', 'cbd8ad77c9e72076', 'cbde1db24975fa13',
       'cbe69613fe32d21e', 'cbeadd44e60e908a', 'cbefa52d8eb75814',
       'cbf7939beaf12453',
       ...
       'cb994d7be8eb1d02', 'cb99e234ed13817b', 'cbacebe499230e8a',
       'cbb48bedbf1e12cf', 'cbb86e94b9656dfa', 'cbb971751e1bddf4',
       'cbba56ca2f839d83', 'cbbc0e0664c6eccb', 'cbbdba96da662972',
       'cbc3aa9188574109'],
      dtype='object', name='placeId', length=12236)

双向，