In [380]:
import pandas as pd
import csv
import numpy as np
from scripts.geocoding import *
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[Quoting Problem](http://stackoverflow.com/a/29857126/4811003)

**Schema**

In [2]:
schema = pd.read_table("data/twitter-swisscom/schema.txt", header = None, delim_whitespace=True,index_col=0)
schema.head()

Unnamed: 0_level_0,1,2,3,4,5
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,id,bigint(20),UNSIGNED,No,
2,userId,bigint(20),UNSIGNED,No,
3,createdAt,timestamp,No,0000-00-00,00:00:00
4,text,text,utf8_unicode_ci,No,
5,longitude,float,Yes,,


# Preprocessing

First process the `sample.tsv` by removing the `\\\n` in the content of twitter.

In [224]:
with open("data/twitter-swisscom/sample.tsv", 'r') as h:
    lines = h.readlines()
    
to_be_processed_lines = [k for k, i in enumerate(lines) if i.endswith('\\\n') ]
print(len(to_be_processed_lines))

1210


In [238]:
def remove_line_breaks(list_of_strings):
    for i, s in enumerate(list_of_strings):
        if s.endswith('\\\n'):
            list_of_strings[i] = s[:-2]

remove_line_breaks(lines)
with open("data/twitter-swisscom/modified_sample.tsv", 'w') as h:
    h.writelines(lines)

In [239]:
# None quating in the dataset
df = pd.read_csv("data/twitter-swisscom/modified_sample.tsv", header = None, sep='\t',quoting=csv.QUOTE_NONE)
df.columns = schema[1]
df.head()

1,id,userId,createdAt,text,longitude,latitude,placeId,inReplyTo,source,truncated,placeLatitude,placeLongitude,sourceName,sourceUrl,userName,screenName,followersCount,friendsCount,statusesCount,userLocation
0,776522983837954049,735449229028675584,2016-09-15 20:48:01,se lo dici tu... https://t.co/x7Qm1VHBKL,\N,\N,51c0e6b24c64e54e,\N,1,,46.0027,8.96044,Twitter for iPhone,http://twitter.com/#!/download/iphone,plvtone filiae.,hazel_chb,146,110,28621,Earleen.
1,776523000636203010,2741685639,2016-09-15 20:48:05,https://t.co/noYrTnqmg9,\N,\N,4e7c21fd2af027c6,\N,1,,46.8131,8.22414,Twitter for iPhone,http://twitter.com/#!/download/iphone,samara,letisieg,755,2037,3771,Suisse
2,776523045200691200,435239151,2016-09-15 20:48:15,@BesacTof @Leonid_CCCP Tu dois t'engager en si...,\N,\N,12eb9b254faf37a3,776522113859608576,5,,47.201,5.94082,Twitter for Android,http://twitter.com/download/android,lebrübrü❤,lebrubru,811,595,30191,Fontain
3,776523058404290560,503244217,2016-09-15 20:48:18,@Mno0or_Abyat اشوف مظاهرات على قانون العمل الج...,\N,\N,30bcd7f767b4041e,776521597515624448,1,,45.8011,6.16552,Twitter for iPhone,http://twitter.com/#!/download/iphone,عبدالله القنيص,bingnais,28433,417,12262,Shargeyah
4,776523058504925185,452805259,2016-09-15 20:48:18,Greek night #geneve (@ Emilios in Genève) http...,6.14414,46.1966,c3a6437e1b1a726d,\N,3,,46.2048,6.14319,foursquare,http://foursquare.com,Alkan Şenli,Alkanoli,204,172,3390,İstanbul/Burgazada


# Data Stats

## Case Study

#### Information Independent of tweets

|attribute| value| |
|------|------| --------- |
|id | 776523058504925185 | unique index in url|
|url |https://twitter.com/alkanoli/status/776523058504925185||
|`userLocation`| İstanbul/Burgazada | come from/resides|

#### Information related to tweets

|attribute| value ||
|-----|-----|----|
|`text`|'Greek night #geneve (@ Emilios in Genève) https://t.co/sEplW0Mcyz'| content|
|`createdAt`|`2016-09-15 20:48:18`| |
|`placeId`| `c3a6437e1b1a726d`| place id of a tweet (by inspecting the tweet's webpage)|
|`longitude`| 6.14414||
|`latitude`|46.1966	||
|`placeLatitude`|46.2048||
|`placeLongitude`|6.14319||

## Basic Information

### How to Use Geo Information

1. (`latitude`, `longitude`) and (`placeLatitude`, `placeLongitude`) are basically same. Data of (`placeLatitude`, `placeLongitude`) are more complete.
2. `placeId`: name of tweet location. Groups people nearby. We may need **reverse geocoding** to get the name of the exact place.

In [233]:
df[(df.longitude != '\\N')][['longitude', 'latitude', 'placeLongitude', 'placeLatitude', 'placeId', 'userLocation']].dropna().head(10)

1,longitude,latitude,placeLongitude,placeLatitude,placeId,userLocation
4,6.14414,46.1966,6.14319,46.2048,c3a6437e1b1a726d,İstanbul/Burgazada
26,8.95092,46.006,8.96044,46.0027,51c0e6b24c64e54e,Lahore
31,6.81899,47.1003,6.82645,47.1136,c2bf4772ec58dc04,"La Chaux-de-Fonds, Neuchâtel"
41,8.94542,45.9915,8.95449,45.9884,6b2eafacf6c765ba,Lahore
54,5.99278,47.2763,5.96952,47.2635,1cf182db3b9e8fc5,Besançon
88,9.08087,45.8132,9.08382,45.8002,cd661902b07eb657,\N
93,6.10767,46.2308,6.079,46.2322,068c70be7b3a4cc2,"Utrecht, NL"
128,6.15127,46.2101,6.14319,46.2048,c3a6437e1b1a726d,"iPhone: 47.632786,-122.026932"
139,7.69677,46.4999,7.71341,46.4657,1231efcfff3a1c64,\N
140,8.30979,47.0541,8.31721,47.0408,8b3e53628223753a,\N


The placeId maybe the hash number of longitude and latitude. They are more compact way for us to group similar users.



#### We can use geopy to do reverse geocoding using library geopy

Since `geopy` has many sources, the resulting format can be different. We will use the most common ones like 

1. `country_code`
2. `state`

The following are two examples: the first one is a place in Berlin and the second one is EPFL.

Compare two examples, we found:

1. EPFL has `county`, `town`, but no `city` entry.
2. `Potsdamer Platz` has `city` but no `county`, `town`.

Both examples have `state` entry, which seems to be a better choice. (For countries like Liechtenstein which has no `state`, we will ...)

In [342]:
rev_geo(lat=52.509669, lon=13.376294)

{'address': {'attraction': 'Potsdamer Platz',
  'city': 'Berlin',
  'city_district': 'Mitte',
  'country': 'Deutschland',
  'country_code': 'de',
  'postcode': '10117',
  'road': 'Potsdamer Platz',
  'state': 'Berlin'},
 'boundingbox': ['52.5093982', '52.5095982', '13.3764983', '13.3766983'],
 'display_name': 'Potsdamer Platz, Mitte, Berlin, 10117, Deutschland',
 'lat': '52.5094982',
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright',
 'lon': '13.3765983',
 'osm_id': '245981373',
 'osm_type': 'node',
 'place_id': '644947'}

In [346]:
geo(loc='EPFL')

(6.566561505148, 46.5186594)

In [349]:
rev_geo(lon=6.566561505148, lat=46.5186594)

{'address': {'country': 'Schweiz, Suisse, Svizzera, Svizra',
  'country_code': 'ch',
  'county': "District de l'Ouest lausannois",
  'pedestrian': 'Place Cosandey',
  'postcode': '1015',
  'state': 'Vaud',
  'town': 'Ecublens',
  'university': 'École Polytechnique Fédérale de Lausanne (EPFL)'},
 'boundingbox': ['46.5152316', '46.5222479', '6.5601751', '6.5721733'],
 'display_name': "École Polytechnique Fédérale de Lausanne (EPFL), Place Cosandey, Ecublens, District de l'Ouest lausannois, Vaud, 1015, Schweiz, Suisse, Svizzera, Svizra",
 'lat': '46.5186594',
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright',
 'lon': '6.566561505148',
 'osm_id': '23391253',
 'osm_type': 'way',
 'place_id': '68692007'}

In [384]:
# Distance in kilo meters using vincent metrics
distance(lat1=52.509669, lon1=13.376294,lon2=6.566561505148, lat2=46.5186594)

648.9242490859898

##### Apply to this problem

In [402]:
# Since every item with same `placeId` has same lat and lon, we will simple use `first()` method
# to groupby
placeId_lat_lon_df = df.groupby('placeId').first()[['placeLatitude','placeLongitude']]
placeId_lat_lon_df.head()

1,placeLatitude,placeLongitude
placeId,Unnamed: 1_level_1,Unnamed: 2_level_1
000a93ad12003aaa,46.8911,7.51217
0046b64d1941431e,46.2118,6.43633
0070770855fc0793,47.4499,10.3448
007355fb62ccfa7b,45.8411,8.72494
00b3d266c3ec547d,47.2673,8.67959


In [403]:
import pickle
import os

if os.path.exists("data/twitter-swisscom/placeId_lat_lon_df.pickle"):
    with open("data/twitter-swisscom/placeId_lat_lon_df.pickle", "rb") as h:
        placeId_lat_lon_df = pickle.load(h)
else:
    def get_geo_info(x):
        return rev_geo(lon=x['placeLongitude'], lat=x['placeLatitude'])

    import time
    t = time.time()
    placeId_lat_lon_df['raw_geo'] = placeId_lat_lon_df.apply(lambda x: get_geo_info(x), axis=1)
    print("time for all items = ", time.time() - t)
    with open("data/twitter-swisscom/placeId_lat_lon_df.pickle", "wb") as h:
        pickle.dump(placeId_lat_lon_df, h)

In [445]:
placeId_lat_lon_df.iloc[1, 2]

{'address': {'country': 'France',
  'country_code': 'fr',
  'county': 'Thonon-les-Bains',
  'postcode': '74420',
  'road': 'Route de la Gruaz',
  'state': 'Auvergne-Rhône-Alpes',
  'suburb': 'Les Andrys',
  'village': 'Villard'},
 'boundingbox': ['46.2116281', '46.2151875', '6.434952', '6.4378786'],
 'display_name': 'Route de la Gruaz, Les Andrys, Villard, Thonon-les-Bains, Haute-Savoie, Auvergne-Rhône-Alpes, 74420, France',
 'lat': '46.2129927',
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright',
 'lon': '6.4369696',
 'osm_id': '143600488',
 'osm_type': 'way',
 'place_id': '96129716'}

In [428]:
placeId_lat_lon_df['country_code'] = placeId_lat_lon_df['raw_geo'].apply(lambda x: x['address']['country_code'])

In [447]:
# swiss, french, german, italian, austria, turky, Liechtenstein
placeId_lat_lon_df['country_code'].value_counts()

ch    565
fr    174
it    134
de     53
at     17
li      4
tr      1
Name: country_code, dtype: int64

In [442]:
placeId_lat_lon_df['state']=placeId_lat_lon_df['raw_geo'].apply(lambda x: x['address'].get('state'))

In [448]:
placeId_lat_lon_df['state'].unique()

array(['Bern - Berne', 'Auvergne-Rhône-Alpes', 'Bayern', 'LOM', 'Zürich',
       'Graubünden - Grigioni - Grischun', 'Grand-Est', 'Tirol',
       'Schaffhausen', 'Genève', 'Bourgogne-Franche-Comté', 'Vorarlberg',
       'Ticino', 'Valais - Wallis', 'Schwyz', 'Thurgau', 'Solothurn',
       'Neuchâtel', 'Jura', 'Aargau', 'Basel-Landschaft', 'Vaud',
       'Sankt Gallen', 'Basel-Stadt', 'Baden-Württemberg', 'Luzern',
       'Fribourg - Freiburg', 'Obwalden', 'Nidwalden', 'PIE', 'Zug',
       'Glarus', 'VDA', 'Tekirdağ', None, 'Appenzell Innerrhoden', 'TAA',
       'Appenzell Ausserrhoden', 'Uri', 'Nouvelle-Aquitaine'], dtype=object)

# Processing

## Problems in the data:

1. line break '\' in the text. Solved by preprocess and generate a new file.
2. '\N' in Twitter time stamp. 

### Deal with missing values

##  Time

We would like to analysis the people's location/behavior by 

Periodical behavior of a person/group: 

1. hour of the twitter: morning, afternoon, evening. (Maybe we can classify them)
2. weekday of a week: 1~7 (1 means Monday): Work at some place and visit .. on Sunday
3. month: People cross border more frequently in summer/winter/...?

Notice the potential problem:

1. Bias:
    1. some people may only tweet in the weekend.
    2. some people only tweet when they are travelling

##### pandas library

In [316]:
s = df.loc[0, 'createdAt']; print(s)
st = pd.Timestamp(s).to_pydatetime()

2016-09-15 20:48:01


In [317]:
st.day, st.date(), st.ctime(), st.hour, st.isocalendar(), st.isoweekday(), st.isoformat()

(15,
 datetime.date(2016, 9, 15),
 'Thu Sep 15 20:48:01 2016',
 20,
 (2016, 37, 4),
 4,
 '2016-09-15T20:48:01')

In [305]:
st.month, st.year, st.weekday(), st.toordinal(), st.timetuple()

(9,
 2016,
 3,
 736222,
 time.struct_time(tm_year=2016, tm_mon=9, tm_mday=15, tm_hour=20, tm_min=48, tm_sec=1, tm_wday=3, tm_yday=259, tm_isdst=-1))

##### For this problem

In [261]:
df['time'] = df['createdAt'].apply(lambda x: pd.Timestamp(x).to_pydatetime())

In [266]:
df['weekday'] = df['time'].apply(lambda x: x.weekday())

In [311]:
df['day'] = df['time'].apply(lambda x: x.day)
df['hour'] = df['time'].apply(lambda x: x.hour)
df['month'] = df['time'].apply(lambda x: x.month)

In [313]:
df[['time', 'month', 'weekday', 'day', 'hour']].head(5)

1,time,month,weekday,day,hour
0,2016-09-15 20:48:01,9,3,15,20
1,2016-09-15 20:48:05,9,3,15,20
2,2016-09-15 20:48:15,9,3,15,20
3,2016-09-15 20:48:18,9,3,15,20
4,2016-09-15 20:48:18,9,3,15,20
