# A hitchhiker's guide to the world
---
### Context
In the paper, "Friendship and mobility : User movement in Location-Based Social Networks", the authors answered one important question : what influence do friends have on movements ?<br>
But what if you don't have friends in the first place and just want to blend in ? Imagine : You just arrived in a new country. You don't know how to behave with the locals and what to expect. How friendly are people ? How often is it socially acceptable to meet ? Where should you meet ? Where and when should you go on holidays ? <br>
In this Notebook, we will attempt to answer some of these questions for multiple countries.

---
### The data
“Global-scale Check-in Dataset with User Social Networks” from two research projects at this address (project 5 by Dingqi Yang): ( https://sites.google.com/site/yangdingqi/home/foursquare-dataset#h.p_7rmPjnwFGIx9). The dataset is coming from Foursquare and it contains the information of 22,809,624 checkins by 114,324 users, 607,333 friendship links and 3,820,891 POIs. It contains a set of worldwide check-ins with country flags taken over about two years and two snapshots of the corresponding user social network before (in Mar. 2012) and after (in May 2014) the check-in data collection period. <br>
In order to work with this dataset, we broke it down in smaller datasets, based on countries. In order to learn more on how we broke our dataset down, please consult the scripts "createSubDataset.ipynb" and "preprocess.ipynb".

---
### Structure of this notebook
In every subpart of this notebook, we will try to answer a different question. You will find the index of all questions here : 


---
### Note :
In the following notebook, all the calculations are done on one database, because they all have the same structure. When it come to the data story, we will extract information from all datasets.

In [27]:
import pandas as pd
from zipfile import ZipFile
import datetime
import math

### Step 1 : How many friends do native users have ?

### Subpart 1 : Finding nationalities
---

We open a dataset (example : the check-ins that took place in the US in 2013). However, we don't know if all the people of this dataset are American or foreigner on vacation. Since we want to study the behaviour of the locals, we need a criterion to distinguish natives from foreigners.
- Idea 1 : We choose that a person is considered a native if he checks in more than five times in the country.
- Idea 2 : A person is considered a foreigner if he never checks in a home.

#### Idea 1

In [4]:
data_file = ZipFile('../data.zip')
df_US = pd.read_csv(data_file.open('data/1_US_2013_merge_data.csv'))

In [5]:
df_US

Unnamed: 0,zone_id,person_id,time_checkin,year,Lat,lon,building,country
0,3fd66200f964a52000e71ee3,319827,2013-01-13 00:49:25+00:00,2013,40.733596,-74.003139,Jazz Club,US
1,3fd66200f964a52000e71ee3,496140,2013-01-13 01:12:49+00:00,2013,40.733596,-74.003139,Jazz Club,US
2,3fd66200f964a52000e71ee3,288077,2013-02-16 02:29:11+00:00,2013,40.733596,-74.003139,Jazz Club,US
3,3fd66200f964a52000e71ee3,191931,2013-02-17 03:50:53+00:00,2013,40.733596,-74.003139,Jazz Club,US
4,3fd66200f964a52000e71ee3,1402791,2013-02-19 03:48:11+00:00,2013,40.733596,-74.003139,Jazz Club,US
...,...,...,...,...,...,...,...,...
1187592,52b6450811d248b7b0610626,1243714,2013-12-22 01:50:10+00:00,2013,33.873249,-118.387099,Home (private),US
1187593,52b65856498e252aade808b5,120150,2013-12-22 03:12:02+00:00,2013,39.281602,-76.593760,Lounge,US
1187594,52b66bde498e5705ff1ef091,212157,2013-12-22 04:48:06+00:00,2013,33.546776,-117.131694,Other Great Outdoors,US
1187595,52b67f4e498e403b8cccbe27,133864,2013-12-22 06:24:48+00:00,2013,42.331624,-83.066572,Dive Bar,US


In [50]:
US_participants = df_US.groupby(by = "person_id").agg("count")
US_natives =  US_participants[US_participants["zone_id"] >= 5]
US_foreign = US_participants[US_participants["zone_id"] < 5]
US_natives["zone_id"]

person_id
19         146
54          40
58          73
120         56
178         91
          ... 
2169607      5
2169991     27
2174127     92
2174989      6
2181131    116
Name: zone_id, Length: 16987, dtype: int64

In [51]:
df_US_foreign = df_US[df_US["person_id"].isin(US_foreign.index.tolist())]
df_US_natives = df_US[df_US["person_id"].isin(US_natives.index.tolist())]
print('Our initial dataset had {} users.'.format(US_participants.shape[0]))
print('Out of them, {} are considered natives.'.format(US_natives.shape[0]))
print('Out of all our users, {}% have low checkins.'.format(round((1-US_natives.shape[0]/US_participants.shape[0])*100)))

Our initial dataset had 20495 users.
Out of them, 16987 are considered natives.
Out of all our users, 17% have low checkins.


Getting rid of users can seem a bit dangerous because we lose data. However, since we get rid of users with a low number of Checkins, there isn't too much loss. Furthermore, we can associate low checkin users to visitors and get an approximation of the number of visitors.

#### Most popular places among natives

In [49]:
df_US_natives.groupby(by = "building").agg("count").sort_values(by = "person_id", ascending = False)

Unnamed: 0_level_0,zone_id,person_id,time_checkin,year,Lat,lon,country
building,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Home (private),56962,56962,56962,56962,56962,56962,56962
Office,45859,45859,45859,45859,45859,45859,45859
Coffee Shop,41529,41529,41529,41529,41529,41529,41529
Airport,38303,38303,38303,38303,38303,38303,38303
Gym,33411,33411,33411,33411,33411,33411,33411
...,...,...,...,...,...,...,...
Ramen / Noodle House,3,3,3,3,3,3,3
Parade,2,2,2,2,2,2,2
Music Festival,2,2,2,2,2,2,2
Cities,1,1,1,1,1,1,1


**Interpretation**: Unsurprinsingly, most people check-in at their home and on the workplace (places where they go the most). We also notice that they check-in a lot in Coffee Shops (which we can interpret as "they check-in during break time"). For Airports, we can interpret it as traveling for work (or going from one coast to another) or leisure.

In [52]:
df_US_foreign.groupby(by = "building").agg("count").sort_values(by = "person_id", ascending = False)

Unnamed: 0_level_0,zone_id,person_id,time_checkin,year,Lat,lon,country
building,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Airport,887,887,887,887,887,887,887
Hotel,549,549,549,549,549,549,549
Mall,264,264,264,264,264,264,264
American Restaurant,262,262,262,262,262,262,262
Coffee Shop,183,183,183,183,183,183,183
...,...,...,...,...,...,...,...
Dance Studio,1,1,1,1,1,1,1
Dentist's Office,1,1,1,1,1,1,1
Design Studio,1,1,1,1,1,1,1
Swiss Restaurant,1,1,1,1,1,1,1


In [53]:
df_US_foreign[df_US_foreign["building"] == "Home (private)"]

Unnamed: 0,zone_id,person_id,time_checkin,year,Lat,lon,building,country
365691,4b47caf8f964a520f13e26e3,107355,2013-01-03 19:59:33+00:00,2013,41.531396,-93.555966,Home (private),US
365693,4b47caf8f964a520f13e26e3,107355,2013-01-08 00:09:35+00:00,2013,41.531396,-93.555966,Home (private),US
587581,4bc62588f360ef3b333adb2d,500843,2013-11-29 19:49:08+00:00,2013,28.367592,-81.510266,Home (private),US
658612,4c03d46e9a7920a1fc5cd079,156601,2013-01-27 06:16:51+00:00,2013,40.833966,-81.806039,Home (private),US
693549,4c3035fd7cc0c9b6151bed9a,252081,2013-01-28 06:22:10+00:00,2013,33.433488,-86.686288,Home (private),US
...,...,...,...,...,...,...,...,...
1184766,5236144b11d27df7715dfd83,217728,2013-09-15 20:11:19+00:00,2013,38.630324,-121.325569,Home (private),US
1184956,5239b34611d2ab4763dec28a,1267538,2013-09-18 14:07:49+00:00,2013,40.726044,-73.883070,Home (private),US
1185037,523b4634498ea178a5a57a94,278252,2013-09-19 18:46:07+00:00,2013,32.349171,-97.389477,Home (private),US
1187465,52a636eb11d21b372b71cfc1,1161008,2013-12-10 21:30:06+00:00,2013,39.474366,-84.475014,Home (private),US


**Interpretation**: We look back at the people we classified as foreigners and want to see where they checked-in in order to verify our assumptions. We see that they checked mainly in Hotels, Airports, Malls, Restaurants and Coffee Shops which are destinations that any tourist would go to. <br>
We however notice that a lot of users who are classified as foreigners check-in at a home. While this can be considered as them checking in a friend or them just going to a zone_id with many houses.

#### Idea 2

In [54]:
list_of_people_with_homes = df_US[df_US["building"] == "Home (private)"].person_id.unique().tolist()
df_US_natives_2 = df_US[df_US["person_id"].isin(list_of_people_with_homes)]

In [58]:
print('Our initial dataset had {} users.'.format(US_participants.shape[0]))
print('Out of them, {} are considered natives.'.format(len(df_US_natives_2.person_id.unique().tolist())))

Our initial dataset had 20495 users.
Out of them, 5364 are considered natives.
Out of all our users, -2821% have low checkins.


We notice that when we keep only users with a Home, the number of users falls drastically. We therefore think that going to a home is not an interesting criterion for determining if a user is a native.