## Preprocessing of the data
---
In this notebook, we do some tests to clean our data and remove strange points.

In [3]:
import pandas as pd
from zipfile import ZipFile
import datetime

In [4]:
ID_df = pd.read_csv("ID_2013_merge_data.csv")

In [5]:
ID_df

Unnamed: 0,zone_id,person_id,time_checkin,year,Lat,lon,building,country
0,4b05880ff964a5201caf22e3,1170327,2013-01-02 06:04:37+00:00,2013,1.145077,104.010916,Other Great Outdoors,ID
1,4b05880ff964a5201caf22e3,1170327,2013-01-11 11:31:56+00:00,2013,1.145077,104.010916,Other Great Outdoors,ID
2,4b05880ff964a5201caf22e3,1170327,2013-01-15 10:18:07+00:00,2013,1.145077,104.010916,Other Great Outdoors,ID
3,4b05880ff964a5201caf22e3,408433,2013-01-30 11:12:24+00:00,2013,1.145077,104.010916,Other Great Outdoors,ID
4,4b05880ff964a5201caf22e3,408433,2013-01-30 11:25:09+00:00,2013,1.145077,104.010916,Other Great Outdoors,ID
...,...,...,...,...,...,...,...,...
771866,52b6c7f611d255fa1a7392a3,707704,2013-12-22 11:08:20+00:00,2013,0.475714,121.241539,Rest Area,ID
771867,52b6ca4e498e15b460e38ab2,683822,2013-12-22 11:17:49+00:00,2013,-6.227489,106.825379,Australian Restaurant,ID
771868,52b6cada11d248b7b56d2374,1264982,2013-12-22 11:20:12+00:00,2013,-8.569188,114.090425,Home (private),ID
771869,52b6cd3e498ea6b8c72aa057,1499509,2013-12-22 11:46:16+00:00,2013,1.459921,124.823780,Home (private),ID


### Preprocessing


In our tests, we check that all points have latitude between -90 and 90 and longitude between -180 and 180. We notice after the tests that all our point are in the right intervals and that we don't have to delete anything.

In [6]:
#Helpers
def dropLocationId(df):
    return df.drop(['zone_id'],axis=1)

def dropCheckinTime(df):
    return df.drop(['time_checkin'],axis=1)

def dropMissingLatOrLng(df):
    return df[~((df['Lat'] == 0) | (df['lon'] == 0))]

def dropInvalidLat(df):
    return df[(df['Lat'] < 90) & (df['Lat'] > -90)]

def dropInvalidLng(df):
    return df[(df['lon'] < 180) & (df['lon'] > -180)]

def preprocess(df):
    df_ = df.copy()
    #drop Rows having NaN/Null/NaT Values 
    df_ = df_.dropna(how="any")

    #drop Rows having invalid Values 
    df_ = dropMissingLatOrLng(df_)

    df_ = dropInvalidLat(df_)
    df_ = dropInvalidLng(df_)
    return df_
    


In [7]:
totalChekinsBefore = ID_df.shape[0]

In [8]:
clean_df = preprocess(ID_df)

In [9]:

#Total number of check-ins 
print('Before the preprocessing, the dataset had : '+ str(totalChekinsBefore) + ' checkins')
print('After the preprocessing, the dataset had : '+ str(clean_df.shape[0]) + ' checkins')


Before the preprocessing, the dataset had : 771871 checkins
After the preprocessing, the dataset had : 771871 checkins
