# Feature engineering

In this notebook, we will try to create features from checkin dataset.
We based our work on the paper: `Fine-Scale Prediction of People’s Home Location Using Social Media Footprints` by H. Kavak et al. The authors start by clustring the checkins of each user using Density-based spatial clustering of applications with noise (DBSCAN). The goal of this step is to group checkins into dense and small clusters. One of these clusters contain the home location. To predict the latter, they generated the following mobility features per cluster:

- Check-in Ratio (CR)
- Check-in Ration during Midnight (MR)
- Check-in Ratio of Last Destination of a Day (EDR)
- Check-in Ratio of Last Destination of a Day with Inactive Midnight (EIDR)
- PageRank (PR)
- Reverse PageRank (RPR)

We will explain each feature as we progress in this work.


These feature will be used to classify each cluster and we average the latitude and longitude to obtain users' home location

# 1. Importing libraries

In [1]:
import pandas as pd
import numpy as np
# functions to compute the distance between two geocoordinates
from haversine import haversine_vector, Unit
# DBSCAN
from sklearn.cluster import DBSCAN
# To do operations on datetime
from datetime import timedelta
# PageRank and ReversePageRank
import networkx as nx

# 2. Helper functions
Here we will redefine functions we used during previous milestones to clean our data

## 2.1 Latitude correction


In [2]:
def correct_latitude(lat):
    """
    This function corrects for out of range latitude.
    
    Input: 
    -- lat: latitude coordinates in °
    Output: 
    -- lat: latitude coordinates put between -90 and 90°
    """
    while lat>90 or lat<-90:
        if lat>90:
            lat = -(lat-180)
        elif lat<-90:
            lat = -(lat+180)
    return lat

In [3]:
def correct_longitude(long):
    """
    This function corrects for out of range longitude.
    
    Input: 
    -- long: longitude coordiantes in °
    Output: 
    -- long: longitude coordinates put between -180 and 180°
    """
    while long>180 or long<-180:
        if long>180:
            long = long - 360
        elif long<-180:
            long = long +360
    return long

## 2.2. Compute distance between two geocoordinates points

In [4]:
def compute_distance(df,columns):
    '''
    This function computes the distance between two geographic coordinates for a given dataframe.
    
    Input: 
        - df: Dataframe containing 4 columns latitude1, longitude1, latitude2 and longitude2
        - columns: list of columns [latitude1, longitude1, latitude2 and longitude2]
        
    Output: 
        - numpy array containing the distance between geographic coordinates of each row
    '''
    points1 = list(zip(df[columns[0]],df[columns[1]]))
    points2 = list(zip(df[columns[2]],df[columns[3]]))
    # Use harvesine_vector to compute the distance between points
    return np.round(haversine_vector(points1,points2,Unit.KILOMETERS),decimals=3)

## 2.3. Selecting relevant homes

As we want to use the training dataset to predict home locations, it's important to keep only users that doesn't change home. To measure that, we compute the variance their home's location Using the function defined below.

In [5]:
def select_relevant_homes(df_homes):
    '''
    This function selects relevant homes. We consider a home location as relevant if it's latitude and
    longitude doesn't "vary much". To measure this variation, we simply compute the mean and the std of
    the latitude and longitude of homes for every user. Then we construct 4 points as follow:
        - by adding and substracting the standard deviation of the latitude and longitude from their
        respective mean
        - Measure the diagonal in KM
        - If the diagonal is less than 100m we can assume with confidence that the mean is indeed the
        home location
    
    Input:
        - df_homes: A dataframe containing all checkins labled as Home
    Output:
        - df_homes: Home location for each user
    '''
    
    # Grouping df_homes according to the user id and compute std and mean for lat and lon
    df_homes = df_homes.groupby('User_ID').agg({'lat':('std','mean'),'lon':('std','mean')})
    
    # Filling nan values with 0 (std return 0 if there is only one sample)
    df_homes.fillna(0,inplace = True)
    
    # Construct the diagonal points
    df_tmp = pd.DataFrame()
    df_tmp['lat1'] = df_homes.lat['mean']-df_homes.lat['std']
    df_tmp['lat2'] = df_homes.lat['mean']+df_homes.lat['std']
    df_tmp['lon1'] = df_homes.lon['mean']-df_homes.lon['std']
    df_tmp['lon2'] = df_homes.lon['mean']+df_homes.lon['std']
    
    # Compute diagonal length
    df_tmp['home_radius'] = compute_distance(df_tmp,['lat1','lon1','lat2','lon2'])
    
    # Filter home and keep relevant home (estimated distance between homes checkins < 100m )
    df_homes = df_homes[df_tmp['home_radius']<0.1][[('lat','mean'),('lon','mean')]].copy()
    
    # Flatten df_homes columns
    df_homes.columns = df_homes.columns.get_level_values(0)
    
    return df_homes

# 3. Constructing Checkins dataframe
## 3.1. Training dataset
In this section, we define a function that construct a clean checkin dataframe from Foursquare data

In [6]:
def construct_df_checkins(path,sample_frac = 1):
    '''
    This function takes the path of the raw data, import it and construct a checkin dataframe where
    all users have at least 5 checkins and 1 home location
    
    Input:
        - Path: the Path of the file containing the data
        - sample_frac: sample fraction from the raw dataframe
    Output:
        - df_checkins: Checkin dataframe where all users have at least 5 checkins and 1 home location
    '''
    
    # Read data from the file and drop unnecessary columns
    df_tmp = pd.read_csv(path).sample(frac=sample_frac).drop(columns=['Venue_ID','day'])
    
    # Latitude and Longitude correction
    df_tmp.lat = df_tmp.lat.apply(correct_latitude)
    df_tmp.lon = df_tmp.lon.apply(correct_longitude)
    
    # Construct df_homes and select only relevant homes
    df_homes = df_tmp.loc[df_tmp.place.str.lower().str.contains('home' and 'private')].copy()
    df_homes = select_relevant_homes(df_homes)
    
    # Select users with relevant homes from the raw data
    df_tmp = df_tmp.loc[df_tmp['User_ID'].isin(df_homes.index)].copy()
    
    # Count the number of checkins for each user
    df_tmp_grouped = df_tmp.groupby('User_ID').agg({'User_ID':'count'})
    
    # Define a set containing users with at least 5 checkins
    users = set(df_tmp_grouped[df_tmp_grouped['User_ID']>5].index)
    
    # Construct df_checkins
    df_checkins = df_tmp.loc[df_tmp['User_ID'].isin(users)].copy()
    
    # Convert 'local time' attribute to a pandas datetime
    df_checkins['local_time'] = pd.to_datetime(df_checkins['local_time'])
    
    # Label Homes
    df_checkins['Is_home'] = df_checkins.place.str.lower().str.contains('home' and 'private')
    
    # Drop unnecessary column
    df_checkins.drop(columns = ['place'],inplace = True)
    
    return df_checkins.sort_values(by=['User_ID','local_time']).reset_index(drop=True)

## 3.2. Prediction dataset

In [7]:
def construct_prediction_dataset(path, sample_frac = 1):
    '''
    This function takes the path of the raw data, import it and construct a checkin dataframe where
    all users have at least 5 checkins. It's used to prepare the dataset we will use to predict home location
    of the users.
    
    Input:
        - Path: the Path of the file containing the data
        - sample_frac: sample fraction from the raw dataframe
    Output:
        - df_checkins: Checkin dataframe where all users have at least 5 checkins
    '''
    
    # Import dataset
    df_checkins = pd.read_csv(path,sep='\t',header=None,names=['User_ID','local_time','lat','lon','location_id'],
                              parse_dates = ['local_time'])
    # Drop NaN values
    df_checkins.dropna(inplace = True)
    
    # Drop unnecessary column
    df_checkins.drop(columns = ['location_id'],inplace=True)
    
    # Correct latitude and longitude
    df_checkins.lat = df_checkins.lat.apply(correct_latitude)
    df_checkins.lon = df_checkins.lon.apply(correct_longitude)
    
    # Drop checkins with both latitude and longitude set at zero
    # Note: this is specific to gowalla and brightkite dataset
    df_checkins = df_checkins[(df_checkins['lat'] !=0) & (df_checkins['lon'] != 0)]
    
    # Grouping by users
    users = df_checkins.groupby(['User_ID']).agg({'User_ID':'count'})
    
    # Selecting users with more than 5 checkins
    users = set(users.loc[users['User_ID']>5].index)
    
    df_checkins = df_checkins.loc[df_checkins['User_ID'].isin(users)]
    
    return df_checkins

# 4. Building features
Some descriptions are taken from the online supplemental provided by the authors.

Source: https://github.com/hamdikavak/home-location-prediction/blob/master/supplemental_revised.pdf
## 4.1. Compute cluster label

Instead of discretizing the world, the authors use an unsupervised to create small cluster. The function below take the checkin of every user, cluster them using DBSCAN and return label of the cluster for each checkin.

In [8]:
def build_clusters_labels(df_user,clustering_method):
    '''
    This function clusters the checkins for a single user.
    
    Input:
        - df_user: a dataframe containing the latitude and longitude for each checkin
        - clustering_method: DBSCAN, we define this parameter to avoid unnecessary initialisations
        when calling this funcrion
    Output:
        - clusters_labels: cluster label assigned to each checkin
    '''
    cluster_lables = clustering_method.fit(np.deg2rad(df_user[['lat','lon']])).labels_
    
    return cluster_lables

## 4.2. Cleaning users

The paper suggest removing multiple successive checkin within 60 minutes and 100m to avoid biasing the dataset.
We defined the function below for this purpose

In [9]:
def cleaning_user(df_user):
    '''
    To avoid biasing the dataset with multiple checkins in a small period of time or small distance traveled, 
    we drop checkins that are consecutively shared within 60 minutes and 100m.
    
    Input:
        - df_user: datafame containing checkin time and location sorted by time
    Output:
        - df_user: cleaned df_user
    
    '''
    
    # Constructing a dataframe containing the actual checkin and the next checkin
    df_tmp = df_user.reset_index().merge(df_user.iloc[1:].reset_index(drop=True),right_index=True,
                                         left_index=True,how='inner')
    
    # Compute the time between two consecutive checkins
    df_tmp['dt'] = df_tmp['local_time_y'] - df_tmp['local_time_x']
    
    # Compute the distance between two consecutive checkins
    columns = ['lat_x','lon_x','lat_y','lon_y']
    df_tmp['distance'] = compute_distance(df_tmp,columns)
    
    # Construct a mask to keep consecutive checkins if they are distant by 60 minutes or 100m
    # We also ignore checkins with 0 dt
    mask = (df_tmp['dt']!=timedelta(0))&((df_tmp['dt']>timedelta(hours=1))|(df_tmp['distance']>0.1))
    
    return df_user.reset_index().iloc[df_tmp[mask].index]
    

# 4.3. Checkin During Midnight

Midnight check-in ratio looks at all midnight check-ins (12:00 AM - 07:00 AM) of a user and calculate the ratio of midnight check-ins per visited location.

In [10]:
def compute_checkin_during_midnight(df_user):
    '''
    This function lables the checkins after midnight.
    
    Input:
        - df_user: dataframe containing and sorted by checkin time
    Output:
        - Labeles for each checkin. If it is happening after midnight and befor 7am it's set to True
        and False otherwise
    '''
    df_tmp = (df_user['local_time'].dt.hour>=0) & (df_user['local_time'].dt.hour<7)
    
    return df_tmp

## 4.4. Last Checkin
This feature captures the last destination of the day which is found to be important to predict home location.
We identify all last check-ins of days and calculate the ratio per location using tweets shared between 05:00 PM
in the evening until 03:00 AM in the morning.

In [11]:
def compute_last_checkin(df_user):
    '''
    This function labels the last checkin before 3 am.
    
    Input:
        df_user: Dataframe containing and sorted by checkin time
    Output:
        - Labeles for each checkin. If it is the last checkin of the day, the label is set to True
        and False otherwise
    '''
    # We subsctract 3 hours  so we can detect the last checkin whenever the date changes
    tmp_date = (df_user['local_time']-timedelta(hours=3)).dt.date.values
    tmp_hour = (df_user['local_time']-timedelta(hours=3)).dt.hour.values
    last_checkin = []
    
    # Labeling last checkins
    for i in range(len(tmp_date)-1):
        if (tmp_hour[i]>=14) and (tmp_hour[i]<=23) and (tmp_date[i]<tmp_date[i+1]):
            last_checkin.append(True)
        else:
            last_checkin.append(False)
            
    # The last checkin is always True by definition 
    if (tmp_hour[-1]>=14) and (tmp_hour[-1]<=23):
        last_checkin.append(True)
    else:
        last_checkin.append(False)

    return last_checkin

## 4.5. Last checkin with inactive midnight
This feature is very similar to the EDR feature but ignores days when a user shares tweets during midnight.

In [12]:
def compute_last_checkin_with_inactive_midnight(df_user):
    '''
    This function labels the last checkin with inactive midnight (no checkins between 0am and 7am).
    
    Input:
        df_user: Dataframe containing and sorted by checkin time
    Output:
        - Labeles for each checkin. If it is the last checkin of the day and the user didn't checkin
        between 0am and 7am the label is True and False otherwise
    '''
    
    # Substract 7 hours to detect the change of the day whenever the date changes
    tmp_date = (df_user['local_time']-timedelta(hours=7)).dt.date.values
    tmp_hour = (df_user['local_time']-timedelta(hours=3)).dt.hour.values
    
    last_checkin_with_inactive_midnight = []
    
    # Compute the last checkin with inactive midnight
    for i in range(len(tmp_date)-1):
        # If the date changes and the hour is <= 23 the last checkin is happening before midnight
        if (tmp_hour[i]>=14) and (tmp_hour[i]<=23) and (tmp_date[i]<tmp_date[i+1]):
            last_checkin_with_inactive_midnight.append(True)
        else:
            last_checkin_with_inactive_midnight.append(False)
    
        # As the last checkin is by definition the last checkin of the day, we simply need to see if it is 
        # happening before midnight
    
    if (tmp_hour[-1]>=14) and (tmp_hour[-1]<=23):
        last_checkin_with_inactive_midnight.append(True)
    else:
        last_checkin_with_inactive_midnight.append(False)
        
    return last_checkin_with_inactive_midnight

## 4.6. PageRank and ReversePageRank
The authors used PageRank and ReversePageRank to measure the importance of nodes based on the time made between each transition. Nodes here are the clusters.

We start by computing the time between two consecute checkins using the following function

In [13]:
def compute_dt_to_next_checkin(df_user):
    '''
    This function compute delta time between two consecutive checkins in Hour.
    
    Input:
        - df_user: dataframe containing checkin time
    Output:
        - Delta time between two consecutive checkins
    '''
    
    # get checkin times
    checkin_time = df_user['local_time'].values
    
    # Compute the difference (The result is in nanoseconds)
    delta_time = checkin_time[1:]-checkin_time[:-1]
    
    # Convert delta_time to hours
    delta_time = delta_time.astype(float)/(1e9*3600)
    
    # The last checkin doesn't have a next checkin so we append None
    delta_time = np.append(delta_time,None)
    return delta_time

Now, we compute PageRank and ReversePageRank with the function below.

In [15]:
def compute_PR_RPR(df_user):
    '''
    This function computes the PageRank and ReversePageRank for each cluster. The ReversePageRank is computed
    by inverting the the edges. The weight of an edge is obtained by computing the sum over clusters of the 
    inverse of the time made between two consecutive checkins
    
    Input:
        - df_user: dataframe containing clusters labels for each checkin and the time until next checkin
    Output:
        - PageRank and ReversePageRank
    '''
    # Construct a dataframe containing the acutal checkin and next checkin
    df_tmp = df_user.reset_index().iloc[:-1].merge(df_user.iloc[1:].reset_index(),
                                                    right_index=True,left_index=True)
    
    # Compute the inverse time made between two consecutive checkins
    df_tmp['inverse_time'] = 1/df_tmp['dt_to_next_checkin_x']
    
    # Construct the graph edges dataframe
    df_graph = df_tmp.groupby(['cluster_label_x','cluster_label_y'],as_index = False).agg({'inverse_time':'sum'})
    
    # Initialise PageRank Graph
    G = nx.DiGraph()
    
    # Initialise ReversePageRank Graph
    RG = nx.DiGraph()
    
    # Building edges of the two graphs
    for i, row in df_graph.iterrows():
        G.add_edge(int(row['cluster_label_x']),int(row['cluster_label_y']),weight=row['inverse_time'])
        RG.add_edge(int(row['cluster_label_y']),int(row['cluster_label_x']),weight=row['inverse_time'])
    
    PR = list(nx.pagerank(G, max_iter = 10000, weight='weight').values())
    RPR = list(nx.pagerank(RG, max_iter = 10000, weight='weight').values())
    return PR, RPR

## 4.7. Select most visited country
Since we are using DBSCAN, a cluster may contain multiple countries. The function below selects the most visited country and assign it to the cluster

In [16]:
def compute_country(column):
    '''
    This function compute the most visited country in a cluster and assign it as label to the cluster.
    
    Input: 
        - column: column containing checkin countries
    Output:
        - most visited country
    '''
    
    values,counts = np.unique(column.astype(str),return_counts=True)
    
    return values[np.argmax(counts)]

# 5. Building features
The function below group all the previously defined functions. It constructs the features of checkin dataframes.

In [17]:
def build_features(path, training = True, sample_frac = 1):
    '''
    This is the main function to extract features from checkin data.
    
    Input:
        - path: the path of the file containing raw checkins.
        - training: if this parameter is True, construct training features and testing features if it is False.
        - sample_frac: sample fraction from the raw dataframe
    Output:
        - Cleaned training data: dataframe containing features for each cluster of every user
    '''
    
    if training:
        # Initialize output dataframe for training dataset
        df_tmp = pd.DataFrame(columns = ['user','CR','MR','EDR','EIDR','PR','RPR','Is_home',
                                         'lat','lon','country'])
        # Construct checkin dataframe
        df_checkins  = construct_df_checkins(path, sample_frac=sample_frac)
        
    else:
        # Initialize output dataframe for dataset to predict
        df_tmp = pd.DataFrame(columns = ['user','CR','MR','EDR','EIDR','PR','RPR',
                                         'lat','lon'])
        # Copy dataset to predict
        df_checkins = construct_prediction_dataset(path, sample_frac=sample_frac)
    
    # Exctract users from the checkin dataframe
    users_id = np.unique(df_checkins['User_ID'])
    
    # Grouping the checkin dataframe by the 'User ID'
    grouped_checkins = df_checkins.groupby('User_ID')
    
    # Initialize Clustering method with the right parameters
    KMS_PER_RADIAN = 6371.0088
    PRECISION = 0.1
    clustering_method = DBSCAN(eps=PRECISION/KMS_PER_RADIAN,metric='haversine')
    
    # Compute features for each cluster of every user
    for user in users_id:
        
        # Get the user Dataframe
        df_user = grouped_checkins.get_group(user)
        
        # Clean df_user
        df_user = cleaning_user(df_user).copy()
        
        # Consider only dataframes containing more than 1 cleaned entries
        if len(df_user)>1:
            
            # Compute cluster_label
            df_user['cluster_label'] = build_clusters_labels(df_user,clustering_method)
            
            # Compute Checkin during midnight
            df_user['checkin_during_midnight'] = compute_checkin_during_midnight(df_user)
            
            # Compute last checkin
            df_user['last_checkin'] = compute_last_checkin(df_user)
            
            # Compute last checkin with inactive midnight
            df_user['last_checkin_with_inactive_midnight'] = compute_last_checkin_with_inactive_midnight(df_user)
            
            # Compute distance to next_checkin and classify edges
            df_user['dt_to_next_checkin'] = compute_dt_to_next_checkin(df_user)
            
            #print(df_user)
            # Construct aggregation dictionnary
            if training:
                agg_dic = {'cluster_label':'count','checkin_during_midnight':'sum',
                            'last_checkin':'sum','last_checkin_with_inactive_midnight': 'sum',
                            'Is_home': 'sum','lat':'mean','lon':'mean','country':compute_country}
            else:
                agg_dic = {'cluster_label':'count','checkin_during_midnight':'sum',
                            'last_checkin':'sum','last_checkin_with_inactive_midnight': 'sum',
                            'lat':'mean','lon':'mean'}
                
            # Construct rename dictiaonnary
            rename_dic = {'cluster_label':'CR','checkin_during_midnight':'MR','last_checkin':'EDR',
                          'last_checkin_with_inactive_midnight':'EIDR'}
            
            # Group by cluster_label
            grouped_clusters = df_user.groupby('cluster_label')
            
            # Compute the first 4 features
            features = grouped_clusters.agg(agg_dic).rename(columns = rename_dic)
            
            # Add user ID to the features
            features['user'] = user
            
            # Compute Checkin Ration (CR)
            features['CR'] = features['CR']/features['CR'].sum()
            
            # Compute Checkin Ration (MR)
            features['MR'] = features['MR']/features['MR'].sum()
            
            # Compute Checkin Ration (EDR)
            features['EDR'] = features['EDR']/features['EDR'].sum()
            
            # Compute Checkin Ration (EIDR)
            features['EIDR'] = features['EIDR']/features['EIDR'].sum()
            
            if training:
                # Label the clusters that contain the home location
                features['Is_home'] = features['Is_home'] == features['Is_home'].max()
            
            # Compute PageRank and ReversePageRank
            features['PR'],features['RPR'] = compute_PR_RPR(df_user)
            
            # Append results to the output dataframe
            df_tmp = df_tmp.append(features)
    
    return df_tmp.reset_index(drop=True)

We apply the function on the foursquare dataset which is also our training dataset

In [22]:
df_training = build_features(path = 'data/foursquare_checkin_data.csv.zip')
df_training.dropna(inplace = True)
df_training.head()

Unnamed: 0,user,CR,MR,EDR,EIDR,PR,RPR,Is_home,lat,lon,country
0,19,0.756158,0.846154,0.777027,0.772414,0.821211,0.801806,True,38.652989,-73.973113,US
1,19,0.05665,0.0,0.054054,0.055172,0.02295,0.037273,False,40.725046,-73.992639,US
2,19,0.059113,0.0,0.114865,0.117241,0.026504,0.026315,False,40.726305,-73.984104,US
3,19,0.036946,0.0,0.013514,0.013793,0.024816,0.024017,False,40.724191,-73.997563,US
4,19,0.03202,0.0,0.013514,0.013793,0.02473,0.026114,False,40.72294,-73.995724,US


Exporting results

In [23]:
df_training.to_csv('data/training_dataset.csv')

Buil

In [18]:
df_gowalla_features = build_features(path = 'data/loc-gowalla_totalCheckins.txt.gz',training=False)
df_gowalla_features.dropna(inplace = True)
df_gowalla_features.head()

Unnamed: 0,user,CR,MR,EDR,EIDR,PR,RPR,lat,lon
0,0,0.725962,0.894737,1.0,1.0,0.748403,0.800407,34.862264,-98.091842
1,0,0.110577,0.0,0.0,0.0,0.07198,0.051105,30.269103,-97.749395
2,0,0.0625,0.0,0.0,0.0,0.056549,0.04833,30.26791,-97.749312
3,0,0.033654,0.017544,0.0,0.0,0.03548,0.027163,30.24486,-97.757163
4,0,0.028846,0.0,0.0,0.0,0.038535,0.035271,30.264854,-97.743845


In [19]:
df_gowalla_features.to_csv('data/gowalla_checkin_features.csv')

In [20]:
df_brightkite_features = build_features(path = 'data/loc-brightkite_totalCheckins.txt.gz',training=False)
df_brightkite_features.dropna(inplace = True)
df_brightkite_features.head()

Unnamed: 0,user,CR,MR,EDR,EIDR,PR,RPR,lat,lon
0,0,0.285204,0.27795,1.0,1.0,0.420218,0.241797,39.69395,-98.427854
1,0,0.011437,0.021739,0.0,0.0,0.005258,0.02078,39.891383,-105.070814
2,0,0.076483,0.141304,0.0,0.0,0.015565,0.054985,39.89112,-105.068526
3,0,0.042888,0.013975,0.0,0.0,0.017623,0.023467,39.750728,-104.999579
4,0,0.012152,0.006211,0.0,0.0,0.015682,0.007052,39.75279,-104.996794


In [21]:
df_brightkite_features.to_csv('data/brightkite_checkin_features.csv')