# Human Mobility Prediction

Welcome to my project on Human Mobility Prediction. For the introduction of my project and overview of what will be covered in this notebook and the subsequent others, please click [here](https://d-lim.github.io/indigo-jekyll-theme/). 

## 1. Cleaning the dataset
In this first notebook, initial cleaning process:
* Removing duplicates and homogenous features
* Transforming "POLYLINE" datatype 
* Extracting start and end coordinates
* Extracting datetime information will be carried out

In [1]:
import pandas as pd
import numpy as np
%config InlineBackend.figure_format = 'retina'

In [2]:
taxi = pd.read_csv("../train.csv") #import train dataset

In [3]:
# 1. Write initial_clean function and use to remove duplicates and columns which contain homogeneous values
def initial_clean(df, keep_last = "last"):
    '''
    Basic cleaning which consists of droping off duplicates,
    and columns which contain homogeneous values.
    '''
    df.drop_duplicates(subset = "TRIP_ID", keep = keep_last, inplace = True) #TRIP_ID duplicates are dropped as TRIP_IDs are unique
    df = df[df["MISSING_DATA"] == 0] #Remove all data with missing coordinates
    df.drop("MISSING_DATA", axis = 1, inplace = True) #All missing coordinates removed, MISSING_DATA column can be removed
    df.drop("DAY_TYPE", axis = 1, inplace=True) #DAY_TYPE is removed as all values are 'A'
    df = df[df["POLYLINE"] != "[]"]
    df = df.fillna(0)
    return df

taxi = initial_clean(taxi)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


In [4]:
# 2. Write both list_coordiantes and convert_coordinates function to use them both to convert 'POLYLINE' from a str into a list.
import json 
def list_coordinates(string):
        """
        Loads list of coordinates from given string and swap out longitudes & latitudes.
        Swapping is done because the standard is to have latitude values first, but
        the coordiantes given is backwards.
        """
        return [(lat, long) for (long, lat) in json.loads(string)] # json.loads convent strs into a json object (i.e list/dict)
def convert_coordinates(df):
    """
    Transforming the POLYLINE values from a string into a list
    """
    df['POLYLINE'] = df['POLYLINE'].apply(list_coordinates) #maps through and coverts str into list for each row in 'POLYLINE' 
    return df

taxi = convert_coordinates(taxi)

In [5]:
# 3. Write start_pos function and use it to extract the coordinates for the start point
def start_pos(df):
    """
    Returns back the first lat and longs in the 'POLYLINE' in two seperate
    columns, with labels 'START_LAT' for latitudes and 'START_LONG'
    for longitudes.
    """
    df['START_LAT'] = df['POLYLINE'].apply(lambda x: x[0][0]) #extracts the first latitude in the polyline
    df['START_LONG'] = df['POLYLINE'].apply(lambda x: x[0][1])#extracts the first longitude in the polyline
    return df

taxi = start_pos(taxi)

In [6]:
#4. Write last_pos function and use it to extract the last coordinate for of the 'POLYLINE'
# Note: last_pos() extracts destination for the training set but last coordiante of truncated 'POLYLINE' for test set
def last_pos(df, lat_label, long_label):
    """
    Returns back the last lat and longs in the 'POLYLINE' in two seperate
    columns, with labels define in lat_label for latitudes 
    and long_label for longitudes.
    """
    df[lat_label] = df['POLYLINE'].apply(lambda x: x[-1][0])
    df[long_label] = df['POLYLINE'].apply(lambda x: x[-1][1])
    return df

taxi = last_pos(taxi, "END_LAT", "END_LONG")

In [7]:
#5. Write date_time function and use it to extract the week, day and quarter hour of the trips
def date_time(df):
    """
    Convert unxi time in seconds to datetime and spliting them
    into 'WEEK', 'DAY', 'HOUR'
    """
    df['date_time'] = pd.to_datetime(df['TIMESTAMP'],unit='s')
    df["WEEK"] = df["date_time"].map(lambda x : x.weekofyear) #total 52 weeks in a year
    df["DAY"] = df["date_time"].map(lambda x : x.dayofweek) # total 7 days in a week Mon = 0 and Sun = 6
    df["Q_HOUR"] = df["date_time"].map(lambda x : x.hour* 4 + x.minute / 15) 
    return df

taxi = date_time(taxi)

Another than the training set, cleaning will be applied to the test set too.

In [8]:
test = pd.read_csv("./test.csv")

In [9]:
test = initial_clean(test)
test = convert_coordinates(test)
taxi = last_pos(taxi, "END_LAT", "END_LONG")
test = start_pos(test)
test = date_time(test)

In [11]:
taxi.to_pickle('./Pickles/taxi_cleaned') #pickling cleaned trainset for to be used for EDA
test.to_pickle('./Pickles/test_cleaned') #pickling cleaned testset for to be used for EDA