## Business: Understanding the business problem

**Problem statement**: Our task is to predict ride requests (demand forecast) - (for a specific latitude and longitude) and (for a specific future time window).

**Business objective**: is to directly (increase revenue and improve customer satisfaction) through demand fulfillment.

**Dataset**: Dataset we are going to use contains (8381556) ride requests made by (94274) users and this dataset was collected over the time period of the past year (data collected over the time period of 12 months). 

**Dataset description**: Dataset is based on the logs with contains the following features:
- 'number' (user id) = it represents the user id and (it's unique for every user)
- 'ts' (booking timestamp) - it represents the (time and date) when the booking was made
- 'pick_lat' (pickup location latitude) & 'pick_lng' (pickup location longitude) - location from which the user wants the driver to come to pick him up
- 'drop_lat' (drop location latitude) & 'drop_lng' (drop location longitude) - location to which the user wants the driver to drive him to drop him

To help us solve this problem, management of the company provided us some guidelines on what the definition of a good ride request is:
- Count only 1 ride request by a user, if there are multiple bookings from the same latitude and longitude within 1 hour of the last booking time
    - Reason for this suggestion is: Based on the logs, the management team have noticed that very often a user would repeatedly book a ride based on the arrival time. Meaning that after a user books a ride, if the arrival time is too long, then they would re-try booking a ride several times again in an attempt to shorten the arrival time if another driver gets allocated. Another scenario they have noticed is that after users book a ride, if they see a low driver rating or a car they dislike, then they cancel the booking and book again. Consequently, the logs data contain a lot of duplicate rides from the same user and with same pickup and drop locations. So, we must remove those duplicate entries.
- If there're ride requests within 8 minuntes of the last booking time, consider only 1 ride request from a user (latitude and longitude may or may not be the same)
- If the geodesic distances from (pickup point and drop point) is les than 50 meters, consider that ride request as a fraud ride request.
- Consider ride requests where pick up/drop location is outside India bounding box: ['6.2325274', '35.6745457', '68.1113787', '97.395561'] as system error
- Karnataka is our prime city where we have a lot of drivers and ride requests to fulfill. We would not love to serve rides that are outside Karnataka and have pickup and drop geodesic distance > 500kms. Karnataka bounding box ['11.5945587', '18.4767308','74.0543908', '78.588083']

Based on the definition of a good ride request that the management team have given us, we're going to perform data preprocessing to only pick the set of data with good ride requests which will then be used for training our machine learning model.

Company have given us data that ranges from 2020-03-26 (26th March 2020) to 2021-03-26 (26th March 2021). For deployment of the model we must build a prediction pipeline which we will use to evaluate our model by making a prediction for 2021-03-27 (26th March 2021).

## Theory: Project flow and structure

1. Data exploration
2. Data preprocessing
3. Data preparation
4. Modelling and predict

## Theory: Multi-step time-series forecasting

To solve this problem for time series demand forecasting we firstly need to know what multi-step time series is.

This problem of predicting driver hours is a multi-step time series problem.

Multistep-ahead prediction is the task of predicting a sequence of values in a time series. Typical approach, known as multi-stage prediction, is to apply a predictive model step-by-step and use the predicted value of the current time step to determine its value in the next time step.

(TODO)

In [41]:
import pandas as pd
from IPython.display import display

df = pd.read_csv("../../data/raw_data.csv", low_memory=False, compression='gzip', parse_dates=["ts"])

In [42]:
display(df.head())
display(df.shape)

Unnamed: 0,ts,number,pick_lat,pick_lng,drop_lat,drop_lng
0,2020-03-26 07:07:17,14626,12.313621,76.658195,12.287301,76.60228
1,2020-03-26 07:32:27,85490,12.943947,77.560745,12.954014,77.54377
2,2020-03-26 07:36:44,5408,12.899603,77.5873,12.93478,77.56995
3,2020-03-26 07:38:00,58940,12.918229,77.607544,12.968971,77.636375
4,2020-03-26 07:39:29,5408,12.89949,77.58727,12.93478,77.56995


(8381556, 6)

### Data cleaning 

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8381556 entries, 0 to 8381555
Data columns (total 6 columns):
 #   Column    Dtype         
---  ------    -----         
 0   ts        datetime64[ns]
 1   number    object        
 2   pick_lat  float64       
 3   pick_lng  float64       
 4   drop_lat  float64       
 5   drop_lng  float64       
dtypes: datetime64[ns](1), float64(4), object(1)
memory usage: 383.7+ MB


Since 'number' is a numeric unique identifier of user, it must be a number rather than an object, so changing its type.

In [48]:
df["number"] = pd.to_numeric(df["number"], errors="coerce")

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8381556 entries, 0 to 8381555
Data columns (total 6 columns):
 #   Column    Dtype         
---  ------    -----         
 0   ts        datetime64[ns]
 1   number    float64       
 2   pick_lat  float64       
 3   pick_lng  float64       
 4   drop_lat  float64       
 5   drop_lng  float64       
dtypes: datetime64[ns](1), float64(5)
memory usage: 383.7 MB


**Step 1**: Same customer cannot book a ride multiple times at the same timestamp (these duplicates may of appeared due to system logging issues)

Removing duplicate entries ["ts", "number"] as a CustomerID ("number") at a particular timestamp ("ts") can only have one entry

In [34]:
duplicates = df[df.duplicated(subset=["ts", "number"], keep=False)]
print(duplicates.shape[0])

113540


There're are 113540 duplicate entries. So, we will remove these duplicate entries (but keeping the last occurence)

In [38]:
df.drop_duplicates(subset=["ts", "number"], inplace=True, keep="last")
df.reset_index(inplace=True, drop=True)
display(df.shape[0])

8315498

**Step 2**: Handling missing values 

Let's see if we have any missing values in our dataset

In [49]:
df.isnull().sum()

ts            0
number      121
pick_lat      0
pick_lng      0
drop_lat      0
drop_lng      0
dtype: int64

We see that we have 121 missing values in the 'number' column, which the unique identifier of each user.

To handle missing values, we have 3 possible options:
- Drop the entries with missing values
- Drop the column which contains missing values
- Impute the missing values

Dropping the column of user identifier is not possible for our use case. Imputing the missing values with mode/mean/median would simply not make any sense as it is an identifier. So, they only option we have is to drop the entries with missing values. Furthermore, taking into account that we have over 8300000 entries, dropping 121 of them will unlikely have any great negative effect on performance of our model.

In [52]:
df.dropna(inplace=True, subset=["number"])
display(df.shape[0])

8381435

### Breaking time feature into various features

In [53]:
df['hour'] = df['ts'].dt.hour
df['minute'] = df['ts'].dt.minute
df['day'] = df['ts'].dt.day
df['month'] = df['ts'].dt.month
df['year'] = df['ts'].dt.year
df['dayofweek'] = df['ts'].dt.dayofweek

In [54]:
df.head()

Unnamed: 0,ts,number,pick_lat,pick_lng,drop_lat,drop_lng,hour,minute,day,month,year,dayofweek
0,2020-03-26 07:07:17,14626.0,12.313621,76.658195,12.287301,76.60228,7,7,26,3,2020,3
1,2020-03-26 07:32:27,85490.0,12.943947,77.560745,12.954014,77.54377,7,32,26,3,2020,3
2,2020-03-26 07:36:44,5408.0,12.899603,77.5873,12.93478,77.56995,7,36,26,3,2020,3
3,2020-03-26 07:38:00,58940.0,12.918229,77.607544,12.968971,77.636375,7,38,26,3,2020,3
4,2020-03-26 07:39:29,5408.0,12.89949,77.58727,12.93478,77.56995,7,39,26,3,2020,3


In [57]:
import os 

try:
    os.mkdir("../../output")
except:
    pass

df.to_csv("../../output/preprocessed_1.csv", index=False, compression="gzip")