### OLA Bike - Rides Request Demand Forecast Project Overview

#### Business Problem
Ola Bikes are suffering losses and losing out from their competition due to their inability to fulfill the ride requests of many users. To tackle this problem you are asked to predict demand for rides in a certain region and a given future time window. This would help them allocate drivers more intelligently to meet the ride requests from users.



#### Goal
You have to predict ride requests (demand forecast) for a particular latitude and longitude for a requested future time window/duration.
##Data Description
Raw Data contains a `number` (unique for every user), ride request DateTime (IST time),
pickup and drop location latitude, and longitude.



#### Data Field
1. number: unique id for every user
2. ts: DateTime of booking ride (IST time)
3. pick_lat: ride request pickup latitude
4. pick_lng: ride request pickup longitude
5. drop_lat: ride request drop latitude
6. drop_lng: ride request drop longitude



#### Defining a Good Ride Request
Ola Management knows the task is not easy and very important for their business to grow.
Hence, their business team has provided you some guidelines to follow.
1. Count only 1 ride request by a user, if there are multiple bookings from the same latitude and longitude within 1hour of the last booking time.
2. If there are ride requests within 8mins of the last booking time consider only 1 ride
request from a user (latitude and longitude may or may not be the same).
3. If the geodesic distance from pickup and drop point is less than 50meters
consider that ride request as a fraud ride request.
4. Consider all ride requests where pick up or drop location is outside India bounding box: ['6.2325274', '35.6745457', '68.1113787', '97.395561'] as system error.
5. Karnataka is our prime city where we have a lot of drivers and ride requests to fulfill. We would not love to serve rides that are outside Karnataka and have pickup and drop geodesic distance > 500kms. Karnataka bounding box: ['11.5945587', '18.4767308','74.0543908', '78.588083']

### Data Exploration and Cleanup (basic) script
Our aim here is to understand our dataset and do a basic cleanup removing NaNs & Duplicates. 

In [39]:
import pandas as pd 
import numpy as np 

In [40]:
df = pd.read_csv('/Users/chetanhalai/Documents/code base for projects/7) Ola Bike Rides/data/raw_data.csv', compression='gzip')


  df = pd.read_csv('/Users/chetanhalai/Documents/code base for projects/7) Ola Bike Rides/data/raw_data.csv', compression='gzip')


In [41]:
df.shape

(8381556, 6)

In [42]:
df.head()

Unnamed: 0,ts,number,pick_lat,pick_lng,drop_lat,drop_lng
0,2020-03-26 07:07:17,14626,12.313621,76.658195,12.287301,76.60228
1,2020-03-26 07:32:27,85490,12.943947,77.560745,12.954014,77.54377
2,2020-03-26 07:36:44,5408,12.899603,77.5873,12.93478,77.56995
3,2020-03-26 07:38:00,58940,12.918229,77.607544,12.968971,77.636375
4,2020-03-26 07:39:29,5408,12.89949,77.58727,12.93478,77.56995


In [43]:
# Removing Diplicate Entreis - - A customer ID number at a particular time stamp can only have one entry 
#  so i aill have to remove duplicate entries - A customer can nott book multiple rides in one timestamp 
df[df.duplicated(subset=['ts','number'],keep=False)]

Unnamed: 0,ts,number,pick_lat,pick_lng,drop_lat,drop_lng
235,2020-03-26 18:10:35,16795,12.967236,77.641594,13.014504,77.650856
236,2020-03-26 18:10:35,16795,12.967236,77.641594,13.014504,77.650856
407,2020-03-26 21:35:50,65856,12.917173,77.586400,12.913940,77.685280
408,2020-03-26 21:35:50,65856,12.917173,77.586400,12.913940,77.685280
443,2020-03-26 23:26:29,27554,12.933715,77.619300,12.938208,77.587520
...,...,...,...,...,...,...
8381231,2021-03-26 22:23:12,61636,12.975229,77.620370,13.017285,77.618200
8381245,2021-03-26 22:25:13,61636,12.975229,77.620370,13.017285,77.618200
8381246,2021-03-26 22:25:13,61636,12.975229,77.620370,13.017285,77.618200
8381248,2021-03-26 22:25:27,61636,12.975229,77.620370,13.017285,77.618200


** Observation **

* There are 113540 duplicate entries - we have 8315498 unique timestamp, customer id rows 


In [44]:
df.drop_duplicates(subset=['ts','number'], inplace=True, keep='last')
df.reset_index(inplace=True, drop=True)

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8315498 entries, 0 to 8315497
Data columns (total 6 columns):
 #   Column    Dtype  
---  ------    -----  
 0   ts        object 
 1   number    object 
 2   pick_lat  float64
 3   pick_lng  float64
 4   drop_lat  float64
 5   drop_lng  float64
dtypes: float64(4), object(2)
memory usage: 380.7+ MB


In [46]:
#convert Numbers column from an object to Int 
df['number'] = pd.to_numeric(df['number'], errors='coerce', downcast='integer')

** Observation **

Both 'ts' and number column need to be integer values - so i will convertthem 


In [47]:
#convert Numbers column from an object to Int 
df['number'] = pd.to_numeric(df['number'], errors='coerce', downcast='integer')

In [48]:
df.isnull().sum()

ts            0
number      116
pick_lat      0
pick_lng      0
drop_lat      0
drop_lng      0
dtype: int64

** Observation **

there are 116 nan rows - need to drop these 

In [51]:
#converting time stamp to date time 
df['ts'] = pd.to_datetime(df['ts'])

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8315498 entries, 0 to 8315497
Data columns (total 6 columns):
 #   Column    Dtype         
---  ------    -----         
 0   ts        datetime64[ns]
 1   number    float64       
 2   pick_lat  float64       
 3   pick_lng  float64       
 4   drop_lat  float64       
 5   drop_lng  float64       
dtypes: datetime64[ns](1), float64(5)
memory usage: 380.7 MB


In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8315498 entries, 0 to 8315497
Data columns (total 6 columns):
 #   Column    Dtype         
---  ------    -----         
 0   ts        datetime64[ns]
 1   number    float64       
 2   pick_lat  float64       
 3   pick_lng  float64       
 4   drop_lat  float64       
 5   drop_lng  float64       
dtypes: datetime64[ns](1), float64(5)
memory usage: 380.7 MB


In [54]:
df.head()

Unnamed: 0,ts,number,pick_lat,pick_lng,drop_lat,drop_lng
0,2020-03-26 07:07:17,14626.0,12.313621,76.658195,12.287301,76.60228
1,2020-03-26 07:32:27,85490.0,12.943947,77.560745,12.954014,77.54377
2,2020-03-26 07:36:44,5408.0,12.899603,77.5873,12.93478,77.56995
3,2020-03-26 07:38:00,58940.0,12.918229,77.607544,12.968971,77.636375
4,2020-03-26 07:39:29,5408.0,12.89949,77.58727,12.93478,77.56995


#### BREAKING TIME TO FEATURE 

* Many machine learning models, especially those not specifically designed for time series (e.g., linear regression, decision trees), don't inherently understand the continuous and cyclical nature of time data. By converting time into more granular features, you make the patterns more explicit and the data more amenable to these models

In [55]:
df['hour'] = df['ts'].dt.hour
df['mins'] = df['ts'].dt.minute
df['day'] = df['ts'].dt.day
df['month'] = df['ts'].dt.month
df['year'] = df['ts'].dt.year
df['dayofweek'] = df['ts'].dt.dayofweek


In [56]:
df.head()

Unnamed: 0,ts,number,pick_lat,pick_lng,drop_lat,drop_lng,hour,mins,day,month,year,dayofweek
0,2020-03-26 07:07:17,14626.0,12.313621,76.658195,12.287301,76.60228,7,7,26,3,2020,3
1,2020-03-26 07:32:27,85490.0,12.943947,77.560745,12.954014,77.54377,7,32,26,3,2020,3
2,2020-03-26 07:36:44,5408.0,12.899603,77.5873,12.93478,77.56995,7,36,26,3,2020,3
3,2020-03-26 07:38:00,58940.0,12.918229,77.607544,12.968971,77.636375,7,38,26,3,2020,3
4,2020-03-26 07:39:29,5408.0,12.89949,77.58727,12.93478,77.56995,7,39,26,3,2020,3
