## ETA-PREDICTION-FOR-DELIVERY-COMPANY

### Business Understanding

Ride-hailing apps like Uber and Yassir depend heavily on real-time data and sophisticated machine learning algorithms to streamline and enhance their services. Accurate ETA predictions are crucial for several reasons:

- Customer Satisfaction:
Accurate ETA estimates provide customers with reliable information about when their ride will arrive. This improves their overall experience and satisfaction.
Reliable ETAs build trust with customers, leading to higher retention rates and positive reviews.

- Operational Efficiency:
By predicting ETAs more accurately, Yassir can optimize the distribution of drivers and resources. This ensures that drivers are effectively assigned to rides based on demand and proximity, reducing idle time and improving operational efficiency.
Improved ETA predictions can help reduce operational costs by minimizing wait times and inefficient routes. This leads to better fuel usage and lower overall costs.

- Competitive Advantage:
In a competitive market, offering more accurate ETAs can differentiate Yassir from its competitors. This can attract more users and partners who value reliability and efficiency.
Accurate ETAs can enhance relationships with business partners by providing them with reliable scheduling information and improving the overall service experience.

- Impact on Business Strategy:
Savings from improved efficiency can be reinvested into other areas of the business, such as technology upgrades, marketing, or expansion efforts.
Leveraging real-time data and advanced analytics can drive strategic decisions and foster innovation within the company.

#### Objectives
The goal of this project is to develop a machine learning model that predicts the estimated time of arrival (ETA) at the dropoff point for a single Yassir journey. This model will:

- Enhance Accuracy

- Improve Efficiency

- Drive Innovation




In [2]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Create a dataframe
train_df = pd.read_csv('../Dataset/Train.csv')
test_df = pd.read_csv('../Dataset/Test.csv')
weather_df = pd.read_csv('../Dataset/Weather.csv')
sample_df = pd.read_csv('../Dataset/SampleSubmission.csv')

### EDA

In [58]:
train_df.head()

Unnamed: 0,ID,Timestamp,Origin_lat,Origin_lon,Destination_lat,Destination_lon,Trip_distance,ETA
0,000FLWA8,2019-12-04T20:01:50Z,3.258,36.777,3.003,36.718,39627,2784
1,000RGOAM,2019-12-10T22:37:09Z,3.087,36.707,3.081,36.727,3918,576
2,001QSGIH,2019-11-23T20:36:10Z,3.144,36.739,3.088,36.742,7265,526
3,002ACV6R,2019-12-01T05:43:21Z,3.239,36.784,3.054,36.763,23350,3130
4,0039Y7A8,2019-12-17T20:30:20Z,2.912,36.707,3.207,36.698,36613,2138


In [6]:
test_df.head()

Unnamed: 0,ID,Timestamp,Origin_lat,Origin_lon,Destination_lat,Destination_lon,Trip_distance
0,000V4BQX,2019-12-21T05:52:37Z,2.981,36.688,2.978,36.754,17549
1,003WBC5J,2019-12-25T21:38:53Z,3.032,36.769,3.074,36.751,7532
2,004O4X3A,2019-12-29T21:30:29Z,3.035,36.711,3.01,36.758,10194
3,006CEI5B,2019-12-31T22:51:57Z,2.902,36.738,3.208,36.698,32768
4,009G0M2T,2019-12-28T21:47:22Z,2.86,36.692,2.828,36.696,4513


In [7]:
weather_df.head()

Unnamed: 0,date,dewpoint_2m_temperature,maximum_2m_air_temperature,mean_2m_air_temperature,mean_sea_level_pressure,minimum_2m_air_temperature,surface_pressure,total_precipitation,u_component_of_wind_10m,v_component_of_wind_10m
0,2019-11-01,290.630524,296.434662,294.125061,101853.617188,292.503998,100806.351562,0.004297,3.561323,0.941695
1,2019-11-02,289.135284,298.432404,295.551666,101225.164062,293.337921,100187.25,0.001767,5.318593,3.258237
2,2019-11-03,287.667694,296.612122,295.182831,100806.617188,293.674316,99771.414062,0.000797,8.447649,3.172982
3,2019-11-04,287.634644,297.173737,294.368134,101240.929688,292.376221,100200.84375,0.000393,5.991428,2.2367
4,2019-11-05,286.413788,294.284851,292.496979,101131.75,289.143066,100088.5,0.004658,6.96273,2.655364


checking for data types

In [59]:
train_df.shape

(83924, 8)

In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83924 entries, 0 to 83923
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               83924 non-null  object 
 1   Timestamp        83924 non-null  object 
 2   Origin_lat       83924 non-null  float64
 3   Origin_lon       83924 non-null  float64
 4   Destination_lat  83924 non-null  float64
 5   Destination_lon  83924 non-null  float64
 6   Trip_distance    83924 non-null  int64  
 7   ETA              83924 non-null  int64  
dtypes: float64(4), int64(2), object(2)
memory usage: 5.1+ MB


In [5]:
train_df['Timestamp'] = pd.to_datetime(train_df['Timestamp'])

Checking statistical info

In [28]:
train_df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max,std
ID,83924.0,83924.0,ZZZY11ZN,1.0,,,,,,,
Timestamp,83924.0,,,,2019-12-04 14:22:20.568883712+00:00,2019-11-19 23:00:08+00:00,2019-11-27 01:53:00.500000+00:00,2019-12-04 01:46:50.500000+00:00,2019-12-11 21:36:44+00:00,2019-12-19 23:59:29+00:00,
Origin_lat,83924.0,,,,3.052406,2.807,2.994,3.046,3.095,3.381,0.096388
Origin_lon,83924.0,,,,36.739358,36.589,36.721,36.742,36.76,36.82,0.032074
Destination_lat,83924.0,,,,3.056962,2.807,2.995,3.049,3.109,3.381,0.10071
Destination_lon,83924.0,,,,36.737732,36.596,36.718,36.742,36.76,36.819,0.032781
Trip_distance,83924.0,,,,13527.82141,1.0,6108.0,11731.5,19369.0,62028.0,9296.716006
ETA,83924.0,,,,1111.697762,1.0,701.0,1054.0,1456.0,5238.0,563.565486


checking for null values

In [29]:
train_df.isna().sum()

ID                 0
Timestamp          0
Origin_lat         0
Origin_lon         0
Destination_lat    0
Destination_lon    0
Trip_distance      0
ETA                0
dtype: int64

checking for duplicates

In [30]:
train_df.duplicated().sum()

np.int64(0)

In [6]:
train_df1 = train_df.copy()

In [7]:
train_df1.set_index('Timestamp', inplace=True)


In [8]:
train_df1.head(10)

Unnamed: 0_level_0,ID,Origin_lat,Origin_lon,Destination_lat,Destination_lon,Trip_distance,ETA
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-12-04 20:01:50+00:00,000FLWA8,3.258,36.777,3.003,36.718,39627,2784
2019-12-10 22:37:09+00:00,000RGOAM,3.087,36.707,3.081,36.727,3918,576
2019-11-23 20:36:10+00:00,001QSGIH,3.144,36.739,3.088,36.742,7265,526
2019-12-01 05:43:21+00:00,002ACV6R,3.239,36.784,3.054,36.763,23350,3130
2019-12-17 20:30:20+00:00,0039Y7A8,2.912,36.707,3.207,36.698,36613,2138
2019-12-01 04:21:03+00:00,003B9LE9,2.995,36.738,3.207,36.698,25342,1341
2019-12-10 23:08:35+00:00,004K2C9W,3.054,36.773,3.059,36.785,2814,606
2019-11-26 20:41:42+00:00,004LD40Z,3.092,36.711,3.035,36.734,7026,451
2019-12-02 05:24:25+00:00,005H5Q6S,3.178,36.722,3.197,36.713,2454,587
2019-12-05 23:10:00+00:00,006DWCWR,3.001,36.736,3.015,36.752,3855,423


In [9]:


olat = train_df1.groupby(['Origin_lat'])['ETA'].mean()
olat


Origin_lat
2.807    2419.0
2.808    1618.5
2.809    1546.0
2.810    1268.0
2.812    1779.0
          ...  
3.373    1808.0
3.374    1874.0
3.379     992.0
3.380    1158.0
3.381    1703.0
Name: ETA, Length: 568, dtype: float64

In [10]:
train_df1['Origin_lat'].nunique()

568

In [12]:
# Generate the complete date range
complete_date_range = pd.date_range(start=train_df1.index.min(), end=train_df1.index.max(), freq='min')

Origin_lats = train_df1['Origin_lat'].unique()
Origin_lons = train_df1['Origin_lon'].unique()
Destination_lats = train_df1['Destination_lat'].unique()
Destination_lons = train_df1['Destination_lon'].unique()

# Create a DataFrame with all possible combinations of dates, families and stores
all_combinations = pd.MultiIndex.from_product([complete_date_range, Origin_lats, Origin_lons], names=['Timestamp', 'Origin_lat', 'Origin_lon'])
all_df = pd.DataFrame(index=all_combinations).reset_index()


# Merge with the original DataFrame to fill missing values
train_df_filled = pd.merge(all_df, train_df1, how='left', on=['Timestamp', 'Origin_lat', 'Origin_lon', 'Destination_lat', 'Destination_lon'])


# Fill remaining missing values with zeros
train_df_filled['Trip_distance'] = train_df_filled['Trip_distance'].fillna(0)
train_df_filled['ETA'] = train_df_filled['ETA'].fillna(0)


# Reindex the DataFrame
#df_reindexed = train_df1.reindex(complete_date_range)

# Identify missing dates
#missing_dates = df_reindexed[df_reindexed.isnull().any(axis=1)].index
#print("Missing dates:")
#print(missing_dates)


MemoryError: Unable to allocate 21.1 GiB for an array with shape (5651486400,) and data type int32

In [None]:
#train_df1 = train_df1[~train_df1.index.duplicated(keep='first')]

# Generate complete date range with minute frequency
complete_date_range = pd.date_range(start=train_df1.index.min(), end=train_df1.index.max(), freq='S')

# Reindex DataFrame
df_reindexed = train_df1.reindex(complete_date_range)

# Identify missing dates
missing_dates = df_reindexed[df_reindexed.isnull().any(axis=1)].index
print("Missing dates:")
print(missing_dates)

In [53]:
#train_df_filled = pd.merge(train_df1, missing_dates, how='left', on=['Timestamp'])

# Reset index to get 'Timestamp' back as a column
train_df_filled.reset_index(inplace=True)
train_df_filled.rename(columns={'index': 'Timestamp'}, inplace=True)

print("DataFrame with missing dates filled:")
print(train_df_filled)

DataFrame with missing dates filled:
      Timestamp                 Timestamp        ID  Origin_lat  Origin_lon  \
0             0 2019-11-19 23:00:08+00:00  UYFJUFF0       3.021      36.751   
1             1 2019-11-19 23:01:08+00:00         0       0.000       0.000   
2             2 2019-11-19 23:02:08+00:00  LV8809ED       3.050      36.738   
3             3 2019-11-19 23:03:08+00:00         0       0.000       0.000   
4             4 2019-11-19 23:04:08+00:00         0       0.000       0.000   
...         ...                       ...       ...         ...         ...   
43255     43255 2019-12-19 23:55:08+00:00         0       0.000       0.000   
43256     43256 2019-12-19 23:56:08+00:00         0       0.000       0.000   
43257     43257 2019-12-19 23:57:08+00:00         0       0.000       0.000   
43258     43258 2019-12-19 23:58:08+00:00         0       0.000       0.000   
43259     43259 2019-12-19 23:59:08+00:00         0       0.000       0.000   

       Destina

In [54]:
train_df_filled.shape

(43260, 9)

In [56]:
train_df_filled.head()

Unnamed: 0,Timestamp,Timestamp.1,ID,Origin_lat,Origin_lon,Destination_lat,Destination_lon,Trip_distance,ETA
0,0,2019-11-19 23:00:08+00:00,UYFJUFF0,3.021,36.751,3.031,36.769,3898.0,556.0
1,1,2019-11-19 23:01:08+00:00,0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,2019-11-19 23:02:08+00:00,LV8809ED,3.05,36.738,3.044,36.741,776.0,167.0
3,3,2019-11-19 23:03:08+00:00,0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,2019-11-19 23:04:08+00:00,0,0.0,0.0,0.0,0.0,0.0,0.0
