**Glovo Orders Prediction**

Based on 4 weeks orders data per hours we need to predict how many orders are going to be raised in the following week at any hour.


** Loading data **

Data has been presented through an excel sheet containing 4 pages one for each week. 
In each page the number of orders per hour is represented.
The excel sheet has been transformed to csv and data will need to be loaded and manipulated before being modeled.

In [65]:
import pandas as pd
import numpy as np

# load 4 data files
df_w17 = pd.read_csv('data_forecast_w17.csv')
df_w18 = pd.read_csv('data_forecast_w18.csv')
df_w19 = pd.read_csv('data_forecast_w19.csv')
df_w20 = pd.read_csv('data_forecast_w20.csv')

# add week column to all dataframes
df_w17['WEEK']=17
df_w18['WEEK']=18
df_w19['WEEK']=19
df_w20['WEEK']=20

#concat all dataframes
df_all = df_w17.append(df_w18).append(df_w19).append(df_w20)
df_all.index.names = [None]
df_all = df_all.reset_index(drop=True)
df_all

Unnamed: 0,HOURS,MON,TUE,WED,THU,FRI,SAT,SUN,WEEK
0,0,63,66,90,84,144,111,138,17
1,1,48,42,48,57,54,66,60,17
2,2,21,48,42,30,36,75,42,17
3,3,18,33,12,39,27,48,36,17
4,4,6,12,21,3,15,42,24,17
5,5,12,21,0,21,24,69,24,17
6,6,9,3,3,12,15,168,30,17
7,7,18,12,18,15,30,39,24,17
8,8,51,51,33,42,42,57,45,17
9,9,96,75,138,117,123,102,78,17


The idea is predicting a given hour number of orders based on the previous weeks information. For this we want to use a linear regression so we can predict a base hour and day number of orders. Thanks to this we will be able to predict the whole week number of orders per hours as asked.
Before being able to run the regression we need to prepare the features.

** Features preparation: **
 
We think the exact same day and hour for the previous week should have an important effect in the decision. Especially the day should have an effect.
We will dispose 7 features, one for each day with binary values.

In [66]:
# create empty numpy array with new columns format
#'WEEK','HOUR','IS_MONDAY','IS_TUESDAY','IS_WEDNESDAY','IS_THURSDAY','IS_FRIDAY','IS_SATURDAY','IS_SUNDAY','NUM_ORDERS']
features_array = np.zeros(shape=(672,10))

# fill new dataframe looping once for every day of the week
index=0
# loop for mondays data
for idx, row in df_all.iterrows():
    # add monday entry to new features array
    features_array[index,0]=row['WEEK']
    features_array[index,1]=row['HOURS']
    features_array[index,2]=1
    features_array[index,9]=row['MON']
    index=index+1
# loop for tuesdays data
for idx, row in df_all.iterrows():
    # add tue entry to new features array
    features_array[index,0]=row['WEEK']
    features_array[index,1]=row['HOURS']
    features_array[index,3]=1
    features_array[index,9]=row['TUE']
    index=index+1
# loop for wednesdays data
for idx, row in df_all.iterrows():
    # add wed entry to new features array
    features_array[index,0]=row['WEEK']
    features_array[index,1]=row['HOURS']
    features_array[index,4]=1
    features_array[index,9]=row['WED']
    index=index+1
# loop for thursdays data
for idx, row in df_all.iterrows():
    # add thu entry to new features array
    features_array[index,0]=row['WEEK']
    features_array[index,1]=row['HOURS']
    features_array[index,5]=1
    features_array[index,9]=row['THU']
    index=index+1
# loop for fridays data
for idx, row in df_all.iterrows():
    # add thu entry to new features array
    features_array[index,0]=row['WEEK']
    features_array[index,1]=row['HOURS']
    features_array[index,6]=1
    features_array[index,9]=row['FRI']
    index=index+1
# loop for saturdays data
for idx, row in df_all.iterrows():
    # add thu entry to new features array
    features_array[index,0]=row['WEEK']
    features_array[index,1]=row['HOURS']
    features_array[index,7]=1
    features_array[index,9]=row['SAT']
    index=index+1
# loop for sunday data
for idx, row in df_all.iterrows():
    # add thu entry to new features array
    features_array[index,0]=row['WEEK']
    features_array[index,1]=row['HOURS']
    features_array[index,8]=1
    features_array[index,9]=row['SUN']
    index=index+1

    

print(features_array[21])
print(features_array[300])
print(features_array[671])

[  17.   21.    1.    0.    0.    0.    0.    0.    0.  684.]
[  17.   12.    0.    0.    0.    1.    0.    0.    0.  306.]
[  20.   23.    0.    0.    0.    0.    0.    0.    1.  375.]


** Prepare X and y features: **

To simplify we will try using weeks 17, 18 and 19 as training data and week 20 to validate the results.
In a second phase the whole set will be used as training data so we can provide the full results for a hypothetic week 31.

In [67]:
# prepare final training sets including all data
X_train_final = features_array[:,1:9]
y_train_final = features_array[:,9]

# prepare validation train sets with 3 weeks 
X_train = X_train_final[0:24*7*3,:]
y_train = y_train_final[0:24*7*3]
# prepare validation test sets with remaining week
X_test = X_train_final[24*7*3:,:]
y_test = y_train_final[24*7*3:]

print("X_train shape: ",X_train.shape)
print("y_train shape: ",y_train.shape)
print("X_test shape: ",X_test.shape)
print("y_test shape: ",y_test.shape)

print("X_train_final shape: ",X_train_final.shape)
print("y_train_final shape: ",y_train_final.shape)

X_train shape:  (504, 8)
y_train shape:  (504,)
X_test shape:  (168, 8)
y_test shape:  (168,)
X_train_final shape:  (672, 8)
y_train_final shape:  (672,)


** Linear Regression **

After the features are ready the linear regression can be applied.

In [68]:
from sklearn import linear_model
import math

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % np.mean((regr.predict(X_test) - y_test) ** 2))
print("Root mean squared error: %.2f"
     % math.sqrt(np.mean((regr.predict(X_test) - y_test) ** 2)))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_test, y_test))

Coefficients: 
 [ 30.21397516 -32.55208333 -35.08333333 -16.11458333   4.10416667
  40.01041667  39.63541667   0.        ]
Mean squared error: 86358.78
Root mean squared error: 293.87
Variance score: 0.45


In [70]:
regr.predict(X_test)

array([ -27.83571429,    2.37826087,   32.59223602,   62.80621118,
         93.02018634,  123.23416149,  153.44813665,  183.6621118 ,
        213.87608696,  244.09006211,  274.30403727,  304.51801242,
        334.73198758,  364.94596273,  395.15993789,  425.37391304,
        455.5878882 ,  485.80186335,  516.01583851,  546.22981366,
        576.44378882,  606.65776398,  636.87173913,  667.08571429,
        -27.83571429,    2.37826087,   32.59223602,   62.80621118,
         93.02018634,  123.23416149,  153.44813665,  183.6621118 ,
        213.87608696,  244.09006211,  274.30403727,  304.51801242,
        334.73198758,  364.94596273,  395.15993789,  425.37391304,
        455.5878882 ,  485.80186335,  516.01583851,  546.22981366,
        576.44378882,  606.65776398,  636.87173913,  667.08571429,
        -27.83571429,    2.37826087,   32.59223602,   62.80621118,
         93.02018634,  123.23416149,  153.44813665,  183.6621118 ,
        213.87608696,  244.09006211,  274.30403727,  304.51801

** Results: **

Root squared error is too high, results are not good so far, a different approach with features needs to be taken.
Negative values should not appear, they proof this model does not work too.

** Prepare features again: **

Let's modify the features. The next idea is using the hours as 24 binary features such as we did with the week days.

In [71]:
# create array for new features with 23 extra columns
features2_array = np.zeros(shape=(672,33))
# column names would be (ordered)
# WEEK,IS_MON,IS_TUE,IS_WED,IS_THU,IS_FRI,IS_SAT,IS_SUN,IS_0,IS_1,IS_2,IS_3,IS_4,IS_5,IS_6,IS_7,IS_8,IS_9,IS_10,
# IS_11,IS_12,IS_13,IS_14,IS_15,IS_16,IS_17,IS_18,IS_19,IS_20,IS_21,IS_22,IS_23,NUM_ORDERS

#'WEEK','HOUR','IS_MONDAY','IS_TUESDAY','IS_WEDNESDAY','IS_THURSDAY','IS_FRIDAY','IS_SATURDAY','IS_SUNDAY','NUM_ORDERS']


index=0
#loop over every kind of hour
#0
for hour in range(0,24):
    for row in features_array[(hour==features_array[:,1])]:
        # set week
        features2_array[index,0]=row[0]
        # set days binary info
        features2_array[index,1:8]=row[2:9]
        # set hour info
        features2_array[index,8+hour]=1
        # set num_orders info
        features2_array[index,32]=row[9]
        index=index+1

print(features2_array)
print(features2_array[20,:])
print(features2_array[220,:])
print(features2_array[250,:])
print(features2_array[671,:])

[[  17.    1.    0. ...,    0.    0.   63.]
 [  18.    1.    0. ...,    0.    0.  105.]
 [  19.    1.    0. ...,    0.    0.  141.]
 ..., 
 [  18.    0.    0. ...,    0.    1.  431.]
 [  19.    0.    0. ...,    0.    1.  390.]
 [  20.    0.    0. ...,    0.    1.  375.]]
[  17.    0.    0.    0.    0.    0.    1.    0.    1.    0.    0.    0.
    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
    0.    0.    0.    0.    0.    0.    0.    0.  111.]
[ 17.   0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.
   1.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
   0.   0.  24.]
[ 19.   0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.
   0.   1.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
   0.   0.  24.]
[  20.    0.    0.    0.    0.    0.    0.    1.    0.    0.    0.    0.
    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
    0.    0.    0.    0.    0.    0.    0.

**Prepare X and y features:**

In [72]:
# prepare final training sets including all data
X_train_final = features2_array[:,1:32]
y_train_final = features2_array[:,32]

# prepare validation train sets with 3 weeks 
X_train = X_train_final[0:24*7*3,:]
y_train = y_train_final[0:24*7*3]
# prepare validation test sets with remaining week
X_test = X_train_final[24*7*3:,:]
y_test = y_train_final[24*7*3:]

print("X_train shape: ",X_train.shape)
print("y_train shape: ",y_train.shape)
print("X_test shape: ",X_test.shape)
print("y_test shape: ",y_test.shape)

print("X_train_final shape: ",X_train_final.shape)
print("y_train_final shape: ",y_train_final.shape)

X_train shape:  (504, 31)
y_train shape:  (504,)
X_test shape:  (168, 31)
y_test shape:  (168,)
X_train_final shape:  (672, 31)
y_train_final shape:  (672,)


** Linear regression: **

In [73]:
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % np.mean((regr.predict(X_test) - y_test) ** 2))
print("Root mean squared error: %.2f"
     % math.sqrt(np.mean((regr.predict(X_test) - y_test) ** 2)))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_test, y_test))

Coefficients: 
 [ -1.85763931e+15  -1.85763931e+15  -1.85763931e+15  -1.85763931e+15
  -1.85763931e+15  -1.85763931e+15  -1.85763931e+15  -5.29959101e+13
  -5.29959101e+13  -5.29959101e+13  -5.29959101e+13  -5.29959101e+13
  -5.29959101e+13  -5.29959101e+13  -5.29959101e+13  -5.29959101e+13
  -5.29959101e+13  -5.29959101e+13  -5.29959101e+13  -5.29959101e+13
  -5.29959101e+13  -5.29959101e+13  -5.29959101e+13  -5.29959101e+13
  -5.29959101e+13   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00]
Mean squared error: 2808566487945819987141197824.00
Root mean squared error: 52995910105835.71
Variance score: -17371868258535900971008.00


Results are much worse even, this binary features handling is not the way to go.

I would now go back and use days of the week tokens as features ('MON','TUE',...,'SUN') such as hours ('0','1',...,'23'). Then strings could be vectorising through python tools such as sklearn.DictVectorizer and see if that solves the surprisingly bad results.

There's no time for more in this task.