# Introduction

Stuart is the leading on-demand solution powering the way goods are transported in a
customised way. We connect businesses across all industries and of all sizes to high quality
independent couriers to offer customised delivery solutions.

As a data analyst at Stuart your role is to abstract the complexity of the business and the
underlying data to provide meaningful insights to decision makers. To assess your suitability
to the role, we’re providing you with a realistic data set and questions similar to what you
may encounter in the day to day work at Stuart


# Data
The data set attached consists of records of the principal business events generated from
the moment a package appears in our system until it is delivered to its final destination.

There are 22 tables in the data set recording events around 5 principle objects in Stuart’s
universe, those are: packages, deliveries, tasks, drivers and invitations.

- Packages are the physical articles that need to be delivered and can very in size,
type and urgency
- Deliveries represent the job a driver needs to perform to deliver one or more
packages
- Tasks are the different segments of a delivery. For example: a simple delivery
consists of two tasks: a pickup task and a drop off task
- Drivers are the persons performing the deliveries and can do so using different
transport types
- Invitations are the outcome of our dispatching algorithms to connect free Drivers to
Deliveries

The data is provided in an SQLite format and can be queried using any of the free sqlite
clients such as DB Browser or TablePlus

Please note that no data dictionary is provided and you’ll need to make sense of the data
based on the table names and the above information.


## Question 1
What is the success rate of packages per Zone? (% of Packages delivered on time)

In [None]:
//SQL QUERY

SELECT 
pd.zone_id,
p.zone_name,
ROUND((SUM(case when pd.created_at <= time_window_do_end then 1 else 0 end )*1.0 / count(p.created_at))*100,1)  as success_rate
from package p
inner join package_delivered pd 
on pd.package_id = p.package_id
group by pd.zone_id
order by success_rate desc

## Question 2
Is the driver invitation acceptance rate higher or lower among drivers who have been
on the platform for longer?

In [None]:
//SQL QUERY 

SELECT
ROUND(AVG(a.rate_accepted),0) as avg_rate_accepted,
a.years_worked 

from 
(select
d.driver_id,
count(ia.delivery_invitation_id) as total_invites_accept,
count(ic.delivery_invitation_id),
(count(ia.delivery_invitation_id)*1.0/count(ic.delivery_invitation_id))*100 as rate_accepted,
--DATE('now'),
--d.created_at,
julianday('now') - julianday(d.created_at),
(CASE WHEN (julianday('now') - julianday(d.created_at)) < 365 then '0-1 Year' 
WHEN (julianday('now') - julianday(d.created_at)) between 365 and 730 then '1-2 Years'
WHEN (julianday('now') - julianday(d.created_at)) between 730 and 1095 then '2-3 Years'
WHEN (julianday('now') - julianday(d.created_at)) between 1095 and 1460 then '3-4 Years'
WHEN (julianday('now') - julianday(d.created_at)) between 1460 and 1825 then '4-5 Years'
WHEN (julianday('now') - julianday(d.created_at)) between 1825 and 2190 then '5-6 Years'
WHEN (julianday('now') - julianday(d.created_at)) between 2190 and 2555 then '6-7 Years'
WHEN (julianday('now') - julianday(d.created_at)) > 2555 then '7+ Years'
END)  as  years_worked
from driver d
inner join invitation_created ic 
on d.driver_id = ic.driver_id
left join invitation_accepted ia 
on ic.delivery_invitation_id = ia.delivery_invitation_id
group by d.driver_id
) a
group by a.years_worked
order by a.years_worked asc

## Question 3

What might be other relevant metrics to track? (Propose 2 metrics at least, and explain how they can be computed from the provided data).

In [None]:
//SQL QUERY

Average Delivery Time 

select
ROUND(AVG((JULIANDAY(ts.created_at) - JULIANDAY(t.created_at))*3600),0) as Average_Delivery_Time_Minutes
from task t 
inner join task_succeeded ts 
on t.task_id = ts.task_id
where type = 'DropoffTask'

% Online Currently

select 
ROUND((sum(off_on)*1.0/count(driver_id))*100,0) || '%' as Percent_Online_Now
from 
(select distinct driver_id ,created_at, 1 as off_on
from driver_online
UNION 
select distinct driver_id,created_at,0 as off_on 
from driver_offline ) a

where created_at  between datetime('now','start of day') and datetime(‘now','localtime')

———- (adjusted query to showcase using a given time) ————-
select 
ROUND((sum(off_on)*1.0/count(driver_id))*100,0) || '%' as Percent_Online_Now
from 
(select distinct driver_id ,created_at, 1 as off_on
from driver_online
UNION 
select distinct driver_id,created_at,0 as off_on 
from driver_offline ) a

where created_at  between datetime('2020-12-12','start of day') and '2020-12-12 06:17:07.616000' -- used as an example date to show case results but effectively metric will use Date('Now') and users would refresh on the dashboard to reflect current status


## Question 4

As a BI Analyst you are asked to create a new table in the DWH optimized for
Tableau consumption in order to build a dashboard to track driver performances. The
current event level data is too granular and unwieldy.
1.  Propose a new flat table schema to consolidate the relevant events,
2.  Write the necessary SQL to build that table,
3.  Explain the methodology you used.

In [None]:
CREATE TABLE model_driver_performance as
WITH CTE_BASE as (
    select
    distance,
    delivery_created_id,
    ad.driver_id,
    ad.task_id,
    package_id,
    delivery_id,
   
    (case when type = 'PICKUP' then 1 else 0 END) as num_pickup,
    (case when type = 'DROPOFF' then 1 else 0 END) as num_dropoff
    from
        assign_driver ad
        inner join delivery_created dc on ad.task_id = dc.task_id

       
),

cte_final as (
    select cb.*,
    (case when status = 'delivered' then 1 else 0 END) as delivered,
    (case when status = 'not_delivered' then 1 else 0 END) as not_delivered
    from
        CTE_BASE CB
        inner join package p on cb.package_id = p.package_id
       
)


select
driver_id,
task_id,
package_id,
sum(distance) as total_distance,
sum(delivered) as total_delivered,
sum(not_delivered) as total_not_delivered,
(select sum(delivered) from cte_final) as total_delivery_Stuart,
sum(num_pickup) as total_pickup,
sum(num_dropoff) as total_dropoff
from cte_final
group by driver_id

## Question 5

The data provided is missing one critical component and that is weather data.
Weather plays a critical role in the availability of drivers as well as on the success of
deliveries. Using any of the free weather APIs available (e.g. DarkSky, Weatherbit,
Accuweather,...etc.) collect weather data for the locations and dates of the provided
deliveries, proposing a new table schema to store them and make them accessible to
all analysts at Stuart

In [None]:
import requests
import json
import pandas as pd
from datetime import datetime

# Static variables
api_key = "eafc3879d0b972e2ecf01c93c7161b74"
locations = {
    'Paris': [48.856613, 2.352222],
    'Barcelona': [41.3874, 2.1686],
    'Brighton': [50.8225, 0.1372],
    'Lyon': [45.7640, 4.8357],
    'Lyon': [40.4168, 3.7038],
    'Manchester': [53.4804, 2.2426]


}
current_conditions = []

# Functions
def extract_weather (api_key, lat, lon):
    """ Extract current weather data for a single location.
            Args:
                api_key: Your personal API key
                lat: Lattitude
                lon: Longitude
            Returns:
                Flattened JSON object"""    
    url = "https://api.openweathermap.org/data/2.5/onecall?lat=%s&lon=%s&appid=%s&units=metric" % (lat, lon, api_key)
    response = requests.get(url)
    data = json.loads(response.text)
    return data["current"]

def extract_city_weather():
  for key in locations:
        temp = {'City': key, 'Date': datetime.timestamp(datetime.now())} #Timestamp created for reference in the table to allow filtering, City coloumn allows the table to be joined to others through zone_name/zone_id
        temp.update(extract_weather(api_key, locations[key][0], locations[key][1]))
        current_conditions.append(temp)

extract_city_weather()


df = pd.DataFrame(current_conditions)
print(df)

# The query will be run daily, initially no historical data will be present, this would be built up over time. 
# Users can join through other tables through the City column via zone_name/zone_id 