# Fall 2021 Shopify Data Science Intern Challenge

Charles Qian [Github](https://github.com/charlescqian/shopify-data-science)

May 9, 2021

The given data has been downloaded into a .csv file called data.csv. 
We load the data into a dataframe.

In [3]:
import pandas as pd

df = pd.read_csv('data.csv')

## a. Think about what could be going wrong with our calculation. Think about a better way to evaluate this data. 

First we confirm the given AOV of $3145.13.

In [4]:
no_orders = df.shape[0]
total_order_amount = sum(df['order_amount'])
aov = total_order_amount/no_orders

print(f'Average Order Value: ${round(aov,2)}')

Average Order Value: $3145.13


By quickly looking at the data, we notice that most of the orders are of small quantities of 1-5. The scale of the order amount for these are in the range of a few hundred dollars. However, there are a few very large orders where 2000 units are being ordered at once, with an order value of \\$704,000. These are all from a single customer/user with user_id 607. These are likely wholesale orders. There also seems to be a very expensive item in which each unit costs \\$25,725. These two points are definite outliers and they are the cause of AOV being higher than expected. 

A better way to report this metric would be to exclude the wholesale orders and to calculate the AOV for retail orders only. We can do this by either excluding any orders from user_id 607 or we can set a limit for what would be a reasonable size for a retail order. 

We have to be careful when defining this limit, due to the item that costs \\$25,725. Should this limit be based on the value of the order (order_amount) or the number of items in an order (total_items)? For now, let's define any orders with an order_amount of over \\$10,000 to be wholesale. 

In [26]:
retail_df = df[(df['order_amount'] <= 10000)]

total_retail_order_amount = sum(retail_df['order_amount'])
no_retail_orders = retail_df.shape[0]

retail_aov = total_retail_order_amount/no_retail_orders

print(f'Average Retail Order Value: ${round(retail_aov,2)}')

Average Retail Order Value: $302.58


We can see from above, that by excluding the two outliers, we get a much more reasonable average order value for sneaker stores of $302.

## b. What metric would you report for this dataset? 

I believe a very useful metric for this dataset would be % of monthly revenue from retail sales vs wholesale. This is a good metric because it allows the business operator to better allocate their resources, such as marketing or customer relationship management. For example, if they want more of their revenue to come from retail sales, they can increase spending in marketing and promotions. On the other hand, if they rely on wholesale for the majority of the revenue, then perhaps they would like to improve their customer relationship management to ensure the wholesale customers keep coming back.

## c. What is its value? 

To compute the % of monthly revenue from retail sales vs wholesale, we will follow the previous definition of categorizing any orders with an order_amount of over $10,000 to be wholesale.

Since we have already computed the total_retail_order_amount and total_order_amount, this metric is very straightforward to compute.

In [29]:
perc_revenue_retail = total_retail_order_amount / total_order_amount

print(f'% Of Monthly Revenue from Retail: {round(perc_revenue_retail*100,2)}%')
print(f'% Of Monthly Revenue from Wholesale: {round((1-perc_revenue_retail)*100,2)}%')

% Of Monthly Revenue from Retail: 9.5%
% Of Monthly Revenue from Wholesale: 90.5%


Clearly, from the above results, we can see that the vast majority of revenue comes from wholesale. Therefore, the business operator should likely invest in maintaining the relationship with their wholesale buyers. 