# Shopify Summer 2022 Data Science Intern Challenge

## Question 1

On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.

### Part a. 

Think about what could be going wrong with our calculation. Think about a better way to evaluate this data. 

### Part a. Solution

Thinking about what could be going wrong with the calculation prior to investigating the dataset, 
I believe some possibilities could include:

- a sum of order values rather than an average
- an average of the wrong metric (perhaps average spent by unique users or average earned by unique shops)
- a misplaced decimal value
- outliers raising the average
- incorrect values in the dataset

Now it's time to get into the data to try to validate some of these theories.

In [1]:
# imports
import pandas as pd

In [2]:
# data
file = "2019 Winter Data Science Intern Challenge Data Set - Sheet1.csv"
df = pd.read_csv(file)

In [3]:
# dataset statistics
df.describe()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2500.5,50.0788,849.0924,3145.128,8.7872
std,1443.520003,29.006118,87.798982,41282.539349,116.32032
min,1.0,1.0,607.0,90.0,1.0
25%,1250.75,24.0,775.0,163.0,1.0
50%,2500.5,50.0,849.0,284.0,2.0
75%,3750.25,75.0,925.0,390.0,3.0
max,5000.0,100.0,999.0,704000.0,2000.0


After investigating the dataset, it appears that the mean value of the order_amount column is indeed 3145.13. Furthermore, the max value of the order_amount column is 704000. Knowing these statistics leads me to believe that it is most likely one of the last two possibilities which is causing the AOV to be drastically higher than expected (i.e., either outliers are raising the average value or there are incorrect values in the dataset).


Now it's time to take a look at some of the largest orders made

In [4]:
# investigating large orders
df.sort_values(by = 'order_amount', ascending=False).head(65)

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,payment_method,created_at
2153,2154,42,607,704000,2000,credit_card,2017-03-12 4:00:00
3332,3333,42,607,704000,2000,credit_card,2017-03-24 4:00:00
520,521,42,607,704000,2000,credit_card,2017-03-02 4:00:00
1602,1603,42,607,704000,2000,credit_card,2017-03-17 4:00:00
60,61,42,607,704000,2000,credit_card,2017-03-04 4:00:00
...,...,...,...,...,...,...,...
1419,1420,78,912,25725,1,cash,2017-03-30 12:23:43
3440,3441,78,982,25725,1,debit,2017-03-19 19:02:54
1204,1205,78,970,25725,1,credit_card,2017-03-17 22:32:21
1364,1365,42,797,1760,5,cash,2017-03-10 6:28:21


Looking further into the large order_amount values, it appears that two unique shops and one unique user
account for all of the large sales. Specifically, the shops that account for the largest purchases are 
shops 42 and 78 and the user that accounts for the largest purchases is user 607. It can also be seen that
user 607 only purchases from shop 42. Shops 42 and 78 sell to multiple users.

Now its time to investigate these interesting users and shops

In [5]:
# investigating user id 607
user_607 = df[df['user_id'] == 607]
user_607.describe()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items
count,17.0,17.0,17.0,17.0,17.0
mean,2336.235294,42.0,607.0,704000.0,2000.0
std,1603.584872,0.0,0.0,0.0,0.0
min,16.0,42.0,607.0,704000.0,2000.0
25%,1363.0,42.0,607.0,704000.0,2000.0
50%,2154.0,42.0,607.0,704000.0,2000.0
75%,3333.0,42.0,607.0,704000.0,2000.0
max,4883.0,42.0,607.0,704000.0,2000.0


In [6]:
user_607.head()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,payment_method,created_at
15,16,42,607,704000,2000,credit_card,2017-03-07 4:00:00
60,61,42,607,704000,2000,credit_card,2017-03-04 4:00:00
520,521,42,607,704000,2000,credit_card,2017-03-02 4:00:00
1104,1105,42,607,704000,2000,credit_card,2017-03-24 4:00:00
1362,1363,42,607,704000,2000,credit_card,2017-03-15 4:00:00


Investigating user 607 it appears they made 17 orders, all from shop 42. Each order involved the purchase of 
2000 items for an amount of 704000. This shows that the (unique) item cost from shop 42 is 352.

In [7]:
# investigating shop id 42
shop_42 = df[df['shop_id'] == 42]
shop_42.describe()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items
count,51.0,51.0,51.0,51.0,51.0
mean,2441.921569,42.0,758.588235,235101.490196,667.901961
std,1484.456801,0.0,125.993044,334860.641587,951.308641
min,16.0,42.0,607.0,352.0,1.0
25%,1366.5,42.0,607.0,352.0,1.0
50%,2154.0,42.0,770.0,704.0,2.0
75%,3801.0,42.0,863.5,704000.0,2000.0
max,4883.0,42.0,975.0,704000.0,2000.0


In [8]:
shop_42.head()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,payment_method,created_at
15,16,42,607,704000,2000,credit_card,2017-03-07 4:00:00
40,41,42,793,352,1,credit_card,2017-03-24 14:15:41
60,61,42,607,704000,2000,credit_card,2017-03-04 4:00:00
308,309,42,770,352,1,credit_card,2017-03-11 18:14:39
409,410,42,904,704,2,credit_card,2017-03-04 14:32:58


Investigating the orders from shop 42 it can be seen that this item cost of 352 is consistent throughout orders. Although 352 seems like a reasonable cost for a shoe, 2000 items seems like an unreasonable order amount for a shopify order which leads me to believe that the order data from user 607 may be illegitimate. If it was possible, I would enquire for this user to be investigated further; however, in this case, perhaps it is best to remove the user's data as it seems to most likely be illegitimate.

In [9]:
# investigating shop id 78
shop_78 = df[df['shop_id'] == 78]
shop_78.describe()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items
count,46.0,46.0,46.0,46.0,46.0
mean,2663.021739,78.0,867.73913,49213.043478,1.913043
std,1338.52002,0.0,81.314871,26472.227449,1.029047
min,161.0,78.0,707.0,25725.0,1.0
25%,1428.25,78.0,812.5,25725.0,1.0
50%,2796.5,78.0,866.5,51450.0,2.0
75%,3720.25,78.0,935.75,51450.0,2.0
max,4919.0,78.0,997.0,154350.0,6.0


In [10]:
shop_78.head()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,payment_method,created_at
160,161,78,990,25725,1,credit_card,2017-03-12 5:56:57
490,491,78,936,51450,2,debit,2017-03-26 17:08:19
493,494,78,983,51450,2,cash,2017-03-16 21:39:35
511,512,78,967,51450,2,cash,2017-03-09 7:23:14
617,618,78,760,51450,2,cash,2017-03-18 11:18:42


Investigating shop 78, it can be seen that their item is consistently priced at 25725 throughout the data.
This price seems quite high; however, it is difficult to deduce the reason. It could be a legitimate price
and shop 78 is simply selling a luxury shoe, or it could be an error in the data and the price of the shoe
could be 257.25. Considering this is a shopify store, it seems more likely than not that this item price is
incorrect and perhaps it is best to remove the store's data as it seems to most likely be illegitimate.

In [11]:
# investigating shops without shop 42 & shop 78
new_df = df[(df['shop_id'] != 78) & (df['user_id'] != 607)]
new_df.describe()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items
count,4937.0,4937.0,4937.0,4937.0,4937.0
mean,2499.551347,49.846465,849.752279,302.580514,1.994734
std,1444.069407,29.061131,86.840313,160.804912,0.982821
min,1.0,1.0,700.0,90.0,1.0
25%,1248.0,24.0,775.0,163.0,1.0
50%,2497.0,50.0,850.0,284.0,2.0
75%,3751.0,74.0,925.0,387.0,3.0
max,5000.0,100.0,999.0,1760.0,8.0


Removing user 607 and store 78, the dataset has a mean order_amount value of 302.58, which seems more 
reasonable than the original 3145.13.

### Part b.

What metric would you report for this dataset?

### Part b. Solution

Judging from the analysis performed above and taking into consideration the fact that I cannot accurately 
determine the legitimacy of the data, perhaps a better metric to use to report this dataset would be the 
median order_amount value as it less susceptible to the influence of outliers compared to the mean 
order_amount value and thus better portrays the spending habits of a typical consumer in this dataset.

### Part c.

What is its value?

### Part c. Solution

In [12]:
df["order_amount"].median()

284.0

The median order_amount in the original dataset is 284.

In [13]:
new_df["order_amount"].median()

284.0

It can also be seen that the median order_amount in the new dataset (without user 607 or shop 78) is also 284.

# Question 2

 For this question you’ll need to use SQL. Follow this link to access the data set required for the challenge. Please use queries to answer the following questions. Paste your queries along with your final numerical answers below.

## Part a.

How many orders were shipped by Speedy Express in total?

## Part a. Solution

Running the following query we can determine that Speedy Express has shipped 54 orders in total. <br />

" <br />
SELECT COUNT(*) <br />
FROM Orders INNER JOIN Shippers ON Orders.ShipperID=Shippers.ShipperID <br />
WHERE ShipperName == "Speedy Express" <br />
"

## Part b.

What is the last name of the employee with the most orders?

## Part b. Solution

Running the following query we can determine that the last name of the employee with the most orders was Peacock. <br />

" <br />
SELECT LastName, COUNT(*) <br />
FROM Employees INNER JOIN Orders ON Employees.EmployeeID=Orders.EmployeeID<br />
GROUP BY LastName <br />
ORDER BY COUNT(*) DESC <br />
LIMIT 1 <br />
"

## Part c.

What product was ordered the most by customers in Germany?

## Part c. Solution

Running the following query we can determine that the product most ordered by customers in Germany was Gorgonzola Telino. <br />

" <br />
SELECT ProductName, COUNT(*) <br />
FROM Products <br />
INNER JOIN OrderDetails ON Products.ProductID=OrderDetails.ProductID <br />
INNER JOIN Orders ON OrderDetails.OrderID=Orders.OrderID <br />
INNER JOIN Customers ON Orders.CustomerID=Customers.CustomerID <br />
WHERE Country = "Germany" <br />
GROUP BY ProductName <br />
ORDER BY COUNT(*) DESC <br />
LIMIT 1 <br />
" <br />