# 2021: Week 2

January 13, 2021

Building on from Week 1's challenge, we are going to take your data prep skills on one step further. The next steps we are introducing this week are:

* Aggregation - changing the level of granularity of your data. The combination of the categorical fields often sets what each row represents so aggregating data changes this. In Tableau Prep this is different from how we aggregate in Tableau Desktop.
* Calculations - If the value or variable that you need to use isn't in your data set, you will often be able to create it from the other data fields you do have. 
As per last week, we've attached some help links that will teach you the techniques if you need a few nudges. One of the main challenges with Data Preparation is to think about not just what you want to do but the order you need to do those steps in. The challenge this week will be a good example of that to avoid repeating steps. Here's a post that might help you with your planning. 
Also, thank you to all those who posted their solutions, in all the different tools on Twitter and the community forums last week. Keep sharing those solutions, you are helping more people learn these important skills than you think!

# Challenge
Challenge by: Carl Allchin
This week we are looking at the different Brands of bikes available in our stores. We need to understand what are the most popular sellers and do the customers of the different brands have the same experience to other customers. 

We are creating simple summaries this week to get a quick, tabular view of the answers. If you want to visualise the data to highlight those trends even more clearly, go for it! 

# Input
One csv file of bike sales.

<img src='https://1.bp.blogspot.com/-z_heCV23gPQ/X_rk1Pa7a8I/AAAAAAAACGM/v4DDRd218DcDw71jZPbyVVGgxkee-wSsQCLcBGAsYHQ/w640-h214/Screenshot%2B2021-01-10%2Bat%2B11.27.51.png'>

# Requirements
* Input the data (help)
* Clean up the Model field to leave only the letters to represent the Brand of the bike (help)
* Workout the Order Value using Value per Bike and Quantity.
* Aggregate Value per Bike, Order Value and Quantity by Brand and Bike Type to form: (help) 
Quantity Sold
Order Value
Average Value Sold per Brand, Type
* Calculate Days to ship by measuring the difference between when an order was placed and when it was shipped as 'Days to Ship' (help)
* Aggregate Order Value, Quantity and Days to Ship by Brand and Store to form:
Total Quantity Sold
Total Order Value
Average Days to Ship
* Round any averaged values to one decimal place to make the values easier to read
* Output both data sets

# Output
Two files:

1. Sales by Brand and Type

<img src='https://1.bp.blogspot.com/-pIoOV-rm3JM/X_8FXjz9FbI/AAAAAAAACGs/mWo2N0mn1vwHqnH-_n8X60M_kvKiIUCwACLcBGAsYHQ/w640-h244/Screenshot%2B2021-01-13%2Bat%2B14.35.41.png'>

5 Data Fields:
* Brand
* Bike Type
* Quantity Sold
* Order Value
* Avg Bike Value per Brand
* 15 Rows (16 including headers)

2. Sales by Brand and Store

<img src='https://1.bp.blogspot.com/-vKiMGxugMJg/X_8FpFFpm5I/AAAAAAAACG0/BVU8dfchDR8z_3Oyfc8ruJ7j9S1OSVgqQCLcBGAsYHQ/w640-h264/Screenshot%2B2021-01-13%2Bat%2B14.37.14.png'>

5 Data Fields:
* Brand
* Store
* Total Quantity Sold
* Total Order Value
* Avg Days to Ship
* 25 Rows (26 including headers)

In [1]:
import pandas as pd
from datetime import date

In [2]:
# Import input and correct data types
input = 'PD 2021 Wk 2 Input - Bike Model Sales.csv'
df = pd.read_csv(input)
df['Order Date'] = pd.to_datetime(df['Order Date'], format=r'%d/%m/%Y')
df['Shipping Date'] = pd.to_datetime(df['Shipping Date'], format=r'%d/%m/%Y')
print(df.head(5))
print(df.info())

  Bike Type       Store Order Date  Quantity  Value per Bike Shipping Date  \
0  Mountain  Manchester 2020-05-15         4            1543    2020-06-01   
1    Gravel  Manchester 2020-06-16         2            2076    2020-06-24   
2      Road  Birmingham 2020-05-04         1            2616    2020-05-13   
3    Gravel        York 2020-09-05         2            1359    2020-09-19   
4    Gravel  Birmingham 2020-03-28         4            1599    2020-04-04   

          Model  
0  GIA31292/003  
1  GIA21312/001  
2  GIA94221/129  
3  GIA12442/120  
4  GIA12492/123  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Bike Type       2000 non-null   object        
 1   Store           2000 non-null   object        
 2   Order Date      2000 non-null   datetime64[ns]
 3   Quantity        2000 non-null   int64         
 4   

In [3]:
pattern = r'[^A-Z]'

In [4]:
# Clean up the Model field to leave only the letters to represent the Brand of the bike
df['Model'] = df['Model'].str.replace(pattern, '')
df.rename(columns={'Model':'Brand'}, inplace=True)
print(df.head(5))

  Bike Type       Store Order Date  Quantity  Value per Bike Shipping Date  \
0  Mountain  Manchester 2020-05-15         4            1543    2020-06-01   
1    Gravel  Manchester 2020-06-16         2            2076    2020-06-24   
2      Road  Birmingham 2020-05-04         1            2616    2020-05-13   
3    Gravel        York 2020-09-05         2            1359    2020-09-19   
4    Gravel  Birmingham 2020-03-28         4            1599    2020-04-04   

  Brand  
0   GIA  
1   GIA  
2   GIA  
3   GIA  
4   GIA  


  df['Model'] = df['Model'].str.replace(pattern, '')


In [5]:
# Workout the Order Value using Value per Bike and Quantity.
df['Order Value'] = df['Value per Bike'] * df['Quantity']
df.drop(columns='Value per Bike', inplace=True)
print(df.head(5))
print(df.info())

  Bike Type       Store Order Date  Quantity Shipping Date Brand  Order Value
0  Mountain  Manchester 2020-05-15         4    2020-06-01   GIA         6172
1    Gravel  Manchester 2020-06-16         2    2020-06-24   GIA         4152
2      Road  Birmingham 2020-05-04         1    2020-05-13   GIA         2616
3    Gravel        York 2020-09-05         2    2020-09-19   GIA         2718
4    Gravel  Birmingham 2020-03-28         4    2020-04-04   GIA         6396
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Bike Type      2000 non-null   object        
 1   Store          2000 non-null   object        
 2   Order Date     2000 non-null   datetime64[ns]
 3   Quantity       2000 non-null   int64         
 4   Shipping Date  2000 non-null   datetime64[ns]
 5   Brand          2000 non-null   object        
 6   Order Value  

In [6]:
# Aggregate Value per Bike, Order Value and Quantity by Brand and Bike Type
output1 = df
output1 = df.groupby(['Brand', 'Bike Type']).agg(Quantity_Sold=('Quantity','sum'), Order_Value=('Order Value','sum'), Average_Value_Sold_per_Brand_Type=('Order Value','mean'))
output1.reset_index(inplace=True)
output1 = output1.round(1)
print(output1.head(5))
print(output1.info())

  Brand Bike Type  Quantity_Sold  Order_Value  \
0  BROM    Gravel            186       433885   
1  BROM  Mountain            277       674770   
2  BROM      Road            257       656539   
3   GIA    Gravel            323       733087   
4   GIA  Mountain            425      1021329   

   Average_Value_Sold_per_Brand_Type  
0                             6675.2  
1                             7581.7  
2                             8872.1  
3                             6604.4  
4                             7243.5  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 5 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Brand                              15 non-null     object 
 1   Bike Type                          15 non-null     object 
 2   Quantity_Sold                      15 non-null     int64  
 3   Order_Value                        15 non-nu

In [7]:
# Calculate Days to ship by measuring the difference between when an order was placed and when it was shipped as 'Days to Ship'
df['Days to Ship'] = (df['Shipping Date'] - df['Order Date']).dt.days

In [8]:
# Aggregate Order Value, Quantity and Days to Ship by Brand and Store
output2 = df
output2 = df.groupby(['Brand', 'Store']).agg(Total_Quantity_Sold=('Quantity','sum'), Total_Order_Value=('Order Value','sum'), Avg_Days_to_Ship=('Days to Ship','mean'))
output2.reset_index(inplace=True)
output2 = output2.round(1)
print(output2.head(5))
print(output2.info())

  Brand       Store  Total_Quantity_Sold  Total_Order_Value  Avg_Days_to_Ship
0  BROM  Birmingham                  155             349759              11.8
1  BROM       Leeds                  150             389116               9.8
2  BROM      London                  133             324635              11.0
3  BROM  Manchester                  137             339832              10.9
4  BROM        York                  145             361852               9.8
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Brand                25 non-null     object 
 1   Store                25 non-null     object 
 2   Total_Quantity_Sold  25 non-null     int64  
 3   Total_Order_Value    25 non-null     int64  
 4   Avg_Days_to_Ship     25 non-null     float64
dtypes: float64(1), int64(2), object(2)
memory usage: 1.1+ KB
None
