# 2021: Week 4

January 27, 2021

This is the last in the 'Starter Challenges' series to get you up and Preppin' to start the new year. We've enjoyed running this mini-series so much we're already looking at creating another similar series later in the year. 

This week's challenge involves picking up some more of the fundamental skills and gives you some chances to practice some of the skills you've picked up over the last few weeks. As always, we'll be guiding you along the way with some useful help links if you need a couple of reminders or chance to explore those new techniques. 

The new technique for you to learn this week is Joins. If you've worked with different data solutions for a number of years, you'll be familiar with Joins but if you are new you're in for a treat! Joins allow us to bring two data sources together. This allows for much easier, richer and deeper analysis as data is often in many different locations. Use the help links if this is a new technique for you. Joins are one of the harder concepts to pick up so make sure you've set aside a good amount of time to explore. 

Challenge by: Carl Allchin

# Input

The input file may seem familiar from last week. We still have 5 worksheets, each containing one Store's product sales.

<img src='https://1.bp.blogspot.com/-sJ7fitvkMMk/YA69_pgQbII/AAAAAAAACHk/NQ8TVmd0fLcGOyipZhPWl5_VTsfvZIfhgCLcBGAsYHQ/w640-h192/Screenshot%2B2021-01-25%2Bat%2B12.47.37.png'>

What's new is there is also a set of Quarterly Targets that each store is expected to achieve.

<img src='https://1.bp.blogspot.com/-P5YUkfcO1hM/YA6-YPuJ1CI/AAAAAAAACHs/-mdb5K7xB5YEg5xRyWbl0b8PfNEF7DVMgCLcBGAsYHQ/w400-h211/Screenshot%2B2021-01-25%2Bat%2B12.49.20.png'>

# Requirements
- Input the file 
- Union the Stores data together (help)
- Remove any unnecessary data fields your Input step might create and rename the 'Table Names' as 'Store' 
- Pivot the product columns (help)
- Split the 'Customer Type - Product' field to create: (help)
    - Customer Type
    - Product
    - Also rename the Values column resulting from you pivot as 'Products Sold'
- Turn the date into a 'Quarter' number (help)
- Sum up the products sold by Store and Quarter (help)
- Add the Targets data 
- Join the Targets data with the aggregated Stores data (help)
    - Note: this should give you 20 rows of data
- Remove any duplicate fields formed by the Join
- Calculate the Variance between each Store's Quarterly actual sales and the target. Call this field 'Variance to Target' (help)
- Rank the Store's based on the Variance to Target in each quarter (help)
    - The greater the variance the better the rank
- Output the data (help)

# Output

<img src='https://1.bp.blogspot.com/-h9JR5XeRNBM/YA6_dbPM22I/AAAAAAAACH4/C0D2jBnLwgcv-YEPX0YTq__u2yPkY5CwQCLcBGAsYHQ/w640-h246/Screenshot%2B2021-01-25%2Bat%2B12.42.54.png'>

One file:

6 Data Fields:

* Quarter
* Rank
* Store
* Products Sold
* Target 
* Variance to Target
20 Rows (21 rows including headers)

In [1]:
import pandas as pd

In [2]:
# Input the file & Union the Stores data together
input = 'PD 2021 Wk 4 Input.xlsx'
excel_sheets = pd.ExcelFile(input).sheet_names
print(excel_sheets)
print('=========================================')

store_table = excel_sheets
store_table.remove('Targets')
print(store_table)
print('=========================================')

array_df = []
for i in store_table:
    temp_df = pd.read_excel(input, sheet_name=i)
    temp_df['Store'] = i
    array_df.append(temp_df)
df = pd.concat(array_df, axis=0)
print(df.head(5))
print(df.info())

['Manchester', 'London', 'Leeds', 'York', 'Birmingham', 'Targets']
['Manchester', 'London', 'Leeds', 'York', 'Birmingham']
        Date  New - Saddles  New - Mudguards  New - Wheels  New - Bags  \
0 2021-01-21           13.0             42.0          19.0        38.0   
1 2021-02-21            1.0              9.0          14.0         6.0   
2 2021-03-21            8.0             22.0           6.0        35.0   
3 2021-04-21            3.0              9.0           8.0        16.0   
4 2021-05-21            2.0              8.0           5.0        34.0   

   Existing - Saddles  Existing - Mudguards  Existing - Wheels  \
0                17.0                  48.0               19.0   
1                 2.0                   4.0               19.0   
2                 0.0                  48.0               17.0   
3                18.0                  50.0               18.0   
4                17.0                   3.0               12.0   

   Existing - Bags       Store  
0 

In [3]:
# Correct data types
# loop through columns in dataframe
for col in df.columns:
    # check if column is of float type
    if df[col].dtype == 'float64':
        # check if column has any decimal values
        if df[col].apply(lambda x: x.is_integer()).all():
            # if all values are integers, convert column to integer type
            df[col] = df[col].astype(int)

# print the updated dataframe
print(df.head(5))
print(df.info())

        Date  New - Saddles  New - Mudguards  New - Wheels  New - Bags  \
0 2021-01-21             13               42            19          38   
1 2021-02-21              1                9            14           6   
2 2021-03-21              8               22             6          35   
3 2021-04-21              3                9             8          16   
4 2021-05-21              2                8             5          34   

   Existing - Saddles  Existing - Mudguards  Existing - Wheels  \
0                  17                    48                 19   
1                   2                     4                 19   
2                   0                    48                 17   
3                  18                    50                 18   
4                  17                     3                 12   

   Existing - Bags       Store  
0               13  Manchester  
1               24  Manchester  
2               16  Manchester  
3               25  Manche

In [4]:
# List columns need to be pivoted
column_to_pivot = df.drop(columns=['Date','Store']).columns
print(column_to_pivot)
print('=========================================')

# Pivot the product column
df_pivot = df.melt(id_vars=['Date', 'Store'], value_vars=column_to_pivot, var_name='Pivot_name', value_name='Pivot_value')
print(df_pivot.head(5))
print(df_pivot.info())

Index(['New - Saddles', 'New - Mudguards', 'New - Wheels', 'New - Bags',
       'Existing - Saddles', 'Existing - Mudguards', 'Existing - Wheels',
       'Existing - Bags'],
      dtype='object')
        Date       Store     Pivot_name  Pivot_value
0 2021-01-21  Manchester  New - Saddles           13
1 2021-02-21  Manchester  New - Saddles            1
2 2021-03-21  Manchester  New - Saddles            8
3 2021-04-21  Manchester  New - Saddles            3
4 2021-05-21  Manchester  New - Saddles            2
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Date         480 non-null    datetime64[ns]
 1   Store        480 non-null    object        
 2   Pivot_name   480 non-null    object        
 3   Pivot_value  480 non-null    int32         
dtypes: datetime64[ns](1), int32(1), object(2)
memory usage: 13.2+ KB
None


In [5]:
# Split the 'Customer Type - Product' field
df_pivot[['Customer Type', 'Product']] = df_pivot['Pivot_name'].str.split(' - ', expand=True)
df_pivot.drop(columns='Pivot_name', inplace=True)

# Also rename the Values column resulting from you pivot as 'Products Sold'
df_pivot.rename(columns={'Pivot_value':'Products Sold'}, inplace=True)
print(df_pivot.head(5))
print(df_pivot.info())

        Date       Store  Products Sold Customer Type  Product
0 2021-01-21  Manchester             13           New  Saddles
1 2021-02-21  Manchester              1           New  Saddles
2 2021-03-21  Manchester              8           New  Saddles
3 2021-04-21  Manchester              3           New  Saddles
4 2021-05-21  Manchester              2           New  Saddles
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           480 non-null    datetime64[ns]
 1   Store          480 non-null    object        
 2   Products Sold  480 non-null    int32         
 3   Customer Type  480 non-null    object        
 4   Product        480 non-null    object        
dtypes: datetime64[ns](1), int32(1), object(3)
memory usage: 17.0+ KB
None


In [6]:
# Turn the date into a 'Quarter' number
df_pivot['Quarter'] = df_pivot['Date'].dt.quarter
df_pivot.drop(columns='Date', inplace=True)

In [7]:
# Sum up the products sold by Store and Quarter
df_pivot = df_pivot.groupby(['Quarter', 'Store']).agg(Products_Sold=('Products Sold','sum')).reset_index()
print(df_pivot.head(5))
print(df_pivot.info())

   Quarter       Store  Products_Sold
0        1  Birmingham            477
1        1       Leeds            488
2        1      London            425
3        1  Manchester            440
4        1        York            499
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Quarter        20 non-null     int64 
 1   Store          20 non-null     object
 2   Products_Sold  20 non-null     int32 
dtypes: int32(1), int64(1), object(1)
memory usage: 528.0+ bytes
None


In [8]:
# Add the Targets data 
target = pd.read_excel(input, sheet_name='Targets')

# Correct data type
for col in target.columns:
    if target[col].dtype == 'float64':
        if target[col].apply(lambda x: x.is_integer()).all():
            target[col] = target[col].astype(int)

print(target.head(5))
print(target.info())


   Quarter       Store  Target
0        1  Manchester     475
1        1      London     475
2        1       Leeds     490
3        1        York     490
4        1  Birmingham     475
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Quarter  20 non-null     int32 
 1   Store    20 non-null     object
 2   Target   20 non-null     int32 
dtypes: int32(2), object(1)
memory usage: 448.0+ bytes
None


In [9]:
# Join the Targets data with the aggregated Stores data & Remove any duplicate fields formed by the Join
output = pd.merge(left=df_pivot, right=target, on=['Quarter', 'Store'], how='inner')

# Calculate the Variance between each Store's Quarterly actual sales and the target. Call this field 'Variance to Target'
output['Variance to Target'] = output['Products_Sold'] - output['Target']

# Rank the Store's based on the Variance to Target in each quarter
output['Rank'] = output.groupby('Quarter')['Variance to Target'].rank(method='dense', ascending=False)

print(output.head(5))
print(output.info())

   Quarter       Store  Products_Sold  Target  Variance to Target  Rank
0        1  Birmingham            477     475                   2   2.0
1        1       Leeds            488     490                  -2   3.0
2        1      London            425     475                 -50   5.0
3        1  Manchester            440     475                 -35   4.0
4        1        York            499     490                   9   1.0
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Quarter             20 non-null     int64  
 1   Store               20 non-null     object 
 2   Products_Sold       20 non-null     int32  
 3   Target              20 non-null     int32  
 4   Variance to Target  20 non-null     int32  
 5   Rank                20 non-null     float64
dtypes: float64(1), int32(3), int64(1), object(1)
memory usage: 880.0+ byt