# Sales EDA

## Goal:
Execute a exploratory data analysis in Sales data provided by a story.

The data was provided as .csv and ingested in BigQuery table. This notebook get all this data, transform and load tidy data to Bigquery to be used in Tableau.

The dictionary of clean data is in the end of this notebook and the link to public Tableau as well

## The raw data
The raw data is read from BigQuery and it compose by 8 tables:
Addres customers - Information about customers addresses
Busines goal - A table with the goal of each store per day
Business Unit - A table with the business units
Channel - A table with the types of channel
Customer type - Information about the type of users
Products - The products, category and sub-category of each product
Sales - About 4mi rows of sales from January, 2021 to December, 2022
Store - The stores and their business units.


## 1. Extraction

Load the files from BigQuery in a dataframe to be manipulated.

Option 2: Load the local data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from google.cloud import bigquery
import pandas_gbq
import db_dtypes
from tqdm import tqdm
import glob
import os

# define style for charts
plt.style.use('ggplot')

# expand number of columns to better viz
pd.set_option('display.max_columns', 50)

In [None]:
# use local json file to authenticate
SERVICE_ACCOUNT_JSON = r"D:\DataAnalytics\projects\sales-store-383520-cd6fdfbcbcb6.json"
client = bigquery.Client.from_service_account_json(SERVICE_ACCOUNT_JSON)

# find all tables in dataset (raw)
tables = client.list_tables("sales-store-383520.raw")

# iterate through tables and create a dataframe for each table
for table in tables:
    print("{}.{}.{}".format(table.project, table.dataset_id, table.table_id))

    query = 'select * from `sales-store-383520.raw.{}`'.format(table.table_id)

    globals()[f"df_{table.table_id}"] = pandas_gbq.read_gbq(query, project_id='sales-store-383520', progress_bar_type='tqdm_notebook')

[optional] load data from local CSVs

In [2]:
path = os.getcwd()
csv_files = glob.glob(os.path.join(path + '/files/', "*.csv"))

# iterate through csv files and create a dataframe for each file
for file in csv_files:
    file_name = file.split("\\")[-1].replace('.csv', '')
    print(file_name)

    globals()[f"df_{file_name}"] = pd.read_csv(file, sep=';', encoding='iso-8859-1')
    

address_customers
business_goal
business_unit
channel
customer_type
products
sales
stores


In [3]:
df_channel.columns = ['id_channel', 'channel']

df_address_customers.columns = ['id_address_sale', 'customer_state']

df_customer_type.columns = ['id_customer_type', 'customer_type'] 

df_stores.columns = ['id_store', 'store_code', 'start_date', 'branch', 'district', 'city', 'state']

df_business_goal.columns = ['date', 'id_store', 'id_business_unit', 'id_channel', 'sales_goal']

df_products.columns = ['id_product', 'supplier', 'product_name', 'category', 'sub-category']

df_business_unit.columns = ['id_business_unit', 'business_unit']

df_sales.columns = ['date', 'id_store', 'id_business_unit', 'id_channel', 
                        'id_product', 'id_coupon', 'id_customer', 'id_address_sale', 
                        'id_customer_type', 'items', 'gross_revenue', 'tax_value', 'costs']

## 2. Transforming
Performe some adjustments on imported data.

### 2.1 Change datatypes
All data was provided as Object (String) but some of them should be transformed to proper data like numbers, data or category.


##### 2.1.1 Change the columns related to date, originally as string

In [4]:
def convert_to_date(df, cols):
    for c in cols:
        df[c] = pd.to_datetime(df[c], format="%Y/%m/%d")

convert_to_date(df_sales, ['date'])

convert_to_date(df_business_goal, ['date' ])


##### 2.1.2 Change the columns related to numbers to float, originally as string.

In [5]:
def convert_to_float(df, cols):
    for c in cols:
        df[c] = df[c].str.replace(',', '.').astype(float)

convert_to_float(df_sales, ['items', 'gross_revenue', 'tax_value', 'costs'])

convert_to_float(df_business_goal, ['sales_goal' ])

##### 2.1.3 Change some columns to category datatype

In [6]:
def convert_to_category(df, cols):
    for c in cols:
        df[c] = df[c].astype('category')

convert_to_category(df_channel, ['channel'])
convert_to_category(df_address_customers, ['customer_state'])
convert_to_category(df_customer_type, ['customer_type'])
convert_to_category(df_stores, ['store_code', 'branch', 'store_code', 'district', 'city', 'state'])
convert_to_category(df_business_goal, ['id_store', 'id_business_unit', 'id_channel'])
convert_to_category(df_products, ['supplier', 'category', 'sub-category'])
convert_to_category(df_business_unit, ['id_business_unit','business_unit'])
convert_to_category(df_sales, ['id_store','id_business_unit', 'id_channel', 'id_customer_type'])

##### 2.1.4 Removing null or columns with 0 items of Sales dataframe

In [7]:
print("Before: {}".format(df_sales[df_sales['items'] == 0]['items'].count()))

df_sales = df_sales[df_sales['items'] > 0]

print("After: {}".format(df_sales[df_sales['items'] == 0]['items'].count()))

Before: 11397
After: 0


### 2.2 Creating new columns to be analized/ plotted.

##### 2.2.1 Creating the tax rate column to understand the percentagem of taxes of each sale.


In [8]:
df_sales['tax_rate'] = round(df_sales['tax_value'] / df_sales['gross_revenue'], 4)

df_sales[['gross_revenue', 'tax_value', 'tax_rate']].sample(5)

Unnamed: 0,gross_revenue,tax_value,tax_rate
3780813,42.594,3.942,0.0925
374240,3.594,0.984,0.2738
204782,4.722,0.0,0.0
1746935,30.564,8.334,0.2727
1002599,0.258,0.072,0.2791


##### 2.2.2 Create a net revenue column

In [9]:
df_sales['net_revenue'] = df_sales['gross_revenue'] - (df_sales['tax_value'] + df_sales['costs'])

df_sales[['gross_revenue', 'net_revenue', 'tax_value' , 'costs']].sample(5)

Unnamed: 0,gross_revenue,net_revenue,tax_value,costs
483786,11.994,2.958,3.27,5.766
1394292,38.97,20.484,7.014,11.472
3281429,4.188,2.448,0.39,1.35
2940909,8.64,3.186,2.352,3.102
841912,7.794,4.068,2.13,1.596


##### 2.2.3 New column with cumulative sales for each customer

In [10]:
df_sales['cumulative_sales'] = df_sales.assign(temp=~df_sales.duplicated(subset=['id_customer','date'])).groupby('id_customer')['temp'].cumsum()

### 2.3 Create new dataframe 'Customers' to be used in analysis and data visualization. 

##### 2.3.1 Create the new dataframe using information of other tables

In [11]:
# get the customer information in the sales dataframe
df_customers = df_sales.groupby(['id_customer', 'id_customer_type'])\
                                        .agg({'id_customer': 'count', 'items': 'sum','date': ['min', 'max'], 'gross_revenue': 'sum'})\
                                        .reset_index()

# rename columns
df_customers.columns = ['id_customer', 'id_customer_type', 'purchase_count', 'items_purchased', 'first_purchase', 'last_purchase', 'total_spent']

# get the customer type information
df_customers = df_customers.merge(df_customer_type, on='id_customer_type', how='left')

# create a column to identify the customer type
df_customers['customer_type_code'] = df_customers.apply(lambda x: 1 if x['customer_type'] == 'Identificado' else 0, axis=1)

df_customers.sample(5)

Unnamed: 0,id_customer,id_customer_type,purchase_count,items_purchased,first_purchase,last_purchase,total_spent,customer_type,customer_type_code
478053,"E,1%"">>U%U3UC(=/@`V^D!",F+9C/:YY=_[^&$L90;9D_%,5,3.6,2022-01-27,2022-03-06,70.824,Identificado,1
354398,"D*?-C^,UDXR?/G2(3:""XP""","N3ZH'W$AE#+&45Z8N8""S*#",14,9.6,2021-08-23,2022-02-19,190.548,Não Identificado,0
1558580,ND02+LH;$-MZY3;\ATVTO/,A-Z4<6#[I<TA\FNKYY]%:+,0,0.0,NaT,NaT,0.0,Não Identificado,0
802392,GZE<.(6#KU.8N;JC6O?^_%,EQJ7X$INM^[%CO5KH82M_!,0,0.0,NaT,NaT,0.0,Não Identificado,0
1778145,"P>TB*K8W#+SD#N34?>4^!""",KGS;BJ!<S1'[<<P*O&1(X+,0,0.0,NaT,NaT,0.0,Identificado,1


##### 2.3.2 Agreggate information by customer.

In [12]:
df_customers = df_customers.groupby('id_customer')\
                                        .agg({'purchase_count': 'sum', 
                                              'items_purchased': 'sum',
                                              'first_purchase': 'min', 
                                              'last_purchase': 'max', 
                                              'total_spent': 'sum', 'customer_type_code': 'max'})\
                                        .reset_index()
df_customers.sample(5)

Unnamed: 0,id_customer,purchase_count,items_purchased,first_purchase,last_purchase,total_spent,customer_type_code
87770,"HIGY!8QYQ4K1QN`2,N@;\&",2,1.2,2021-03-31,2021-03-31,103.134,1
136563,"LT/F""M4<^/\<W<F7PA2*H/",3,1.8,2022-06-16,2022-06-16,38.328,1
21057,"BU^MQY"":=DVG.[&#S."">R)",26,18.6,2022-10-09,2022-12-30,304.038,1
67915,FV3&*7[(<PGO??8%$P54B),6,4.2,2021-02-01,2021-04-16,181.188,1
124702,KVETGTA=+Y:*]YC3?SNQD$,15,9.0,2021-07-07,2021-11-05,604.17,1


##### 2.3.3 Get the Customer state from address customers table

In [13]:
temp_state_customer = df_sales.merge(df_address_customers, on='id_address_sale', how='left')

temp_state_customer[['id_customer', 'customer_state']]

df_customers = df_customers.merge(temp_state_customer[['id_customer', 'customer_state']], on='id_customer', how='left')

df_customers.sample(5)

Unnamed: 0,id_customer,purchase_count,items_purchased,first_purchase,last_purchase,total_spent,customer_type_code,customer_state
1463580,"GY5V/-+W,VVI[#Y`60&:@0",88,55.8,2021-01-20,2022-12-30,2597.172,1,SP
375400,"BRF,_NV`8OG.9_98`)4Z.)",43,36.0,2021-02-25,2022-12-17,2151.294,1,SP
148291,"AL*Z(,;3^@O2%OZSR'&8@!",125,90.6,2021-01-03,2022-09-26,1548.642,1,SP
3334306,"M]X;GRX3YFUIIRU_,;:AB+",678206,527065.8,2021-01-02,2022-12-31,12646910.0,1,
1519075,H+VU7HZN>W16![HOR6PSG$,127,94.8,2021-01-18,2022-12-02,1947.192,1,SP


### 2.4 Aggregate data to Sales dataframe

New: df_sales
- Sales information from "df_sales"
- Business unit information from "df_business_unit"
- Channel information from "df_channel"

In [14]:
df_sales = df_sales.merge(df_business_unit, on='id_business_unit', how='left')

df_sales = df_sales.merge(df_channel, on='id_channel', how='left')

df_sales = df_sales.merge(df_customer_type, on='id_customer_type', how='left')

In [15]:
df_sales = df_sales[['date', 
                     'id_store', 'business_unit', 
                     'channel', 'id_product',
                     'id_customer', 'customer_type',
                     'id_coupon', 'id_address_sale', 
                     'items', 'gross_revenue', 'tax_value', 'costs', 'tax_rate','net_revenue', 
                     'cumulative_sales']]

df_sales

Unnamed: 0,date,id_store,business_unit,channel,id_product,id_customer,customer_type,id_coupon,id_address_sale,items,gross_revenue,tax_value,costs,tax_rate,net_revenue,cumulative_sales
0,2022-09-21,F)T`P;^+F]5F7YX^S\=+?&,Produtos,Loja,"AR^$EA+5@,Q][""V`\\VQC,",N$P5WZFC9VKQM(XS1DBJZ*,Não Identificado,"N_N,M-K1I34E(DW*-FHTX.","H=ZO(L""MR+7D](@#\""/NG)",0.6,16.200,4.410,5.736,0.2722,6.054,1
1,2022-08-06,F!25!6;D=./F%2(E)D;]P0,Produtos,Loja,FC^22=\(:F=0J=F6TNPD.&,"P0K'8UWIADS?T""+9:-W@6*",Não Identificado,"OEUPW7V[BY>]:>T;Y3""KM(",F'^..@O;\5E;#O4(^_'$0+,0.6,11.994,2.154,3.084,0.1796,6.756,1
2,2022-08-08,"FO1G""YC0G6I&C(,H&(MT3-",Produtos,Loja,C2O9ATWBXT.B)L-4@Y-FI$,E\^N9TRHKU5ABQ1=?;J./',Não Identificado,"B7,9,^VTQPPN)M\$G""/I,,","E0(*AW^9CG6ACQ2*,&@LL-",0.6,11.172,3.048,3.192,0.2728,4.932,1
3,2022-07-28,"FO)5JW59TP?&C:?,ZG$$L*",Produtos,Loja,"CIJ6#@,X$@9;"",SB)891P$","K*""!]6VAE;=C*BS-]/@_*%",Não Identificado,JT=[J-@0OCG!X.YRK&Q&3!,"H:E2P/CNLQ@T$!(,0BJ,G+",0.6,2.394,0.222,0.468,0.0927,1.704,1
4,2022-10-26,F!25!6;D=./F%2(E)D;]P0,Produtos,Loja,"NLPPIQIA=`-_>0I1)P[$5""",EO2UJ\6RWGUP-OSON/+Y)%,Não Identificado,"E+VB+$M>3GM`:$V#W?V,P*","A$ER-0#JA]ENP.WRSZ)#""+",1.8,4.482,1.224,1.296,0.2731,1.962,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4048285,2022-11-26,"FO1G""YC0G6I&C(,H&(MT3-",Produtos,Loja,"C+]_@QHN^Z""=13$34#KGF#",GC9:='0DCELH82JYG)CE(!,Não Identificado,"DUUZ-0-*N=1SDL-JN#[>R,","OBDILQJOCNA?0*%*,8Q:F0",0.6,64.794,5.994,34.380,0.0925,24.420,193
4048286,2022-08-25,"FO1G""YC0G6I&C(,H&(MT3-",Produtos,Loja,"CK&\:`,K_NP=E*Y<]:\GM%",AZ^6'B)X#8HHSQ^9M5PJP*,Não Identificado,"KK\WXP)(JCG_H'8N$\*#B""","D&K,JBR?""-U2O&$.SR<;@$",0.6,53.964,4.986,30.816,0.0924,18.162,24
4048287,2022-11-02,F!25!6;D=./F%2(E)D;]P0,Produtos,Loja,M90DRT:D\`0DC!C8'J:-M#,"DS"";/NU6Y#[GLME[D<I/7&",Não Identificado,"CR:""Y2$*\ZQ:ND49WDPA$+","P`4O-RQI48+`O%\OK?YZ(,",0.6,36.114,6.498,18.498,0.1799,11.118,2
4048288,2022-10-28,"FO1G""YC0G6I&C(,H&(MT3-",Produtos,Loja,C3VP.=?FF8`!%X^4TLZG-0,"C!LPVVC?4RXH?GP_?OL5Q""",Não Identificado,"MP<Q*-GE8""R^DH?JF$G;W""","I>--[,+0/3X)N`'TLHBG+-",0.6,6.714,1.830,2.406,0.2726,2.478,36


### 2.5 Agregate to Business Goals dataframe more information.

##### 2.5.1 Get the sales of each day, store and branch

In [16]:
sales_period_store = df_sales.groupby(['date', 'id_store'])\
                .agg({
                    'items': 'sum',
                    'gross_revenue': 'sum'}).reset_index()

sales_period_store

Unnamed: 0,date,id_store,items,gross_revenue
0,2021-01-02,F!25!6;D=./F%2(E)D;]P0,1119.6,25898.100
1,2021-01-02,"F%#+YX,X!FRF<FHD):`=9+",907.8,23662.638
2,2021-01-02,F)T`P;^+F]5F7YX^S\=+?&,1154.4,30992.874
3,2021-01-02,"FO)5JW59TP?&C:?,ZG$$L*",916.2,19486.866
4,2021-01-02,"FO1G""YC0G6I&C(,H&(MT3-",0.0,0.000
...,...,...,...,...
3630,2022-12-31,F!25!6;D=./F%2(E)D;]P0,706.8,19767.834
3631,2022-12-31,"F%#+YX,X!FRF<FHD):`=9+",699.0,18471.312
3632,2022-12-31,F)T`P;^+F]5F7YX^S\=+?&,1224.6,35505.378
3633,2022-12-31,"FO)5JW59TP?&C:?,ZG$$L*",844.2,18085.818


##### 2.5.2 Get the goal for each day, store

In [17]:
goal_period_store = df_business_goal.groupby(['date', 'id_store'])\
                .agg({
                    'sales_goal': 'sum' }).reset_index()

goal_period_store

Unnamed: 0,date,id_store,sales_goal
0,2021-01-01,F!25!6;D=./F%2(E)D;]P0,0.000
1,2021-01-01,"F%#+YX,X!FRF<FHD):`=9+",0.000
2,2021-01-01,F)T`P;^+F]5F7YX^S\=+?&,0.000
3,2021-01-01,"FO)5JW59TP?&C:?,ZG$$L*",0.000
4,2021-01-01,"FO1G""YC0G6I&C(,H&(MT3-",0.000
...,...,...,...
3645,2022-12-31,F!25!6;D=./F%2(E)D;]P0,18323.364
3646,2022-12-31,"F%#+YX,X!FRF<FHD):`=9+",16336.596
3647,2022-12-31,F)T`P;^+F]5F7YX^S\=+?&,33015.048
3648,2022-12-31,"FO)5JW59TP?&C:?,ZG$$L*",19925.220


##### 2.5.3 Merge the Goal per day and store with Sales per day and store.

In [18]:
df_goal_store_period = goal_period_store.merge(sales_period_store, on=['id_store', 'date'], how='left')

# drop the days without goal
df_goal_store_period = df_goal_store_period.dropna(subset=['sales_goal', 'gross_revenue'])

df_goal_store_period

Unnamed: 0,date,id_store,sales_goal,items,gross_revenue
5,2021-01-02,F!25!6;D=./F%2(E)D;]P0,22445.046,1119.6,25898.100
6,2021-01-02,"F%#+YX,X!FRF<FHD):`=9+",19573.326,907.8,23662.638
7,2021-01-02,F)T`P;^+F]5F7YX^S\=+?&,25188.930,1154.4,30992.874
8,2021-01-02,"FO)5JW59TP?&C:?,ZG$$L*",17902.728,916.2,19486.866
9,2021-01-02,"FO1G""YC0G6I&C(,H&(MT3-",0.000,0.0,0.000
...,...,...,...,...,...
3645,2022-12-31,F!25!6;D=./F%2(E)D;]P0,18323.364,706.8,19767.834
3646,2022-12-31,"F%#+YX,X!FRF<FHD):`=9+",16336.596,699.0,18471.312
3647,2022-12-31,F)T`P;^+F]5F7YX^S\=+?&,33015.048,1224.6,35505.378
3648,2022-12-31,"FO)5JW59TP?&C:?,ZG$$L*",19925.220,844.2,18085.818


##### 2.5.4 Create new columns to be used futher.

In [19]:
df_goal_store_period['goal_%'] = round((df_goal_store_period['gross_revenue'] / df_goal_store_period['sales_goal']), 2)

df_goal_store_period['goal_result'] = df_goal_store_period.apply(lambda x: 1 if x['goal_%'] >= 1 else 0, axis=1)

df_goal_store_period['goal_result'] = df_goal_store_period['goal_result'].astype('category')

df_goal_store_period

Unnamed: 0,date,id_store,sales_goal,items,gross_revenue,goal_%,goal_result
5,2021-01-02,F!25!6;D=./F%2(E)D;]P0,22445.046,1119.6,25898.100,1.15,1
6,2021-01-02,"F%#+YX,X!FRF<FHD):`=9+",19573.326,907.8,23662.638,1.21,1
7,2021-01-02,F)T`P;^+F]5F7YX^S\=+?&,25188.930,1154.4,30992.874,1.23,1
8,2021-01-02,"FO)5JW59TP?&C:?,ZG$$L*",17902.728,916.2,19486.866,1.09,1
9,2021-01-02,"FO1G""YC0G6I&C(,H&(MT3-",0.000,0.0,0.000,,0
...,...,...,...,...,...,...,...
3645,2022-12-31,F!25!6;D=./F%2(E)D;]P0,18323.364,706.8,19767.834,1.08,1
3646,2022-12-31,"F%#+YX,X!FRF<FHD):`=9+",16336.596,699.0,18471.312,1.13,1
3647,2022-12-31,F)T`P;^+F]5F7YX^S\=+?&,33015.048,1224.6,35505.378,1.08,1
3648,2022-12-31,"FO)5JW59TP?&C:?,ZG$$L*",19925.220,844.2,18085.818,0.91,0


## 3. Load data

After manipulate the data we are ready to send the datframe to in Tableau to analysis and data visualization.

In [20]:
def send_to_bq(df, table_name):
    pandas_gbq.to_gbq(df, table_name, project_id='sales-store-383520', if_exists='replace')

In [21]:
send_to_bq(df_products, 'data.products')

100%|██████████| 1/1 [00:00<?, ?it/s]


In [22]:
send_to_bq(df_stores, 'data.stores')

100%|██████████| 1/1 [00:00<?, ?it/s]


In [23]:
send_to_bq(df_sales, 'data.sales')

100%|██████████| 1/1 [00:00<?, ?it/s]


In [24]:
send_to_bq(df_customers, 'data.customers')

100%|██████████| 1/1 [00:00<?, ?it/s]


In [25]:
send_to_bq(df_goal_store_period, 'data.business_goals')

100%|██████████| 1/1 [00:00<?, ?it/s]


Save the CSV locally if needed.

In [26]:
df_sales.to_csv('data_load/sales.csv', index=None)

df_customers.to_csv('data_load/customers.csv', index=None)

df_goal_store_period.to_csv('data_load/business_goals.csv', index=None)

df_stores.to_csv('data_load/stores.csv', index=None)

df_products.to_csv('data_load/products.csv', index=None)

## Next Steps

1. Automate data extraction
2. Fine tune to performe faster
3. Run it periodically