# ETL PROJECT - RETAIL TRANSACTIONS

The dataset it will be used in this project is from Kaggle plataform. It contains information about retail transactions conducted online. It contains information about customer purchases, including the invoice number, stock code, description of the items purchased, quantity, unit price, invoice date, customer ID, and country.


## 1.1 - Columns Dictionary



**1) InvoiceNo:** A unique identifier for each transaction or invoice.

**2) StockCode:** A code representing the stock or item purchased.

**3) Description:** A textual description of the item purchased.

**4) Quantity:** The quantity of the item purchased in each transaction.

**5) InvoiceDate:** The date and time when the transaction occurred.

**6) UnitPrice:** The price per unit of the item purchased.

**7) CustomerID:** The unique identifier for the customer making the purchase.

**8) Country:** The country where the transaction took place.


## 1- Extract


The table it will be used it was in Google Big Query plataform. So it will be necessary extract all the informations from there.

In [4]:
#import dependencies

import os
import pandas as pd
from google.cloud import bigquery
from google.oauth2 import service_account
import pandas_gbq


In [5]:

# Set the path to your service account key JSON file
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "GBQ.json"

# Your existing code
selectQuery = """SELECT * FROM etl-project-416319.project_1.customers"""
bigqueryClient = bigquery.Client()
df = bigqueryClient.query(selectQuery).to_dataframe()
df.to_csv("customer.csv", index=False)


In [6]:
df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,A563185,B,Adjust bad debt,1,2011-08-12 14:50:00+00:00,11062.06,,United Kingdom
1,A563186,B,Adjust bad debt,1,2011-08-12 14:51:00+00:00,-11062.06,,United Kingdom
2,A563187,B,Adjust bad debt,1,2011-08-12 14:52:00+00:00,-11062.06,,United Kingdom
3,C552650,D,Discount,-18,2011-05-10 14:03:00+00:00,1.45,16672.0,United Kingdom
4,C545478,D,Discount,-720,2011-03-03 11:08:00+00:00,0.01,16422.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,561513,gift_0001_40,Dotcomgiftshop Gift Voucher £40.00,1,2011-07-27 15:12:00+00:00,33.33,,United Kingdom
541905,539958,gift_0001_50,Dotcomgiftshop Gift Voucher £50.00,1,2010-12-23 13:26:00+00:00,42.55,,United Kingdom
541906,552232,gift_0001_50,Dotcomgiftshop Gift Voucher £50.00,1,2011-05-06 15:54:00+00:00,41.67,,United Kingdom
541907,558066,gift_0001_50,Dotcomgiftshop Gift Voucher £50.00,1,2011-06-24 15:45:00+00:00,41.67,,United Kingdom


## 2- Transform

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype              
---  ------       --------------   -----              
 0   InvoiceNo    541909 non-null  object             
 1   StockCode    541909 non-null  object             
 2   Description  540455 non-null  object             
 3   Quantity     541909 non-null  Int64              
 4   InvoiceDate  541909 non-null  datetime64[us, UTC]
 5   UnitPrice    541909 non-null  float64            
 6   CustomerID   406829 non-null  float64            
 7   Country      541909 non-null  object             
dtypes: Int64(1), datetime64[us, UTC](1), float64(2), object(4)
memory usage: 33.6+ MB


In [7]:
# checking for columns types

df.dtypes

InvoiceNo                   object
StockCode                   object
Description                 object
Quantity                     Int64
InvoiceDate    datetime64[us, UTC]
UnitPrice                  float64
CustomerID                 float64
Country                     object
dtype: object

In [11]:
df.shape

(541909, 8)

In [14]:
# checking for duplicates

df.duplicated().sum()

5268

In [19]:

# pick up duplicated datas
duplicated = df[df.duplicated()]


print(duplicated)

# Se desejar contar o número de duplicados encontrados
print("Número de duplicados:", len(duplicated))


       InvoiceNo StockCode                      Description  Quantity  \
449       572344         M                           Manual        48   
450       572344         M                           Manual        48   
451       572344         M                           Manual        48   
452       572344         M                           Manual        48   
453       572344         M                           Manual        48   
...          ...       ...                              ...       ...   
540144    568212    90199C  5 STRAND GLASS NECKLACE CRYSTAL         1   
540153   C568370    90199C  5 STRAND GLASS NECKLACE CRYSTAL        -1   
540332    554112    90200D         PINK SWEETHEART BRACELET         1   
540633    575258    90206C     CRYSTAL DIAMANTE STAR BROOCH         1   
540676    562204    90209B     GREEN ENAMEL+GLASS HAIR COMB         1   

                     InvoiceDate  UnitPrice  CustomerID         Country  
449    2011-10-24 10:43:00+00:00       1.50     1

In [20]:
# dropping duplicated data

df = df.drop_duplicates()
df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,A563185,B,Adjust bad debt,1,2011-08-12 14:50:00+00:00,11062.06,,United Kingdom
1,A563186,B,Adjust bad debt,1,2011-08-12 14:51:00+00:00,-11062.06,,United Kingdom
2,A563187,B,Adjust bad debt,1,2011-08-12 14:52:00+00:00,-11062.06,,United Kingdom
3,C552650,D,Discount,-18,2011-05-10 14:03:00+00:00,1.45,16672.0,United Kingdom
4,C545478,D,Discount,-720,2011-03-03 11:08:00+00:00,0.01,16422.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,561513,gift_0001_40,Dotcomgiftshop Gift Voucher £40.00,1,2011-07-27 15:12:00+00:00,33.33,,United Kingdom
541905,539958,gift_0001_50,Dotcomgiftshop Gift Voucher £50.00,1,2010-12-23 13:26:00+00:00,42.55,,United Kingdom
541906,552232,gift_0001_50,Dotcomgiftshop Gift Voucher £50.00,1,2011-05-06 15:54:00+00:00,41.67,,United Kingdom
541907,558066,gift_0001_50,Dotcomgiftshop Gift Voucher £50.00,1,2011-06-24 15:45:00+00:00,41.67,,United Kingdom


In [21]:
# checking for missing values

df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135037
Country             0
dtype: int64

In [23]:
df[df['Description'].isna()]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
715,575506,C2,,150,2011-11-10 10:30:00+00:00,0.0,,United Kingdom
859,547966,DOT,,1000,2011-03-28 15:49:00+00:00,0.0,,United Kingdom
1589,540699,POST,,1000,2011-01-11 09:32:00+00:00,0.0,,United Kingdom
1590,554857,POST,,800,2011-05-27 10:08:00+00:00,0.0,,United Kingdom
1591,565556,POST,,750,2011-09-05 12:14:00+00:00,0.0,,United Kingdom
...,...,...,...,...,...,...,...,...
541774,542531,DCGS0072,,-1,2011-01-28 13:08:00+00:00,0.0,,United Kingdom
541776,542532,DCGS0074,,-1,2011-01-28 13:09:00+00:00,0.0,,United Kingdom
541824,561255,DCGS0066P,,-3,2011-07-26 11:52:00+00:00,0.0,,United Kingdom
541875,564762,gift_0001_10,,30,2011-08-30 10:48:00+00:00,0.0,,United Kingdom


In most rows there are missing values in the Description column, there are also missing values in CustomerID, and the UnitPrice is equal to 0. So those values are going to be deleted.

In [25]:
df = df.dropna(subset=['Description'])

In [26]:
df.isnull().sum()

InvoiceNo           0
StockCode           0
Description         0
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     133583
Country             0
dtype: int64

In [28]:
df_null_customer=df[df['CustomerID'].isna()]

In [29]:
df_null_customer.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,133583.0,133583.0,0.0
mean,2.120479,8.166272,
std,62.072857,152.747198,
min,-9600.0,-11062.06,
25%,1.0,1.63,
50%,1.0,3.29,
75%,3.0,5.79,
max,4000.0,17836.46,


In [30]:
df.describe()


Unnamed: 0,Quantity,UnitPrice,CustomerID
count,535187.0,535187.0,401604.0
mean,9.671593,4.645242,15281.160818
std,219.059056,97.36481,1714.006089
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13939.0
50%,3.0,2.08,15145.0
75%,10.0,4.13,16784.0
max,80995.0,38970.0,18287.0


The dataframe has missing values in the CustomerID column. If the remaining corresponding data for these missing IDs suggested that the purchase did not exist, one approach to consider would be to delete these records. Alternatively, if the objective of the analysis is to understand the customer profile and how they shop on the site, then these records without identification would hinder the analysis, and deleting them would be the best approach. Here, as the objective is to understand some sales metrics, which will be subsequently addressed in the dashboard construction, these values will be replaced. In this column, the smallest ID is 12346 and the largest is 18287. Therefore, to facilitate the identification of these records that were null, they will be replaced by the value 1. Thus, whenever the value 1 appears in the CustomerID column, it means that it was a missing value.

In [32]:
df['CustomerID'].fillna(1, inplace=True)


In [33]:
# transform CustomerID column type into integer

df['CustomerID'] = df['CustomerID'].astype(int)

In [34]:
df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,A563185,B,Adjust bad debt,1,2011-08-12 14:50:00+00:00,11062.06,1,United Kingdom
1,A563186,B,Adjust bad debt,1,2011-08-12 14:51:00+00:00,-11062.06,1,United Kingdom
2,A563187,B,Adjust bad debt,1,2011-08-12 14:52:00+00:00,-11062.06,1,United Kingdom
3,C552650,D,Discount,-18,2011-05-10 14:03:00+00:00,1.45,16672,United Kingdom
4,C545478,D,Discount,-720,2011-03-03 11:08:00+00:00,0.01,16422,United Kingdom
...,...,...,...,...,...,...,...,...
541904,561513,gift_0001_40,Dotcomgiftshop Gift Voucher £40.00,1,2011-07-27 15:12:00+00:00,33.33,1,United Kingdom
541905,539958,gift_0001_50,Dotcomgiftshop Gift Voucher £50.00,1,2010-12-23 13:26:00+00:00,42.55,1,United Kingdom
541906,552232,gift_0001_50,Dotcomgiftshop Gift Voucher £50.00,1,2011-05-06 15:54:00+00:00,41.67,1,United Kingdom
541907,558066,gift_0001_50,Dotcomgiftshop Gift Voucher £50.00,1,2011-06-24 15:45:00+00:00,41.67,1,United Kingdom
