# Nortal - Data Engineer Assignment

Written by: Enlik

## Mapping Data

Our client sent to us his data in a csv format and we need to map this data to our internal
schema. In this task, we expect that you load the csv file, rename the columns according to the
mapping below and cast the columns for the correct format. In addition to the mapping:

- Timestamps are in GMT+3, convert it to GMT+1
- Generate an id column to the items table by concatenating the transaction id column with
the product id column, separated by a dash (-). Ex.: 64532-676

In [204]:
# Libraries needed
import pandas as pd
from datetime import timedelta

In [54]:
items = pd.read_csv('dataset/items.csv')
items.head()

Unnamed: 0,Transaction ID,Product ID
0,110,1
1,111,1
2,112,2
3,112,3
4,113,1


In [55]:
trx = pd.read_csv('dataset/Transactions.csv')
trx.head()

Unnamed: 0,ID,Timestamp,Guset ID,Amount
0,111,08/24/2021 06:08:08,,1.64
1,112,08/24/2021 06:13:45,22.0,3.27
2,113,08/24/2021 06:15:11,11.0,6.12
3,114,08/24/2021 06:24:30,11.0,15.23
4,115,08/24/2021 06:27:43,33.0,5.41


### Timestamps are in GMT+3, convert it to GMT+1

- These code below using manual time conversion
- The reason is because there is no UTC information from the raw data 'transactions.csv'
- If it has UTC information, I can use pandas function `dt.tz_convert()` for automatic conversion based on the timzone region

In [50]:
trx = pd.read_csv('dataset/Transactions.csv')
trx['Timestamp'] = pd.to_datetime(trx['Timestamp'], utc=None)
trx['Timestamp'] = trx['Timestamp'] - timedelta(hours=2)
# trx['Timestamp_GMT+3'] = trx['Timestamp'].dt.tz_convert('Europe/Tallinn')
# trx['Timestamp_GMT+1'] = trx['Timestamp_GMT+3'].dt.tz_convert('Europe/London')

trx.head(5)

Unnamed: 0,ID,Timestamp,Guset ID,Amount
0,111,2021-08-24 04:08:08,,1.64
1,112,2021-08-24 04:13:45,22.0,3.27
2,113,2021-08-24 04:15:11,11.0,6.12
3,114,2021-08-24 04:24:30,11.0,15.23
4,115,2021-08-24 04:27:43,33.0,5.41


### Generating `ID` column to items

Generate an id column to the items table by concatenating the transaction id column with
the product id column, separated by a dash (-). Ex.: 64532-676

In [43]:
items = pd.read_csv('dataset/items.csv')
items.head()

Unnamed: 0,Transaction ID,Product ID
0,110,1
1,111,1
2,112,2
3,112,3
4,113,1


In [46]:
items['ID'] = items['Transaction ID'].astype(str) + '-' + items['Product ID'].astype(str)
items

Unnamed: 0,Transaction ID,Product ID,ID
0,110,1,110-1
1,111,1,111-1
2,112,2,112-2
3,112,3,112-3
4,113,1,113-1
5,116,5,116-5
6,116,4,116-4
7,116,7,116-7
8,117,8,117-8
9,118,5,118-5


## Generate Guest Table

- Our client didn’t send the guests csv but we have the columns `Customer ID` and `Timestamp` in the
transactions file that can be used to generate the guest table
- Since we may have the same `Customer ID` in different transactions and we don’t want duplicate guests, take the first timestamp for that particular guest.

In [233]:
# trx_clean = trx[['Guset ID','Timestamp']].dropna().drop_duplicates().reset_index(drop=True)
trx_clean = trx[['Guset ID','Timestamp']].dropna().reset_index(drop=True)

trx_clean

Unnamed: 0,Guset ID,Timestamp
0,22.0,08/24/2021 06:13:45
1,11.0,08/24/2021 06:15:11
2,11.0,08/24/2021 06:24:30
3,33.0,08/24/2021 06:27:43
4,44.0,08/24/2021 06:28:43
5,22.0,08/24/2021 06:31:18
6,11.0,08/24/2021 06:34:00
7,55.0,08/24/2021 06:34:19
8,66.0,08/24/2021 06:35:42
9,77.0,08/24/2021 06:36:31


In [202]:
guest_df = pd.DataFrame(columns=['guest_id','created_at'])

for guest_id in trx_clean['Guset ID'].unique():
    guest_df = guest_df.append({'guest_id': guest_id}, ignore_index=True)
    guest_df.loc[guest_df.guest_id == guest_id, "created_at"] = trx_clean[(trx_clean['Guset ID'] == guest_id)].Timestamp.min()

In [203]:
guest_df = guest_df.sort_values(by=['guest_id'])
guest_df

Unnamed: 0,guest_id,created_at
1,11.0,08/24/2021 06:15:11
0,22.0,08/24/2021 06:13:45
2,33.0,08/24/2021 06:27:43
3,44.0,08/24/2021 06:28:43
4,55.0,08/24/2021 06:34:19
5,66.0,08/24/2021 06:35:42
6,77.0,08/24/2021 06:36:31
7,88.0,08/24/2021 06:37:15
8,99.0,08/24/2021 07:33:16


## Ensure Data Integrity


### All transactions contain items. Transactions without items need to be excluded.

In [220]:
# transaction without items
trx[trx['ID'].isin(items['Transaction ID']) == False].reset_index(drop=True)

Unnamed: 0,ID,Timestamp,Guset ID,Amount
0,114,08/24/2021 06:24:30,11.0,15.23
1,115,08/24/2021 06:27:43,33.0,5.41
2,125,08/24/2021 06:49:41,88.0,15.98


In [223]:
# All valid transactions
trx[trx['ID'].isin(items['Transaction ID']) == True].reset_index(drop=True)

Unnamed: 0,ID,Timestamp,Guset ID,Amount
0,111,08/24/2021 06:08:08,,1.64
1,112,08/24/2021 06:13:45,22.0,3.27
2,113,08/24/2021 06:15:11,11.0,6.12
3,116,08/24/2021 06:28:43,44.0,9.19
4,117,08/24/2021 06:31:18,22.0,8.13
5,118,08/24/2021 06:34:00,11.0,13.96
6,119,08/24/2021 06:34:19,55.0,2.07
7,120,08/24/2021 06:35:42,66.0,3.06
8,121,08/24/2021 06:36:31,77.0,7.44
9,122,08/24/2021 06:37:15,88.0,9.19


### All items belong to a transaction. Items without a transaction need to be excluded.

In [225]:
# items without transaction
items[items['Transaction ID'].isin(trx['ID']) == False].reset_index(drop=True)

Unnamed: 0,Transaction ID,Product ID
0,110,1
1,131,8


In [226]:
# All valid items
items[items['Transaction ID'].isin(trx['ID']) == True].reset_index(drop=True)

Unnamed: 0,Transaction ID,Product ID
0,111,1
1,112,2
2,112,3
3,113,1
4,116,5
5,116,4
6,116,7
7,117,8
8,118,5
9,119,3


### All transactions contain guests. Transactions without guests need to be excluded.

In [240]:
# transaction without guest
trx[trx['Guset ID'].isna() == True].reset_index(drop=True)

Unnamed: 0,ID,Timestamp,Guset ID,Amount
0,111,08/24/2021 06:08:08,,1.64
1,128,08/24/2021 09:48:48,,26.05
2,129,08/24/2021 08:42:50,,2.07


In [239]:
# transactions contains guest
trx.dropna().reset_index(drop=True)

Unnamed: 0,ID,Timestamp,Guset ID,Amount
0,112,08/24/2021 06:13:45,22.0,3.27
1,113,08/24/2021 06:15:11,11.0,6.12
2,114,08/24/2021 06:24:30,11.0,15.23
3,115,08/24/2021 06:27:43,33.0,5.41
4,116,08/24/2021 06:28:43,44.0,9.19
5,117,08/24/2021 06:31:18,22.0,8.13
6,118,08/24/2021 06:34:00,11.0,13.96
7,119,08/24/2021 06:34:19,55.0,2.07
8,120,08/24/2021 06:35:42,66.0,3.06
9,121,08/24/2021 06:36:31,77.0,7.44


## Unit Test

In [37]:
# example
# https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-zone-handling
# assert pd.Timestamp(d_2037, tz=DST) != pd.Timestamp(d_2037, tz="GMT")