# Nortal - Data Engineer Assignment

Written by: Enlik

## Mapping Data

Our client sent to us his data in a csv format and we need to map this data to our internal
schema. In this task, we expect that you load the csv file, rename the columns according to the
mapping below and cast the columns for the correct format. In addition to the mapping:

- Timestamps are in GMT+3, convert it to GMT+1
- Generate an id column to the items table by concatenating the transaction id column with
the product id column, separated by a dash (-). Ex.: 64532-676

In [1]:
# Libraries needed
import pandas as pd
from datetime import timedelta

In [2]:
items = pd.read_csv('dataset/items.csv')
items.head()

Unnamed: 0,Transaction ID,Product ID
0,110,1
1,111,1
2,112,2
3,112,3
4,113,1


In [3]:
trx = pd.read_csv('dataset/Transactions.csv')
trx.head()

Unnamed: 0,ID,Timestamp,Guset ID,Amount
0,111,08/24/2021 06:08:08,,1.64
1,112,08/24/2021 06:13:45,22.0,3.27
2,113,08/24/2021 06:15:11,11.0,6.12
3,114,08/24/2021 06:24:30,11.0,15.23
4,115,08/24/2021 06:27:43,33.0,5.41


### Timestamps are in GMT+3, convert it to GMT+1

- These code below using manual time conversion, because there is no UTC information from the raw data 'transactions.csv'
- If it has UTC information, I can use pandas function `dt.tz_convert()` for automatic conversion based on the timzone region

In [4]:
trx = pd.read_csv('dataset/Transactions.csv')
trx['Timestamp'] = pd.to_datetime(trx['Timestamp'], utc=True)

# normalize timestamp into GMT+1
# Europe/London is equal to GMT+1
trx['Timestamp'] = trx['Timestamp'].dt.tz_convert('Europe/London')
trx['Timestamp'] = trx['Timestamp'] - timedelta(hours=3)

# trx['Timestamp_GMT+3'] = trx['Timestamp'].dt.tz_convert('Europe/Tallinn')
# trx['Timestamp_GMT+1'] = trx['Timestamp_GMT+3'].dt.tz_convert('Europe/London')

trx.head(5)

Unnamed: 0,ID,Timestamp,Guset ID,Amount
0,111,2021-08-24 04:08:08+01:00,,1.64
1,112,2021-08-24 04:13:45+01:00,22.0,3.27
2,113,2021-08-24 04:15:11+01:00,11.0,6.12
3,114,2021-08-24 04:24:30+01:00,11.0,15.23
4,115,2021-08-24 04:27:43+01:00,33.0,5.41


### Generating `ID` column to items

Generate an id column to the items table by concatenating the transaction id column with
the product id column, separated by a dash (-). Ex.: 64532-676

In [5]:
items = pd.read_csv('dataset/items.csv')
items.head()

Unnamed: 0,Transaction ID,Product ID
0,110,1
1,111,1
2,112,2
3,112,3
4,113,1


In [6]:
items['ID'] = items['Transaction ID'].astype(str) + '-' + items['Product ID'].astype(str)
items

Unnamed: 0,Transaction ID,Product ID,ID
0,110,1,110-1
1,111,1,111-1
2,112,2,112-2
3,112,3,112-3
4,113,1,113-1
5,116,5,116-5
6,116,4,116-4
7,116,7,116-7
8,117,8,117-8
9,118,5,118-5


## Generate Guest Table

- Our client didn’t send the guests csv but we have the columns `Customer ID` and `Timestamp` in the
transactions file that can be used to generate the guest table
- Since we may have the same `Customer ID` in different transactions and we don’t want duplicate guests, take the first timestamp for that particular guest.

In [7]:
# trx_clean = trx[['Guset ID','Timestamp']].dropna().drop_duplicates().reset_index(drop=True)
trx_clean = trx[['Guset ID','Timestamp']].dropna().reset_index(drop=True)

trx_clean

Unnamed: 0,Guset ID,Timestamp
0,22.0,2021-08-24 04:13:45+01:00
1,11.0,2021-08-24 04:15:11+01:00
2,11.0,2021-08-24 04:24:30+01:00
3,33.0,2021-08-24 04:27:43+01:00
4,44.0,2021-08-24 04:28:43+01:00
5,22.0,2021-08-24 04:31:18+01:00
6,11.0,2021-08-24 04:34:00+01:00
7,55.0,2021-08-24 04:34:19+01:00
8,66.0,2021-08-24 04:35:42+01:00
9,77.0,2021-08-24 04:36:31+01:00


In [8]:
guest_df = pd.DataFrame(columns=['guest_id','created_at'])

for guest_id in trx_clean['Guset ID'].unique():
    guest_df = guest_df.append({'guest_id': guest_id}, ignore_index=True)
    guest_df.loc[guest_df.guest_id == guest_id, "created_at"] = trx_clean[(trx_clean['Guset ID'] == guest_id)].Timestamp.min()

In [9]:
guest_df = guest_df.sort_values(by=['guest_id'])
guest_df

Unnamed: 0,guest_id,created_at
1,11.0,2021-08-24 04:15:11+01:00
0,22.0,2021-08-24 04:13:45+01:00
2,33.0,2021-08-24 04:27:43+01:00
3,44.0,2021-08-24 04:28:43+01:00
4,55.0,2021-08-24 04:34:19+01:00
5,66.0,2021-08-24 04:35:42+01:00
6,77.0,2021-08-24 04:36:31+01:00
7,88.0,2021-08-24 04:37:15+01:00
8,99.0,2021-08-24 05:33:16+01:00


## Ensure Data Integrity


### All transactions contain items. Transactions without items need to be excluded.

In [10]:
# transaction without items
trx[trx['ID'].isin(items['Transaction ID']) == False].reset_index(drop=True)

Unnamed: 0,ID,Timestamp,Guset ID,Amount
0,114,2021-08-24 04:24:30+01:00,11.0,15.23
1,115,2021-08-24 04:27:43+01:00,33.0,5.41
2,125,2021-08-24 04:49:41+01:00,88.0,15.98


In [11]:
# All valid transactions
trx[trx['ID'].isin(items['Transaction ID']) == True].reset_index(drop=True)

Unnamed: 0,ID,Timestamp,Guset ID,Amount
0,111,2021-08-24 04:08:08+01:00,,1.64
1,112,2021-08-24 04:13:45+01:00,22.0,3.27
2,113,2021-08-24 04:15:11+01:00,11.0,6.12
3,116,2021-08-24 04:28:43+01:00,44.0,9.19
4,117,2021-08-24 04:31:18+01:00,22.0,8.13
5,118,2021-08-24 04:34:00+01:00,11.0,13.96
6,119,2021-08-24 04:34:19+01:00,55.0,2.07
7,120,2021-08-24 04:35:42+01:00,66.0,3.06
8,121,2021-08-24 04:36:31+01:00,77.0,7.44
9,122,2021-08-24 04:37:15+01:00,88.0,9.19


### All items belong to a transaction. Items without a transaction need to be excluded.

In [12]:
# items without transaction
items[items['Transaction ID'].isin(trx['ID']) == False].reset_index(drop=True)

Unnamed: 0,Transaction ID,Product ID,ID
0,110,1,110-1
1,131,8,131-8


In [13]:
# All valid items
items[items['Transaction ID'].isin(trx['ID']) == True].reset_index(drop=True)

Unnamed: 0,Transaction ID,Product ID,ID
0,111,1,111-1
1,112,2,112-2
2,112,3,112-3
3,113,1,113-1
4,116,5,116-5
5,116,4,116-4
6,116,7,116-7
7,117,8,117-8
8,118,5,118-5
9,119,3,119-3


### All transactions contain guests. Transactions without guests need to be excluded.

In [14]:
# transaction without guest
trx[trx['Guset ID'].isna() == True].reset_index(drop=True)

Unnamed: 0,ID,Timestamp,Guset ID,Amount
0,111,2021-08-24 04:08:08+01:00,,1.64
1,128,2021-08-24 07:48:48+01:00,,26.05
2,129,2021-08-24 06:42:50+01:00,,2.07


In [15]:
# transactions contains guest
trx.dropna().reset_index(drop=True)

Unnamed: 0,ID,Timestamp,Guset ID,Amount
0,112,2021-08-24 04:13:45+01:00,22.0,3.27
1,113,2021-08-24 04:15:11+01:00,11.0,6.12
2,114,2021-08-24 04:24:30+01:00,11.0,15.23
3,115,2021-08-24 04:27:43+01:00,33.0,5.41
4,116,2021-08-24 04:28:43+01:00,44.0,9.19
5,117,2021-08-24 04:31:18+01:00,22.0,8.13
6,118,2021-08-24 04:34:00+01:00,11.0,13.96
7,119,2021-08-24 04:34:19+01:00,55.0,2.07
8,120,2021-08-24 04:35:42+01:00,66.0,3.06
9,121,2021-08-24 04:36:31+01:00,77.0,7.44


## Simple Unit Test for Data Integrity

In [16]:
# reference: 
# https://stackoverflow.com/questions/40172281/unit-tests-for-functions-in-a-jupyter-notebook

def red(text):
    print('\x1b[31m{}\x1b[0m'.format(text))

def assertEquals(a, b):
    res = a == b
    if type(res) is bool:
        if not res:
            red('Invalid Data')
            return
    else:
        if not res.all():
            red('Invalid Data'.format(a, b))
            return
        
    print('Assert okay.')

In [17]:
def checkValidTrx(x):
    return x in items['Transaction ID'].unique()

def checkValidItem(x):
    return x in trx['ID'].unique()

### All transactions contain items. Transactions without items need to be excluded.

In [18]:
for i in range(len(trx)):
    print(trx['ID'][i])
    assertEquals(checkValidTrx(trx['ID'][i]), True)

111
Assert okay.
112
Assert okay.
113
Assert okay.
114
[31mInvalid Data[0m
115
[31mInvalid Data[0m
116
Assert okay.
117
Assert okay.
118
Assert okay.
119
Assert okay.
120
Assert okay.
121
Assert okay.
122
Assert okay.
123
Assert okay.
124
Assert okay.
125
[31mInvalid Data[0m
126
Assert okay.
127
Assert okay.
128
Assert okay.
129
Assert okay.
130
Assert okay.


### All items belong to a transaction. Items without a transaction need to be excluded.

In [19]:
for i in range(len(items)):
    print(items['Transaction ID'][i])
    assertEquals(checkValidItem(items['Transaction ID'][i]), True)

110
[31mInvalid Data[0m
111
Assert okay.
112
Assert okay.
112
Assert okay.
113
Assert okay.
116
Assert okay.
116
Assert okay.
116
Assert okay.
117
Assert okay.
118
Assert okay.
119
Assert okay.
120
Assert okay.
121
Assert okay.
122
Assert okay.
123
Assert okay.
124
Assert okay.
124
Assert okay.
126
Assert okay.
127
Assert okay.
128
Assert okay.
129
Assert okay.
130
Assert okay.
131
[31mInvalid Data[0m
