# <center>Dunnhumby - The Complete Journey<center>

** ** 
    
# <center>*02 - Customer Information Layer*<center>
    

In this notebook the total spend per department will be added to the Customer Information (hh) table my means of joining the *Transaction* and *Product* datasets, and then joining them to the *HH Demographics* Dataset.


<br>
    
This project was developed by <br><br>

*<center>António Oliveira | NTT Data Summer Internship 2024<center>*

** **

<a class="anchor" id="0"></a>

# Table of Contents

1. [Importing Libraries & Data](#1.-Importing-Libraries-&-Data)

    1.1 [Libraries](#1.1-Libraries)
    
    1.2 [Data](#1.2-Data) <br><br>
  
    
2. [First Join](#2.-First-Join)

    2.1 [Dataset Transformation](#2.1-Dataset-Transformation) <br><br>
    
3. [Second Join](#3.-Second-Join)    <br>

    3.1 [Feature Engineering](#3.1-Feature-Engineering) <br><br>
    
4. [Export](#4.-Export) <br><br>

## 1. Importing Libraries & Data

### 1.1 Libraries

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
import ast

import functions

### 1.2 Data

In [2]:
path = '/Users/antoniooliveira/Downloads/NTT project'
#path = "C:/Users/aprataso/Downloads/final_data"

transaction = pd.read_csv(f"{path}/trans.csv")
prod = pd.read_csv(f"{path}/prod.csv")
hh = pd.read_csv(f"{path}/Gold/hh_treated.csv")

## 2. First Join

<a class='anchor' id='1'></a>
[Top &#129033;](#0) 

Merge the *Transaction* and *Product* Datasets

In [3]:
merger = transaction.merge(prod, on = 'product_id', how = 'left')
merger.head(2)

Unnamed: 0,household_key,basket_id,day,product_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,sales_value_eu,loyalty_card_price,non_loyalty_card_price,days_,transaction_date,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,curr_size_of_product_value,curr_size_of_product_units
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0,1.49,1.99,1.39,1,2021-01-01,69,PRODUCE,Private,POTATOES,POTATOES RUSSET (BULK&BAG),5 LB,5.0,LB
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0,0.88,0.82,0.82,1,2021-01-01,2,PRODUCE,National,ONIONS,ONIONS SWEET (BULK&BAG),40 LB,40.0,LB


Looking at the variables in the new Dataframe

In [4]:
merger.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2595732 entries, 0 to 2595731
Data columns (total 25 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   household_key               int64  
 1   basket_id                   int64  
 2   day                         int64  
 3   product_id                  int64  
 4   quantity                    int64  
 5   sales_value                 float64
 6   store_id                    int64  
 7   retail_disc                 float64
 8   trans_time                  int64  
 9   week_no                     int64  
 10  coupon_disc                 float64
 11  coupon_match_disc           float64
 12  sales_value_eu              float64
 13  loyalty_card_price          float64
 14  non_loyalty_card_price      float64
 15  days_                       int64  
 16  transaction_date            object 
 17  manufacturer                int64  
 18  department                  object 
 19  brand                

### 2.1 Dataset Transformation

Since here we are using the non-Gold versions of the dataset, we have to reduce the number of departments again

In [5]:
print(merger['department'].unique())

replacement_mapping = {
    'MEAT': 'Meat',
    'MEAT-PCKGD': 'Meat',
    'MEAT-WHSE': 'Meat',
    'PORK': 'Meat',
    'SEAFOOD': 'Seafood',
    'SEAFOOD-PCKGD': 'Seafood',
    'FROZEN GROCERY': 'Groceries',
    'PRODUCE': 'Groceries',
    'GROCERY': 'Groceries',
    'NUTRITION': 'Groceries',
    'GRO BAKERY': 'Bakery',
    'PASTRY': 'Bakery',
    'DELI': 'Delicacies',
    'DAIRY DELI': 'Delicacies',
    'DELI/SNACK BAR': 'DELI/SNACK BAR',
    'PHOTO': 'Photo/Video',
    'VIDEO': 'Photo/Video'
}
merger['department'] = merger['department'].replace(replacement_mapping)


['PRODUCE' 'GROCERY' 'DRUG GM' 'MEAT' 'MEAT-PCKGD' 'DELI' 'SEAFOOD-PCKGD'
 ' ' 'PASTRY' 'NUTRITION' 'VIDEO RENTAL' 'MISC SALES TRAN' 'FLORAL'
 'SEAFOOD' 'SALAD BAR' 'AUTOMOTIVE' 'SPIRITS' 'COSMETICS' 'MISC. TRANS.'
 'GARDEN CENTER' 'CHEF SHOPPE' 'TRAVEL & LEISUR' 'COUP/STR & MFG'
 'KIOSK-GAS' 'FROZEN GROCERY' 'RESTAURANT' 'HOUSEWARES' 'PORK'
 'POSTAL CENTER' 'GM MERCH EXP' 'CNTRL/STORE SUP' 'PROD-WHS SALES'
 'DAIRY DELI' 'HBC' 'CHARITABLE CONT' 'RX' 'DELI/SNACK BAR'
 'PHARMACY SUPPLY' 'PHOTO' 'ELECT &PLUMBING' 'MEAT-WHSE' 'TOYS'
 'GRO BAKERY' 'VIDEO']


Verifying the success of the operation

In [6]:
merger.head(1)

Unnamed: 0,household_key,basket_id,day,product_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,sales_value_eu,loyalty_card_price,non_loyalty_card_price,days_,transaction_date,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,curr_size_of_product_value,curr_size_of_product_units
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0,1.49,1.99,1.39,1,2021-01-01,69,Groceries,Private,POTATOES,POTATOES RUSSET (BULK&BAG),5 LB,5.0,LB


Grouping our data by household_id to after join it with *hh*

In [7]:
agg_funcs = {
    'day': list, 
    #'basket_id': list,
    'product_id': list,  
    'quantity': lambda x: x.tolist(),   
    'sales_value': list,  
    'store_id': 'first',  
    'retail_disc':  'mean',  
    'trans_time': 'first', 
    'week_no': 'first',  
    'coupon_disc':  'mean',  
    'coupon_match_disc':  'mean',  
    'sales_value_eu':list,  
    'loyalty_card_price': list,  
    'non_loyalty_card_price': list, 
    'days_': 'first',  
    'transaction_date': 'first', 
    'manufacturer': list,  
    'department': list, 
    'brand': list,  
    'commodity_desc': list, 
    'sub_commodity_desc': list #,
    #'curr_size_of_product': list,  
    #'curr_size_of_product_value': list,  
    #'curr_size_of_product_units': list  
}

# Group by household_key and apply aggregations
grouped_df = merger.groupby(['household_key']).agg(agg_funcs).reset_index()

grouped_df['first_transaction_date'] = grouped_df['transaction_date']
grouped_df = grouped_df.drop('transaction_date', axis = 1)
grouped_df.head(2)

Unnamed: 0,household_key,day,product_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,sales_value_eu,loyalty_card_price,non_loyalty_card_price,days_,manufacturer,department,brand,commodity_desc,sub_commodity_desc,first_transaction_date
0,1,"[51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 5...","[825123, 831447, 840361, 845307, 852014, 85498...","[1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, ...","[3.99, 2.99, 1.09, 3.71, 2.79, 7.19, 2.5, 1.49...",436,-0.403613,1456,8,-0.046647,-0.015142,"[4.27, 3.2, 1.17, 3.97, 2.99, 7.69, 2.68, 1.59...","[3.99, 2.99, 1.39, 4.33, 3.99, 8.49, 2.99, 1.4...","[3.99, 2.99, 1.09, 3.71, 2.79, 7.19, 2.5, 1.49...",51,"[1179, 317, 69, 4084, 69, 1011, 159, 407, 2082...","[Groceries, Groceries, Groceries, Delicacies, ...","[National, National, Private, National, Privat...","[SALD DRSNG/SNDWCH SPRD, CHEESE, EGGS, DELI ME...","[SEMI-SOLID SALAD DRESSING MAY, SHREDDED CHEES...",2021-02-20
1,2,"[103, 103, 103, 103, 103, 103, 103, 103, 103, ...","[854852, 930118, 1077555, 1098066, 5567388, 55...","[1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, ...","[0.65, 4.47, 1.0, 0.79, 2.5, 2.5, 2.5, 5.0, 3....",401,-0.469174,1904,15,-0.012605,0.0,"[0.7, 4.78, 1.07, 0.85, 2.68, 2.68, 2.68, 5.35...","[0.65, 9.5, 1.59, 0.99, 4.69, 4.69, 4.69, 4.69...","[0.65, 4.47, 1.0, 0.79, 2.5, 2.5, 2.5, 2.5, 3....",103,"[2, 2, 69, 69, 1208, 1208, 1208, 1208, 1323, 6...","[Groceries, Groceries, Groceries, Groceries, G...","[National, National, Private, Private, Nationa...","[TOMATOES, STONE FRUIT, MILK BY-PRODUCTS, BAKE...","[TOMATOES HOTHOUSE ON THE VINE, CHERRIES RED, ...",2021-04-13


## 3. Second Join

<a class='anchor' id='1'></a>
[Top &#129033;](#0) 

Joining *Transaction* and *Product* Datasets to *HH Demographics* Dataset.

In [11]:
hh.head(2)

Unnamed: 0,marital_status_code,homeowner_desc,household_key,marital_status,age_group,adult_category_size,has_kids,avg_age,avg_income,n_kids,n_household,gender(s)
0,A,Homeowner,1,married,senior,2.0,0.0,65.0,42000,0.0,2,2.0
1,A,Homeowner,7,married,middle-aged,2.0,0.0,49.5,62000,0.0,2,2.0


**2 Options**

1. Joining *hh* with *merge*, which means we will only keep 801 rows, wasting the information of 1699 transaction for which we do not have customer information.

2. Joining *merge* with *hh*, which means we will keep 2500 rows, even tho 1699 will have missing values as we do not have information about those customers yet

In [12]:
'''clients = hh.merge(grouped_df, on='household_key', how='left')
clients'''

"clients = hh.merge(grouped_df, on='household_key', how='left')\nclients"

In [13]:
clients = grouped_df.merge(hh, on='household_key', how='left')
clients

Unnamed: 0,household_key,day,product_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,sales_value_eu,loyalty_card_price,non_loyalty_card_price,days_,manufacturer,department,brand,commodity_desc,sub_commodity_desc,first_transaction_date,marital_status_code,homeowner_desc,marital_status,age_group,adult_category_size,has_kids,avg_age,avg_income,n_kids,n_household,gender(s)
0,1,"[51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 5...","[825123, 831447, 840361, 845307, 852014, 85498...","[1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, ...","[3.99, 2.99, 1.09, 3.71, 2.79, 7.19, 2.5, 1.49...",436,-0.403613,1456,8,-0.046647,-0.015142,"[4.27, 3.2, 1.17, 3.97, 2.99, 7.69, 2.68, 1.59...","[3.99, 2.99, 1.39, 4.33, 3.99, 8.49, 2.99, 1.4...","[3.99, 2.99, 1.09, 3.71, 2.79, 7.19, 2.5, 1.49...",51,"[1179, 317, 69, 4084, 69, 1011, 159, 407, 2082...","[Groceries, Groceries, Groceries, Delicacies, ...","[National, National, Private, National, Privat...","[SALD DRSNG/SNDWCH SPRD, CHEESE, EGGS, DELI ME...","[SEMI-SOLID SALAD DRESSING MAY, SHREDDED CHEES...",2021-02-20,A,Homeowner,married,senior,2.0,0.0,65.0,42000.0,0.0,2.0,2.0
1,2,"[103, 103, 103, 103, 103, 103, 103, 103, 103, ...","[854852, 930118, 1077555, 1098066, 5567388, 55...","[1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, ...","[0.65, 4.47, 1.0, 0.79, 2.5, 2.5, 2.5, 5.0, 3....",401,-0.469174,1904,15,-0.012605,0.000000,"[0.7, 4.78, 1.07, 0.85, 2.68, 2.68, 2.68, 5.35...","[0.65, 9.5, 1.59, 0.99, 4.69, 4.69, 4.69, 4.69...","[0.65, 4.47, 1.0, 0.79, 2.5, 2.5, 2.5, 2.5, 3....",103,"[2, 2, 69, 69, 1208, 1208, 1208, 1208, 1323, 6...","[Groceries, Groceries, Groceries, Groceries, G...","[National, National, Private, Private, Nationa...","[TOMATOES, STONE FRUIT, MILK BY-PRODUCTS, BAKE...","[TOMATOES HOTHOUSE ON THE VINE, CHERRIES RED, ...",2021-04-13,,,,,,,,,,,
2,3,"[113, 113, 113, 113, 113, 113, 113, 113, 113, ...","[866211, 878996, 882830, 904360, 921345, 93194...","[1, 1, 1, 1, 3, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, ...","[3.08, 1.64, 1.29, 0.99, 5.0, 2.08, 4.0, 1.69,...",401,-0.732278,1549,17,-0.066367,-0.021475,"[3.3, 1.75, 1.38, 1.06, 5.35, 2.23, 4.28, 1.81...","[4.12, 2.19, 1.29, 0.99, 1.99, 1.04, 2.0, 1.69...","[3.08, 1.64, 1.29, 0.99, 1.67, 1.04, 2.0, 1.69...",113,"[2, 2, 764, 673, 1094, 1377, 1071, 910, 69, 2,...","[Groceries, Groceries, Groceries, Groceries, M...","[National, National, National, National, Natio...","[GRAPES, GRAPES, WAREHOUSE SNACKS, VEGETABLES ...","[GRAPES WHITE, GRAPES RED, CANISTER POTATO/TOR...",2021-04-23,,,,,,,,,,,
3,4,"[104, 104, 104, 104, 104, 104, 104, 104, 104, ...","[836163, 857849, 877523, 878909, 883932, 89142...","[2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, ...","[3.0, 0.44, 1.79, 1.5, 1.83, 3.59, 2.0, 0.99, ...",298,-0.384219,1452,16,-0.008306,0.000000,"[3.21, 0.47, 1.92, 1.61, 1.96, 3.84, 2.14, 1.0...","[2.79, 0.89, 1.79, 1.59, 2.49, 3.59, 2.89, 1.1...","[1.5, 0.44, 1.79, 1.5, 1.83, 3.59, 2.0, 0.99, ...",104,"[586, 71, 194, 69, 586, 1075, 69, 1529, 499, 1...","[Groceries, Groceries, Groceries, Groceries, G...","[National, National, National, Private, Nation...","[COOKIES/CONES, MARGARINES, SPICES & EXTRACTS,...","[CHOCOLATE COVERED COOKIES, MARGARINE STICK, ...",2021-04-14,,,,,,,,,,,
4,5,"[85, 85, 87, 88, 88, 88, 97, 97, 97, 97, 97, 9...","[938983, 5980822, 1012352, 825538, 1002499, 69...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, ...","[1.99, 2.49, 5.54, 7.59, 2.59, 3.5, 3.34, 6.49...",374,-0.533018,1540,13,0.000000,0.000000,"[2.13, 2.66, 5.93, 8.12, 2.77, 3.75, 3.57, 6.9...","[1.99, 2.49, 7.39, 7.59, 3.49, 3.99, 5.19, 6.4...","[1.99, 2.49, 5.54, 7.59, 2.59, 3.5, 3.34, 6.49...",85,"[5143, 5143, 870, 1216, 410, 5268, 69, 397, 80...","[DRUG GM, DRUG GM, DRUG GM, Meat, Groceries, D...","[National, National, National, National, Natio...","[GREETING CARDS/WRAP/PARTY SPLY, GREETING CARD...","[CARDS SEASONAL, CARDS SEASONAL, SUNBURN AFTER...",2021-03-26,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2495,2496,"[117, 117, 117, 117, 117, 117, 117, 117, 117, ...","[840361, 852159, 871756, 886703, 899624, 91612...","[1, 1, 1, 1, 1, 2, 1, 1, 3, 1, 1, 6, 1, 1, 1, ...","[0.79, 3.71, 2.95, 1.09, 2.99, 5.32, 16.24, 1....",370,-0.748885,1345,17,-0.034654,-0.008328,"[0.85, 3.97, 3.16, 1.17, 3.2, 5.69, 17.38, 1.8...","[1.39, 3.71, 3.69, 1.09, 2.99, 6.4, 16.24, 1.9...","[0.79, 3.71, 2.95, 1.09, 2.99, 2.66, 16.24, 1....",117,"[69, 2946, 317, 69, 69, 4314, 2908, 69, 1276, ...","[Groceries, Meat, Groceries, Groceries, Grocer...","[Private, National, National, Private, Private...","[EGGS, BEEF, SALD DRSNG/SNDWCH SPRD, MISC. DAI...","[EGGS - LARGE, CHOICE BEEF, SEMI-SOLID SALAD D...",2021-04-27,A,Homeowner,married,middle-aged,2.0,0.0,49.5,87000.0,0.0,3.0,2.0
2496,2497,"[78, 78, 78, 78, 78, 80, 80, 80, 80, 80, 82, 8...","[838220, 1037840, 1052294, 5569230, 8090537, 1...","[1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, ...","[25.33, 3.04, 29.34, 5.0, 3.34, 7.99, 5.0, 5.0...",339,-0.683502,1121,12,-0.008583,-0.000382,"[27.1, 3.25, 31.39, 5.35, 3.57, 8.55, 5.35, 5....","[25.33, 3.04, 29.34, 4.69, 4.59, 7.99, 4.69, 4...","[25.33, 3.04, 29.34, 2.5, 3.34, 7.99, 2.5, 2.5...",78,"[539, 539, 539, 1208, 103, 2, 1208, 1208, 1208...","[DRUG GM, DRUG GM, DRUG GM, Groceries, Groceri...","[National, National, National, National, Natio...","[CIGARETTES, CIGARETTES, CIGARETTES, SOFT DRIN...","[CIGARETTES, CIGARETTES, CIGARETTES, SOFT DRIN...",2021-03-19,U,Unknown,single,middle-aged,1.0,0.0,49.5,42000.0,0.0,1.0,0.0
2497,2498,"[105, 105, 105, 105, 105, 105, 105, 105, 105, ...","[824555, 835576, 901776, 904023, 911215, 91749...","[1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, ...","[0.1, 0.1, 2.69, 0.1, 0.1, 0.1, 0.1, 0.2, 0.1,...",309,-0.325262,1823,16,0.000000,0.000000,"[0.11, 0.11, 2.88, 0.11, 0.11, 0.11, 0.11, 0.2...","[0.2, 0.2, 3.99, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2,...","[0.1, 0.1, 2.69, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,...",105,"[1046, 1046, 436, 1046, 1046, 1046, 1046, 1046...","[Groceries, Groceries, Groceries, Groceries, G...","[National, National, National, National, Natio...","[PWDR/CRYSTL DRNK MX, PWDR/CRYSTL DRNK MX, REF...","[SOFT DRINK POWDER POUCHES, SOFT DRINK POWDER ...",2021-04-15,U,Homeowner,unknown,adult,2.0,0.0,29.5,62000.0,0.0,2.0,2.0
2498,2499,"[70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 85, 8...","[838186, 853197, 864143, 883665, 932949, 93383...","[1, 1, 6, 7, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, ...","[3.99, 1.5, 4.74, 19.53, 2.38, 4.9, 3.34, 3.35...",31782,-0.427187,1056,11,-0.003045,-0.000386,"[4.27, 1.61, 5.07, 20.9, 2.55, 5.24, 3.57, 3.5...","[3.99, 1.89, 0.89, 3.29, 2.38, 2.94, 3.99, 3.3...","[3.99, 1.5, 0.79, 2.79, 2.38, 2.45, 3.34, 3.35...",70,"[1790, 1914, 69, 1046, 3869, 3705, 103, 3787, ...","[Groceries, Groceries, Groceries, Groceries, D...","[National, National, Private, National, Nation...","[BAKED SWEET GOODS, BAKED BREAD/BUNS/ROLLS, FL...","[SW GDS:DONUTS, RYE BREADS, FLUID MILK WHITE O...",2021-03-11,U,Unknown,unknown,adult,2.0,1.0,29.5,15000.0,1.0,3.0,2.0


### 3.1 Feature Engineering

Dropping columns that will not be necessary

In [14]:
columns_to_drop = [
    'day', 'product_id', 'sales_value',
    'store_id','trans_time', 'week_no',
    'loyalty_card_price',
    'non_loyalty_card_price', 'days_', 'manufacturer',
    'brand', 'commodity_desc', 'sub_commodity_desc'
]


clients_cleaned = clients.drop(columns=columns_to_drop)
clients_cleaned.head(2)

Unnamed: 0,household_key,quantity,retail_disc,coupon_disc,coupon_match_disc,sales_value_eu,department,first_transaction_date,marital_status_code,homeowner_desc,marital_status,age_group,adult_category_size,has_kids,avg_age,avg_income,n_kids,n_household,gender(s)
0,1,"[1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, ...",-0.403613,-0.046647,-0.015142,"[4.27, 3.2, 1.17, 3.97, 2.99, 7.69, 2.68, 1.59...","[Groceries, Groceries, Groceries, Delicacies, ...",2021-02-20,A,Homeowner,married,senior,2.0,0.0,65.0,42000.0,0.0,2.0,2.0
1,2,"[1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, ...",-0.469174,-0.012605,0.0,"[0.7, 4.78, 1.07, 0.85, 2.68, 2.68, 2.68, 5.35...","[Groceries, Groceries, Groceries, Groceries, G...",2021-04-13,,,,,,,,,,,


Create a column for each department that represents the amount of money spent per department per household

In [15]:
# Convert string representations of lists to actual lists if needed
clients_cleaned['department_'] = clients_cleaned['department'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
clients_cleaned['sales_value_eu_'] = clients_cleaned['sales_value_eu'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
clients_cleaned['quantity_'] = clients_cleaned['quantity'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

# Convert lists within DataFrame columns to numeric values if necessary
clients_cleaned['sales_value_eu_'] = clients_cleaned['sales_value_eu'].apply(lambda x: list(map(float, x)))
clients_cleaned['quantit_y'] = clients_cleaned['quantity'].apply(lambda x: list(map(float, x)))

# Initialize an empty list to collect rows
rows = []

# Iterate over each row in the DataFrame
for index, row in clients_cleaned.iterrows():
    household_key = row['household_key']
    departments = row['department']  # Ensure this is a list of department names
    sales_values = row['sales_value_eu']  # Ensure this is a list of sales values (numeric)
    quantities = row['quantity']  # Ensure this is a list of quantities (numeric)

    # Create a dictionary to store the total spending for each department
    total_spending = {}
    
    for dept, sales_value, quantity in zip(departments, sales_values, quantities):
        if dept not in total_spending:
            total_spending[dept] = 0
        total_spending[dept] += sales_value
    
    # Convert the dictionary to a DataFrame row
    spending_row = {'household_key': household_key}
    spending_row.update(total_spending)
    
    # Append the row to the list
    rows.append(spending_row)

# Convert the list of rows into a DataFrame
spending_per_department = pd.DataFrame(rows)

# Fill NaN values with 0 (for departments with no spending)
spending_per_department.fillna(0, inplace=True)


# Join the spending DataFrame to clients_cleaned
clients_cleaned = clients_cleaned.merge(spending_per_department, on='household_key', how='left')

# Rename spending columns to include '_spend'
spending_columns = spending_per_department.columns.drop('household_key')
spending_column_mapping = {col: f"{col}_spend" for col in spending_columns}

# Rename the columns
clients_cleaned.rename(columns=spending_column_mapping, inplace=True)

# Display the result
clients_cleaned.head(2)


Unnamed: 0,household_key,quantity,retail_disc,coupon_disc,coupon_match_disc,sales_value_eu,department,first_transaction_date,marital_status_code,homeowner_desc,marital_status,age_group,adult_category_size,has_kids,avg_age,avg_income,n_kids,n_household,gender(s),department_,sales_value_eu_,quantity_,quantit_y,Groceries_spend,Delicacies_spend,Meat_spend,Bakery_spend,DRUG GM_spend,_spend,SALAD BAR_spend,MISC SALES TRAN_spend,RESTAURANT_spend,FLORAL_spend,Seafood_spend,COSMETICS_spend,KIOSK-GAS_spend,CHEF SHOPPE_spend,GARDEN CENTER_spend,MISC. TRANS._spend,SPIRITS_spend,AUTOMOTIVE_spend,TRAVEL & LEISUR_spend,CNTRL/STORE SUP_spend,COUP/STR & MFG_spend,GM MERCH EXP_spend,POSTAL CENTER_spend,DELI/SNACK BAR_spend,Photo/Video_spend,CHARITABLE CONT_spend,RX_spend,VIDEO RENTAL_spend,PROD-WHS SALES_spend,PHARMACY SUPPLY_spend,TOYS_spend,HBC_spend,ELECT &PLUMBING_spend,HOUSEWARES_spend
0,1,"[1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, ...",-0.403613,-0.046647,-0.015142,"[4.27, 3.2, 1.17, 3.97, 2.99, 7.69, 2.68, 1.59...","[Groceries, Groceries, Groceries, Delicacies, ...",2021-02-20,A,Homeowner,married,senior,2.0,0.0,65.0,42000.0,0.0,2.0,2.0,"[Groceries, Groceries, Groceries, Delicacies, ...","[4.27, 3.2, 1.17, 3.97, 2.99, 7.69, 2.68, 1.59...","[1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, ...",3285.02,226.99,373.56,102.01,568.23,0.0,44.09,21.4,4.47,8.55,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,"[1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, ...",-0.469174,-0.012605,0.0,"[0.7, 4.78, 1.07, 0.85, 2.68, 2.68, 2.68, 5.35...","[Groceries, Groceries, Groceries, Groceries, G...",2021-04-13,,,,,,,,,,,,"[Groceries, Groceries, Groceries, Groceries, G...","[0.7, 4.78, 1.07, 0.85, 2.68, 2.68, 2.68, 5.35...","[1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, ...",1361.5,50.01,238.86,26.55,353.11,0.0,0.0,4.56,0.0,23.53,9.62,23.62,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


As the *quantity* columns will not be needed, they will be dropped

In [16]:
quantity_columns = [col for col in clients_cleaned.columns if '_quantity' in col]

clients_cleaned.drop(columns=quantity_columns, inplace=True)

Drop temporary variables

In [17]:
clients_cleaned = clients_cleaned.drop([' _spend', 'quantit_y', 'quantity_', 'sales_value_eu_', 'department_'], axis = 1) #'dep_quantity'

Serialize variables to keep their data type after export

In [18]:
# Apply serialization
clients_cleaned['quantity'] = clients_cleaned['quantity'].apply(functions.serialize_list)
clients_cleaned['sales_value_eu'] = clients_cleaned['sales_value_eu'].apply(functions.serialize_list)
clients_cleaned['department'] = clients_cleaned['department'].apply(functions.serialize_list)

## 4. Export

<a class='anchor' id='1'></a>
[Top &#129033;](#0) 

In [20]:
clients_cleaned.to_csv(f'{path}/Gold/Customer_info_dataset.csv', index = True)