**Fetch Rewards Coding Exercise**

Pt3 - Data Quality

Section 3 -rewardsReceiptItemList Variability

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as pyplot
import seaborn as sns

import duckdb

* [Overview](#Overview)
    * [Review rewardsReceiptItemList](#format)
    * [Review df_items](#items)

# Overview <a class="anchor" id="Overview"></a>

Background: 
- Across the Receipts dataset, there are different variations of key-value pairs within the nested JSON of rewardsReceiptItemList.
- df_items represents all of the posssible columns within rewardsReceiptItemList. In it's overview (.info()), we see that majority of the columns are mostly null values.

Why This Matters:
The variation of this column raises several concerns including:
- Standardization - We want to ensure we are standardizing the values we're tracking within this field to ensure data quality
- Future Organization of Data - Given that marjoity of the rows contain null values within df_items, we may be clouding our datasets with unneccesary information and/or we may want to consider creating new tables if we find certain values necessary. 

------------------------

## rewardsReceiptItemList Format <a id="format"></a>

**Summary:** 
- Several rows within rewardsReceiptItemList have differerent lengths of key-value pairs

In [12]:
#Data cleaning
df_receipts = pd.read_json('receipts.gz', lines=True, compression='gzip')
df_receipts['receipt_id'] = pd.json_normalize(df_receipts['_id'])['$oid']
df_receipts.drop('_id', axis = 1, inplace = True)


#Identify the date columns 
date_columns = list(set(['createDate', 'finishedDate', 'dateScanned', 'modifyDate', 'pointsAwardedDate', 'purchaseDate']))

#Normalize the column JSON , extract UNIX date value
#Convert UNIX Date value into datetime
for col in date_columns:
    if isinstance(df_receipts[col].iloc[0], dict):
        df_receipts[col] = pd.json_normalize(df_receipts[col])['$date']
    df_receipts[col] = pd.to_datetime(df_receipts[col], unit = 'ms', errors = 'coerce')

In [7]:
df_receipts['rewardsReceiptItemList'].iloc[10]

[{'barcode': '4011',
  'description': 'ITEM NOT FOUND',
  'finalPrice': '1',
  'itemPrice': '1',
  'partnerItemId': '1',
  'quantityPurchased': 1}]

In [8]:
df_receipts['rewardsReceiptItemList'].iloc[25]

[{'barcode': '4011',
  'description': 'ITEM NOT FOUND',
  'finalPrice': '27.00',
  'itemPrice': '27.00',
  'needsFetchReview': False,
  'partnerItemId': '1',
  'preventTargetGapPoints': True,
  'quantityPurchased': 5,
  'userFlaggedBarcode': '4011',
  'userFlaggedNewItem': True,
  'userFlaggedPrice': '27.00',
  'userFlaggedQuantity': 5}]

In [9]:
df_receipts['rewardsReceiptItemList'].iloc[42]

[{'barcode': '025800000135',
  'description': 'SMART MADE Rosemary Grilled Beef & Vegetables 9 OZ Box',
  'finalPrice': '5',
  'itemPrice': '5',
  'partnerItemId': '1',
  'quantityPurchased': 1,
  'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'},
 {'barcode': '025800000135',
  'description': 'SMART MADE Rosemary Grilled Beef & Vegetables 9 OZ Box',
  'finalPrice': '5',
  'itemPrice': '5',
  'partnerItemId': '2',
  'quantityPurchased': 1,
  'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'},
 {'barcode': '025800000135',
  'description': 'SMART MADE Rosemary Grilled Beef & Vegetables 9 OZ Box',
  'finalPrice': '5',
  'itemPrice': '5',
  'partnerItemId': '3',
  'quantityPurchased': 1,
  'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'},
 {'barcode': '025800000135',
  'description': 'SMART MADE Rosemary Grilled Beef & Vegetables 9 OZ Box',
  'finalPrice': '5',
  'itemPrice': '5',
  'partnerItemId': '4',
  'quantityPurchased': 1,
  'rewardsProductPartnerId': '559c2234e4b06aca

## df_items Explore <a id="items"></a>
**Items Table** - Extracting fields from the rewardsReceiptItemList in the Receipts table

**Summary**: Majority of the columns within df_items are mostly null values.

In [10]:
df_receipts = df_receipts.explode('rewardsReceiptItemList')
df_receipts['_id'] = pd.json_normalize(df_receipts['_id'])['$oid']
df_items = pd.concat([df_receipts['_id'], df_receipts['rewardsReceiptItemList'].apply(pd.Series)], axis = 1)

In [11]:
df_items.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7381 entries, 0 to 1118
Data columns (total 36 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   _id                                 7381 non-null   object 
 1   barcode                             3090 non-null   object 
 2   description                         6560 non-null   object 
 3   finalPrice                          6767 non-null   object 
 4   itemPrice                           6767 non-null   object 
 5   needsFetchReview                    813 non-null    object 
 6   partnerItemId                       6941 non-null   object 
 7   preventTargetGapPoints              358 non-null    object 
 8   quantityPurchased                   6767 non-null   float64
 9   userFlaggedBarcode                  337 non-null    object 
 10  userFlaggedNewItem                  323 non-null    object 
 11  userFlaggedPrice                    299 non