### Data Quality Issues in the Receipts, Users, and Brands data

In [4]:
import numpy as np
import pandas as pd
import json
import datetime
from collections import defaultdict

In [2]:
with open("data/receipts.json") as receipts_file:
    receipts_raw = [json.loads(line) for line in receipts_file]
with open("data/users.json") as users_file:
    users_raw = [json.loads(line) for line in users_file]
with open("data/brands.json") as brands_file:
    brands_raw = [json.loads(line) for line in brands_file]

#### Data Completeness

In [103]:
def dataCompletenessReport(list_of_dicts):
    """
    Accepts a list of dictionaries and returns a pandas DataFrame summarizing
    the fields in the dictionaries, data types, data completeness,
    and a few other characteristics
    Assumes:  Dictionaries in the input list have keys in common

    """
    key_count_dict = defaultdict(int)
    n_lines = len(list_of_dicts)
    datatype_dict = defaultdict(set)
    for curr in list_of_dicts: #loop over raw data    
        for key,value in curr.items():
            key_count_dict[key]+=1 #count number of rows with this key populated
            curr_type = type(value)
            datatype_dict[key].add(curr_type)
    dataCompDf = pd.DataFrame([datatype_dict,key_count_dict]).T
    dataCompDf.columns = ["Data Types","Count Populated"]
    dataCompDf["Proportion Populated"] = dataCompDf["Count Populated"]/n_lines

    return dataCompDf

In [104]:
dataCompletenessReport(receipts_raw)

Unnamed: 0,Data Types,Count Populated,Proportion Populated
_id,{<class 'dict'>},1119,1.0
bonusPointsEarned,{<class 'int'>},544,0.486148
bonusPointsEarnedReason,{<class 'str'>},544,0.486148
createDate,{<class 'dict'>},1119,1.0
dateScanned,{<class 'dict'>},1119,1.0
finishedDate,{<class 'dict'>},568,0.507596
modifyDate,{<class 'dict'>},1119,1.0
pointsAwardedDate,{<class 'dict'>},537,0.479893
pointsEarned,{<class 'str'>},609,0.544236
purchaseDate,{<class 'dict'>},671,0.599643


Initial observations:

* rewardsReceiptItemList has type list.  We should investigate data completeness of it as well.
* The date points were awarded is not populated for every line where points were earned
* Total spent, item list, and item count are not populated for every receipt

In [74]:
dataCompletenessReport(users_raw)

Unnamed: 0,Data Types,Count Populated,Proportion Populated
_id,{<class 'dict'>},495,1.0
active,{<class 'bool'>},495,1.0
createdDate,{<class 'dict'>},495,1.0
lastLogin,{<class 'dict'>},433,0.874747
role,{<class 'str'>},495,1.0
signUpSource,{<class 'str'>},447,0.90303
state,{<class 'str'>},439,0.886869


Initial observations:
* No major indications of data quality issues that would affect reporting on Users

In [75]:
dataCompletenessReport(brands_raw)

Unnamed: 0,Data Types,Count Populated,Proportion Populated
_id,{<class 'dict'>},1167,1.0
barcode,{<class 'str'>},1167,1.0
category,{<class 'str'>},1012,0.867181
categoryCode,{<class 'str'>},517,0.443016
cpg,{<class 'dict'>},1167,1.0
name,{<class 'str'>},1167,1.0
topBrand,{<class 'bool'>},555,0.475578
brandCode,{<class 'str'>},933,0.799486


Initial observations:
* brandCode is not populated for every brand ID.  This means it's likely a brand that appears in a receipt might not have a record in the brands data schema

In [76]:
#Build a list of raw receipt item lines to check for data completeness and quality
rewardsReceiptItemLines_raw = [curr_receipt.get('rewardsReceiptItemList') for curr_receipt in receipts_raw if curr_receipt.get('rewardsReceiptItemList') is not None]
rewardsReceiptItemLines_flat = [curr_receipt for curr_list in rewardsReceiptItemLines_raw for curr_receipt in curr_list]
dataCompletenessReport(rewardsReceiptItemLines_flat)

Unnamed: 0,Data Types,Count Populated,Proportion Populated
barcode,{<class 'str'>},3090,0.445181
description,{<class 'str'>},6560,0.945109
finalPrice,{<class 'str'>},6767,0.974932
itemPrice,{<class 'str'>},6767,0.974932
needsFetchReview,{<class 'bool'>},813,0.11713
partnerItemId,{<class 'str'>},6941,1.0
preventTargetGapPoints,{<class 'bool'>},358,0.051578
quantityPurchased,{<class 'int'>},6767,0.974932
userFlaggedBarcode,{<class 'str'>},337,0.048552
userFlaggedNewItem,{<class 'bool'>},323,0.046535


Initial observations:
* brandCode is only populated for 37.45% of receipt lines
* barcode is not populated for every receipt line
* Approximately 2% of items are missing item prices

## Key data quality issues relating to data completeness 

Issue 1: Only 37.45% of receipt item lines contain brand codes.  

This is a serious concern for our ability to answer the first, second, fifth, and sixth questions asked by the stakeholder.  brandCode is currently the only way we can uniquely identify a brand from a receipt line, so we won't be able to summarize reliably if this few receipts can be associated with a brand.

Issue 2:  
