In [46]:
import numpy as np
import pandas as pd
import json
import datetime

In [38]:
with open("data/receipts.json") as receipts_file:
    receipts_raw = [json.loads(line) for line in receipts_file]
with open("data/users.json") as users_file:
    users_raw = [json.loads(line) for line in users_file]
with open("data/brands.json") as brands_file:
    brands_raw = [json.loads(line) for line in brands_file]

### Initial exploration of Receipts Data schema

In [58]:
print(json.dumps(receipts_raw[1],indent=2))

{
  "_id": {
    "$oid": "5ff1e1bb0a720f052300056b"
  },
  "bonusPointsEarned": 150,
  "bonusPointsEarnedReason": "Receipt number 5 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",
  "createDate": {
    "$date": 1609687483000
  },
  "dateScanned": {
    "$date": 1609687483000
  },
  "finishedDate": {
    "$date": 1609687483000
  },
  "modifyDate": {
    "$date": 1609687488000
  },
  "pointsAwardedDate": {
    "$date": 1609687483000
  },
  "pointsEarned": "150.0",
  "purchaseDate": {
    "$date": 1609601083000
  },
  "purchasedItemCount": 2,
  "rewardsReceiptItemList": [
    {
      "barcode": "4011",
      "description": "ITEM NOT FOUND",
      "finalPrice": "1",
      "itemPrice": "1",
      "partnerItemId": "1",
      "quantityPurchased": 1
    },
    {
      "barcode": "028400642255",
      "description": "DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ",
      "finalPrice": "10.00",
      "itemPrice": "10.00",
      "needsFetchReview": true,
      "

#### Observations/questions about this example data point:

* Value of $oid within the "_id" node identifies the receipt
* Dates (createDate, dateScanned, finishedDate, and modifyDate appear to be in milliseconds since 1-1-1970, but this should be confirmed with the stakeholder:

In [72]:
print(datetime.datetime.fromtimestamp(receipts_raw[0]["dateScanned"]["$date"]/1000.0))

2021-01-03 09:25:31


* pointsEarned is in string format with one decimal place, but bonusPointsEarned is in int format.  This should be considered when choosing data types for database tables
* For stakeholder: What is the DEFAULT (code) at the end of bonusPointsEarnedReason?
* purchasedItemCount does *not* necessarily match the number of items in rewardsReceiptItemList
* The value of "barcode" or "userFlaggedBarcode" in each item in rewardsReceiptItemList might be able to be used as a foreign key to the Brand schema, but it doesn't appear to match directly
* rewardsReceiptItemList is really what we should be looking at here
* userID can hopefully be used as a foreign key to the Users data schema
* Items might be not found, with barcode 4011
* totalSpent is in string format
* brandCode is not populated for every receipt line  

Outstanding questions for data analyst:
* Which keys have missing values?
* How can we reliably join to the Brand schema?

### Initial exploration of Users Data schema

In [73]:
print(json.dumps(users_raw[0],indent=2))
print(json.dumps(users_raw[1],indent=2))

{
  "_id": {
    "$oid": "5ff1e194b6a9d73a3a9f1052"
  },
  "active": true,
  "createdDate": {
    "$date": 1609687444800
  },
  "lastLogin": {
    "$date": 1609687537858
  },
  "role": "consumer",
  "signUpSource": "Email",
  "state": "WI"
}
{
  "_id": {
    "$oid": "5ff1e194b6a9d73a3a9f1052"
  },
  "active": true,
  "createdDate": {
    "$date": 1609687444800
  },
  "lastLogin": {
    "$date": 1609687537858
  },
  "role": "consumer",
  "signUpSource": "Email",
  "state": "WI"
}


#### Observations/questions about these example data points:

* They are duplicates - we should check other files for the same issue
* Dates are also in the milliseconds since 1-1-1970 (probably) format
* $oid does exist in receipts.json, so I plan to use it as the link between the two schema 
* What's the point of the "role" value if it is always "consumer"?
* Nothing else major of note

### Initial exploration of Brand Data schema

In [74]:
print(json.dumps(brands_raw[0],indent=2))
print(json.dumps(brands_raw[50],indent=2))

{
  "_id": {
    "$oid": "601ac115be37ce2ead437551"
  },
  "barcode": "511111019862",
  "category": "Baking",
  "categoryCode": "BAKING",
  "cpg": {
    "$id": {
      "$oid": "601ac114be37ce2ead437550"
    },
    "$ref": "Cogs"
  },
  "name": "test brand @1612366101024",
  "topBrand": false
}
{
  "_id": {
    "$oid": "5d602d9d6d5f3b23d1bc7907"
  },
  "name": "Kevita Sparkling Drinks",
  "cpg": {
    "$ref": "Cogs",
    "$id": {
      "$oid": "5332f5fbe4b03c9a25efd0ba"
    }
  },
  "category": "Beverages",
  "barcode": "511111704935",
  "brandCode": "KEVITA"
}


#### Observations/questions about these example data points:

* The first data point is from a test brand.  Test data points should be removed
* The oid within cpg appears to correspond to rewardsProductPartnerID in the rewardsReceiptItemLIst, but it is not unique within brands.json

Programmatic evaluation of data quality will be in DataQualityIssues.ipynb.