## Andrew Byrnes: Fetch Rewards Coding Exercise - Data Analyst
### 1-Data_Prep.ipynb

This notebook preps the provided data files and uploads them to a SQLite database. It includes a an entity relationship diagram of the how I've modeled this data.
I chose SQLite for this challenge because it is lightweight and lends itself well to sharing.  
SQLite's flexible typing rules could be a liability for a database that would be continually updated and primarily used for analysis. For that use case I would choose something that more strictly follows SQL standard.

### Data Sources
- data-modeling.html : coding exercise instructions
- brands.json.gz, receipts.json.gz, users.json.gz : raw data files provided for competition of the challenge

### Changes
- 09-17-2022 : Started project, first look at data, identified transformation tasks 
- 09-18-2022 : cleaned df_brands _id and cpg columns
- 09-19-2022 : wrote function to clean columns with dicts, wrote function to convert epoch time to timestamps
- 09-20-2022 : refactored functions, applied functions cleaning and converting data, explored df_receipts.rewardsReceiptItemList, notes on stakeholder questions
- 09-21-2022 : brainstorming notes on receipt_items, created df_receipt_items dataframe, loaded all 4 dataframes to SQLite fetch.db, drew fetch.db ERD
- 09-22-2022 : added a dupe_barcodes column to the brands data and reloaded, restarted kernel and re-ran everything
- 09-24-2022 : adjusting receipt_items ETL to bring in brandcode, regex to extract brandcodes, add extracted_brand_code to receipt_items
- 09-25-2022 : reset kernel and re-ran all code prior to submission

In [1]:
import pandas as pd
from pathlib import Path
import os
from datetime import datetime
import gzip
import json
import sqlite3
import re

### File Locations

In [2]:
today = datetime.today()
print(today)
in_brands = Path.cwd() / "data" / "raw" / "brands.json.gz"
in_receipts = Path.cwd() / "data" / "raw" / "receipts.json.gz"
in_users = Path.cwd() / "data" / "raw" / "users.json.gz"
db_path = Path.cwd() / "data" / "processed" / "fetch.db"

2022-09-25 21:04:39.125360


### Drop database if exists

In [3]:
if os.path.exists(db_path):
    os.remove(db_path)
    print("The db has been removed successfully")
else:
    print("The db does not exist!")

The db does not exist!


### Formatting and options

In [4]:
pd.set_option('display.max_colwidth', None)
# pd.set_option('display.max_rows', None)
pd.reset_option('display.max_rows')
pd.set_option('display.max_columns', None)
# pd.reset_option('display.max_columns')
# surpressing a warning related to renaming columns just prior to loading to sqlite
pd.options.mode.chained_assignment = None

### Load JSON data to Panda's dataframes

In [5]:
df_brands = pd.read_json(in_brands,lines=True,compression='gzip')
df_receipts = pd.read_json(in_receipts,lines=True,compression='gzip')
df_users = pd.read_json(in_users,lines=True,compression='gzip')

### First look at data

### **brands**  
**to-do**:
- ~extract _ids~
- ~extract cpg ids~

In [6]:
df_brands.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1167 entries, 0 to 1166
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   _id           1167 non-null   object 
 1   barcode       1167 non-null   int64  
 2   category      1012 non-null   object 
 3   categoryCode  517 non-null    object 
 4   cpg           1167 non-null   object 
 5   name          1167 non-null   object 
 6   topBrand      555 non-null    float64
 7   brandCode     933 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 73.1+ KB


**Brand Data Schema**
- _id: brand uuid
- barcode: the barcode on the item
- brandCode: String that corresponds with the brand column in a partner product file
- category: The category name for which the brand sells products in
- categoryCode: The category code that references a BrandCategory
- cpg: reference to CPG collection
- topBrand: Boolean indicator for whether the brand should be featured as a 'top brand'
- name: Brand name

In [7]:
df_brands

Unnamed: 0,_id,barcode,category,categoryCode,cpg,name,topBrand,brandCode
0,{'$oid': '601ac115be37ce2ead437551'},511111019862,Baking,BAKING,"{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}",test brand @1612366101024,0.0,
1,{'$oid': '601c5460be37ce2ead43755f'},511111519928,Beverages,BEVERAGES,"{'$id': {'$oid': '5332f5fbe4b03c9a25efd0ba'}, '$ref': 'Cogs'}",Starbucks,0.0,STARBUCKS
2,{'$oid': '601ac142be37ce2ead43755d'},511111819905,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",test brand @1612366146176,0.0,TEST BRANDCODE @1612366146176
3,{'$oid': '601ac142be37ce2ead43755a'},511111519874,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",test brand @1612366146051,0.0,TEST BRANDCODE @1612366146051
4,{'$oid': '601ac142be37ce2ead43755e'},511111319917,Candy & Sweets,CANDY_AND_SWEETS,"{'$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}, '$ref': 'Cogs'}",test brand @1612366146827,0.0,TEST BRANDCODE @1612366146827
...,...,...,...,...,...,...,...,...
1162,{'$oid': '5f77274dbe37ce6b592e90c0'},511111116752,Baking,BAKING,"{'$ref': 'Cogs', '$id': {'$oid': '5f77274dbe37ce6b592e90bf'}}",test brand @1601644365844,,
1163,{'$oid': '5dc1fca91dda2c0ad7da64ae'},511111706328,Breakfast & Cereal,,"{'$ref': 'Cogs', '$id': {'$oid': '53e10d6368abd3c7065097cc'}}",Dippin Dots® Cereal,,DIPPIN DOTS CEREAL
1164,{'$oid': '5f494c6e04db711dd8fe87e7'},511111416173,Candy & Sweets,CANDY_AND_SWEETS,"{'$ref': 'Cogs', '$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}}",test brand @1598639215217,,TEST BRANDCODE @1598639215217
1165,{'$oid': '5a021611e4b00efe02b02a57'},511111400608,Grocery,,"{'$ref': 'Cogs', '$id': {'$oid': '5332f5f6e4b03c9a25efd0b4'}}",LIPTON TEA Leaves,0.0,LIPTON TEA Leaves


### **receipts**  
**to-do**:
- ~extract _ids~
- ~extract and convert createDate~
- ~extract and convert dateScanned~
- ~extract and convert finishedDate~
- ~extract and convert modifyDate~
- ~extract and convert pointsAwardedDate~
- ~extract and convert purchaseDate~
- create receipt_items table using the rewardsReceiptItemList

In [8]:
df_receipts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1119 entries, 0 to 1118
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   _id                      1119 non-null   object 
 1   bonusPointsEarned        544 non-null    float64
 2   bonusPointsEarnedReason  544 non-null    object 
 3   createDate               1119 non-null   object 
 4   dateScanned              1119 non-null   object 
 5   finishedDate             568 non-null    object 
 6   modifyDate               1119 non-null   object 
 7   pointsAwardedDate        537 non-null    object 
 8   pointsEarned             609 non-null    float64
 9   purchaseDate             671 non-null    object 
 10  purchasedItemCount       635 non-null    float64
 11  rewardsReceiptItemList   679 non-null    object 
 12  rewardsReceiptStatus     1119 non-null   object 
 13  totalSpent               684 non-null    float64
 14  userId                  

**Receipts Data Schema**
- _id: uuid for this receipt
- bonusPointsEarned: Number of bonus points that were awarded upon receipt completion
- bonusPointsEarnedReason: event that triggered bonus points
- createDate: The date that the event was created
- dateScanned: Date that the user scanned their receipt
- finishedDate: Date that the receipt finished processing
- modifyDate: The date the event was modified
- pointsAwardedDate: The date we awarded points for the transaction
- pointsEarned: The number of points earned for the receipt
- purchaseDate: the date of the purchase
- purchasedItemCount: Count of number of items on the receipt
- rewardsReceiptItemList: The items that were purchased on the receipt
- rewardsReceiptStatus: status of the receipt through receipt validation and processing
- totalSpent: The total amount on the receipt
- userId: string id back to the User collection for the user who scanned the receipt

In [9]:
df_receipts

Unnamed: 0,_id,bonusPointsEarned,bonusPointsEarnedReason,createDate,dateScanned,finishedDate,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
0,{'$oid': '5ff1e1eb0a720f0523000575'},500.0,"Receipt number 2 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1609687531000},{'$date': 1609687531000},{'$date': 1609687531000},{'$date': 1609687536000},{'$date': 1609687531000},500.0,{'$date': 1609632000000},5.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '26.00', 'itemPrice': '26.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 5, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '26.00', 'userFlaggedQuantity': 5}]",FINISHED,26.00,5ff1e1eacfcf6c399c274ae6
1,{'$oid': '5ff1e1bb0a720f052300056b'},150.0,"Receipt number 5 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1609687483000},{'$date': 1609687483000},{'$date': 1609687483000},{'$date': 1609687488000},{'$date': 1609687483000},150.0,{'$date': 1609601083000},2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '028400642255', 'description': 'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ', 'finalPrice': '10.00', 'itemPrice': '10.00', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'pointsNotAwardedReason': 'Action not allowed for user and CPG', 'pointsPayerId': '5332f5fbe4b03c9a25efd0ba', 'preventTargetGapPoints': True, 'quantityPurchased': 1, 'rewardsGroup': 'DORITOS SPICY SWEET CHILI SINGLE SERVE', 'rewardsProductPartnerId': '5332f5fbe4b03c9a25efd0ba', 'userFlaggedBarcode': '028400642255', 'userFlaggedDescription': 'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ', 'userFlaggedNewItem': True, 'userFlaggedPrice': '10.00', 'userFlaggedQuantity': 1}]",FINISHED,11.00,5ff1e194b6a9d73a3a9f1052
2,{'$oid': '5ff1e1f10a720f052300057a'},5.0,All-receipts receipt bonus,{'$date': 1609687537000},{'$date': 1609687537000},,{'$date': 1609687542000},,5.0,{'$date': 1609632000000},1.0,"[{'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '26.00', 'userFlaggedQuantity': 3}]",REJECTED,10.00,5ff1e1f1cfcf6c399c274b0b
3,{'$oid': '5ff1e1ee0a7214ada100056f'},5.0,All-receipts receipt bonus,{'$date': 1609687534000},{'$date': 1609687534000},{'$date': 1609687534000},{'$date': 1609687539000},{'$date': 1609687534000},5.0,{'$date': 1609632000000},4.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '28.00', 'itemPrice': '28.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 4, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '28.00', 'userFlaggedQuantity': 4}]",FINISHED,28.00,5ff1e1eacfcf6c399c274ae6
4,{'$oid': '5ff1e1d20a7214ada1000561'},5.0,All-receipts receipt bonus,{'$date': 1609687506000},{'$date': 1609687506000},{'$date': 1609687511000},{'$date': 1609687511000},{'$date': 1609687506000},5.0,{'$date': 1609601106000},2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '1234', 'finalPrice': '2.56', 'itemPrice': '2.56', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'preventTargetGapPoints': True, 'quantityPurchased': 3, 'userFlaggedBarcode': '1234', 'userFlaggedDescription': '', 'userFlaggedNewItem': True, 'userFlaggedPrice': '2.56', 'userFlaggedQuantity': 3}]",FINISHED,1.00,5ff1e194b6a9d73a3a9f1052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1114,{'$oid': '603cc0630a720fde100003e6'},25.0,COMPLETE_NONPARTNER_RECEIPT,{'$date': 1614594147000},{'$date': 1614594147000},,{'$date': 1614594148000},,25.0,{'$date': 1597622400000},2.0,"[{'barcode': 'B076FJ92M4', 'description': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'discountedItemPrice': '22.97', 'finalPrice': '22.97', 'itemPrice': '22.97', 'originalReceiptItemText': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'partnerItemId': '0', 'priceAfterCoupon': '22.97', 'quantityPurchased': 1}, {'barcode': 'B07BRRLSVC', 'description': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'discountedItemPrice': '11.99', 'finalPrice': '11.99', 'itemPrice': '11.99', 'originalReceiptItemText': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'partnerItemId': '1', 'priceAfterCoupon': '11.99', 'quantityPurchased': 1}]",REJECTED,34.96,5fc961c3b8cfca11a077dd33
1115,{'$oid': '603d0b710a720fde1000042a'},,,{'$date': 1614613361873},{'$date': 1614613361873},,{'$date': 1614613361873},,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
1116,{'$oid': '603cf5290a720fde10000413'},,,{'$date': 1614607657664},{'$date': 1614607657664},,{'$date': 1614607657664},,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
1117,{'$oid': '603ce7100a7217c72c000405'},25.0,COMPLETE_NONPARTNER_RECEIPT,{'$date': 1614604048000},{'$date': 1614604048000},,{'$date': 1614604049000},,25.0,{'$date': 1597622400000},2.0,"[{'barcode': 'B076FJ92M4', 'description': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'discountedItemPrice': '22.97', 'finalPrice': '22.97', 'itemPrice': '22.97', 'originalReceiptItemText': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'partnerItemId': '0', 'priceAfterCoupon': '22.97', 'quantityPurchased': 1}, {'barcode': 'B07BRRLSVC', 'description': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'discountedItemPrice': '11.99', 'finalPrice': '11.99', 'itemPrice': '11.99', 'originalReceiptItemText': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'partnerItemId': '1', 'priceAfterCoupon': '11.99', 'quantityPurchased': 1}]",REJECTED,34.96,5fc961c3b8cfca11a077dd33


### **users**  
**to-do**:
- ~extract _ids~
- ~extract and convert createdDate~
- ~extract and convert lastLogin~

In [10]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 495 entries, 0 to 494
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   _id           495 non-null    object
 1   active        495 non-null    bool  
 2   createdDate   495 non-null    object
 3   lastLogin     433 non-null    object
 4   role          495 non-null    object
 5   signUpSource  447 non-null    object
 6   state         439 non-null    object
dtypes: bool(1), object(6)
memory usage: 23.8+ KB


**Users Data Schema**
- _id: user Id
- state: state abbreviation
- createdDate: when the user created their account
- lastLogin: last time the user was recorded logging in to the app
- role: constant value set to 'CONSUMER'
- active: indicates if the user is active; only Fetch will de-activate an account with this flag

In [11]:
df_users

Unnamed: 0,_id,active,createdDate,lastLogin,role,signUpSource,state
0,{'$oid': '5ff1e194b6a9d73a3a9f1052'},True,{'$date': 1609687444800},{'$date': 1609687537858},consumer,Email,WI
1,{'$oid': '5ff1e194b6a9d73a3a9f1052'},True,{'$date': 1609687444800},{'$date': 1609687537858},consumer,Email,WI
2,{'$oid': '5ff1e194b6a9d73a3a9f1052'},True,{'$date': 1609687444800},{'$date': 1609687537858},consumer,Email,WI
3,{'$oid': '5ff1e1eacfcf6c399c274ae6'},True,{'$date': 1609687530554},{'$date': 1609687530597},consumer,Email,WI
4,{'$oid': '5ff1e194b6a9d73a3a9f1052'},True,{'$date': 1609687444800},{'$date': 1609687537858},consumer,Email,WI
...,...,...,...,...,...,...,...
490,{'$oid': '54943462e4b07e684157a532'},True,{'$date': 1418998882381},{'$date': 1614963143204},fetch-staff,,
491,{'$oid': '54943462e4b07e684157a532'},True,{'$date': 1418998882381},{'$date': 1614963143204},fetch-staff,,
492,{'$oid': '54943462e4b07e684157a532'},True,{'$date': 1418998882381},{'$date': 1614963143204},fetch-staff,,
493,{'$oid': '54943462e4b07e684157a532'},True,{'$date': 1418998882381},{'$date': 1614963143204},fetch-staff,,


## Cleaning the data

### First attempt of extracting the values from columns containing dictionaries
I've chosen to include the following section of python code to help illustrate my thought process that lead to writing the value_from_dict() function. This code includes some notes as comments, but full explanation of my process is included within the documentation of the resulting function.  
The function should account for executing the following code, but if you are stepping through this notebook on a fresh kernel you can skip the cells between **Start** and **End**.

**Start** - you *can* skip executing the code starting here

In [12]:
# confirming the values in _id are being recoginized python objects, in this case a dictionary
type(df_brands['_id'][0])

dict

In [13]:
df_brands['_id'][0]['$oid']

'601ac115be37ce2ead437551'

In [14]:
# extract the _id column as a series
_id_series = df_brands['_id']
# create an list to collect the values from the dictionary objects in _id
_id_clean = []

# iterate through _id_series appending values them to _id_clean
for index, value in _id_series.items():
    _id_clean.append(value['$oid'])
    
# confirm no nulls in _id_clean
assert None not in _id_clean, "there is at least one None/null value in _id_clean"
# confirm _id_clean is the same length is the original _id column in df_brands
assert len(_id_clean) == len(df_brands['_id']), "the length of the original column and the cleaned column are not the same"

# add _id_clean to df_brands after the _id column
df_brands.insert(1, '_id_clean', _id_clean)

df_brands

Unnamed: 0,_id,_id_clean,barcode,category,categoryCode,cpg,name,topBrand,brandCode
0,{'$oid': '601ac115be37ce2ead437551'},601ac115be37ce2ead437551,511111019862,Baking,BAKING,"{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}",test brand @1612366101024,0.0,
1,{'$oid': '601c5460be37ce2ead43755f'},601c5460be37ce2ead43755f,511111519928,Beverages,BEVERAGES,"{'$id': {'$oid': '5332f5fbe4b03c9a25efd0ba'}, '$ref': 'Cogs'}",Starbucks,0.0,STARBUCKS
2,{'$oid': '601ac142be37ce2ead43755d'},601ac142be37ce2ead43755d,511111819905,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",test brand @1612366146176,0.0,TEST BRANDCODE @1612366146176
3,{'$oid': '601ac142be37ce2ead43755a'},601ac142be37ce2ead43755a,511111519874,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",test brand @1612366146051,0.0,TEST BRANDCODE @1612366146051
4,{'$oid': '601ac142be37ce2ead43755e'},601ac142be37ce2ead43755e,511111319917,Candy & Sweets,CANDY_AND_SWEETS,"{'$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}, '$ref': 'Cogs'}",test brand @1612366146827,0.0,TEST BRANDCODE @1612366146827
...,...,...,...,...,...,...,...,...,...
1162,{'$oid': '5f77274dbe37ce6b592e90c0'},5f77274dbe37ce6b592e90c0,511111116752,Baking,BAKING,"{'$ref': 'Cogs', '$id': {'$oid': '5f77274dbe37ce6b592e90bf'}}",test brand @1601644365844,,
1163,{'$oid': '5dc1fca91dda2c0ad7da64ae'},5dc1fca91dda2c0ad7da64ae,511111706328,Breakfast & Cereal,,"{'$ref': 'Cogs', '$id': {'$oid': '53e10d6368abd3c7065097cc'}}",Dippin Dots® Cereal,,DIPPIN DOTS CEREAL
1164,{'$oid': '5f494c6e04db711dd8fe87e7'},5f494c6e04db711dd8fe87e7,511111416173,Candy & Sweets,CANDY_AND_SWEETS,"{'$ref': 'Cogs', '$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}}",test brand @1598639215217,,TEST BRANDCODE @1598639215217
1165,{'$oid': '5a021611e4b00efe02b02a57'},5a021611e4b00efe02b02a57,511111400608,Grocery,,"{'$ref': 'Cogs', '$id': {'$oid': '5332f5f6e4b03c9a25efd0b4'}}",LIPTON TEA Leaves,0.0,LIPTON TEA Leaves


In [15]:
# examining the values in cpg
type(df_brands['cpg'])

pandas.core.series.Series

In [16]:
df_brands['cpg'][0]

{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}

In [17]:
df_brands['cpg'][0]['$id']['$oid']

'601ac114be37ce2ead437550'

In [18]:
# extract the cpg column as a series
cpg_series = df_brands['cpg']
# create an list to collect the values from the dictionary objects in _id
cpg_clean = []

# iterate through cpg_series appending values them to cpg_clean
for index, value in cpg_series.items():
    cpg_clean.append(value['$id']['$oid'])
    
# confirm no nulls in _id_clean
assert None not in cpg_clean, "there is at least one None/null value in cpg_clean"
# confirm cpg_clean is the same length is the original cpg column in df_brands
assert len(cpg_clean) == len(df_brands['cpg']), "the length of the original column and the cleaned column are not the same"

# add cpg_clean to df_brands after the cpg column
df_brands.insert(6, 'cpg_clean', cpg_clean)

df_brands

Unnamed: 0,_id,_id_clean,barcode,category,categoryCode,cpg,cpg_clean,name,topBrand,brandCode
0,{'$oid': '601ac115be37ce2ead437551'},601ac115be37ce2ead437551,511111019862,Baking,BAKING,"{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}",601ac114be37ce2ead437550,test brand @1612366101024,0.0,
1,{'$oid': '601c5460be37ce2ead43755f'},601c5460be37ce2ead43755f,511111519928,Beverages,BEVERAGES,"{'$id': {'$oid': '5332f5fbe4b03c9a25efd0ba'}, '$ref': 'Cogs'}",5332f5fbe4b03c9a25efd0ba,Starbucks,0.0,STARBUCKS
2,{'$oid': '601ac142be37ce2ead43755d'},601ac142be37ce2ead43755d,511111819905,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",601ac142be37ce2ead437559,test brand @1612366146176,0.0,TEST BRANDCODE @1612366146176
3,{'$oid': '601ac142be37ce2ead43755a'},601ac142be37ce2ead43755a,511111519874,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",601ac142be37ce2ead437559,test brand @1612366146051,0.0,TEST BRANDCODE @1612366146051
4,{'$oid': '601ac142be37ce2ead43755e'},601ac142be37ce2ead43755e,511111319917,Candy & Sweets,CANDY_AND_SWEETS,"{'$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}, '$ref': 'Cogs'}",5332fa12e4b03c9a25efd1e7,test brand @1612366146827,0.0,TEST BRANDCODE @1612366146827
...,...,...,...,...,...,...,...,...,...,...
1162,{'$oid': '5f77274dbe37ce6b592e90c0'},5f77274dbe37ce6b592e90c0,511111116752,Baking,BAKING,"{'$ref': 'Cogs', '$id': {'$oid': '5f77274dbe37ce6b592e90bf'}}",5f77274dbe37ce6b592e90bf,test brand @1601644365844,,
1163,{'$oid': '5dc1fca91dda2c0ad7da64ae'},5dc1fca91dda2c0ad7da64ae,511111706328,Breakfast & Cereal,,"{'$ref': 'Cogs', '$id': {'$oid': '53e10d6368abd3c7065097cc'}}",53e10d6368abd3c7065097cc,Dippin Dots® Cereal,,DIPPIN DOTS CEREAL
1164,{'$oid': '5f494c6e04db711dd8fe87e7'},5f494c6e04db711dd8fe87e7,511111416173,Candy & Sweets,CANDY_AND_SWEETS,"{'$ref': 'Cogs', '$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}}",5332fa12e4b03c9a25efd1e7,test brand @1598639215217,,TEST BRANDCODE @1598639215217
1165,{'$oid': '5a021611e4b00efe02b02a57'},5a021611e4b00efe02b02a57,511111400608,Grocery,,"{'$ref': 'Cogs', '$id': {'$oid': '5332f5f6e4b03c9a25efd0b4'}}",5332f5f6e4b03c9a25efd0b4,LIPTON TEA Leaves,0.0,LIPTON TEA Leaves


In [19]:
dataframe = df_brands
column_to_clean ='cpg'
dirty_series = dataframe[column_to_clean]
dirty_series
insert_at = dataframe.columns.get_loc(column_to_clean) + 1
cleaned_list = []
dict_key = "['$id']['$oid']"
dict_key_value = "value" + dict_key


for index, value in dirty_series.items():
    cleaned_list.append(eval(dict_key_value))

cleaned_list



['601ac114be37ce2ead437550',
 '5332f5fbe4b03c9a25efd0ba',
 '601ac142be37ce2ead437559',
 '601ac142be37ce2ead437559',
 '5332fa12e4b03c9a25efd1e7',
 '601ac142be37ce2ead437559',
 '601ac142be37ce2ead437559',
 '559c2234e4b06aca36af13c6',
 '5a734034e4b0d58f376be874',
 '59ba6f1ce4b092b29c167346',
 '5f4bf556be37ce0b44915549',
 '5332f5f2e4b03c9a25efd0aa',
 '559c2234e4b06aca36af13c6',
 '5d5d4fd16d5f3b23d1bc7905',
 '5332f5fbe4b03c9a25efd0ba',
 '5332f709e4b03c9a25efd0f1',
 '5d9b4f591dda2c6225a284aa',
 '5f358338be37ce443bf9d557',
 '5fb28549be37ce522e165cb4',
 '5332f5f6e4b03c9a25efd0b4',
 '55b62995e4b0d8e685c14213',
 '5d9b4f591dda2c6225a284aa',
 '559c2234e4b06aca36af13c6',
 '53e10d6368abd3c7065097cc',
 '5332f5ebe4b03c9a25efd0a8',
 '5e9f12f5be37ce3e45b6a77e',
 '5332f5f6e4b03c9a25efd0b4',
 '5d5d4fd16d5f3b23d1bc7905',
 '5f493e72be37ce64d0ae36c2',
 '5f4936dcbe37ce52f8314fd8',
 '559c2234e4b06aca36af13c6',
 '5fd2a0aebe37ce49eb72c0ed',
 '53e10d6368abd3c7065097cc',
 '5f494c5d04db711dd8fe87e2',
 '5332f5f3e4b0

In [20]:
dataframe

Unnamed: 0,_id,_id_clean,barcode,category,categoryCode,cpg,cpg_clean,name,topBrand,brandCode
0,{'$oid': '601ac115be37ce2ead437551'},601ac115be37ce2ead437551,511111019862,Baking,BAKING,"{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}",601ac114be37ce2ead437550,test brand @1612366101024,0.0,
1,{'$oid': '601c5460be37ce2ead43755f'},601c5460be37ce2ead43755f,511111519928,Beverages,BEVERAGES,"{'$id': {'$oid': '5332f5fbe4b03c9a25efd0ba'}, '$ref': 'Cogs'}",5332f5fbe4b03c9a25efd0ba,Starbucks,0.0,STARBUCKS
2,{'$oid': '601ac142be37ce2ead43755d'},601ac142be37ce2ead43755d,511111819905,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",601ac142be37ce2ead437559,test brand @1612366146176,0.0,TEST BRANDCODE @1612366146176
3,{'$oid': '601ac142be37ce2ead43755a'},601ac142be37ce2ead43755a,511111519874,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",601ac142be37ce2ead437559,test brand @1612366146051,0.0,TEST BRANDCODE @1612366146051
4,{'$oid': '601ac142be37ce2ead43755e'},601ac142be37ce2ead43755e,511111319917,Candy & Sweets,CANDY_AND_SWEETS,"{'$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}, '$ref': 'Cogs'}",5332fa12e4b03c9a25efd1e7,test brand @1612366146827,0.0,TEST BRANDCODE @1612366146827
...,...,...,...,...,...,...,...,...,...,...
1162,{'$oid': '5f77274dbe37ce6b592e90c0'},5f77274dbe37ce6b592e90c0,511111116752,Baking,BAKING,"{'$ref': 'Cogs', '$id': {'$oid': '5f77274dbe37ce6b592e90bf'}}",5f77274dbe37ce6b592e90bf,test brand @1601644365844,,
1163,{'$oid': '5dc1fca91dda2c0ad7da64ae'},5dc1fca91dda2c0ad7da64ae,511111706328,Breakfast & Cereal,,"{'$ref': 'Cogs', '$id': {'$oid': '53e10d6368abd3c7065097cc'}}",53e10d6368abd3c7065097cc,Dippin Dots® Cereal,,DIPPIN DOTS CEREAL
1164,{'$oid': '5f494c6e04db711dd8fe87e7'},5f494c6e04db711dd8fe87e7,511111416173,Candy & Sweets,CANDY_AND_SWEETS,"{'$ref': 'Cogs', '$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}}",5332fa12e4b03c9a25efd1e7,test brand @1598639215217,,TEST BRANDCODE @1598639215217
1165,{'$oid': '5a021611e4b00efe02b02a57'},5a021611e4b00efe02b02a57,511111400608,Grocery,,"{'$ref': 'Cogs', '$id': {'$oid': '5332f5f6e4b03c9a25efd0b4'}}",5332f5f6e4b03c9a25efd0b4,LIPTON TEA Leaves,0.0,LIPTON TEA Leaves


In [21]:
# reset df_brands to the inital load of raw data - if excuted above code, 
# uncomment the following line to avoid any exceptions with value_from_dict() 
df_brands = pd.read_json(in_brands,lines=True,compression='gzip')

**End** You can continue executing the following cells

### Writting a function to extract values from a dataframe's column that contains a dictionary. Then adds those values back to the dataframe as a new column.

In [22]:
# I realized I'd be doing this multiple times, better to make a function
# define a function that cleans a df column that contains a dictionary by returning the specified 
# values and adding them as new column in the dataframe
def value_from_dict(dataframe, column_to_clean, dict_key, allow_nulls = False):
    """Returns dataframe with a 'cleaned' column inserted after the column that was cleaned.
    
    :param dataframe: A dataframe with a column containing dictionaries, from which one value is
        to be extracted
    :type dataframe: Pandas DataFrame
    :param column_to_clean: The name of the column containing dictionaries
    :type column_to_clean: str
    :param dict_key: A str containing the key associated with the value we want to extract,
        e.g, "['$id']['$oid']"
    :type dict_key: str
    :param allow_nulls: A boolean value indicating if None/Null/NaN/NaT values should be allowed,
        defaults to False
    :type allow_nulls: bool
    
    :raises AssertionError: 'there is at least one None/Null/NaN/NaT value in the cleaned data' if param allow_nulls = False
    :raises AssertionError: 'the length of the original column and the cleaned column are not the same'
    :excepts VlueError: If a column has already been cleaned, print message confirming
        no values were added to the dataframe
    
    :rtype: Pandas DataFrame
    :return: the original DataFrame with an additional column containing 'cleaned' values
    """
    # setting variables:
    # extract the column we want to clean from the dataframe as a series
    dirty_series = dataframe[column_to_clean]
    # create a list to store the cleaned values in
    cleaned_list = []
    # name of the column we'll be adding to the DataFrame
    cleaned_column_name = column_to_clean + '_cleaned'
    # location to insert the cleaned column, after the 'dirty' column
    insert_at = dataframe.columns.get_loc(column_to_clean) + 1
    # translate dict_key str into a format useable in the following for loop
    value_dict_key = "value" + dict_key
    
    # iterate through dirty_series appending extracted values to cleaned_list
    for index, value in dirty_series.items():
        # if there is no dictionary or any other issue, append None
        try:
            cleaned_list.append(eval(value_dict_key))
        except:
            cleaned_list.append(None)

    # handle allow_nulls param flag
    if not allow_nulls:
        # confirm no nulls in cleaned_list
        assert None not in cleaned_list, "there is at least one None value in the cleaned data"
    
    # confirm cleaned_list is the same length as dirty_series
    assert len(cleaned_list) == len(dirty_series), "the length of the original column and the cleaned column are not the same"
    
    # add the cleaned_list data to the originl dataframe following column_to_clean
    try:
        dataframe.insert(insert_at, cleaned_column_name, cleaned_list)
    except ValueError as error:
        print(f"{str(error)}, {cleaned_column_name} was not added to the dataframe")

    # return the modified dataframe
    return dataframe

### Using  value_from_dict() to exatract values from all the columns containing dictionaries and add them to the dataframe as a new column:

#### df_brands._id :

In [23]:
# look at the first value in df_brands._id
df_brands['_id'][0]

{'$oid': '601ac115be37ce2ead437551'}

In [24]:
#extract the value
df_brands['_id'][0]['$oid']

'601ac115be37ce2ead437551'

In [25]:
# access the value, set a varbale to use for the dict_key param of value_from_dict() function
brand_id_dict_key = "['$oid']"

In [26]:
# clean df_brands._id and confirm by viewing a sample of the dataframe
value_from_dict(df_brands, "_id", brand_id_dict_key)
df_brands.sample(2)

Unnamed: 0,_id,_id_cleaned,barcode,category,categoryCode,cpg,name,topBrand,brandCode
1120,{'$oid': '5332f769e4b03c9a25efd121'},5332f769e4b03c9a25efd121,511111803676,,,"{'$ref': 'Cpgs', '$id': {'$oid': '5332f5ebe4b03c9a25efd0a8'}}",Glaceau smartwater,,
35,{'$oid': '59dfaad1e4b0a56a2fa69abc'},59dfaad1e4b0a56a2fa69abc,511111100621,Baking,,"{'$ref': 'Cogs', '$id': {'$oid': '559c2234e4b06aca36af13c6'}}",GODIVA Instant Pudding Mix,0.0,GODIVA DRY PACKAGED DESSERTS


#### df_brands.cpg:

In [27]:
# look at the first value in df_brands.cpg
df_brands['cpg'][0]

{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}

In [28]:
#extract the value
df_brands['cpg'][0]['$id']['$oid']

'601ac114be37ce2ead437550'

In [29]:
# set a variable to use for the dict_key param of value_from_dict() function
brand_cpg_dict_key = "['$id']['$oid']"

In [30]:
# clean df_brands.cpg and confirm by viewing a sample of the dataframe
value_from_dict(df_brands, "cpg", brand_cpg_dict_key)
df_brands.sample(2)

Unnamed: 0,_id,_id_cleaned,barcode,category,categoryCode,cpg,cpg_cleaned,name,topBrand,brandCode
715,{'$oid': '5a5d1e8fe4b06ba572cf249e'},5a5d1e8fe4b06ba572cf249e,511111400325,Grocery,,"{'$ref': 'Cogs', '$id': {'$oid': '559c2234e4b06aca36af13c6'}}",559c2234e4b06aca36af13c6,OVEN FRY,0.0,OVEN FRY
99,{'$oid': '5332f7b8e4b03c9a25efd145'},5332f7b8e4b03c9a25efd145,511111903413,,,"{'$ref': 'Cogs', '$id': {'$oid': '5332f7a7e4b03c9a25efd134'}}",5332f7a7e4b03c9a25efd134,Stella Artois,,


#### df_receipts._id:

In [31]:
# look at the first value in df_receipts._id
df_receipts['_id'][0]

{'$oid': '5ff1e1eb0a720f0523000575'}

In [32]:
#extract the value
df_receipts['_id'][0]['$oid']

'5ff1e1eb0a720f0523000575'

In [33]:
# set a variable to use for the dict_key param of value_from_dict() function
receipts_id_dict_key = "['$oid']"

In [34]:
# clean df_receipts._id and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "_id", receipts_id_dict_key)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,dateScanned,finishedDate,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
521,{'$oid': '6012e65b0a720f05f800003d'},6012e65b0a720f05f800003d,,,{'$date': 1611851355000},{'$date': 1611851355000},{'$date': 1611851356000},{'$date': 1611851356000},{'$date': 1611851356000},840.0,{'$date': 1611792000000},4.0,"[{'barcode': '036000320893', 'description': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'discountedItemPrice': '21.00', 'finalPrice': '21.00', 'itemNumber': '4023', 'itemPrice': '21.00', 'originalReceiptItemText': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'partnerItemId': '0', 'pointsEarned': '210.0', 'pointsPayerId': '550b2565e4b001d5e9e4146f', 'quantityPurchased': 1, 'rewardsGroup': 'HUGGIES ONE AND DONE SIMPLY CLEAN BABY WIPES 200 COUNT OR LARGER', 'rewardsProductPartnerId': '550b2565e4b001d5e9e4146f'}, {'barcode': '036000320893', 'description': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'discountedItemPrice': '21.00', 'finalPrice': '21.00', 'itemNumber': '4023', 'itemPrice': '21.00', 'originalReceiptItemText': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'partnerItemId': '1', 'pointsEarned': '210.0', 'pointsPayerId': '550b2565e4b001d5e9e4146f', 'quantityPurchased': 1, 'rewardsGroup': 'HUGGIES ONE AND DONE SIMPLY CLEAN BABY WIPES 200 COUNT OR LARGER', 'rewardsProductPartnerId': '550b2565e4b001d5e9e4146f'}, {'barcode': '036000320893', 'description': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'discountedItemPrice': '21.00', 'finalPrice': '21.00', 'itemNumber': '4023', 'itemPrice': '21.00', 'originalReceiptItemText': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'partnerItemId': '2', 'pointsEarned': '210.0', 'pointsPayerId': '550b2565e4b001d5e9e4146f', 'quantityPurchased': 1, 'rewardsGroup': 'HUGGIES ONE AND DONE SIMPLY CLEAN BABY WIPES 200 COUNT OR LARGER', 'rewardsProductPartnerId': '550b2565e4b001d5e9e4146f'}, {'barcode': '036000320893', 'description': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'discountedItemPrice': '21.00', 'finalPrice': '21.00', 'itemNumber': '4023', 'itemPrice': '21.00', 'originalReceiptItemText': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'partnerItemId': '3', 'pointsEarned': '210.0', 'pointsPayerId': '550b2565e4b001d5e9e4146f', 'quantityPurchased': 1, 'rewardsGroup': 'HUGGIES ONE AND DONE SIMPLY CLEAN BABY WIPES 200 COUNT OR LARGER', 'rewardsProductPartnerId': '550b2565e4b001d5e9e4146f'}]",FINISHED,84.0,5fa41775898c7a11a6bcef3e
975,{'$oid': '6026fc2f0a720f05a8000311'},6026fc2f0a720f05a8000311,,,{'$date': 1613167663589},{'$date': 1613167663589},,{'$date': 1613167663589},,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33


#### df_receipts.createDate:

In [35]:
# look at the first value in XXXX.YYY
df_receipts['createDate'][0]

{'$date': 1609687531000}

In [36]:
# extract the value
df_receipts['createDate'][0]['$date']

1609687531000

In [37]:
# set a variable to use for the dict_key param of value_from_dict() function
receipts_createDate_dict_key = "['$date']"

In [38]:
# clean df_receipts.createDate and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "createDate", receipts_createDate_dict_key)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,finishedDate,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
722,{'$oid': '601af2280a7214ad28000209'},601af2280a7214ad28000209,,,{'$date': 1612378664384},1612378664384,{'$date': 1612378664384},,{'$date': 1612378664384},,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
157,{'$oid': '5ff8da540a7214adca000017'},5ff8da540a7214adca000017,5.0,All-receipts receipt bonus,{'$date': 1610144340000},1610144340000,{'$date': 1610144340000},{'$date': 1610144345000},{'$date': 1610144345000},{'$date': 1610144340000},5.0,{'$date': 1610057940000},2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '1234', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '1234', 'userFlaggedDescription': '', 'userFlaggedNewItem': True}]",FINISHED,1.0,5ff8da28b3348b11c9337ac6


#### df_receipts.dateScanned:
same format as df_receipts.createDate

In [39]:
# set a varbale to use for the dict_key param of value_from_dict() function
receipts_dateScanned_dict_key = "['$date']"

In [40]:
# clean df_receipts.dateScanned and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "dateScanned", receipts_dateScanned_dict_key)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,dateScanned_cleaned,finishedDate,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
948,{'$oid': '6025386a0a7214d8e9000241'},6025386a0a7214d8e9000241,500.0,"Receipt number 2 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1613052010000},1613052010000,{'$date': 1613052010000},1613052010000,{'$date': 1613052018000},{'$date': 1613052018000},{'$date': 1613052011000},500.0,{'$date': 1612965610000},2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '1234', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '1234', 'userFlaggedDescription': '', 'userFlaggedNewItem': True}]",FINISHED,1.0,60253861efa6017a44dc6b50
12,{'$oid': '5ff1e1b60a7214ada100055c'},5ff1e1b60a7214ada100055c,150.0,"Receipt number 5 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1609687478000},1609687478000,{'$date': 1609687478000},1609687478000,,{'$date': 1609687478000},,8850.0,{'$date': 1612365878000},10.0,"[{'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '1', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '2', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '3', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '4', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '5', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '6', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '7', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '8', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '9', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '10', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}]",FLAGGED,290.0,5ff1e194b6a9d73a3a9f1052


#### df_receipts.finishedDate:
same format as df_receipts.createDate - can include None/Null/NaN

In [41]:
# set a varbale to use for the dict_key param of value_from_dict() function
receipts_finishedDate_dict_key = "['$date']"

In [42]:
# clean df_receipts.finishedDate and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "finishedDate", receipts_finishedDate_dict_key)
df_receipts.sample(2)

AssertionError: there is at least one None value in the cleaned data

In [43]:
# clean df_receipts.finishedDate and confirm by viewing a sample of the dataframe
# setting allow_nulls = True
value_from_dict(df_receipts, "finishedDate", receipts_finishedDate_dict_key, allow_nulls=True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,dateScanned_cleaned,finishedDate,finishedDate_cleaned,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
1046,{'$oid': '603a90430a7217c72c000226'},603a90430a7217c72c000226,,,{'$date': 1614450755382},1614450755382,{'$date': 1614450755382},1614450755382,,,{'$date': 1614450755382},,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
568,{'$oid': '601442570a7214ad5000004b'},601442570a7214ad5000004b,5.0,All-receipts receipt bonus,{'$date': 1611940439000},1611940439000,{'$date': 1611940439000},1611940439000,{'$date': 1611940441000},1611940000000.0,{'$date': 1611940441000},{'$date': 1611940441000},5.0,{'$date': 1611854039000},1.0,"[{'barcode': '021000049325', 'description': 'KRAFT Catalina Dressing 24.00-fl oz', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}]",FINISHED,1.0,54943462e4b07e684157a532


#### df_receipts.modifyDate:
same format as df_receipts.createDate

In [44]:
# set a varbale to use for the dict_key param of value_from_dict() function
receipts_modifyDate_dict_key = "['$date']"

In [45]:
# clean df_receipts.modifyDate and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "modifyDate", receipts_modifyDate_dict_key)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,dateScanned_cleaned,finishedDate,finishedDate_cleaned,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
557,{'$oid': '601487f80a720f05f800018d'},601487f80a720f05f800018d,,,{'$date': 1611958264138},1611958264138,{'$date': 1611958264138},1611958264138,,,{'$date': 1611958264138},1611958264138,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
67,{'$oid': '5ff473f30a720f05230005c0'},5ff473f30a720f05230005c0,750.0,"Receipt number 1 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1609855987000},1609855987000,{'$date': 1609855987000},1609855987000,{'$date': 1609855988000},1609856000000.0,{'$date': 1609855988000},1609855988000,{'$date': 1609855988000},1800.0,{'$date': 1609855987000},1.0,"[{'barcode': '043000077467', 'description': 'O THAT'S GOOD Baked Potato Soup 16 OZ', 'finalPrice': '10', 'itemPrice': '10', 'partnerItemId': '1', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'O THAT'S GOOD - SOUPS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}]",FINISHED,10.0,5ff473f3c1e2d0121a9b2707


#### df_receipts.pointsAwardedDate:
same format as df_receipts.createDate - can include None/Null/NaN

In [46]:
# set a varbale to use for the dict_key param of value_from_dict() function
receipts_pointsAwardedDate_dict_key = "['$date']"

In [47]:
# clean df_receipts.purchaseDate and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "pointsAwardedDate", receipts_pointsAwardedDate_dict_key)
df_receipts.sample(2)

AssertionError: there is at least one None value in the cleaned data

In [48]:
# clean df_receipts.purchaseDate and confirm by viewing a sample of the dataframe allowing nulls
value_from_dict(df_receipts, "pointsAwardedDate", receipts_pointsAwardedDate_dict_key, allow_nulls=True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,dateScanned_cleaned,finishedDate,finishedDate_cleaned,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
254,{'$oid': '5fff4c8c0a7214ad4c000026'},5fff4c8c0a7214ad4c000026,5.0,All-receipts receipt bonus,{'$date': 1610566796000},1610566796000,{'$date': 1610566796000},1610566796000,{'$date': 1610566797000},1610567000000.0,{'$date': 1610566797000},1610566797000,{'$date': 1610566797000},1610567000000.0,5.0,{'$date': 1609567200000},5.0,"[{'barcode': '070085035006', 'competitiveProduct': True, 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '1', 'quantityPurchased': 1, 'rewardsGroup': 'TOTINO'S PIZZA ROLLS 15 COUNT - 40 COUNT', 'rewardsProductPartnerId': '5332f5f3e4b03c9a25efd0ae'}, {'barcode': '070085035006', 'competitiveProduct': True, 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '2', 'quantityPurchased': 1, 'rewardsGroup': 'TOTINO'S PIZZA ROLLS 15 COUNT - 40 COUNT', 'rewardsProductPartnerId': '5332f5f3e4b03c9a25efd0ae'}, {'barcode': '070085035006', 'competitiveProduct': True, 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '3', 'quantityPurchased': 1, 'rewardsGroup': 'TOTINO'S PIZZA ROLLS 15 COUNT - 40 COUNT', 'rewardsProductPartnerId': '5332f5f3e4b03c9a25efd0ae'}, {'barcode': '070085035006', 'competitiveProduct': True, 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '4', 'quantityPurchased': 1, 'rewardsGroup': 'TOTINO'S PIZZA ROLLS 15 COUNT - 40 COUNT', 'rewardsProductPartnerId': '5332f5f3e4b03c9a25efd0ae'}, {'barcode': '070085035006', 'competitiveProduct': True, 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '5', 'quantityPurchased': 1, 'rewardsGroup': 'TOTINO'S PIZZA ROLLS 15 COUNT - 40 COUNT', 'rewardsProductPartnerId': '5332f5f3e4b03c9a25efd0ae'}]",FINISHED,49.95,5fff4beedf9ace121f0c17ea
69,{'$oid': '5ff473a90a7214ada10005c2'},5ff473a90a7214ada10005c2,150.0,"Receipt number 5 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1609855913000},1609855913000,{'$date': 1609855913000},1609855913000,{'$date': 1609855914000},1609856000000.0,{'$date': 1609855914000},1609855914000,{'$date': 1609855914000},1609856000000.0,275.0,{'$date': 1609394400000},5.0,"[{'barcode': '043000008836', 'description': 'MAXWELL HOUSE International Cinnamon Dulce Cappuccino Cafe-Style Beverage Mix 9.1 oz. Tub', 'finalPrice': '5', 'itemPrice': '5', 'partnerItemId': '1', 'pointsEarned': '25.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'MAXWELL HOUSE INTERNATIONAL INSTANT COFFEE', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '043000008836', 'description': 'MAXWELL HOUSE International Cinnamon Dulce Cappuccino Cafe-Style Beverage Mix 9.1 oz. Tub', 'finalPrice': '5', 'itemPrice': '5', 'partnerItemId': '2', 'pointsEarned': '25.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'MAXWELL HOUSE INTERNATIONAL INSTANT COFFEE', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '043000008836', 'description': 'MAXWELL HOUSE International Cinnamon Dulce Cappuccino Cafe-Style Beverage Mix 9.1 oz. Tub', 'finalPrice': '5', 'itemPrice': '5', 'partnerItemId': '3', 'pointsEarned': '25.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'MAXWELL HOUSE INTERNATIONAL INSTANT COFFEE', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '043000008836', 'description': 'MAXWELL HOUSE International Cinnamon Dulce Cappuccino Cafe-Style Beverage Mix 9.1 oz. Tub', 'finalPrice': '5', 'itemPrice': '5', 'partnerItemId': '4', 'pointsEarned': '25.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'MAXWELL HOUSE INTERNATIONAL INSTANT COFFEE', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '043000008836', 'description': 'MAXWELL HOUSE International Cinnamon Dulce Cappuccino Cafe-Style Beverage Mix 9.1 oz. Tub', 'finalPrice': '5', 'itemPrice': '5', 'partnerItemId': '5', 'pointsEarned': '25.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'MAXWELL HOUSE INTERNATIONAL INSTANT COFFEE', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}]",FINISHED,25.0,5ff47392c3d63511e2a47881


#### df_receipts.purchaseDate:
same format as df_receipts.createDate - can include None/Null/NaN

In [49]:
# set a varbale to use for the dict_key param of value_from_dict() function
receipts_purchaseDate_dict_key = "['$date']"

In [50]:
# clean df_receipts.purchaseDate and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "purchaseDate", receipts_purchaseDate_dict_key)
df_receipts.sample(2)

AssertionError: there is at least one None value in the cleaned data

In [51]:
# clean df_receipts.purchaseDate and confirm by viewing a sample of the dataframe allowing nulls
value_from_dict(df_receipts, "purchaseDate", receipts_purchaseDate_dict_key, allow_nulls=True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,dateScanned_cleaned,finishedDate,finishedDate_cleaned,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
1040,{'$oid': '6039d4bc0a7217c72c000184'},6039d4bc0a7217c72c000184,25.0,COMPLETE_NONPARTNER_RECEIPT,{'$date': 1614402748000},1614402748000,{'$date': 1614402748000},1614402748000,,,{'$date': 1614402749000},1614402749000,,,25.0,{'$date': 1597622400000},1597622000000.0,2.0,"[{'barcode': 'B076FJ92M4', 'description': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'discountedItemPrice': '22.97', 'finalPrice': '22.97', 'itemPrice': '22.97', 'originalReceiptItemText': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'partnerItemId': '0', 'priceAfterCoupon': '22.97', 'quantityPurchased': 1}, {'barcode': 'B07BRRLSVC', 'description': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'discountedItemPrice': '11.99', 'finalPrice': '11.99', 'itemPrice': '11.99', 'originalReceiptItemText': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'partnerItemId': '1', 'priceAfterCoupon': '11.99', 'quantityPurchased': 1}]",REJECTED,34.96,5fc961c3b8cfca11a077dd33
773,{'$oid': '601d83930a7214ad59000932'},601d83930a7214ad59000932,,,{'$date': 1612546962685},1612546962685,{'$date': 1612546962685},1612546962685,,,{'$date': 1612546962685},1612546962685,,,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33


#### df_users._id:

In [52]:
# look at the first value in df_users._id
df_users['_id'][0]

{'$oid': '5ff1e194b6a9d73a3a9f1052'}

In [53]:
#extract the value
df_users['_id'][0]['$oid']

'5ff1e194b6a9d73a3a9f1052'

In [54]:
# set a varbale to use for the dict_key param of value_from_dict() function
users_id_dict_key = "['$oid']"

In [55]:
# clean df_users._id and confirm by viewing a sample of the dataframe
value_from_dict(df_users, "_id", users_id_dict_key)
df_users.sample(2)

Unnamed: 0,_id,_id_cleaned,active,createdDate,lastLogin,role,signUpSource,state
231,{'$oid': '60088e5d633aab121bb8e5cf'},60088e5d633aab121bb8e5cf,True,{'$date': 1611173469989},{'$date': 1611173470035},consumer,Email,WI
380,{'$oid': '601ac195af4b1a1205f7560f'},601ac195af4b1a1205f7560f,True,{'$date': 1612366229397},{'$date': 1612366451043},consumer,Email,WI


#### df_users.createdDate:  
same format as df_receipts.createDate

In [56]:
# set a variable to use for the dict_key param of value_from_dict() function
users_createdDate_dict_key = "['$date']"

In [57]:
# clean df_users.createDate and confirm by viewing a sample of the dataframe
value_from_dict(df_users, "createdDate", users_createdDate_dict_key)
df_users.sample(2)

Unnamed: 0,_id,_id_cleaned,active,createdDate,createdDate_cleaned,lastLogin,role,signUpSource,state
417,{'$oid': '60255883efa60114d20e5d4e'},60255883efa60114d20e5d4e,True,{'$date': 1613060227502},1613060227502,{'$date': 1613060309219},consumer,Email,WI
479,{'$oid': '54943462e4b07e684157a532'},54943462e4b07e684157a532,True,{'$date': 1418998882381},1418998882381,{'$date': 1614963143204},fetch-staff,,


#### df_users.lastLogin:  
same format as df_receipts.createDat - can include None/Null/NaN

In [58]:
# set a variable to use for the dict_key param of value_from_dict() function
users_lastLogin_dict_key = "['$date']"

In [59]:
# clean df_users.createDate and confirm by viewing a sample of the dataframe
value_from_dict(df_users, "lastLogin", users_lastLogin_dict_key)
df_users.sample(2)

AssertionError: there is at least one None value in the cleaned data

In [60]:
# clean df_users.createDate and confirm by viewing a sample of the dataframe allowing nulls
value_from_dict(df_users, "lastLogin", users_lastLogin_dict_key, allow_nulls=True)
df_users.sample(2)

Unnamed: 0,_id,_id_cleaned,active,createdDate,createdDate_cleaned,lastLogin,lastLogin_cleaned,role,signUpSource,state
158,{'$oid': '5fff26f2b3348b03eb45bbb9'},5fff26f2b3348b03eb45bbb9,True,{'$date': 1610557170518},1610557170518,{'$date': 1610557170567},1610557000000.0,consumer,Email,WI
130,{'$oid': '5ffc9d87b3348b11c9338920'},5ffc9d87b3348b11c9338920,True,{'$date': 1610390919076},1610390919076,{'$date': 1610391013461},1610391000000.0,consumer,Email,WI


### Writing a function to convert date data from epoch to timestamps

In [61]:
def epoch_to_timestamp(dataframe, column_to_convert, allow_nulls = False):
    """Returns dataframe with a new column containing timestamps converted from epoch.
    
    :param dataframe: A dataframe with a column containing epoch seconds as ints or floats
    :type dataframe: Pandas DataFrame
    :param column_to_convert: The name of the column containing epoch seconds
    :type column_to_convert: str
    :param allow_nulls: A boolean value indicating if None(Null) values should be allowed,
        defaults to False
    :type allow_nulls: bool
    
    :raises AssertionError: 'there is at least one None/Null/NaN/NaT value in the converted timestamp data' 
        if param allow_nulls = False
    :raises AssertionError: 'the length of the original column and the converted column are not the same'
    :excepts VlueError: If a column has already been converted, print message confirming
        no values were added to the dataframe
    
    :rtype: Pandas DataFrame
    :return: the original DataFrame with an additional column containing converted epoch values as timestamps
    """
    #setting variables
    # name of the new column we'll be adding to the dataframe
    converted_column_name = column_to_convert + "_ts"
    # location to insert the converted column, after the column_to_clean
    insert_at = dataframe.columns.get_loc(column_to_convert) + 1
    # create a series of timestamps from the epoch time column_to_convert
    # pd.to_datetime() converts a scalar, array-like, Series or DataFrame/dict-like to a pandas datetime object
    # the data in the epoch columns is miliseconds from epoch start and we round to 1ms for consistency 
    time_stamps = pd.to_datetime(dataframe[column_to_convert], unit='ms').round('1ms')
    
    # handle allow_nulls param flag
    if not allow_nulls:
        # confirm no nulls in time_stamps
#         assert None not in time_stamps, "there is at least one None/Null/NaN/NaT value in the converted timestamp data"
        assert not time_stamps.isnull().values.any(), "there is at least one None/Null/NaN/NaT value in the converted timestamp data" 
            # df_receipts['finishedDate_cleaned'].isnull().values.any()
    
    # confirm time_stamps is the same length as column_to_convert
    assert len(time_stamps) == len(dataframe[column_to_convert]), "the length of the original column and the converted column are not the same"
    
    # add the timestamps data to the originl dataframe following column_to_convert
    try:
        dataframe.insert(insert_at, converted_column_name, time_stamps)
    except ValueError as error:
        print(f"{str(error)}, {converted_column_name} was not added to the dataframe")
    
    # return the modified dataframe
    return dataframe
    

### Using epoch_to_timestamp() to convert columns with epoch values to timestamps and add them to the dataframe as a new column:

In [62]:
# convert df_users.createdDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_users, 'createdDate_cleaned')
df_users.sample(2)

Unnamed: 0,_id,_id_cleaned,active,createdDate,createdDate_cleaned,createdDate_cleaned_ts,lastLogin,lastLogin_cleaned,role,signUpSource,state
99,{'$oid': '5ff79464b3348b11c933738b'},5ff79464b3348b11c933738b,True,{'$date': 1610060900153},1610060900153,2021-01-07 23:08:20.153,{'$date': 1610060900301},1610061000000.0,consumer,Email,WI
289,{'$oid': '600fb1ac73c60b12049027bb'},600fb1ac73c60b12049027bb,True,{'$date': 1611641260879},1611641260879,2021-01-26 06:07:40.879,{'$date': 1611641483950},1611641000000.0,consumer,Email,WI


In [63]:
# convert df_users.lastLogin_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_users, 'lastLogin_cleaned')
df_users.sample(2)

AssertionError: there is at least one None/Null/NaN/NaT value in the converted timestamp data

In [64]:
# convert df_users.lastLogin_cleaned to timestamps and confirm by viewing a sample of the dataframe, allowing for Nulls
epoch_to_timestamp(df_users, 'lastLogin_cleaned', allow_nulls=True)
df_users.sample(2)

Unnamed: 0,_id,_id_cleaned,active,createdDate,createdDate_cleaned,createdDate_cleaned_ts,lastLogin,lastLogin_cleaned,lastLogin_cleaned_ts,role,signUpSource,state
214,{'$oid': '600741d06e6469120a787853'},600741d06e6469120a787853,True,{'$date': 1611088337000},1611088337000,2021-01-19 20:32:17.000,{'$date': 1611088743299},1611089000000.0,2021-01-19 20:39:03.299,consumer,Email,WI
27,{'$oid': '5ff36be7135e7011bcb856d3'},5ff36be7135e7011bcb856d3,True,{'$date': 1609788391239},1609788391239,2021-01-04 19:26:31.239,{'$date': 1609788592726},1609789000000.0,2021-01-04 19:29:52.726,consumer,Email,WI


In [65]:
# convert df_receipts.createDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'createDate_cleaned')
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,finishedDate,finishedDate_cleaned,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
129,{'$oid': '5ff794bf0a7214ada1000650'},5ff794bf0a7214ada1000650,25.0,COMPLETE_NONPARTNER_RECEIPT,{'$date': 1609974591000},1609974591000,2021-01-06 23:09:51,{'$date': 1609974591000},1609974591000,,,{'$date': 1610039393000},1610039393000,,,25.0,{'$date': 1609888191000},1609888000000.0,1.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}]",FLAGGED,1.0,5ff7930fb3348b11c93372a6
146,{'$oid': '5ff794130a7214ada1000640'},5ff794130a7214ada1000640,5.0,All-receipts receipt bonus,{'$date': 1609888019000},1609888019000,2021-01-05 23:06:59,{'$date': 1609888019000},1609888019000,{'$date': 1610060828000},1610061000000.0,{'$date': 1610060828000},1610060828000,{'$date': 1609996022000},1609996000000.0,2005.0,{'$date': 1609801618000},1609802000000.0,2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '1234', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '1234', 'userFlaggedDescription': '', 'userFlaggedNewItem': True}]",FINISHED,1.0,5ff7930fb3348b11c93372a6


In [66]:
# convert df_receipts.dateScanned_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'dateScanned_cleaned')
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
18,{'$oid': '5ff1e1eb0a720f0523000576'},5ff1e1eb0a720f0523000576,300.0,"Receipt number 4 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1609687531000},1609687531000,2021-01-03 15:25:31,{'$date': 1609687531000},1609687531000,2021-01-03 15:25:31,{'$date': 1609687532000},1609688000000.0,{'$date': 1609687536000},1609687536000,{'$date': 1609687532000},1609688000000.0,300.0,{'$date': 1609632000000},1609632000000.0,2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '26.00', 'itemPrice': '26.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 2, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '26.00', 'userFlaggedQuantity': 2}]",FINISHED,26.0,5ff1e1eacfcf6c399c274ae6
404,{'$oid': '600988980a7214ad89000130'},600988980a7214ad89000130,5.0,All-receipts receipt bonus,{'$date': 1611237528000},1611237528000,2021-01-21 13:58:48,{'$date': 1611237528000},1611237528000,2021-01-21 13:58:48,{'$date': 1611237528000},1611238000000.0,{'$date': 1611237528000},1611237528000,{'$date': 1611237528000},1611238000000.0,5.0,{'$date': 1611151128000},1611151000000.0,1.0,"[{'barcode': '028400081382', 'competitiveProduct': True, 'finalPrice': '10', 'itemPrice': '10', 'partnerItemId': '1', 'quantityPurchased': 1, 'rewardsGroup': 'OLD EL PASO SAUCE', 'rewardsProductPartnerId': '5332f5f3e4b03c9a25efd0ae'}]",FINISHED,10.0,600987d77d983a11f63cfa92


In [67]:
# convert df_receipts.finishedDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'finishedDate_cleaned')
df_receipts.sample(2)

AssertionError: there is at least one None/Null/NaN/NaT value in the converted timestamp data

In [68]:
# convert df_receipts.finishedDate_cleaned to timestamps and confirm by viewing a sample of the dataframe, allowing nulls
epoch_to_timestamp(df_receipts, 'finishedDate_cleaned', allow_nulls = True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
705,{'$oid': '601b0bf00a720f05f400021b'},601b0bf00a720f05f400021b,,,{'$date': 1612385264167},1612385264167,2021-02-03 20:47:44.167,{'$date': 1612385264167},1612385264167,2021-02-03 20:47:44.167,,,NaT,{'$date': 1612385264167},1612385264167,,,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
404,{'$oid': '600988980a7214ad89000130'},600988980a7214ad89000130,5.0,All-receipts receipt bonus,{'$date': 1611237528000},1611237528000,2021-01-21 13:58:48.000,{'$date': 1611237528000},1611237528000,2021-01-21 13:58:48.000,{'$date': 1611237528000},1611238000000.0,2021-01-21 13:58:48,{'$date': 1611237528000},1611237528000,{'$date': 1611237528000},1611238000000.0,5.0,{'$date': 1611151128000},1611151000000.0,1.0,"[{'barcode': '028400081382', 'competitiveProduct': True, 'finalPrice': '10', 'itemPrice': '10', 'partnerItemId': '1', 'quantityPurchased': 1, 'rewardsGroup': 'OLD EL PASO SAUCE', 'rewardsProductPartnerId': '5332f5f3e4b03c9a25efd0ae'}]",FINISHED,10.0,600987d77d983a11f63cfa92


In [69]:
# convert df_receipts.modifyDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'modifyDate_cleaned')
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,modifyDate_cleaned_ts,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
634,{'$oid': '601711900a720f05f800029f'},601711900a720f05f800029f,,,{'$date': 1612124560099},1612124560099,2021-01-31 20:22:40.099,{'$date': 1612124560099},1612124560099,2021-01-31 20:22:40.099,,,NaT,{'$date': 1612124560099},1612124560099,2021-01-31 20:22:40.099,,,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
367,{'$oid': '600841da0a7214ad89000051'},600841da0a7214ad89000051,150.0,"Receipt number 5 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1611153882000},1611153882000,2021-01-20 14:44:42.000,{'$date': 1611153882000},1611153882000,2021-01-20 14:44:42.000,{'$date': 1611153883000},1611154000000.0,2021-01-20 14:44:43,{'$date': 1611153883000},1611153883000,2021-01-20 14:44:43.000,{'$date': 1611153883000},1611154000000.0,173.6,{'$date': 1611067482000},1611067000000.0,4.0,"[{'barcode': '044700072813', 'description': 'OSCAR MAYER Deli Fresh Maple Honey 97% Fat Free Ham, 8 oz', 'finalPrice': '4.66', 'itemPrice': '4.66', 'partnerItemId': '1', 'pointsEarned': '23.6', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 4, 'rewardsGroup': 'OSCAR MAYER LUNCH MEAT - DELI FRESH', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}]",FINISHED,4.66,6008412f6e64697abedcd5d5


In [70]:
# convert df_receipts.pointsAwardedDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'pointsAwardedDate_cleaned')
df_receipts.sample(2)

AssertionError: there is at least one None/Null/NaN/NaT value in the converted timestamp data

In [71]:
# convert df_receipts.pointsAwardedDate_cleaned to timestamps and confirm by viewing a sample of the dataframe allowing nulls
epoch_to_timestamp(df_receipts, 'pointsAwardedDate_cleaned', allow_nulls=True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,modifyDate_cleaned_ts,pointsAwardedDate,pointsAwardedDate_cleaned,pointsAwardedDate_cleaned_ts,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
676,{'$oid': '60189cc00a720f05f4000063'},60189cc00a720f05f4000063,25.0,COMPLETE_NONPARTNER_RECEIPT,{'$date': 1612225728000},1612225728000,2021-02-02 00:28:48.000,{'$date': 1612225728000},1612225728000,2021-02-02 00:28:48.000,{'$date': 1612226629000},1612227000000.0,2021-02-02 00:43:49,{'$date': 1612226629000},1612226629000,2021-02-02 00:43:49.000,{'$date': 1612225729000},1612226000000.0,2021-02-02 00:28:49,25.0,{'$date': 1612137600000},1612138000000.0,5.0,"[{'barcode': '013562300631', 'description': 'Annie's Homegrown Organic White Cheddar Macaroni & Cheese Shells, 6 Oz', 'discountedItemPrice': '50.00', 'finalPrice': '50.00', 'itemNumber': '013562300631', 'itemPrice': '50.00', 'needsFetchReview': True, 'needsFetchReviewReason': 'POINTS_GREATER_THAN_THRESHOLD', 'originalMetaBriteQuantityPurchased': 1, 'partnerItemId': '1', 'pointsNotAwardedReason': 'Action not allowed for user and CPG', 'pointsPayerId': '5332f5f3e4b03c9a25efd0ae', 'quantityPurchased': 5, 'rewardsGroup': 'ANNIE'S HOMEGROWN MULTI-SERVING MAC & CHEESE', 'rewardsProductPartnerId': '5332f5f3e4b03c9a25efd0ae'}]",FINISHED,50.0,60189c74c8b50e11d8454eff
596,{'$oid': '6015268a0a7214ad50000155'},6015268a0a7214ad50000155,,,{'$date': 1611998858821},1611998858821,2021-01-30 09:27:38.821,{'$date': 1611998858821},1611998858821,2021-01-30 09:27:38.821,,,NaT,{'$date': 1611998858821},1611998858821,2021-01-30 09:27:38.821,,,NaT,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33


In [72]:
# convert df_receipts.purchaseDate_cleaned to timestamps and confirm by viewing a sample of the dataframe 
epoch_to_timestamp(df_receipts, 'purchaseDate_cleaned')
df_receipts.sample(2)

AssertionError: there is at least one None/Null/NaN/NaT value in the converted timestamp data

In [73]:
# convert df_receipts.purchaseDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'purchaseDate_cleaned', allow_nulls = True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,modifyDate_cleaned_ts,pointsAwardedDate,pointsAwardedDate_cleaned,pointsAwardedDate_cleaned_ts,pointsEarned,purchaseDate,purchaseDate_cleaned,purchaseDate_cleaned_ts,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
972,{'$oid': '602686ff0a7214d8e9000301'},602686ff0a7214d8e9000301,,,{'$date': 1613137663139},1613137663139,2021-02-12 13:47:43.139,{'$date': 1613137663139},1613137663139,2021-02-12 13:47:43.139,,,NaT,{'$date': 1613137663139},1613137663139,2021-02-12 13:47:43.139,,,NaT,,,,NaT,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
617,{'$oid': '601602db0a7214ad500001b3'},601602db0a7214ad500001b3,,,{'$date': 1612055259603},1612055259603,2021-01-31 01:07:39.603,{'$date': 1612055259603},1612055259603,2021-01-31 01:07:39.603,,,NaT,{'$date': 1612055259603},1612055259603,2021-01-31 01:07:39.603,,,NaT,,,,NaT,,,SUBMITTED,,5fc961c3b8cfca11a077dd33


In [74]:
# convert df_receipts.purchaseDate_cleaned to timestamps and confirm by viewing a sample of the dataframe, allowing nulls
epoch_to_timestamp(df_receipts, 'purchaseDate_cleaned', allow_nulls = True)
df_receipts.sample(2)

cannot insert purchaseDate_cleaned_ts, already exists, purchaseDate_cleaned_ts was not added to the dataframe


Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,modifyDate_cleaned_ts,pointsAwardedDate,pointsAwardedDate_cleaned,pointsAwardedDate_cleaned_ts,pointsEarned,purchaseDate,purchaseDate_cleaned,purchaseDate_cleaned_ts,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
437,{'$oid': '600f48940a720f0535000043'},600f48940a720f0535000043,5.0,All-receipts receipt bonus,{'$date': 1611614356000},1611614356000,2021-01-25 22:39:16.000,{'$date': 1611614356000},1611614356000,2021-01-25 22:39:16.000,{'$date': 1611614357000},1611614000000.0,2021-01-25 22:39:17,{'$date': 1611614357000},1611614357000,2021-01-25 22:39:17.000,{'$date': 1611614357000},1611614000000.0,2021-01-25 22:39:17,5.0,{'$date': 1610491156000},1610491000000.0,2021-01-12 22:39:16,1.0,"[{'barcode': '022174070214', 'description': 'CJN INJ & LSN GLD GRLN KIT BOX 1 CT', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}]",FINISHED,1.0,54943462e4b07e684157a532
828,{'$oid': '601fee520a720f053c000120'},601fee520a720f053c000120,,,{'$date': 1612705362952},1612705362952,2021-02-07 13:42:42.952,{'$date': 1612705362952},1612705362952,2021-02-07 13:42:42.952,,,NaT,{'$date': 1612705362952},1612705362952,2021-02-07 13:42:42.952,,,NaT,,,,NaT,,,SUBMITTED,,5fc961c3b8cfca11a077dd33


### visually check that all the cleaned and converted columns I expect are present

In [75]:
df_brands.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1167 entries, 0 to 1166
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   _id           1167 non-null   object 
 1   _id_cleaned   1167 non-null   object 
 2   barcode       1167 non-null   int64  
 3   category      1012 non-null   object 
 4   categoryCode  517 non-null    object 
 5   cpg           1167 non-null   object 
 6   cpg_cleaned   1167 non-null   object 
 7   name          1167 non-null   object 
 8   topBrand      555 non-null    float64
 9   brandCode     933 non-null    object 
dtypes: float64(1), int64(1), object(8)
memory usage: 91.3+ KB


In [76]:
df_receipts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1119 entries, 0 to 1118
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   _id                           1119 non-null   object        
 1   _id_cleaned                   1119 non-null   object        
 2   bonusPointsEarned             544 non-null    float64       
 3   bonusPointsEarnedReason       544 non-null    object        
 4   createDate                    1119 non-null   object        
 5   createDate_cleaned            1119 non-null   int64         
 6   createDate_cleaned_ts         1119 non-null   datetime64[ns]
 7   dateScanned                   1119 non-null   object        
 8   dateScanned_cleaned           1119 non-null   int64         
 9   dateScanned_cleaned_ts        1119 non-null   datetime64[ns]
 10  finishedDate                  568 non-null    object        
 11  finishedDate_cleaned          

In [77]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 495 entries, 0 to 494
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   _id                     495 non-null    object        
 1   _id_cleaned             495 non-null    object        
 2   active                  495 non-null    bool          
 3   createdDate             495 non-null    object        
 4   createdDate_cleaned     495 non-null    int64         
 5   createdDate_cleaned_ts  495 non-null    datetime64[ns]
 6   lastLogin               433 non-null    object        
 7   lastLogin_cleaned       433 non-null    float64       
 8   lastLogin_cleaned_ts    433 non-null    datetime64[ns]
 9   role                    495 non-null    object        
 10  signUpSource            447 non-null    object        
 11  state                   439 non-null    object        
dtypes: bool(1), datetime64[ns](2), float64(1), int64(1

### What might I need to answer the stakeholder questions?  
This collection of cells is representative of some of my brainstorming/planning process. I've attempted to 'think out loud' a bit here, but will more fully document what code is doing in the following section.

- What are the top 5 brands by receipts scanned for most recent month?
  - need to join to brands from receipts, only way there is via barcode: in rewardsReceiptItemList  
  
  
- How does the ranking of the top 5 brands by receipts scanned for the recent month compare to the ranking for the previous month?  
  - same as above, barcode 


- When considering average spend from receipts with 'rewardsReceiptStatus’ of ‘Accepted’ or ‘Rejected’, which is greater?  
  - this can be answered with df_receipts.totalSpent

- When considering total number of items purchased from receipts with 'rewardsReceiptStatus’ of ‘Accepted’ or ‘Rejected’, which is greater?  
  - df_receipts.purchasedItemCount


- Which brand has the most spend among users who were created within the past 6 months?  
  - barcode
  - df_users.createdDate_cleaned_ts

- Which brand has the most transactions among users who were created within the past 6 months?
  - barcode
  
  
Questions:
  - 1 receipt = 1 transaction?
  - There is no 'Accepted' value for rewardsReceiptStatus. Assume 'Finished' is 'Accepted' or anything but 'Rejected' or something else?
  - Re: receipts data - is this data a snapshot in time, if taken again might some statuses change, along the contents of rewardsReceiptItemList? If so, what are the final statuses - FINISHED and REJECTED?
    - looking at status by daterange might give some indication, there are a number of date fields - modifyDate could be representative of some sort of updated at reference 



**to-do:**
- explore what keys are included in a dictionary that includes barcode:, is it a consistent set?
  - it is not a consistent set, it looks like most 'FINISHED' receipts have the best quality of data. I'm curious what status implies 
- decide what else I should include in addition to barcode from rewardsReceiptItemList?
    - With the following I can get to brands via barcode / userFlaggedBarcode. I can sum quantity and prices and provide descriptions - potentially useful for the next level of drill down and easy to grab now.
  - 'barcode':
  - 'userFlaggedBarcode':
  - 'description':
  - 'userFlaggedDescription':
  - 'finalPrice':
  - 'userFlaggedPrice':
  - 'quantityPurchased':
  - 'userFlaggedQuantity':
- create a new data source that will act as a look up table, receipt_items. rows to include the original receipt id, and the above fields from  where available. If neither barcode or userFlaggedBarcode are available, don't include those receipt items


#### Update a few days later (9/23) after initially loading all the data to sql
- when attempting to answer 'What are the top 5 brands by receipts scanned for most recent month?' I realized my assumption that brands.barcode would join on coalesce(receipt_items.barcode, receipt_items.userFlaggedBarcode) was a bad one. A quick visual check would have confirmed next to no matches as it appears almost all of brands.barcode start with 511111 and practically none of the barcode values from receipt_items do.
- Will add the brandCode value from receipts.rewardsReceiptItemList to my derived receipt_items table. 
- For records that don't contain a brandCode in receipts.rewardsReceiptItemList I may be able to extract one from the description - will also add description and explore opportunities there. Could lead to an additional extracted brand code.
- I'm going to leave the code I used to troubleshoot this in notebook: 2.1-EDA_first_pass.ipynb and will keep the original barcode values present in the final tables so that the sql continues to work.

In [78]:
# from df_receipts extract _id_cleaned and rewardsReceiptItemList to series and look at a few samplesabs
df_receipt_items = df_receipts[['_id_cleaned','rewardsReceiptItemList']]
# df_receipt_items
df_receipt_items.sample(3)

Unnamed: 0,_id_cleaned,rewardsReceiptItemList
739,601bbbb50a720f05f400026a,
60,5ff4ce430a7214ada10005d8,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '25.00', 'itemPrice': '25.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 4, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '25.00', 'userFlaggedQuantity': 4}]"
750,601c11410a720f05f400028e,


In [79]:
df_receipts.groupby('rewardsReceiptStatus').count()

Unnamed: 0_level_0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,modifyDate_cleaned_ts,pointsAwardedDate,pointsAwardedDate_cleaned,pointsAwardedDate_cleaned_ts,pointsEarned,purchaseDate,purchaseDate_cleaned,purchaseDate_cleaned_ts,purchasedItemCount,rewardsReceiptItemList,totalSpent,userId
rewardsReceiptStatus,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
FINISHED,518,518,456,456,518,518,518,518,518,518,518,518,518,518,518,518,514,514,514,518,518,518,518,518,516,518,518
FLAGGED,46,46,30,30,46,46,46,46,46,46,0,0,0,46,46,46,19,19,19,33,35,35,35,46,46,46,46
PENDING,50,50,0,0,50,50,50,50,50,50,50,50,50,50,50,50,0,0,0,0,49,49,49,0,49,49,50
REJECTED,71,71,58,58,71,71,71,71,71,71,0,0,0,71,71,71,4,4,4,58,69,69,69,71,68,71,71
SUBMITTED,434,434,0,0,434,434,434,434,434,434,0,0,0,434,434,434,0,0,0,0,0,0,0,0,0,0,434


In [80]:
df_receipts[['_id_cleaned','rewardsReceiptStatus','rewardsReceiptItemList']].sample(40)

Unnamed: 0,_id_cleaned,rewardsReceiptStatus,rewardsReceiptItemList
83,5ff473b20a720f05230005b7,FINISHED,"[{'barcode': '021000667543', 'description': 'Dressing KRAFT Free Catalina 90 Ounce', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '1', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '021000667543', 'description': 'Dressing KRAFT Free Catalina 90 Ounce', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '2', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '021000667543', 'description': 'Dressing KRAFT Free Catalina 90 Ounce', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '3', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '021000667543', 'description': 'Dressing KRAFT Free Catalina 90 Ounce', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '4', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '021000667543', 'description': 'Dressing KRAFT Free Catalina 90 Ounce', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '5', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}]"
331,600741fe0a7214ad89000001,FINISHED,"[{'barcode': '043000068649', 'description': 'GEVALIA 100% Arabica Signature Blend K Mild Roast Coffee - K -Cups Pods - 36ct', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '1', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'GEVALIA KAFFE KEURIG COFFEE PODS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '043000068649', 'description': 'GEVALIA 100% Arabica Signature Blend K Mild Roast Coffee - K -Cups Pods - 36ct', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '2', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'GEVALIA KAFFE KEURIG COFFEE PODS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '043000068649', 'description': 'GEVALIA 100% Arabica Signature Blend K Mild Roast Coffee - K -Cups Pods - 36ct', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '3', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'GEVALIA KAFFE KEURIG COFFEE PODS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '043000068649', 'description': 'GEVALIA 100% Arabica Signature Blend K Mild Roast Coffee - K -Cups Pods - 36ct', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '4', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'GEVALIA KAFFE KEURIG COFFEE PODS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '043000068649', 'description': 'GEVALIA 100% Arabica Signature Blend K Mild Roast Coffee - K -Cups Pods - 36ct', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '5', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'GEVALIA KAFFE KEURIG COFFEE PODS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}]"
918,6023a2530a720f05a8000144,SUBMITTED,
797,601f190c0a720f053c0000c4,SUBMITTED,
1097,603c5ba30a720fde1000038d,SUBMITTED,
387,60088d5f0a720f05fa0000fc,FINISHED,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '25.00', 'itemPrice': '25.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 1, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '25.00', 'userFlaggedQuantity': 1}]"
968,6025495f0a720f05a8000231,SUBMITTED,
468,60107e3b0a720f0535000051,REJECTED,"[{'description': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'discountedItemPrice': '22.97', 'finalPrice': '22.97', 'itemPrice': '22.97', 'originalReceiptItemText': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'partnerItemId': '0', 'priceAfterCoupon': '22.97', 'quantityPurchased': 1}, {'description': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'discountedItemPrice': '11.99', 'finalPrice': '11.99', 'itemPrice': '11.99', 'originalReceiptItemText': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'partnerItemId': '1', 'priceAfterCoupon': '11.99', 'quantityPurchased': 1}]"
185,5ffc9daa0a7214adca000054,FINISHED,"[{'barcode': '001111147332', 'brandCode': 'BRAND', 'description': 'CARESS FINE FRAGRANCE LOVE FOREVER WASH SOAP PLASTIC BOTTLE RP 3 OZ', 'finalPrice': '10.00', 'itemPrice': '10.00', 'partnerItemId': '0', 'pointsEarned': '100.0', 'pointsPayerId': '5332f5f6e4b03c9a25efd0b4', 'quantityPurchased': 1, 'rewardsGroup': 'CARESS BODY WASH AND SOAP', 'rewardsProductPartnerId': '5332f5f6e4b03c9a25efd0b4'}]"
398,6009f5200a720f0535000005,FINISHED,"[{'description': 'Tomato Sauce', 'discountedItemPrice': '2.34', 'finalPrice': '2.34', 'itemPrice': '2.34', 'originalReceiptItemText': 'CVVEESE SAUCE', 'partnerItemId': '1031', 'quantityPurchased': 1}, {'description': 'CHEESE SAUCE', 'discountedItemPrice': '2.34', 'finalPrice': '2.34', 'itemPrice': '2.34', 'originalReceiptItemText': 'CHEESE SAUCE', 'partnerItemId': '1032', 'quantityPurchased': 1}, {'description': 'CHEESESAUCE', 'discountedItemPrice': '2.34', 'finalPrice': '2.34', 'itemPrice': '2.34', 'originalReceiptItemText': 'CHEESESAUCE', 'partnerItemId': '1033', 'quantityPurchased': 1}, {'description': 'CHEESESAUCE', 'discountedItemPrice': '2.34', 'finalPrice': '2.34', 'itemPrice': '2.34', 'originalReceiptItemText': 'CHEESESAUCE', 'partnerItemId': '1034', 'quantityPurchased': 1}, {'description': '10X103D PL', 'discountedItemPrice': '12.62', 'finalPrice': '12.62', 'itemPrice': '12.62', 'originalReceiptItemText': '10X103D PL', 'partnerItemId': '1035', 'quantityPurchased': 1}, {'description': 'CRAFTS', 'discountedItemPrice': '12.62', 'finalPrice': '12.62', 'itemPrice': '12.62', 'originalReceiptItemText': 'CRAFTS', 'partnerItemId': '1036', 'quantityPurchased': 1}, {'description': 'MSLRG RLNG', 'discountedItemPrice': '45.86', 'finalPrice': '45.86', 'itemPrice': '45.86', 'originalReceiptItemText': 'MSLRG RLNG', 'partnerItemId': '1037', 'quantityPurchased': 1}]"


In [81]:
id_125 = df_receipt_items.loc[df_receipt_items['_id_cleaned'] == '6008ee0e0a7214ad89000125']
id_125

Unnamed: 0,_id_cleaned,rewardsReceiptItemList
392,6008ee0e0a7214ad89000125,"[{'barcode': '012000809965', 'description': 'MTN DEW REVOLUTION SODA WILDBERRY FRUIT FLVR CANS IN BOX 12 CT 144 OZ', 'discountedItemPrice': '8.99', 'finalPrice': '8.99', 'itemNumber': '012000809965', 'itemPrice': '8.99', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'ILDBERRY FRUIT FLVR CANS IN BOX 12 C', 'partnerItemId': '1032', 'pointsNotAwardedReason': 'Action not allowed for user and CPG', 'pointsPayerId': '5332f5fbe4b03c9a25efd0ba', 'quantityPurchased': 1, 'rewardsGroup': 'MOUNTAIN DEW 12 OZ 12 PACK', 'rewardsProductPartnerId': '5332f5fbe4b03c9a25efd0ba'}, {'barcode': '511111101451', 'description': 'QUAKER', 'discountedItemPrice': '3.99', 'finalPrice': '3.99', 'itemNumber': '511111101451', 'itemPrice': '3.99', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': '2.99 10 OUAKER OATS Q', 'partnerItemId': '1042', 'pointsNotAwardedReason': 'Action not allowed for user and CPG', 'pointsPayerId': '53e10d6368abd3c7065097cc', 'quantityPurchased': 1, 'rewardsProductPartnerId': '53e10d6368abd3c7065097cc'}, {'barcode': '005111116022', 'description': 'TTER BLUE KRAZY KRITTER BLUE 1', 'discountedItemPrice': '1.49', 'finalPrice': '1.49', 'itemNumber': '005111116022', 'itemPrice': '1.49', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'TTER BLUE KRAZY KRITTER BLUE 1', 'partnerItemId': '1048', 'quantityPurchased': 1}, {'barcode': '511111602118', 'description': 'JELL-O', 'discountedItemPrice': '1.99', 'finalPrice': '1.99', 'itemNumber': '511111602118', 'itemPrice': '1.99', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'LO JELL-O', 'partnerItemId': '1051', 'pointsEarned': '10.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '311111536044', 'description': 'LUCKY CHARMS UNICORN CEREAL FAMILY SIZE', 'discountedItemPrice': '6.58', 'finalPrice': '6.58', 'itemNumber': '311111536044', 'itemPrice': '6.58', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'SI HIDDEN VALLEY SALAD DRESSING 21OZ', 'partnerItemId': '1088', 'pointsNotAwardedReason': 'Action not allowed for user and CPG', 'pointsPayerId': '5332f5f3e4b03c9a25efd0ae', 'quantityPurchased': 1, 'rewardsGroup': 'LUCKY CHARMS UNICORN CEREAL FAMILY SIZE', 'rewardsProductPartnerId': '5332f5f3e4b03c9a25efd0ae'}, {'barcode': '074682200294', 'description': 'R W KND FML BT VGTB JC BTL RFRG AFTR OPNN 32 FL OZ', 'discountedItemPrice': '7.89', 'finalPrice': '7.89', 'itemNumber': '074682200294', 'itemPrice': '7.89', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'ML BT VGTB JC BTL RFRG AFTR OPNN 32', 'partnerItemId': '1091', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '011594404013', 'description': 'HWN ONN RNG SWT M GLDN CRSP BAG 4 OZ', 'discountedItemPrice': '1.49', 'finalPrice': '1.49', 'itemNumber': '011594404013', 'itemPrice': '1.49', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'AIIAN SWT HWN ONN RNG SWT M GLDN CRS', 'partnerItemId': '1112', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}]"


In [82]:
# extract a sample value from rewardsReceiptItemList
receiptlist = id_125.iloc[0]['rewardsReceiptItemList']
receiptlist
len(receiptlist)

for item in receiptlist:
    print(item['barcode'])

012000809965
511111101451
005111116022
511111602118
311111536044
074682200294
011594404013


### Creating df_receipt_items data source

In [83]:
# create a list containing '_id_cleaned','rewardsReceiptItemList' values from df_receipts
list_receipt_items_in = df_receipts[['_id_cleaned', 'rewardsReceiptItemList']].values.tolist()

In [84]:
# create an emptly list to store values from list_receipt_items_in, a list of lists
list_receipt_items_expand = []

for _id_cleaned, rewardsReceiptItemList in list_receipt_items_in:
    item_index = 0
    try:
        for item in rewardsReceiptItemList:
            # create the list to add to list_receipt_items_expand
            list_out = [_id_cleaned, item_index, item]
            list_receipt_items_expand.append(list_out)
            item_index += 1
    except:
        pass

# confirm the list is composed as intended, slicing the first 5 items
list_receipt_items_expand[0:6]

[['5ff1e1eb0a720f0523000575',
  0,
  {'barcode': '4011',
   'description': 'ITEM NOT FOUND',
   'finalPrice': '26.00',
   'itemPrice': '26.00',
   'needsFetchReview': False,
   'partnerItemId': '1',
   'preventTargetGapPoints': True,
   'quantityPurchased': 5,
   'userFlaggedBarcode': '4011',
   'userFlaggedNewItem': True,
   'userFlaggedPrice': '26.00',
   'userFlaggedQuantity': 5}],
 ['5ff1e1bb0a720f052300056b',
  0,
  {'barcode': '4011',
   'description': 'ITEM NOT FOUND',
   'finalPrice': '1',
   'itemPrice': '1',
   'partnerItemId': '1',
   'quantityPurchased': 1}],
 ['5ff1e1bb0a720f052300056b',
  1,
  {'barcode': '028400642255',
   'description': 'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ',
   'finalPrice': '10.00',
   'itemPrice': '10.00',
   'needsFetchReview': True,
   'needsFetchReviewReason': 'USER_FLAGGED',
   'partnerItemId': '2',
   'pointsNotAwardedReason': 'Action not allowed for user and CPG',
   'pointsPayerId': '5332f5fbe4b03c9a25efd0ba',
   'preve

In [85]:
# create an empty list to store values from list_receipt_items_expand, another list of lists
# 9/24 - take two - adding brandCode - adjusted conditions for saving values by adding brandCode and Description
# I would drop my barcode and userFlaggedBarcode conditons but I want to perserve my exploratory code in 2.1-EDA_first_pass.ipynb
list_receipt_items_extract = []

for _id_cleaned, item_index, item in list_receipt_items_expand:
    # only save values to list_receipt_items_extract if there is a 'barcode', 'userFlaggedBarcode',
    # 'description', OR 'brandCode' present 
    if item.get('barcode', None) or item.get('userFlaggedBarcode', None) or item.get('description', None) \
    or item.get('brandCode', None):
        # assign variables values from item dictionaries, if the key doesn't exist default None
        barcode = item.get('barcode', None)
        userFlaggedBarcode = item.get('userFlaggedBarcode', None)
        description = item.get('description', None)
        userFlaggedDescription = item.get('userFlaggedDescription', None)
        finalPrice = item.get('finalPrice', None)
        userFlaggedPrice = item.get('userFlaggedPrice', None)
        quantityPurchased = item.get('quantityPurchased', None)
        userFlaggedQuantity = item.get('userFlaggedQuantity', None)
        brandCode = item.get('brandCode', None)
        #create the list to add to list_receipt_items_extract
        list_out = [
            _id_cleaned, item_index, barcode, userFlaggedBarcode, description, userFlaggedDescription, \
            finalPrice, userFlaggedPrice, quantityPurchased, userFlaggedQuantity, brandCode
            ]
        list_receipt_items_extract.append(list_out)

In [86]:
# inspect the first items of list_receipt_items_extract
list_receipt_items_extract[0:6]

[['5ff1e1eb0a720f0523000575',
  0,
  '4011',
  '4011',
  'ITEM NOT FOUND',
  None,
  '26.00',
  '26.00',
  5,
  5,
  None],
 ['5ff1e1bb0a720f052300056b',
  0,
  '4011',
  None,
  'ITEM NOT FOUND',
  None,
  '1',
  None,
  1,
  None,
  None],
 ['5ff1e1bb0a720f052300056b',
  1,
  '028400642255',
  '028400642255',
  'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ',
  'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ',
  '10.00',
  '10.00',
  1,
  1,
  None],
 ['5ff1e1f10a720f052300057a',
  0,
  None,
  '4011',
  None,
  None,
  None,
  '26.00',
  None,
  3,
  None],
 ['5ff1e1ee0a7214ada100056f',
  0,
  '4011',
  '4011',
  'ITEM NOT FOUND',
  None,
  '28.00',
  '28.00',
  4,
  4,
  None],
 ['5ff1e1d20a7214ada1000561',
  0,
  '4011',
  None,
  'ITEM NOT FOUND',
  None,
  '1',
  None,
  1,
  None,
  None]]

In [87]:
# create an empty dataframe, df_receipt_items with columns
# 9/24 - take two - adding brandCode
ri_columns = [
                "receipt_id",
                "item_index",
                "barcode",
                "userFlaggedBarcode",
                "description",
                "userFlaggedDescription",
                "finalPrice",
                "userFlaggedPrice",
                "quantityPurchased",
                "userFlaggedQuantity",
                "brandCode"
                ]
# load all the values from list_receipt_items_extract into this dataframe
df_receipt_items = pd.DataFrame(list_receipt_items_extract, columns = ri_columns)

In [88]:
"""9/24 - take two - adding brandCode and adjusting 
Take one output follows:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3240 entries, 0 to 3239
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   receipt_id              3240 non-null   object 
 1   item_index              3240 non-null   int64  
 2   barcode                 3090 non-null   object 
 3   userFlaggedBarcode      337 non-null    object 
 4   description             2859 non-null   object 
 5   userFlaggedDescription  205 non-null    object 
 6   finalPrice              3066 non-null   object 
 7   userFlaggedPrice        299 non-null    object 
 8   quantityPurchased       3066 non-null   float64
 9   userFlaggedQuantity     299 non-null    float64
 10  brandCode               1746 non-null   object 
dtypes: float64(2), int64(1), object(8)
memory usage: 278.6+ KB

Many more records to work with after pivoting to brandCode.
"""
df_receipt_items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6941 entries, 0 to 6940
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   receipt_id              6941 non-null   object 
 1   item_index              6941 non-null   int64  
 2   barcode                 3090 non-null   object 
 3   userFlaggedBarcode      337 non-null    object 
 4   description             6560 non-null   object 
 5   userFlaggedDescription  205 non-null    object 
 6   finalPrice              6767 non-null   object 
 7   userFlaggedPrice        299 non-null    object 
 8   quantityPurchased       6767 non-null   float64
 9   userFlaggedQuantity     299 non-null    float64
 10  brandCode               2600 non-null   object 
dtypes: float64(2), int64(1), object(8)
memory usage: 596.6+ KB


### Extracting brandCodes from description
Once I realized the better join between receipt_items and brand would be on brandCode I had the thought to apply a similar process that I used in my Fetch Catalog Analyst exercise. I wondered, are there items in receipt_items that have a value for description but None for brandCode. In those situations, might I be able to indentify a brand based on comparing known brands and the content of description.

The following cells capature my exploration of this idea.

In [89]:
# What's the potential set of items that would be eligible for this comparison. ie, non-null description and null brandCode
df_non_null_description = df_receipt_items[df_receipt_items.description.notnull()]
df_non_null_description.info()
"""There are 6,560 items with a description, and 3,960 of those items have no value for brandCode.
It also appears, based on comparing the result of the above cell, that when there is a brandCode there is 
a description. The potential to extract potentially 3,960 brandCodes makes it worth trying.
"""

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6560 entries, 0 to 6940
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   receipt_id              6560 non-null   object 
 1   item_index              6560 non-null   int64  
 2   barcode                 2859 non-null   object 
 3   userFlaggedBarcode      136 non-null    object 
 4   description             6560 non-null   object 
 5   userFlaggedDescription  17 non-null     object 
 6   finalPrice              6560 non-null   object 
 7   userFlaggedPrice        122 non-null    object 
 8   quantityPurchased       6560 non-null   float64
 9   userFlaggedQuantity     122 non-null    float64
 10  brandCode               2600 non-null   object 
dtypes: float64(2), int64(1), object(8)
memory usage: 615.0+ KB


'There are 6,560 items with a description, and 3,960 of those items have no value for brandCode.\nIt also appears, based on comparing the result of the above cell, that when there is a brandCode there is \na description. The potential to extract potentially 3,960 brandCodes makes it worth trying.\n'

Plan: 
- get a list of known brand codes
- use that list with regex to extract matches from df_receipt_items.description
- add those extracted values to a new column in df_receipt_items
- in the future join brands to receipt_items on b.brandCode = coalesce(ri.brandCode, ri.extracted-brandCode)

In [90]:
# get a list of all the known brand codes
brand_codes = list(set(df_brands['brandCode']))
# remove the '' and nan
brand_codes.remove('')
brand_codes.remove(brand_codes[0])

# get a list of all the descriptions from df_receipt_items
descriptions = list(df_receipt_items['description'])

# brandCodes are capitalized, I want my extracted matches to be capitalized as well, capitlaize descriptions:
# list comprehenison to upper case every str in description
# ref: https://stackoverflow.com/questions/1801668/convert-a-list-with-strings-all-to-lowercase-or-uppercase
descriptionsUp = []
for des in descriptions:
    try:
        descriptionsUp.append(des.upper())
    except:
        descriptionsUp.append(None)
        
# make a list to store matches in
extracted_brandCodes = []

# create a regex pattern of the values in brand_codes
regex = r'(' + '|'.join(brand_codes) + r')'

for des in descriptionsUp:
    try:
        matches = re.findall(regex, des)
    except:
        pass
    
    try:
        extracted_brandCodes.append(matches[0])
    except:
        extracted_brandCodes.append(None)
        

In [91]:
# add extracted_brandCodes ti df_receipt_items
df_receipt_items['extracted_brandCode'] = extracted_brandCodes

In [92]:
df_receipt_items.info()
# was able to extract 1854 brandCodes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6941 entries, 0 to 6940
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   receipt_id              6941 non-null   object 
 1   item_index              6941 non-null   int64  
 2   barcode                 3090 non-null   object 
 3   userFlaggedBarcode      337 non-null    object 
 4   description             6560 non-null   object 
 5   userFlaggedDescription  205 non-null    object 
 6   finalPrice              6767 non-null   object 
 7   userFlaggedPrice        299 non-null    object 
 8   quantityPurchased       6767 non-null   float64
 9   userFlaggedQuantity     299 non-null    float64
 10  brandCode               2600 non-null   object 
 11  extracted_brandCode     1854 non-null   object 
dtypes: float64(2), int64(1), object(9)
memory usage: 650.8+ KB


In [93]:
# How many of these extracted brandCodes could replace a brandCodes? ie,  non-null extracted_brandCodes and null brandCode
df_non_null_extracted_brandCodes = df_receipt_items[df_receipt_items.extracted_brandCode.notnull()]
df_non_null_extracted_brandCodes.info()
# Looks like I'll be able to join an additional ~1,000 receipt_items to brands. 


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1854 entries, 2 to 6863
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   receipt_id              1854 non-null   object 
 1   item_index              1854 non-null   int64  
 2   barcode                 1586 non-null   object 
 3   userFlaggedBarcode      119 non-null    object 
 4   description             1679 non-null   object 
 5   userFlaggedDescription  80 non-null     object 
 6   finalPrice              1792 non-null   object 
 7   userFlaggedPrice        109 non-null    object 
 8   quantityPurchased       1792 non-null   float64
 9   userFlaggedQuantity     109 non-null    float64
 10  brandCode               885 non-null    object 
 11  extracted_brandCode     1854 non-null   object 
dtypes: float64(2), int64(1), object(9)
memory usage: 188.3+ KB


### Convert data types and load dataframes to SQLite database 

In [94]:
# convert topBrand to dtype boolean, allowing nulls
df_brands.topBrand = df_brands.topBrand.astype("boolean")

In [95]:
df_brands.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1167 entries, 0 to 1166
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   _id           1167 non-null   object 
 1   _id_cleaned   1167 non-null   object 
 2   barcode       1167 non-null   int64  
 3   category      1012 non-null   object 
 4   categoryCode  517 non-null    object 
 5   cpg           1167 non-null   object 
 6   cpg_cleaned   1167 non-null   object 
 7   name          1167 non-null   object 
 8   topBrand      555 non-null    boolean
 9   brandCode     933 non-null    object 
dtypes: boolean(1), int64(1), object(8)
memory usage: 84.5+ KB


In [96]:
# create a new datafame with only the columns I want to load to sqlite
df_brands_load = df_brands[[
                            '_id_cleaned', 'barcode', 'category', 'categoryCode', 
                            'cpg_cleaned', 'name', 'topBrand', 'brandCode'
                            ]]
# rename some columns for better formating when loading into sqlite
df_brands_load.rename(columns={'_id_cleaned': 'id', 'cpg_cleaned': 'cpg'}, inplace=True)
df_brands_load


Unnamed: 0,id,barcode,category,categoryCode,cpg,name,topBrand,brandCode
0,601ac115be37ce2ead437551,511111019862,Baking,BAKING,601ac114be37ce2ead437550,test brand @1612366101024,False,
1,601c5460be37ce2ead43755f,511111519928,Beverages,BEVERAGES,5332f5fbe4b03c9a25efd0ba,Starbucks,False,STARBUCKS
2,601ac142be37ce2ead43755d,511111819905,Baking,BAKING,601ac142be37ce2ead437559,test brand @1612366146176,False,TEST BRANDCODE @1612366146176
3,601ac142be37ce2ead43755a,511111519874,Baking,BAKING,601ac142be37ce2ead437559,test brand @1612366146051,False,TEST BRANDCODE @1612366146051
4,601ac142be37ce2ead43755e,511111319917,Candy & Sweets,CANDY_AND_SWEETS,5332fa12e4b03c9a25efd1e7,test brand @1612366146827,False,TEST BRANDCODE @1612366146827
...,...,...,...,...,...,...,...,...
1162,5f77274dbe37ce6b592e90c0,511111116752,Baking,BAKING,5f77274dbe37ce6b592e90bf,test brand @1601644365844,,
1163,5dc1fca91dda2c0ad7da64ae,511111706328,Breakfast & Cereal,,53e10d6368abd3c7065097cc,Dippin Dots® Cereal,,DIPPIN DOTS CEREAL
1164,5f494c6e04db711dd8fe87e7,511111416173,Candy & Sweets,CANDY_AND_SWEETS,5332fa12e4b03c9a25efd1e7,test brand @1598639215217,,TEST BRANDCODE @1598639215217
1165,5a021611e4b00efe02b02a57,511111400608,Grocery,,5332f5f6e4b03c9a25efd0b4,LIPTON TEA Leaves,False,LIPTON TEA Leaves


In [97]:
# when trying to load, the unique constraint fails on barcode: IntegrityError: UNIQUE constraint failed: brands.barcode
# retrun all rows where barcodes are repeated ref: https://stackoverflow.com/questions/14657241/how-do-i-get-a-list-of-all-the-duplicate-items-using-pandas-in-python
pd.concat(g for _, g in df_brands_load.groupby("barcode") if len(g) > 1)

Unnamed: 0,id,barcode,category,categoryCode,cpg,name,topBrand,brandCode
467,5c409ab4cd244a3539b84162,511111004790,Baking,,55b62995e4b0d8e685c14213,alexa,True,ALEXA
1071,5cdacd63166eb33eb7ce0fa8,511111004790,Condiments & Sauces,,559c2234e4b06aca36af13c6,Bitten Dressing,,BITTEN
152,5c45f91b87ff3552f950f027,511111204923,Grocery,,5c45f8b087ff3552f950f026,Brand1,True,0987654321
536,5d6027f46d5f3b23d1bc7906,511111204923,Snacks,,5332f5fbe4b03c9a25efd0ba,CHESTER'S,,CHESTERS
20,5c4699f387ff3577e203ea29,511111305125,Baby,,55b62995e4b0d8e685c14213,Chris Image Test,,CHRISIMAGE
651,5d642d65a3a018514994f42d,511111305125,Magazines,,5d5d4fd16d5f3b23d1bc7905,Rachael Ray Everyday,,511111305125
129,5a7e0604e4b0aedb3b84afd3,511111504139,Beverages,,55b62995e4b0d8e685c14213,Chris Brand XYZ,,CHRISXYZ
299,5a8c33f3e4b07f0a2dac8943,511111504139,Grocery,,5a734034e4b0d58f376be874,Pace,False,PACE
9,5c408e8bcd244a1fdb47aee7,511111504788,Baking,,59ba6f1ce4b092b29c167346,test,,TEST
412,5ccb2ece166eb31bbbadccbe,511111504788,Condiments & Sauces,,559c2234e4b06aca36af13c6,The Pioneer Woman,,PIONEER WOMAN


In [98]:
# add a column to flag if the barcode is a duplicate
# start by creating a list of the duplicated barcodes
dupe_barcodes = pd.concat(g for _, g in df_brands_load.groupby("barcode") if len(g) > 1)['barcode']
# remove duplicates
dupe_barcodes = list(set(dupe_barcodes))

# create a list to collect a bool value indicating if df_brands_load['barcode'] is duplicated
dupe_barcode_flag = []

# extract barcodes to evaluate
brand_barcodes = df_brands_load['barcode']

# loop through brand_barcodes evaluating if the barcode is included in dupe_barcodes
for barcode in brand_barcodes:
    if barcode in dupe_barcodes:
        dupe_barcode_flag.append(True)
    else:
        dupe_barcode_flag.append(None)

# add dupe_barcode to df_brands_load after the barcode column
df_brands_load.insert(2, 'dupe_barcode', dupe_barcode_flag)

# convert dupe_barcode to dtype boolean, allowing nulls
df_brands_load.dupe_barcode = df_brands_load.dupe_barcode.astype("boolean")

df_brands_load

Unnamed: 0,id,barcode,dupe_barcode,category,categoryCode,cpg,name,topBrand,brandCode
0,601ac115be37ce2ead437551,511111019862,,Baking,BAKING,601ac114be37ce2ead437550,test brand @1612366101024,False,
1,601c5460be37ce2ead43755f,511111519928,,Beverages,BEVERAGES,5332f5fbe4b03c9a25efd0ba,Starbucks,False,STARBUCKS
2,601ac142be37ce2ead43755d,511111819905,,Baking,BAKING,601ac142be37ce2ead437559,test brand @1612366146176,False,TEST BRANDCODE @1612366146176
3,601ac142be37ce2ead43755a,511111519874,,Baking,BAKING,601ac142be37ce2ead437559,test brand @1612366146051,False,TEST BRANDCODE @1612366146051
4,601ac142be37ce2ead43755e,511111319917,,Candy & Sweets,CANDY_AND_SWEETS,5332fa12e4b03c9a25efd1e7,test brand @1612366146827,False,TEST BRANDCODE @1612366146827
...,...,...,...,...,...,...,...,...,...
1162,5f77274dbe37ce6b592e90c0,511111116752,,Baking,BAKING,5f77274dbe37ce6b592e90bf,test brand @1601644365844,,
1163,5dc1fca91dda2c0ad7da64ae,511111706328,,Breakfast & Cereal,,53e10d6368abd3c7065097cc,Dippin Dots® Cereal,,DIPPIN DOTS CEREAL
1164,5f494c6e04db711dd8fe87e7,511111416173,,Candy & Sweets,CANDY_AND_SWEETS,5332fa12e4b03c9a25efd1e7,test brand @1598639215217,,TEST BRANDCODE @1598639215217
1165,5a021611e4b00efe02b02a57,511111400608,,Grocery,,5332f5f6e4b03c9a25efd0b4,LIPTON TEA Leaves,False,LIPTON TEA Leaves


In [99]:
# 9/24 - Take two - need to repeat the origianl process for creating a dupe_barcode, but for brandCode
# when trying to load, the unique constraint fails on brandCode: IntegrityError: UNIQUE constraint failed: brands.brandCode
# retrun all rows where brandCode are repeated ref: https://stackoverflow.com/questions/14657241/how-do-i-get-a-list-of-all-the-duplicate-items-using-pandas-in-python
pd.concat(g for _, g in df_brands_load.groupby("brandCode") if len(g) > 1)

Unnamed: 0,id,barcode,dupe_barcode,category,categoryCode,cpg,name,topBrand,brandCode
124,57ebc11fe4b0ac389136a33a,511111802075,,Baking,,559c2234e4b06aca36af13c6,Kraft Caramels,False,
153,58861c7d4e8d0d20bc42c4d6,511111601449,,Snacks,,559c2234e4b06aca36af13c6,Jell-O Refrigerated Pudding & Gelatin,False,
163,57ebc2ace4b0ac389136a346,511111801962,,Deli,,559c2234e4b06aca36af13c6,P3,False,
188,58b59989e4b0857c2ddb7255,511111400998,,Beer Wine Spirits,,5332f709e4b03c9a25efd0f1,Redd's Wicked,False,
234,58b5988ce4b0857c2ddb7252,511111301028,,Beer Wine Spirits,,5332f709e4b03c9a25efd0f1,Henry's Hard Sparkling,False,
236,57ebc125e4b0ac389136a33b,511111302063,,Grocery,,559c2234e4b06aca36af13c6,Kraft Macaroni & Cheese,False,
278,585a967fe4b03e62d1ce0e80,511111801689,,Snacks,,5332f5fbe4b03c9a25efd0ba,Lay's Kettle Cooked,True,
288,585a96d2e4b03e62d1ce0e89,511111001607,,Beverages,,5332f5fbe4b03c9a25efd0ba,Pepsi Diet,False,
297,580e015be4b0f32b2de21385,511111501824,,Condiments & Sauces,,559c2234e4b06aca36af13c6,Kraft BBQ Sauce,False,
302,588b9ff4e4b02187f85cdadb,511111901099,,Condiments & Sauces,,559c2234e4b06aca36af13c6,Kraft Salad Dressing,False,


9/24 - Take two - there aren't nearly as many dupes as barcode, there are a few nulls which won't create duplicates in joins
There are two sets of repeated repeated brandCodes - GOODNITES and HUGGIES.
I'm not going to recreate the dupe_barcode process for this, but when I'm looking to get brands.name in queries I will 
join to a CTE of unique brands.name, brands.brandCode to avoid duplicates.



In [100]:
df_brands_load.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1167 entries, 0 to 1166
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            1167 non-null   object 
 1   barcode       1167 non-null   int64  
 2   dupe_barcode  14 non-null     boolean
 3   category      1012 non-null   object 
 4   categoryCode  517 non-null    object 
 5   cpg           1167 non-null   object 
 6   name          1167 non-null   object 
 7   topBrand      555 non-null    boolean
 8   brandCode     933 non-null    object 
dtypes: boolean(2), int64(1), object(6)
memory usage: 68.5+ KB


In [101]:
# create brands table
conn = sqlite3.connect(db_path)
c = conn.cursor()

c.execute('DROP TABLE IF EXISTS brands')

c.execute("""CREATE TABLE IF NOT EXISTS brands (
        id uuid PRIMARY KEY,
        -- there are duplicates in barcode, for now we'll load the table without this contraint and note it
        -- barcode numeric UNIQUE,
        barcode numeric,
        dupe_barcode bool,
        category text,
        categoryCode text,
        cpg text,
        name text,
        topBrand bool,
        -- there are duplicates in brandCode for now we'll load the table without this contraint and note it
        -- brandCode text UNIQUE
        brandCode text
    )""")

df_brands_load.to_sql('brands', conn, if_exists='append', index=False)

conn.commit()
conn.close()

In [102]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 495 entries, 0 to 494
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   _id                     495 non-null    object        
 1   _id_cleaned             495 non-null    object        
 2   active                  495 non-null    bool          
 3   createdDate             495 non-null    object        
 4   createdDate_cleaned     495 non-null    int64         
 5   createdDate_cleaned_ts  495 non-null    datetime64[ns]
 6   lastLogin               433 non-null    object        
 7   lastLogin_cleaned       433 non-null    float64       
 8   lastLogin_cleaned_ts    433 non-null    datetime64[ns]
 9   role                    495 non-null    object        
 10  signUpSource            447 non-null    object        
 11  state                   439 non-null    object        
dtypes: bool(1), datetime64[ns](2), float64(1), int64(1

In [103]:
# create a new datafame with only the columns I want to load to sqlite
df_users_load = df_users[[
                            '_id_cleaned', 'active', 'createdDate_cleaned_ts', 'lastLogin_cleaned_ts', 'role', 
                            'signUpSource', 'state'
                            ]]
# rename some columns for better formating when loading into sqlite
df_users_load.rename(columns={
                                '_id_cleaned': 'id', 'createdDate_cleaned_ts': 'createdDate', 
                                'lastLogin_cleaned_ts': 'lastLogin'
                                }, inplace=True)
df_users_load

Unnamed: 0,id,active,createdDate,lastLogin,role,signUpSource,state
0,5ff1e194b6a9d73a3a9f1052,True,2021-01-03 15:24:04.800,2021-01-03 15:25:37.858,consumer,Email,WI
1,5ff1e194b6a9d73a3a9f1052,True,2021-01-03 15:24:04.800,2021-01-03 15:25:37.858,consumer,Email,WI
2,5ff1e194b6a9d73a3a9f1052,True,2021-01-03 15:24:04.800,2021-01-03 15:25:37.858,consumer,Email,WI
3,5ff1e1eacfcf6c399c274ae6,True,2021-01-03 15:25:30.554,2021-01-03 15:25:30.597,consumer,Email,WI
4,5ff1e194b6a9d73a3a9f1052,True,2021-01-03 15:24:04.800,2021-01-03 15:25:37.858,consumer,Email,WI
...,...,...,...,...,...,...,...
490,54943462e4b07e684157a532,True,2014-12-19 14:21:22.381,2021-03-05 16:52:23.204,fetch-staff,,
491,54943462e4b07e684157a532,True,2014-12-19 14:21:22.381,2021-03-05 16:52:23.204,fetch-staff,,
492,54943462e4b07e684157a532,True,2014-12-19 14:21:22.381,2021-03-05 16:52:23.204,fetch-staff,,
493,54943462e4b07e684157a532,True,2014-12-19 14:21:22.381,2021-03-05 16:52:23.204,fetch-staff,,


### Removing user duplicates

In [104]:
# when trying to load, the unique and Primary Key constraint fails on barcode: IntegrityError: UNIQUE constraint failed: users.id
# retrun all rows where barcodes are repeated ref: https://stackoverflow.com/questions/14657241/how-do-i-get-a-list-of-all-the-duplicate-items-using-pandas-in-python
pd.concat(g for _, g in df_users_load.groupby("id") if len(g) > 1)

Unnamed: 0,id,active,createdDate,lastLogin,role,signUpSource,state
475,54943462e4b07e684157a532,True,2014-12-19 14:21:22.381,2021-03-05 16:52:23.204,fetch-staff,,
476,54943462e4b07e684157a532,True,2014-12-19 14:21:22.381,2021-03-05 16:52:23.204,fetch-staff,,
477,54943462e4b07e684157a532,True,2014-12-19 14:21:22.381,2021-03-05 16:52:23.204,fetch-staff,,
478,54943462e4b07e684157a532,True,2014-12-19 14:21:22.381,2021-03-05 16:52:23.204,fetch-staff,,
479,54943462e4b07e684157a532,True,2014-12-19 14:21:22.381,2021-03-05 16:52:23.204,fetch-staff,,
...,...,...,...,...,...,...,...
374,60189c94c8b50e11d8454f6b,True,2021-02-02 00:28:04.020,2021-02-02 00:28:04.073,consumer,Email,WI
385,601c2c05969c0b11f7d0b097,True,2021-02-04 17:16:53.700,2021-02-04 17:20:30.228,consumer,Email,WI
387,601c2c05969c0b11f7d0b097,True,2021-02-04 17:16:53.700,2021-02-04 17:20:30.228,consumer,Email,WI
393,60229990b57b8a12187fe9e0,True,2021-02-09 14:17:52.581,2021-02-09 14:17:52.626,consumer,Email,WI


In [105]:
# drop the duplicate rows (complete rows, may still result in dupe ids), then attempt to load 
df_backup = df_users_load
df_users_load.drop_duplicates(inplace = True)
# deduping worked, there were not repeating ids with unqiue values in any other column
df_users_load.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 212 entries, 0 to 475
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            212 non-null    object        
 1   active        212 non-null    bool          
 2   createdDate   212 non-null    datetime64[ns]
 3   lastLogin     172 non-null    datetime64[ns]
 4   role          212 non-null    object        
 5   signUpSource  207 non-null    object        
 6   state         206 non-null    object        
dtypes: bool(1), datetime64[ns](2), object(4)
memory usage: 11.8+ KB


In [106]:
# create users table
conn = sqlite3.connect(db_path)
c = conn.cursor()

c.execute('DROP TABLE IF EXISTS users')

c.execute("""CREATE TABLE IF NOT EXISTS users (
        id uuid PRIMARY KEY,
        active bool,
        createdDate timestamp,
        lastLogin timestamp,
        role text,
        signUpSource text,
        state text     
    )""")

df_users_load.to_sql('users', conn, if_exists='append', index=False)

conn.commit()
conn.close()

In [107]:
df_receipts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1119 entries, 0 to 1118
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   _id                           1119 non-null   object        
 1   _id_cleaned                   1119 non-null   object        
 2   bonusPointsEarned             544 non-null    float64       
 3   bonusPointsEarnedReason       544 non-null    object        
 4   createDate                    1119 non-null   object        
 5   createDate_cleaned            1119 non-null   int64         
 6   createDate_cleaned_ts         1119 non-null   datetime64[ns]
 7   dateScanned                   1119 non-null   object        
 8   dateScanned_cleaned           1119 non-null   int64         
 9   dateScanned_cleaned_ts        1119 non-null   datetime64[ns]
 10  finishedDate                  568 non-null    object        
 11  finishedDate_cleaned          

In [108]:
# create a new datafame with only the columns I want to load to sqlite
df_receipts_load = df_receipts[[
                            '_id_cleaned', 'bonusPointsEarned', 'bonusPointsEarnedReason', 'createDate_cleaned_ts', 
                            'dateScanned_cleaned_ts', 'finishedDate_cleaned_ts', 'modifyDate_cleaned_ts', 
                            'pointsAwardedDate_cleaned_ts', 'pointsEarned', 'purchaseDate_cleaned_ts', 
                            'purchasedItemCount', 'rewardsReceiptItemList', 'rewardsReceiptStatus', 'totalSpent', 
                            'userId'
                            ]]
# rename some columns for better formating when loading into sqlite
df_receipts_load.rename(columns={
                                '_id_cleaned': 'id', 'createDate_cleaned_ts': 'createDate', 
                                'dateScanned_cleaned_ts': 'dateScanned', 'finishedDate_cleaned_ts': 'finishedDate',
                                'modifyDate_cleaned_ts': 'modifyDate', 'pointsAwardedDate_cleaned_ts': 'pointsAwardedDate',
                                'purchaseDate_cleaned_ts': 'purchaseDate'
                                }, inplace=True)
df_receipts_load

Unnamed: 0,id,bonusPointsEarned,bonusPointsEarnedReason,createDate,dateScanned,finishedDate,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
0,5ff1e1eb0a720f0523000575,500.0,"Receipt number 2 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",2021-01-03 15:25:31.000,2021-01-03 15:25:31.000,2021-01-03 15:25:31,2021-01-03 15:25:36.000,2021-01-03 15:25:31,500.0,2021-01-03 00:00:00,5.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '26.00', 'itemPrice': '26.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 5, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '26.00', 'userFlaggedQuantity': 5}]",FINISHED,26.00,5ff1e1eacfcf6c399c274ae6
1,5ff1e1bb0a720f052300056b,150.0,"Receipt number 5 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",2021-01-03 15:24:43.000,2021-01-03 15:24:43.000,2021-01-03 15:24:43,2021-01-03 15:24:48.000,2021-01-03 15:24:43,150.0,2021-01-02 15:24:43,2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '028400642255', 'description': 'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ', 'finalPrice': '10.00', 'itemPrice': '10.00', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'pointsNotAwardedReason': 'Action not allowed for user and CPG', 'pointsPayerId': '5332f5fbe4b03c9a25efd0ba', 'preventTargetGapPoints': True, 'quantityPurchased': 1, 'rewardsGroup': 'DORITOS SPICY SWEET CHILI SINGLE SERVE', 'rewardsProductPartnerId': '5332f5fbe4b03c9a25efd0ba', 'userFlaggedBarcode': '028400642255', 'userFlaggedDescription': 'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ', 'userFlaggedNewItem': True, 'userFlaggedPrice': '10.00', 'userFlaggedQuantity': 1}]",FINISHED,11.00,5ff1e194b6a9d73a3a9f1052
2,5ff1e1f10a720f052300057a,5.0,All-receipts receipt bonus,2021-01-03 15:25:37.000,2021-01-03 15:25:37.000,NaT,2021-01-03 15:25:42.000,NaT,5.0,2021-01-03 00:00:00,1.0,"[{'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '26.00', 'userFlaggedQuantity': 3}]",REJECTED,10.00,5ff1e1f1cfcf6c399c274b0b
3,5ff1e1ee0a7214ada100056f,5.0,All-receipts receipt bonus,2021-01-03 15:25:34.000,2021-01-03 15:25:34.000,2021-01-03 15:25:34,2021-01-03 15:25:39.000,2021-01-03 15:25:34,5.0,2021-01-03 00:00:00,4.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '28.00', 'itemPrice': '28.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 4, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '28.00', 'userFlaggedQuantity': 4}]",FINISHED,28.00,5ff1e1eacfcf6c399c274ae6
4,5ff1e1d20a7214ada1000561,5.0,All-receipts receipt bonus,2021-01-03 15:25:06.000,2021-01-03 15:25:06.000,2021-01-03 15:25:11,2021-01-03 15:25:11.000,2021-01-03 15:25:06,5.0,2021-01-02 15:25:06,2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '1234', 'finalPrice': '2.56', 'itemPrice': '2.56', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'preventTargetGapPoints': True, 'quantityPurchased': 3, 'userFlaggedBarcode': '1234', 'userFlaggedDescription': '', 'userFlaggedNewItem': True, 'userFlaggedPrice': '2.56', 'userFlaggedQuantity': 3}]",FINISHED,1.00,5ff1e194b6a9d73a3a9f1052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1114,603cc0630a720fde100003e6,25.0,COMPLETE_NONPARTNER_RECEIPT,2021-03-01 10:22:27.000,2021-03-01 10:22:27.000,NaT,2021-03-01 10:22:28.000,NaT,25.0,2020-08-17 00:00:00,2.0,"[{'barcode': 'B076FJ92M4', 'description': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'discountedItemPrice': '22.97', 'finalPrice': '22.97', 'itemPrice': '22.97', 'originalReceiptItemText': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'partnerItemId': '0', 'priceAfterCoupon': '22.97', 'quantityPurchased': 1}, {'barcode': 'B07BRRLSVC', 'description': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'discountedItemPrice': '11.99', 'finalPrice': '11.99', 'itemPrice': '11.99', 'originalReceiptItemText': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'partnerItemId': '1', 'priceAfterCoupon': '11.99', 'quantityPurchased': 1}]",REJECTED,34.96,5fc961c3b8cfca11a077dd33
1115,603d0b710a720fde1000042a,,,2021-03-01 15:42:41.873,2021-03-01 15:42:41.873,NaT,2021-03-01 15:42:41.873,NaT,,NaT,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
1116,603cf5290a720fde10000413,,,2021-03-01 14:07:37.664,2021-03-01 14:07:37.664,NaT,2021-03-01 14:07:37.664,NaT,,NaT,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
1117,603ce7100a7217c72c000405,25.0,COMPLETE_NONPARTNER_RECEIPT,2021-03-01 13:07:28.000,2021-03-01 13:07:28.000,NaT,2021-03-01 13:07:29.000,NaT,25.0,2020-08-17 00:00:00,2.0,"[{'barcode': 'B076FJ92M4', 'description': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'discountedItemPrice': '22.97', 'finalPrice': '22.97', 'itemPrice': '22.97', 'originalReceiptItemText': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'partnerItemId': '0', 'priceAfterCoupon': '22.97', 'quantityPurchased': 1}, {'barcode': 'B07BRRLSVC', 'description': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'discountedItemPrice': '11.99', 'finalPrice': '11.99', 'itemPrice': '11.99', 'originalReceiptItemText': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'partnerItemId': '1', 'priceAfterCoupon': '11.99', 'quantityPurchased': 1}]",REJECTED,34.96,5fc961c3b8cfca11a077dd33


In [109]:
# convert to str - causing an error when loading to SQLite - InterfaceError: Error binding parameter 11 - probably unsupported type.
df_receipts_load.rewardsReceiptItemList = df_receipts_load.rewardsReceiptItemList.astype(str)

In [110]:
# create receipts table
conn = sqlite3.connect(db_path)
c = conn.cursor()

c.execute('DROP TABLE IF EXISTS receipts')

c.execute("""CREATE TABLE IF NOT EXISTS receipts (
        id uuid PRIMARY KEY,
        'bonusPointsEarned' numeric,
        'bonusPointsEarnedReason' text,
        'createDate' timestamp,
        'dateScanned' timestamp,
        'finishedDate' timestamp,
        'modifyDate' timestamp,
        'pointsAwardedDate' timestamp,
        'pointsEarned' numeric,
        'purchaseDate' timestamp,
        'purchasedItemCount' numeric,
        'rewardsReceiptItemList' text,
        'rewardsReceiptStatus' text,
        'totalSpent' numeric,
        'userId' text    
    )""")

df_receipts_load.to_sql('receipts', conn, if_exists='append', index=False)


conn.commit()
conn.close()

In [111]:
df_receipt_items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6941 entries, 0 to 6940
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   receipt_id              6941 non-null   object 
 1   item_index              6941 non-null   int64  
 2   barcode                 3090 non-null   object 
 3   userFlaggedBarcode      337 non-null    object 
 4   description             6560 non-null   object 
 5   userFlaggedDescription  205 non-null    object 
 6   finalPrice              6767 non-null   object 
 7   userFlaggedPrice        299 non-null    object 
 8   quantityPurchased       6767 non-null   float64
 9   userFlaggedQuantity     299 non-null    float64
 10  brandCode               2600 non-null   object 
 11  extracted_brandCode     1854 non-null   object 
dtypes: float64(2), int64(1), object(9)
memory usage: 650.8+ KB


In [112]:
# df_receipt_items already has all the columns I want
df_receipt_items_load = df_receipt_items

In [113]:
# create receipt_items table
conn = sqlite3.connect(db_path)
c = conn.cursor()

c.execute('DROP TABLE IF EXISTS receipt_items')

c.execute("""CREATE TABLE IF NOT EXISTS receipt_items (
        receipt_id text,
        item_index numeric,
        barcode text,
        userFlaggedBarcode text,
        description text,
        userFlaggedDescription text,
        finalPrice numeric,
        userFlaggedPrice numeric,
        quantityPurchased numeric,
        userFlaggedQuantity numeric,
        brandCode text,
        extracted_brandCode text
    )""")

df_receipt_items_load.to_sql('receipt_items', conn, if_exists='append', index=False)


conn.commit()
conn.close()

In [114]:
# scratch pad to quick check tables are correct

# table_name = 'brands'
# table_name = 'users'
# table_name = 'receipts'
table_name = 'receipt_items'

# query=f"""
# pragma table_info({table_name})
# """

query=f"""
    select 
        * 
    from 
        {table_name};
"""

# query=f"""
#     select 
#          *
#         -- count(*) 
#     from 
#         {table_name}
#     where
#         finishedDate is Null;
# """

conn = sqlite3.connect(db_path)
c = conn.cursor()

df = pd.read_sql_query(query,conn)

conn.commit()
conn.close()

df

Unnamed: 0,receipt_id,item_index,barcode,userFlaggedBarcode,description,userFlaggedDescription,finalPrice,userFlaggedPrice,quantityPurchased,userFlaggedQuantity,brandCode,extracted_brandCode
0,5ff1e1eb0a720f0523000575,0,4011,4011,ITEM NOT FOUND,,26.00,26.0,5.0,5.0,,
1,5ff1e1bb0a720f052300056b,0,4011,,ITEM NOT FOUND,,1.00,,1.0,,,
2,5ff1e1bb0a720f052300056b,1,028400642255,028400642255,DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ,DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ,10.00,10.0,1.0,1.0,,DORITOS
3,5ff1e1f10a720f052300057a,0,,4011,,,,26.0,,3.0,,DORITOS
4,5ff1e1ee0a7214ada100056f,0,4011,4011,ITEM NOT FOUND,,28.00,28.0,4.0,4.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...
6936,603cc2bc0a720fde100003e9,1,B07BRRLSVC,,thindust summer face mask - sun protection neck gaiter for outdooractivities,,11.99,,1.0,,,
6937,603cc0630a720fde100003e6,0,B076FJ92M4,,"mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white",,22.97,,1.0,,,
6938,603cc0630a720fde100003e6,1,B07BRRLSVC,,thindust summer face mask - sun protection neck gaiter for outdooractivities,,11.99,,1.0,,,
6939,603ce7100a7217c72c000405,0,B076FJ92M4,,"mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white",,22.97,,1.0,,,


### fetch.db schema

![fetch_db_erd-5.jpg](attachment:fetch_db_erd-5.jpg)