## Andrew Byrnes: Fetch Rewards Coding Exercise - Data Analyst

This notebook preps the provided data files and uploads them to a SQLite database. It includes a an entity relationship diagram of the how I've modeled this data.
I chose SQLite for this challenge because it is lightweight and lends itself well to sharing.  
SQLite's flexible typing rules could be a liability for a database that would be continually updated and primarily used for analysis. For that usecase I would choose something that more strictly follows SQL standard.

### Data Sources
- data-modeling.html : coding exercise instuctions
- brands.json.gz, receipts.json.gz, users.json.gz : raw data files provided for completition of the challenge

### Changes
- 09-17-2022 : Started project, first look at data, identified transformation tasks 
- 09-18-2022 : cleaned df_brands _id and cpg columns
- 09-19-2022 : wrote function to clean columns with dicts, wrote function to convert epoch time to timestamps
- 09-20-2022 : refactored functions, applied funtions cleaning and converting data, explored df_receipts.rewardsReceiptItemList, notes on stakeholder questions
- 09-21-2022 : brainstorming notes on receipt_items, created df_receipt_items dataframe

In [1]:
import pandas as pd
from pathlib import Path
import os
from datetime import datetime
import gzip
import json
import sqlite3

### File Locations

In [2]:
today = datetime.today()
print(today)
in_brands = Path.cwd() / "data" / "raw" / "brands.json.gz"
in_receipts = Path.cwd() / "data" / "raw" / "receipts.json.gz"
in_users = Path.cwd() / "data" / "raw" / "users.json.gz"
db_path = Path.cwd() / "data" / "processed" / "fetch.db"

2022-09-21 21:49:01.541415


### Drop database if exists

In [3]:
if os.path.exists(db_path):
    os.remove(db_path)
    print("The db has been removed successfully")
else:
    print("The db does not exist!")

The db does not exist!


### Formatting and options

In [103]:
pd.set_option('display.max_colwidth', None)
# pd.set_option('display.max_rows', None)
pd.reset_option('display.max_rows')
pd.set_option('display.max_columns', None)
# pd.reset_option('display.max_columns')
# surpressing a warning related to renaming columns just prior to loading to sqlite
pd.options.mode.chained_assignment = None

### Load JSON data to Panda's dataframes

In [5]:
df_brands = pd.read_json(in_brands,lines=True,compression='gzip')
df_receipts = pd.read_json(in_receipts,lines=True,compression='gzip')
df_users = pd.read_json(in_users,lines=True,compression='gzip')

### First look at data

### **brands**  
**to-do**:
- ~extract _ids~
- ~extract cpg ids~

In [6]:
df_brands.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1167 entries, 0 to 1166
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   _id           1167 non-null   object 
 1   barcode       1167 non-null   int64  
 2   category      1012 non-null   object 
 3   categoryCode  517 non-null    object 
 4   cpg           1167 non-null   object 
 5   name          1167 non-null   object 
 6   topBrand      555 non-null    float64
 7   brandCode     933 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 73.1+ KB


**Brand Data Schema**
- _id: brand uuid
- barcode: the barcode on the item
- brandCode: String that corresponds with the brand column in a partner product file
- category: The category name for which the brand sells products in
- categoryCode: The category code that references a BrandCategory
- cpg: reference to CPG collection
- topBrand: Boolean indicator for whether the brand should be featured as a 'top brand'
- name: Brand name

In [7]:
df_brands

Unnamed: 0,_id,barcode,category,categoryCode,cpg,name,topBrand,brandCode
0,{'$oid': '601ac115be37ce2ead437551'},511111019862,Baking,BAKING,"{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}",test brand @1612366101024,0.0,
1,{'$oid': '601c5460be37ce2ead43755f'},511111519928,Beverages,BEVERAGES,"{'$id': {'$oid': '5332f5fbe4b03c9a25efd0ba'}, '$ref': 'Cogs'}",Starbucks,0.0,STARBUCKS
2,{'$oid': '601ac142be37ce2ead43755d'},511111819905,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",test brand @1612366146176,0.0,TEST BRANDCODE @1612366146176
3,{'$oid': '601ac142be37ce2ead43755a'},511111519874,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",test brand @1612366146051,0.0,TEST BRANDCODE @1612366146051
4,{'$oid': '601ac142be37ce2ead43755e'},511111319917,Candy & Sweets,CANDY_AND_SWEETS,"{'$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}, '$ref': 'Cogs'}",test brand @1612366146827,0.0,TEST BRANDCODE @1612366146827
...,...,...,...,...,...,...,...,...
1162,{'$oid': '5f77274dbe37ce6b592e90c0'},511111116752,Baking,BAKING,"{'$ref': 'Cogs', '$id': {'$oid': '5f77274dbe37ce6b592e90bf'}}",test brand @1601644365844,,
1163,{'$oid': '5dc1fca91dda2c0ad7da64ae'},511111706328,Breakfast & Cereal,,"{'$ref': 'Cogs', '$id': {'$oid': '53e10d6368abd3c7065097cc'}}",Dippin Dots® Cereal,,DIPPIN DOTS CEREAL
1164,{'$oid': '5f494c6e04db711dd8fe87e7'},511111416173,Candy & Sweets,CANDY_AND_SWEETS,"{'$ref': 'Cogs', '$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}}",test brand @1598639215217,,TEST BRANDCODE @1598639215217
1165,{'$oid': '5a021611e4b00efe02b02a57'},511111400608,Grocery,,"{'$ref': 'Cogs', '$id': {'$oid': '5332f5f6e4b03c9a25efd0b4'}}",LIPTON TEA Leaves,0.0,LIPTON TEA Leaves


### **receipts**  
**to-do**:
- ~extract _ids~
- ~extract and convert createDate~
- ~extract and convert dateScanned~
- ~extract and convert finishedDate~
- ~extract and convert modifyDate~
- ~extract and convert pointsAwardedDate~
- ~extract and convert purchaseDate~
- create receipt_items table using the rewardsReceiptItemList

In [8]:
df_receipts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1119 entries, 0 to 1118
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   _id                      1119 non-null   object 
 1   bonusPointsEarned        544 non-null    float64
 2   bonusPointsEarnedReason  544 non-null    object 
 3   createDate               1119 non-null   object 
 4   dateScanned              1119 non-null   object 
 5   finishedDate             568 non-null    object 
 6   modifyDate               1119 non-null   object 
 7   pointsAwardedDate        537 non-null    object 
 8   pointsEarned             609 non-null    float64
 9   purchaseDate             671 non-null    object 
 10  purchasedItemCount       635 non-null    float64
 11  rewardsReceiptItemList   679 non-null    object 
 12  rewardsReceiptStatus     1119 non-null   object 
 13  totalSpent               684 non-null    float64
 14  userId                  

**Receipts Data Schema**
- _id: uuid for this receipt
- bonusPointsEarned: Number of bonus points that were awarded upon receipt completion
- bonusPointsEarnedReason: event that triggered bonus points
- createDate: The date that the event was created
- dateScanned: Date that the user scanned their receipt
- finishedDate: Date that the receipt finished processing
- modifyDate: The date the event was modified
- pointsAwardedDate: The date we awarded points for the transaction
- pointsEarned: The number of points earned for the receipt
- purchaseDate: the date of the purchase
- purchasedItemCount: Count of number of items on the receipt
- rewardsReceiptItemList: The items that were purchased on the receipt
- rewardsReceiptStatus: status of the receipt through receipt validation and processing
- totalSpent: The total amount on the receipt
- userId: string id back to the User collection for the user who scanned the receipt

In [9]:
df_receipts

Unnamed: 0,_id,bonusPointsEarned,bonusPointsEarnedReason,createDate,dateScanned,finishedDate,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
0,{'$oid': '5ff1e1eb0a720f0523000575'},500.0,"Receipt number 2 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1609687531000},{'$date': 1609687531000},{'$date': 1609687531000},{'$date': 1609687536000},{'$date': 1609687531000},500.0,{'$date': 1609632000000},5.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '26.00', 'itemPrice': '26.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 5, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '26.00', 'userFlaggedQuantity': 5}]",FINISHED,26.00,5ff1e1eacfcf6c399c274ae6
1,{'$oid': '5ff1e1bb0a720f052300056b'},150.0,"Receipt number 5 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1609687483000},{'$date': 1609687483000},{'$date': 1609687483000},{'$date': 1609687488000},{'$date': 1609687483000},150.0,{'$date': 1609601083000},2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '028400642255', 'description': 'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ', 'finalPrice': '10.00', 'itemPrice': '10.00', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'pointsNotAwardedReason': 'Action not allowed for user and CPG', 'pointsPayerId': '5332f5fbe4b03c9a25efd0ba', 'preventTargetGapPoints': True, 'quantityPurchased': 1, 'rewardsGroup': 'DORITOS SPICY SWEET CHILI SINGLE SERVE', 'rewardsProductPartnerId': '5332f5fbe4b03c9a25efd0ba', 'userFlaggedBarcode': '028400642255', 'userFlaggedDescription': 'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ', 'userFlaggedNewItem': True, 'userFlaggedPrice': '10.00', 'userFlaggedQuantity': 1}]",FINISHED,11.00,5ff1e194b6a9d73a3a9f1052
2,{'$oid': '5ff1e1f10a720f052300057a'},5.0,All-receipts receipt bonus,{'$date': 1609687537000},{'$date': 1609687537000},,{'$date': 1609687542000},,5.0,{'$date': 1609632000000},1.0,"[{'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '26.00', 'userFlaggedQuantity': 3}]",REJECTED,10.00,5ff1e1f1cfcf6c399c274b0b
3,{'$oid': '5ff1e1ee0a7214ada100056f'},5.0,All-receipts receipt bonus,{'$date': 1609687534000},{'$date': 1609687534000},{'$date': 1609687534000},{'$date': 1609687539000},{'$date': 1609687534000},5.0,{'$date': 1609632000000},4.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '28.00', 'itemPrice': '28.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 4, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '28.00', 'userFlaggedQuantity': 4}]",FINISHED,28.00,5ff1e1eacfcf6c399c274ae6
4,{'$oid': '5ff1e1d20a7214ada1000561'},5.0,All-receipts receipt bonus,{'$date': 1609687506000},{'$date': 1609687506000},{'$date': 1609687511000},{'$date': 1609687511000},{'$date': 1609687506000},5.0,{'$date': 1609601106000},2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '1234', 'finalPrice': '2.56', 'itemPrice': '2.56', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'preventTargetGapPoints': True, 'quantityPurchased': 3, 'userFlaggedBarcode': '1234', 'userFlaggedDescription': '', 'userFlaggedNewItem': True, 'userFlaggedPrice': '2.56', 'userFlaggedQuantity': 3}]",FINISHED,1.00,5ff1e194b6a9d73a3a9f1052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1114,{'$oid': '603cc0630a720fde100003e6'},25.0,COMPLETE_NONPARTNER_RECEIPT,{'$date': 1614594147000},{'$date': 1614594147000},,{'$date': 1614594148000},,25.0,{'$date': 1597622400000},2.0,"[{'barcode': 'B076FJ92M4', 'description': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'discountedItemPrice': '22.97', 'finalPrice': '22.97', 'itemPrice': '22.97', 'originalReceiptItemText': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'partnerItemId': '0', 'priceAfterCoupon': '22.97', 'quantityPurchased': 1}, {'barcode': 'B07BRRLSVC', 'description': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'discountedItemPrice': '11.99', 'finalPrice': '11.99', 'itemPrice': '11.99', 'originalReceiptItemText': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'partnerItemId': '1', 'priceAfterCoupon': '11.99', 'quantityPurchased': 1}]",REJECTED,34.96,5fc961c3b8cfca11a077dd33
1115,{'$oid': '603d0b710a720fde1000042a'},,,{'$date': 1614613361873},{'$date': 1614613361873},,{'$date': 1614613361873},,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
1116,{'$oid': '603cf5290a720fde10000413'},,,{'$date': 1614607657664},{'$date': 1614607657664},,{'$date': 1614607657664},,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
1117,{'$oid': '603ce7100a7217c72c000405'},25.0,COMPLETE_NONPARTNER_RECEIPT,{'$date': 1614604048000},{'$date': 1614604048000},,{'$date': 1614604049000},,25.0,{'$date': 1597622400000},2.0,"[{'barcode': 'B076FJ92M4', 'description': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'discountedItemPrice': '22.97', 'finalPrice': '22.97', 'itemPrice': '22.97', 'originalReceiptItemText': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'partnerItemId': '0', 'priceAfterCoupon': '22.97', 'quantityPurchased': 1}, {'barcode': 'B07BRRLSVC', 'description': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'discountedItemPrice': '11.99', 'finalPrice': '11.99', 'itemPrice': '11.99', 'originalReceiptItemText': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'partnerItemId': '1', 'priceAfterCoupon': '11.99', 'quantityPurchased': 1}]",REJECTED,34.96,5fc961c3b8cfca11a077dd33


### **users**  
**to-do**:
- ~extract _ids~
- ~extract and convert createdDate~
- ~extract and convert lastLogin~

In [10]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 495 entries, 0 to 494
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   _id           495 non-null    object
 1   active        495 non-null    bool  
 2   createdDate   495 non-null    object
 3   lastLogin     433 non-null    object
 4   role          495 non-null    object
 5   signUpSource  447 non-null    object
 6   state         439 non-null    object
dtypes: bool(1), object(6)
memory usage: 23.8+ KB


**Users Data Schema**
- _id: user Id
- state: state abbreviation
- createdDate: when the user created their account
- lastLogin: last time the user was recorded logging in to the app
- role: constant value set to 'CONSUMER'
- active: indicates if the user is active; only Fetch will de-activate an account with this flag

In [11]:
df_users

Unnamed: 0,_id,active,createdDate,lastLogin,role,signUpSource,state
0,{'$oid': '5ff1e194b6a9d73a3a9f1052'},True,{'$date': 1609687444800},{'$date': 1609687537858},consumer,Email,WI
1,{'$oid': '5ff1e194b6a9d73a3a9f1052'},True,{'$date': 1609687444800},{'$date': 1609687537858},consumer,Email,WI
2,{'$oid': '5ff1e194b6a9d73a3a9f1052'},True,{'$date': 1609687444800},{'$date': 1609687537858},consumer,Email,WI
3,{'$oid': '5ff1e1eacfcf6c399c274ae6'},True,{'$date': 1609687530554},{'$date': 1609687530597},consumer,Email,WI
4,{'$oid': '5ff1e194b6a9d73a3a9f1052'},True,{'$date': 1609687444800},{'$date': 1609687537858},consumer,Email,WI
...,...,...,...,...,...,...,...
490,{'$oid': '54943462e4b07e684157a532'},True,{'$date': 1418998882381},{'$date': 1614963143204},fetch-staff,,
491,{'$oid': '54943462e4b07e684157a532'},True,{'$date': 1418998882381},{'$date': 1614963143204},fetch-staff,,
492,{'$oid': '54943462e4b07e684157a532'},True,{'$date': 1418998882381},{'$date': 1614963143204},fetch-staff,,
493,{'$oid': '54943462e4b07e684157a532'},True,{'$date': 1418998882381},{'$date': 1614963143204},fetch-staff,,


## Cleaning the data

### First attempt of extracting the values from columns containing dictionaries
I've chosen to include the following section of python code to help illustrate my thought process that lead to writting the value_from_dict() function. This code includes some notes as comments, but full explanation of my process is included within the documentation of the resulting function.  
The function should account for executing the following code, but if you are stepping through this notebook on a fresh kernel you can skip the cells between **Start** and **End**.

**Start** - you *can* skip executing the code starting here

In [12]:
# confirming the values in _id are being recoginized python objects, in this case a dictionary
type(df_brands['_id'][0])

dict

In [13]:
df_brands['_id'][0]['$oid']

'601ac115be37ce2ead437551'

In [14]:
# extract the _id column as a series
_id_series = df_brands['_id']
# create an list to collect the values from the dictionary objects in _id
_id_clean = []

# iterate through _id_series appending values them to _id_clean
for index, value in _id_series.items():
    _id_clean.append(value['$oid'])
    
# confirm no nulls in _id_clean
assert None not in _id_clean, "there is at least one None/null value in _id_clean"
# confirm _id_clean is the same length is the original _id column in df_brands
assert len(_id_clean) == len(df_brands['_id']), "the length of the original column and the cleaned column are not the same"

# add _id_clean to df_brands after the _id column
df_brands.insert(1, '_id_clean', _id_clean)

df_brands

Unnamed: 0,_id,_id_clean,barcode,category,categoryCode,cpg,name,topBrand,brandCode
0,{'$oid': '601ac115be37ce2ead437551'},601ac115be37ce2ead437551,511111019862,Baking,BAKING,"{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}",test brand @1612366101024,0.0,
1,{'$oid': '601c5460be37ce2ead43755f'},601c5460be37ce2ead43755f,511111519928,Beverages,BEVERAGES,"{'$id': {'$oid': '5332f5fbe4b03c9a25efd0ba'}, '$ref': 'Cogs'}",Starbucks,0.0,STARBUCKS
2,{'$oid': '601ac142be37ce2ead43755d'},601ac142be37ce2ead43755d,511111819905,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",test brand @1612366146176,0.0,TEST BRANDCODE @1612366146176
3,{'$oid': '601ac142be37ce2ead43755a'},601ac142be37ce2ead43755a,511111519874,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",test brand @1612366146051,0.0,TEST BRANDCODE @1612366146051
4,{'$oid': '601ac142be37ce2ead43755e'},601ac142be37ce2ead43755e,511111319917,Candy & Sweets,CANDY_AND_SWEETS,"{'$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}, '$ref': 'Cogs'}",test brand @1612366146827,0.0,TEST BRANDCODE @1612366146827
...,...,...,...,...,...,...,...,...,...
1162,{'$oid': '5f77274dbe37ce6b592e90c0'},5f77274dbe37ce6b592e90c0,511111116752,Baking,BAKING,"{'$ref': 'Cogs', '$id': {'$oid': '5f77274dbe37ce6b592e90bf'}}",test brand @1601644365844,,
1163,{'$oid': '5dc1fca91dda2c0ad7da64ae'},5dc1fca91dda2c0ad7da64ae,511111706328,Breakfast & Cereal,,"{'$ref': 'Cogs', '$id': {'$oid': '53e10d6368abd3c7065097cc'}}",Dippin Dots® Cereal,,DIPPIN DOTS CEREAL
1164,{'$oid': '5f494c6e04db711dd8fe87e7'},5f494c6e04db711dd8fe87e7,511111416173,Candy & Sweets,CANDY_AND_SWEETS,"{'$ref': 'Cogs', '$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}}",test brand @1598639215217,,TEST BRANDCODE @1598639215217
1165,{'$oid': '5a021611e4b00efe02b02a57'},5a021611e4b00efe02b02a57,511111400608,Grocery,,"{'$ref': 'Cogs', '$id': {'$oid': '5332f5f6e4b03c9a25efd0b4'}}",LIPTON TEA Leaves,0.0,LIPTON TEA Leaves


In [15]:
# examining the values in cpg
type(df_brands['cpg'])

pandas.core.series.Series

In [16]:
df_brands['cpg'][0]

{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}

In [17]:
df_brands['cpg'][0]['$id']['$oid']

'601ac114be37ce2ead437550'

In [18]:
# extract the cpg column as a series
cpg_series = df_brands['cpg']
# create an list to collect the values from the dictionary objects in _id
cpg_clean = []

# iterate through cpg_series appending values them to cpg_clean
for index, value in cpg_series.items():
    cpg_clean.append(value['$id']['$oid'])
    
# confirm no nulls in _id_clean
assert None not in cpg_clean, "there is at least one None/null value in cpg_clean"
# confirm cpg_clean is the same length is the original cpg column in df_brands
assert len(cpg_clean) == len(df_brands['cpg']), "the length of the original column and the cleaned column are not the same"

# add cpg_clean to df_brands after the cpg column
df_brands.insert(6, 'cpg_clean', cpg_clean)

df_brands

Unnamed: 0,_id,_id_clean,barcode,category,categoryCode,cpg,cpg_clean,name,topBrand,brandCode
0,{'$oid': '601ac115be37ce2ead437551'},601ac115be37ce2ead437551,511111019862,Baking,BAKING,"{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}",601ac114be37ce2ead437550,test brand @1612366101024,0.0,
1,{'$oid': '601c5460be37ce2ead43755f'},601c5460be37ce2ead43755f,511111519928,Beverages,BEVERAGES,"{'$id': {'$oid': '5332f5fbe4b03c9a25efd0ba'}, '$ref': 'Cogs'}",5332f5fbe4b03c9a25efd0ba,Starbucks,0.0,STARBUCKS
2,{'$oid': '601ac142be37ce2ead43755d'},601ac142be37ce2ead43755d,511111819905,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",601ac142be37ce2ead437559,test brand @1612366146176,0.0,TEST BRANDCODE @1612366146176
3,{'$oid': '601ac142be37ce2ead43755a'},601ac142be37ce2ead43755a,511111519874,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",601ac142be37ce2ead437559,test brand @1612366146051,0.0,TEST BRANDCODE @1612366146051
4,{'$oid': '601ac142be37ce2ead43755e'},601ac142be37ce2ead43755e,511111319917,Candy & Sweets,CANDY_AND_SWEETS,"{'$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}, '$ref': 'Cogs'}",5332fa12e4b03c9a25efd1e7,test brand @1612366146827,0.0,TEST BRANDCODE @1612366146827
...,...,...,...,...,...,...,...,...,...,...
1162,{'$oid': '5f77274dbe37ce6b592e90c0'},5f77274dbe37ce6b592e90c0,511111116752,Baking,BAKING,"{'$ref': 'Cogs', '$id': {'$oid': '5f77274dbe37ce6b592e90bf'}}",5f77274dbe37ce6b592e90bf,test brand @1601644365844,,
1163,{'$oid': '5dc1fca91dda2c0ad7da64ae'},5dc1fca91dda2c0ad7da64ae,511111706328,Breakfast & Cereal,,"{'$ref': 'Cogs', '$id': {'$oid': '53e10d6368abd3c7065097cc'}}",53e10d6368abd3c7065097cc,Dippin Dots® Cereal,,DIPPIN DOTS CEREAL
1164,{'$oid': '5f494c6e04db711dd8fe87e7'},5f494c6e04db711dd8fe87e7,511111416173,Candy & Sweets,CANDY_AND_SWEETS,"{'$ref': 'Cogs', '$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}}",5332fa12e4b03c9a25efd1e7,test brand @1598639215217,,TEST BRANDCODE @1598639215217
1165,{'$oid': '5a021611e4b00efe02b02a57'},5a021611e4b00efe02b02a57,511111400608,Grocery,,"{'$ref': 'Cogs', '$id': {'$oid': '5332f5f6e4b03c9a25efd0b4'}}",5332f5f6e4b03c9a25efd0b4,LIPTON TEA Leaves,0.0,LIPTON TEA Leaves


In [19]:
dataframe = df_brands
column_to_clean ='cpg'
dirty_series = dataframe[column_to_clean]
dirty_series
insert_at = dataframe.columns.get_loc(column_to_clean) + 1
cleaned_list = []
dict_key = "['$id']['$oid']"
dict_key_value = "value" + dict_key


for index, value in dirty_series.items():
    cleaned_list.append(eval(dict_key_value))

cleaned_list



['601ac114be37ce2ead437550',
 '5332f5fbe4b03c9a25efd0ba',
 '601ac142be37ce2ead437559',
 '601ac142be37ce2ead437559',
 '5332fa12e4b03c9a25efd1e7',
 '601ac142be37ce2ead437559',
 '601ac142be37ce2ead437559',
 '559c2234e4b06aca36af13c6',
 '5a734034e4b0d58f376be874',
 '59ba6f1ce4b092b29c167346',
 '5f4bf556be37ce0b44915549',
 '5332f5f2e4b03c9a25efd0aa',
 '559c2234e4b06aca36af13c6',
 '5d5d4fd16d5f3b23d1bc7905',
 '5332f5fbe4b03c9a25efd0ba',
 '5332f709e4b03c9a25efd0f1',
 '5d9b4f591dda2c6225a284aa',
 '5f358338be37ce443bf9d557',
 '5fb28549be37ce522e165cb4',
 '5332f5f6e4b03c9a25efd0b4',
 '55b62995e4b0d8e685c14213',
 '5d9b4f591dda2c6225a284aa',
 '559c2234e4b06aca36af13c6',
 '53e10d6368abd3c7065097cc',
 '5332f5ebe4b03c9a25efd0a8',
 '5e9f12f5be37ce3e45b6a77e',
 '5332f5f6e4b03c9a25efd0b4',
 '5d5d4fd16d5f3b23d1bc7905',
 '5f493e72be37ce64d0ae36c2',
 '5f4936dcbe37ce52f8314fd8',
 '559c2234e4b06aca36af13c6',
 '5fd2a0aebe37ce49eb72c0ed',
 '53e10d6368abd3c7065097cc',
 '5f494c5d04db711dd8fe87e2',
 '5332f5f3e4b0

In [20]:
dataframe

Unnamed: 0,_id,_id_clean,barcode,category,categoryCode,cpg,cpg_clean,name,topBrand,brandCode
0,{'$oid': '601ac115be37ce2ead437551'},601ac115be37ce2ead437551,511111019862,Baking,BAKING,"{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}",601ac114be37ce2ead437550,test brand @1612366101024,0.0,
1,{'$oid': '601c5460be37ce2ead43755f'},601c5460be37ce2ead43755f,511111519928,Beverages,BEVERAGES,"{'$id': {'$oid': '5332f5fbe4b03c9a25efd0ba'}, '$ref': 'Cogs'}",5332f5fbe4b03c9a25efd0ba,Starbucks,0.0,STARBUCKS
2,{'$oid': '601ac142be37ce2ead43755d'},601ac142be37ce2ead43755d,511111819905,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",601ac142be37ce2ead437559,test brand @1612366146176,0.0,TEST BRANDCODE @1612366146176
3,{'$oid': '601ac142be37ce2ead43755a'},601ac142be37ce2ead43755a,511111519874,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",601ac142be37ce2ead437559,test brand @1612366146051,0.0,TEST BRANDCODE @1612366146051
4,{'$oid': '601ac142be37ce2ead43755e'},601ac142be37ce2ead43755e,511111319917,Candy & Sweets,CANDY_AND_SWEETS,"{'$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}, '$ref': 'Cogs'}",5332fa12e4b03c9a25efd1e7,test brand @1612366146827,0.0,TEST BRANDCODE @1612366146827
...,...,...,...,...,...,...,...,...,...,...
1162,{'$oid': '5f77274dbe37ce6b592e90c0'},5f77274dbe37ce6b592e90c0,511111116752,Baking,BAKING,"{'$ref': 'Cogs', '$id': {'$oid': '5f77274dbe37ce6b592e90bf'}}",5f77274dbe37ce6b592e90bf,test brand @1601644365844,,
1163,{'$oid': '5dc1fca91dda2c0ad7da64ae'},5dc1fca91dda2c0ad7da64ae,511111706328,Breakfast & Cereal,,"{'$ref': 'Cogs', '$id': {'$oid': '53e10d6368abd3c7065097cc'}}",53e10d6368abd3c7065097cc,Dippin Dots® Cereal,,DIPPIN DOTS CEREAL
1164,{'$oid': '5f494c6e04db711dd8fe87e7'},5f494c6e04db711dd8fe87e7,511111416173,Candy & Sweets,CANDY_AND_SWEETS,"{'$ref': 'Cogs', '$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}}",5332fa12e4b03c9a25efd1e7,test brand @1598639215217,,TEST BRANDCODE @1598639215217
1165,{'$oid': '5a021611e4b00efe02b02a57'},5a021611e4b00efe02b02a57,511111400608,Grocery,,"{'$ref': 'Cogs', '$id': {'$oid': '5332f5f6e4b03c9a25efd0b4'}}",5332f5f6e4b03c9a25efd0b4,LIPTON TEA Leaves,0.0,LIPTON TEA Leaves


In [21]:
# reset df_brands to the inital load of raw data - if excuted above code, 
# uncomment the following line to avoid any exceptions with value_from_dict() 
df_brands = pd.read_json(in_brands,lines=True,compression='gzip')

**End** You can continue executing the following cells

### Writting a function to extract values from a dataframe's column that contains a dictionary. Then adds those values back to the dataframe as a new column.

In [22]:
# I realized I'd be doing this multiple times, better to make a function
# define a function that cleans a df column that contains a dictionary by returning the specified 
# values and adding them as new column in the dataframe
def value_from_dict(dataframe, column_to_clean, dict_key, allow_nulls = False):
    """Returns dataframe with a 'cleaned' column inserted after the column that was cleaned.
    
    :param dataframe: A dataframe with a column containing dictionaries, from which one value is
        to be extracted
    :type dataframe: Pandas DataFrame
    :param column_to_clean: The name of the column containing dictionaries
    :type column_to_clean: str
    :param dict_key: A str containing the key associated with the value we want to extract,
        e.g, "['$id']['$oid']"
    :type dict_key: str
    :param allow_nulls: A boolean value indicating if None/Null/NaN/NaT values should be allowed,
        defaults to False
    :type allow_nulls: bool
    
    :raises AssertionError: 'there is at least one None/Null/NaN/NaT value in the cleaned data' if param allow_nulls = False
    :raises AssertionError: 'the length of the original column and the cleaned column are not the same'
    :excepts VlueError: If a column has already been cleaned, print message confirming
        no values were added to the dataframe
    
    :rtype: Pandas DataFrame
    :return: the original DataFrame with an additional column containing 'cleaned' values
    """
    # setting variables:
    # extract the column we want to clean from the dataframe as a series
    dirty_series = dataframe[column_to_clean]
    # create a list to store the cleaned values in
    cleaned_list = []
    # name of the column we'll be adding to the DataFrame
    cleaned_column_name = column_to_clean + '_cleaned'
    # location to insert the cleaned column, after the 'dirty' column
    insert_at = dataframe.columns.get_loc(column_to_clean) + 1
    # translate dict_key str into a format useable in the following for loop
    value_dict_key = "value" + dict_key
    
    # iterate through dirty_series appending extracted values to cleaned_list
    for index, value in dirty_series.items():
        # if there is no dictionary or any other issue, append None
        try:
            cleaned_list.append(eval(value_dict_key))
        except:
            cleaned_list.append(None)

    # handle allow_nulls param flag
    if not allow_nulls:
        # confirm no nulls in cleaned_list
        assert None not in cleaned_list, "there is at least one None value in the cleaned data"
    
    # confirm cleaned_list is the same length as dirty_series
    assert len(cleaned_list) == len(dirty_series), "the length of the original column and the cleaned column are not the same"
    
    # add the cleaned_list data to the originl dataframe following column_to_clean
    try:
        dataframe.insert(insert_at, cleaned_column_name, cleaned_list)
    except ValueError as error:
        print(f"{str(error)}, {cleaned_column_name} was not added to the dataframe")

    # return the modified dataframe
    return dataframe

### Using  value_from_dict() to exatract values from all the columns containing dictionaries and add them to the dataframe as a new column:

#### df_brands._id :

In [23]:
# look at the first value in df_brands._id
df_brands['_id'][0]

{'$oid': '601ac115be37ce2ead437551'}

In [24]:
#extract the value
df_brands['_id'][0]['$oid']

'601ac115be37ce2ead437551'

In [25]:
# access the value, set a varbale to use for the dict_key param of value_from_dict() function
brand_id_dict_key = "['$oid']"

In [26]:
# clean df_brands._id and confirm by viewing a sample of the dataframe
value_from_dict(df_brands, "_id", brand_id_dict_key)
df_brands.sample(2)

Unnamed: 0,_id,_id_cleaned,barcode,category,categoryCode,cpg,name,topBrand,brandCode
274,{'$oid': '592486bfe410d61fcea3d13f'},592486bfe410d61fcea3d13f,511111500667,Personal Care,PERSONAL_CARE,"{'$ref': 'Cogs', '$id': {'$oid': '5332f5f6e4b03c9a25efd0b4'}}",TRESEMME,0.0,TRESEMME
169,{'$oid': '5332f7d3e4b03c9a25efd14e'},5332f7d3e4b03c9a25efd14e,511111803393,Snacks,,"{'$ref': 'Cpgs', '$id': {'$oid': '5332f5f2e4b03c9a25efd0aa'}}",Cheez-It,,


#### df_brands.cpg:

In [27]:
# look at the first value in df_brands.cpg
df_brands['cpg'][0]

{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}

In [28]:
#extract the value
df_brands['cpg'][0]['$id']['$oid']

'601ac114be37ce2ead437550'

In [29]:
# set a variable to use for the dict_key param of value_from_dict() function
brand_cpg_dict_key = "['$id']['$oid']"

In [30]:
# clean df_brands.cpg and confirm by viewing a sample of the dataframe
value_from_dict(df_brands, "cpg", brand_cpg_dict_key)
df_brands.sample(2)

Unnamed: 0,_id,_id_cleaned,barcode,category,categoryCode,cpg,cpg_cleaned,name,topBrand,brandCode
873,{'$oid': '5fff3aa0be37ce2bb7930117'},5fff3aa0be37ce2bb7930117,511111319566,Baking,BAKING,"{'$ref': 'Cogs', '$id': {'$oid': '5fff3aa0be37ce2bb7930116'}}",5fff3aa0be37ce2bb7930116,test brand @1610562208600,0.0,TEST BRANDCODE @1610562208600
0,{'$oid': '601ac115be37ce2ead437551'},601ac115be37ce2ead437551,511111019862,Baking,BAKING,"{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}",601ac114be37ce2ead437550,test brand @1612366101024,0.0,


#### df_receipts._id:

In [31]:
# look at the first value in df_receipts._id
df_receipts['_id'][0]

{'$oid': '5ff1e1eb0a720f0523000575'}

In [32]:
#extract the value
df_receipts['_id'][0]['$oid']

'5ff1e1eb0a720f0523000575'

In [33]:
# set a variable to use for the dict_key param of value_from_dict() function
receipts_id_dict_key = "['$oid']"

In [34]:
# clean df_receipts._id and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "_id", receipts_id_dict_key)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,dateScanned,finishedDate,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
171,{'$oid': '5ff8c4fb0a7214adca000006'},5ff8c4fb0a7214adca000006,500.0,"Receipt number 2 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1610138875000},{'$date': 1610138875000},{'$date': 1610138876000},{'$date': 1610138876000},{'$date': 1610138876000},600.0,{'$date': 1609459200000},1.0,"[{'barcode': '001111132666', 'brandCode': 'BRAND', 'description': 'DOVE MEN PLUS CARE SOAP 4 CT AQUA IMPACT SP 16 OZ', 'finalPrice': '10.00', 'itemPrice': '10.00', 'partnerItemId': '0', 'pointsEarned': '100.0', 'pointsPayerId': '5332f5f6e4b03c9a25efd0b4', 'quantityPurchased': 1, 'rewardsGroup': 'DOVE MEN+CARE BODY WASH AND SOAP', 'rewardsProductPartnerId': '5332f5f6e4b03c9a25efd0b4'}]",FINISHED,10.0,5ff8c4e8b3348b11c9337a11
551,{'$oid': '601431db0a720f05f80000a8'},601431db0a720f05f80000a8,5.0,All-receipts receipt bonus,{'$date': 1611936219000},{'$date': 1611936219000},{'$date': 1611936225000},{'$date': 1611936225000},{'$date': 1611936220000},5.0,{'$date': 1611849819000},2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '1234', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '1234', 'userFlaggedDescription': '', 'userFlaggedNewItem': True}]",FINISHED,1.0,6014319173c60b3ca7f3bf01


#### df_receipts.createDate:

In [35]:
# look at the first value in XXXX.YYY
df_receipts['createDate'][0]

{'$date': 1609687531000}

In [36]:
# extract the value
df_receipts['createDate'][0]['$date']

1609687531000

In [37]:
# set a variable to use for the dict_key param of value_from_dict() function
receipts_createDate_dict_key = "['$date']"

In [38]:
# clean df_receipts.createDate and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "createDate", receipts_createDate_dict_key)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,finishedDate,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
45,{'$oid': '5ff36d9d0a720f05230005aa'},5ff36d9d0a720f05230005aa,100.0,"Receipt number 6 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1609788829000},1609788829000,{'$date': 1609788829000},{'$date': 1609788829000},{'$date': 1609788829000},{'$date': 1609788829000},225.0,{'$date': 1609480800000},5.0,"[{'barcode': '044700002810', 'description': 'OSCAR MAYER XXL Premium Fully-Cooked Beef Franks 16 OZ 004470000281', 'finalPrice': '5', 'itemPrice': '5', 'partnerItemId': '1', 'pointsEarned': '25.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER HOT DOG - BEEF FRANKS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '044700002810', 'description': 'OSCAR MAYER XXL Premium Fully-Cooked Beef Franks 16 OZ 004470000281', 'finalPrice': '5', 'itemPrice': '5', 'partnerItemId': '2', 'pointsEarned': '25.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER HOT DOG - BEEF FRANKS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '044700002810', 'description': 'OSCAR MAYER XXL Premium Fully-Cooked Beef Franks 16 OZ 004470000281', 'finalPrice': '5', 'itemPrice': '5', 'partnerItemId': '3', 'pointsEarned': '25.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER HOT DOG - BEEF FRANKS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '044700002810', 'description': 'OSCAR MAYER XXL Premium Fully-Cooked Beef Franks 16 OZ 004470000281', 'finalPrice': '5', 'itemPrice': '5', 'partnerItemId': '4', 'pointsEarned': '25.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER HOT DOG - BEEF FRANKS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '044700002810', 'description': 'OSCAR MAYER XXL Premium Fully-Cooked Beef Franks 16 OZ 004470000281', 'finalPrice': '5', 'itemPrice': '5', 'partnerItemId': '5', 'pointsEarned': '25.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER HOT DOG - BEEF FRANKS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}]",FINISHED,25.0,5ff36d0362fde912123a5535
55,{'$oid': '5ff371240a7214ada10005b3'},5ff371240a7214ada10005b3,750.0,"Receipt number 1 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1609789732000},1609789732000,{'$date': 1609789732000},{'$date': 1609789732000},{'$date': 1609789732000},{'$date': 1609789732000},755.0,{'$date': 1609703332000},1.0,"[{'barcode': '044700073377', 'description': 'OSCAR MAYER Jumbo Angus Beef Uncured Franks, 15.0 OZ', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'pointsEarned': '5.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER HOT DOG - BEEF FRANKS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}]",FINISHED,1.0,5ff37124135e7011bcb86bc3


#### df_receipts.dateScanned:
same format as df_receipts.createDate

In [39]:
# set a varbale to use for the dict_key param of value_from_dict() function
receipts_dateScanned_dict_key = "['$date']"

In [40]:
# clean df_receipts.dateScanned and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "dateScanned", receipts_dateScanned_dict_key)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,dateScanned_cleaned,finishedDate,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
565,{'$oid': '601465f40a720f05f8000137'},601465f40a720f05f8000137,,,{'$date': 1611949556000},1611949556000,{'$date': 1611949556000},1611949556000,{'$date': 1611949557000},{'$date': 1611949557000},{'$date': 1611949557000},100.0,{'$date': 1611273600000},1.0,"[{'barcode': '007940018942', 'brandCode': 'BRAND', 'description': 'SUAVE KIDS PURELY AWESOME TEARLESS SHAMPOO AND CONDITIONER IN ONE LIQUID PLASTIC BOTTLE RP 12 OZ', 'finalPrice': '10.00', 'itemPrice': '10.00', 'partnerItemId': '0', 'pointsEarned': '100.0', 'pointsPayerId': '5332f5f6e4b03c9a25efd0b4', 'quantityPurchased': 1, 'rewardsGroup': 'SUAVE KIDS HAIR CARE', 'rewardsProductPartnerId': '5332f5f6e4b03c9a25efd0b4'}]",FINISHED,10.0,6014658e67804a1228b20ef4
1051,{'$oid': '603aa55a0a7217c72c000233'},603aa55a0a7217c72c000233,,,{'$date': 1614456154525},1614456154525,{'$date': 1614456154525},1614456154525,,{'$date': 1614456154525},,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33


#### df_receipts.finishedDate:
same format as df_receipts.createDate - can include None/Null/NaN

In [41]:
# set a varbale to use for the dict_key param of value_from_dict() function
receipts_finishedDate_dict_key = "['$date']"

In [42]:
# clean df_receipts.finishedDate and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "finishedDate", receipts_finishedDate_dict_key)
df_receipts.sample(2)

AssertionError: there is at least one None value in the cleaned data

In [43]:
# clean df_receipts.finishedDate and confirm by viewing a sample of the dataframe
# setting allow_nulls = True
value_from_dict(df_receipts, "finishedDate", receipts_finishedDate_dict_key, allow_nulls=True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,dateScanned_cleaned,finishedDate,finishedDate_cleaned,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
590,{'$oid': '6014e99b0a7214ad50000137'},6014e99b0a7214ad50000137,,,{'$date': 1611983259747},1611983259747,{'$date': 1611983259747},1611983259747,,,{'$date': 1611983259747},,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
1003,{'$oid': '603877bb0a720fde10000074'},603877bb0a720fde10000074,,,{'$date': 1614313403497},1614313403497,{'$date': 1614313403497},1614313403497,,,{'$date': 1614313403497},,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33


#### df_receipts.modifyDate:
same format as df_receipts.createDate

In [44]:
# set a varbale to use for the dict_key param of value_from_dict() function
receipts_modifyDate_dict_key = "['$date']"

In [45]:
# clean df_receipts.modifyDate and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "modifyDate", receipts_modifyDate_dict_key)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,dateScanned_cleaned,finishedDate,finishedDate_cleaned,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
476,{'$oid': '600fb21a0a720f053500004f'},600fb21a0a720f053500004f,5.0,All-receipts receipt bonus,{'$date': 1611641370000},1611641370000,{'$date': 1611641370000},1611641370000,{'$date': 1611641370000},1611641000000.0,{'$date': 1611641370000},1611641370000,{'$date': 1611641370000},5.0,{'$date': 1611381600000},5.0,"[{'barcode': '021000049318', 'description': 'KRAFT Classic Ranch Dressing, 24 fl oz', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '1', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '021000049318', 'description': 'KRAFT Classic Ranch Dressing, 24 fl oz', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '2', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '021000049318', 'description': 'KRAFT Classic Ranch Dressing, 24 fl oz', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '3', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '021000049318', 'description': 'KRAFT Classic Ranch Dressing, 24 fl oz', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '4', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '021000049318', 'description': 'KRAFT Classic Ranch Dressing, 24 fl oz', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '5', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}]",FINISHED,49.95,600fb1ac73c60b12049027bb
453,{'$oid': '600f012c0a720f0535000025'},600f012c0a720f0535000025,,,{'$date': 1611596076000},1611596076000,{'$date': 1611596076000},1611596076000,{'$date': 1611596077000},1611596000000.0,{'$date': 1611596077000},1611596077000,{'$date': 1611596077000},125.0,{'$date': 1611122400000},5.0,"[{'barcode': '041129412152', 'description': 'CLASSICO All Natural Peeled Ground Tomatoes 28 OZ Can', 'finalPrice': '5', 'itemPrice': '5', 'partnerItemId': '1', 'pointsEarned': '25.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'CLASSICO SAUCE', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '041129412152', 'description': 'CLASSICO All Natural Peeled Ground Tomatoes 28 OZ Can', 'finalPrice': '5', 'itemPrice': '5', 'partnerItemId': '2', 'pointsEarned': '25.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'CLASSICO SAUCE', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '041129412152', 'description': 'CLASSICO All Natural Peeled Ground Tomatoes 28 OZ Can', 'finalPrice': '5', 'itemPrice': '5', 'partnerItemId': '3', 'pointsEarned': '25.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'CLASSICO SAUCE', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '041129412152', 'description': 'CLASSICO All Natural Peeled Ground Tomatoes 28 OZ Can', 'finalPrice': '5', 'itemPrice': '5', 'partnerItemId': '4', 'pointsEarned': '25.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'CLASSICO SAUCE', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '041129412152', 'description': 'CLASSICO All Natural Peeled Ground Tomatoes 28 OZ Can', 'finalPrice': '5', 'itemPrice': '5', 'partnerItemId': '5', 'pointsEarned': '25.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'CLASSICO SAUCE', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}]",FINISHED,25.0,600f008f4329897eac237bd8


#### df_receipts.pointsAwardedDate:
same format as df_receipts.createDate - can include None/Null/NaN

In [46]:
# set a varbale to use for the dict_key param of value_from_dict() function
receipts_pointsAwardedDate_dict_key = "['$date']"

In [47]:
# clean df_receipts.purchaseDate and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "pointsAwardedDate", receipts_pointsAwardedDate_dict_key)
df_receipts.sample(2)

AssertionError: there is at least one None value in the cleaned data

In [48]:
# clean df_receipts.purchaseDate and confirm by viewing a sample of the dataframe allowing nulls
value_from_dict(df_receipts, "pointsAwardedDate", receipts_pointsAwardedDate_dict_key, allow_nulls=True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,dateScanned_cleaned,finishedDate,finishedDate_cleaned,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
60,{'$oid': '5ff4ce430a7214ada10005d8'},5ff4ce430a7214ada10005d8,5.0,All-receipts receipt bonus,{'$date': 1609879107000},1609879107000,{'$date': 1609879107000},1609879107000,{'$date': 1609879108000},1609879000000.0,{'$date': 1609879113000},1609879113000,{'$date': 1609879108000},1609879000000.0,5.0,{'$date': 1609804800000},4.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '25.00', 'itemPrice': '25.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 4, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '25.00', 'userFlaggedQuantity': 4}]",FINISHED,25.0,5ff4ce3dc3d63511e2a484dc
450,{'$oid': '600ed4d90a7214ada200001c'},600ed4d90a7214ada200001c,,,{'$date': 1611584729000},1611584729000,{'$date': 1611584729000},1611584729000,{'$date': 1611584729000},1611585000000.0,{'$date': 1611584729000},1611584729000,{'$date': 1611584729000},1611585000000.0,250.0,{'$date': 1610323200000},5.0,"[{'barcode': '029000020351', 'description': 'P3 Peanuts Ham Jerky Sunflower Kernels Portable Protein Pack - 5.4oz - 3ct', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '1', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER P3', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '029000020351', 'description': 'P3 Peanuts Ham Jerky Sunflower Kernels Portable Protein Pack - 5.4oz - 3ct', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '2', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER P3', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '029000020351', 'description': 'P3 Peanuts Ham Jerky Sunflower Kernels Portable Protein Pack - 5.4oz - 3ct', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '3', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER P3', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '029000020351', 'description': 'P3 Peanuts Ham Jerky Sunflower Kernels Portable Protein Pack - 5.4oz - 3ct', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '4', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER P3', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '029000020351', 'description': 'P3 Peanuts Ham Jerky Sunflower Kernels Portable Protein Pack - 5.4oz - 3ct', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '5', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER P3', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}]",FINISHED,49.95,600ed42e43298911ce45d1fa


#### df_receipts.purchaseDate:
same format as df_receipts.createDate - can include None/Null/NaN

In [49]:
# set a varbale to use for the dict_key param of value_from_dict() function
receipts_purchaseDate_dict_key = "['$date']"

In [50]:
# clean df_receipts.purchaseDate and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "purchaseDate", receipts_purchaseDate_dict_key)
df_receipts.sample(2)

AssertionError: there is at least one None value in the cleaned data

In [51]:
# clean df_receipts.purchaseDate and confirm by viewing a sample of the dataframe allowing nulls
value_from_dict(df_receipts, "purchaseDate", receipts_purchaseDate_dict_key, allow_nulls=True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,dateScanned_cleaned,finishedDate,finishedDate_cleaned,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
673,{'$oid': '6018a95f0a7214ad28000056'},6018a95f0a7214ad28000056,,,{'$date': 1612228959277},1612228959277,{'$date': 1612228959277},1612228959277,,,{'$date': 1612228959277},1612228959277,,,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
826,{'$oid': '60204ae80a7214ad25000111'},60204ae80a7214ad25000111,,,{'$date': 1612729064002},1612729064002,{'$date': 1612729064002},1612729064002,,,{'$date': 1612729064002},1612729064002,,,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33


#### df_users._id:

In [52]:
# look at the first value in df_users._id
df_users['_id'][0]

{'$oid': '5ff1e194b6a9d73a3a9f1052'}

In [53]:
#extract the value
df_users['_id'][0]['$oid']

'5ff1e194b6a9d73a3a9f1052'

In [54]:
# set a varbale to use for the dict_key param of value_from_dict() function
users_id_dict_key = "['$oid']"

In [55]:
# clean df_users._id and confirm by viewing a sample of the dataframe
value_from_dict(df_users, "_id", users_id_dict_key)
df_users.sample(2)

Unnamed: 0,_id,_id_cleaned,active,createdDate,lastLogin,role,signUpSource,state
240,{'$oid': '6008622ebe5fc9247bab4eb9'},6008622ebe5fc9247bab4eb9,False,{'$date': 1611162158662},{'$date': 1611162158931},consumer,Email,WI
149,{'$oid': '5fff26bab3348b03eb45bb22'},5fff26bab3348b03eb45bb22,True,{'$date': 1610557114027},{'$date': 1610557114075},consumer,Email,WI


#### df_users.createdDate:  
same format as df_receipts.createDate

In [56]:
# set a variable to use for the dict_key param of value_from_dict() function
users_createdDate_dict_key = "['$date']"

In [57]:
# clean df_users.createDate and confirm by viewing a sample of the dataframe
value_from_dict(df_users, "createdDate", users_createdDate_dict_key)
df_users.sample(2)

Unnamed: 0,_id,_id_cleaned,active,createdDate,createdDate_cleaned,lastLogin,role,signUpSource,state
294,{'$oid': '600fb1ac73c60b12049027bb'},600fb1ac73c60b12049027bb,True,{'$date': 1611641260879},1611641260879,{'$date': 1611641483950},consumer,Email,WI
203,{'$oid': '6004a965e257124ec6b9a39f'},6004a965e257124ec6b9a39f,True,{'$date': 1610918245327},1610918245327,,consumer,Email,WI


#### df_users.lastLogin:  
same format as df_receipts.createDat - can include None/Null/NaN

In [58]:
# set a variable to use for the dict_key param of value_from_dict() function
users_lastLogin_dict_key = "['$date']"

In [59]:
# clean df_users.createDate and confirm by viewing a sample of the dataframe
value_from_dict(df_users, "lastLogin", users_lastLogin_dict_key)
df_users.sample(2)

AssertionError: there is at least one None value in the cleaned data

In [60]:
# clean df_users.createDate and confirm by viewing a sample of the dataframe allowing nulls
value_from_dict(df_users, "lastLogin", users_lastLogin_dict_key, allow_nulls=True)
df_users.sample(2)

Unnamed: 0,_id,_id_cleaned,active,createdDate,createdDate_cleaned,lastLogin,lastLogin_cleaned,role,signUpSource,state
417,{'$oid': '60255883efa60114d20e5d4e'},60255883efa60114d20e5d4e,True,{'$date': 1613060227502},1613060227502,{'$date': 1613060309219},1613060000000.0,consumer,Email,WI
388,{'$oid': '55308179e4b0eabd8f99caa2'},55308179e4b0eabd8f99caa2,True,{'$date': 1429242233186},1429242233186,{'$date': 1525713820003},1525714000000.0,consumer,,WI


### Writing a function to convert date data from epoch to timestamps

In [61]:
def epoch_to_timestamp(dataframe, column_to_convert, allow_nulls = False):
    """Returns dataframe with a new column containing timestamps converted from epoch.
    
    :param dataframe: A dataframe with a column containing epoch seconds as ints or floats
    :type dataframe: Pandas DataFrame
    :param column_to_convert: The name of the column containing epoch seconds
    :type column_to_convert: str
    :param allow_nulls: A boolean value indicating if None(Null) values should be allowed,
        defaults to False
    :type allow_nulls: bool
    
    :raises AssertionError: 'there is at least one None/Null/NaN/NaT value in the converted timestamp data' 
        if param allow_nulls = False
    :raises AssertionError: 'the length of the original column and the converted column are not the same'
    :excepts VlueError: If a column has already been converted, print message confirming
        no values were added to the dataframe
    
    :rtype: Pandas DataFrame
    :return: the original DataFrame with an additional column containing converted epoch values as timestamps
    """
    #setting variables
    # name of the new column we'll be adding to the dataframe
    converted_column_name = column_to_convert + "_ts"
    # location to insert the converted column, after the column_to_clean
    insert_at = dataframe.columns.get_loc(column_to_convert) + 1
    # create a series of timestamps from the epoch time column_to_convert
    # pd.to_datetime() converts a scalar, array-like, Series or DataFrame/dict-like to a pandas datetime object
    # the data in the epoch columns is miliseconds from epoch start and we round to 1ms for consistency 
    time_stamps = pd.to_datetime(dataframe[column_to_convert], unit='ms').round('1ms')
    
    # handle allow_nulls param flag
    if not allow_nulls:
        # confirm no nulls in time_stamps
#         assert None not in time_stamps, "there is at least one None/Null/NaN/NaT value in the converted timestamp data"
        assert not time_stamps.isnull().values.any(), "there is at least one None/Null/NaN/NaT value in the converted timestamp data" 
            # df_receipts['finishedDate_cleaned'].isnull().values.any()
    
    # confirm time_stamps is the same length as column_to_convert
    assert len(time_stamps) == len(dataframe[column_to_convert]), "the length of the original column and the converted column are not the same"
    
    # add the timestamps data to the originl dataframe following column_to_convert
    try:
        dataframe.insert(insert_at, converted_column_name, time_stamps)
    except ValueError as error:
        print(f"{str(error)}, {converted_column_name} was not added to the dataframe")
    
    # return the modified dataframe
    return dataframe
    

### Using epoch_to_timestamp() to convert columns with epoch values to timestamps and add them to the dataframe as a new column:

In [62]:
# convert df_users.createdDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_users, 'createdDate_cleaned')
df_users.sample(2)

Unnamed: 0,_id,_id_cleaned,active,createdDate,createdDate_cleaned,createdDate_cleaned_ts,lastLogin,lastLogin_cleaned,role,signUpSource,state
195,{'$oid': '60023de5fb296c121a81b955'},60023de5fb296c121a81b955,True,{'$date': 1610759653653},1610759653653,2021-01-16 01:14:13.653,,,consumer,Email,CO
254,{'$oid': '6009e60450b3311194385009'},6009e60450b3311194385009,True,{'$date': 1611261445244},1611261445244,2021-01-21 20:37:25.244,,,consumer,Email,WI


In [63]:
# convert df_users.lastLogin_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_users, 'lastLogin_cleaned')
df_users.sample(2)

AssertionError: there is at least one None/Null/NaN/NaT value in the converted timestamp data

In [64]:
# convert df_users.lastLogin_cleaned to timestamps and confirm by viewing a sample of the dataframe, allowing for Nulls
epoch_to_timestamp(df_users, 'lastLogin_cleaned', allow_nulls=True)
df_users.sample(2)

Unnamed: 0,_id,_id_cleaned,active,createdDate,createdDate_cleaned,createdDate_cleaned_ts,lastLogin,lastLogin_cleaned,lastLogin_cleaned_ts,role,signUpSource,state
195,{'$oid': '60023de5fb296c121a81b955'},60023de5fb296c121a81b955,True,{'$date': 1610759653653},1610759653653,2021-01-16 01:14:13.653,,,NaT,consumer,Email,CO
77,{'$oid': '5ff5d15aeb7c7d12096d91a2'},5ff5d15aeb7c7d12096d91a2,True,{'$date': 1609945434680},1609945434680,2021-01-06 15:03:54.680,{'$date': 1609945690009},1609946000000.0,2021-01-06 15:08:10.009,consumer,Email,WI


In [65]:
# convert df_receipts.createDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'createDate_cleaned')
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,finishedDate,finishedDate_cleaned,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
992,{'$oid': '6025f0fe0a720f05a800028f'},6025f0fe0a720f05a800028f,,,{'$date': 1613099262814},1613099262814,2021-02-12 03:07:42.814,{'$date': 1613099262814},1613099262814,,,{'$date': 1613099262814},1613099262814,,,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
47,{'$oid': '5ff29be20a7214ada1000571'},5ff29be20a7214ada1000571,25.0,COMPLETE_NONPARTNER_RECEIPT,{'$date': 1609735138000},1609735138000,2021-01-04 04:38:58.000,{'$date': 1609735138000},1609735138000,{'$date': 1609735146000},1609735000000.0,{'$date': 1609735149000},1609735149000,{'$date': 1609735146000},1609735000000.0,25.0,{'$date': 1609632000000},1609632000000.0,1.0,"[{'barcode': '044000000745', 'brandCode': 'KRAFT EASY CHEESE', 'competitiveProduct': True, 'competitorRewardsGroup': 'SARGENTO RICOTTA CHEESE', 'description': '-Cheddar', 'discountedItemPrice': '1.00', 'finalPrice': '1.00', 'itemNumber': '044000000745', 'itemPrice': '1.00', 'partnerItemId': '1030', 'quantityPurchased': 1, 'rewardsGroup': 'SARGENTO RICOTTA CHEESE', 'rewardsProductPartnerId': '5e7cf838f221c312e698a628'}]",FINISHED,1.0,5964eb07e4b03efd0c0f267b


In [66]:
# convert df_receipts.dateScanned_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'dateScanned_cleaned')
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
115,{'$oid': '5ff5d2030a720f05230005df'},5ff5d2030a720f05230005df,25.0,COMPLETE_NONPARTNER_RECEIPT,{'$date': 1609945603000},1609945603000,2021-01-06 15:06:43,{'$date': 1609945603000},1609945603000,2021-01-06 15:06:43,{'$date': 1609945603000},1609946000000.0,{'$date': 1609945603000},1609945603000,{'$date': 1609945603000},1609946000000.0,25.0,{'$date': 1609945603000},1609946000000.0,1.0,"[{'barcode': '021000001361', 'description': 'PHILADELPHIA Snack BAR Strawberry CheeseC 1 CT 002100000136', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '1', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}]",FINISHED,9.99,5ff5d15aeb7c7d12096d91a2
365,{'$oid': '60083ab50a7214ad89000049'},60083ab50a7214ad89000049,5.0,All-receipts receipt bonus,{'$date': 1611152053000},1611152053000,2021-01-20 14:14:13,{'$date': 1611152053000},1611152053000,2021-01-20 14:14:13,{'$date': 1611152053000},1611152000000.0,{'$date': 1611152053000},1611152053000,{'$date': 1611152053000},1611152000000.0,5.0,{'$date': 1611152053000},1611152000000.0,1.0,"[{'barcode': '021000029778', 'competitiveProduct': True, 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '1', 'quantityPurchased': 1, 'rewardsGroup': 'SARGENTO STRING OR STICK CHEESE', 'rewardsProductPartnerId': '5e7cf838f221c312e698a628'}]",FINISHED,9.99,60083a1e325c8a17946255de


In [67]:
# convert df_receipts.finishedDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'finishedDate_cleaned')
df_receipts.sample(2)

AssertionError: there is at least one None/Null/NaN/NaT value in the converted timestamp data

In [68]:
# convert df_receipts.finishedDate_cleaned to timestamps and confirm by viewing a sample of the dataframe, allowing nulls
epoch_to_timestamp(df_receipts, 'finishedDate_cleaned', allow_nulls = True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
870,{'$oid': '602145ae0a720f0573000007'},602145ae0a720f0573000007,,,{'$date': 1612793262418},1612793262418,2021-02-08 14:07:42.418,{'$date': 1612793262418},1612793262418,2021-02-08 14:07:42.418,,,NaT,{'$date': 1612793262418},1612793262418,,,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
294,{'$oid': '6000d4bc0a7214ad4c000070'},6000d4bc0a7214ad4c000070,,,{'$date': 1610667196000},1610667196000,2021-01-14 23:33:16.000,{'$date': 1610667196000},1610667196000,2021-01-14 23:33:16.000,,,NaT,{'$date': 1610667197000},1610667197000,,,8700.0,{'$date': 1613345597000},1613346000000.0,10.0,"[{'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '1', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '2', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '3', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '4', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '5', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '6', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '7', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '8', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '9', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}, {'barcode': '034100573065', 'description': 'MILLER LITE 24 PACK 12OZ CAN', 'finalPrice': '29', 'itemPrice': '29', 'partnerItemId': '10', 'pointsEarned': '870.0', 'pointsPayerId': '5332f709e4b03c9a25efd0f1', 'quantityPurchased': 1, 'rewardsGroup': 'MILLER LITE 24 PACK', 'rewardsProductPartnerId': '5332f709e4b03c9a25efd0f1', 'targetPrice': '77'}]",FLAGGED,290.0,6000d46cfb296c121a81b20c


In [69]:
# convert df_receipts.modifyDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'modifyDate_cleaned')
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,modifyDate_cleaned_ts,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
1017,{'$oid': '60391f210a7217c72c0000de'},60391f210a7217c72c0000de,,,{'$date': 1614356257288},1614356257288,2021-02-26 16:17:37.288,{'$date': 1614356257288},1614356257288,2021-02-26 16:17:37.288,,,NaT,{'$date': 1614356257288},1614356257288,2021-02-26 16:17:37.288,,,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
1022,{'$oid': '6038f2950a7217c72c0000ca'},6038f2950a7217c72c0000ca,,,{'$date': 1614344853064},1614344853064,2021-02-26 13:07:33.064,{'$date': 1614344853064},1614344853064,2021-02-26 13:07:33.064,,,NaT,{'$date': 1614344853064},1614344853064,2021-02-26 13:07:33.064,,,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33


In [70]:
# convert df_receipts.pointsAwardedDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'pointsAwardedDate_cleaned')
df_receipts.sample(2)

AssertionError: there is at least one None/Null/NaN/NaT value in the converted timestamp data

In [71]:
# convert df_receipts.pointsAwardedDate_cleaned to timestamps and confirm by viewing a sample of the dataframe allowing nulls
epoch_to_timestamp(df_receipts, 'pointsAwardedDate_cleaned', allow_nulls=True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,modifyDate_cleaned_ts,pointsAwardedDate,pointsAwardedDate_cleaned,pointsAwardedDate_cleaned_ts,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
336,{'$oid': '600746ce0a720f05fa000017'},600746ce0a720f05fa000017,45.0,COMPLETE_PARTNER_RECEIPT,{'$date': 1611089614000},1611089614000,2021-01-19 20:53:34.000,{'$date': 1611089614000},1611089614000,2021-01-19 20:53:34.000,{'$date': 1611089615000},1611090000000.0,2021-01-19 20:53:35,{'$date': 1611089615000},1611089615000,2021-01-19 20:53:35.000,{'$date': 1611089615000},1611090000000.0,2021-01-19 20:53:35,50.0,{'$date': 1611003214000},1611003000000.0,1.0,"[{'barcode': '044700070840', 'description': 'OSCAR MAYER LUNCHABLES LUNCHABLES Chicken Sliders 3.00-oz', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'pointsEarned': '5.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'LUNCHABLES LUNCH COMBINATIONS - FUN PACK', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}]",FINISHED,1.0,6007464b6e64691717e8c1f0
888,{'$oid': '6021efe50a7214d8e9000081'},6021efe50a7214d8e9000081,,,{'$date': 1612836837436},1612836837436,2021-02-09 02:13:57.436,{'$date': 1612836837436},1612836837436,2021-02-09 02:13:57.436,,,NaT,{'$date': 1612836837436},1612836837436,2021-02-09 02:13:57.436,,,NaT,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33


In [72]:
# convert df_receipts.purchaseDate_cleaned to timestamps and confirm by viewing a sample of the dataframe 
epoch_to_timestamp(df_receipts, 'purchaseDate_cleaned')
df_receipts.sample(2)

AssertionError: there is at least one None/Null/NaN/NaT value in the converted timestamp data

In [73]:
# convert df_receipts.purchaseDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'purchaseDate_cleaned', allow_nulls = True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,modifyDate_cleaned_ts,pointsAwardedDate,pointsAwardedDate_cleaned,pointsAwardedDate_cleaned_ts,pointsEarned,purchaseDate,purchaseDate_cleaned,purchaseDate_cleaned_ts,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
799,{'$oid': '601ea1830a720f053c000079'},601ea1830a720f053c000079,,,{'$date': 1612620163464},1612620163464,2021-02-06 14:02:43.464,{'$date': 1612620163464},1612620163464,2021-02-06 14:02:43.464,,,NaT,{'$date': 1612620163464},1612620163464,2021-02-06 14:02:43.464,,,NaT,,,,NaT,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
162,{'$oid': '5ff8cea10a7214adca00000b'},5ff8cea10a7214adca00000b,250.0,"Receipt number 3 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1609968545000},1609968545000,2021-01-06 21:29:05.000,{'$date': 1609968545000},1609968545000,2021-01-06 21:29:05.000,{'$date': 1610141355000},1610141000000.0,2021-01-08 21:29:15,{'$date': 1610141355000},1610141355000,2021-01-08 21:29:15.000,{'$date': 1610141355000},1610141000000.0,2021-01-08 21:29:15,1999.6,{'$date': 1609882145000},1609882000000.0,2021-01-05 21:29:05,2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '1234', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '1234', 'userFlaggedDescription': '', 'userFlaggedNewItem': True}]",FINISHED,1.0,5ff8ce8504929111f6e913cb


In [74]:
# convert df_receipts.purchaseDate_cleaned to timestamps and confirm by viewing a sample of the dataframe, allowing nulls
epoch_to_timestamp(df_receipts, 'purchaseDate_cleaned', allow_nulls = True)
df_receipts.sample(2)

cannot insert purchaseDate_cleaned_ts, already exists, purchaseDate_cleaned_ts was not added to the dataframe


Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,modifyDate_cleaned_ts,pointsAwardedDate,pointsAwardedDate_cleaned,pointsAwardedDate_cleaned_ts,pointsEarned,purchaseDate,purchaseDate_cleaned,purchaseDate_cleaned_ts,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
1110,{'$oid': '603c6adf0a720fde1000039a'},603c6adf0a720fde1000039a,,,{'$date': 1614572255736},1614572255736,2021-03-01 04:17:35.736,{'$date': 1614572255736},1614572255736,2021-03-01 04:17:35.736,,,NaT,{'$date': 1614572255736},1614572255736,2021-03-01 04:17:35.736,,,NaT,,,,NaT,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
38,{'$oid': '5ff3710b0a7214ada10005a7'},5ff3710b0a7214ada10005a7,100.0,"Receipt number 6 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1609789707000},1609789707000,2021-01-04 19:48:27.000,{'$date': 1609789707000},1609789707000,2021-01-04 19:48:27.000,{'$date': 1609789708000},1609790000000.0,2021-01-04 19:48:28,{'$date': 1609789708000},1609789708000,2021-01-04 19:48:28.000,{'$date': 1609789708000},1609790000000.0,2021-01-04 19:48:28,600.0,{'$date': 1609703307000},1609703000000.0,2021-01-03 19:48:27,9.0,"[{'barcode': '044700033302', 'description': 'OSCAR MAYER Carving Board Oven Roasted Turkey Breast, 7.5 oz', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '1', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER CARVING BOARD', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '044700033302', 'description': 'OSCAR MAYER Carving Board Oven Roasted Turkey Breast, 7.5 oz', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '2', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER CARVING BOARD', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '044700033302', 'description': 'OSCAR MAYER Carving Board Oven Roasted Turkey Breast, 7.5 oz', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '3', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER CARVING BOARD', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '044700033302', 'description': 'OSCAR MAYER Carving Board Oven Roasted Turkey Breast, 7.5 oz', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '4', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER CARVING BOARD', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '044700033302', 'description': 'OSCAR MAYER Carving Board Oven Roasted Turkey Breast, 7.5 oz', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '5', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER CARVING BOARD', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '044700033302', 'description': 'OSCAR MAYER Carving Board Oven Roasted Turkey Breast, 7.5 oz', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '6', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER CARVING BOARD', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '044700033302', 'description': 'OSCAR MAYER Carving Board Oven Roasted Turkey Breast, 7.5 oz', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '7', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER CARVING BOARD', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '044700033302', 'description': 'OSCAR MAYER Carving Board Oven Roasted Turkey Breast, 7.5 oz', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '8', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER CARVING BOARD', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '044700033302', 'description': 'OSCAR MAYER Carving Board Oven Roasted Turkey Breast, 7.5 oz', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '9', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER CARVING BOARD', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}]",FINISHED,89.91,5ff370c562fde912123a5e0e


### visually check that all the cleaned and converted columns I expect are present

In [75]:
df_brands.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1167 entries, 0 to 1166
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   _id           1167 non-null   object 
 1   _id_cleaned   1167 non-null   object 
 2   barcode       1167 non-null   int64  
 3   category      1012 non-null   object 
 4   categoryCode  517 non-null    object 
 5   cpg           1167 non-null   object 
 6   cpg_cleaned   1167 non-null   object 
 7   name          1167 non-null   object 
 8   topBrand      555 non-null    float64
 9   brandCode     933 non-null    object 
dtypes: float64(1), int64(1), object(8)
memory usage: 91.3+ KB


In [76]:
df_receipts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1119 entries, 0 to 1118
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   _id                           1119 non-null   object        
 1   _id_cleaned                   1119 non-null   object        
 2   bonusPointsEarned             544 non-null    float64       
 3   bonusPointsEarnedReason       544 non-null    object        
 4   createDate                    1119 non-null   object        
 5   createDate_cleaned            1119 non-null   int64         
 6   createDate_cleaned_ts         1119 non-null   datetime64[ns]
 7   dateScanned                   1119 non-null   object        
 8   dateScanned_cleaned           1119 non-null   int64         
 9   dateScanned_cleaned_ts        1119 non-null   datetime64[ns]
 10  finishedDate                  568 non-null    object        
 11  finishedDate_cleaned          

In [77]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 495 entries, 0 to 494
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   _id                     495 non-null    object        
 1   _id_cleaned             495 non-null    object        
 2   active                  495 non-null    bool          
 3   createdDate             495 non-null    object        
 4   createdDate_cleaned     495 non-null    int64         
 5   createdDate_cleaned_ts  495 non-null    datetime64[ns]
 6   lastLogin               433 non-null    object        
 7   lastLogin_cleaned       433 non-null    float64       
 8   lastLogin_cleaned_ts    433 non-null    datetime64[ns]
 9   role                    495 non-null    object        
 10  signUpSource            447 non-null    object        
 11  state                   439 non-null    object        
dtypes: bool(1), datetime64[ns](2), float64(1), int64(1

### What might I need to answer the stakeholder questions?  
This collection of cells is representative of some of my brainstorming/planning process. I've attempted to 'think out loud' a bit here, but will more fully document what code is doing in the following section.

- What are the top 5 brands by receipts scanned for most recent month?
  - need to join to brands from receipts, only way there is via barcode: in rewardsReceiptItemList  
  
  
- How does the ranking of the top 5 brands by receipts scanned for the recent month compare to the ranking for the previous month?  
  - same as above, barcode 


- When considering average spend from receipts with 'rewardsReceiptStatus’ of ‘Accepted’ or ‘Rejected’, which is greater?  
  - this can be answered with df_receipts.totalSpent

- When considering total number of items purchased from receipts with 'rewardsReceiptStatus’ of ‘Accepted’ or ‘Rejected’, which is greater?  
  - df_receipts.purchasedItemCount


- Which brand has the most spend among users who were created within the past 6 months?  
  - barcode
  - df_users.createdDate_cleaned_ts

- Which brand has the most transactions among users who were created within the past 6 months?
  - barcode
  
  
Questions:
  - 1 receipt = 1 transaction?
  - There is no 'Accepted' value for rewardsReceiptStatus. Assume 'Finished' is 'Accepted' or anything but 'Rejected' or something else?
  - Re: receipts data - is this data a snapshot in time, if taken again might some statuses change, along the contents of rewardsReceiptItemList? If so, what are the final statuses - FINISHED and REJECTED?
    - looking at status by daterange might give some indication, there are a number of date fields - modifyDate could be representative of some sort of updated at reference 



**to-do:**
- explore what keys are included in a dictionary that includes barcode:, is it a consistent set?
  - it is not a consistent set, it looks like most 'FINISHED' receipts have the best quality of data. I'm curious what status implies 
- decide what else I should include in addition to barcode from rewardsReceiptItemList?
    - With the following I can get to brands via barcode / userFlaggedBarcode. I can sum quantity and prices and provide descriptions - potentially useful for the next level of drill down and easy to grab now.
  - 'barcode':
  - 'userFlaggedBarcode':
  - 'description':
  - 'userFlaggedDescription':
  - 'finalPrice':
  - 'userFlaggedPrice':
  - 'quantityPurchased':
  - 'userFlaggedQuantity':
- create a new data source that will act as a look up table, receipt_items. rows to include the original receipt id, and the above fields from  where available. If neither barcode or userFlaggedBarcode are avaialbe, don't include those receipt items


In [78]:
# from df_receipts extract _id_cleaned and rewardsReceiptItemList to series and look at a few samplesabs
df_receipt_items = df_receipts[['_id_cleaned','rewardsReceiptItemList']]
# df_receipt_items
df_receipt_items.sample(3)

Unnamed: 0,_id_cleaned,rewardsReceiptItemList
320,60037b500a7214ad4c000086,"[{'barcode': '085718300512', 'competitiveProduct': True, 'description': 'BANZA SPAGH', 'discountedItemPrice': '2.88', 'finalPrice': '2.88', 'itemNumber': '085718300512', 'itemPrice': '2.88', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'BANZA SPAGH', 'partnerItemId': '1235', 'quantityPurchased': 1, 'rewardsGroup': 'ANNIE'S HOMEGROWN MULTI-SERVING MAC & CHEESE', 'rewardsProductPartnerId': '5332f5f3e4b03c9a25efd0ae'}, {'barcode': '085718300512', 'competitiveProduct': True, 'description': 'BANZA SPAGH', 'discountedItemPrice': '2.88', 'finalPrice': '2.88', 'itemNumber': '085718300512', 'itemPrice': '2.88', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'BANZA SPAGH', 'partnerItemId': '1236', 'quantityPurchased': 1, 'rewardsGroup': 'ANNIE'S HOMEGROWN MULTI-SERVING MAC & CHEESE', 'rewardsProductPartnerId': '5332f5f3e4b03c9a25efd0ae'}, {'barcode': '078742112138', 'description': 'FRZ.ORGBLUE', 'discountedItemPrice': '2.74', 'finalPrice': '2.74', 'itemNumber': '078742112138', 'itemPrice': '2.74', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'FRZ.ORGBLUE', 'partnerItemId': '1237', 'quantityPurchased': 1}, {'barcode': '078742112138', 'description': 'FRZ ORGBLUE', 'discountedItemPrice': '2.74', 'finalPrice': '2.74', 'itemNumber': '078742112138', 'itemPrice': '2.74', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'FRZ ORGBLUE', 'partnerItemId': '1238', 'quantityPurchased': 1}, {'barcode': '633472322587', 'description': 'CLEAR BLUE', 'discountedItemPrice': '7.98', 'finalPrice': '7.98', 'itemNumber': '633472322587', 'itemPrice': '7.98', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'CLEAR BLUE', 'partnerItemId': '1239', 'quantityPurchased': 1}, {'barcode': '029000019034', 'description': 'PLANTERS Deluxe Whole Cashews - Lightly Salted 18.25oz', 'discountedItemPrice': '9.34', 'finalPrice': '9.34', 'itemNumber': '002900001903', 'itemPrice': '9.34', 'originalMetaBriteBarcode': '002900001903', 'originalReceiptItemText': 'LS WHL CASHW', 'partnerItemId': '1240', 'pointsEarned': '46.7', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'PLANTERS CASHEWS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '4011', 'description': 'BANANAS', 'discountedItemPrice': '1.38', 'finalPrice': '1.38', 'itemNumber': '4011', 'itemPrice': '1.38', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'BANANAS', 'partnerItemId': '1241', 'quantityPurchased': 1}, {'barcode': '4011', 'description': 'BANANAS', 'discountedItemPrice': '0.98', 'finalPrice': '0.98', 'itemNumber': '4011', 'itemPrice': '0.98', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'BANANAS', 'partnerItemId': '1243', 'quantityPurchased': 1}, {'barcode': '052100587967', 'description': 'PPR GRNDR', 'discountedItemPrice': '1.87', 'finalPrice': '1.87', 'itemNumber': '052100587967', 'itemPrice': '1.87', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'PPR GRNDR', 'partnerItemId': '1245', 'quantityPurchased': 1}, {'barcode': '078742348315', 'description': 'GV LINER GRE', 'discountedItemPrice': '1.00', 'finalPrice': '1.00', 'itemNumber': '078742348315', 'itemPrice': '1.00', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'GV LINER GRE', 'partnerItemId': '1246', 'quantityPurchased': 1}, {'barcode': '042222302005', 'description': 'JN O TRKY GRND 7 PCT FAT 93 PCT LN LN KP RFRG TRY IN SLV 16 OZ', 'discountedItemPrice': '3.62', 'finalPrice': '3.62', 'itemNumber': '042222302005', 'itemPrice': '3.62', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'GRD TURKEY', 'partnerItemId': '1247', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '681131122764', 'description': 'MS12L CFBRN', 'discountedItemPrice': '2.48', 'finalPrice': '2.48', 'itemNumber': '681131122764', 'itemPrice': '2.48', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'MS12L CFBRN', 'partnerItemId': '1248', 'quantityPurchased': 1}, {'barcode': '4093', 'description': 'ONIONS', 'discountedItemPrice': '0.55', 'finalPrice': '0.55', 'itemNumber': '4093', 'itemPrice': '0.55', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'ONIONS', 'partnerItemId': '1249', 'quantityPurchased': 1}, {'barcode': '857484001159', 'description': 'YUMMALLO RAINBOW MARSHMALLOW 6.5 OZ - 0857484001151', 'discountedItemPrice': '1.97', 'finalPrice': '1.97', 'itemNumber': '857484001159', 'itemPrice': '1.97', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'RAINBOW MAR', 'partnerItemId': '1251', 'quantityPurchased': 1, 'rewardsProductPartnerId': '5e825d64f221c312e698a62a'}, {'barcode': '689544083405', 'competitiveProduct': True, 'description': 'FAGE YOG', 'discountedItemPrice': '3.22', 'finalPrice': '3.22', 'itemNumber': '689544083405', 'itemPrice': '3.22', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'FAGE YOG', 'partnerItemId': '1252', 'quantityPurchased': 1, 'rewardsGroup': 'YOPLAIT GREEK YOGURT WHIPS', 'rewardsProductPartnerId': '5332f5f3e4b03c9a25efd0ae'}, {'barcode': '887214041018', 'description': 'AVOCADO', 'discountedItemPrice': '2.98', 'finalPrice': '2.98', 'itemNumber': '887214041018', 'itemPrice': '2.98', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'AVOCADO', 'partnerItemId': '1253', 'quantityPurchased': 1}, {'barcode': '045255148947', 'description': 'ORG BELL 2CT', 'discountedItemPrice': '3.46', 'finalPrice': '3.46', 'itemNumber': '045255148947', 'itemPrice': '3.46', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'ORG BELL 2CT', 'partnerItemId': '1254', 'quantityPurchased': 1}, {'barcode': '681131091206', 'description': 'ORG CARROTS', 'discountedItemPrice': '1.96', 'finalPrice': '1.96', 'itemNumber': '681131091206', 'itemPrice': '1.96', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'ORG CARROTS', 'partnerItemId': '1255', 'quantityPurchased': 1}, {'barcode': '026542084018', 'description': 'CHEESE PIMTO', 'discountedItemPrice': '4.08', 'finalPrice': '4.08', 'itemNumber': '026542084018', 'itemPrice': '4.08', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'CHEESE PIMTO', 'partnerItemId': '1256', 'quantityPurchased': 1}, {'barcode': '068113135477', 'description': 'ORG SALAD', 'discountedItemPrice': '2.96', 'finalPrice': '2.96', 'itemNumber': '068113135477', 'itemPrice': '2.96', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'ORG SALAD', 'partnerItemId': '1257', 'quantityPurchased': 1}, {'barcode': '041235000687', 'description': 'MRS RNF MDM PCNT SHLF STBL SC JAR 16 FL OZ', 'discountedItemPrice': '2.96', 'finalPrice': '2.96', 'itemNumber': '041235000687', 'itemPrice': '2.96', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'MED SALSA', 'partnerItemId': '1258', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '037466017631', 'description': 'LNDT EXC DRK 70 PCT CC BAR BOX 3.5 OZ', 'discountedItemPrice': '2.78', 'finalPrice': '2.78', 'itemNumber': '037466017631', 'itemPrice': '2.78', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'COCOA', 'partnerItemId': '1259', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '037466017631', 'description': 'LNDT EXC DRK 70 PCT CC BAR BOX 3.5 OZ', 'discountedItemPrice': '2.78', 'finalPrice': '2.78', 'itemNumber': '037466017631', 'itemPrice': '2.78', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'COCOA', 'partnerItemId': '1260', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '076828048432', 'competitiveProduct': True, 'description': 'WET ONE FRE', 'discountedItemPrice': '1.47', 'finalPrice': '1.47', 'itemNumber': '076828048432', 'itemPrice': '1.47', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'WET ONE FRE', 'partnerItemId': '1261', 'quantityPurchased': 1, 'rewardsGroup': 'KLEENEX WET WIPES', 'rewardsProductPartnerId': '550b2565e4b001d5e9e4146f'}, {'barcode': '021000040247', 'competitiveProduct': True, 'description': 'CREAM CHEESE', 'discountedItemPrice': '3.84', 'finalPrice': '3.84', 'itemNumber': '021000040247', 'itemPrice': '3.84', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'CREAM CHEESE', 'partnerItemId': '1262', 'quantityPurchased': 1, 'rewardsGroup': 'SARGENTO RICOTTA CHEESE', 'rewardsProductPartnerId': '5e7cf838f221c312e698a628'}, {'barcode': '021000612239', 'competitiveProduct': True, 'description': 'CREAM CHEESE', 'discountedItemPrice': '1.96', 'finalPrice': '1.96', 'itemNumber': '021000612239', 'itemPrice': '1.96', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'CREAM CHEESE', 'partnerItemId': '1263', 'quantityPurchased': 1, 'rewardsGroup': 'SARGENTO RICOTTA CHEESE', 'rewardsProductPartnerId': '5e7cf838f221c312e698a628'}, {'barcode': '078000082401', 'competitiveProduct': True, 'description': 'DR PEPPER', 'discountedItemPrice': '1.88', 'finalPrice': '1.88', 'itemNumber': '078000082401', 'itemPrice': '1.88', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'DR PEPPER', 'partnerItemId': '1264', 'quantityPurchased': 1, 'rewardsGroup': 'PEPSI 20 OZ', 'rewardsProductPartnerId': '5332f5fbe4b03c9a25efd0ba'}, {'barcode': '013120014178', 'description': 'ORE-IDA Extra Crispy Fast Food Fries - 26oz', 'discountedItemPrice': '2.58', 'finalPrice': '2.58', 'itemNumber': '013120014178', 'itemPrice': '2.58', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'FAST FRIES', 'partnerItemId': '1265', 'pointsEarned': '12.9', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'ORE-IDA FROZEN POTATOES', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '021000040247', 'competitiveProduct': True, 'description': 'CREAM CHEESE', 'discountedItemPrice': '3.84', 'finalPrice': '3.84', 'itemNumber': '021000040247', 'itemPrice': '3.84', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'CREAM CHEESE', 'partnerItemId': '1266', 'quantityPurchased': 1, 'rewardsGroup': 'SARGENTO RICOTTA CHEESE', 'rewardsProductPartnerId': '5e7cf838f221c312e698a628'}, {'barcode': '075706304783', 'description': 'PAL RC CHE', 'discountedItemPrice': '5.00', 'finalPrice': '5.00', 'itemNumber': '075706304783', 'itemPrice': '5.00', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'PAL RC CHE', 'partnerItemId': '1267', 'quantityPurchased': 1}, {'barcode': '036000485967', 'description': 'COTTONELLE ULTRA COMFORT CARE MEGA ROLL 2 PLY 284 COTTON TOILET TISSUE 12 CT', 'discountedItemPrice': '12.48', 'finalPrice': '12.48', 'itemNumber': '036000485967', 'itemPrice': '12.48', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'COTT CFT 12M', 'partnerItemId': '1268', 'pointsEarned': '124.8', 'pointsPayerId': '550b2565e4b001d5e9e4146f', 'quantityPurchased': 1, 'rewardsGroup': 'COTTONELLE ULTRA COMFORTCARE BATH TISSUE', 'rewardsProductPartnerId': '550b2565e4b001d5e9e4146f'}, {'barcode': '000980000069', 'description': 'TIC TAC WNTR DXTR SGR PC BRTH SWTN CNST BP 6PK 0.5 OZ', 'discountedItemPrice': '1.48', 'finalPrice': '1.48', 'itemNumber': '000980000069', 'itemPrice': '1.48', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'KJ XMAS EGG', 'partnerItemId': '1269', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '078742052328', 'description': 'GV ALM UNORG', 'discountedItemPrice': '2.36', 'finalPrice': '2.36', 'itemNumber': '078742052328', 'itemPrice': '2.36', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'GV ALM UNORG', 'partnerItemId': '1270', 'quantityPurchased': 1}, {'barcode': '041483022226', 'description': 'KMPS SLC FRM CWS NOT TRTD WTH RBST PSTR HMGN DRY RFRG MLK JUG 64 FL OZ', 'discountedItemPrice': '2.48', 'finalPrice': '2.48', 'itemNumber': '041483022226', 'itemPrice': '2.48', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'KMPS VIT D', 'partnerItemId': '1271', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '078742333960', 'description': 'ORG VEG BRTH', 'discountedItemPrice': '1.98', 'finalPrice': '1.98', 'itemNumber': '078742333960', 'itemPrice': '1.98', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'ORG VEG BRTH', 'partnerItemId': '1272', 'quantityPurchased': 1}, {'barcode': '015800030621', 'description': 'C & H 454 SRVG WHT CN GRNL SGR BAG 4 LB', 'discountedItemPrice': '2.63', 'finalPrice': '2.63', 'itemNumber': '015800030621', 'itemPrice': '2.63', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'C H GRAN 4', 'partnerItemId': '1273', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '078742333960', 'description': 'ORG VEG BRTH', 'discountedItemPrice': '1.98', 'finalPrice': '1.98', 'itemNumber': '078742333960', 'itemPrice': '1.98', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'ORG VEG BRTH', 'partnerItemId': '1274', 'quantityPurchased': 1}, {'barcode': '007874242956', 'description': 'GV WORCESTE', 'discountedItemPrice': '0.88', 'finalPrice': '0.88', 'itemNumber': '007874242956', 'itemPrice': '0.88', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'GV WORCESTE', 'partnerItemId': '1275', 'quantityPurchased': 1}, {'barcode': '078742372198', 'description': 'GV PWD 2LB', 'discountedItemPrice': '1.62', 'finalPrice': '1.62', 'itemNumber': '078742372198', 'itemPrice': '1.62', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'GV PWD 2LB', 'partnerItemId': '1276', 'quantityPurchased': 1}]"
115,5ff5d2030a720f05230005df,"[{'barcode': '021000001361', 'description': 'PHILADELPHIA Snack BAR Strawberry CheeseC 1 CT 002100000136', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '1', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}]"
564,60145a960a7214ad50000095,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '1234', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '1234', 'userFlaggedDescription': '', 'userFlaggedNewItem': True}]"


In [79]:
df_receipts.groupby('rewardsReceiptStatus').count()

Unnamed: 0_level_0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,modifyDate_cleaned_ts,pointsAwardedDate,pointsAwardedDate_cleaned,pointsAwardedDate_cleaned_ts,pointsEarned,purchaseDate,purchaseDate_cleaned,purchaseDate_cleaned_ts,purchasedItemCount,rewardsReceiptItemList,totalSpent,userId
rewardsReceiptStatus,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
FINISHED,518,518,456,456,518,518,518,518,518,518,518,518,518,518,518,518,514,514,514,518,518,518,518,518,516,518,518
FLAGGED,46,46,30,30,46,46,46,46,46,46,0,0,0,46,46,46,19,19,19,33,35,35,35,46,46,46,46
PENDING,50,50,0,0,50,50,50,50,50,50,50,50,50,50,50,50,0,0,0,0,49,49,49,0,49,49,50
REJECTED,71,71,58,58,71,71,71,71,71,71,0,0,0,71,71,71,4,4,4,58,69,69,69,71,68,71,71
SUBMITTED,434,434,0,0,434,434,434,434,434,434,0,0,0,434,434,434,0,0,0,0,0,0,0,0,0,0,434


In [80]:
df_receipts[['_id_cleaned','rewardsReceiptStatus','rewardsReceiptItemList']].sample(40)

Unnamed: 0,_id_cleaned,rewardsReceiptStatus,rewardsReceiptItemList
665,6017ab130a720f05f80002e0,SUBMITTED,
617,601602db0a7214ad500001b3,SUBMITTED,
603,601542ab0a720f05f80001d5,SUBMITTED,
283,6000b9b70a720f05f300006d,FINISHED,"[{'brandCode': 'BORDEN', 'description': 'BORDEN 2% MILK, 1/2 GAL', 'discountedItemPrice': '4.81', 'finalPrice': '4.81', 'itemPrice': '4.81', 'originalReceiptItemText': 'BORDEN 2% MILK, 1/2 GAL', 'partnerItemId': '1010', 'quantityPurchased': 1}, {'description': 'EMIL S SAUSAGE MUSHROOM PIZZA', 'discountedItemPrice': '13.12', 'finalPrice': '13.12', 'itemPrice': '13.12', 'originalReceiptItemText': 'EMIL S SAUSAGE MUSHROOM PIZZA', 'partnerItemId': '1012', 'quantityPurchased': 2}, {'description': 'EMIL' S SAUSAGE MUSHROOM PIZZA', 'discountedItemPrice': '13.08', 'finalPrice': '13.08', 'itemPrice': '13.08', 'originalReceiptItemText': 'EMIL' S SAUSAGE MUSHROOM PIZZA', 'partnerItemId': '1015', 'quantityPurchased': 2}, {'barcode': '036000391718', 'brandCode': 'KLEENEX', 'description': 'KLEENEX POP UP RECTANGLE BOX FACIAL TISSUE 2 PLY 8PK 160 CT', 'discountedItemPrice': '5.38', 'finalPrice': '5.38', 'itemPrice': '5.38', 'metabriteCampaignId': 'KLEENEX TRUSTED CARE FACIAL TISSUES 120 - 179 COUNT, 8 PACK', 'originalReceiptItemText': 'KLEENEX TRUSTED CARE FACIAL TISSUES', 'partnerItemId': '1018', 'pointsEarned': '53.8', 'pointsPayerId': '550b2565e4b001d5e9e4146f', 'quantityPurchased': 2, 'rewardsGroup': 'KLEENEX TRUSTED CARE FACIAL TISSUES 120 - 179 COUNT, 8 PACK', 'rewardsProductPartnerId': '550b2565e4b001d5e9e4146f'}, {'barcode': '028435383253', 'brandCode': 'KLARBRUNN', 'description': 'KLRB CRBN WTR CLR FR LMN SPRK CAN 12PK 12 FL OZ', 'discountedItemPrice': '15.60', 'finalPrice': '15.60', 'itemPrice': '15.60', 'originalReceiptItemText': 'KLARBRUNN 12PX 12 FL OZ', 'partnerItemId': '1021', 'quantityPurchased': 5, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'description': 'EMIL.S SAUSAGE MOSHROOM PIZZA', 'discountedItemPrice': '13.58', 'finalPrice': '13.58', 'itemPrice': '13.58', 'originalReceiptItemText': 'EMIL.S SAUSAGE MOSHROOM PIZZA', 'partnerItemId': '1024', 'quantityPurchased': 2}, {'description': 'KLARBRUNN 12PK 12 SL OZ', 'discountedItemPrice': '6.39', 'finalPrice': '6.39', 'itemPrice': '6.39', 'originalReceiptItemText': 'KLARBRUNN 12PK 12 SL OZ', 'partnerItemId': '1027', 'quantityPurchased': 3}, {'barcode': '076840100354', 'brandCode': 'BEN AND JERRYS', 'description': 'BEN & JERRYS FROZEN CHUNKY MONKEY ICE CREAM REGULAR 16 OZ - 0076840100351', 'discountedItemPrice': '18.40', 'finalPrice': '18.40', 'itemPrice': '18.40', 'metabriteCampaignId': 'BEN AND JERRYS ICE CREAM', 'originalReceiptItemText': 'BEN 4 JERRY.A CHUN*Y MONKEY PINT', 'partnerItemId': '1030', 'pointsEarned': '184.0', 'pointsPayerId': '5332f5f6e4b03c9a25efd0b4', 'quantityPurchased': 4, 'rewardsGroup': 'BEN AND JERRYS ICE CREAM', 'rewardsProductPartnerId': '5332f5f6e4b03c9a25efd0b4'}]"
979,6025fde20a720f05a8000294,SUBMITTED,
47,5ff29be20a7214ada1000571,FINISHED,"[{'barcode': '044000000745', 'brandCode': 'KRAFT EASY CHEESE', 'competitiveProduct': True, 'competitorRewardsGroup': 'SARGENTO RICOTTA CHEESE', 'description': '-Cheddar', 'discountedItemPrice': '1.00', 'finalPrice': '1.00', 'itemNumber': '044000000745', 'itemPrice': '1.00', 'partnerItemId': '1030', 'quantityPurchased': 1, 'rewardsGroup': 'SARGENTO RICOTTA CHEESE', 'rewardsProductPartnerId': '5e7cf838f221c312e698a628'}]"
1016,6038a5400a720fde10000088,SUBMITTED,
522,6013410c0a7214ad5000002a,FINISHED,"[{'barcode': '036000320893', 'description': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'discountedItemPrice': '21.00', 'finalPrice': '21.00', 'itemNumber': '4023', 'itemPrice': '21.00', 'originalReceiptItemText': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'partnerItemId': '0', 'pointsEarned': '210.0', 'pointsPayerId': '550b2565e4b001d5e9e4146f', 'quantityPurchased': 1, 'rewardsGroup': 'HUGGIES ONE AND DONE SIMPLY CLEAN BABY WIPES 200 COUNT OR LARGER', 'rewardsProductPartnerId': '550b2565e4b001d5e9e4146f'}, {'barcode': '036000320893', 'description': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'discountedItemPrice': '21.00', 'finalPrice': '21.00', 'itemNumber': '4023', 'itemPrice': '21.00', 'originalReceiptItemText': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'partnerItemId': '1', 'pointsEarned': '210.0', 'pointsPayerId': '550b2565e4b001d5e9e4146f', 'quantityPurchased': 1, 'rewardsGroup': 'HUGGIES ONE AND DONE SIMPLY CLEAN BABY WIPES 200 COUNT OR LARGER', 'rewardsProductPartnerId': '550b2565e4b001d5e9e4146f'}, {'barcode': '036000320893', 'description': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'discountedItemPrice': '21.00', 'finalPrice': '21.00', 'itemNumber': '4023', 'itemPrice': '21.00', 'originalReceiptItemText': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'partnerItemId': '2', 'pointsEarned': '210.0', 'pointsPayerId': '550b2565e4b001d5e9e4146f', 'quantityPurchased': 1, 'rewardsGroup': 'HUGGIES ONE AND DONE SIMPLY CLEAN BABY WIPES 200 COUNT OR LARGER', 'rewardsProductPartnerId': '550b2565e4b001d5e9e4146f'}, {'barcode': '036000320893', 'description': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'discountedItemPrice': '21.00', 'finalPrice': '21.00', 'itemNumber': '4023', 'itemPrice': '21.00', 'originalReceiptItemText': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'partnerItemId': '3', 'pointsEarned': '210.0', 'pointsPayerId': '550b2565e4b001d5e9e4146f', 'quantityPurchased': 1, 'rewardsGroup': 'HUGGIES ONE AND DONE SIMPLY CLEAN BABY WIPES 200 COUNT OR LARGER', 'rewardsProductPartnerId': '550b2565e4b001d5e9e4146f'}]"
6,5ff1e1cd0a720f052300056f,FINISHED,"[{'brandCode': 'MISSION', 'competitorRewardsGroup': 'TACO BELL TACO SHELLS', 'description': 'MSSN TORTLLA', 'discountedItemPrice': '2.23', 'finalPrice': '2.23', 'itemPrice': '2.23', 'originalReceiptItemText': 'MSSN TORTLLA', 'partnerItemId': '1009', 'quantityPurchased': 1}]"
796,601e63660a720f053c000055,SUBMITTED,


In [81]:
id_125 = df_receipt_items.loc[df_receipt_items['_id_cleaned'] == '6008ee0e0a7214ad89000125']
id_125

Unnamed: 0,_id_cleaned,rewardsReceiptItemList
392,6008ee0e0a7214ad89000125,"[{'barcode': '012000809965', 'description': 'MTN DEW REVOLUTION SODA WILDBERRY FRUIT FLVR CANS IN BOX 12 CT 144 OZ', 'discountedItemPrice': '8.99', 'finalPrice': '8.99', 'itemNumber': '012000809965', 'itemPrice': '8.99', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'ILDBERRY FRUIT FLVR CANS IN BOX 12 C', 'partnerItemId': '1032', 'pointsNotAwardedReason': 'Action not allowed for user and CPG', 'pointsPayerId': '5332f5fbe4b03c9a25efd0ba', 'quantityPurchased': 1, 'rewardsGroup': 'MOUNTAIN DEW 12 OZ 12 PACK', 'rewardsProductPartnerId': '5332f5fbe4b03c9a25efd0ba'}, {'barcode': '511111101451', 'description': 'QUAKER', 'discountedItemPrice': '3.99', 'finalPrice': '3.99', 'itemNumber': '511111101451', 'itemPrice': '3.99', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': '2.99 10 OUAKER OATS Q', 'partnerItemId': '1042', 'pointsNotAwardedReason': 'Action not allowed for user and CPG', 'pointsPayerId': '53e10d6368abd3c7065097cc', 'quantityPurchased': 1, 'rewardsProductPartnerId': '53e10d6368abd3c7065097cc'}, {'barcode': '005111116022', 'description': 'TTER BLUE KRAZY KRITTER BLUE 1', 'discountedItemPrice': '1.49', 'finalPrice': '1.49', 'itemNumber': '005111116022', 'itemPrice': '1.49', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'TTER BLUE KRAZY KRITTER BLUE 1', 'partnerItemId': '1048', 'quantityPurchased': 1}, {'barcode': '511111602118', 'description': 'JELL-O', 'discountedItemPrice': '1.99', 'finalPrice': '1.99', 'itemNumber': '511111602118', 'itemPrice': '1.99', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'LO JELL-O', 'partnerItemId': '1051', 'pointsEarned': '10.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '311111536044', 'description': 'LUCKY CHARMS UNICORN CEREAL FAMILY SIZE', 'discountedItemPrice': '6.58', 'finalPrice': '6.58', 'itemNumber': '311111536044', 'itemPrice': '6.58', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'SI HIDDEN VALLEY SALAD DRESSING 21OZ', 'partnerItemId': '1088', 'pointsNotAwardedReason': 'Action not allowed for user and CPG', 'pointsPayerId': '5332f5f3e4b03c9a25efd0ae', 'quantityPurchased': 1, 'rewardsGroup': 'LUCKY CHARMS UNICORN CEREAL FAMILY SIZE', 'rewardsProductPartnerId': '5332f5f3e4b03c9a25efd0ae'}, {'barcode': '074682200294', 'description': 'R W KND FML BT VGTB JC BTL RFRG AFTR OPNN 32 FL OZ', 'discountedItemPrice': '7.89', 'finalPrice': '7.89', 'itemNumber': '074682200294', 'itemPrice': '7.89', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'ML BT VGTB JC BTL RFRG AFTR OPNN 32', 'partnerItemId': '1091', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '011594404013', 'description': 'HWN ONN RNG SWT M GLDN CRSP BAG 4 OZ', 'discountedItemPrice': '1.49', 'finalPrice': '1.49', 'itemNumber': '011594404013', 'itemPrice': '1.49', 'originalMetaBriteBarcode': '', 'originalReceiptItemText': 'AIIAN SWT HWN ONN RNG SWT M GLDN CRS', 'partnerItemId': '1112', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}]"


In [82]:
# extract a sample value from rewardsReceiptItemList
receiptlist = id_125.iloc[0]['rewardsReceiptItemList']
receiptlist
len(receiptlist)

for item in receiptlist:
    print(item['barcode'])

012000809965
511111101451
005111116022
511111602118
311111536044
074682200294
011594404013


### Creating df_receipt_items data source

In [83]:
# create a list containing '_id_cleaned','rewardsReceiptItemList' values from df_receipts
list_receipt_items_in = df_receipts[['_id_cleaned', 'rewardsReceiptItemList']].values.tolist()

In [84]:
# create an emptly list to store values from list_receipt_items_in, a list of lists
list_receipt_items_expand = []

for _id_cleaned, rewardsReceiptItemList in list_receipt_items_in:
    item_index = 0
    try:
        for item in rewardsReceiptItemList:
            # create the list to add to list_receipt_items_expand
            list_out = [_id_cleaned, item_index, item]
            list_receipt_items_expand.append(list_out)
            item_index += 1
    except:
        pass

# confirm the list is composed as intended
list_receipt_items_expand[0:6]

[['5ff1e1eb0a720f0523000575',
  0,
  {'barcode': '4011',
   'description': 'ITEM NOT FOUND',
   'finalPrice': '26.00',
   'itemPrice': '26.00',
   'needsFetchReview': False,
   'partnerItemId': '1',
   'preventTargetGapPoints': True,
   'quantityPurchased': 5,
   'userFlaggedBarcode': '4011',
   'userFlaggedNewItem': True,
   'userFlaggedPrice': '26.00',
   'userFlaggedQuantity': 5}],
 ['5ff1e1bb0a720f052300056b',
  0,
  {'barcode': '4011',
   'description': 'ITEM NOT FOUND',
   'finalPrice': '1',
   'itemPrice': '1',
   'partnerItemId': '1',
   'quantityPurchased': 1}],
 ['5ff1e1bb0a720f052300056b',
  1,
  {'barcode': '028400642255',
   'description': 'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ',
   'finalPrice': '10.00',
   'itemPrice': '10.00',
   'needsFetchReview': True,
   'needsFetchReviewReason': 'USER_FLAGGED',
   'partnerItemId': '2',
   'pointsNotAwardedReason': 'Action not allowed for user and CPG',
   'pointsPayerId': '5332f5fbe4b03c9a25efd0ba',
   'preve

In [85]:
# create an empty list to store values from list_receipt_items_expand, another list of lists
list_receipt_items_extract = []

for _id_cleaned, item_index, item in list_receipt_items_expand:
    # only save values to list_receipt_items_extract if there is a 'barcode' or 'userFlaggedBarcode' present 
    if item.get('barcode', None) or item.get('userFlaggedBarcode', None):
        # assign variables values from item dictionaries, if the key doesn't exist default None
        barcode = item.get('barcode', None)
        userFlaggedBarcode = item.get('userFlaggedBarcode', None)
        description = item.get('description', None)
        userFlaggedDescription = item.get('userFlaggedDescription', None)
        finalPrice = item.get('finalPrice', None)
        userFlaggedPrice = item.get('userFlaggedPrice', None)
        quantityPurchased = item.get('quantityPurchased', None)
        userFlaggedQuantity = item.get('userFlaggedQuantity', None)
        #create the list to add to list_receipt_items_extract
        list_out = [
            _id_cleaned, item_index, barcode, userFlaggedBarcode, description, userFlaggedDescription, \
            finalPrice, userFlaggedPrice, quantityPurchased, userFlaggedQuantity
            ]
        list_receipt_items_extract.append(list_out)

In [86]:
# inspect the first items of list_receipt_items_extract
list_receipt_items_extract[0:6]

[['5ff1e1eb0a720f0523000575',
  0,
  '4011',
  '4011',
  'ITEM NOT FOUND',
  None,
  '26.00',
  '26.00',
  5,
  5],
 ['5ff1e1bb0a720f052300056b',
  0,
  '4011',
  None,
  'ITEM NOT FOUND',
  None,
  '1',
  None,
  1,
  None],
 ['5ff1e1bb0a720f052300056b',
  1,
  '028400642255',
  '028400642255',
  'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ',
  'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ',
  '10.00',
  '10.00',
  1,
  1],
 ['5ff1e1f10a720f052300057a',
  0,
  None,
  '4011',
  None,
  None,
  None,
  '26.00',
  None,
  3],
 ['5ff1e1ee0a7214ada100056f',
  0,
  '4011',
  '4011',
  'ITEM NOT FOUND',
  None,
  '28.00',
  '28.00',
  4,
  4],
 ['5ff1e1d20a7214ada1000561',
  0,
  '4011',
  None,
  'ITEM NOT FOUND',
  None,
  '1',
  None,
  1,
  None]]

In [87]:
# create an empty dataframe, df_receipt_items 
ri_columns = [
                "receipt_id",
                "item_index",
                "barcode",
                "userFlaggedBarcode",
                "description",
                "userFlaggedDescription",
                "finalPrice",
                "userFlaggedPrice",
                "quantityPurchased",
                "userFlaggedQuantity"
                ]
df_receipt_items = pd.DataFrame(list_receipt_items_extract, columns = ri_columns)

In [88]:
df_receipt_items

Unnamed: 0,receipt_id,item_index,barcode,userFlaggedBarcode,description,userFlaggedDescription,finalPrice,userFlaggedPrice,quantityPurchased,userFlaggedQuantity
0,5ff1e1eb0a720f0523000575,0,4011,4011,ITEM NOT FOUND,,26.00,26.00,5.0,5.0
1,5ff1e1bb0a720f052300056b,0,4011,,ITEM NOT FOUND,,1,,1.0,
2,5ff1e1bb0a720f052300056b,1,028400642255,028400642255,DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ,DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ,10.00,10.00,1.0,1.0
3,5ff1e1f10a720f052300057a,0,,4011,,,,26.00,,3.0
4,5ff1e1ee0a7214ada100056f,0,4011,4011,ITEM NOT FOUND,,28.00,28.00,4.0,4.0
...,...,...,...,...,...,...,...,...,...,...
3235,603cc2bc0a720fde100003e9,1,B07BRRLSVC,,thindust summer face mask - sun protection neck gaiter for outdooractivities,,11.99,,1.0,
3236,603cc0630a720fde100003e6,0,B076FJ92M4,,"mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white",,22.97,,1.0,
3237,603cc0630a720fde100003e6,1,B07BRRLSVC,,thindust summer face mask - sun protection neck gaiter for outdooractivities,,11.99,,1.0,
3238,603ce7100a7217c72c000405,0,B076FJ92M4,,"mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white",,22.97,,1.0,


### 

In [92]:
# convert topBrand to dtype boolean, allowing nulls00
df_brands.topBrand = df_brands.topBrand.astype("boolean")

In [93]:
df_brands.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1167 entries, 0 to 1166
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   _id           1167 non-null   object 
 1   _id_cleaned   1167 non-null   object 
 2   barcode       1167 non-null   int64  
 3   category      1012 non-null   object 
 4   categoryCode  517 non-null    object 
 5   cpg           1167 non-null   object 
 6   cpg_cleaned   1167 non-null   object 
 7   name          1167 non-null   object 
 8   topBrand      555 non-null    boolean
 9   brandCode     933 non-null    object 
dtypes: boolean(1), int64(1), object(8)
memory usage: 84.5+ KB


In [107]:
# create a new datafame with only the columns I want to load to sqlite
df_brands_load = df_brands[[
                            '_id_cleaned', 'barcode', 'category', 'categoryCode', \
                            'cpg_cleaned', 'name', 'topBrand', 'brandCode'
                            ]]
df_brands_load.rename(columns={'_id_cleaned': 'id', 'cpg_cleaned': 'cpg'}, inplace=True)
df_brands_load


Unnamed: 0,id,barcode,category,categoryCode,cpg,name,topBrand,brandCode
0,601ac115be37ce2ead437551,511111019862,Baking,BAKING,601ac114be37ce2ead437550,test brand @1612366101024,False,
1,601c5460be37ce2ead43755f,511111519928,Beverages,BEVERAGES,5332f5fbe4b03c9a25efd0ba,Starbucks,False,STARBUCKS
2,601ac142be37ce2ead43755d,511111819905,Baking,BAKING,601ac142be37ce2ead437559,test brand @1612366146176,False,TEST BRANDCODE @1612366146176
3,601ac142be37ce2ead43755a,511111519874,Baking,BAKING,601ac142be37ce2ead437559,test brand @1612366146051,False,TEST BRANDCODE @1612366146051
4,601ac142be37ce2ead43755e,511111319917,Candy & Sweets,CANDY_AND_SWEETS,5332fa12e4b03c9a25efd1e7,test brand @1612366146827,False,TEST BRANDCODE @1612366146827
...,...,...,...,...,...,...,...,...
1162,5f77274dbe37ce6b592e90c0,511111116752,Baking,BAKING,5f77274dbe37ce6b592e90bf,test brand @1601644365844,,
1163,5dc1fca91dda2c0ad7da64ae,511111706328,Breakfast & Cereal,,53e10d6368abd3c7065097cc,Dippin Dots® Cereal,,DIPPIN DOTS CEREAL
1164,5f494c6e04db711dd8fe87e7,511111416173,Candy & Sweets,CANDY_AND_SWEETS,5332fa12e4b03c9a25efd1e7,test brand @1598639215217,,TEST BRANDCODE @1598639215217
1165,5a021611e4b00efe02b02a57,511111400608,Grocery,,5332f5f6e4b03c9a25efd0b4,LIPTON TEA Leaves,False,LIPTON TEA Leaves


In [111]:
# when trying to load, the unique constraint fails on barcode: IntegrityError: UNIQUE constraint failed: brands.barcode
# retrun all rows where barcodes are repeated ref: https://stackoverflow.com/questions/14657241/how-do-i-get-a-list-of-all-the-duplicate-items-using-pandas-in-python
pd.concat(g for _, g in df_brands_load.groupby("barcode") if len(g) > 1)

Unnamed: 0,id,barcode,category,categoryCode,cpg,name,topBrand,brandCode
467,5c409ab4cd244a3539b84162,511111004790,Baking,,55b62995e4b0d8e685c14213,alexa,True,ALEXA
1071,5cdacd63166eb33eb7ce0fa8,511111004790,Condiments & Sauces,,559c2234e4b06aca36af13c6,Bitten Dressing,,BITTEN
152,5c45f91b87ff3552f950f027,511111204923,Grocery,,5c45f8b087ff3552f950f026,Brand1,True,0987654321
536,5d6027f46d5f3b23d1bc7906,511111204923,Snacks,,5332f5fbe4b03c9a25efd0ba,CHESTER'S,,CHESTERS
20,5c4699f387ff3577e203ea29,511111305125,Baby,,55b62995e4b0d8e685c14213,Chris Image Test,,CHRISIMAGE
651,5d642d65a3a018514994f42d,511111305125,Magazines,,5d5d4fd16d5f3b23d1bc7905,Rachael Ray Everyday,,511111305125
129,5a7e0604e4b0aedb3b84afd3,511111504139,Beverages,,55b62995e4b0d8e685c14213,Chris Brand XYZ,,CHRISXYZ
299,5a8c33f3e4b07f0a2dac8943,511111504139,Grocery,,5a734034e4b0d58f376be874,Pace,False,PACE
9,5c408e8bcd244a1fdb47aee7,511111504788,Baking,,59ba6f1ce4b092b29c167346,test,,TEST
412,5ccb2ece166eb31bbbadccbe,511111504788,Condiments & Sauces,,559c2234e4b06aca36af13c6,The Pioneer Woman,,PIONEER WOMAN


In [112]:
# create brands table
conn = sqlite3.connect(db_path)
c = conn.cursor()

c.execute('DROP TABLE IF EXISTS brands')

c.execute("""CREATE TABLE IF NOT EXISTS brands (
        id uuid PRIMARY KEY,
        -- there are dupicates in barcode, for now we'll load the table without this contraint and note it
        -- barcode numeric UNIQUE,
        barcode numeric,
        category text,
        categoryCode text,
        cpg text,
        name text,
        topBrand bool,
        brandCode text       
    )""")

df_brands_load.to_sql('brands', conn, if_exists='append', index=False)

conn.commit()
conn.close()

In [198]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 495 entries, 0 to 494
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   _id                     495 non-null    object        
 1   _id_cleaned             495 non-null    object        
 2   active                  495 non-null    bool          
 3   createdDate             495 non-null    object        
 4   createdDate_cleaned     495 non-null    int64         
 5   createdDate_cleaned_ts  495 non-null    datetime64[ns]
 6   lastLogin               433 non-null    object        
 7   lastLogin_cleaned       433 non-null    float64       
 8   lastLogin_cleaned_ts    433 non-null    datetime64[ns]
 9   role                    495 non-null    object        
 10  signUpSource            447 non-null    object        
 11  state                   439 non-null    object        
dtypes: bool(1), datetime64[ns](2), float64(1), int64(1

In [209]:
df_receipts.info()
df_receipts.sample(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1119 entries, 0 to 1118
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   _id                           1119 non-null   object        
 1   _id_cleaned                   1119 non-null   object        
 2   bonusPointsEarned             544 non-null    float64       
 3   bonusPointsEarnedReason       544 non-null    object        
 4   createDate                    1119 non-null   object        
 5   createDate_cleaned            1119 non-null   int64         
 6   createDate_cleaned_ts         1119 non-null   datetime64[ns]
 7   dateScanned                   1119 non-null   object        
 8   dateScanned_cleaned           1119 non-null   int64         
 9   dateScanned_cleaned_ts        1119 non-null   datetime64[ns]
 10  finishedDate                  568 non-null    object        
 11  finishedDate_cleaned          

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,modifyDate_cleaned_ts,pointsAwardedDate,pointsAwardedDate_cleaned,pointsAwardedDate_cleaned_ts,pointsEarned,purchaseDate,purchaseDate_cleaned,purchaseDate_cleaned_ts,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
449,{'$oid': '600f41e60a720f0535000033'},600f41e60a720f0535000033,150.0,"Receipt number 5 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1611612646000},1611612646000,2021-01-25 22:10:46.000,{'$date': 1611612646000},1611612646000,2021-01-25 22:10:46.000,{'$date': 1611612647000},1611613000000.0,2021-01-25 22:10:47,{'$date': 1611612647000},1611612647000,2021-01-25 22:10:47.000,{'$date': 1611612647000},1611613000000.0,2021-01-25 22:10:47,150.0,{'$date': 1611526246000},1611526000000.0,2021-01-24 22:10:46,1.0,"[{'barcode': '043000079232', 'description': 'COOL WHIP Whipped Topping 4-Pack Original 4-8 OZ Cups', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}]",FINISHED,1.0,600f41b2bd196811e68ea219
411,{'$oid': '6009fd790a720f0535000006'},6009fd790a720f0535000006,,,{'$date': 1611267449450},1611267449450,2021-01-21 22:17:29.450,{'$date': 1611267449450},1611267449450,2021-01-21 22:17:29.450,,,NaT,{'$date': 1611267449450},1611267449450,2021-01-21 22:17:29.450,,,NaT,,,,NaT,,,SUBMITTED,,6009fd1550b33111943850b1
899,{'$oid': '602279db0a720f05a800009b'},602279db0a720f05a800009b,,,{'$date': 1612872155383},1612872155383,2021-02-09 12:02:35.383,{'$date': 1612872155383},1612872155383,2021-02-09 12:02:35.383,,,NaT,{'$date': 1612872155383},1612872155383,2021-02-09 12:02:35.383,,,NaT,,,,NaT,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
634,{'$oid': '601711900a720f05f800029f'},601711900a720f05f800029f,,,{'$date': 1612124560099},1612124560099,2021-01-31 20:22:40.099,{'$date': 1612124560099},1612124560099,2021-01-31 20:22:40.099,,,NaT,{'$date': 1612124560099},1612124560099,2021-01-31 20:22:40.099,,,NaT,,,,NaT,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
239,{'$oid': '5ffce8000a7214ad4e002b57'},5ffce8000a7214ad4e002b57,,,{'$date': 1610409982000},1610409982000,2021-01-12 00:06:22.000,{'$date': 1610409982000},1610409982000,2021-01-12 00:06:22.000,,,NaT,{'$date': 1610409982000},1610409982000,2021-01-12 00:06:22.000,,,NaT,,,,NaT,,,SUBMITTED,,59c124bae4b0299e55b0f330
689,{'$oid': '6018ace50a7214ad28000057'},6018ace50a7214ad28000057,,,{'$date': 1612229861168},1612229861168,2021-02-02 01:37:41.168,{'$date': 1612229861168},1612229861168,2021-02-02 01:37:41.168,,,NaT,{'$date': 1612229861168},1612229861168,2021-02-02 01:37:41.168,,,NaT,,,,NaT,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
537,{'$oid': '601341160a720f05f8000095'},601341160a720f05f8000095,,,{'$date': 1611874582000},1611874582000,2021-01-28 22:56:22.000,{'$date': 1611874582000},1611874582000,2021-01-28 22:56:22.000,{'$date': 1611874582000},1611875000000.0,2021-01-28 22:56:22,{'$date': 1611874582000},1611874582000,2021-01-28 22:56:22.000,{'$date': 1611874582000},1611875000000.0,2021-01-28 22:56:22,840.0,{'$date': 1611792000000},1611792000000.0,2021-01-28 00:00:00,4.0,"[{'barcode': '036000320893', 'description': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'discountedItemPrice': '21.00', 'finalPrice': '21.00', 'itemNumber': '4023', 'itemPrice': '21.00', 'originalReceiptItemText': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'partnerItemId': '0', 'pointsEarned': '210.0', 'pointsPayerId': '550b2565e4b001d5e9e4146f', 'quantityPurchased': 1, 'rewardsGroup': 'HUGGIES ONE AND DONE SIMPLY CLEAN BABY WIPES 200 COUNT OR LARGER', 'rewardsProductPartnerId': '550b2565e4b001d5e9e4146f'}, {'barcode': '036000320893', 'description': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'discountedItemPrice': '21.00', 'finalPrice': '21.00', 'itemNumber': '4023', 'itemPrice': '21.00', 'originalReceiptItemText': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'partnerItemId': '1', 'pointsEarned': '210.0', 'pointsPayerId': '550b2565e4b001d5e9e4146f', 'quantityPurchased': 1, 'rewardsGroup': 'HUGGIES ONE AND DONE SIMPLY CLEAN BABY WIPES 200 COUNT OR LARGER', 'rewardsProductPartnerId': '550b2565e4b001d5e9e4146f'}, {'barcode': '036000320893', 'description': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'discountedItemPrice': '21.00', 'finalPrice': '21.00', 'itemNumber': '4023', 'itemPrice': '21.00', 'originalReceiptItemText': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'partnerItemId': '2', 'pointsEarned': '210.0', 'pointsPayerId': '550b2565e4b001d5e9e4146f', 'quantityPurchased': 1, 'rewardsGroup': 'HUGGIES ONE AND DONE SIMPLY CLEAN BABY WIPES 200 COUNT OR LARGER', 'rewardsProductPartnerId': '550b2565e4b001d5e9e4146f'}, {'barcode': '036000320893', 'description': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'discountedItemPrice': '21.00', 'finalPrice': '21.00', 'itemNumber': '4023', 'itemPrice': '21.00', 'originalReceiptItemText': 'HUGGIES SIMPLY CLEAN PREMOISTENED WIPE FRAGRANCE FREE BAG 216 COUNT', 'partnerItemId': '3', 'pointsEarned': '210.0', 'pointsPayerId': '550b2565e4b001d5e9e4146f', 'quantityPurchased': 1, 'rewardsGroup': 'HUGGIES ONE AND DONE SIMPLY CLEAN BABY WIPES 200 COUNT OR LARGER', 'rewardsProductPartnerId': '550b2565e4b001d5e9e4146f'}]",FINISHED,84.0,5fa41775898c7a11a6bcef3e
122,{'$oid': '5ff726910a720f05230005f1'},5ff726910a720f05230005f1,100.0,"Receipt number 6 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1610032785000},1610032785000,2021-01-07 15:19:45.000,{'$date': 1610032785000},1610032785000,2021-01-07 15:19:45.000,{'$date': 1610032785000},1610033000000.0,2021-01-07 15:19:45,{'$date': 1610032790000},1610032790000,2021-01-07 15:19:50.000,{'$date': 1610032785000},1610033000000.0,2021-01-07 15:19:45,100.0,{'$date': 1609977600000},1609978000000.0,2021-01-07 00:00:00,4.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '25.00', 'itemPrice': '25.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 4, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '25.00', 'userFlaggedQuantity': 4}]",FINISHED,25.0,5ff7268eeb7c7d12096da2a9
952,{'$oid': '6024f7570a7214d8e900021f'},6024f7570a7214d8e900021f,,,{'$date': 1613035351539},1613035351539,2021-02-11 09:22:31.539,{'$date': 1613035351539},1613035351539,2021-02-11 09:22:31.539,,,NaT,{'$date': 1613035351539},1613035351539,2021-02-11 09:22:31.539,,,NaT,,,,NaT,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
226,{'$oid': '5ffe23d70a7214ad280068f6'},5ffe23d70a7214ad280068f6,,,{'$date': 1610490839000},1610490839000,2021-01-12 22:33:59.000,{'$date': 1610490839000},1610490839000,2021-01-12 22:33:59.000,{'$date': 1610490839000},1610491000000.0,2021-01-12 22:33:59,{'$date': 1610490839000},1610490839000,2021-01-12 22:33:59.000,,,NaT,,{'$date': 1610409600000},1610410000000.0,2021-01-12 00:00:00,,"[{'description': 'flipbelt level terrain waist pouch, neon yellow, large/32-35', 'discountedItemPrice': '28.57', 'finalPrice': '28.57', 'itemPrice': '28.57', 'originalReceiptItemText': 'flipbelt level terrain waist pouch, neon yellow, large/32-35', 'partnerItemId': '0', 'priceAfterCoupon': '28.57', 'quantityPurchased': 1}]",PENDING,28.57,59c124bae4b0299e55b0f330


In [196]:
df_receipt_items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3240 entries, 0 to 3239
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   receipt_id              3240 non-null   object 
 1   item_index              3240 non-null   int64  
 2   barcode                 3090 non-null   object 
 3   userFlaggedBarcode      337 non-null    object 
 4   description             2859 non-null   object 
 5   userFlaggedDescription  205 non-null    object 
 6   finalPrice              3066 non-null   object 
 7   userFlaggedPrice        299 non-null    object 
 8   quantityPurchased       3066 non-null   float64
 9   userFlaggedQuantity     299 non-null    float64
dtypes: float64(2), int64(1), object(7)
memory usage: 253.2+ KB
