## Andrew Byrnes: Fetch Rewards Coding Exercise - Data Analyst

This notebook preps the provided data files and uploads them to a SQLite database. It includes a an entity relationship diagram of the how I've modeled this data.
I chose SQLite for this challenge because it is lightweight and lends itself well to sharing.  
SQLite's flexible typing rules could be a liability for a database that would be continually updated and primarily used for analysis. For that usecase I would choose something that more strictly follows SQL standard.

### Data Sources
- data-modeling.html : coding exercise instuctions
- brands.json.gz, receipts.json.gz, users.json.gz : raw data files provided for completition of the challenge

### Changes
- 09-17-2022 : Started project, first look at data, identified transformation tasks 
- 09-18-2022 : cleaned df_brands _id and cpg columns
- 09-19-2022 : wrote function to clean columns with dicts, wrote function to convert epoch time to timestamps
- 09-20-2022 : refactored functions, applied funtions cleaning and converting data, explored df_receipts.rewardsReceiptItemList, notes on stakeholder questions

In [1]:
import pandas as pd
from pathlib import Path
import os
from datetime import datetime
import gzip
import json
import sqlite3

### File Locations

In [2]:
today = datetime.today()
print(today)
in_brands = Path.cwd() / "data" / "raw" / "brands.json.gz"
in_receipts = Path.cwd() / "data" / "raw" / "receipts.json.gz"
in_users = Path.cwd() / "data" / "raw" / "users.json.gz"
db_path = Path.cwd() / "data" / "processed" / "fetch.db"

2022-09-20 17:45:03.546243


### Drop database if exists

In [3]:
if os.path.exists(db_path):
    os.remove(db_path)
    print("The db has been removed successfully")
else:
    print("The db does not exist!")

The db does not exist!


### Formatting

In [4]:
pd.set_option('display.max_colwidth', None)
# pd.set_option('display.max_rows', None)
pd.reset_option('display.max_rows')
pd.set_option('display.max_columns', None)
# pd.reset_option('display.max_columns')


### Load JSON data to Panda's dataframes

In [5]:
df_brands = pd.read_json(in_brands,lines=True,compression='gzip')
df_receipts = pd.read_json(in_receipts,lines=True,compression='gzip')
df_users = pd.read_json(in_users,lines=True,compression='gzip')

### First look at data

### **brands**  
**to-do**:
- ~extract _ids~
- ~extract cpg ids~

In [6]:
df_brands.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1167 entries, 0 to 1166
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   _id           1167 non-null   object 
 1   barcode       1167 non-null   int64  
 2   category      1012 non-null   object 
 3   categoryCode  517 non-null    object 
 4   cpg           1167 non-null   object 
 5   name          1167 non-null   object 
 6   topBrand      555 non-null    float64
 7   brandCode     933 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 73.1+ KB


**Brand Data Schema**
- _id: brand uuid
- barcode: the barcode on the item
- brandCode: String that corresponds with the brand column in a partner product file
- category: The category name for which the brand sells products in
- categoryCode: The category code that references a BrandCategory
- cpg: reference to CPG collection
- topBrand: Boolean indicator for whether the brand should be featured as a 'top brand'
- name: Brand name

In [7]:
df_brands

Unnamed: 0,_id,barcode,category,categoryCode,cpg,name,topBrand,brandCode
0,{'$oid': '601ac115be37ce2ead437551'},511111019862,Baking,BAKING,"{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}",test brand @1612366101024,0.0,
1,{'$oid': '601c5460be37ce2ead43755f'},511111519928,Beverages,BEVERAGES,"{'$id': {'$oid': '5332f5fbe4b03c9a25efd0ba'}, '$ref': 'Cogs'}",Starbucks,0.0,STARBUCKS
2,{'$oid': '601ac142be37ce2ead43755d'},511111819905,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",test brand @1612366146176,0.0,TEST BRANDCODE @1612366146176
3,{'$oid': '601ac142be37ce2ead43755a'},511111519874,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",test brand @1612366146051,0.0,TEST BRANDCODE @1612366146051
4,{'$oid': '601ac142be37ce2ead43755e'},511111319917,Candy & Sweets,CANDY_AND_SWEETS,"{'$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}, '$ref': 'Cogs'}",test brand @1612366146827,0.0,TEST BRANDCODE @1612366146827
...,...,...,...,...,...,...,...,...
1162,{'$oid': '5f77274dbe37ce6b592e90c0'},511111116752,Baking,BAKING,"{'$ref': 'Cogs', '$id': {'$oid': '5f77274dbe37ce6b592e90bf'}}",test brand @1601644365844,,
1163,{'$oid': '5dc1fca91dda2c0ad7da64ae'},511111706328,Breakfast & Cereal,,"{'$ref': 'Cogs', '$id': {'$oid': '53e10d6368abd3c7065097cc'}}",Dippin Dots® Cereal,,DIPPIN DOTS CEREAL
1164,{'$oid': '5f494c6e04db711dd8fe87e7'},511111416173,Candy & Sweets,CANDY_AND_SWEETS,"{'$ref': 'Cogs', '$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}}",test brand @1598639215217,,TEST BRANDCODE @1598639215217
1165,{'$oid': '5a021611e4b00efe02b02a57'},511111400608,Grocery,,"{'$ref': 'Cogs', '$id': {'$oid': '5332f5f6e4b03c9a25efd0b4'}}",LIPTON TEA Leaves,0.0,LIPTON TEA Leaves


### **receipts**  
**to-do**:
- ~extract _ids~
- ~extract and convert createDate~
- ~extract and convert dateScanned~
- ~extract and convert finishedDate~
- ~extract and convert modifyDate~
- ~extract and convert pointsAwardedDate~
- ~extract and convert purchaseDate~
- create receipt_items table using the rewardsReceiptItemList

In [8]:
df_receipts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1119 entries, 0 to 1118
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   _id                      1119 non-null   object 
 1   bonusPointsEarned        544 non-null    float64
 2   bonusPointsEarnedReason  544 non-null    object 
 3   createDate               1119 non-null   object 
 4   dateScanned              1119 non-null   object 
 5   finishedDate             568 non-null    object 
 6   modifyDate               1119 non-null   object 
 7   pointsAwardedDate        537 non-null    object 
 8   pointsEarned             609 non-null    float64
 9   purchaseDate             671 non-null    object 
 10  purchasedItemCount       635 non-null    float64
 11  rewardsReceiptItemList   679 non-null    object 
 12  rewardsReceiptStatus     1119 non-null   object 
 13  totalSpent               684 non-null    float64
 14  userId                  

**Receipts Data Schema**
- _id: uuid for this receipt
- bonusPointsEarned: Number of bonus points that were awarded upon receipt completion
- bonusPointsEarnedReason: event that triggered bonus points
- createDate: The date that the event was created
- dateScanned: Date that the user scanned their receipt
- finishedDate: Date that the receipt finished processing
- modifyDate: The date the event was modified
- pointsAwardedDate: The date we awarded points for the transaction
- pointsEarned: The number of points earned for the receipt
- purchaseDate: the date of the purchase
- purchasedItemCount: Count of number of items on the receipt
- rewardsReceiptItemList: The items that were purchased on the receipt
- rewardsReceiptStatus: status of the receipt through receipt validation and processing
- totalSpent: The total amount on the receipt
- userId: string id back to the User collection for the user who scanned the receipt

In [9]:
df_receipts

Unnamed: 0,_id,bonusPointsEarned,bonusPointsEarnedReason,createDate,dateScanned,finishedDate,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
0,{'$oid': '5ff1e1eb0a720f0523000575'},500.0,"Receipt number 2 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1609687531000},{'$date': 1609687531000},{'$date': 1609687531000},{'$date': 1609687536000},{'$date': 1609687531000},500.0,{'$date': 1609632000000},5.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '26.00', 'itemPrice': '26.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 5, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '26.00', 'userFlaggedQuantity': 5}]",FINISHED,26.00,5ff1e1eacfcf6c399c274ae6
1,{'$oid': '5ff1e1bb0a720f052300056b'},150.0,"Receipt number 5 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1609687483000},{'$date': 1609687483000},{'$date': 1609687483000},{'$date': 1609687488000},{'$date': 1609687483000},150.0,{'$date': 1609601083000},2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '028400642255', 'description': 'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ', 'finalPrice': '10.00', 'itemPrice': '10.00', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'pointsNotAwardedReason': 'Action not allowed for user and CPG', 'pointsPayerId': '5332f5fbe4b03c9a25efd0ba', 'preventTargetGapPoints': True, 'quantityPurchased': 1, 'rewardsGroup': 'DORITOS SPICY SWEET CHILI SINGLE SERVE', 'rewardsProductPartnerId': '5332f5fbe4b03c9a25efd0ba', 'userFlaggedBarcode': '028400642255', 'userFlaggedDescription': 'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ', 'userFlaggedNewItem': True, 'userFlaggedPrice': '10.00', 'userFlaggedQuantity': 1}]",FINISHED,11.00,5ff1e194b6a9d73a3a9f1052
2,{'$oid': '5ff1e1f10a720f052300057a'},5.0,All-receipts receipt bonus,{'$date': 1609687537000},{'$date': 1609687537000},,{'$date': 1609687542000},,5.0,{'$date': 1609632000000},1.0,"[{'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '26.00', 'userFlaggedQuantity': 3}]",REJECTED,10.00,5ff1e1f1cfcf6c399c274b0b
3,{'$oid': '5ff1e1ee0a7214ada100056f'},5.0,All-receipts receipt bonus,{'$date': 1609687534000},{'$date': 1609687534000},{'$date': 1609687534000},{'$date': 1609687539000},{'$date': 1609687534000},5.0,{'$date': 1609632000000},4.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '28.00', 'itemPrice': '28.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 4, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '28.00', 'userFlaggedQuantity': 4}]",FINISHED,28.00,5ff1e1eacfcf6c399c274ae6
4,{'$oid': '5ff1e1d20a7214ada1000561'},5.0,All-receipts receipt bonus,{'$date': 1609687506000},{'$date': 1609687506000},{'$date': 1609687511000},{'$date': 1609687511000},{'$date': 1609687506000},5.0,{'$date': 1609601106000},2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '1234', 'finalPrice': '2.56', 'itemPrice': '2.56', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'preventTargetGapPoints': True, 'quantityPurchased': 3, 'userFlaggedBarcode': '1234', 'userFlaggedDescription': '', 'userFlaggedNewItem': True, 'userFlaggedPrice': '2.56', 'userFlaggedQuantity': 3}]",FINISHED,1.00,5ff1e194b6a9d73a3a9f1052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1114,{'$oid': '603cc0630a720fde100003e6'},25.0,COMPLETE_NONPARTNER_RECEIPT,{'$date': 1614594147000},{'$date': 1614594147000},,{'$date': 1614594148000},,25.0,{'$date': 1597622400000},2.0,"[{'barcode': 'B076FJ92M4', 'description': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'discountedItemPrice': '22.97', 'finalPrice': '22.97', 'itemPrice': '22.97', 'originalReceiptItemText': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'partnerItemId': '0', 'priceAfterCoupon': '22.97', 'quantityPurchased': 1}, {'barcode': 'B07BRRLSVC', 'description': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'discountedItemPrice': '11.99', 'finalPrice': '11.99', 'itemPrice': '11.99', 'originalReceiptItemText': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'partnerItemId': '1', 'priceAfterCoupon': '11.99', 'quantityPurchased': 1}]",REJECTED,34.96,5fc961c3b8cfca11a077dd33
1115,{'$oid': '603d0b710a720fde1000042a'},,,{'$date': 1614613361873},{'$date': 1614613361873},,{'$date': 1614613361873},,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
1116,{'$oid': '603cf5290a720fde10000413'},,,{'$date': 1614607657664},{'$date': 1614607657664},,{'$date': 1614607657664},,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
1117,{'$oid': '603ce7100a7217c72c000405'},25.0,COMPLETE_NONPARTNER_RECEIPT,{'$date': 1614604048000},{'$date': 1614604048000},,{'$date': 1614604049000},,25.0,{'$date': 1597622400000},2.0,"[{'barcode': 'B076FJ92M4', 'description': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'discountedItemPrice': '22.97', 'finalPrice': '22.97', 'itemPrice': '22.97', 'originalReceiptItemText': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'partnerItemId': '0', 'priceAfterCoupon': '22.97', 'quantityPurchased': 1}, {'barcode': 'B07BRRLSVC', 'description': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'discountedItemPrice': '11.99', 'finalPrice': '11.99', 'itemPrice': '11.99', 'originalReceiptItemText': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'partnerItemId': '1', 'priceAfterCoupon': '11.99', 'quantityPurchased': 1}]",REJECTED,34.96,5fc961c3b8cfca11a077dd33


### **users**  
**to-do**:
- ~extract _ids~
- ~extract and convert createdDate~
- ~extract and convert lastLogin~

In [10]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 495 entries, 0 to 494
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   _id           495 non-null    object
 1   active        495 non-null    bool  
 2   createdDate   495 non-null    object
 3   lastLogin     433 non-null    object
 4   role          495 non-null    object
 5   signUpSource  447 non-null    object
 6   state         439 non-null    object
dtypes: bool(1), object(6)
memory usage: 23.8+ KB


**Users Data Schema**
- _id: user Id
- state: state abbreviation
- createdDate: when the user created their account
- lastLogin: last time the user was recorded logging in to the app
- role: constant value set to 'CONSUMER'
- active: indicates if the user is active; only Fetch will de-activate an account with this flag

In [11]:
df_users

Unnamed: 0,_id,active,createdDate,lastLogin,role,signUpSource,state
0,{'$oid': '5ff1e194b6a9d73a3a9f1052'},True,{'$date': 1609687444800},{'$date': 1609687537858},consumer,Email,WI
1,{'$oid': '5ff1e194b6a9d73a3a9f1052'},True,{'$date': 1609687444800},{'$date': 1609687537858},consumer,Email,WI
2,{'$oid': '5ff1e194b6a9d73a3a9f1052'},True,{'$date': 1609687444800},{'$date': 1609687537858},consumer,Email,WI
3,{'$oid': '5ff1e1eacfcf6c399c274ae6'},True,{'$date': 1609687530554},{'$date': 1609687530597},consumer,Email,WI
4,{'$oid': '5ff1e194b6a9d73a3a9f1052'},True,{'$date': 1609687444800},{'$date': 1609687537858},consumer,Email,WI
...,...,...,...,...,...,...,...
490,{'$oid': '54943462e4b07e684157a532'},True,{'$date': 1418998882381},{'$date': 1614963143204},fetch-staff,,
491,{'$oid': '54943462e4b07e684157a532'},True,{'$date': 1418998882381},{'$date': 1614963143204},fetch-staff,,
492,{'$oid': '54943462e4b07e684157a532'},True,{'$date': 1418998882381},{'$date': 1614963143204},fetch-staff,,
493,{'$oid': '54943462e4b07e684157a532'},True,{'$date': 1418998882381},{'$date': 1614963143204},fetch-staff,,


## Cleaning the data

### First attempt of extracting the values from columns containing dictionaries
I've chosen to include the following section of python code to help illustrate my thought process that lead to writting the value_from_dict() function. This code includes some notes as comments, but full explanation of my process is included within the documentation of the resulting function.  
The function should account for executing the following code, but if you are stepping through this notebook on a fresh kernel you can skip the cells between **Start** and **End**.

**Start** - you *can* skip executing the code starting here

In [12]:
# confirming the values in _id are being recoginized python objects, in this case a dictionary
type(df_brands['_id'][0])

dict

In [13]:
df_brands['_id'][0]['$oid']

'601ac115be37ce2ead437551'

In [14]:
# extract the _id column as a series
_id_series = df_brands['_id']
# create an list to collect the values from the dictionary objects in _id
_id_clean = []

# iterate through _id_series appending values them to _id_clean
for index, value in _id_series.items():
    _id_clean.append(value['$oid'])
    
# confirm no nulls in _id_clean
assert None not in _id_clean, "there is at least one None/null value in _id_clean"
# confirm _id_clean is the same length is the original _id column in df_brands
assert len(_id_clean) == len(df_brands['_id']), "the length of the original column and the cleaned column are not the same"

# add _id_clean to df_brands after the _id column
df_brands.insert(1, '_id_clean', _id_clean)

df_brands

Unnamed: 0,_id,_id_clean,barcode,category,categoryCode,cpg,name,topBrand,brandCode
0,{'$oid': '601ac115be37ce2ead437551'},601ac115be37ce2ead437551,511111019862,Baking,BAKING,"{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}",test brand @1612366101024,0.0,
1,{'$oid': '601c5460be37ce2ead43755f'},601c5460be37ce2ead43755f,511111519928,Beverages,BEVERAGES,"{'$id': {'$oid': '5332f5fbe4b03c9a25efd0ba'}, '$ref': 'Cogs'}",Starbucks,0.0,STARBUCKS
2,{'$oid': '601ac142be37ce2ead43755d'},601ac142be37ce2ead43755d,511111819905,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",test brand @1612366146176,0.0,TEST BRANDCODE @1612366146176
3,{'$oid': '601ac142be37ce2ead43755a'},601ac142be37ce2ead43755a,511111519874,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",test brand @1612366146051,0.0,TEST BRANDCODE @1612366146051
4,{'$oid': '601ac142be37ce2ead43755e'},601ac142be37ce2ead43755e,511111319917,Candy & Sweets,CANDY_AND_SWEETS,"{'$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}, '$ref': 'Cogs'}",test brand @1612366146827,0.0,TEST BRANDCODE @1612366146827
...,...,...,...,...,...,...,...,...,...
1162,{'$oid': '5f77274dbe37ce6b592e90c0'},5f77274dbe37ce6b592e90c0,511111116752,Baking,BAKING,"{'$ref': 'Cogs', '$id': {'$oid': '5f77274dbe37ce6b592e90bf'}}",test brand @1601644365844,,
1163,{'$oid': '5dc1fca91dda2c0ad7da64ae'},5dc1fca91dda2c0ad7da64ae,511111706328,Breakfast & Cereal,,"{'$ref': 'Cogs', '$id': {'$oid': '53e10d6368abd3c7065097cc'}}",Dippin Dots® Cereal,,DIPPIN DOTS CEREAL
1164,{'$oid': '5f494c6e04db711dd8fe87e7'},5f494c6e04db711dd8fe87e7,511111416173,Candy & Sweets,CANDY_AND_SWEETS,"{'$ref': 'Cogs', '$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}}",test brand @1598639215217,,TEST BRANDCODE @1598639215217
1165,{'$oid': '5a021611e4b00efe02b02a57'},5a021611e4b00efe02b02a57,511111400608,Grocery,,"{'$ref': 'Cogs', '$id': {'$oid': '5332f5f6e4b03c9a25efd0b4'}}",LIPTON TEA Leaves,0.0,LIPTON TEA Leaves


In [15]:
# examining the values in cpg
type(df_brands['cpg'])

pandas.core.series.Series

In [16]:
df_brands['cpg'][0]

{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}

In [17]:
df_brands['cpg'][0]['$id']['$oid']

'601ac114be37ce2ead437550'

In [18]:
# extract the cpg column as a series
cpg_series = df_brands['cpg']
# create an list to collect the values from the dictionary objects in _id
cpg_clean = []

# iterate through cpg_series appending values them to cpg_clean
for index, value in cpg_series.items():
    cpg_clean.append(value['$id']['$oid'])
    
# confirm no nulls in _id_clean
assert None not in cpg_clean, "there is at least one None/null value in cpg_clean"
# confirm cpg_clean is the same length is the original cpg column in df_brands
assert len(cpg_clean) == len(df_brands['cpg']), "the length of the original column and the cleaned column are not the same"

# add cpg_clean to df_brands after the cpg column
df_brands.insert(6, 'cpg_clean', cpg_clean)

df_brands

Unnamed: 0,_id,_id_clean,barcode,category,categoryCode,cpg,cpg_clean,name,topBrand,brandCode
0,{'$oid': '601ac115be37ce2ead437551'},601ac115be37ce2ead437551,511111019862,Baking,BAKING,"{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}",601ac114be37ce2ead437550,test brand @1612366101024,0.0,
1,{'$oid': '601c5460be37ce2ead43755f'},601c5460be37ce2ead43755f,511111519928,Beverages,BEVERAGES,"{'$id': {'$oid': '5332f5fbe4b03c9a25efd0ba'}, '$ref': 'Cogs'}",5332f5fbe4b03c9a25efd0ba,Starbucks,0.0,STARBUCKS
2,{'$oid': '601ac142be37ce2ead43755d'},601ac142be37ce2ead43755d,511111819905,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",601ac142be37ce2ead437559,test brand @1612366146176,0.0,TEST BRANDCODE @1612366146176
3,{'$oid': '601ac142be37ce2ead43755a'},601ac142be37ce2ead43755a,511111519874,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",601ac142be37ce2ead437559,test brand @1612366146051,0.0,TEST BRANDCODE @1612366146051
4,{'$oid': '601ac142be37ce2ead43755e'},601ac142be37ce2ead43755e,511111319917,Candy & Sweets,CANDY_AND_SWEETS,"{'$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}, '$ref': 'Cogs'}",5332fa12e4b03c9a25efd1e7,test brand @1612366146827,0.0,TEST BRANDCODE @1612366146827
...,...,...,...,...,...,...,...,...,...,...
1162,{'$oid': '5f77274dbe37ce6b592e90c0'},5f77274dbe37ce6b592e90c0,511111116752,Baking,BAKING,"{'$ref': 'Cogs', '$id': {'$oid': '5f77274dbe37ce6b592e90bf'}}",5f77274dbe37ce6b592e90bf,test brand @1601644365844,,
1163,{'$oid': '5dc1fca91dda2c0ad7da64ae'},5dc1fca91dda2c0ad7da64ae,511111706328,Breakfast & Cereal,,"{'$ref': 'Cogs', '$id': {'$oid': '53e10d6368abd3c7065097cc'}}",53e10d6368abd3c7065097cc,Dippin Dots® Cereal,,DIPPIN DOTS CEREAL
1164,{'$oid': '5f494c6e04db711dd8fe87e7'},5f494c6e04db711dd8fe87e7,511111416173,Candy & Sweets,CANDY_AND_SWEETS,"{'$ref': 'Cogs', '$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}}",5332fa12e4b03c9a25efd1e7,test brand @1598639215217,,TEST BRANDCODE @1598639215217
1165,{'$oid': '5a021611e4b00efe02b02a57'},5a021611e4b00efe02b02a57,511111400608,Grocery,,"{'$ref': 'Cogs', '$id': {'$oid': '5332f5f6e4b03c9a25efd0b4'}}",5332f5f6e4b03c9a25efd0b4,LIPTON TEA Leaves,0.0,LIPTON TEA Leaves


In [19]:
dataframe = df_brands
column_to_clean ='cpg'
dirty_series = dataframe[column_to_clean]
dirty_series
insert_at = dataframe.columns.get_loc(column_to_clean) + 1
cleaned_list = []
dict_key = "['$id']['$oid']"
dict_key_value = "value" + dict_key


for index, value in dirty_series.items():
    cleaned_list.append(eval(dict_key_value))

cleaned_list



['601ac114be37ce2ead437550',
 '5332f5fbe4b03c9a25efd0ba',
 '601ac142be37ce2ead437559',
 '601ac142be37ce2ead437559',
 '5332fa12e4b03c9a25efd1e7',
 '601ac142be37ce2ead437559',
 '601ac142be37ce2ead437559',
 '559c2234e4b06aca36af13c6',
 '5a734034e4b0d58f376be874',
 '59ba6f1ce4b092b29c167346',
 '5f4bf556be37ce0b44915549',
 '5332f5f2e4b03c9a25efd0aa',
 '559c2234e4b06aca36af13c6',
 '5d5d4fd16d5f3b23d1bc7905',
 '5332f5fbe4b03c9a25efd0ba',
 '5332f709e4b03c9a25efd0f1',
 '5d9b4f591dda2c6225a284aa',
 '5f358338be37ce443bf9d557',
 '5fb28549be37ce522e165cb4',
 '5332f5f6e4b03c9a25efd0b4',
 '55b62995e4b0d8e685c14213',
 '5d9b4f591dda2c6225a284aa',
 '559c2234e4b06aca36af13c6',
 '53e10d6368abd3c7065097cc',
 '5332f5ebe4b03c9a25efd0a8',
 '5e9f12f5be37ce3e45b6a77e',
 '5332f5f6e4b03c9a25efd0b4',
 '5d5d4fd16d5f3b23d1bc7905',
 '5f493e72be37ce64d0ae36c2',
 '5f4936dcbe37ce52f8314fd8',
 '559c2234e4b06aca36af13c6',
 '5fd2a0aebe37ce49eb72c0ed',
 '53e10d6368abd3c7065097cc',
 '5f494c5d04db711dd8fe87e2',
 '5332f5f3e4b0

In [20]:
dataframe

Unnamed: 0,_id,_id_clean,barcode,category,categoryCode,cpg,cpg_clean,name,topBrand,brandCode
0,{'$oid': '601ac115be37ce2ead437551'},601ac115be37ce2ead437551,511111019862,Baking,BAKING,"{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}",601ac114be37ce2ead437550,test brand @1612366101024,0.0,
1,{'$oid': '601c5460be37ce2ead43755f'},601c5460be37ce2ead43755f,511111519928,Beverages,BEVERAGES,"{'$id': {'$oid': '5332f5fbe4b03c9a25efd0ba'}, '$ref': 'Cogs'}",5332f5fbe4b03c9a25efd0ba,Starbucks,0.0,STARBUCKS
2,{'$oid': '601ac142be37ce2ead43755d'},601ac142be37ce2ead43755d,511111819905,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",601ac142be37ce2ead437559,test brand @1612366146176,0.0,TEST BRANDCODE @1612366146176
3,{'$oid': '601ac142be37ce2ead43755a'},601ac142be37ce2ead43755a,511111519874,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, '$ref': 'Cogs'}",601ac142be37ce2ead437559,test brand @1612366146051,0.0,TEST BRANDCODE @1612366146051
4,{'$oid': '601ac142be37ce2ead43755e'},601ac142be37ce2ead43755e,511111319917,Candy & Sweets,CANDY_AND_SWEETS,"{'$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}, '$ref': 'Cogs'}",5332fa12e4b03c9a25efd1e7,test brand @1612366146827,0.0,TEST BRANDCODE @1612366146827
...,...,...,...,...,...,...,...,...,...,...
1162,{'$oid': '5f77274dbe37ce6b592e90c0'},5f77274dbe37ce6b592e90c0,511111116752,Baking,BAKING,"{'$ref': 'Cogs', '$id': {'$oid': '5f77274dbe37ce6b592e90bf'}}",5f77274dbe37ce6b592e90bf,test brand @1601644365844,,
1163,{'$oid': '5dc1fca91dda2c0ad7da64ae'},5dc1fca91dda2c0ad7da64ae,511111706328,Breakfast & Cereal,,"{'$ref': 'Cogs', '$id': {'$oid': '53e10d6368abd3c7065097cc'}}",53e10d6368abd3c7065097cc,Dippin Dots® Cereal,,DIPPIN DOTS CEREAL
1164,{'$oid': '5f494c6e04db711dd8fe87e7'},5f494c6e04db711dd8fe87e7,511111416173,Candy & Sweets,CANDY_AND_SWEETS,"{'$ref': 'Cogs', '$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}}",5332fa12e4b03c9a25efd1e7,test brand @1598639215217,,TEST BRANDCODE @1598639215217
1165,{'$oid': '5a021611e4b00efe02b02a57'},5a021611e4b00efe02b02a57,511111400608,Grocery,,"{'$ref': 'Cogs', '$id': {'$oid': '5332f5f6e4b03c9a25efd0b4'}}",5332f5f6e4b03c9a25efd0b4,LIPTON TEA Leaves,0.0,LIPTON TEA Leaves


In [21]:
# reset df_brands to the inital load of raw data - if excuted above code, 
# uncomment the following line to avoid any exceptions with value_from_dict() 
df_brands = pd.read_json(in_brands,lines=True,compression='gzip')

**End** You can continue executing the following cells

### Writting a function to extract values from a dataframe's column that contains a dictionary. Then adds those values back to the dataframe as a new column.

In [22]:
# I realized I'd be doing this multiple times, better to make a function
# define a function that cleans a df column that contains a dictionary by returning the specified 
# values and adding them as new column in the dataframe
def value_from_dict(dataframe, column_to_clean, dict_key, allow_nulls = False):
    """Returns dataframe with a 'cleaned' column inserted after the column that was cleaned.
    
    :param dataframe: A dataframe with a column containing dictionaries, from which one value is
        to be extracted
    :type dataframe: Pandas DataFrame
    :param column_to_clean: The name of the column containing dictionaries
    :type column_to_clean: str
    :param dict_key: A str containing the key associated with the value we want to extract,
        e.g, "['$id']['$oid']"
    :type dict_key: str
    :param allow_nulls: A boolean value indicating if None/Null/NaN/NaT values should be allowed,
        defaults to False
    :type allow_nulls: bool
    
    :raises AssertionError: 'there is at least one None/Null/NaN/NaT value in the cleaned data' if param allow_nulls = False
    :raises AssertionError: 'the length of the original column and the cleaned column are not the same'
    :excepts VlueError: If a column has already been cleaned, print message confirming
        no values were added to the dataframe
    
    :rtype: Pandas DataFrame
    :return: the original DataFrame with an additional column containing 'cleaned' values
    """
    # setting variables:
    # extract the column we want to clean from the dataframe as a series
    dirty_series = dataframe[column_to_clean]
    # create a list to store the cleaned values in
    cleaned_list = []
    # name of the column we'll be adding to the DataFrame
    cleaned_column_name = column_to_clean + '_cleaned'
    # location to insert the cleaned column, after the 'dirty' column
    insert_at = dataframe.columns.get_loc(column_to_clean) + 1
    # translate dict_key str into a format useable in the following for loop
    value_dict_key = "value" + dict_key
    
    # iterate through dirty_series appending extracted values to cleaned_list
    for index, value in dirty_series.items():
        # if there is no dictionary or any other issue, append None
        try:
            cleaned_list.append(eval(value_dict_key))
        except:
            cleaned_list.append(None)

    # handle allow_nulls param flag
    if not allow_nulls:
        # confirm no nulls in cleaned_list
        assert None not in cleaned_list, "there is at least one None value in the cleaned data"
    
    # confirm cleaned_list is the same length as dirty_series
    assert len(cleaned_list) == len(dirty_series), "the length of the original column and the cleaned column are not the same"
    
    # add the cleaned_list data to the originl dataframe following column_to_clean
    try:
        dataframe.insert(insert_at, cleaned_column_name, cleaned_list)
    except ValueError as error:
        print(f"{str(error)}, {cleaned_column_name} was not added to the dataframe")

    # return the modified dataframe
    return dataframe

### Using  value_from_dict() to exatract values from all the columns containing dictionaries and add them to the dataframe as a new column:

#### df_brands._id :

In [23]:
# look at the first value in df_brands._id
df_brands['_id'][0]

{'$oid': '601ac115be37ce2ead437551'}

In [24]:
#extract the value
df_brands['_id'][0]['$oid']

'601ac115be37ce2ead437551'

In [25]:
# access the value, set a varbale to use for the dict_key param of value_from_dict() function
brand_id_dict_key = "['$oid']"

In [26]:
# clean df_brands._id and confirm by viewing a sample of the dataframe
value_from_dict(df_brands, "_id", brand_id_dict_key)
df_brands.sample(2)

Unnamed: 0,_id,_id_cleaned,barcode,category,categoryCode,cpg,name,topBrand,brandCode
479,{'$oid': '5f872b7fbe37ce66db5dd978'},5f872b7fbe37ce66db5dd978,511111217022,Baking,BAKING,"{'$ref': 'Cogs', '$id': {'$oid': '5f872b7fbe37ce66db5dd975'}}",test brand @1602694015743,,TEST BRANDCODE @1602694015743
568,{'$oid': '5a5d2bcbe4b0db471c2d0435'},5a5d2bcbe4b0db471c2d0435,511111600077,Beauty & Personal Care,,"{'$ref': 'Cogs', '$id': {'$oid': '53e10d6368abd3c7065097cc'}}",Salon Selectives,,SALON SELECTIVES


#### df_brands.cpg:

In [27]:
# look at the first value in df_brands.cpg
df_brands['cpg'][0]

{'$id': {'$oid': '601ac114be37ce2ead437550'}, '$ref': 'Cogs'}

In [28]:
#extract the value
df_brands['cpg'][0]['$id']['$oid']

'601ac114be37ce2ead437550'

In [29]:
# set a variable to use for the dict_key param of value_from_dict() function
brand_cpg_dict_key = "['$id']['$oid']"

In [30]:
# clean df_brands.cpg and confirm by viewing a sample of the dataframe
value_from_dict(df_brands, "cpg", brand_cpg_dict_key)
df_brands.sample(2)

Unnamed: 0,_id,_id_cleaned,barcode,category,categoryCode,cpg,cpg_cleaned,name,topBrand,brandCode
513,{'$oid': '5332fa74e4b03c9a25efd21e'},5332fa74e4b03c9a25efd21e,511111003038,,,"{'$ref': 'Cpgs', '$id': {'$oid': '5332f5ebe4b03c9a25efd0a8'}}",5332f5ebe4b03c9a25efd0a8,Sprite,,
1163,{'$oid': '5dc1fca91dda2c0ad7da64ae'},5dc1fca91dda2c0ad7da64ae,511111706328,Breakfast & Cereal,,"{'$ref': 'Cogs', '$id': {'$oid': '53e10d6368abd3c7065097cc'}}",53e10d6368abd3c7065097cc,Dippin Dots® Cereal,,DIPPIN DOTS CEREAL


#### df_receipts._id:

In [31]:
# look at the first value in df_receipts._id
df_receipts['_id'][0]

{'$oid': '5ff1e1eb0a720f0523000575'}

In [32]:
#extract the value
df_receipts['_id'][0]['$oid']

'5ff1e1eb0a720f0523000575'

In [33]:
# set a variable to use for the dict_key param of value_from_dict() function
receipts_id_dict_key = "['$oid']"

In [34]:
# clean df_receipts._id and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "_id", receipts_id_dict_key)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,dateScanned,finishedDate,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
208,{'$oid': '5ffcb4a20a720f0515000006'},5ffcb4a20a720f0515000006,100.0,"Receipt number 6 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1610396834000},{'$date': 1610396834000},{'$date': 1610396839000},{'$date': 1610396839000},{'$date': 1610396834000},100.0,{'$date': 1610310434000},2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '1234', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '1234', 'userFlaggedDescription': '', 'userFlaggedNewItem': True}]",FINISHED,1.0,5ffcb47d04929111f6e9256c
142,{'$oid': '5ff7265a0a7214ada10005fb'},5ff7265a0a7214ada10005fb,750.0,"Receipt number 1 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1610032730000},{'$date': 1610032730000},{'$date': 1610032731000},{'$date': 1610032735000},{'$date': 1610032731000},760.0,{'$date': 1609946330000},1.0,"[{'barcode': '079400066619', 'competitiveProduct': True, 'description': 'SUAVE PROFESSIONALS MOISTURIZING SHAMPOO LIQUID PLASTIC BOTTLE RP 12.6 OZ - 0079400066612', 'finalPrice': '1', 'itemPrice': '1', 'needsFetchReview': False, 'originalMetaBriteBarcode': '080878042197', 'partnerItemId': '1', 'pointsEarned': '10.0', 'pointsPayerId': '5332f5f6e4b03c9a25efd0b4', 'preventTargetGapPoints': True, 'quantityPurchased': 1, 'rewardsGroup': 'SUAVE HAIR CARE', 'rewardsProductPartnerId': '5332f5f6e4b03c9a25efd0b4', 'targetPrice': '800', 'userFlaggedBarcode': '079400066619'}]",FINISHED,1.0,5ff7264e8f142f11dd189504


#### df_receipts.createDate:

In [35]:
# look at the first value in XXXX.YYY
df_receipts['createDate'][0]

{'$date': 1609687531000}

In [36]:
# extract the value
df_receipts['createDate'][0]['$date']

1609687531000

In [37]:
# set a variable to use for the dict_key param of value_from_dict() function
receipts_createDate_dict_key = "['$date']"

In [38]:
# clean df_receipts.createDate and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "createDate", receipts_createDate_dict_key)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,finishedDate,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
435,{'$oid': '600ed47e0a7214ada2000014'},600ed47e0a7214ada2000014,5.0,All-receipts receipt bonus,{'$date': 1611584638000},1611584638000,{'$date': 1611584638000},{'$date': 1611584639000},{'$date': 1611584639000},{'$date': 1611584639000},5.0,{'$date': 1610461438000},1.0,"[{'barcode': '022174070214', 'description': 'CJN INJ & LSN GLD GRLN KIT BOX 1 CT', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}]",FINISHED,1.0,54943462e4b07e684157a532
183,{'$oid': '5ffc9d8c0a7214adca000049'},5ffc9d8c0a7214adca000049,750.0,"Receipt number 1 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1610390924000},1610390924000,{'$date': 1610390924000},,{'$date': 1610390925000},{'$date': 1610390925000},750.0,{'$date': 1610304524000},1.0,"[{'barcode': '075925306254', 'competitiveProduct': True, 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1, 'rewardsGroup': 'SARGENTO NATURAL SHREDDED CHEESE 6OZ OR LARGER', 'rewardsProductPartnerId': '5e7cf838f221c312e698a628'}, {'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '034100573065', 'userFlaggedDescription': 'MILLER LITE 24 PACK 12OZ CAN', 'userFlaggedNewItem': True, 'userFlaggedPrice': '29.00', 'userFlaggedQuantity': 1}, {'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '3', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '034100573065', 'userFlaggedDescription': 'MILLER LITE 24 PACK 12OZ CAN', 'userFlaggedNewItem': True, 'userFlaggedPrice': '29.00', 'userFlaggedQuantity': 1}, {'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '4', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '034100573065', 'userFlaggedDescription': 'MILLER LITE 24 PACK 12OZ CAN', 'userFlaggedNewItem': True, 'userFlaggedPrice': '29.00', 'userFlaggedQuantity': 1}, {'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '5', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '034100573065', 'userFlaggedDescription': 'MILLER LITE 24 PACK 12OZ CAN', 'userFlaggedNewItem': True, 'userFlaggedPrice': '29.00', 'userFlaggedQuantity': 1}, {'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '6', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '034100573065', 'userFlaggedDescription': 'MILLER LITE 24 PACK 12OZ CAN', 'userFlaggedNewItem': True, 'userFlaggedPrice': '29.00', 'userFlaggedQuantity': 1}, {'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '7', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '034100573065', 'userFlaggedDescription': 'MILLER LITE 24 PACK 12OZ CAN', 'userFlaggedNewItem': True, 'userFlaggedPrice': '29.00', 'userFlaggedQuantity': 1}, {'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '8', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '034100573065', 'userFlaggedDescription': 'MILLER LITE 24 PACK 12OZ CAN', 'userFlaggedNewItem': True, 'userFlaggedPrice': '29.00', 'userFlaggedQuantity': 1}, {'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '9', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '034100573065', 'userFlaggedDescription': 'MILLER LITE 24 PACK 12OZ CAN', 'userFlaggedNewItem': True, 'userFlaggedPrice': '29.00', 'userFlaggedQuantity': 1}, {'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '10', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '034100573065', 'userFlaggedDescription': 'MILLER LITE 24 PACK 12OZ CAN', 'userFlaggedNewItem': True, 'userFlaggedPrice': '29.00', 'userFlaggedQuantity': 1}, {'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '11', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '034100573065', 'userFlaggedDescription': 'MILLER LITE 24 PACK 12OZ CAN', 'userFlaggedNewItem': True, 'userFlaggedPrice': '29.00', 'userFlaggedQuantity': 1}]",FLAGGED,1.0,5ffc9d8cb3348b11c9338927


#### df_receipts.dateScanned:
same format as df_receipts.createDate

In [39]:
# set a varbale to use for the dict_key param of value_from_dict() function
receipts_dateScanned_dict_key = "['$date']"

In [40]:
# clean df_receipts.dateScanned and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "dateScanned", receipts_dateScanned_dict_key)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,dateScanned_cleaned,finishedDate,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
569,{'$oid': '601455a80a720f05f8000107'},601455a80a720f05f8000107,5.0,All-receipts receipt bonus,{'$date': 1611945384000},1611945384000,{'$date': 1611945384000},1611945384000,{'$date': 1611945390000},{'$date': 1611945390000},{'$date': 1611945385000},5.0,{'$date': 1611858984000},2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '1234', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '1234', 'userFlaggedDescription': '', 'userFlaggedNewItem': True}]",FINISHED,1.0,6014554a67804a1228b20ca9
650,{'$oid': '601830ce0a7214ad500002df'},601830ce0a7214ad500002df,750.0,"Receipt number 1 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1612198094000},1612198094000,{'$date': 1612198094000},1612198094000,{'$date': 1612198095000},{'$date': 1612198099000},{'$date': 1612198095000},750.0,{'$date': 1612137600000},5.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '28.00', 'itemPrice': '28.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 5, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '28.00', 'userFlaggedQuantity': 5}]",FINISHED,28.0,601830ce9a1b091205b618e8


#### df_receipts.finishedDate:
same format as df_receipts.createDate - can include None/Null/NaN

In [41]:
# set a varbale to use for the dict_key param of value_from_dict() function
receipts_finishedDate_dict_key = "['$date']"

In [42]:
# clean df_receipts.finishedDate and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "finishedDate", receipts_finishedDate_dict_key)
df_receipts.sample(2)

AssertionError: there is at least one None value in the cleaned data

In [43]:
# clean df_receipts.finishedDate and confirm by viewing a sample of the dataframe
# setting allow_nulls = True
value_from_dict(df_receipts, "finishedDate", receipts_finishedDate_dict_key, allow_nulls=True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,dateScanned_cleaned,finishedDate,finishedDate_cleaned,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
571,{'$oid': '6014323b0a720f05f80000b2'},6014323b0a720f05f80000b2,5.0,All-receipts receipt bonus,{'$date': 1611936315000},1611936315000,{'$date': 1611936315000},1611936315000,{'$date': 1611936319000},1611936000000.0,{'$date': 1611936323000},{'$date': 1611936317000},5.0,{'$date': 1611849915000},2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'needsFetchReview': False, 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '028400642255', 'description': 'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ', 'finalPrice': '10.00', 'itemPrice': '10.00', 'needsFetchReview': False, 'partnerItemId': '2', 'pointsNotAwardedReason': 'Action not allowed for user and CPG', 'pointsPayerId': '5332f5fbe4b03c9a25efd0ba', 'preventTargetGapPoints': True, 'quantityPurchased': 1, 'rewardsGroup': 'DORITOS SPICY SWEET CHILI SINGLE SERVE', 'rewardsProductPartnerId': '5332f5fbe4b03c9a25efd0ba', 'targetPrice': '800', 'userFlaggedBarcode': '028400642255', 'userFlaggedNewItem': True, 'userFlaggedPrice': '10.00', 'userFlaggedQuantity': 1}]",FINISHED,11.0,54943462e4b07e684157a532
382,{'$oid': '60088e670a720f05fa00011d'},60088e670a720f05fa00011d,5.0,All-receipts receipt bonus,{'$date': 1611173479000},1611173479000,{'$date': 1611173479000},1611173479000,{'$date': 1611173480000},1611173000000.0,{'$date': 1611173480000},{'$date': 1611173480000},5.0,{'$date': 1611087079000},1.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}]",FINISHED,1.0,54943462e4b07e684157a532


#### df_receipts.modifyDate:
same format as df_receipts.createDate

In [44]:
# set a varbale to use for the dict_key param of value_from_dict() function
receipts_modifyDate_dict_key = "['$date']"

In [45]:
# clean df_receipts.modifyDate and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "modifyDate", receipts_modifyDate_dict_key)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,dateScanned_cleaned,finishedDate,finishedDate_cleaned,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
451,{'$oid': '600f42050a7214ada200003b'},600f42050a7214ada200003b,5.0,All-receipts receipt bonus,{'$date': 1611612677000},1611612677000,{'$date': 1611612677000},1611612677000,{'$date': 1611612677000},1611613000000.0,{'$date': 1611612677000},1611612677000,{'$date': 1611612677000},5.0,{'$date': 1611122400000},5.0,"[{'barcode': '043000200520', 'description': 'JELL-O GELATIN RASPBERRY 6 OZ BOX', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '1', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '043000200520', 'description': 'JELL-O GELATIN RASPBERRY 6 OZ BOX', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '2', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '043000200520', 'description': 'JELL-O GELATIN RASPBERRY 6 OZ BOX', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '3', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '043000200520', 'description': 'JELL-O GELATIN RASPBERRY 6 OZ BOX', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '4', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}, {'barcode': '043000200520', 'description': 'JELL-O GELATIN RASPBERRY 6 OZ BOX', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '5', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}]",FINISHED,49.95,600f41b2bd196811e68ea219
697,{'$oid': '60189cbe0a720f05f4000062'},60189cbe0a720f05f4000062,5.0,All-receipts receipt bonus,{'$date': 1612225726000},1612225726000,{'$date': 1612225726000},1612225726000,{'$date': 1612225728000},1612226000000.0,{'$date': 1612225732000},1612225732000,{'$date': 1612225727000},5.0,{'$date': 1612139326000},2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '028400642255', 'description': 'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ', 'finalPrice': '10.00', 'itemPrice': '10.00', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'pointsNotAwardedReason': 'Action not allowed for user and CPG', 'pointsPayerId': '5332f5fbe4b03c9a25efd0ba', 'preventTargetGapPoints': True, 'quantityPurchased': 1, 'rewardsGroup': 'DORITOS SPICY SWEET CHILI SINGLE SERVE', 'rewardsProductPartnerId': '5332f5fbe4b03c9a25efd0ba', 'userFlaggedBarcode': '028400642255', 'userFlaggedDescription': 'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ', 'userFlaggedNewItem': True, 'userFlaggedPrice': '10.00', 'userFlaggedQuantity': 1}]",FINISHED,11.0,60189c74c8b50e11d8454eff


#### df_receipts.pointsAwardedDate:
same format as df_receipts.createDate - can include None/Null/NaN

In [46]:
# set a varbale to use for the dict_key param of value_from_dict() function
receipts_pointsAwardedDate_dict_key = "['$date']"

In [47]:
# clean df_receipts.purchaseDate and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "pointsAwardedDate", receipts_pointsAwardedDate_dict_key)
df_receipts.sample(2)

AssertionError: there is at least one None value in the cleaned data

In [48]:
# clean df_receipts.purchaseDate and confirm by viewing a sample of the dataframe allowing nulls
value_from_dict(df_receipts, "pointsAwardedDate", receipts_pointsAwardedDate_dict_key, allow_nulls=True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,dateScanned_cleaned,finishedDate,finishedDate_cleaned,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
726,{'$oid': '601af5aa0a720f05f4000214'},601af5aa0a720f05f4000214,,,{'$date': 1612379562686},1612379562686,{'$date': 1612379562686},1612379562686,,,{'$date': 1612379562686},1612379562686,,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
73,{'$oid': '5ff4ce430a720f05230005c9'},5ff4ce430a720f05230005c9,5.0,All-receipts receipt bonus,{'$date': 1609879107000},1609879107000,{'$date': 1609879107000},1609879107000,{'$date': 1609879107000},1609879000000.0,{'$date': 1609879112000},1609879112000,{'$date': 1609879107000},1609879000000.0,5.0,{'$date': 1609804800000},1.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '29.00', 'itemPrice': '29.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 1, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '29.00', 'userFlaggedQuantity': 1}]",FINISHED,29.0,5ff4ce3dc3d63511e2a484dc


#### df_receipts.purchaseDate:
same format as df_receipts.createDate - can include None/Null/NaN

In [49]:
# set a varbale to use for the dict_key param of value_from_dict() function
receipts_purchaseDate_dict_key = "['$date']"

In [50]:
# clean df_receipts.purchaseDate and confirm by viewing a sample of the dataframe
value_from_dict(df_receipts, "purchaseDate", receipts_purchaseDate_dict_key)
df_receipts.sample(2)

AssertionError: there is at least one None value in the cleaned data

In [51]:
# clean df_receipts.purchaseDate and confirm by viewing a sample of the dataframe allowing nulls
value_from_dict(df_receipts, "purchaseDate", receipts_purchaseDate_dict_key, allow_nulls=True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,dateScanned,dateScanned_cleaned,finishedDate,finishedDate_cleaned,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
613,{'$oid': '60161ca50a7214ad500001be'},60161ca50a7214ad500001be,,,{'$date': 1612061861027},1612061861027,{'$date': 1612061861027},1612061861027,,,{'$date': 1612061861027},1612061861027,,,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
201,{'$oid': '5ffcb4c10a7214ad4e000014'},5ffcb4c10a7214ad4e000014,5.0,All-receipts receipt bonus,{'$date': 1610396865000},1610396865000,{'$date': 1610396865000},1610396865000,{'$date': 1610396865000},1610397000000.0,{'$date': 1610396870000},1610396870000,{'$date': 1610396865000},1610397000000.0,5.0,{'$date': 1610323200000},1610323000000.0,3.0,"[{'barcode': '4011', 'finalPrice': '20.00', 'itemPrice': '20.00', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 3, 'userFlaggedBarcode': '4011', 'userFlaggedDescription': '', 'userFlaggedNewItem': True, 'userFlaggedPrice': '20.00', 'userFlaggedQuantity': 3}]",FINISHED,20.0,5ffcb4bc04929111f6e92608


#### df_users._id:

In [52]:
# look at the first value in df_users._id
df_users['_id'][0]

{'$oid': '5ff1e194b6a9d73a3a9f1052'}

In [53]:
#extract the value
df_users['_id'][0]['$oid']

'5ff1e194b6a9d73a3a9f1052'

In [54]:
# set a varbale to use for the dict_key param of value_from_dict() function
users_id_dict_key = "['$oid']"

In [55]:
# clean df_users._id and confirm by viewing a sample of the dataframe
value_from_dict(df_users, "_id", users_id_dict_key)
df_users.sample(2)

Unnamed: 0,_id,_id_cleaned,active,createdDate,lastLogin,role,signUpSource,state
252,{'$oid': '6009e60450b3311194385009'},6009e60450b3311194385009,True,{'$date': 1611261445244},,consumer,Email,WI
273,{'$oid': '600ed42e43298911ce45d1fa'},600ed42e43298911ce45d1fa,True,{'$date': 1611584559286},{'$date': 1611584964970},consumer,Email,WI


#### df_users.createdDate:  
same format as df_receipts.createDate

In [56]:
# set a variable to use for the dict_key param of value_from_dict() function
users_createdDate_dict_key = "['$date']"

In [57]:
# clean df_users.createDate and confirm by viewing a sample of the dataframe
value_from_dict(df_users, "createdDate", users_createdDate_dict_key)
df_users.sample(2)

Unnamed: 0,_id,_id_cleaned,active,createdDate,createdDate_cleaned,lastLogin,role,signUpSource,state
332,{'$oid': '601465c567804a1228b20f89'},601465c567804a1228b20f89,True,{'$date': 1611949509523},1611949509523,{'$date': 1611949509567},consumer,Email,WI
13,{'$oid': '5ff1e1eacfcf6c399c274ae6'},5ff1e1eacfcf6c399c274ae6,True,{'$date': 1609687530554},1609687530554,{'$date': 1609687530597},consumer,Email,WI


#### df_users.lastLogin:  
same format as df_receipts.createDat - can include None/Null/NaN

In [58]:
# set a variable to use for the dict_key param of value_from_dict() function
users_lastLogin_dict_key = "['$date']"

In [59]:
# clean df_users.createDate and confirm by viewing a sample of the dataframe
value_from_dict(df_users, "lastLogin", users_lastLogin_dict_key)
df_users.sample(2)

AssertionError: there is at least one None value in the cleaned data

In [60]:
# clean df_users.createDate and confirm by viewing a sample of the dataframe allowing nulls
value_from_dict(df_users, "lastLogin", users_lastLogin_dict_key, allow_nulls=True)
df_users.sample(2)

Unnamed: 0,_id,_id_cleaned,active,createdDate,createdDate_cleaned,lastLogin,lastLogin_cleaned,role,signUpSource,state
72,{'$oid': '5ff5d15aeb7c7d12096d91a2'},5ff5d15aeb7c7d12096d91a2,True,{'$date': 1609945434680},1609945434680,{'$date': 1609945690009},1609946000000.0,consumer,Email,WI
180,{'$oid': '6002475cfb296c121a81b98d'},6002475cfb296c121a81b98d,True,{'$date': 1610762076571},1610762076571,,,consumer,Email,WI


### Writing a function to convert date data from epoch to timestamps

In [61]:
def epoch_to_timestamp(dataframe, column_to_convert, allow_nulls = False):
    """Returns dataframe with a new column containing timestamps converted from epoch.
    
    :param dataframe: A dataframe with a column containing epoch seconds as ints or floats
    :type dataframe: Pandas DataFrame
    :param column_to_convert: The name of the column containing epoch seconds
    :type column_to_convert: str
    :param allow_nulls: A boolean value indicating if None(Null) values should be allowed,
        defaults to False
    :type allow_nulls: bool
    
    :raises AssertionError: 'there is at least one None/Null/NaN/NaT value in the converted timestamp data' 
        if param allow_nulls = False
    :raises AssertionError: 'the length of the original column and the converted column are not the same'
    :excepts VlueError: If a column has already been converted, print message confirming
        no values were added to the dataframe
    
    :rtype: Pandas DataFrame
    :return: the original DataFrame with an additional column containing converted epoch values as timestamps
    """
    #setting variables
    # name of the new column we'll be adding to the dataframe
    converted_column_name = column_to_convert + "_ts"
    # location to insert the converted column, after the column_to_clean
    insert_at = dataframe.columns.get_loc(column_to_convert) + 1
    # create a series of timestamps from the epoch time column_to_convert
    # pd.to_datetime() converts a scalar, array-like, Series or DataFrame/dict-like to a pandas datetime object
    # the data in the epoch columns is miliseconds from epoch start and we round to 1ms for consistency 
    time_stamps = pd.to_datetime(dataframe[column_to_convert], unit='ms').round('1ms')
    
    # handle allow_nulls param flag
    if not allow_nulls:
        # confirm no nulls in time_stamps
#         assert None not in time_stamps, "there is at least one None/Null/NaN/NaT value in the converted timestamp data"
        assert not time_stamps.isnull().values.any(), "there is at least one None/Null/NaN/NaT value in the converted timestamp data" 
            # df_receipts['finishedDate_cleaned'].isnull().values.any()
    
    # confirm time_stamps is the same length as column_to_convert
    assert len(time_stamps) == len(dataframe[column_to_convert]), "the length of the original column and the converted column are not the same"
    
    # add the timestamps data to the originl dataframe following column_to_convert
    try:
        dataframe.insert(insert_at, converted_column_name, time_stamps)
    except ValueError as error:
        print(f"{str(error)}, {converted_column_name} was not added to the dataframe")
    
    # return the modified dataframe
    return dataframe
    

### Using epoch_to_timestamp() to convert columns with epoch values to timestamps and add them to the dataframe as a new column:

In [62]:
# convert df_users.createdDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_users, 'createdDate_cleaned')
df_users.sample(2)

Unnamed: 0,_id,_id_cleaned,active,createdDate,createdDate_cleaned,createdDate_cleaned_ts,lastLogin,lastLogin_cleaned,role,signUpSource,state
375,{'$oid': '60186237c8b50e11d8454d5f'},60186237c8b50e11d8454d5f,True,{'$date': 1612210743551},1612210743551,2021-02-01 20:19:03.551,,,consumer,Email,
428,{'$oid': '5a43c08fe4b014fd6b6a0612'},5a43c08fe4b014fd6b6a0612,True,{'$date': 1514389647059},1514389647059,2017-12-27 15:47:27.059,{'$date': 1613146957155},1613147000000.0,consumer,,


In [63]:
# convert df_users.lastLogin_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_users, 'lastLogin_cleaned')
df_users.sample(2)

AssertionError: there is at least one None/Null/NaN/NaT value in the converted timestamp data

In [64]:
# convert df_users.lastLogin_cleaned to timestamps and confirm by viewing a sample of the dataframe, allowing for Nulls
epoch_to_timestamp(df_users, 'lastLogin_cleaned', allow_nulls=True)
df_users.sample(2)

Unnamed: 0,_id,_id_cleaned,active,createdDate,createdDate_cleaned,createdDate_cleaned_ts,lastLogin,lastLogin_cleaned,lastLogin_cleaned_ts,role,signUpSource,state
428,{'$oid': '5a43c08fe4b014fd6b6a0612'},5a43c08fe4b014fd6b6a0612,True,{'$date': 1514389647059},1514389647059,2017-12-27 15:47:27.059,{'$date': 1613146957155},1613147000000.0,2021-02-12 16:22:37.155,consumer,,
488,{'$oid': '54943462e4b07e684157a532'},54943462e4b07e684157a532,True,{'$date': 1418998882381},1418998882381,2014-12-19 14:21:22.381,{'$date': 1614963143204},1614963000000.0,2021-03-05 16:52:23.204,fetch-staff,,


In [65]:
# convert df_receipts.createDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'createDate_cleaned')
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,finishedDate,finishedDate_cleaned,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
791,{'$oid': '601e71780a7214ad2500004f'},601e71780a7214ad2500004f,,,{'$date': 1612607864030},1612607864030,2021-02-06 10:37:44.030,{'$date': 1612607864030},1612607864030,,,{'$date': 1612607864030},1612607864030,,,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
312,{'$oid': '60024fa00a7214ad4c00007f'},60024fa00a7214ad4c00007f,,,{'$date': 1610764192583},1610764192583,2021-01-16 02:29:52.583,{'$date': 1610764192583},1610764192583,,,{'$date': 1610764192583},1610764192583,,,,,,,,SUBMITTED,,60024f24e257124ec6b99a13


In [66]:
# convert df_receipts.dateScanned_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'dateScanned_cleaned')
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
621,{'$oid': '6016e0570a720f05f8000289'},6016e0570a720f05f8000289,,,{'$date': 1612111959295},1612111959295,2021-01-31 16:52:39.295,{'$date': 1612111959295},1612111959295,2021-01-31 16:52:39.295,,,{'$date': 1612111959295},1612111959295,,,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
896,{'$oid': '602231350a7214d8e900009f'},602231350a7214d8e900009f,,,{'$date': 1612853557105},1612853557105,2021-02-09 06:52:37.105,{'$date': 1612853557105},1612853557105,2021-02-09 06:52:37.105,,,{'$date': 1612853557105},1612853557105,,,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33


In [67]:
# convert df_receipts.finishedDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'finishedDate_cleaned')
df_receipts.sample(2)

AssertionError: there is at least one None/Null/NaN/NaT value in the converted timestamp data

In [68]:
# convert df_receipts.finishedDate_cleaned to timestamps and confirm by viewing a sample of the dataframe, allowing nulls
epoch_to_timestamp(df_receipts, 'finishedDate_cleaned', allow_nulls = True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
824,{'$oid': '601f81560a720f053c0000f1'},601f81560a720f053c0000f1,,,{'$date': 1612677462031},1612677462031,2021-02-07 05:57:42.031,{'$date': 1612677462031},1612677462031,2021-02-07 05:57:42.031,,,NaT,{'$date': 1612677462031},1612677462031,,,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
345,{'$oid': '600746fd0a720f05fa00001f'},600746fd0a720f05fa00001f,750.0,"Receipt number 1 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1611089661000},1611089661000,2021-01-19 20:54:21.000,{'$date': 1611089661000},1611089661000,2021-01-19 20:54:21.000,{'$date': 1611089661000},1611090000000.0,2021-01-19 20:54:21,{'$date': 1611089661000},1611089661000,{'$date': 1611089661000},1611090000000.0,755.0,{'$date': 1611003261000},1611003000000.0,1.0,"[{'barcode': '087684002872', 'description': 'CAPRI SUN Roarin' Waters Fruit Punch Beverage, 10 count, 60 Fl Oz', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'pointsEarned': '5.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'CAPRI SUN ROARIN' WATERS BEVERAGE DRINK', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}]",FINISHED,1.0,600746fd6e64691717e8cfb5


In [69]:
# convert df_receipts.modifyDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'modifyDate_cleaned')
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,modifyDate_cleaned_ts,pointsAwardedDate,pointsAwardedDate_cleaned,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
1012,{'$oid': '60394dff0a720fde100000d1'},60394dff0a720fde100000d1,,,{'$date': 1614368255062},1614368255062,2021-02-26 19:37:35.062,{'$date': 1614368255062},1614368255062,2021-02-26 19:37:35.062,,,NaT,{'$date': 1614368255062},1614368255062,2021-02-26 19:37:35.062,,,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33
499,{'$oid': '6010be5a0a720f053500005a'},6010be5a0a720f053500005a,25.0,COMPLETE_NONPARTNER_RECEIPT,{'$date': 1611710042000},1611710042000,2021-01-27 01:14:02.000,{'$date': 1611710042000},1611710042000,2021-01-27 01:14:02.000,{'$date': 1611710042000},1611710000000.0,2021-01-27 01:14:02,{'$date': 1611710042000},1611710042000,2021-01-27 01:14:02.000,{'$date': 1611710042000},1611710000000.0,25.0,{'$date': 1611623642000},1611624000000.0,1.0,"[{'barcode': '016000493025', 'description': 'Cocoa Puffs Ice Cream Scoops Rte Cereal - Medium Size', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'pointsNotAwardedReason': 'Action not allowed for user and CPG', 'pointsPayerId': '5332f5f3e4b03c9a25efd0ae', 'quantityPurchased': 1, 'rewardsGroup': 'COCOA PUFFS CEREAL MEDIUM SIZE', 'rewardsProductPartnerId': '5332f5f3e4b03c9a25efd0ae', 'targetPrice': '800'}]",FINISHED,1.0,6010bddaa4b74c120bd19dfb


In [70]:
# convert df_receipts.pointsAwardedDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'pointsAwardedDate_cleaned')
df_receipts.sample(2)

AssertionError: there is at least one None/Null/NaN/NaT value in the converted timestamp data

In [71]:
# convert df_receipts.pointsAwardedDate_cleaned to timestamps and confirm by viewing a sample of the dataframe allowing nulls
epoch_to_timestamp(df_receipts, 'pointsAwardedDate_cleaned', allow_nulls=True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,modifyDate_cleaned_ts,pointsAwardedDate,pointsAwardedDate_cleaned,pointsAwardedDate_cleaned_ts,pointsEarned,purchaseDate,purchaseDate_cleaned,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
159,{'$oid': '5ff874170a7214ada100065f'},5ff874170a7214ada100065f,25.0,COMPLETE_NONPARTNER_RECEIPT,{'$date': 1610118167000},1610118167000,2021-01-08 15:02:47.000,{'$date': 1610118167000},1610118167000,2021-01-08 15:02:47.000,{'$date': 1610118167000},1610118000000.0,2021-01-08 15:02:47,{'$date': 1610118167000},1610118167000,2021-01-08 15:02:47.000,{'$date': 1610118167000},1610118000000.0,2021-01-08 15:02:47,25.0,{'$date': 1609459200000},1609459000000.0,1.0,"[{'barcode': '016000156234', 'brandCode': 'BRAND', 'description': 'Fiber One Oats & Chocolate Chewy Bar - 2 Count', 'finalPrice': '10.00', 'itemPrice': '10.00', 'partnerItemId': '0', 'pointsNotAwardedReason': 'Action not allowed for user and CPG', 'pointsPayerId': '5332f5f3e4b03c9a25efd0ae', 'quantityPurchased': 1, 'rewardsGroup': 'FIBER ONE BARS', 'rewardsProductPartnerId': '5332f5f3e4b03c9a25efd0ae'}]",FINISHED,10.0,5ff873d1b3348b11c9337716
1104,{'$oid': '603cf2ce0a7217c72c000413'},603cf2ce0a7217c72c000413,,,{'$date': 1614607054396},1614607054396,2021-03-01 13:57:34.396,{'$date': 1614607054396},1614607054396,2021-03-01 13:57:34.396,,,NaT,{'$date': 1614607054396},1614607054396,2021-03-01 13:57:34.396,,,NaT,,,,,,SUBMITTED,,5fc961c3b8cfca11a077dd33


In [72]:
# convert df_receipts.purchaseDate_cleaned to timestamps and confirm by viewing a sample of the dataframe 
epoch_to_timestamp(df_receipts, 'purchaseDate_cleaned')
df_receipts.sample(2)

AssertionError: there is at least one None/Null/NaN/NaT value in the converted timestamp data

In [73]:
# convert df_receipts.purchaseDate_cleaned to timestamps and confirm by viewing a sample of the dataframe
epoch_to_timestamp(df_receipts, 'purchaseDate_cleaned', allow_nulls = True)
df_receipts.sample(2)

Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,modifyDate_cleaned_ts,pointsAwardedDate,pointsAwardedDate_cleaned,pointsAwardedDate_cleaned_ts,pointsEarned,purchaseDate,purchaseDate_cleaned,purchaseDate_cleaned_ts,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
24,{'$oid': '5ff1e1c60a7214ada100055e'},5ff1e1c60a7214ada100055e,5.0,All-receipts receipt bonus,{'$date': 1609687494000},1609687494000,2021-01-03 15:24:54.000,{'$date': 1609687494000},1609687494000,2021-01-03 15:24:54.000,{'$date': 1609687499000},1609687000000.0,2021-01-03 15:24:59,{'$date': 1609687499000},1609687499000,2021-01-03 15:24:59.000,{'$date': 1609687494000},1609687000000.0,2021-01-03 15:24:54,5.0,{'$date': 1609601094000},1609601000000.0,2021-01-02 15:24:54,2.0,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '1234', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '1234', 'userFlaggedDescription': '', 'userFlaggedNewItem': True}]",FINISHED,1.0,5ff1e194b6a9d73a3a9f1052
979,{'$oid': '6025fde20a720f05a8000294'},6025fde20a720f05a8000294,,,{'$date': 1613102562814},1613102562814,2021-02-12 04:02:42.814,{'$date': 1613102562814},1613102562814,2021-02-12 04:02:42.814,,,NaT,{'$date': 1613102562814},1613102562814,2021-02-12 04:02:42.814,,,NaT,,,,NaT,,,SUBMITTED,,5fc961c3b8cfca11a077dd33


In [74]:
# convert df_receipts.purchaseDate_cleaned to timestamps and confirm by viewing a sample of the dataframe, allowing nulls
epoch_to_timestamp(df_receipts, 'purchaseDate_cleaned', allow_nulls = True)
df_receipts.sample(2)

cannot insert purchaseDate_cleaned_ts, already exists, purchaseDate_cleaned_ts was not added to the dataframe


Unnamed: 0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,modifyDate_cleaned_ts,pointsAwardedDate,pointsAwardedDate_cleaned,pointsAwardedDate_cleaned_ts,pointsEarned,purchaseDate,purchaseDate_cleaned,purchaseDate_cleaned_ts,purchasedItemCount,rewardsReceiptItemList,rewardsReceiptStatus,totalSpent,userId
661,{'$oid': '6018682c0a720f05f4000011'},6018682c0a720f05f4000011,,,{'$date': 1612212268843},1612212268843,2021-02-01 20:44:28.843,{'$date': 1612212268843},1612212268843,2021-02-01 20:44:28.843,,,NaT,{'$date': 1612212268843},1612212268843,2021-02-01 20:44:28.843,,,NaT,,,,NaT,,,SUBMITTED,,60186237c8b50e11d8454d5f
444,{'$oid': '600f489d0a7214ada200004a'},600f489d0a7214ada200004a,750.0,"Receipt number 1 completed, bonus point schedule DEFAULT (5cefdcacf3693e0b50e83a36)",{'$date': 1611614365000},1611614365000,2021-01-25 22:39:25.000,{'$date': 1611614365000},1611614365000,2021-01-25 22:39:25.000,{'$date': 1611614366000},1611614000000.0,2021-01-25 22:39:26,{'$date': 1611614366000},1611614366000,2021-01-25 22:39:26.000,{'$date': 1611614366000},1611614000000.0,2021-01-25 22:39:26,755.0,{'$date': 1611527965000},1611528000000.0,2021-01-24 22:39:25,1.0,"[{'barcode': '759283001890', 'description': 'Boca Essentials Breakfast Scramble Burgers', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'pointsEarned': '5.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'BOCA ESSENTIALS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}]",FINISHED,1.0,600f489d6fd0dc1768a35a88


### visually check that all the cleaned and converted columns I expect are present

In [75]:
df_brands.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1167 entries, 0 to 1166
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   _id           1167 non-null   object 
 1   _id_cleaned   1167 non-null   object 
 2   barcode       1167 non-null   int64  
 3   category      1012 non-null   object 
 4   categoryCode  517 non-null    object 
 5   cpg           1167 non-null   object 
 6   cpg_cleaned   1167 non-null   object 
 7   name          1167 non-null   object 
 8   topBrand      555 non-null    float64
 9   brandCode     933 non-null    object 
dtypes: float64(1), int64(1), object(8)
memory usage: 91.3+ KB


In [76]:
df_receipts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1119 entries, 0 to 1118
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   _id                           1119 non-null   object        
 1   _id_cleaned                   1119 non-null   object        
 2   bonusPointsEarned             544 non-null    float64       
 3   bonusPointsEarnedReason       544 non-null    object        
 4   createDate                    1119 non-null   object        
 5   createDate_cleaned            1119 non-null   int64         
 6   createDate_cleaned_ts         1119 non-null   datetime64[ns]
 7   dateScanned                   1119 non-null   object        
 8   dateScanned_cleaned           1119 non-null   int64         
 9   dateScanned_cleaned_ts        1119 non-null   datetime64[ns]
 10  finishedDate                  568 non-null    object        
 11  finishedDate_cleaned          

In [77]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 495 entries, 0 to 494
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   _id                     495 non-null    object        
 1   _id_cleaned             495 non-null    object        
 2   active                  495 non-null    bool          
 3   createdDate             495 non-null    object        
 4   createdDate_cleaned     495 non-null    int64         
 5   createdDate_cleaned_ts  495 non-null    datetime64[ns]
 6   lastLogin               433 non-null    object        
 7   lastLogin_cleaned       433 non-null    float64       
 8   lastLogin_cleaned_ts    433 non-null    datetime64[ns]
 9   role                    495 non-null    object        
 10  signUpSource            447 non-null    object        
 11  state                   439 non-null    object        
dtypes: bool(1), datetime64[ns](2), float64(1), int64(1

### What might I need to answer the stakeholder questions?
- What are the top 5 brands by receipts scanned for most recent month?
  - need to join to brands from receipts, only way there is via barcode: in rewardsReceiptItemList  
  
  
- How does the ranking of the top 5 brands by receipts scanned for the recent month compare to the ranking for the previous month?  
  - same as above, barcode 


- When considering average spend from receipts with 'rewardsReceiptStatus’ of ‘Accepted’ or ‘Rejected’, which is greater?  
  - this can be answered with df_receipts.totalSpent

- When considering total number of items purchased from receipts with 'rewardsReceiptStatus’ of ‘Accepted’ or ‘Rejected’, which is greater?  
  - df_receipts.purchasedItemCount


- Which brand has the most spend among users who were created within the past 6 months?  
  - barcode
  - df_users.createdDate_cleaned_ts

- Which brand has the most transactions among users who were created within the past 6 months?
  - barcode
  
  
Questions:
  - 1 receipt = 1 transaction?
  - There is no 'Accepted' value for rewardsReceiptStatus. Assume 'Finished' is 'Accepted' or anything but 'Rejected' or something else?


What to extract from rewardsReceiptItemList:
- barcode

**to-do:**
- explore what keys are included in a dictionary that includes barcode:, is it a consistent set?
- decide what else I should include in addition to barcode?


In [99]:
# from df_receipts extract _id_cleaned and rewardsReceiptItemList to series and look at a few samplesabs
df_receipt_items = df_receipts[['_id_cleaned','rewardsReceiptItemList']]
# df_receipt_items
df_receipt_items.sample(3)

Unnamed: 0,_id_cleaned,rewardsReceiptItemList
358,600746e80a7214ad89000017,"[{'barcode': '021000644551', 'description': 'KRAFT Salad Dressing Thousand Island, 8 fl oz', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '1', 'quantityPurchased': 1, 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6'}]"
794,601ed3e80a720f053c0000a5,
331,600741fe0a7214ad89000001,"[{'barcode': '043000068649', 'description': 'GEVALIA 100% Arabica Signature Blend K Mild Roast Coffee - K -Cups Pods - 36ct', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '1', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'GEVALIA KAFFE KEURIG COFFEE PODS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '043000068649', 'description': 'GEVALIA 100% Arabica Signature Blend K Mild Roast Coffee - K -Cups Pods - 36ct', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '2', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'GEVALIA KAFFE KEURIG COFFEE PODS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '043000068649', 'description': 'GEVALIA 100% Arabica Signature Blend K Mild Roast Coffee - K -Cups Pods - 36ct', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '3', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'GEVALIA KAFFE KEURIG COFFEE PODS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '043000068649', 'description': 'GEVALIA 100% Arabica Signature Blend K Mild Roast Coffee - K -Cups Pods - 36ct', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '4', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'GEVALIA KAFFE KEURIG COFFEE PODS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}, {'barcode': '043000068649', 'description': 'GEVALIA 100% Arabica Signature Blend K Mild Roast Coffee - K -Cups Pods - 36ct', 'finalPrice': '9.99', 'itemPrice': '9.99', 'partnerItemId': '5', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'GEVALIA KAFFE KEURIG COFFEE PODS', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}]"


df_receipts['rewardsReceiptItemList'] is a list of dictionaries.  
some values are null  
each dictionary corresponds to on item.  
the following are keys in the dictionary:  
- needsFetchReview
- needsFetchReviewReason
- partnerItemId
- preventTargetGapPoints
- userFlaggedBarcode
- userFlaggedDescription
- userFlaggedNewItem
- userFlaggedPrice
- userFlaggedQuantity 

In [None]:
columns = [
    "receipt_id", "needsFetchReview", "needsFetchReviewReason", "partnerItemId", "preventTargetGapPoints",
    "userFlaggedBarcode", "userFlaggedDescription", "userFlaggedNewItem", "userFlaggedPrice", "userFlaggedQuantity"
    ]


In [90]:
df_receipts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1119 entries, 0 to 1118
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   _id                           1119 non-null   object        
 1   _id_cleaned                   1119 non-null   object        
 2   bonusPointsEarned             544 non-null    float64       
 3   bonusPointsEarnedReason       544 non-null    object        
 4   createDate                    1119 non-null   object        
 5   createDate_cleaned            1119 non-null   int64         
 6   createDate_cleaned_ts         1119 non-null   datetime64[ns]
 7   dateScanned                   1119 non-null   object        
 8   dateScanned_cleaned           1119 non-null   int64         
 9   dateScanned_cleaned_ts        1119 non-null   datetime64[ns]
 10  finishedDate                  568 non-null    object        
 11  finishedDate_cleaned          

In [92]:
df_receipts.groupby('rewardsReceiptStatus').count()

Unnamed: 0_level_0,_id,_id_cleaned,bonusPointsEarned,bonusPointsEarnedReason,createDate,createDate_cleaned,createDate_cleaned_ts,dateScanned,dateScanned_cleaned,dateScanned_cleaned_ts,finishedDate,finishedDate_cleaned,finishedDate_cleaned_ts,modifyDate,modifyDate_cleaned,modifyDate_cleaned_ts,pointsAwardedDate,pointsAwardedDate_cleaned,pointsAwardedDate_cleaned_ts,pointsEarned,purchaseDate,purchaseDate_cleaned,purchaseDate_cleaned_ts,purchasedItemCount,rewardsReceiptItemList,totalSpent,userId
rewardsReceiptStatus,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
FINISHED,518,518,456,456,518,518,518,518,518,518,518,518,518,518,518,518,514,514,514,518,518,518,518,518,516,518,518
FLAGGED,46,46,30,30,46,46,46,46,46,46,0,0,0,46,46,46,19,19,19,33,35,35,35,46,46,46,46
PENDING,50,50,0,0,50,50,50,50,50,50,50,50,50,50,50,50,0,0,0,0,49,49,49,0,49,49,50
REJECTED,71,71,58,58,71,71,71,71,71,71,0,0,0,71,71,71,4,4,4,58,69,69,69,71,68,71,71
SUBMITTED,434,434,0,0,434,434,434,434,434,434,0,0,0,434,434,434,0,0,0,0,0,0,0,0,0,0,434


In [101]:
df_receipts[['rewardsReceiptStatus','rewardsReceiptItemList']].sample(20)

Unnamed: 0,rewardsReceiptStatus,rewardsReceiptItemList
143,FINISHED,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '1234', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '1234', 'userFlaggedDescription': '', 'userFlaggedNewItem': True}]"
591,SUBMITTED,
875,PENDING,
251,FINISHED,"[{'barcode': '044700071533', 'description': 'OSCAR MAYER Fun Pack Pepperoni Kabobble 8 oz', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'pointsEarned': '5.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'LUNCHABLES LUNCH COMBINATIONS - FUN PACK', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}]"
696,SUBMITTED,
468,REJECTED,"[{'description': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'discountedItemPrice': '22.97', 'finalPrice': '22.97', 'itemPrice': '22.97', 'originalReceiptItemText': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'partnerItemId': '0', 'priceAfterCoupon': '22.97', 'quantityPurchased': 1}, {'description': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'discountedItemPrice': '11.99', 'finalPrice': '11.99', 'itemPrice': '11.99', 'originalReceiptItemText': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'partnerItemId': '1', 'priceAfterCoupon': '11.99', 'quantityPurchased': 1}]"
585,SUBMITTED,
605,SUBMITTED,
231,PENDING,"[{'description': 'flipbelt level terrain waist pouch, neon yellow, large/32-35', 'discountedItemPrice': '28.57', 'finalPrice': '28.57', 'itemPrice': '28.57', 'originalReceiptItemText': 'flipbelt level terrain waist pouch, neon yellow, large/32-35', 'partnerItemId': '0', 'priceAfterCoupon': '28.57', 'quantityPurchased': 1}]"
52,FINISHED,"[{'barcode': '044700019917', 'description': 'OSCAR MAYER Lower Sodium Bacon 16 oz. Pack', 'finalPrice': '10', 'itemPrice': '10', 'partnerItemId': '1', 'pointsEarned': '50.0', 'pointsPayerId': '559c2234e4b06aca36af13c6', 'quantityPurchased': 1, 'rewardsGroup': 'OSCAR MAYER BACON', 'rewardsProductPartnerId': '559c2234e4b06aca36af13c6', 'targetPrice': '800'}]"


In [103]:
# get a list of all the keys found in the dictionaries of rewardsReceiptItemList
df_receipt_items = df_receipts[['_id_cleaned','rewardsReceiptItemList']]
df_receipt_items

Unnamed: 0,_id_cleaned,rewardsReceiptItemList
0,5ff1e1eb0a720f0523000575,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '26.00', 'itemPrice': '26.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 5, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '26.00', 'userFlaggedQuantity': 5}]"
1,5ff1e1bb0a720f052300056b,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '028400642255', 'description': 'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ', 'finalPrice': '10.00', 'itemPrice': '10.00', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'pointsNotAwardedReason': 'Action not allowed for user and CPG', 'pointsPayerId': '5332f5fbe4b03c9a25efd0ba', 'preventTargetGapPoints': True, 'quantityPurchased': 1, 'rewardsGroup': 'DORITOS SPICY SWEET CHILI SINGLE SERVE', 'rewardsProductPartnerId': '5332f5fbe4b03c9a25efd0ba', 'userFlaggedBarcode': '028400642255', 'userFlaggedDescription': 'DORITOS TORTILLA CHIP SPICY SWEET CHILI REDUCED FAT BAG 1 OZ', 'userFlaggedNewItem': True, 'userFlaggedPrice': '10.00', 'userFlaggedQuantity': 1}]"
2,5ff1e1f10a720f052300057a,"[{'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '26.00', 'userFlaggedQuantity': 3}]"
3,5ff1e1ee0a7214ada100056f,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '28.00', 'itemPrice': '28.00', 'needsFetchReview': False, 'partnerItemId': '1', 'preventTargetGapPoints': True, 'quantityPurchased': 4, 'userFlaggedBarcode': '4011', 'userFlaggedNewItem': True, 'userFlaggedPrice': '28.00', 'userFlaggedQuantity': 4}]"
4,5ff1e1d20a7214ada1000561,"[{'barcode': '4011', 'description': 'ITEM NOT FOUND', 'finalPrice': '1', 'itemPrice': '1', 'partnerItemId': '1', 'quantityPurchased': 1}, {'barcode': '1234', 'finalPrice': '2.56', 'itemPrice': '2.56', 'needsFetchReview': True, 'needsFetchReviewReason': 'USER_FLAGGED', 'partnerItemId': '2', 'preventTargetGapPoints': True, 'quantityPurchased': 3, 'userFlaggedBarcode': '1234', 'userFlaggedDescription': '', 'userFlaggedNewItem': True, 'userFlaggedPrice': '2.56', 'userFlaggedQuantity': 3}]"
...,...,...
1114,603cc0630a720fde100003e6,"[{'barcode': 'B076FJ92M4', 'description': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'discountedItemPrice': '22.97', 'finalPrice': '22.97', 'itemPrice': '22.97', 'originalReceiptItemText': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'partnerItemId': '0', 'priceAfterCoupon': '22.97', 'quantityPurchased': 1}, {'barcode': 'B07BRRLSVC', 'description': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'discountedItemPrice': '11.99', 'finalPrice': '11.99', 'itemPrice': '11.99', 'originalReceiptItemText': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'partnerItemId': '1', 'priceAfterCoupon': '11.99', 'quantityPurchased': 1}]"
1115,603d0b710a720fde1000042a,
1116,603cf5290a720fde10000413,
1117,603ce7100a7217c72c000405,"[{'barcode': 'B076FJ92M4', 'description': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'discountedItemPrice': '22.97', 'finalPrice': '22.97', 'itemPrice': '22.97', 'originalReceiptItemText': 'mueller austria hypergrind precision electric spice/coffee grinder millwith large grinding capacity and hd motor also for spices, herbs, nuts,grains, white', 'partnerItemId': '0', 'priceAfterCoupon': '22.97', 'quantityPurchased': 1}, {'barcode': 'B07BRRLSVC', 'description': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'discountedItemPrice': '11.99', 'finalPrice': '11.99', 'itemPrice': '11.99', 'originalReceiptItemText': 'thindust summer face mask - sun protection neck gaiter for outdooractivities', 'partnerItemId': '1', 'priceAfterCoupon': '11.99', 'quantityPurchased': 1}]"
