**Fetch Rewards Coding Exercise**

Pt3 - Data Quality

Section 1 - Brands Identifiers

In [68]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as pyplot
import seaborn as sns

import duckdb

* [Overview](#Overview)
    * [Pt.1 barcode Review](#barcode)
    * [Pt.2 brandCode Review](#brandCode)
    * [Pt.3 - (brand) names Review](#names)

# Overview <a class="anchor" id="Overview"></a>
**Evaluating BarCode vs. BrandCode + and exploring df_brands['name']**

Background: 
- Brands contains the following values based on the documentation provided
    - barcode - barcode on the item
    - brandCode - String that corresponds with the brand column in a partner product file
- While we expecct barcode to match the item within the barcode value of df_receipts['rewardsReceiptItemList'], in our analysis, we found that there are no overlapping values within this dataset
- brandCode from df_receipts only had 18% of overlapping values.

Why This Matters:
- The absence of overlapping barcode values raises the concern that either (i) barcodes aren't being correctly logged/imported from receipts or (ii) there is a missing source of truth on how barcodes are defined within df_brands
- While it's good to see some overlap between the two tables on brandCode, 18% is EXTREAMLY small. There prevents us from understanding and analyzing our data. 

------------------------

In [69]:
df_brands = pd.read_json('brands.gz', lines=True, compression='gzip')
df_receipts = pd.read_json('receipts.gz', lines=True, compression='gzip')
df_users = pd.read_json('users.gz', lines=True, compression='gzip')

In [None]:
df_receipts = df_receipts.explode('rewardsReceiptItemList')
df_receipts = pd.concat([df_receipts.drop(['rewardsReceiptItemList'], axis=1), df_receipts['rewardsReceiptItemList'].apply(pd.Series)], axis=1)

## Pt.1 | Barcode Comparisons <a id="barcode"></a>
Comparing barcodes in df_receipts vs. df_brands

**Summary:** 
- Only 3% of barcodes within receipts are found in the df_brands.
- df_brands['barcode'] values maintain a pattern of (i) 12 digits and (ii) start with '511111'
- df_receipts['barcode'] vary in pattern and length 

In [53]:
unique_receipts_barcodes = df_receipts['barcode'].dropna().unique()
unique_brands_barcodes = df_brands['barcode'].unique()

In [55]:
# Identifying for overlap of barcode in df_receipts vs. df_brands
missing_barcodes = set(unique_receipts_barcodes) - set(unique_brands_barcodes)
missing_brands_nonreceipts = set(unique_brands_barcodes) - set(unique_receipts_barcodes)
print(f"Total barcodes in receipts: {len(unique_receipts_barcodes)}")
print(f"Total barcodes in brands: {len(unique_brands_barcodes)}")
print(f"Barcodes in receipts not found in brands: {len(missing_barcodes)}")
print(f"Barcodes in BRANDS not found in RECEIPTS: {len(missing_brands_nonreceipts)}")
# if missing_barcodes:
#     print("Some missing barcodes:", list(missing_barcodes)[:10])  # Show the first 10 missing barcodes as an example
# else:
#     print("No missing barcodes. All items in receipts have a matching brand barcode.")

Total barcodes in receipts: 569
Total barcodes in brands: 1160
Barcodes in receipts not found in brands: 553
Barcodes in BRANDS not found in RECEIPTS: 1144


In [60]:
#quick look at how barcodes look like in df_brands
print(df_brands['barcode'].sample(10))

524     511111001324
276     511111700029
1100    511111515807
352     511111405535
1156    511111617853
701     511111100348
901     511111719724
586     511111502937
1118    511111015741
29      511111315957
Name: barcode, dtype: int64


In [62]:
#quick look at how barcodes look like in df_receipts
print(df_receipts['barcode'].sample(10))

419    012000809941
543             NaN
543             NaN
870             NaN
543             NaN
306    021000604647
427             NaN
292    075925306254
543             NaN
409             NaN
Name: barcode, dtype: object


In [63]:
#Let's compare the patterns of barcode in the tables

df_brands['barcode'] = df_brands['barcode'].astype(str)
df_receipts['barcode'] = df_receipts['barcode'].astype(str)

df_brands['barcode_length'] = df_brands['barcode'].apply(len)
df_receipts['barcode_length'] = df_receipts['barcode'].apply(len)

# Summary of barcode lengths and sample values
brands_length_summary = df_brands['barcode_length'].value_counts()
receipts_length_summary = df_receipts['barcode_length'].value_counts()

print("Brands Barcode Lengths:\n", brands_length_summary)
print("Receipts Barcode Lengths:\n", receipts_length_summary)
# print("Sample Barcodes from Brands:\n", df_brands['barcode'].dropna().sample(10))
# print("Sample Barcodes from Receipts:\n", df_receipts['barcode'].dropna().sample(10))

Brands Barcode Lengths:
 12    1167
Name: barcode_length, dtype: int64
Receipts Barcode Lengths:
 3     4291
12    2619
4      332
10      82
11      46
13       8
5        2
2        1
Name: barcode_length, dtype: int64


**Pt.1 Intial Observations:**
- There is some overlap between the barcodes in df_brands vs. df_receipts.
- df_receipt barcodes show us that (i) there are a lot of receipts with missing barcode values and (ii) the barcodes have varying patterns and lengths
- df_brands['barcode] all seem to follow a similar patterns of 12 digits starting with '511111'

In [65]:
valid_receipts_barcode = df_receipts[df_receipts['barcode_length'] == 12]['barcode']
matching_barcodes = valid_receipts_barcode[valid_receipts_barcode.isin(df_brands['barcode'])]
print(f"Total valid-length barcodes in receipts: {len(valid_receipts_barcode)}")
print(f"Matching barcodes between receipts and brands: {len(matching_barcodes)}")
if not matching_barcodes.empty:
    print("Sample of matching barcodes:", matching_barcodes.sample(min(10, len(matching_barcodes))))
else:
    pass

Total valid-length barcodes in receipts: 2619
Matching barcodes between receipts and brands: 82
Sample of matching barcodes: 321    511111001485
469    511111904175
423    511111101451
425    511111802358
310    511111104537
324    511111204206
543    511111004127
462    511111902690
309    511111001485
543    511111802358
Name: barcode, dtype: object


-------------------------------------------

## Pt.2 | brandCode Comparisons <a id="brandcode"></a>
Comparing brandCode in df_receipts vs. df_brands

**Summary:**
- 18% of brand_code in df_receipts are found in df_brands
- There are a lot of TEST brandcodes within df_brands

In [57]:
unique_receipts_brandcode = df_receipts['brandCode'].dropna().unique()
unique_brands_brandcode = df_brands['brandCode'].dropna().unique()

# Identifying for overlap of brandCode in df_receipts vs. df_brands
missing_brandCode_receiptsnonbrand = set(unique_receipts_brandcode) - set(unique_brands_brandcode)
missing_brandcode_brandnonreceipts = set(unique_brands_brandcode) - set(unique_receipts_brandcode)

print(f"Total barcodes in receipts: {len(unique_receipts_brandcode)}")
print(f"Total barcodes in brands: {len(unique_brands_brandcode)}")
print(f"brandCode in receipts not found in brands: {len(missing_brandCode_receiptsnonbrand)}")
print(f"brandCode in BRANDS not found in RECEIPTS: {len(missing_brandcode_brandnonreceipts)}")

Total barcodes in receipts: 227
Total barcodes in brands: 897
brandCode in receipts not found in brands: 186
brandCode in BRANDS not found in RECEIPTS: 856


In [51]:
#quick look at how brandCode look like in df_brands
print(df_brands['brandCode'].sample(10))

1042    TEST BRANDCODE @1598290604214
323     TEST BRANDCODE @1598633693767
758              ABSOLUT® JUICE APPLE
836     TEST BRANDCODE @1599097539367
747                               NaN
398     TEST BRANDCODE @1597350074404
725                               NaN
265                   COLORADO NATIVE
549     TEST BRANDCODE @1610493497005
793                     STUBBORN SODA
Name: brandCode, dtype: object


In [67]:
#quick look at how brandCode look like in df_receipts
# print(df_receipts['brandCode'].sample(10))

------------------------------------------------------

## Pt3 | df_brands['name'] Exploration <a id="names"></a>

**Summary**:
- A lot of "test brand @..." values within 'name'
-  brand 'name' that contain a null value in brandCode

In [39]:
df_brands['name'].sort_values()

230                             .
861          1915 Bolthouse Farms
725           1_KRAFT Hockeyville
723                          7 up
61                   A&W Rootbeer
                  ...            
5       test brand @1612366146091
6       test brand @1612366146133
2       test brand @1612366146176
4       test brand @1612366146827
1166    test brand @1613158231643
Name: name, Length: 1167, dtype: object

In [49]:
df_brands[['name', 'brandCode']].sample(15)

Unnamed: 0,name,brandCode
574,Baken-Ets,BAKEN ETS
176,I CAN'T BELIEVE IT'S NOT BUTTER!,I CAN'T BELIEVE IT'S NOT BUTTER!
438,test brand @1610038527809,
560,test brand @1608313051244,TEST BRANDCODE @1608313051244
1015,Diet Chris Cola,DIETCHRIS2
231,Frosted Flakes,
1049,test brand @1598633602226,TEST BRANDCODE @1598633602226
1002,Prego,PREGO
347,Berroca®,BERROCA®
405,test brand @1601939762881,TEST BRANDCODE @1601939762881


### Workspace
Comparing brandCode in the two tables and how they appear in their respecctive tables

In [25]:
matchingbrandcode_receipts = unique_receipts_brandcode[np.isin(unique_receipts_brandcode, unique_brands_brandcode)]
matchingbrandcode_brands = unique_brands_brandcode[np.isin(unique_brands_brandcode, unique_receipts_brandcode)]

print(matchingbrandcode_receipts)
print('------------------------------------------------')
print(matchingbrandcode_brands)

['PEPSI' 'DORITOS' 'KLEENEX' 'KNORR' 'SWANSON' 'YUBAN'
 'DOLE CHILLED FRUIT JUICES' 'KRAFT' 'TOSTITOS' 'RICE-A-RONI'
 'KETTLE BRAND' 'PEPPERIDGE FARM' 'STOVE TOP' 'NATURE VALLEY' 'ARNOLD'
 'GREY POUPON' 'KLONDIKE' 'CRACKER BARREL' 'QUAKER' 'PHILADELPHIA'
 'TACO BELL' 'COTTONELLE' "HELLMANN'S/BEST FOODS" 'COOL WHIP'
 'MOUNTAIN DEW' 'JELL-O' 'CLASSICO' 'LUNCHABLES' 'OSCAR MAYER' 'PLANTERS'
 'VELVEETA' 'PREGO' 'PACIFIC FOODS' 'ORE-IDA' 'FINISH' 'JUST CRACK AN EGG'
 'SARGENTO' 'CHEETOS' 'V8' 'HUGGIES' 'VIVA']
------------------------------------------------
['TACO BELL' 'COTTONELLE' 'SWANSON' 'KETTLE BRAND' 'DORITOS' 'KLONDIKE'
 'PLANTERS' 'CHEETOS' 'COOL WHIP' 'TOSTITOS' 'NATURE VALLEY' 'GREY POUPON'
 'PACIFIC FOODS' 'KRAFT' 'PEPPERIDGE FARM' 'QUAKER' 'OSCAR MAYER' 'YUBAN'
 'SARGENTO' 'KNORR' 'FINISH' 'JELL-O' 'RICE-A-RONI' 'CRACKER BARREL'
 'LUNCHABLES' 'HUGGIES' 'VELVEETA' 'JUST CRACK AN EGG' 'PEPSI'
 "HELLMANN'S/BEST FOODS" 'DOLE CHILLED FRUIT JUICES' 'CLASSICO' 'ARNOLD'
 'MOUNTAIN DEW

In [37]:
df_brands[df_brands['brandCode'].str.contains('EGG', case=False, na=False)]


Unnamed: 0,_id,barcode,category,categoryCode,cpg,name,topBrand,brandCode,barcode_length
680,{'$oid': '5aa1b53ae4b086c8aad5e097'},511111804277,Breakfast & Cereal,,"{'$ref': 'Cogs', '$id': {'$oid': '559c2234e4b0...",JUST CRACK AN EGG Scramble Kit,0.0,JUST CRACK AN EGG,12


In [29]:
df_receipts[df_receipts['brandCode'] == 'VELVEETA']

Unnamed: 0,_id,bonusPointsEarned,bonusPointsEarnedReason,createDate,dateScanned,finishedDate,modifyDate,pointsAwardedDate,pointsEarned,purchaseDate,...,pointsEarned.1,targetPrice,competitiveProduct,originalFinalPrice,originalMetaBriteItemPrice,deleted,priceAfterCoupon,0,metabriteCampaignId,barcode_length
426,{'$oid': '600b420b0a7214ada200000d'},750.0,"Receipt number 1 completed, bonus point schedu...",{'$date': 1611350539000},{'$date': 1611350539000},{'$date': 1611351137000},{'$date': 1611351324000},{'$date': 1611351137000},1729.5,{'$date': 1611273600000},...,,,True,,,,,,VELVEETA CHEESY SKILLETS,12
446,{'$oid': '600f24970a720f053500002f'},,,{'$date': 1611605143000},{'$date': 1611605143000},,{'$date': 1611606325000},,,{'$date': 1611532800000},...,,,,,,,,,VELVEETA MACARONI & CHEESE DINNER,12
446,{'$oid': '600f24970a720f053500002f'},,,{'$date': 1611605143000},{'$date': 1611605143000},,{'$date': 1611606325000},,,{'$date': 1611532800000},...,,,,,,,,,VELVEETA MACARONI & CHEESE DINNER,12
447,{'$oid': '600f0cc70a720f053500002c'},,,{'$date': 1611599047000},{'$date': 1611599047000},,{'$date': 1611599887000},,,{'$date': 1611532800000},...,,,,,,,,,VELVEETA MACARONI & CHEESE DINNER,12
447,{'$oid': '600f0cc70a720f053500002c'},,,{'$date': 1611599047000},{'$date': 1611599047000},,{'$date': 1611599887000},,,{'$date': 1611532800000},...,,,,,,,,,VELVEETA MACARONI & CHEESE DINNER,12
469,{'$oid': '600f39c30a7214ada2000030'},750.0,"Receipt number 1 completed, bonus point schedu...",{'$date': 1611610563000},{'$date': 1611610563000},{'$date': 1611630363000},{'$date': 1611630460000},{'$date': 1611630363000},7137.2,{'$date': 1611446400000},...,,,True,,,,5.99,,VELVEETA MACARONI & CHEESE DINNER,12
469,{'$oid': '600f39c30a7214ada2000030'},750.0,"Receipt number 1 completed, bonus point schedu...",{'$date': 1611610563000},{'$date': 1611610563000},{'$date': 1611630363000},{'$date': 1611630460000},{'$date': 1611630363000},7137.2,{'$date': 1611446400000},...,,,True,,,,5.99,,VELVEETA MACARONI & CHEESE DINNER,12
543,{'$oid': '600f2fc80a720f0535000030'},750.0,"Receipt number 1 completed, bonus point schedu...",{'$date': 1611608008000},{'$date': 1611608008000},{'$date': 1611612263000},{'$date': 1611873422000},{'$date': 1611612263000},4944.7,{'$date': 1611446400000},...,,,True,,,,,,VELVEETA MACARONI & CHEESE DINNER,12
543,{'$oid': '600f2fc80a720f0535000030'},750.0,"Receipt number 1 completed, bonus point schedu...",{'$date': 1611608008000},{'$date': 1611608008000},{'$date': 1611612263000},{'$date': 1611873422000},{'$date': 1611612263000},4944.7,{'$date': 1611446400000},...,,,True,,,,,,VELVEETA MACARONI & CHEESE DINNER,12


In [36]:
df_brands['name'].sort_values()

230                             .
861          1915 Bolthouse Farms
725           1_KRAFT Hockeyville
723                          7 up
61                   A&W Rootbeer
                  ...            
5       test brand @1612366146091
6       test brand @1612366146133
2       test brand @1612366146176
4       test brand @1612366146827
1166    test brand @1613158231643
Name: name, Length: 1167, dtype: object