# Step 2b: Data Cleaning, Pt. 3

At this point, I am working with the following datasets: 
1) For Grocery Market Data: The U.S. Department of Agriculture Agricultural Marketing Service's (AMS) [Market News Report](https://marketnews.usda.gov/mnp/dataDownload"), cleaned and filtered to just 2023 data. 
2) For Inflation Trend Data: The Bureau of Labor Statistics' Consumer Price Index (CPI) for [All Urban Consumers](https://www.bls.gov/cpi/tables/supplemental-files/), cleaned and filtered to just 2023 data.
3) For Verified Grocery Sales Data: The [Costco dataset](https://www.kaggle.com/datasets/bhavikjikadara/grocery-store-dataset) from Kaggle, which reported 2023 data, cleaned. 
4) For Regional & Demographic Data: I have switched from the[U.S. Census Dataset : Education, Finance, Industry](https://www.kaggle.com/datasets/mittvin/u-s-census-dataset-education-finance-industry) from Kaggle, which reported 2019-2020 data, to the U.S. Census dataset from the [Population Estimate Program (PEP)](https://www.census.gov/data/datasets/time-series/demo/popest/2020s-national-total.html), which reports 2020-2025 data and is therefore more useful to this analyis.  I may still try to incorporate education, finance, and industry data, though, if possible. 

I am going begin this portion of the cleaning and preparation process by looking at the dataset that handles product names the most usefully: The Bureau of Labor Statistics' Consumer Price Index (CPI) for [All Urban Consumers](https://www.bls.gov/cpi/tables/supplemental-files/).  I am going to see if I can get a reasonable list of proudcts to find comparables to in the other datasets.  The product column in each dataset, then, can be used as a foreign key from which to generate deeper insights by relating tables. 

In [2]:
# import libraries

import pandas as pd
pd.set_option('display.max_rows', None)

import re

## A. Inflation Trend Data

In [3]:
# import cleaned BLS CPI dataset

bls_df = pd.read_csv('cleaned_bls_cpi_data.csv')

bls_df.head()

Unnamed: 0,product,unadjusted_percent_change_2023,seasonally_adjusted_effect_2023,unadjusted_effect_2023
0,All items,3.4,,
1,Food,2.7,0.028,0.366
2,Food at home,1.3,0.012,0.114
3,Cereals and bakery products,2.6,-0.003,0.03
4,Cereals and cereal products,0.4,-0.005,0.001


In [4]:
# How many unique products are listed?

bls_df['product'].nunique()

402

In [5]:
# Are there any duplicates? 

duplicates = bls_df[bls_df.duplicated(subset=['product'])]

print(duplicates)

# No duplicates

Empty DataFrame
Columns: [product, unadjusted_percent_change_2023, seasonally_adjusted_effect_2023, unadjusted_effect_2023]
Index: []


In [6]:
# I need to get my head around which products to keep and which to cut.
# What's the full list of products contained in this dataset? 

product_list = bls_df['product'].tolist()

print(product_list)

# It appears there's a lot of non-food items in this dataset.  Need to get those out. 

['All items', 'Food', 'Food at home', 'Cereals and bakery products', 'Cereals and cereal products', 'Flour and prepared flour mixes', 'Breakfast cereal(4)', 'Rice, pasta, cornmeal', 'Rice(4)(5)(6)', 'Bakery products(4)', 'Bread(4)(5)', 'White bread(4)(6)', 'Bread other than white(4)(6)', 'Fresh biscuits, rolls, muffins(5)', 'Cakes, cupcakes, and cookies(4)', 'Cookies(4)(6)', 'Fresh cakes and cupcakes(4)(6)', 'Other bakery products', 'Fresh sweetrolls, coffeecakes, doughnuts(4)(6)', 'Crackers, bread, and cracker products(6)', 'Frozen and refrigerated bakery products, pies, tarts, turnovers(6)', 'Meats, poultry, fish, and eggs', 'Meats, poultry, and fish', 'Meats', 'Beef and veal', 'Uncooked ground beef(4)', 'Uncooked beef roasts(5)', 'Uncooked beef steaks(5)', 'Uncooked other beef and veal(4)(5)', 'Pork', 'Bacon, breakfast sausage, and related products(5)', 'Bacon and related products(6)', 'Breakfast sausage and related products(5)(6)', 'Ham', 'Ham, excluding canned(6)', 'Pork chops(4)'

In [7]:
bls_df

# Row 119 is the last one with food-related data.  Slice everything else out.

bls_df = bls_df[:119]

bls_df

Unnamed: 0,product,unadjusted_percent_change_2023,seasonally_adjusted_effect_2023,unadjusted_effect_2023
0,All items,3.4,,
1,Food,2.7,0.028,0.366
2,Food at home,1.3,0.012,0.114
3,Cereals and bakery products,2.6,-0.003,0.03
4,Cereals and cereal products,0.4,-0.005,0.001
5,Flour and prepared flour mixes,2.2,0.0,0.001
6,Breakfast cereal(4),0.3,-0.004,0.0
7,"Rice, pasta, cornmeal",-0.3,0.0,0.0
8,Rice(4)(5)(6),0.1,,
9,Bakery products(4),3.6,-0.003,0.028


In [8]:
# Prune the proudcts column down to only wanted items. 

indexes_to_keep = [1, 2, 5, 6, 8, 9, 10, 15, 24, 25, 29, 31, 32, 33, 35, 40, 41, 42, 45, 46, 49, 50, 52, 55, 56, 60, 61, 62, 63, 66, 67, 68, 69, 74, 76, 78, 85, 96, 100, 113, 118]

bls_df = bls_df.loc[indexes_to_keep].reset_index(drop=True)

bls_df

Unnamed: 0,product,unadjusted_percent_change_2023,seasonally_adjusted_effect_2023,unadjusted_effect_2023
0,Food,2.7,0.028,0.366
1,Food at home,1.3,0.012,0.114
2,Flour and prepared flour mixes,2.2,0.0,0.001
3,Breakfast cereal(4),0.3,-0.004,0.0
4,Rice(4)(5)(6),0.1,,
5,Bakery products(4),3.6,-0.003,0.028
6,Bread(4)(5),3.1,-0.001,0.007
7,Cookies(4)(6),2.7,,
8,Beef and veal,8.7,0.001,0.038
9,Uncooked ground beef(4),6.7,-0.001,0.011


In [9]:
# Looking further at it, I think unadjusted_percent_change is the numeric data I'm going to want to use from this table. 
# Drop the others. 

bls_df.drop(columns=['seasonally_adjusted_effect_2023', 'unadjusted_effect_2023'], axis=1, inplace=True)

bls_df

Unnamed: 0,product,unadjusted_percent_change_2023
0,Food,2.7
1,Food at home,1.3
2,Flour and prepared flour mixes,2.2
3,Breakfast cereal(4),0.3
4,Rice(4)(5)(6),0.1
5,Bakery products(4),3.6
6,Bread(4)(5),3.1
7,Cookies(4)(6),2.7
8,Beef and veal,8.7
9,Uncooked ground beef(4),6.7


In [10]:
# Strip parenthetical information from product names

bls_df['product'] = bls_df['product'].str.replace(r'\(.*\)', '', regex=True)

bls_df

Unnamed: 0,product,unadjusted_percent_change_2023
0,Food,2.7
1,Food at home,1.3
2,Flour and prepared flour mixes,2.2
3,Breakfast cereal,0.3
4,Rice,0.1
5,Bakery products,3.6
6,Bread,3.1
7,Cookies,2.7
8,Beef and veal,8.7
9,Uncooked ground beef,6.7


In [11]:
# Normalize product names

rename_dict = {
    'Flour and prepared flour mixes' : 'Flour',
    'Breakfast cereal' : 'Cereal', 
    'Bakery products' : 'Bakery', 
    'Uncooked ground beef' : 'Ground beef', 
    'Bacon and related products' : 'Bacon', 
    'Breakfast sausage and related products' : 'Bkfst sausage', 
    'Fresh whole chicken' : 'Whole chicken', 
    'Fish and seafood' : 'Fish', 
    'Fresh fish and seafood' : 'Fresh fish', 
    'Frozen fish and seafood' : 'Frozen fish', 
    'Cheese and related products' : 'Cheese', 
    'Ice cream and related products' : 'Ice cream', 
    'Dried beans, peas, and lentils' : 'Beans'
}

bls_df['product'] = bls_df['product'].replace(rename_dict)

bls_df

Unnamed: 0,product,unadjusted_percent_change_2023
0,Food,2.7
1,Food at home,1.3
2,Flour,2.2
3,Cereal,0.3
4,Rice,0.1
5,Bakery,3.6
6,Bread,3.1
7,Cookies,2.7
8,Beef and veal,8.7
9,Ground beef,6.7


In [75]:
# Export furhter cleaned bls_df

bls_df.to_csv('bls_cpi_final.csv')

### Takeaways:
* That's starting to look like a reasonable dataset to use to find matching products in the other datasets. 
* Now... how to do that...

## B. Grocery Market Data

In [13]:
usda_df = pd.read_csv('cleaned_usda_data_grouped_and_filtered.csv')

usda_df.head()

Unnamed: 0,product_name,region,report_date,mean_weighted_price,year
0,APPLES BRAEBURN 3 lb bag,MIDWEST U.S.,2023-02-03,3.99,2023
1,APPLES BRAEBURN 3 lb bag,MIDWEST U.S.,2023-02-17,5.99,2023
2,APPLES BRAEBURN 3 lb bag,MIDWEST U.S.,2023-03-31,4.98,2023
3,APPLES BRAEBURN 3 lb bag,MIDWEST U.S.,2023-05-26,4.98,2023
4,APPLES BRAEBURN 3 lb bag,NATIONAL,2023-02-03,3.99,2023


In [14]:
# Let's take 'apples' as the first test case, here. 
# What results can I find for 'apples' in this dataset?

apples_matches = usda_df[usda_df['product_name'].str.contains('apples', case=False, na=False)]

apples_matches.head(5)

# So... a product like 'apples' returned over 3000 rows of data.  This isn't going to be useful,
# particularly given how many *different* ways apples are listed (variety, packaging, etc.).  I
# need to get this down to ONE row of data per product (i.e. 'apple'), per region, per report date.  

Unnamed: 0,product_name,region,report_date,mean_weighted_price,year
0,APPLES BRAEBURN 3 lb bag,MIDWEST U.S.,2023-02-03,3.99,2023
1,APPLES BRAEBURN 3 lb bag,MIDWEST U.S.,2023-02-17,5.99,2023
2,APPLES BRAEBURN 3 lb bag,MIDWEST U.S.,2023-03-31,4.98,2023
3,APPLES BRAEBURN 3 lb bag,MIDWEST U.S.,2023-05-26,4.98,2023
4,APPLES BRAEBURN 3 lb bag,NATIONAL,2023-02-03,3.99,2023


In [15]:
# First, let's get the year column out of there (they're all 2023) and rename product_name
# to match the BLS CPI dataset. 

usda_df.drop('year', axis=1, inplace=True)

usda_df.rename(columns={'product_name': 'product'}, inplace=True)

usda_df.head()

Unnamed: 0,product,region,report_date,mean_weighted_price
0,APPLES BRAEBURN 3 lb bag,MIDWEST U.S.,2023-02-03,3.99
1,APPLES BRAEBURN 3 lb bag,MIDWEST U.S.,2023-02-17,5.99
2,APPLES BRAEBURN 3 lb bag,MIDWEST U.S.,2023-03-31,4.98
3,APPLES BRAEBURN 3 lb bag,MIDWEST U.S.,2023-05-26,4.98
4,APPLES BRAEBURN 3 lb bag,NATIONAL,2023-02-03,3.99


In [16]:
# Okay... now... from the top. First up: flour. 

matches = usda_df[usda_df['product'].str.contains('peanut', case=False, na=False)]

matches

# Apparently there are none?  Flour = out. What else ISN'T in the USDA dataset? 

# Flour, cereal, rice, bakery, bread, cookies, poultry, whole chicken, fish, fresh fish, frozen fish,
# eggs, milk, cheese, ice cream, fresh fruits, bananas, citrus fruits (orange is present), fresh
# vegetables, canned vegetables, frozen vegetables, 'beans, peas, lentils' only as 'beans', coffee, 
# butter, peanut butter' 

Unnamed: 0,product,region,report_date,mean_weighted_price


### QUESTION: 
* Am I only going to keep products that are found in all three of the grocery datasets (USDA, BLS, and Costco)?  Or, I could keep matches in USDA and BLS, and again BLS and Costco.  What about matches between USDA and Costco datasets, too?
* I think I need to stick to using the BLS dataset as a starting point.  

### i. Find products in the usda_df that also exist in the bls_df

In [17]:
# Filter the usda_df to only include products also found in the bls_df. 

# Ensure lowercase and strip spaces
bls_df['product'] = bls_df['product'].str.lower().str.strip()
usda_df['product'] = usda_df['product'].str.lower().str.strip()

# Create regex pattern to match any BLS product name within USDA product names
pattern = '|'.join(bls_df['product'].tolist())  # Join all words with '|' for OR matching

# Filter USDA dataset where the product column contains any BLS product name
filtered_usda_df = usda_df[usda_df['product'].str.contains(pattern, case=False, na=False)]

filtered_usda_df.shape

# I think the usda_df was at least 10x that size before.  We're getting somehwere here! 


(26246, 4)

In [18]:
# Now, I need to clean up each product still remaining in the usda_df. 
# I need a list of products that matched in the usda_df from the bls_df. 

# Initialize an empty list to store matched product names
matched_products = []

# Loop over each product in bls_df
for product in bls_df['product']:
    # Check if the product from bls_df is found anywhere in usda_df
    if usda_df['product'].str.contains(product, case=False, na=False).any():
        matched_products.append(product)

# Display the list of matched products
matched_products

['ground beef',
 'pork',
 'bacon',
 'bkfst sausage',
 'ham',
 'chicken',
 'apples',
 'potatoes',
 'lettuce',
 'tomatoes',
 'beans',
 'butter']

In [19]:
# I think there's a few more that should've matched, but didn't due to formatting errors.

rename_dict = {
    'breakfast sausage' : 'sausage bkfst', 
    'dried beans, peas, and lentils' : 'beans'
}

bls_df['product'] = bls_df['product'].replace(rename_dict)

bls_df


Unnamed: 0,product,unadjusted_percent_change_2023
0,food,2.7
1,food at home,1.3
2,flour,2.2
3,cereal,0.3
4,rice,0.1
5,bakery,3.6
6,bread,3.1
7,cookies,2.7
8,beef and veal,8.7
9,ground beef,6.7


In [20]:
# Re-run matching code

# Create regex pattern to match any BLS product name within USDA product names
pattern = '|'.join(bls_df['product'].tolist())  # Join all words with '|' for OR matching

# Filter USDA dataset where the product column contains any BLS product name
usda_df = usda_df[usda_df['product'].str.contains(pattern, case=False, na=False)]

usda_df.shape

(26246, 4)

In [21]:
# List matches again

# Initialize an empty list to store matched product names
matched_products = []

# Loop over each product in bls_df
for product in bls_df['product']:
    # Check if the product from bls_df is found anywhere in usda_df
    if usda_df['product'].str.contains(product, case=False, na=False).any():
        matched_products.append(product)

# Display the list of matched products
matched_products

['ground beef',
 'pork',
 'bacon',
 'bkfst sausage',
 'ham',
 'chicken',
 'apples',
 'potatoes',
 'lettuce',
 'tomatoes',
 'beans',
 'butter']

### ii. Ground beef

In [22]:
# Okay... let's get to work, starting with 'ground beef' 

ground_beef = usda_df[usda_df['product'].str.contains(r'ground beef', case=False, na=False)]
                      
ground_beef.head()

Unnamed: 0,product,region,report_date,mean_weighted_price
11542,beef ground beef 70-79% per pound,ALASKA,2023-02-03,2.49
11543,beef ground beef 70-79% per pound,ALASKA,2023-03-03,2.49
11544,beef ground beef 70-79% per pound,HAWAII,2023-01-06,4.69
11545,beef ground beef 70-79% per pound,HAWAII,2023-01-20,4.59
11546,beef ground beef 70-79% per pound,HAWAII,2023-01-27,4.59


In [23]:
# Looking at this and thinking about how I'll want to analyze / visualize, it seems like having
# one data point per MONTH rather than per WEEK per region is what I want. So... let's make that.

# Ensure the report_date is in datetime format if not already
usda_df['report_date'] = pd.to_datetime(usda_df['report_date'])

# Extract year and month from report_date
usda_df['year_month'] = usda_df['report_date'].dt.to_period('M')

# Group by product, region, and the new year_month, and aggregate the price (mean in this case)
usda_df = usda_df.groupby(['product', 'region', 'year_month'], as_index=False)['mean_weighted_price'].mean()

usda_df = usda_df.drop_duplicates(subset=['product', 'region', 'year_month'])

usda_df.head()

Unnamed: 0,product,region,year_month,mean_weighted_price
0,apples braeburn 3 lb bag,MIDWEST U.S.,2023-02,4.99
1,apples braeburn 3 lb bag,MIDWEST U.S.,2023-03,4.98
2,apples braeburn 3 lb bag,MIDWEST U.S.,2023-05,4.98
3,apples braeburn 3 lb bag,NATIONAL,2023-02,4.99
4,apples braeburn 3 lb bag,NATIONAL,2023-03,4.98


In [24]:
# Okay... back to ground beef.  I don't need things like 70-79%, 80-89%, etc. because
# the bls_df just lists 'ground beef' as a product. I want to aggregate a mean price 
# for all varieties of ground beef.

# Clean the 'product' column to only contain 'ground beef'
usda_df['product'] = usda_df['product'].str.replace(r'.*ground beef.*', 'ground beef', case=False, regex=True)

# Now group by 'product', 'region', and 'year_month', then calculate the mean price
usda_df = usda_df.groupby(['product', 'region', 'year_month'])['mean_weighted_price'].mean().reset_index()

# Re-check 'ground beef' products... Looks like it worked. 

ground_beef = usda_df[usda_df['product'].str.contains(r'ground beef', case=False, na=False)]
                      
ground_beef.head(5)

Unnamed: 0,product,region,year_month,mean_weighted_price
2058,ground beef,ALASKA,2023-01,4.75125
2059,ground beef,ALASKA,2023-02,3.769583
2060,ground beef,ALASKA,2023-03,4.455833
2061,ground beef,ALASKA,2023-04,4.62125
2062,ground beef,ALASKA,2023-05,4.839375


### iii. Pork chops

In [25]:
# Next, pork chops... 

pork_chops = usda_df[usda_df['product'].str.contains(r'pork.*chops', case=False, na=False)]

pork_chops.head(5)


Unnamed: 0,product,region,year_month,mean_weighted_price
2500,pork assorted chops b/in per pound,ALASKA,2023-03,2.89
2501,pork assorted chops b/in per pound,ALASKA,2023-04,2.69
2502,pork assorted chops b/in per pound,ALASKA,2023-05,3.49
2503,pork assorted chops b/in per pound,ALASKA,2023-06,2.69
2504,pork assorted chops b/in per pound,ALASKA,2023-07,2.29


In [26]:
# Clean the 'product' column to only contain variations of 'pork chops'
usda_df['product'] = usda_df['product'].str.replace(r'.*pork.*chops.*', 'pork chops', case=False, regex=True)

# Filter for only pork chops (while keeping breakfast sausage out)
pork_chops_df = usda_df[usda_df['product'].str.contains(r'pork chops', case=False, na=False)]

# Group by 'product', 'region', and 'year_month', then calculate the mean price for pork chops
pork_chops_df = pork_chops_df.groupby(['product', 'region', 'year_month'])['mean_weighted_price'].mean().reset_index()

# Preview the result
pork_chops_df.head(5)


Unnamed: 0,product,region,year_month,mean_weighted_price
0,pork chops,ALASKA,2023-01,3.226667
1,pork chops,ALASKA,2023-02,4.041111
2,pork chops,ALASKA,2023-03,3.529333
3,pork chops,ALASKA,2023-04,3.653
4,pork chops,ALASKA,2023-05,4.33


### iv. Bacon

In [27]:
# On to bacon... 

bacon_df = usda_df[usda_df['product'].str.contains(r'bacon', case=False, na=False)]
                      
bacon_df.head(5)

Unnamed: 0,product,region,year_month,mean_weighted_price
3198,pork canadian bacon per pound,ALASKA,2023-04,9.81
3199,pork canadian bacon per pound,ALASKA,2023-07,14.64
3200,pork canadian bacon per pound,ALASKA,2023-08,11.17
3201,pork canadian bacon per pound,ALASKA,2023-12,12.77
3202,pork canadian bacon per pound,HAWAII,2023-09,8.0


In [28]:
# What all different kinds of bacon are listed? 

bacon_df['product'].unique()

array(['pork canadian bacon per pound', 'pork pre-cooked bacon per pound',
       'pork sliced bacon, 1 lb pkg per pound'], dtype=object)

In [29]:
# 'pork sliced bacon' is what we generally consume here in the U.S., so I'm going to just keep that

# Rename any product that contains 'pork sliced bacon, 1 lb pkg per pound' to 'pork sliced bacon'
usda_df['product'] = usda_df['product'].str.replace(r'.*pork sliced bacon, 1 lb pkg per pound.*', 'pork sliced bacon', case=False, regex=True)

# Remove only unwanted bacon types while keeping all other products unchanged
usda_df = usda_df[~usda_df['product'].str.contains(r'pork canadian bacon|pork pre-cooked bacon', case=False, na=False)]

# Group by 'product', 'region', and 'year_month', then calculate the mean price
usda_df = usda_df.groupby(['product', 'region', 'year_month'])['mean_weighted_price'].mean().reset_index()

# Extract only the cleaned bacon data
bacon_df = usda_df[usda_df['product'] == 'pork sliced bacon']

# Preview results
bacon_df.head(5)


Unnamed: 0,product,region,year_month,mean_weighted_price
5547,pork sliced bacon,ALASKA,2023-01,7.395
5548,pork sliced bacon,ALASKA,2023-02,8.2325
5549,pork sliced bacon,ALASKA,2023-03,7.312
5550,pork sliced bacon,ALASKA,2023-04,7.18
5551,pork sliced bacon,ALASKA,2023-05,7.3625


### v. Breakfast sausage

In [30]:
# On to breakfast sausage... 

sausage_df = usda_df[usda_df['product'].str.contains(r'sausage', case=False, na=False)]
                      
sausage_df.head(5)

Unnamed: 0,product,region,year_month,mean_weighted_price
2608,"pork bkfst sausage, 1 lb roll per pound",ALASKA,2023-01,5.99
2609,"pork bkfst sausage, 1 lb roll per pound",ALASKA,2023-02,5.99
2610,"pork bkfst sausage, 1 lb roll per pound",ALASKA,2023-03,4.156667
2611,"pork bkfst sausage, 1 lb roll per pound",ALASKA,2023-04,6.195
2612,"pork bkfst sausage, 1 lb roll per pound",ALASKA,2023-06,5.29


In [31]:
# What all different kinds of sausage are listed? 

sausage_df['product'].unique()

array(['pork bkfst sausage, 1 lb roll per pound',
       'pork bkfst sausage, link/patty per pound',
       'pork dinner sausage per pound', 'pork italian sausage per pound',
       'pork pre-cooked sausage per pound'], dtype=object)

In [32]:
# pork bkfst sausage, 1 lb roll per poud best aligns with the product in the bls_df.  Filter out all others. 

# Rename 'pork bkfst sausage, 1 lb roll per pound' to 'pork breakfast sausage'
usda_df['product'] = usda_df['product'].str.replace(
    r'.*pork bkfst sausage, 1 lb roll per pound.*', 'pork breakfast sausage', case=False, regex=True
)

# Remove only unwanted sausage types while keeping all other products unchanged
usda_df = usda_df[~usda_df['product'].str.contains(
    r'pork bkfst sausage, link/patty|pork dinner sausage|pork italian sausage|pork pre-cooked sausage', 
    case=False, na=False
)]

# Group by 'product', 'region', and 'year_month', then calculate the mean price
usda_df = usda_df.groupby(['product', 'region', 'year_month'])['mean_weighted_price'].mean().reset_index()

# Extract only the cleaned sausage data for preview
sausage_df = usda_df[usda_df['product'] == 'pork breakfast sausage']

# Preview results
sausage_df.head(5)


Unnamed: 0,product,region,year_month,mean_weighted_price
2673,pork breakfast sausage,ALASKA,2023-01,5.99
2674,pork breakfast sausage,ALASKA,2023-02,5.99
2675,pork breakfast sausage,ALASKA,2023-03,4.156667
2676,pork breakfast sausage,ALASKA,2023-04,6.195
2677,pork breakfast sausage,ALASKA,2023-06,5.29


### vi. Ham

In [33]:
# Now, ham... 

ham_df = usda_df[usda_df['product'].str.contains(r'ham', case=False, na=False)]
                      
ham_df.head(5)

Unnamed: 0,product,region,year_month,mean_weighted_price
1372,"beef ham, bnls per pound",NATIONAL,2023-01,4.59
1373,"beef ham, bnls per pound",SOUTH CENTRAL U.S.,2023-01,4.19
1374,"beef ham, bnls per pound",SOUTHEAST U.S.,2023-01,4.99
3557,pork deli ham per pound,ALASKA,2023-03,7.74
3558,pork deli ham per pound,ALASKA,2023-04,7.49


In [34]:
# What all different kinds of ham are listed? 

ham_df['product'].unique()

array(['beef ham, bnls per pound', 'pork deli ham per pound',
       'pork ham steak per pound', 'pork ham, b/in butt per pound',
       'pork ham, b/in per pound', 'pork ham, b/in shank per pound',
       'pork ham, bnls per pound', 'pork ham, spiral per pound',
       'pork pkg/slcd ham, 1 lb/less per pound'], dtype=object)

In [35]:
# Keep only 'pork ham, spiral per pound', the most-commonly consumed type of ham in the U.S.

# Rename 'pork ham, spiral per pound' to 'pork spiral ham'
usda_df['product'] = usda_df['product'].str.replace(
    r'.*pork ham, spiral per pound.*', 'pork spiral ham', case=False, regex=True
)

# Remove only unwanted ham types while keeping all other products unchanged
usda_df = usda_df[~usda_df['product'].str.contains(
    r'pork deli ham|pork ham steak|pork ham, b/in butt|pork ham, b/in|pork ham, b/in shank|pork ham, bnls|pork pkg/slcd ham', 
    case=False, na=False
)]

# Group by 'product', 'region', and 'year_month', then calculate the mean price
usda_df = usda_df.groupby(['product', 'region', 'year_month'])['mean_weighted_price'].mean().reset_index()

# Extract only the cleaned ham data for preview
ham_df = usda_df[usda_df['product'] == 'pork spiral ham']

# Preview results
ham_df.head(5)

Unnamed: 0,product,region,year_month,mean_weighted_price
4673,pork spiral ham,ALASKA,2023-01,3.69
4674,pork spiral ham,ALASKA,2023-03,4.81
4675,pork spiral ham,ALASKA,2023-04,3.11
4676,pork spiral ham,ALASKA,2023-08,5.99
4677,pork spiral ham,ALASKA,2023-11,4.39


### vii. Chicken

In [36]:
# Now, chicken... 

chicken_df = usda_df[usda_df['product'].str.contains(r'chicken', case=False, na=False)]
                      
chicken_df.head(5)

Unnamed: 0,product,region,year_month,mean_weighted_price
1375,chicken fresh bagged,ALASKA,2023-01,2.015
1376,chicken fresh bagged,ALASKA,2023-02,2.39
1377,chicken fresh bagged,ALASKA,2023-03,1.695
1378,chicken fresh bagged,ALASKA,2023-04,1.2825
1379,chicken fresh bagged,ALASKA,2023-05,2.43


In [37]:
# What all different kinds of chicken are listed? 

chicken_df['product'].unique()

array(['chicken fresh bagged', 'chicken fresh tray',
       'chicken fresh tray regular', 'chicken fresh tray value',
       'chicken fresh/frozen bagged', 'chicken frozen bagged',
       'chicken prepared 7-10 ounces'], dtype=object)

In [38]:
# Keep 'fresh bagged' as the likely most-common variety.  Filter out all others. 

# Rename 'chicken fresh bagged' to 'chicken fresh bagged'
usda_df['product'] = usda_df['product'].str.replace(
    r'.*chicken fresh bagged.*', 'chicken fresh bagged', case=False, regex=True
)

# Remove only unwanted chicken types while keeping all other products unchanged
usda_df = usda_df[~usda_df['product'].str.contains(
    r'chicken fresh tray|chicken fresh tray regular|chicken fresh tray value|chicken fresh/frozen bagged|chicken frozen bagged|chicken prepared 7-10 ounces', 
    case=False, na=False
)]

# Group by 'product', 'region', and 'year_month', then calculate the mean price
usda_df = usda_df.groupby(['product', 'region', 'year_month'])['mean_weighted_price'].mean().reset_index()

# Extract only the cleaned chicken data for preview
chicken_df = usda_df[usda_df['product'] == 'chicken fresh bagged']

# Preview results
chicken_df.head(5)


Unnamed: 0,product,region,year_month,mean_weighted_price
1375,chicken fresh bagged,ALASKA,2023-01,2.015
1376,chicken fresh bagged,ALASKA,2023-02,2.39
1377,chicken fresh bagged,ALASKA,2023-03,1.695
1378,chicken fresh bagged,ALASKA,2023-04,1.2825
1379,chicken fresh bagged,ALASKA,2023-05,2.43


### viii. Apples

In [39]:
# On to apples... 

apples_df = usda_df[usda_df['product'].str.contains(r'apples', case=False, na=False)]
                      
apples_df.head(5)

Unnamed: 0,product,region,year_month,mean_weighted_price
0,apples braeburn 3 lb bag,MIDWEST U.S.,2023-02,4.99
1,apples braeburn 3 lb bag,MIDWEST U.S.,2023-03,4.98
2,apples braeburn 3 lb bag,MIDWEST U.S.,2023-05,4.98
3,apples braeburn 3 lb bag,NATIONAL,2023-02,4.99
4,apples braeburn 3 lb bag,NATIONAL,2023-03,4.98


In [40]:
# What all different kinds of apples are listed? 

apples_df['product'].unique()

array(['apples braeburn 3 lb bag', 'apples braeburn per pound',
       'apples fuji 2 lb bag', 'apples fuji 3 lb bag',
       'apples fuji 5 lb bag', 'apples fuji per pound',
       'apples gala 2 lb bag', 'apples gala 3 lb bag',
       'apples gala 5 lb bag', 'apples gala per pound',
       'apples ginger gold 3 lb bag', 'apples ginger gold 5 lb bag',
       'apples ginger gold per pound', 'apples golden delicious 2 lb bag',
       'apples golden delicious 3 lb bag',
       'apples golden delicious 5 lb bag',
       'apples golden delicious per pound',
       'apples granny smith 2 lb bag', 'apples granny smith 3 lb bag',
       'apples granny smith 5 lb bag', 'apples granny smith per pound',
       'apples honeycrisp 2 lb bag', 'apples honeycrisp 3 lb bag',
       'apples honeycrisp 5 lb bag', 'apples honeycrisp per pound',
       'apples jonagold 3 lb bag', 'apples jonagold 5 lb bag',
       'apples jonagold per pound', 'apples jonathan 3 lb bag',
       'apples jonathan 5 lb bag', 

In [41]:
# Keep everything that's listed 'per pound' and aggregate. 

# Replace only apple products that have 'per pound' with 'apples per pound'
usda_df.loc[usda_df['product'].str.contains(r'apples.*per pound', case=False, na=False), 'product'] = 'apples per pound'

# Remove all apple products that are NOT 'per pound' (e.g., 2 lb bag, 3 lb bag, etc.)
usda_df = usda_df[~usda_df['product'].str.contains(r'apples', case=False, na=False) | (usda_df['product'] == 'apples per pound')]

# Group by 'product', 'region', and 'year_month', then calculate the mean price
usda_df = usda_df.groupby(['product', 'region', 'year_month'])['mean_weighted_price'].mean().reset_index()

# Preview the result for apples
apples_df = usda_df[usda_df['product'] == 'apples per pound']

# Display the top 5 rows
apples_df.head(5)

Unnamed: 0,product,region,year_month,mean_weighted_price
0,apples per pound,ALASKA,2023-01,2.028194
1,apples per pound,ALASKA,2023-02,2.14125
2,apples per pound,ALASKA,2023-03,1.806944
3,apples per pound,ALASKA,2023-04,1.928929
4,apples per pound,ALASKA,2023-05,2.220833


### ix. Potatoes

In [42]:
# Up next: potatoes

potatoes_df = usda_df[usda_df['product'].str.contains(r'potatoes', case=False, na=False)]
                      
potatoes_df.head(5)


Unnamed: 0,product,region,year_month,mean_weighted_price
3282,potatoes round red 10 lb bag,MIDWEST U.S.,2023-01,4.76
3283,potatoes round red 10 lb bag,MIDWEST U.S.,2023-03,4.99
3284,potatoes round red 10 lb bag,MIDWEST U.S.,2023-04,3.99
3285,potatoes round red 10 lb bag,NATIONAL,2023-01,4.76
3286,potatoes round red 10 lb bag,NATIONAL,2023-02,2.99


In [43]:
# What all different kinds of potatoes are listed? 

potatoes_df['product'].unique()

array(['potatoes round red 10 lb bag', 'potatoes round red 3 lb bag',
       'potatoes round red 5 lb bag', 'potatoes round red per pound',
       'potatoes round white 10 lb bag', 'potatoes round white 3 lb bag',
       'potatoes round white 5 lb bag', 'potatoes round white per pound',
       'potatoes russet 10 lb bag', 'potatoes russet 3 lb bag',
       'potatoes russet 5 lb bag', 'potatoes russet 8 lb bag',
       'potatoes russet per pound', 'potatoes yellow type 10 lb bag',
       'potatoes yellow type 3 lb bag', 'potatoes yellow type 5 lb bag',
       'potatoes yellow type 8 lb bag', 'potatoes yellow type per pound'],
      dtype=object)

In [44]:
# Keep everything that's listed 'per pound' and aggregate 

# Replace potato products that have 'per pound' with 'potatoes per pound'
usda_df['product'] = usda_df['product'].str.replace(r'.*potatoes.*per pound.*', 'potatoes per pound', case=False, regex=True)

# Remove all potato products that are not 'per pound' (e.g., 3 lb bag, 5 lb bag, etc.)
usda_df = usda_df[(~usda_df['product'].str.contains(r'potatoes', case=False, na=False)) | (usda_df['product'] == 'potatoes per pound')]

# Group by 'product', 'region', and 'year_month', then calculate the mean price
usda_df = usda_df.groupby(['product', 'region', 'year_month'])['mean_weighted_price'].mean().reset_index()

# Preview the result for potatoes
potatoes_df = usda_df[usda_df['product'] == 'potatoes per pound']

# Display the top 5 rows
potatoes_df.head(5)

Unnamed: 0,product,region,year_month,mean_weighted_price
3282,potatoes per pound,ALASKA,2023-01,1.393125
3283,potatoes per pound,ALASKA,2023-02,1.445556
3284,potatoes per pound,ALASKA,2023-03,1.265556
3285,potatoes per pound,ALASKA,2023-04,1.531667
3286,potatoes per pound,ALASKA,2023-05,1.539444


### x. Lettuce

In [45]:
# Up next: lettuce 

lettuce_df = usda_df[usda_df['product'].str.contains(r'lettuce', case=False, na=False)]
                      
lettuce_df.head(5)

Unnamed: 0,product,region,year_month,mean_weighted_price
383,"lettuce, other boston each",MIDWEST U.S.,2023-08,0.89
384,"lettuce, other boston each",MIDWEST U.S.,2023-09,0.99
385,"lettuce, other boston each",NATIONAL,2023-01,1.493333
386,"lettuce, other boston each",NATIONAL,2023-03,2.495
387,"lettuce, other boston each",NATIONAL,2023-05,1.89


In [46]:
# What all different kinds of lettuce are listed? 

lettuce_df['product'].unique()

array(['lettuce, other boston each', 'lettuce, other boston per pound',
       'lettuce, other green leaf each',
       'lettuce, other green leaf per pound',
       'lettuce, other red leaf each',
       'lettuce, other red leaf per pound',
       'lettuce, romaine hearts 3 count package'], dtype=object)

In [47]:
# Keep everything that's listed 'per pound' and aggregate

# Replace all lettuce products that have 'per pound' with 'lettuce per pound'
usda_df['product'] = usda_df['product'].str.replace(r'.*lettuce.*per pound.*', 'lettuce per pound', case=False, regex=True)

# Remove all lettuce products that are NOT sold per pound (e.g., "each" or "romaine hearts 3 count package")
usda_df = usda_df[(~usda_df['product'].str.contains(r'lettuce', case=False, na=False)) | (usda_df['product'] == 'lettuce per pound')]

# Group by 'product', 'region', and 'year_month', then calculate the mean price
usda_df = usda_df.groupby(['product', 'region', 'year_month'])['mean_weighted_price'].mean().reset_index()

# Verify the result
lettuce_df = usda_df[usda_df['product'] == 'lettuce per pound']
lettuce_df.head(5)


Unnamed: 0,product,region,year_month,mean_weighted_price
383,lettuce per pound,HAWAII,2023-04,1.99
384,lettuce per pound,HAWAII,2023-05,1.59
385,lettuce per pound,HAWAII,2023-06,1.89
386,lettuce per pound,HAWAII,2023-08,1.39
387,lettuce per pound,MIDWEST U.S.,2023-01,1.99


### xi. Tomatoes

In [48]:
# On to tomatoes

tomatoes_df = usda_df[usda_df['product'].str.contains(r'tomatoes', case=False, na=False)]
                      
tomatoes_df.head(5)

Unnamed: 0,product,region,year_month,mean_weighted_price
3132,tomatoes vine ripe - heirloom varieties per pound,ALASKA,2023-06,2.49
3133,tomatoes vine ripe - heirloom varieties per pound,ALASKA,2023-07,3.99
3134,tomatoes vine ripe - heirloom varieties per pound,MIDWEST U.S.,2023-01,5.99
3135,tomatoes vine ripe - heirloom varieties per pound,MIDWEST U.S.,2023-02,2.99
3136,tomatoes vine ripe - heirloom varieties per pound,MIDWEST U.S.,2023-03,3.405


In [49]:
# What all different kinds of tomatoes are listed? 

tomatoes_df['product'].unique()

array(['tomatoes vine ripe - heirloom varieties per pound',
       'tomatoes vine ripes per pound',
       'tomatoes vine ripes, on the vine per pound',
       'tomatoes, plum type roma per pound'], dtype=object)

In [50]:
# Since all are listed 'per pound', aggregate into a single tomato category

# Replace all tomato products that have 'per pound' with 'tomatoes per pound'
usda_df['product'] = usda_df['product'].str.replace(r'.*tomatoes.*per pound.*', 'tomatoes per pound', case=False, regex=True)

# Remove all tomato products that are NOT sold per pound (if any)
usda_df = usda_df[(~usda_df['product'].str.contains(r'tomatoes', case=False, na=False)) | (usda_df['product'] == 'tomatoes per pound')]

# Group by 'product', 'region', and 'year_month', then calculate the mean price
usda_df = usda_df.groupby(['product', 'region', 'year_month'])['mean_weighted_price'].mean().reset_index()

# Verify the result
tomatoes_df = usda_df[usda_df['product'] == 'tomatoes per pound']
tomatoes_df.head(5)


Unnamed: 0,product,region,year_month,mean_weighted_price
3132,tomatoes per pound,ALASKA,2023-01,2.831389
3133,tomatoes per pound,ALASKA,2023-02,2.647778
3134,tomatoes per pound,ALASKA,2023-03,2.393333
3135,tomatoes per pound,ALASKA,2023-04,2.249444
3136,tomatoes per pound,ALASKA,2023-05,2.1225


### xii. Beans

In [51]:
# Almost there! 2nd-to-last: beans

beans_df = usda_df[usda_df['product'].str.contains(r'beans', case=False, na=False)]
                      
beans_df.head(5)

Unnamed: 0,product,region,year_month,mean_weighted_price
81,beans round green type per pound,ALASKA,2023-01,2.48
82,beans round green type per pound,ALASKA,2023-03,2.435
83,beans round green type per pound,ALASKA,2023-04,2.45
84,beans round green type per pound,ALASKA,2023-05,2.98
85,beans round green type per pound,ALASKA,2023-06,2.59


In [52]:
# What all different kinds of beans are listed? 

beans_df['product'].unique()

# Looks like there's already only one type of beans listed *thumbs up*

array(['beans round green type per pound'], dtype=object)

### xiii. Butter

In [53]:
# Last but not least: butter

butter_df = usda_df[usda_df['product'].str.contains(r'butter', case=False, na=False)]
                      
butter_df.head(5)

Unnamed: 0,product,region,year_month,mean_weighted_price
366,"lamb/veal lb butterflied, bnls leg per pound",NATIONAL,2023-02,8.99
367,"lamb/veal lb butterflied, bnls leg per pound",NATIONAL,2023-03,8.99
368,"lamb/veal lb butterflied, bnls leg per pound",NATIONAL,2023-06,7.99
369,"lamb/veal lb butterflied, bnls leg per pound",NATIONAL,2023-07,8.88
370,"lamb/veal lb butterflied, bnls leg per pound",NATIONAL,2023-08,8.88


In [54]:
# What all different kinds of butter are listed? 

butter_df['product'].unique()

# Ah... that's not like... actually butter products

array(['lamb/veal lb butterflied, bnls leg per pound',
       'squash butternut per pound'], dtype=object)

In [55]:
# Filter out rows where 'product' contains 'butterflied' or 'butternut'
usda_df = usda_df[~usda_df['product'].str.contains(r'butterflied|butternut', case=False, na=False)]

# Verify the result by previewing the dataset
usda_df.head(5)


Unnamed: 0,product,region,year_month,mean_weighted_price
0,apples per pound,ALASKA,2023-01,2.028194
1,apples per pound,ALASKA,2023-02,2.14125
2,apples per pound,ALASKA,2023-03,1.806944
3,apples per pound,ALASKA,2023-04,1.928929
4,apples per pound,ALASKA,2023-05,2.220833


### xiv. Final usda_df checks

In [56]:
# So, what's going on in usda_df now? 

usda_df.shape

# IIRC, it had ~26k rows prior to aggregation.  So, we've made a much more manageable dataset out of it *thumbs up*

(3149, 4)

In [57]:
# What products do we have in the usdas_df now?

usda_df['product'].unique().tolist()

['apples per pound',
 'beans round green type per pound',
 'beef ham, bnls per pound',
 'chicken fresh bagged',
 'ground beef',
 'lettuce per pound',
 'pork backribs per pound',
 'pork bnls ribeye steak per pound',
 'pork bone-in cc loin per pound',
 'pork breakfast sausage',
 'pork butt fresh b/in per pound',
 'pork butt roast bnls per pound',
 'pork chops',
 'pork chorizo per pound',
 'pork country style rib b/in per pound',
 'pork country style ribs bnls per pound',
 'pork deli cooked backribs per pound',
 'pork deli cooked pork roast per pound',
 'pork deli cooked spareribs per pound',
 'pork feet per pound',
 'pork ground pork per pound',
 'pork loin roast bnls per pound',
 'pork neckbones per pound',
 'pork picnic fresh b/in per pound',
 'pork pork steak per pound',
 'pork porketta per pound',
 'pork pulled pork per pound',
 'pork rib end roast b/in per pound',
 'pork sirloin end roast b/in per pound',
 'pork sirloin roast bnls per pound',
 'pork sliced bacon',
 'pork spareribs p

In [58]:
# Seems like there's a lot of pork products that didn't get filtered out... 

usda_df[usda_df['product'].str.contains(r'pork', case=False, na=False)].head()

Unnamed: 0,product,region,year_month,mean_weighted_price
424,pork backribs per pound,ALASKA,2023-01,3.49
425,pork backribs per pound,ALASKA,2023-02,3.91
426,pork backribs per pound,ALASKA,2023-03,4.0675
427,pork backribs per pound,ALASKA,2023-04,4.323333
428,pork backribs per pound,ALASKA,2023-05,5.4125


In [59]:
# Filter the dataset to remove all unwanted pork products that fall outside of the aggregations we did for 
# pork chops, bacon, ham, and breakfast sausage above. 

usda_df = usda_df[~(
    usda_df['product'].str.contains(r'pork', case=False, na=False) &
    ~usda_df['product'].str.contains(r'pork chops|pork sliced bacon|pork spiral ham|pork\s*breakfast\s*sausage', case=False, na=False)
)]

# Display the first few rows to confirm the result
usda_df.head(5)


# Display the  to confirm the result

usda_df[usda_df['product'].str.contains(r'pork', case=False, na=False)].head()

Unnamed: 0,product,region,year_month,mean_weighted_price
597,pork breakfast sausage,ALASKA,2023-01,5.99
598,pork breakfast sausage,ALASKA,2023-02,5.99
599,pork breakfast sausage,ALASKA,2023-03,4.156667
600,pork breakfast sausage,ALASKA,2023-04,6.195
601,pork breakfast sausage,ALASKA,2023-06,5.29


In [60]:
# What's the list look like now? 

usda_df['product'].unique().tolist()

['apples per pound',
 'beans round green type per pound',
 'beef ham, bnls per pound',
 'chicken fresh bagged',
 'ground beef',
 'lettuce per pound',
 'pork breakfast sausage',
 'pork chops',
 'pork sliced bacon',
 'pork spiral ham',
 'potatoes per pound',
 'tomatoes per pound']

In [61]:
# Alright... hopefully final changes... 

# Remove 'beef ham, bnls per pound'
usda_df = usda_df[~usda_df['product'].str.contains('beef ham, bnls per pound', case=False, na=False)]

# Change 'beans round green type per pound' to 'beans per pound'
usda_df['product'] = usda_df['product'].replace('beans round green type per pound', 'beans per pound')

# Preview to check changes
usda_df['product'].unique().tolist()

['apples per pound',
 'beans per pound',
 'chicken fresh bagged',
 'ground beef',
 'lettuce per pound',
 'pork breakfast sausage',
 'pork chops',
 'pork sliced bacon',
 'pork spiral ham',
 'potatoes per pound',
 'tomatoes per pound']

In [62]:
# Export usda_df to csv

usda_df.to_csv('usda_final.csv')

### Takeaways: 
* I think I'm pretty close to a finished usda_df, which considering that this thing was a whopping 2.6GB of data to begin with, is some progress! 
* One thing that just struck me is that I'm not sure about certain products, such as 'chicken fresh bagged', 'graound beef', 'pork chops' etc.  Are those prices per pound.  I suppose any product that's not specificially listed as 'per pound' right now I need to go back and check what its unit of measuremeant was.

## C. Verified Grocery Sales Data

In [63]:
costco_df = pd.read_csv('cleaned_costco_data.csv')
costco_df.head()

Unnamed: 0,sub_category,price,title
0,Bakery & Desserts,56.99,"David’s Cookies Mile High Peanut Butter Cake, ..."
1,Bakery & Desserts,159.99,"The Cake Bake Shop 8"" Round Carrot Cake (16-22..."
2,Bakery & Desserts,44.99,"St Michel Madeleine, Classic French Sponge Cak..."
3,Bakery & Desserts,39.99,"David's Cookies Butter Pecan Meltaways 32 oz, ..."
4,Bakery & Desserts,59.99,"David’s Cookies Premier Chocolate Cake, 7.2 lb..."


In [64]:
# Take the list of products from the bls_df that had matches in the usda_df and find matches in the
# title column here in the costco_df

import pandas as pd

# List of matched products
matched_products = [
    'ground beef', 'pork', 'bacon', 'bkfst sausage', 'ham', 'chicken', 
    'apples', 'potatoes', 'lettuce', 'tomatoes', 'beans'
]

# Create a function to determine which product matches in a given title
def find_matching_product(title):
    for product in matched_products:
        if pd.notna(title) and re.search(rf'\b{product}\b', title, re.IGNORECASE):
            return product  # Return the first matching product
    return None  # Return None if no match is found

# Apply function to create a new column in costco_df
costco_df['matched_product'] = costco_df['title'].apply(find_matching_product)

# Filter out rows where no match was found
filtered_costco_df = costco_df.dropna(subset=['matched_product'])

# Sort the filtered DataFrame alphabetically by 'matched_product'
filtered_costco_df = filtered_costco_df.sort_values(by='matched_product')

# Group by 'matched_product' for easier analysis
grouped_costco = filtered_costco_df.groupby('matched_product')

# Display grouped results
grouped_costco.head(5)



Unnamed: 0,sub_category,price,title,matched_product
612,Gift Baskets,59.99,The Fruit Company Vintage Crate with Pears & A...,apples
929,Meat & Seafood,89.99,Authentic Wagyu Kurobuta Applewood Smoked Thic...,bacon
1120,Pantry & Dry Goods,10.99,"Kirkland Signature, Bacon Crumbles, 20 oz",bacon
1316,Seafood,119.99,Northwest Fish Wild Alaskan Sockeye Salmon Che...,bacon
949,Meat & Seafood,249.99,Authentic Wagyu Japanese A5 Bacon Wrapped Cube...,bacon
839,Kirkland Signature Grocery,10.99,"Kirkland Signature, Bacon Crumbles, 20 oz",bacon
1180,Pantry & Dry Goods,10.99,"S&W, Organic Garbanzo Beans, 15.5 oz, 8-Count",beans
1161,Pantry & Dry Goods,9.99,"S&W, Organic Black Beans, 15 oz, 8-Count",beans
1058,Organic,9.99,"S&W, Organic Black Beans, 15 oz, 8-Count",beans
1108,Pantry & Dry Goods,12.99,"Del Monte, Canned Cut Green Beans, 14.5 oz, 12...",beans


### Takeaways:
* This sort of confirms what I've been suspecting from the beginning: that the Costco dataset isn't particularly good at providing information that aligns with the way products are represented in the USDA dataset (i.e. unprocessed, by the pound).  Almost everything in the Costco dataset is processed.  Even things like "apples" can't be found by the pound.  
* I think we need to find a new Verified Grocery Sales Data dataset.  Off we go... 

## D. Verified Grocery Sales Data - TAKE TWO!

So, after determining that the Costco dataset wasn't going to work well, I found several alternatives.  After testing several, I've currently landed on the U.S. Bureau of Labor Statistics' [Average Retail Food and Energy Prices, U.S. City Average and West Region ](https://www.bls.gov/regions/mid-atlantic/data/averageretailfoodandenergyprices_usandwest_table.htm) dataset.  Verified grocery sales data is card to come by, as most retail chains consider it proprietary and don't release it to the public for competitive reasons, but this appears to give usable grocery price insights.  

To make use of the data, I copied it from the BLS's website into an Excel document, where I did most of the cleaning to narrow it down to products that match what were also available in both the USDA and BLS Consumer Price Index datasets.  

In [76]:
bls_arf_df = pd.read_excel('bls_arf.xlsx')
bls_arf_df.head()

Unnamed: 0,Item and unit,Unnamed: 1,US Dec 2023 Price,US Nov 2024 Price,US Dec 2024 Price,US Percent Change from Dec 2023,Percent Change from Nov 2024,Unnamed: 7,West Dec 2023 Price,West Nov 2024 Price,West Dec 2024 Price,West Percent Change from Dec 2023,Percent Change from Nov 2024.1
0,"Ground beef, per lb. (453.6 gm)",,5.566,5.874,5.863,5.3,-0.2,,5.755,5.916,5.886,2.3,-0.5
1,"Bacon, sliced, per lb. (453.6 gm)",,6.774,6.843,6.915,2.1,1.1,,6.92,7.465,7.3,5.5,-2.2
2,"All Pork Chops, per lb. (453.6 gm)",,4.256,4.43,4.308,1.2,-2.8,,4.73,4.672,4.542,-4.0,-2.8
3,"Ham, boneless, excluding canned, per lb. (453....",,5.497,5.63,5.458,-0.7,-3.1,,5.378,5.717,5.624,4.6,-1.6
4,"Chicken, fresh, whole, per lb. (453.6 gm)",,1.955,2.076,2.061,5.4,-0.7,,2.104,2.144,2.152,2.3,0.4


In [78]:
# Normalize column headers

bls_arf_df.columns = bls_arf_df.columns.str.replace(' ', '_').str.lower()
bls_arf_df.head()

Unnamed: 0,item_and_unit,unnamed:_1,us_dec_2023_price,us_nov_2024_price,us_dec_2024_price,us_percent_change_from_dec_2023,percent_change_from_nov_2024,unnamed:_7,west_dec_2023_price,west_nov_2024_price,west_dec_2024_price,west_percent_change_from_dec_2023,percent_change_from_nov_2024.1
0,"Ground beef, per lb. (453.6 gm)",,5.566,5.874,5.863,5.3,-0.2,,5.755,5.916,5.886,2.3,-0.5
1,"Bacon, sliced, per lb. (453.6 gm)",,6.774,6.843,6.915,2.1,1.1,,6.92,7.465,7.3,5.5,-2.2
2,"All Pork Chops, per lb. (453.6 gm)",,4.256,4.43,4.308,1.2,-2.8,,4.73,4.672,4.542,-4.0,-2.8
3,"Ham, boneless, excluding canned, per lb. (453....",,5.497,5.63,5.458,-0.7,-3.1,,5.378,5.717,5.624,4.6,-1.6
4,"Chicken, fresh, whole, per lb. (453.6 gm)",,1.955,2.076,2.061,5.4,-0.7,,2.104,2.144,2.152,2.3,0.4


In [None]:
# Drop 2 empty columns

bls_arf_df.drop(columns=['unnamed:_1', 'unnamed:_7'], axis=1, inplace=True)
bls_arf_df.head()

Unnamed: 0,item_and_unit,us_dec_2023_price,us_nov_2024_price,us_dec_2024_price,us_percent_change_from_dec_2023,percent_change_from_nov_2024,west_dec_2023_price,west_nov_2024_price,west_dec_2024_price,west_percent_change_from_dec_2023,percent_change_from_nov_2024.1
0,"Ground beef, per lb. (453.6 gm)",5.566,5.874,5.863,5.3,-0.2,5.755,5.916,5.886,2.3,-0.5
1,"Bacon, sliced, per lb. (453.6 gm)",6.774,6.843,6.915,2.1,1.1,6.92,7.465,7.3,5.5,-2.2
2,"All Pork Chops, per lb. (453.6 gm)",4.256,4.43,4.308,1.2,-2.8,4.73,4.672,4.542,-4.0,-2.8
3,"Ham, boneless, excluding canned, per lb. (453....",5.497,5.63,5.458,-0.7,-3.1,5.378,5.717,5.624,4.6,-1.6
4,"Chicken, fresh, whole, per lb. (453.6 gm)",1.955,2.076,2.061,5.4,-0.7,2.104,2.144,2.152,2.3,0.4


In [None]:
# I don't think I care about just the West region, if I can't compare it to other regions

bls_arf_df.drop(columns=bls_arf_df.columns[bls_arf_df.columns.str.contains('west', case=False)], inplace=True)
bls_arf_df.head()

Unnamed: 0,item_and_unit,us_dec_2023_price,us_nov_2024_price,us_dec_2024_price,us_percent_change_from_dec_2023,percent_change_from_nov_2024,percent_change_from_nov_2024.1
0,"Ground beef, per lb. (453.6 gm)",5.566,5.874,5.863,5.3,-0.2,-0.5
1,"Bacon, sliced, per lb. (453.6 gm)",6.774,6.843,6.915,2.1,1.1,-2.2
2,"All Pork Chops, per lb. (453.6 gm)",4.256,4.43,4.308,1.2,-2.8,-2.8
3,"Ham, boneless, excluding canned, per lb. (453....",5.497,5.63,5.458,-0.7,-3.1,-1.6
4,"Chicken, fresh, whole, per lb. (453.6 gm)",1.955,2.076,2.061,5.4,-0.7,0.4


In [None]:
# One column has a duplicate

bls_arf_df.drop('percent_change_from_nov_2024.1', axis=1, inplace=True)
bls_arf_df.head()

Unnamed: 0,item_and_unit,us_dec_2023_price,us_nov_2024_price,us_dec_2024_price,us_percent_change_from_dec_2023,percent_change_from_nov_2024
0,"Ground beef, per lb. (453.6 gm)",5.566,5.874,5.863,5.3,-0.2
1,"Bacon, sliced, per lb. (453.6 gm)",6.774,6.843,6.915,2.1,1.1
2,"All Pork Chops, per lb. (453.6 gm)",4.256,4.43,4.308,1.2,-2.8
3,"Ham, boneless, excluding canned, per lb. (453....",5.497,5.63,5.458,-0.7,-3.1
4,"Chicken, fresh, whole, per lb. (453.6 gm)",1.955,2.076,2.061,5.4,-0.7


In [None]:
# Rename the 'item_and_unit' column to be able to serve as a foreign key

bls_arf_df.rename(columns={'item_and_unit': 'product'}, inplace=True)
bls_arf_df.head()

Unnamed: 0,product,us_dec_2023_price,us_nov_2024_price,us_dec_2024_price,us_percent_change_from_dec_2023,percent_change_from_nov_2024
0,"Ground beef, per lb. (453.6 gm)",5.566,5.874,5.863,5.3,-0.2
1,"Bacon, sliced, per lb. (453.6 gm)",6.774,6.843,6.915,2.1,1.1
2,"All Pork Chops, per lb. (453.6 gm)",4.256,4.43,4.308,1.2,-2.8
3,"Ham, boneless, excluding canned, per lb. (453....",5.497,5.63,5.458,-0.7,-3.1
4,"Chicken, fresh, whole, per lb. (453.6 gm)",1.955,2.076,2.061,5.4,-0.7


In [84]:
# Correct name of 'percent_change_from_nov_2024' column

bls_arf_df.rename(columns={'percent_change_from_nov_2024': 'us_percent_change_from_nov_2024'}, inplace=True)
bls_arf_df.head()


Unnamed: 0,product,us_dec_2023_price,us_nov_2024_price,us_dec_2024_price,us_percent_change_from_dec_2023,us_percent_change_from_nov_2024
0,"Ground beef, per lb. (453.6 gm)",5.566,5.874,5.863,5.3,-0.2
1,"Bacon, sliced, per lb. (453.6 gm)",6.774,6.843,6.915,2.1,1.1
2,"All Pork Chops, per lb. (453.6 gm)",4.256,4.43,4.308,1.2,-2.8
3,"Ham, boneless, excluding canned, per lb. (453....",5.497,5.63,5.458,-0.7,-3.1
4,"Chicken, fresh, whole, per lb. (453.6 gm)",1.955,2.076,2.061,5.4,-0.7


In [88]:
# Normalize product names

bls_arf_df['product'].unique().tolist()

# Remove unwanted parts (and handle "All Pork Chops")
cleaned_products = [
    re.sub(r',.*$', '', product).replace('All ', '') if 'Pork Chops' in product else re.sub(r',.*$', '', product)
    for product in bls_arf_df['product']
]

# Apply the cleaned products back to the DataFrame
bls_arf_df['product'] = cleaned_products
bls_arf_df.head()

Unnamed: 0,product,us_dec_2023_price,us_nov_2024_price,us_dec_2024_price,us_percent_change_from_dec_2023,us_percent_change_from_nov_2024
0,Ground beef,5.566,5.874,5.863,5.3,-0.2
1,Bacon,6.774,6.843,6.915,2.1,1.1
2,Pork Chops,4.256,4.43,4.308,1.2,-2.8
3,Ham,5.497,5.63,5.458,-0.7,-3.1
4,Chicken,1.955,2.076,2.061,5.4,-0.7


In [89]:
# Export cleaned dataframe to csv

bls_arf_df.to_csv('bls_arf_final.csv')