# Step 1: Data Collection & Setup

In [7]:
# Import necessary libraries
import pandas as pd
import openpyxl

## A. Grocery Market Data 

This dataset was taken from the U.S. Department of Agriculture Agricultural Marketing Service's (AMS) [Market News Report](https://marketnews.usda.gov/mnp/dataDownload"), which combines data from various AMS Market News reporting categories, including Dairy & milk, Fruits, vegetables, & specialty crops, Livestock, meats, poultry, eggs, grain, & hay, Organic, and Local & regional foods into a single dataset.

In [None]:
# Create a dataframe from the USDA AMS Market News Report
usda_ams_retail_df = pd.read_csv('MNRetailDatasetCSV.CSV')

  usda_ams_retail_df = pd.read_csv('MNRetailDatasetCSV.CSV')


In [8]:
# Get an overview of the df with a .head()
usda_ams_retail_df.head(5)

Unnamed: 0,LEVEL_OF_TRADE,FREQUENCY,ISSUING_OFFICE,REPORT_DATE,PROGRAM,LEVEL_1,LEVEL_2,LEVEL_3,LEVEL_4,LEVEL_5,...,STORES_WITH_ADS,STORE_OUTLETS,FEATURE_RATE,SPECIAL_RATE,ACTIVITY_INDEX,LOCALLY_GROWN_PERCENTAGE,WEIGHTED_AVERAGE_PRICE,PRICE_LOW,PRICE_HIGH,PRODUCT_QUALITY
0,RETAIL,WEEKLY,"DES MOINES, IA",30-AUG-24,POULTRY,EGGS,,LIQUID,,,...,,5500.0,,,65.0,,,,,
1,RETAIL,WEEKLY,"DES MOINES, IA",30-AUG-24,POULTRY,EGGS,,LIQUID,,,...,,5500.0,1.2,,,,,,,
2,RETAIL,WEEKLY,"DES MOINES, IA",30-AUG-24,POULTRY,EGGS,,LIQUID,,,...,,29200.0,,,65.0,,,,,
3,RETAIL,WEEKLY,"DES MOINES, IA",30-AUG-24,POULTRY,EGGS,,LIQUID,,,...,,29200.0,0.2,,,,,,,
4,RETAIL,WEEKLY,"DES MOINES, IA",30-AUG-24,POULTRY,EGGS,,SHELL,ALL SHELL,,...,,100.0,,11.0,,,,,,


In [7]:
# Get more detail on what columns are in the df
usda_ams_retail_df.columns

Index(['LEVEL_OF_TRADE', 'FREQUENCY', 'ISSUING_OFFICE', 'REPORT_DATE',
       'PROGRAM', 'LEVEL_1', 'LEVEL_2', 'LEVEL_3', 'LEVEL_4', 'LEVEL_5',
       'ORGANIC', 'SPECIALTY', 'UNIT', 'REGION', 'STORES_WITH_ADS',
       'STORE_OUTLETS', 'FEATURE_RATE', 'SPECIAL_RATE', 'ACTIVITY_INDEX',
       'LOCALLY_GROWN_PERCENTAGE', 'WEIGHTED_AVERAGE_PRICE', 'PRICE_LOW',
       'PRICE_HIGH', 'PRODUCT_QUALITY'],
      dtype='object')

In [12]:
# I'm seeing a lot of NaNs.  What % of each column is NaN?
nan_percentage = usda_ams_retail_df.isna().sum() / len(usda_ams_retail_df) * 100
print(nan_percentage)

LEVEL_OF_TRADE               0.000000
FREQUENCY                    0.000000
ISSUING_OFFICE               0.000000
REPORT_DATE                  0.000000
PROGRAM                      0.000000
LEVEL_1                      0.097223
LEVEL_2                      8.058644
LEVEL_3                     66.054698
LEVEL_4                     66.778009
LEVEL_5                     87.909288
ORGANIC                     56.398690
SPECIALTY                   58.290505
UNIT                        11.750140
REGION                       0.000000
STORES_WITH_ADS             11.854112
STORE_OUTLETS               88.249860
FEATURE_RATE                93.781300
SPECIAL_RATE                97.072425
ACTIVITY_INDEX              93.617837
LOCALLY_GROWN_PERCENTAGE    99.917057
WEIGHTED_AVERAGE_PRICE      11.750159
PRICE_LOW                   33.093714
PRICE_HIGH                  33.048509
PRODUCT_QUALITY             44.232302
dtype: float64


### Takeaways:
* This dataset appears to have the requisite raw data. 
* This dataset is large, but it will benefit from dropping unncessary columns and filtering out rows with empty values.

## B. Inflation Trend Data

This dataset was taken from the Bureau of Labor Statistics Consumer Price Index (CPI) for [All Urban Consumers](https://www.bls.gov/cpi/tables/supplemental-files/). 

In [11]:
# Create a dataframe from the BLS's CPI for All Urban Consumers
bls_cpi_df = pd.read_excel('cpi-u-202412.xlsx')

In [14]:
# Get an overview of the dataframe with a .head()
bls_cpi_df.head()

Unnamed: 0,Expenditure category,Relative\nimportance\nNov.\n2024,Unadjusted indexes,Unadjusted indexes.1,Unadjusted indexes.2,Unadjusted indexes.3,Unadjusted indexes.4,Unadjusted indexes.5,Unadjusted indexes.6,Unadjusted indexes.7,...,Seasonally adjusted percent change.1,Seasonally adjusted percent change.2,One Month,One Month.1,One Month.2,One Month.3,Twelve Month,Twelve Month.1,Twelve Month.2,Twelve Month.3
0,,,Dec.\n2023,Jan.\n2024,Feb.\n2024,Mar.\n2024,Apr.\n2024,May\n2024,Jun.\n2024,Jul.\n2024,...,Oct.\n2024-\nNov.\n2024,Nov.\n2024-\nDec.\n2024,Seasonally adjusted effect on All Items\nNov. ...,"Standard error, median price change(2)",Largest (L) or Smallest (S) seasonally adjuste...,Largest (L) or Smallest (S) seasonally adjuste...,Unadjusted effect on All Items\nDec. 2023-\nDe...,"Standard error, median price change(2)",Largest (L) or Smallest (S) unadjusted change ...,Largest (L) or Smallest (S) unadjusted change ...
1,,,,,,,,,,,...,,,,,Date,Percent change,,,Date,Percent change
2,,,,,,,,,,,...,,,,,,,,,,
3,All items,100.0,306.746,308.417,310.326,312.332,313.548,314.069,314.175,314.54,...,0.3,0.4,,0.04,L-Mar. 2024,0.4,,0.1,L-Jul. 2024,2.9
4,Food,13.483,325.409,327.327,327.731,328.043,328.678,329.12,329.71,330.561,...,0.4,0.3,0.042,0.08,S-Oct. 2024,0.2,0.34,0.25,L-Jan. 2024,2.6


### Takeaways:
* I did some cleaning (deleting empty rows) in Excel before importing into a dataframe here in Pandas, as the original table was not friendly to being read here. 
* I believe this table will serve its intended purpose, but a good deal of cleaning may be needed first (deleting unnecessary categories, renaming columns for ease of understanding, etc.).

## C. Verified Grocery Sales Data

I searched for over an hour on a combination of Kaggle, ChatGPT for recommendations, and Google searches for a good, free grocery / supermarket dataset that had detailed information on U.S. markets, but for whatever reason couldn't find one that checked all the boxes.  I ended up going with this [Costco dataset](https://www.kaggle.com/datasets/bhavikjikadara/grocery-store-dataset), because it at least provided some actual data from 2024, even if it doesn't include a lot of the categories of data I'd like to include (store locations, profit margins, etc.).

In [15]:
# Create a dataframe from the Costco dataset
costco_df = pd.read_csv('GroceryDataset.csv')

In [16]:
# Get an overview of the dataframe with a .head()
costco_df.head()

Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Feature,Product Description
0,Bakery & Desserts,$56.99,No Discount,Rated 4.3 out of 5 stars based on 265 reviews.,"David’s Cookies Mile High Peanut Butter Cake, ...",$,"""10"""" Peanut Butter Cake\nCertified Kosher OU-...",A cake the dessert epicure will die for!Our To...
1,Bakery & Desserts,$159.99,No Discount,Rated 5 out of 5 stars based on 1 reviews.,"The Cake Bake Shop 8"" Round Carrot Cake (16-22...",$,Spiced Carrot Cake with Cream Cheese Frosting ...,"Due to the perishable nature of this item, ord..."
2,Bakery & Desserts,$44.99,No Discount,Rated 4.1 out of 5 stars based on 441 reviews.,"St Michel Madeleine, Classic French Sponge Cak...",$,100 count\nIndividually wrapped\nMade in and I...,Moist and buttery sponge cakes with the tradit...
3,Bakery & Desserts,$39.99,No Discount,Rated 4.7 out of 5 stars based on 9459 reviews.,"David's Cookies Butter Pecan Meltaways 32 oz, ...",$,Butter Pecan Meltaways\n32 oz 2-Pack\nNo Prese...,These delectable butter pecan meltaways are th...
4,Bakery & Desserts,$59.99,No Discount,Rated 4.5 out of 5 stars based on 758 reviews.,"David’s Cookies Premier Chocolate Cake, 7.2 lb...",$,"""10"" Four Layer Chocolate Cake\nCertified Kosh...",A cake the dessert epicure will die for!To the...


In [33]:
# Am I going to be able to find products that match those listed in the USDA and BLS datasets in this dataset?
# Let's search for a few basic staples to see...

flour = costco_df[costco_df['Title'].str.contains('flour', case=False)]['Title']

print(flour)

808         Nouvelle Legende Flour Sack Towels, 12 Pack  
845               Kirkland Signature, Almond Flour, 3 lbs
1159    Namaste USDA Organic Gluten Free Perfect Sweet...
1186      Namaste Gluten Free Perfect Flour Blend, 6-pack
1206    Ardent Mills, Harvest Hotel & Restaurant, All-...
1476    Simple Mills Almond Flour Sea Salt Crackers, 1...
Name: Title, dtype: object


### Takeaways: 
* I've spent nearly 2 hours searching Kaggle, asking ChatGPT to help me find a dataset, and Googling to find free datasets with real supermarket data.  Turns out, those are hard to find.  Even fake datasetes that cover the reqs for this dataset are proving hard to find. 
* This dataset is imperfect (it doesn't have any location data, for one), and it's going to be hard to use (the way the proudcts are named in the 'Title' column is going to require a lot of work to get to line up with the way products ar enamed in the USDA and BLS datasets).  But it's what I have for now.

## D. Demographic / Regional Data

I used the [U.S. Census Dataset : Education, Finance, Industry](https://www.kaggle.com/datasets/mittvin/u-s-census-dataset-education-finance-industry) from Kaggle for this portion of the project. 

In [35]:
# Create dataframes from the U.S. Census dataset
education_df = pd.read_csv('Educationv.csv')
finance_df = pd.read_csv('Finance.csv')
industry_df = pd.read_csv('Industry.csv')

In [37]:
# Get an overview of each dataframe with a .head()
education_df.head()

Unnamed: 0,Year,cd,Bachelors_degree_or_higher,high_school_or_some_degree,Less_than_high_school_graduate
0,2020,0_AK,121098,309698,33572
1,2020,0_DC,277816,177505,34652
2,2020,0_DE,175338,351177,57053
3,2020,0_ND,137958,303148,26631
4,2020,0_PR,121098,309698,33572


In [38]:
finance_df.head()

Unnamed: 0,Year,cd,Less_than_$5000,$5000_to_$9999,$10000_to_$14999,$15000_to_$19999,$20000_to_$24999,$25000_to_$34999,$35000_to_$49999,$50000_to_$74999,$75000_to_$99999,$100000_to_$149999,$150000_or_more
0,2019,0_AK,5746,4600,7294,8276,8110,17476,26315,44593,35414,49254,46182
1,2019,0_DC,14138,10318,12304,9470,7695,16841,21906,34694,30240,46707,80073
2,2019,0_DE,11281,7942,12874,12469,14976,31208,43421,64673,52472,60199,52984
3,2019,0_ND,9110,9039,12923,13021,12824,27603,38832,57179,45844,54543,38625
4,2019,0_PR,181287,141265,140140,122766,94982,145595,141659,119535,49486,33679,21805


In [40]:
industry_df.head()

Unnamed: 0,Year,cd,Total_Agriculture_forestry_fishing_hunting_mining,Total_Construction,Total_Manufacturing,Total_Wholesale_trade,Total_Retail_trade,Total_Transportation_warehousing_utilities,Total_Information,Total_Finance_insurance_realestate_rental_leasing,...,Female_Wholesale_trade,Female_Retail_trade,Female_Transportation_warehousing_utilities,Female_Information,Female_Finance_insurance_realestate_rental_leasing,Female_Professional_scientific_management_administrative_waste_management_services,Female_Educationalservices_healthcare_socialassistance,Female_Arts_entertainment_recreation_accommodation_foodservices,Female_Otherservices_except_Public_administration,Female_Public_administration
0,2019,0_PR,6239,33555,73692,23153,70787,29104,12717,46604,...,6012,29811,6283,4613,27197,22920,120146,16304,9342,35707
1,2019,1_MA,1612,16225,37577,8364,25590,15131,4228,22218,...,2188,10088,3248,1727,12659,8124,51768,6292,4436,5532
2,2019,2_MA,1294,14207,32599,6414,20952,9529,6162,17239,...,1929,7766,1660,2253,8629,10494,44123,5242,4462,4097
3,2019,3_MA,541,18000,48852,7105,22137,11373,7342,17442,...,2058,8157,2782,2565,8799,13879,43803,6227,5280,4607
4,2019,4_MA,590,19548,36590,10322,27693,11294,7526,30364,...,2556,11035,2645,2841,13022,17985,53630,6214,5104,5422


### Takeaways:
* These tables don't have any population data in them, but they should allow for some useful insights (e.g. how did prices change relative to areas of high / low education, areas of high / low income, areas of particular type of industry).

## E. Summary

If I was going to say anything at this point in the project, it would be: Finding datasets that do what you want seems *really* hard.  It seems so much easier to be given a dataset to work with than to try to find data that fits certain parameters.  I don't know how often in the "real world" we'll be asked to find out own datasets, but I hope for the most part I'm working with data that a company has already internally curated in some way! 