**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Ryan Lindberg
- Nathan Mitchell
- Domonick Marshall
- Sean Notolli

# Research Question

"How have Amazon product prices across different market sectors changed between 2021-2024, and how do these trends compare to equivalent categories in the Consumer Price Index (CPI)? Is Amazon pricing in line with overall inflation, or does it diverge from broader economic trends?"



## Background and Prior Work


Amazon has been a staple ecommerce service in many lives across the globe, bested by no other. Due to its broad market and utmost convenience, Amazon is one of the first markets considered when needing anything. However, one might wonder how costly this convenience is.

From the Amazon Product Pricing Report 2024 on Issuu <a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1), we can see that Amazon prices are influenced by a vast amount of factors such as supply and demand, seasonal trends, competition, shifting seller fees, and algorithmic pricing. Other elements such as brand power, customer reviews, and holiday shopping behavior also contribute to pricing variability. This report serves as essential information for this project as it provides context to better interpret and analyze Amazon price trends. The report provides information on different pricing trends across different sectors of the market such as beauty, home and kitchen, arts and crafts, pet supply, and baby products, and demonstrates how each sector faces different trends. With each market sector, the report also produces concise and clear visualizations of pricing data across multiple Amazon products. 

The Consumer Price Index (CPI) Summary from the US Bureau of Labor Statistics <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) offers guidance as to how to structure and analyze data related to our topic. Their report on CPI changes from 2023-2024 exemplifies how to organize large datasets and distill them into clear, actionable insights. The summary’s consistent formatting and emphasis on year-over-year percentage changes allow for a straightforward understanding of trends in consumer prices across different sectors. The structured approach will be instrumental in our own analysis of pricing data, helping us standardize our methodology and avoid potential misinterpretations. By adopting their organizational strategy, we can enhance the accuracy and credibility of our findings.

Recent research has also talked about the idea of practical implications on dynamic algorithms in pricing on market behavior. Elmaghraby and Keskinocak <a name="cite_ref-3"></a>[<sup>2</sup>](#cite_note-3) provide a review of dynamic pricing models explaining that factors like consumer demand and supply constraints drive pricing decisions in various industries. The paper talks about how algorithmic pricing is not only a tool for optimizing revenue but it can contribute to pricing volatility and competitive differences in the digital market. Using their findings with the data from Amazon product pricing report and CPI lets us understand how algorithmic strategies and external market conditions can interact with the price trends, Underscoring how important advanced computational methods are in predicting market behaviors in a retail environment. 

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Jungle Scout (2024, Jan). Amazon Product Pricing Report 2024. Issuu. https://issuu.com/junglescoutcobalt/docs/jungle-scout-amazon-product-pricing-report-2024?utm_source=chatgpt.html 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Bureau of Labor Statistics (2024, Feb). Consumer Price Index Summary. U.S. Department of Labor. https://www.bls.gov/news.release/pdf/cpi.html
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Elmaghraby, W., & Keskinocak, P. (2003). Dynamic Pricing in the Presence of Inventory Considerations: Research Overview, Current Practices, and Future Directions. Management Science, 49(10), https://www.researchgate.net/publication/220534328_Dynamic_Pricing_in_the_Presence_of_Inventory_Considerations_Research_Overview_Current_Practices_and_Future_Directions.html



# Hypothesis



Amazon prices across different market sectors have generally increased from 2021-2024 but at an inconsistent rate with inflation trends in the consumer price index. We predict that discretionary goods like electronics and other luxury items have had smaller price increases when compared to essential goods like groceries and other necessities potentially exceeding the CPI inflation rate. We think this is due to Amazon using aggressive price matching and algorithms to remain competitive in non-essential categories where supply chain constraints and labor costs disproportionately impacted essential goods.

# Data

## Data overview

- Dataset #1
  - Dataset Name: Keepa
  - Link to the dataset: https://keepa.com/#!data
  - Number of observations: 998
  - Number of variables: 63
- Dataset #2 (if you have more than one!)
  - Dataset Name: Consumer Price Index (CPI)
  - Link to the dataset: https://www.bls.gov/cpi/tables/supplemental-files/ 
  - Number of observations:
  - Number of variables:
- etc

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [None]:
# Install the keepa library (run this cell if not already installed)
#!pip install keepa

import requests
import pandas as pd
import numpy as np
import json
from pathlib import Path
import keepa
import datetime
import matplotlib.pyplot as plt
import math
import os

ModuleNotFoundError: No module named 'requests'

In [None]:
ACCESS_KEY = "df2mtauj1tmrngcm95ubshd41fplpf2bfh1nba8s8hpd2m6golbbrj9bat7osb8o" # do no share outside of private repo!!
api = keepa.Keepa(ACCESS_KEY, timeout=30)

To query the price of a given amazon product, there are many different types of the the 'price' variable we can access. Two of them are new price and listing price. Here are the differences:

• NEW PRICE:
This is the current selling price for an item offered on Amazon in brand new condition. It reflects the actual market price that customers pay—typically the lowest available offer among Amazon and third‑party sellers. Because it’s influenced by promotions, competition, and real‑time market conditions, the NEW price can fluctuate over time.

• LISTING PRICE:
This is usually the manufacturer’s suggested retail price (MSRP) or the original price displayed on the product’s listing. It tends to be more stable and is often used as a reference to show discounts or price reductions. Even when the new price drops (for deals or competitive reasons), the listing price may remain unchanged.

The new price shows you what you’d pay right now, while the listing price is a reference value set by the manufacturer. This difference helps sellers and buyers gauge the discount depth and market dynamics.

Since we want to accurately capture the price a consumer is paying at a given time, we will use the NEW PRICE.

Other price variables we can access are the NEW_FBM (Filled by Manufacturer), NEW_FBA (Filled by Amazon), USED, (used items), REFURBISHED (refurbished items), WAREHOUSE (prices from amazon warhouse deals, usually returned items). These price variables offer spotty data coverage, and don't accurately represent the consumer experience, so we will not be using them.  

## Amazon Price History Dataset from Keepa API

First, we want to gather data for a variety of Amazon products from different sectors of the market. Luckily, each product on Amazon has a unique ASIN (Amazon Standard Identification Number) which we can use to identify it. With help from the Keepa Data Product Finder tool, we can collect these ASINs along with other product information. In the following example we use the Product Finder to filter by 'product category' = baby products, and rank = 1-1000. This finds the top 1000 items categorized under 'baby products' on Amazon. Additionally we can collect the name of the product, and the sub category. We made sure to exclude 'variations' of products, because we dont want to sample the same product 5 times just because there are 5 versions of the product with slight variations. We also refined our query to only physical products, excluding digital products and ebooks. CPI data is only collected on physical goods, so this allows us to have a fairer comparison. And our final filter is 'tracking time' which allows us to only select products that have been tracked by Keepa from 2021-2024.

In [None]:
# read in top 1000 baby products csv:
data_path = Path('data/keepa_data') / 'grocery_and_foods.csv'
products = pd.read_csv(data_path)

NameError: name 'Path' is not defined

In [None]:
products.head(10)

Unnamed: 0,Title,Categories: Sub,ASIN
0,"CELSIUS Sparkling Strawberry Guava, Functional...","Energy Drinks, Gluten-Free Groceries, Evergree...",B08PGXDHTC
1,"Nespresso Capsules Vertuo, Voltesso, Mild Roas...","Single-Serve Capsules & Pods, Kitchen & Dining...",B0768N9N6P
2,"Premier Protein Shake, Cookies & Cream, 30g Pr...","Protein Drinks, Beverages, Protein drinks",B07MFYYZ5B
3,"Sparkling Ice, Peach Nectarine Sparkling Water...","Carbonated Water, Subscribe & Save Prime Promo...",B009S2XFVW
4,"Sparkling Ice, Cranberry Frost Sparkling Water...","Soft Drinks, Sparkling water, Sparkling Water,...",B07KY58NFX
5,"Monster Energy Zero Ultra, Sugar Free Energy D...","Energy Drinks, Subscribe & Save Prime Promo, B...",B00ADYXY7E
6,"Core Power Protein Shake, Chocolate, 26g Bottl...","Protein Drinks, Balanced Nutrition, Prime Memb...",B07LD2NV9X
7,"Starbucks K-Cup Coffee Pods, Medium Roast Coff...","Single-Serve Capsules & Pods, Packaged Coffee,...",B00U3ODTTM
8,"Nespresso Capsules Vertuo, Double Espresso Scu...","Single-Serve Capsules & Pods, Kitchen & Dining...",B07M8YV12G
9,"Nespresso Capsules VertuoLine, Hazelino Muffin...","Single-Serve Capsules & Pods, Packaged Coffee,...",B0851ZVCGL


Now that we have the ASINs, product name, and subcategory, we've noticed that many products have multiple sub categories, most of which are irrelevant for the purposes of our analysis such as "Christmas Store", "TEST ABCDEFGPD" and even random characters such as "d963aedb-8e7e-493c...". We want to remove clean the subcategory column so that each product has a single subcategory. Luckily for us, it seems that the first sub category is the most descriptive subcategory of a given product, so we can remove the others.

In [None]:
products['Categories: Sub'].value_counts().tail(15) # bottom 15 subcategories

Categories: Sub
Protein Drinks, Subscribe & Save Prime Promo, Spring Savings, New Year New You 2016, Organic Groceries, Grocery Back to School 17, Safe & Healthy Customer Favorites, Everyday grocery items, New Year, New You in Grocery, Grocery Subscribe and Save, Save up to 35% on Orgain Favorites, Grocery Subscribe & Save Event, Subscribe and Save for Back to School, Best Selling USDA Organic Products, Featured SNAP-eligible groceries, SNAP Beverages, d963aedb-8e7e-493c-80e7-b70e3c253824_9301, BTS B3G1 Grocery, Plant-Based Lifestyles, New Year New You - Healthy Eating, FREE Sample: Drazil Kids Tea, Punch Passion, Beverages, Back to School Everyday Essentials, Protein drinks    1
Crackers, Snacks & produce IA, Crackers, Game day snacks & dips, Snacks, Snack favorites, Snacks Under 5$, LMP_California_Snacks, 15% off coupon terms and conditions, Gluten Free, Snacks, Keto Friendly Fats, Gluten-Free Groceries, Featured SNAP-eligible groceries, SNAP Crackers, SNAP Snacks, d963aedb-8e7e-493c

In [None]:
# function remove extra subcategories from each row and rename columns
def clean_frame(df):
    df = df.rename(columns = {'Title' : 'product_title', 'Categories: Sub' : 'subcategory', 'ASIN' : 'asin'})
    def clean_row(row):
        row['subcategory'] = row['subcategory'].split(',')[0].strip().lower()
        return row
    df = df.apply(clean_row, axis = 1)
    
    return df

products = clean_frame(products)

In [None]:
products['subcategory'].value_counts().tail(15)

subcategory
cooking & baking      1
frozen                1
herbs                 1
vegetables            1
potato side dishes    1
baking powder         1
canola                1
bitters               1
frosting              1
chile paste           1
mixed                 1
brown sugar           1
ground pepper         1
fruit                 1
crackers              1
Name: count, dtype: int64

In [None]:
products['subcategory'].nunique()

229

Now our least common subcategories are actually meaningful, and we are left with 179 subcategories for baby products.


Moving forward, we can use the ASINs column to query the Keepa API for historical price data of each item.
We define multiple functions below to achieve this:

**1. keepa_time_to_datetime(kt):**

This function takes in a keepa time integer, and converts it to standard time. Each price value has an attached time value so we know when that price was in effect

**2. def generate_monthly_headers(days):**

This function allows us to standardize which months we actually want to collect data for. When we query Keepa, it may return data from 2011 if the data is available. We dont want that data. So we can use this function to ensure that we filter the data for the months we need. It is also helpful in ensuring every time we collect data within a given range, the column headers are the same.

**3. def get_monthly_avg_prices(asins, days):**

This function gathers historical price data for a list of Amazon products using their ASINs. It returns a dataframe of the average price from the previous x amount of days for each product. We want data since 2020, so we will be using (365 * 5) for our days variable. It has some special features such as forward filling, which is helpful to fill in missing data. Since Keepa only updates prices when the price changes, and some months don't have any price changes, the natural solution is to forward fill to fill in missing data.

With our basic membership access to the Keepa API, we are limited with how much data we can request, so we will have to produce our price data in batches, and incrementally build up a large csv of price data for different product categories.

**4. def batch(iterable, n=20):**

This takes in a list of ASINs, and returns batches of them of size n.


**5. def query_keepa_in_batches(products, category, max_batches=10, batch_size=20, days=365 * 4, start_index=0, stop_index=None):**

To put it simply, this function just gets monthly avg price data for a batch of ASINs, then merges it to a master csv.

With the function query_keepa_in_batches, we can incrementally build a csv of historical price data for a given category by repeatedly requesting data from the Keepa API. The inputs to this function are: a dataframe such as grocery_and_foods, the category we are working with, i.e. the string 'grocery_and_foods', the number of batches we want, batch size, how many days of historical price data we want, and the start and stop indices of the ASINs we want to query from the dataframe. These start and stop indices allow us to pick up where we left off. So if for some reason the query stops, we can just figure out the last batch we completed, and keep going from there! If the CSV already exists, we just append to it. If it does not exist, we create a new one. This ensures we never overwrite data, and we can always keep track of where we are in the request process. This function takes many hours to run since we are limited to 1 token per minute with our API membership, and each ASIN costs 1 token. So it will have to run overnight to collect our data.

In [None]:
def keepa_time_to_datetime(kt):
    # Convert Keepa time (minutes since 2011-01-01) to a Python datetime (UTC)
    if isinstance(kt, datetime.datetime):
        return kt
    return datetime.datetime.fromtimestamp((kt + 21564000) * 60, datetime.timezone.utc)

def generate_monthly_headers(days):
    """
    Generate a list of month headers (strings) in the format 'YYYY-MM'
    spanning from the current month back to the month that includes (now - days).
    The headers are in ascending order. 2021-2025
    This function standardizes which months we collect for each batch and ensures the columns are aligned.
    """
    now = datetime.datetime.now(datetime.timezone.utc)
    start_date = now - datetime.timedelta(days=days)
    
    headers = []
    current_year = now.year
    current_month = now.month

    while True:
        header = f"{current_year:04d}-{current_month:02d}"
        headers.append(header)
        # Move to the previous month
        if current_month == 1:
            current_month = 12
            current_year -= 1
        else:
            current_month -= 1
        
        # Create a timezone-aware date for the first day of the new month.
        month_start = datetime.datetime(current_year, current_month, 1, tzinfo=datetime.timezone.utc)
        # Stop if this month is before the start_date.
        if month_start < start_date:
            break
    return sorted(headers)


def get_monthly_avg_prices(asins, days=1460):
    """
    asins: list of ASIN strings
    days: number of days of history (default 1460 ~ 4 years)
    
    Returns a DataFrame:
        - Rows = ASINs
        - Columns = monthly time periods (e.g. '2025-02', '2025-01', etc.)
        - Values = average 'NEW' price for that month
    """
    products = api.query(asins, days=days)
    dfs = []
    for product in products:
        asin = product['asin']
        price_history = product['data'].get('NEW', [])
        time_history  = product['data'].get('NEW_time', [])
        
        dates = [keepa_time_to_datetime(t) for t in time_history]
        prices = [p for p in price_history]
        
        df = pd.DataFrame({'date': dates, asin: prices})
        df.set_index('date', inplace=True)
        
        # Resample to monthly average using month-end frequency
        monthly_avg = df.resample('ME').mean()
        dfs.append(monthly_avg)
    
    if not dfs:
        return pd.DataFrame()
    
    # Combine and transpose so that rows = ASIN and columns = dates
    combined = pd.concat(dfs, axis=1).T
    # Convert datetime columns to string format 'YYYY-MM'
    combined.columns = [col.strftime('%Y-%m') for col in combined.columns]
    
    # Generate the complete set of monthly headers (headers are in descending order)
    headers = generate_monthly_headers(days)
    
    # Reindex with headers generated headers
    combined = combined.reindex(columns=headers, fill_value=np.nan)
    
    # Forward fill missing values along the row (in chronological order)
    combined = combined.ffill(axis=1)
    
    return combined


def batch(iterable, n=20):
    """Yield successive n-sized chunks from iterable."""
    for i in range(0, len(iterable), n):
        yield iterable[i:i + n]

In [None]:
# EXAMPLE USAGE of get_monthly_avg_prices
asins = ["B009S2XFVW"]  
monthly_prices = get_monthly_avg_prices(asins, days = 365 * 4)
monthly_prices

  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [00:02<00:00,  2.42s/it]


Unnamed: 0,2021-03,2021-04,2021-05,2021-06,2021-07,2021-08,2021-09,2021-10,2021-11,2021-12,...,2024-05,2024-06,2024-07,2024-08,2024-09,2024-10,2024-11,2024-12,2025-01,2025-02
B009S2XFVW,,13.32358,14.412791,7.844828,7.844828,10.955,8.99,8.520645,8.169459,8.169459,...,11.99,11.99,11.7,14.995,13.163333,15.0,14.976,14.976,12.0,12.0


In [None]:
def query_keepa_in_batches(products, category, max_batches=10, batch_size=20,
                           days=365 * 4, start_index=0, stop_index=None):
    """
    Query Keepa for monthly average prices in batches and incrementally save 
    the *merged* results (column-aligned) to a CSV file.
    """
    # Slice the ASIN list based on start_index and stop_index.
    asins = list(products['asin'])[start_index:stop_index]
    csv_file = f'data/{category}_monthly_prices.csv'
    for i, asin_batch in enumerate(batch(asins, batch_size)):
        if i >= max_batches:
            break
        df_batch = get_monthly_avg_prices(asin_batch, days=days)
        if df_batch.empty:
            print(f"Batch {i+1} returned no data; skipping.")
            continue
        # If the CSV file exists, read it, merge columns, then write back
        if os.path.exists(csv_file):
            existing_df = pd.read_csv(csv_file, index_col=0)
            
            # Merge on row index (ASIN) and combine columns (outer join).
            # combine_first() will fill missing entries in existing_df with df_batch values.
            merged_df = existing_df.combine_first(df_batch)
            
            # Ensure columns are in the correct order (descending monthly headers).
            # This step uses the same monthly headers function to reindex.
            headers = generate_monthly_headers(days)
            merged_df = merged_df.reindex(columns=headers, fill_value=np.nan)
            
            merged_df.to_csv(csv_file, index=True)
        else:
            # If no CSV yet, just write df_batch as the first chunk
            df_batch.to_csv(csv_file, index=True)
        
        print(f"Batch {i+1} processed and merged.")


# Example helper function to split the ASIN list into batches.
def batch(iterable, n=20): # 20 is default batch size
    """Yield successive n-sized chunks from iterable"""
    for i in range(0, len(iterable), n):
        yield iterable[i:i + n]



Now for the boring part, waiting for the API requests as we slowly build up our csv. This code will take several hours to run as we have to accumulate tokens throughout the API request. It will probably run in the background and overnight.

Example:
query_keepa_in_batches(products, 'baby_products', batch_size = 20, max_batches=10, days = 365*2, start_index = 0, stop_index = None) 

This entire process was just for the baby_products category. We will have to do this same process for every other product category we use. Luckily after defining all of the functions, it shouldn't be that hard. We only need to do a couple simple steps:
1. Load in a category of Amazon products as a dataframe
2. Clean the dataframe
3. Query the API to build historical price data csv

Here is an example:

In [None]:
# Actual use: 
category = 'grocery_and_foods'
data_path = Path('data/keepa_data') / f'{category}.csv'
grocery_and_foods = pd.read_csv(data_path)
grocery_and_foods = clean_frame(grocery_and_foods)

In [None]:
# Actual use:
# This code may run for multiple hours. 1 minute for each product queried. 
query_keepa_in_batches(grocery_and_foods, category, max_batches=5, batch_size=20, days = (365 * 5) + 60, start_index = 900, stop_index = None)

100%|██████████| 20/20 [00:17<00:00,  1.13it/s]


Batch 1 processed and merged.


100%|██████████| 20/20 [00:11<00:00,  1.76it/s]


Batch 2 processed and merged.


100%|██████████| 20/20 [00:11<00:00,  1.76it/s]


Batch 3 processed and merged.


100%|██████████| 20/20 [00:10<00:00,  1.90it/s]


Batch 4 processed and merged.


  0%|          | 0/20 [00:00<?, ?it/s]Waiting 1175 seconds for additional tokens


Response from server: NOT_ENOUGH_TOKEN


Waiting 2 seconds for additional tokens
100%|██████████| 20/20 [19:49<00:00, 59.48s/it]


Batch 5 processed and merged.


Note:
The code of the cells above are mostly generated by ChatGPT from the prompt "How can I load the Keepa API into an ipynb, and query price data every 30 days for the past 4 years?"

Now that we have both the products and their prices, lets combine the two dataframes and do a final tidying of the data.

In [None]:
category = 'grocery_and_foods'
data_path2 = Path('data/') / f'{category}_monthly_prices.csv'
price_data = pd.read_csv(data_path2)

price_data

Unnamed: 0.1,Unnamed: 0,2020-01,2020-02,2020-03,2020-04,2020-05,2020-06,2020-07,2020-08,2020-09,...,2024-05,2024-06,2024-07,2024-08,2024-09,2024-10,2024-11,2024-12,2025-01,2025-02
0,B0000E5JIU,6.461333,6.522500,5.575000,5.575000,7.614286,7.365000,6.297500,5.572500,5.865833,...,9.386000,9.386000,9.485000,9.522727,9.011000,9.777143,9.732500,9.732500,9.817500,10.024286
1,B0001UXQ9Q,,,22.232500,23.864118,11.935000,11.906154,11.318182,10.860000,10.650000,...,16.285000,19.070714,11.625000,11.625000,11.625000,11.625000,12.800000,12.800000,13.796667,18.970000
2,B0002LD9IW,7.810000,5.578182,4.962500,4.032000,4.756667,4.756667,4.756667,4.756667,3.818750,...,1.353333,1.353333,1.353333,4.035000,4.035000,4.050000,4.050000,4.032000,4.032000,4.032000
3,B00032BPCM,3.847778,3.553333,7.652326,4.737206,1.565556,1.565556,1.565556,1.637500,3.287568,...,1.588947,2.105714,5.160645,4.680000,5.980000,4.951667,5.269259,5.179375,5.166667,5.057308
4,B000AXW9XI,10.208286,11.024000,11.477500,14.070645,12.530238,11.881081,11.717692,11.456486,10.265370,...,10.024706,10.461000,11.595000,17.990000,17.980556,17.970000,18.993333,19.990000,16.180000,14.240000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,B08QR1MWHF,,,,,,,,,,...,20.690000,22.180000,22.886000,19.423333,19.423333,18.924706,17.882000,18.787143,18.394444,19.733571
994,B08QTTJ1NM,,,,,,,,,,...,24.990000,24.990000,24.990000,24.990000,8.990000,14.990000,14.990000,14.990000,14.990000,14.990000
995,B08R4K3LR6,,,,,,,,,,...,41.980000,41.980000,46.381053,45.826667,43.594762,45.162727,45.162727,45.162727,44.874444,46.166667
996,B08R6DZXYV,,,,,,,,,,...,13.270000,13.270000,13.864000,13.864000,13.036667,13.270000,14.405000,13.130000,12.920000,12.570000


It looks like during the gathering process we accidentally missed 2 rows but thats okay. Also something to notice is that rows near the end of the data tend to have less data in the earlier months. We suspect this is due to how Keepa sorts their ASINs. This shouldnt be too big of an issue though. With an extra years worth of data

## Dataset #2 Consumer Price Index (CPI) for All Urban Consumers

### The following is a list of all separate data sets that we retreived from the CPI database.

- [All items in U.S. city average, all urban consumers, not seasonally adjusted](http://data.bls.gov/dataViewer/view/timeseries/CUUR0000SA0)
- [All items in U.S. city average, all urban consumers, seasonally adjusted](http://data.bls.gov/dataViewer/view/timeseries/CUSR0000SA0)
- [Food and beverages in U.S. city average, all urban consumers, not seasonally adjusted](http://data.bls.gov/dataViewer/view/timeseries/CUUR0000SAF)
- [Food in U.S. city average, all urban consumers, not seasonally adjusted](http://data.bls.gov/dataViewer/view/timeseries/CUUR0000SAF1)
- [Prescription drugs in U.S. city average, all urban consumers, not seasonally adjusted](http://data.bls.gov/dataViewer/view/timeseries/CUUR0000SEMF01)
- [Commodities in U.S. city average, all urban consumers, seasonally adjusted](http://data.bls.gov/dataViewer/view/timeseries/CUSR0000SAC)
- [Durables in U.S. city average, all urban consumers, seasonally adjusted](http://data.bls.gov/dataViewer/view/timeseries/CUSR0000SAD)
- [Nondurables in U.S. city average, all urban consumers, seasonally adjusted](http://data.bls.gov/dataViewer/view/timeseries/CUSR0000SAN)
- [Recreation in U.S. city average, all urban consumers, seasonally adjusted](http://data.bls.gov/dataViewer/view/timeseries/CUSR0000SAR)
- [Appliances in U.S. city average, all urban consumers, seasonally adjusted](http://data.bls.gov/dataViewer/view/timeseries/CUSR0000SEHK)
- [Toys in U.S. city average, all urban consumers, seasonally adjusted](http://data.bls.gov/dataViewer/view/timeseries/CUSR0000SERE01)
- [Apparel in U.S. city average, all urban consumers, seasonally adjusted](http://data.bls.gov/dataViewer/view/timeseries/CUSR0000SAA)

Notes:
- Seasonally adjusted indicates that data has been statistically modified (by the CPI) to remove the effects of predictable seasonal fluctuations, like holiday shopping sprees or summer vacation trends, allowing for a clearer analysis of underlying trends without the influence of recurring seasonal patterns.
- Durables are essentially products that are designed to last a long time and be purhcased rather infrequently.
Ex: Cars, refrigerators, furniture, washing machines and dryers, musical instruments, etc.
- Nondurables are goods that are generally consumed quickly and purchased more frequently. They lose value after one use and/or after a short period of time. Ex: food, drinks, hygiene products, paper products, cosmetics, clothing items.

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

#pip install pandas

# All items in U.S. city average, all urban consumers, not seasonally adjusted
all_items_NSA = pd.read_csv('all_items_NSA.csv')
all_items_NSA = all_items_NSA.drop(columns=['Series ID'])
all_items_NSA = all_items_NSA.drop(columns=['Period'])
all_items_NSA['Month'] = all_items_NSA['Label'].apply(lambda x: x.split()[1])
all_items_NSA = all_items_NSA.drop(columns=['Label'])
all_items_NSA = all_items_NSA[['Year', 'Month', 'Value']]


# All items in U.S. city average, all urban consumers, seasonally adjusted
all_items_SA = pd.read_csv('all_items_SA.csv')
all_items_SA = all_items_SA.drop(columns=['Series ID'])
all_items_SA = all_items_SA.drop(columns=['Period'])
all_items_SA['Month'] = all_items_SA['Label'].apply(lambda x: x.split()[1])
all_items_SA = all_items_SA.drop(columns=['Label'])
all_items_SA = all_items_SA[['Year', 'Month', 'Value']]

# Food and beverages in U.S. city average, all urban consumers, not seasonally adjusted
food_and_bev = pd.read_csv('food_and_bev.csv')
food_and_bev = food_and_bev.drop(columns=['Series ID'])
food_and_bev = food_and_bev.drop(columns=['Period'])
food_and_bev['Month'] = food_and_bev['Label'].apply(lambda x: x.split()[1])
food_and_bev = food_and_bev.drop(columns=['Label'])
food_and_bev = food_and_bev[['Year', 'Month', 'Value']]

# Food in U.S. city average, all urban consumers, not seasonally adjusted
food = pd.read_csv('just_food.csv')
food = food.drop(columns=['Series ID'])
food = food.drop(columns=['Period'])
food['Month'] = food['Label'].apply(lambda x: x.split()[1])
food = food.drop(columns=['Label'])
food = food[['Year', 'Month', 'Value']]

# Prescription drugs in U.S. city average, all urban consumers, not seasonally adjusted
presc_drugs = pd.read_csv('presc_drugs.csv')
presc_drugs = presc_drugs.drop(columns=['Series ID'])
presc_drugs = presc_drugs.drop(columns=['Period'])
presc_drugs['Month'] = presc_drugs['Label'].apply(lambda x: x.split()[1])
presc_drugs = presc_drugs.drop(columns=['Label'])
presc_drugs = presc_drugs[['Year', 'Month', 'Value']]

# Commodities in U.S. city average, all urban consumers, seasonally adjusted
commodities = pd.read_csv('commodities.csv')
commodities = commodities.drop(columns=['Series ID'])
commodities = commodities.drop(columns=['Period'])
commodities['Month'] = commodities['Label'].apply(lambda x: x.split()[1])
commodities = commodities.drop(columns=['Label'])
commodities = commodities[['Year', 'Month', 'Value']]

# Durables in U.S. city average, all urban consumers, seasonally adjusted
durables = pd.read_csv('durables.csv')
durables = durables.drop(columns=['Series ID'])
durables = durables.drop(columns=['Period'])
durables['Month'] = durables['Label'].apply(lambda x: x.split()[1])
durables = durables.drop(columns=['Label'])
durables = durables[['Year', 'Month', 'Value']]

# Nondurables in U.S. city average, all urban consumers, seasonally adjusted
nondurables = pd.read_csv('nondurables.csv')
nondurables = nondurables.drop(columns=['Series ID'])
nondurables = nondurables.drop(columns=['Period'])
nondurables['Month'] = nondurables['Label'].apply(lambda x: x.split()[1])
nondurables = nondurables.drop(columns=['Label'])
nondurables = nondurables[['Year', 'Month', 'Value']]

# Recreation in U.S. city average, all urban consumers, seasonally adjusted
rec = pd.read_csv('recreation.csv')
rec = rec.drop(columns=['Series ID'])
rec = rec.drop(columns=['Period'])
rec['Month'] = rec['Label'].apply(lambda x: x.split()[1])
rec = rec.drop(columns=['Label'])
rec = rec[['Year', 'Month', 'Value']]

# Appliances in U.S. city average, all urban consumers, seasonally adjusted
appliances = pd.read_csv('appliances.csv')
appliances = appliances.drop(columns=['Series ID'])
appliances = appliances.drop(columns=['Period'])
appliances['Month'] = appliances['Label'].apply(lambda x: x.split()[1])
appliances = appliances.drop(columns=['Label'])
appliances = appliances[['Year', 'Month', 'Value']]


# Toys in U.S. city average, all urban consumers, seasonally adjusted
toys = pd.read_csv('toys.csv')
toys = toys.drop(columns=['Series ID'])
toys = toys.drop(columns=['Period'])
toys['Month'] = toys['Label'].apply(lambda x: x.split()[1])
toys = toys.drop(columns=['Label'])
toys = toys[['Year', 'Month', 'Value']]


# Apparel in U.S. city average, all urban consumers, seasonally adjusted
apparel = pd.read_csv('apparel.csv')
apparel = apparel.drop(columns=['Series ID'])
apparel = apparel.drop(columns=['Period'])
apparel['Month'] = apparel['Label'].apply(lambda x: x.split()[1])
apparel = apparel.drop(columns=['Label'])
apparel = apparel[['Year', 'Month', 'Value']]

# Combine dataframes
all_cpi = pd.DataFrame(columns = ['Year', 'Month', 'all_items_NSA', 'all_items_SA', 'food_and_bev', 'food', 'presc_drugs', 'commodities', 'durables', 'nondurables', 'recreation', 'appliances', 'toys', 'apparel'])
all_cpi['Year'] = all_items_NSA['Year']
all_cpi['Month'] = all_items_NSA['Month']
all_cpi['all_items_NSA'] = all_items_NSA['Value']
all_cpi['all_items_SA'] = all_items_SA['Value']
all_cpi['food_and_bev'] = food_and_bev['Value']
all_cpi['food'] = food['Value']
all_cpi['presc_drugs'] = presc_drugs['Value']
all_cpi['commodities'] = commodities['Value']
all_cpi['durables'] = durables['Value']
all_cpi['nondurables'] = nondurables['Value']
all_cpi['recreation'] = rec['Value']
all_cpi['appliances'] = appliances['Value']
all_cpi['toys'] = toys['Value']
all_cpi['apparel'] = apparel['Value']

all_cpi.head()

all_cpi.shape







FileNotFoundError: [Errno 2] No such file or directory: 'all_items_NSA.csv'

# Ethics & Privacy

- Thoughtful discussion of ethical concerns included
- Ethical concerns consider the whole data science process (question asked, data collected, data being used, the bias in data, analysis, post-analysis, etc.)
- How your group handled bias/ethical concerns clearly described

Acknowledge and address any ethics & privacy related issues of your question(s), proposed dataset(s), and/or analyses. Use the information provided in lecture to guide your group discussion and thinking. If you need further guidance, check out [Deon's Ethics Checklist](http://deon.drivendata.org/#data-science-ethics-checklist). In particular:

- Are there any biases/privacy/terms of use issues with the data you propsed?
- Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis? (For example, does your data exclude particular populations, or is it likely to reflect particular human biases in a way that could be a problem?)
- How will you set out to detect these specific biases before, during, and after/when communicating your analysis?
- Are there any other issues related to your topic area, data, and/or analyses that are potentially problematic in terms of data privacy and equitable impact?
- How will you handle issues you identified?

# Team Expectations 


Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Team Expectation 1*
* *Team Expectation 2*
* *Team Expecation 3*
* ...

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor) | Discuss/edit Analysis; Complete project check-in |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |