**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Ant Man
- Hulk
- Iron Man
- Thor
- Wasp

# Research Question

-  Include a specific, clear data science question.
-  Make sure what you're measuring (variables) to answer the question is clear

What is your research question? Include the specific question you're setting out to answer. This question should be specific, answerable with data, and clear. A general question with specific subquestions is permitted. (1-2 sentences)



## Background and Prior Work


- Include a general introduction to your topic                       <---- I dont think we did this well enough on the proposal.
- Include explanation of what work has been done previously
- Include citations or links to previous work

This section will present the background and context of your topic and question in a few paragraphs. Include a general introduction to your topic and then describe what information you currently know about the topic after doing your initial research. Include references to other projects who have asked similar questions or approached similar problems. Explain what others have learned in their projects.

Find some relevant prior work, and reference those sources, summarizing what each did and what they learned. Even if you think you have a totally novel question, find the most similar prior work that you can and discuss how it relates to your project.

References can be research publications, but they need not be. Blogs, GitHub repositories, company websites, etc., are all viable references if they are relevant to your project. It must be clear which information comes from which references. (2-3 paragraphs, including at least 2 references)

 **Use inline citation through HTML footnotes to specify which references support which statements** 

From the Amazon Product Pricing Report 2024 on Issuu<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1), we can see that Amazon prices are influenced by a vast amount of factors such as supply and demand, seasonal trends, competition, shifting seller fees, algorithmic pricing, Amazon's buy box system. Other elements such as brand power, customer reviews, and holiday shopping behavior also contribute to pricing variability. This report serves as essential information for this project as it provides context to better interpret and analyze Amazon price trends. The report provides information on different pricing trends across different sectors of the market such as beauty, home and kitchen, arts and crafts, pet supply, and baby products, and demonstrates how each sector faces different trends. With each market sector, the report also produces concise and clear visualizations of pricing data across multiple Amazon products.

The Consumer Price Index (CPI) Summary from the US Bureau of Labor Statistics<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) offers guidance as to how to structure and analyze data related to our topic. Their report on CPI changes from 2023-2024 exemplifies how to organize large datasets and distill them into clear, actionable insights. The summary’s consistent formatting and emphasis on year-over-year percentage changes allow for a straightforward understanding of trends in consumer prices across different sectors. The structured approach will be instrumental in our own analysis of pricing data, helping us standardize our methodology and avoid potential misinterpretations. By adopting their organizational strategy, we can enhance the accuracy and credibility of our findings.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Jungle Scout (2024, Jan). Amazon Product Pricing Report 2024. Issuu. https://issuu.com/junglescoutcobalt/docs/jungle-scout-amazon-product-pricing-report-2024?utm_source=chatgpt.html 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Bureau of Labor Statistics (2024, Feb). Consumer Price Index Summary. U.S. Department of Labor. https://www.bls.gov/news.release/pdf/cpi.html

# Hypothesis



- Include your team's hypothesis
- Ensure that this hypothesis is clear to readers
- Explain why you think this will be the outcome (what was your thinking?)

What is your main hypothesis/predictions about what the answer to your question is? Briefly explain your thinking. (2-3 sentences)

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- Dataset #2 (if you have more than one!)
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- etc

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [139]:
# Install the keepa library (run this cell if not already installed)
#!pip install keepa

import requests
import pandas as pd
import numpy as np
import json
from pathlib import Path
import keepa
import datetime
import matplotlib.pyplot as plt
import math
import os

In [157]:
ACCESS_KEY = "df2mtauj1tmrngcm95ubshd41fplpf2bfh1nba8s8hpd2m6golbbrj9bat7osb8o" # do no share outside of private repo!!
api = keepa.Keepa(ACCESS_KEY, timeout=30)

To query the price of a given amazon product, there are many different types of the the 'price' variable we can access. Two of them are new price and listing price. Here are the differences:

• NEW PRICE:
This is the current selling price for an item offered on Amazon in brand new condition. It reflects the actual market price that customers pay—typically the lowest available offer among Amazon and third‑party sellers. Because it’s influenced by promotions, competition, and real‑time market conditions, the NEW price can fluctuate over time.

• LISTING PRICE:
This is usually the manufacturer’s suggested retail price (MSRP) or the original price displayed on the product’s listing. It tends to be more stable and is often used as a reference to show discounts or price reductions. Even when the new price drops (for deals or competitive reasons), the listing price may remain unchanged.

The new price shows you what you’d pay right now, while the listing price is a reference value set by the manufacturer. This difference helps sellers and buyers gauge the discount depth and market dynamics.

Since we want to accurately capture the price a consumer is paying at a given time, we will use the NEW PRICE.

Other price variables we can access are the NEW_FBM (Filled by Manufacturer), NEW_FBA (Filled by Amazon), USED, (used items), REFURBISHED (refurbished items), WAREHOUSE (prices from amazon warhouse deals, usually returned items). These price variables offer spotty data coverage, and don't accurately represent the consumer experience, so we will not be using them.  

## Amazon Price History Dataset from Keepa API

First, we want to gather the ASINs (Amazon Standard Identification Number) for each product category. Each product on Amazon has a unique ASIN. We can collect them manually through the Keepa Data Product Finder tool. Here we filter by category: baby products, and rank: 1-1000. This gathers the top 1000 product ASINs for baby products. Additionally we include the name of the product, and the sub category. We made sure to exclude 'variations' of products, because we dont want to sample the same product 5 times just because there are 5 versions of the product with slight variations. We also refined our query to only physical products, excluding digital products and ebooks. CPI data is only collected on physical goods, so this allows us to have a fair comparison. And our final product filter is 'tracking time' which allows us to only select products that actually have price data from 2021-2024. If we tracked items that were listed in 2023, we would only have 1 year of price data for that item.

In [131]:
# read in top 1000 baby products csv:
data_path = Path('data') / 'baby_products.csv'
products = pd.read_csv(data_path)

In [138]:
products.head(10)

Unnamed: 0,product_title,subcategory,asin
0,"Pampers Baby Diapers - Swaddlers - Size 8, 60 ...",disposable diapers,B0C88BKX13
1,WaterWipes Plastic-Free Original 99.9% Water B...,wipes & holders,B0C7LW9C7H
2,"Huggies Natural Care Sensitive Baby Wipes, Uns...",wipes & refills,B08QRT84WJ
3,"Pampers Baby Wipes Sensitive, Water Based Wipe...",wipes & refills,B0BJ14MYC9
4,The Honest Company Clean Conscious Unscented W...,wipes & refills,B0DCHPP188
5,"Huggies Size 1 Diapers, Little Snugglers Newbo...",disposable diapers,B09NS37ZHX
6,"No-Touch Thermometer for Adults and Kids, Digi...",thermometers,B0CC29Y419
7,"Huggies Size 4 Diapers, Little Movers Baby Dia...",disposable diapers,B08VC33TDN
8,Dr. Brown’s Natural Flow Level 2 & Level 3 Nar...,nipples,B09VCTN7M6
9,Huggies Simply Clean Unscented Baby Diaper Wip...,wipes & refills,B08QRKY3NJ


Now that we have the ASINs, product name, and subcategory, we've noticed that some products have multiple sub categories, many of which are irrelevant for the purposes of our analysis such as "Baby Coupons", "TEST ABCDEFGPD" and even random characters such as "d963aedb-8e7e-493c...". We want to remove clean the subcategory column so that each product has a single subcategory. Luckily for us, it seems that the first sub category is the most descriptive subcategory of a given product, so we can remove the others.

In [134]:
products['subcategory'].value_counts().tail(15) # bottom 15 subcategories

subcategory
Home & Kitchen, Letters & Numbers, Baby                                                                                          1
Medicine Dispensers, Fresh | Household Coupons, Fresh | HPC Coupons, Fresh | Health & Household Coupons, Fresh | Baby Coupons    1
Safety, Safety                                                                                                                   1
Meals, d963aedb-8e7e-493c-80e7-b70e3c253824_9301, Gluten-Free Groceries, Baby Food                                               1
Tandem                                                                                                                           1
Toy Chests & Organizers, Categories                                                                                              1
Convertible Cribs, Kitchen & Dining Features, Nursery                                                                            1
Hair Care                                                              

In [149]:
# function remove extra subcategories from each row and rename columns
def clean_frame(df):
    df = df.rename(columns = {'Title' : 'product_title', 'Categories: Sub' : 'subcategory', 'ASIN' : 'asin'})
    def clean_row(row):
        row['subcategory'] = row['subcategory'].split(',')[0].strip().lower()
        return row
    df = df.apply(clean_row, axis = 1)
    
    return df

products = clean_frame(products)

In [137]:
products['subcategory'].value_counts().tail(15)

subcategory
convertible cribs              1
tandem                         1
gate extensions                1
strap & belt covers            1
harnesses & leashes            1
travel carry bags              1
pillow covers                  1
glider & ottoman sets          1
furniture                      1
prenatal monitoring devices    1
bottle handles                 1
lamps & shades                 1
stroller hooks                 1
storage & organizers           1
scales                         1
Name: count, dtype: int64

In [112]:
products['subcategory'].nunique()

179

Now our least common subcategories are actually meaningful, and we are left with 179 subcategories for baby products.

Moving forward, we can use the ASINs column to query the Keepa API for historical price data of each item. With our basic membership access to the Keepa API, we are limited with how much data we can request, so we will have to produce our price data in batches. Within the function get_monthly_avg_prices, we query specific ASINs to get their historical price data, and return a dataframe of this data. Due to the API limitations, we have to incrementally build up a large csv of price data for different product categories.

In [94]:
def keepa_time_to_datetime(kt):
    # Convert Keepa time (minutes since 2011-01-01) to a Python datetime (UTC)
    # According to Keepa docs:  Unix timestamp = (keepaTime + 21564000) * 60
    # If kt is already a datetime, just return it
    if isinstance(kt, datetime.datetime):
        return kt
    # Otherwise, assume it's a Keepa integer time
    return datetime.datetime.utcfromtimestamp((kt + 21564000) * 60)

def get_monthly_avg_prices(asins, days=1460):
    """
    asins: list of ASIN strings
    days: how many days of history to request (default 1460 ~ 4 years)
    
    Returns a DataFrame:
        - Rows = ASINs
        - Columns = monthly time periods (e.g. '2023-01', '2023-02', etc.)
        - Values = average 'NEW' price for that month
    """
    # Single API call for all ASINs
    products = api.query(asins, days=days)

    dfs = []
    for product in products:
        asin = product['asin']
        
        # Extract price/time arrays for the 'NEW' price
        price_history = product['data'].get('LISTPRICE', [])
        time_history  = product['data'].get('LISTPRICE_time', [])
        
        if len(price_history) == 0 or len(time_history) == 0:
            # If no price data, skip or create an empty frame for this ASIN
            continue
        
        # Convert Keepa time to datetime, convert price from cents to dollars
        dates = [keepa_time_to_datetime(t) for t in time_history]
        prices = [p for p in price_history]
        
        # Create a DF for this ASIN
        df = pd.DataFrame({'date': dates, asin: prices})
        df.set_index('date', inplace=True)
        
        # Resample to monthly average (end-of-month). 
        # If you prefer start-of-month, use 'MS' or '30D' as needed.
        monthly_avg = df.resample('M').mean()
        
        # monthly_avg now has a DateTimeIndex, and 1 column named after the ASIN.
        dfs.append(monthly_avg)
    
    if not dfs:
        return pd.DataFrame()  # No data, return empty
    
    # Concatenate on axis=1 => date index, one column per ASIN
    combined = pd.concat(dfs, axis=1)
    
    # Transpose so rows = ASIN, columns = monthly date
    combined = combined.T
    
    # Optionally convert the DatetimeIndex to strings like 'YYYY-MM'
    combined.columns = [col.strftime('%Y-%m') for col in combined.columns]
    
    return combined


def batch(iterable, n=25):
    """Yield successive n-sized chunks from iterable."""
    for i in range(0, len(iterable), n):
        yield iterable[i:i + n]

In [173]:
# EXAMPLE USAGE, cost 1 token:
asins = ["B0C7FBH4QV",'B0DJ1Z3XW4','B0CZ4KLN8Q']  

test = api.query(asins, days = 365)

#monthly_prices = get_monthly_avg_prices(asins)
#monthly_prices.head()

100%|██████████| 3/3 [00:08<00:00,  2.77s/it]


In [178]:
#print(test[0])
test[0]['data'].get('NEW_time', [])

array([datetime.datetime(2023, 6, 8, 20, 8),
       datetime.datetime(2023, 6, 12, 3, 52),
       datetime.datetime(2023, 6, 14, 1, 4),
       datetime.datetime(2023, 6, 15, 7, 18),
       datetime.datetime(2023, 6, 20, 14, 44),
       datetime.datetime(2023, 6, 23, 6, 4),
       datetime.datetime(2023, 6, 26, 13, 40),
       datetime.datetime(2023, 6, 27, 21, 46),
       datetime.datetime(2023, 6, 28, 20, 56),
       datetime.datetime(2023, 6, 29, 4, 18),
       datetime.datetime(2023, 7, 1, 21, 28),
       datetime.datetime(2023, 7, 2, 13, 8),
       datetime.datetime(2023, 7, 3, 21, 8),
       datetime.datetime(2023, 7, 4, 5, 12),
       datetime.datetime(2023, 7, 4, 6, 48),
       datetime.datetime(2023, 7, 20, 16, 32),
       datetime.datetime(2023, 7, 21, 13, 40),
       datetime.datetime(2023, 7, 30, 22, 16),
       datetime.datetime(2023, 7, 31, 21, 22),
       datetime.datetime(2023, 8, 5, 10, 0),
       datetime.datetime(2023, 8, 5, 13, 32),
       datetime.datetime(2023, 8, 

Here is where the magic happens. With the function **query_keepa_in_batches**, we can incrementally build a csv of historical price data for a given category using the product dataframe we created earlier. The inputs to this function are: a dataframe such as baby_products, the category we are working with, i.e. 'baby_products', the number of batches we want, batch size, how many days of historical price data we want, and the start and stop indices in products dataframe. These start and stop indices allow us to pick up where we left off if for some reason we have to stop querying. We can just figure out the last batch we completed, multiply by 25, and keep going from there! If the CSV already exists, we just append to it. If it does not exist, we create a new one. This ensures we never overwrite data, and we can always keep track of where we are in the request process. 

In [154]:
# Actual usage. Do not run unless necessary to conserve API tokens. CSV file is saved in data folder after each batch.
# This may take several hours to run depending on max batches variable. Each ASIN costs 1 token, and current API subscription generates 1 token per minute. 

def query_keepa_in_batches(products, category, max_batches=10, batch_size=25, days=1460, start_index=0, stop_index=None):
    """
    Query Keepa for monthly average prices in batches and incrementally save the results to a CSV file.
    
    Parameters:
        products (DataFrame or dict): Contains product data with a key 'asin'.
        category (str): Base name for the CSV file (e.g., 'baby_products').
        max_batches (int): Number of batches to process (default 10). Each batch contains batch_size ASINs.
        batch_size (int): Number of ASINs per batch (default 25).
        days (int): Number of days of history to request (default 1460, ~4 years).
        start_index (int): Start index for slicing the ASIN list.
        stop_index (int or None): Stop index for slicing the ASIN list. If None, process until the end.
        
    Returns:
        None. The function saves the data incrementally to a CSV file in the 'data' folder.
    """
    # Slice the ASIN list based on start_index and stop_index.
    asins = list(products['asin'])[start_index:stop_index]
    
    csv_file = f'data/{category}_monthly_prices.csv'
    
    for i, asin_batch in enumerate(batch(asins, batch_size)):
        if i >= max_batches:
            break
        df_batch = get_monthly_avg_prices(asin_batch, days=days)
        # If the CSV file exists, append without headers; otherwise, create a new file with headers.
        if os.path.exists(csv_file):
            df_batch.to_csv(csv_file, mode='a', index=True, header=False)
        else:
            df_batch.to_csv(csv_file, index=True)
        print(f"Batch {i+1} processed and appended.")


# Example helper function to split the ASIN list into batches.
def batch(iterable, n=25):
    """Yield successive n-sized chunks from iterable"""
    for i in range(0, len(iterable), n):
        yield iterable[i:i + n]



Now for the boring part, waiting for the API requests as we slowly build up our csv. This code will take several hours to run as we have to accumulate tokens throughout the API request. It will probably run in the background and overnight.

In [153]:
#query_keepa_in_batches(products, 'baby_products', max_batches=10) # default batch size is 25, default days is 1460 (4 yrs),

This entire process was just for the baby_products category. We will have to do this same process for every other product category we use. Luckily after defining all of the functions, it shouldn't be that hard. We really only need to do a simple steps.
1. Load in a category of Amazon products as a dataframe
2. Clean the dataframe
3. Query the API and build historical price data csv

Here is an example:

In [141]:
category = 'example_category'
data_path = Path('data/keepa_data') / f'{category}.csv'
example_category = pd.read_csv(data_path)
example_category = clean_frame(example_category)
query_keepa_in_batches(example_category, category, max_batches=10, batch_size=25, days = 1460, start_index = 0, end_index = None)

AttributeError: module 'pandas' has no attribute 'read'

In [152]:
category = 'grocery_and_foods'
data_path = Path('data/keepa_data') / f'{category}.csv'
grocery_and_foods = pd.read_csv(data_path)
grocery_and_foods = grocery_and_foods.rename(columns = {'Title' : 'product_title', 'Categories: Sub' : 'subcategory', 'ASIN' : 'asin'})
grocery_and_foods = grocery_and_foods.apply(clean_data, axis = 1)
grocery_and_foods

Unnamed: 0,product_title,subcategory,asin
0,"CELSIUS Sparkling Oasis Vibe, Functional Essen...",energy drinks,B0C7FBH4QV
1,Nespresso Chocolate Fudge Flavor VertuoLine po...,single-serve capsules & pods,B08RJJ8VP7
2,"Premier Protein Shake, Winter Mint Chocolate, ...",protein drinks,B0CZ4KLN8Q
3,"Sparkling Ice, Berry Lemonade Sparkling Water,...",carbonated water,B0B192RX4X
4,"Monster Energy Ultra Violet, Sugar Free Energy...",energy drinks,B0BL6WMRGG
...,...,...,...
995,WATERMELON JOLLY RANCHER Hard Candy Original F...,hard candy,B0CCPPKG6B
996,"Starbucks Premium Instant Coffee, Dark Roast, ...",instant coffee,B08YPDCZCN
997,"Lipton Honey Ginger Green Tea Bags, Flavored, ...",green,B0D613P593
998,Luxardo Gourmet Cocktail Maraschino Cherries |...,cherries,B0CYP1ZS3K


In [156]:
query_keepa_in_batches(grocery_and_foods, category, max_batches=25, batch_size=20, days = 1460, start_index = 0, stop_index = None)

100%|██████████| 20/20 [00:05<00:00,  3.81it/s]
  monthly_avg = df.resample('M').mean()


Batch 1 processed and appended.


100%|██████████| 20/20 [00:09<00:00,  2.02it/s]
  monthly_avg = df.resample('M').mean()


Batch 2 processed and appended.


100%|██████████| 20/20 [00:05<00:00,  3.74it/s]
  monthly_avg = df.resample('M').mean()


Batch 3 processed and appended.


100%|██████████| 20/20 [00:11<00:00,  1.74it/s]
  monthly_avg = df.resample('M').mean()


Batch 4 processed and appended.


  0%|          | 0/20 [00:00<?, ?it/s]Waiting 1177 seconds for additional tokens


Response from server: NOT_ENOUGH_TOKEN


Waiting 1 seconds for additional tokens
100%|██████████| 20/20 [19:50<00:00, 59.54s/it]
  monthly_avg = df.resample('M').mean()


Batch 5 processed and appended.


  0%|          | 0/20 [00:00<?, ?it/s]Waiting 1190 seconds for additional tokens


Response from server: NOT_ENOUGH_TOKEN


Waiting 2 seconds for additional tokens
100%|██████████| 20/20 [20:03<00:00, 60.18s/it]
  monthly_avg = df.resample('M').mean()


Batch 6 processed and appended.


  0%|          | 0/20 [00:00<?, ?it/s]Waiting 1191 seconds for additional tokens


Response from server: NOT_ENOUGH_TOKEN


Waiting 2 seconds for additional tokens
100%|██████████| 20/20 [20:04<00:00, 60.20s/it]
  monthly_avg = df.resample('M').mean()


Batch 7 processed and appended.


  0%|          | 0/20 [00:00<?, ?it/s]Waiting 1191 seconds for additional tokens


Response from server: NOT_ENOUGH_TOKEN


Waiting 2 seconds for additional tokens
100%|██████████| 20/20 [20:08<00:00, 60.44s/it]
  monthly_avg = df.resample('M').mean()


Batch 8 processed and appended.


  0%|          | 0/20 [00:00<?, ?it/s]Waiting 1187 seconds for additional tokens


Response from server: NOT_ENOUGH_TOKEN


Waiting 2 seconds for additional tokens
100%|██████████| 20/20 [19:58<00:00, 59.93s/it]
  monthly_avg = df.resample('M').mean()


Batch 9 processed and appended.


  0%|          | 0/20 [00:00<?, ?it/s]Waiting 1192 seconds for additional tokens


Response from server: NOT_ENOUGH_TOKEN


Waiting 2 seconds for additional tokens


ReadTimeout: HTTPSConnectionPool(host='api.keepa.com', port=443): Read timed out. (read timeout=10.0)

Note:
The code of the cell above is mostly generated by ChatGPT from the prompt "How can I load the Keepa API into an ipynb, and query price data every 30 days for the past 4 years?"

## Dataset #2 (if you have more than one, use name instead of number here)

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

# Ethics & Privacy

- Thoughtful discussion of ethical concerns included
- Ethical concerns consider the whole data science process (question asked, data collected, data being used, the bias in data, analysis, post-analysis, etc.)
- How your group handled bias/ethical concerns clearly described

Acknowledge and address any ethics & privacy related issues of your question(s), proposed dataset(s), and/or analyses. Use the information provided in lecture to guide your group discussion and thinking. If you need further guidance, check out [Deon's Ethics Checklist](http://deon.drivendata.org/#data-science-ethics-checklist). In particular:

- Are there any biases/privacy/terms of use issues with the data you propsed?
- Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis? (For example, does your data exclude particular populations, or is it likely to reflect particular human biases in a way that could be a problem?)
- How will you set out to detect these specific biases before, during, and after/when communicating your analysis?
- Are there any other issues related to your topic area, data, and/or analyses that are potentially problematic in terms of data privacy and equitable impact?
- How will you handle issues you identified?

# Team Expectations 


Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Team Expectation 1*
* *Team Expectation 2*
* *Team Expecation 3*
* ...

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor) | Discuss/edit Analysis; Complete project check-in |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |