# 00 ‚Äì Data Collection and Acquisition

This notebook documents how the raw data for the project are acquired and
verified. It mirrors and explains the behavior of the script
`scripts/get_data.py`.

Goals of this notebook:

- Confirm that Kaggle API credentials are configured.
- Download the raw Coffee Sales and Coffee Shop datasets into `data/raw/`
  using `scripts/get_data.py`.
- Verify file existence and basic properties (e.g., shapes, column names).
- Run checksum verification using `data/checksums.sha256`.

For full command-line instructions, see also `data/README.md`.

## 1. Kaggle API setup (summary)

Before running this notebook, the Kaggle API must be configured.

As described in `data/README.md`:

1. Create a Kaggle account and go to **Account ‚Üí Settings ‚Üí API**.
2. Either set the `KAGGLE_API_TOKEN` environment variable (recommended), e.g.:

   ```bash
   export KAGGLE_API_TOKEN='KGAT_...your_token_here...'

In [8]:
pip install kaggle

Note: you may need to restart the kernel to use updated packages.


In [9]:
### üîπ Cell 3 ‚Äî Code (check paths)

from pathlib import Path

PROJECT_ROOT = Path("..").resolve()
DATA_DIR = PROJECT_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"

PROJECT_ROOT, DATA_DIR, RAW_DIR

(PosixPath('/Users/ujjwal/Downloads/IS-477-Project-Ujjwal'),
 PosixPath('/Users/ujjwal/Downloads/IS-477-Project-Ujjwal/data'),
 PosixPath('/Users/ujjwal/Downloads/IS-477-Project-Ujjwal/data/raw'))

## 2. Download raw data via `scripts/get_data.py`

The script `scripts/get_data.py` wraps the Kaggle API calls that download
the two datasets:

- `ahmedabbas757/coffee-sales`
- `jawad3664/coffee-shop`

We invoke the script from this notebook using a shell command. This will:

- Download the datasets (if not already present).
- Unzip them into `data/raw/`.
- Produce:

  - `data/raw/coffee_sales.csv`
  - `data/raw/coffee_shop.csv`

In [10]:
# Run the download command from the project root
!python ../scripts/get_data.py download


=== Downloading coffee_sales ===
Downloading ahmedabbas757/coffee-sales into data/raw ...
Dataset URL: https://www.kaggle.com/datasets/ahmedabbas757/coffee-sales
License(s): GNU Lesser General Public License 3.0
Downloading coffee-sales.zip to data/raw
 49%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç                   | 4.00M/8.23M [00:00<00:00, 32.2MB/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 8.23M/8.23M [00:00<00:00, 41.8MB/s]
Unzipping coffee-sales.zip ...

=== Downloading coffee_shop ===
Downloading jawad3664/coffee-shop into data/raw ...
Dataset URL: https://www.kaggle.com/datasets/jawad3664/coffee-shop
License(s): CC0-1.0
Downloading coffee-shop.zip to data/raw
  0%|                                               | 0.00/29.7k [00:00<?, ?B/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 29.7k/29.7k [00:00<00:00,

### 2.1 Verify that raw files exist

After running the script, we confirm that the expected CSV files are present
under `data/raw/`.

In [11]:
import pandas as pd

raw_files = list(RAW_DIR.glob("*.csv"))
raw_files

[PosixPath('/Users/ujjwal/Downloads/IS-477-Project-Ujjwal/data/raw/coffee_shop.csv'),
 PosixPath('/Users/ujjwal/Downloads/IS-477-Project-Ujjwal/data/raw/coffee_sales.csv')]

In [12]:
# Load the two main raw CSV files and print their shapes
sales_raw_path = RAW_DIR / "coffee_sales.csv"
shop_raw_path = RAW_DIR / "coffee_shop.csv"

sales_raw = pd.read_csv(sales_raw_path)
shop_raw = pd.read_csv(shop_raw_path)

print("Raw coffee_sales.csv shape:", sales_raw.shape)
print("Raw coffee_shop.csv shape: ", shop_raw.shape)

Raw coffee_sales.csv shape: (149116, 11)
Raw coffee_shop.csv shape:  (3547, 11)


In [13]:
sales_raw.head()

Unnamed: 0,transaction_id,transaction_date,transaction_time,transaction_qty,store_id,store_location,product_id,unit_price,product_category,product_type,product_detail
0,1,1/1/23,7:06:11,2,5,Lower Manhattan,32,3.0,Coffee,Gourmet brewed coffee,Ethiopia Rg
1,2,1/1/23,7:08:56,2,5,Lower Manhattan,57,3.1,Tea,Brewed Chai tea,Spicy Eye Opener Chai Lg
2,3,1/1/23,7:14:04,2,5,Lower Manhattan,59,4.5,Drinking Chocolate,Hot chocolate,Dark chocolate Lg
3,4,1/1/23,7:20:24,1,5,Lower Manhattan,22,2.0,Coffee,Drip coffee,Our Old Time Diner Blend Sm
4,5,1/1/23,7:22:41,2,5,Lower Manhattan,57,3.1,Tea,Brewed Chai tea,Spicy Eye Opener Chai Lg


In [14]:
shop_raw.head()

Unnamed: 0,hour_of_day,cash_type,money,coffee_name,Time_of_Day,Weekday,Month_name,Weekdaysort,Monthsort,Date,Time
0,10,card,38.7,Latte,Morning,Fri,Mar,5,3,01/03/2024,15:50.5
1,12,card,38.7,Hot Chocolate,Afternoon,Fri,Mar,5,3,01/03/2024,19:22.5
2,12,card,38.7,Hot Chocolate,Afternoon,Fri,Mar,5,3,01/03/2024,20:18.1
3,13,card,28.9,Americano,Afternoon,Fri,Mar,5,3,01/03/2024,46:33.0
4,13,card,38.7,Latte,Afternoon,Fri,Mar,5,3,01/03/2024,48:14.6


## 3. Checksum creation and verification

To ensure data integrity and reproducibility, the project uses SHA-256
checksums recorded in `data/checksums.sha256`.

The script `scripts/get_data.py` supports two subcommands:

- `write-checks` ‚Äì compute checksums for the current raw files and write
  them to `data/checksums.sha256`.
- `verify` ‚Äì recompute checksums for the raw files and compare them to
  the values stored in `data/checksums.sha256`.

This allows anyone to confirm that they are using the **exact same raw
inputs** as were used for the original analysis.

In [15]:
# OPTIONAL: regenerate checksums (only if you intend to update checksums.sha256)
# !python ../scripts/get_data.py write-checks

In [16]:
# Verify that the raw data match the expected SHA-256 checksums
!python ../scripts/get_data.py verify

Checksum file not found. Run 'write-checks' first.


## 4. Hand-off to profiling and cleaning

At this point, the raw data have been:

- Downloaded from Kaggle using the documented script and API setup.
- Confirmed to exist under `data/raw/`.
- Optionally verified against `data/checksums.sha256`.

The next step in the pipeline is **Storage, organization and integration**, which is
performed in:

- `notebooks/01_storage_and_organization.ipynb` and data/README.md