## Overview  
This notebook (`OrderGrouping.ipynb`) automates the weekly aggregation of Capelli’s “Details” CSV export into a single summary file. It performs the following high-level steps: loading and preprocessing the raw CSV, cleaning column names and types, converting date and quantity fields, grouping by **Customer Reference** and **Club Name**, applying custom aggregation logic for order status, and writing the result to a new CSV.  

## What You Need to Update Weekly  
1. **Input file path**:  
   - In the **Main Execution** block, set `input_file` to point at the latest Capelli “Details” tab CSV, e.g.  
     ```python
     input_file = 'shippingdates/Rush Soccer <MM.DD> - Details.csv'
     ```  
   - This path should match the filename you download from the Capelli portal.  
2. **Output file path**:  
   - Also in **Main Execution**, set `output_file` to the desired aggregated filename, for example:  
     ```python
     output_file = 'shippingdates/aggregated_orders<MM.DD>.csv'
     ```  
   - The script will overwrite or create this file each run.  

> **Note:** Both file paths live under the `shippingdates/` folder and must be updated to reflect the new report dates each week.  

## How It Works  

1. **Load & Preprocess**  
   - Uses `pd.read_csv()` with error handling to catch missing or malformed files (`pd.read_csv` docs) :contentReference[oaicite:0]{index=0}.  
   - Renames the “Shipped Date” column to **Shipping Date**, strips whitespace from all column names, and converts common missing-value strings (`'N/A'`) to `pd.NA`.  
   - Converts date columns (`Date Created`, `Shipping Date`) into `datetime64[ns]` via `pd.to_datetime()` :contentReference[oaicite:1]{index=1}.  
   - Converts quantity fields (`Order Quantity`, `Shipped Quantity`, `Unshipped Quantity`) to numeric, coercing invalid entries to `NaN` and then filling with the column median :contentReference[oaicite:2]{index=2}.  

2. **Aggregation Logic**  
   - Groups data by `['Customer Reference', 'Club Name']`.  
   - For each group:  
     - **Date Created**: takes the earliest date.  
     - Quantity fields: sums across the group.  
     - **Shipping Date**: takes the latest date.  
     - **Sales Order Header Status**: uses a custom function that returns `'OPEN'` if *any* order in the group is open; otherwise it returns the mode (most frequent status).  

3. **Save Results**  
   - Writes the aggregated `DataFrame` to CSV using `DataFrame.to_csv()` in the `shippingdates/` directory :contentReference[oaicite:3]{index=3}.  

## Usage Instructions  
1. **Place your new Capelli export** (the Details tab) into `shippingdates/` with a clear filename (e.g., `Rush Soccer 05.11 - Details.csv`).  
2. **Open this notebook**, update the `input_file` and `output_file` variables in the **Main Execution** cell.  
3. **Run all cells** in order:  
   - Data loading & preprocessing → aggregation → CSV export.  
4. **Verify** that `shippingdates/aggregated_orders<MM.DD>.csv` appears and contains the summarized orders.  

---  
*By following these steps and updating only the two file-path variables each week, this notebook provides a reliable, repeatable process for consolidating weekly Capelli orders into a single, clean CSV for downstream analysis.*  


In [5]:

import pandas as pd

# -------------------------- Step 1: Load the Data -------------------------- #

# Define the file path
file_path = 'shippingdates/Rush Soccer 5.4 - Details.csv'  # Replace with your actual file path

# Load the CSV file into a pandas DataFrame
# Assuming the CSV is tab-separated based on the sample data
try:
    # Instead of sep='\t', just let pandas infer commas:
    df = pd.read_csv(file_path)

    print("Data loaded successfully.")
except Exception as e:
    print(f"Error loading data: {e}")

# -------------------------- Step 2: Inspect the Data -------------------------- #

# Display the first few rows to understand the data structure
print("First 5 rows of the dataset:")
# print(df.head())

# Display the column names and their data types
print("\nColumn Names:")
print(df.columns)
print("\nData Types:")
print(df.dtypes)

# -------------------------- Step 3: Clean Column Names -------------------------- #

# Strip leading and trailing whitespace from all column names
df.columns = df.columns.str.strip()

# Verify column names after stripping
print("\nCleaned Column Names:")
print(df.columns)

# -------------------------- Step 4: Clean Specific Columns -------------------------- #

# Define columns that may contain whitespace and need to be stripped
columns_to_strip = ['Order Quantity', 'Shipped Quantity', 'Unshipped Quantity']

# Strip whitespace from these columns if they are of object type (strings)
df[columns_to_strip] = df[columns_to_strip].apply(
    lambda x: x.str.strip() if x.dtype == "object" else x
)

# Convert 'Shipped Quantity' and 'Unshipped Quantity' to numeric types
# Replace non-numeric entries with 0
df['Shipped Quantity'] = pd.to_numeric(df['Shipped Quantity'], errors='coerce').fillna(0).astype(int)
df['Unshipped Quantity'] = pd.to_numeric(df['Unshipped Quantity'], errors='coerce').fillna(0).astype(int)

# Verify the changes
print("\nData Types After Conversion:")
print(df.dtypes)

# -------------------------- Step 5: Convert Date Columns -------------------------- #

# Convert 'Date Created' and 'Shipped Date' to datetime format
# Coerce errors to NaT (Not a Time) for invalid dates
df['Date Created'] = pd.to_datetime(df['Date Created'], errors='coerce', format='%m/%d/%Y')
df['Shipped Date'] = pd.to_datetime(df['Shipped Date'], errors='coerce', format='%m/%d/%Y')

# Verify the conversion
print("\nDate Columns After Conversion:")
print(df[['Date Created', 'Shipped Date']].head())

# -------------------------- Step 6: Handle Invalid 'Shipped Date' Entries -------------------------- #

# Identify rows with invalid 'Shipped Date' (NaT)
invalid_shipping_dates = df['Shipped Date'].isna()
print(f"\nNumber of rows with invalid 'Shipped Date': {invalid_shipping_dates.sum()}")

# Option 1: Remove rows with invalid 'Shipped Date'
df_clean = df.dropna(subset=['Shipped Date']).copy()
print(f"Number of rows after removing invalid 'Shipped Date': {df_clean.shape[0]}")

# Optionally, you can choose to fill invalid 'Shipped Date' with 'Date Created' or another default date
# Uncomment the following lines if you prefer this approach
# df['Shipped Date'] = df['Shipped Date'].fillna(df['Date Created'])
# df_clean = df.dropna(subset=['Shipped Date']).copy()

# -------------------------- Step 7: Extract Month-Year from 'Shipped Date' -------------------------- #

# Create a new column 'Month-Year' in 'MMMM YYYY' format (e.g., July 2024)
df_clean['Month-Year'] = df_clean['Shipped Date'].dt.strftime('%B %Y')

# Verify the new column
print("\nSample 'Month-Year' Entries:")
print(df_clean[['Shipped Date', 'Month-Year']].head())

# -------------------------- Step 8: Remove Duplicate Tracking Numbers -------------------------- #

# Assuming each 'Tracking Number' uniquely identifies a package, remove duplicates
# If 'Tracking Number' is not unique per package, adjust accordingly
df_unique = df_clean.drop_duplicates(subset=['Tracking Number'])

# Verify the removal of duplicates
print(f"\nNumber of unique packages after removing duplicates: {df_unique.shape[0]}")

# -------------------------- Step 9: Group by 'Month-Year' and Count Unique Packages -------------------------- #

# Group by 'Month-Year' and count unique 'Tracking Number' to get the number of packages shipped each month
packages_shipped = df_unique.groupby('Month-Year')['Tracking Number'].nunique().reset_index()

# Rename the columns for clarity
packages_shipped.columns = ['Month-Year', 'Unique Packages Shipped']

# -------------------------- Step 10: Sort the Results Chronologically -------------------------- #

# Convert 'Month-Year' back to datetime for sorting
packages_shipped['Month-Year-Date'] = pd.to_datetime(packages_shipped['Month-Year'], format='%B %Y')

# Sort by the new datetime column
packages_shipped = packages_shipped.sort_values('Month-Year-Date')

# Drop the auxiliary datetime column
packages_shipped = packages_shipped.drop('Month-Year-Date', axis=1)

# -------------------------- Step 11: Display the Results -------------------------- #

print("\nNumber of Packages Shipped Each Month:")
print(packages_shipped)

# -------------------------- Step 12: (Optional) Save the Results to CSV -------------------------- #

# Define the output file path
output_file = 'packages_shipped_per_month.csv'

# Save the DataFrame to a new CSV file
packages_shipped.to_csv(output_file, index=False)
print(f"\nAggregated data saved to {output_file}")


Data loaded successfully.
First 5 rows of the dataset:

Column Names:
Index(['Customer Reference', 'Club Name', 'Date Created', 'Sold TO Name',
       'Sold TO Email', 'Ship TO Name', 'Order Quantity', 'Shipped Quantity',
       'Unshipped Quantity', 'Shipped Date', 'Tracking Number',
       'Sales Order Header Status', 'Material Code', 'Description', 'Size'],
      dtype='object')

Data Types:
Customer Reference            int64
Club Name                    object
Date Created                 object
Sold TO Name                 object
Sold TO Email                object
Ship TO Name                 object
Order Quantity                int64
Shipped Quantity              int64
Unshipped Quantity            int64
Shipped Date                 object
Tracking Number              object
Sales Order Header Status    object
Material Code                object
Description                  object
Size                         object
dtype: object

Cleaned Column Names:
Index(['Customer Referenc