# 1 Business Understanding
The goal of this notebook is to explore an online retail transactional dataset in order to:

* **Understand sales performance**
* **Identify customer behavior patterns**
* **Analyze product-level performance**
* **Detect operational issues** (returns, cancellations)
* **Generate actionable business insights**

---

## 2Ô∏è Retail Concepts to Master in This Notebook
This notebook will not only analyze data ‚Äî it will also build strong retail domain knowledge.

### üßæ Order vs Line Item
* Understand difference between invoice/order level and product-level rows
* Calculate metrics correctly at the right level

###  SKU Granularity
* Analyze performance at:
    * **SKU level** (product-level analysis)
    * **Order level**
    * **Customer level**
* Understand impact of granularity on metrics

###  GMV vs Revenue
* **Calculate:**
    * **GMV (Gross Merchandise Value)** = total sales before returns
    * **Net Revenue** = after returns & cancellations
* Compare both over time

###  Returns vs Cancellations
* **Identify:**
    * Canceled invoices
    * Negative quantities (returns)
* Understand operational and financial impact

###  Time Aggregation
* **Analyze:**
    * Daily sales
    * Weekly trends
    * Monthly growth
* Detect seasonality patterns

###  Repeat Customers
* **Identify:**
    * New vs returning customers
* **Measure:**
    * Repeat purchase rate
    * Customer frequency

###  Return Rate
* **Compute:**
    * Return rate per SKU
    * Return rate overall
* Identify problematic products

###  Data Cleaning with Business Rules
Apply retail-specific cleaning logic:
* Remove canceled invoices
* Handle negative quantities
* Remove invalid prices
* Filter out test SKUs or anomalies

###  Customer & Product Segmentation
* **Segment by:**
    * High-value customers
    * High-return customers
    * Top-performing SKUs
    * Low-performing SKUs

###  Retail Storytelling
Translate data into business insights:
* What drives revenue?
* Which products hurt profit?
* Who are our best customers?
* When do we sell the most?
* Where are operational inefficiencies?

---

##  Final Learning Objective
By the end of this notebook, I should be able to:
1.  **Think like a Retail Data Analyst**
2.  **Speak retail KPIs confidently** (GMV, SKU, return rate, repeat rate)
3.  **Transform raw transactional data into business decisions**
4.  **Tell a clear data-driven retail story**

In [3]:
# imports
import pandas as pd
import numpy as np

In [17]:
df = pd.read_csv(f"D:\\retail-projects\\online-retail-analysis\\data\\raw\\online_retail_II.csv")
df.head()


Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In [14]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 1067371 entries, 0 to 1067370
Data columns (total 8 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   Invoice      1067371 non-null  str    
 1   StockCode    1067371 non-null  str    
 2   Description  1062989 non-null  str    
 3   Quantity     1067371 non-null  int64  
 4   InvoiceDate  1067371 non-null  str    
 5   Price        1067371 non-null  float64
 6   Customer ID  824364 non-null   float64
 7   Country      1067371 non-null  str    
dtypes: float64(2), int64(1), str(5)
memory usage: 65.1 MB
