# Sales Data Exploration

**Project:** Global Electronics Retailer Dataset Analysis  
**Author:** Ammar Siregar  
**Purpose:** Initial data exploration and understanding  

This notebook contains the initial exploration of the sales dataset to understand the structure, quality, and characteristics of the data.

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set style for visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Libraries imported successfully!")

Libraries imported successfully!


## Load Raw Data

Load all the CSV files from the raw data directory and examine their basic structure.

In [2]:
# Load all datasets from raw folder
print("Loading raw data files...")
print("=" * 50)

try:
    customers = pd.read_csv('../data/raw/Customers.csv', encoding='latin-1')
    products = pd.read_csv('../data/raw/Products.csv')
    sales = pd.read_csv('../data/raw/Sales.csv')
    stores = pd.read_csv('../data/raw/Stores.csv')
    exchange_rates = pd.read_csv('../data/raw/Exchange_Rates.csv')
    data_dictionary = pd.read_csv('../data/raw/Data_Dictionary.csv')
    
    print(f"✓ Customers: {customers.shape}")
    print(f"✓ Products: {products.shape}")
    print(f"✓ Sales: {sales.shape}")
    print(f"✓ Stores: {stores.shape}")
    print(f"✓ Exchange Rates: {exchange_rates.shape}")
    print(f"✓ Data Dictionary: {data_dictionary.shape}")
    
except Exception as e:
    print(f"Error loading data: {e}")

Loading raw data files...
✓ Customers: (15266, 10)
✓ Products: (2517, 10)
✓ Sales: (62884, 9)
✓ Stores: (67, 5)
✓ Exchange Rates: (11215, 3)
✓ Data Dictionary: (37, 3)


## Data Dictionary Review

Let's first understand what each field represents by examining the data dictionary.

In [3]:
print("Data Dictionary:")
print("=" * 50)
display(data_dictionary)

Data Dictionary:


Unnamed: 0,Table,Field,Description
0,Sales,Order Number,Unique ID for each order
1,Sales,Line Item,Identifies individual products purchased as pa...
2,Sales,Order Date,Date the order was placed
3,Sales,Delivery Date,Date the order was delivered
4,Sales,CustomerKey,Unique key identifying which customer placed t...
5,Sales,StoreKey,Unique key identifying which store processed t...
6,Sales,ProductKey,Unique key identifying which product was purch...
7,Sales,Quantity,Number of items purchased
8,Sales,Currency Code,Currency used to process the order
9,Customers,CustomerKey,Primary key to identify customers


## Dataset Overview

Examine the structure and basic information about each dataset.

In [4]:
# Function to display dataset overview
def dataset_overview(df, name):
    print(f"\n{name.upper()} DATASET")
    print("=" * 50)
    print(f"Shape: {df.shape}")
    print(f"\nData Types:")
    print(df.dtypes)
    print(f"\nFirst 5 rows:")
    display(df.head())
    print(f"\nMissing values:")
    missing = df.isnull().sum()
    print(missing[missing > 0] if missing.sum() > 0 else "No missing values")
    print("\n" + "="*80)

# Overview of each dataset
dataset_overview(customers, "Customers")
dataset_overview(products, "Products")
dataset_overview(sales, "Sales")
dataset_overview(stores, "Stores")
dataset_overview(exchange_rates, "Exchange Rates")


CUSTOMERS DATASET
Shape: (15266, 10)

Data Types:
CustomerKey     int64
Gender         object
Name           object
City           object
State Code     object
State          object
Zip Code       object
Country        object
Continent      object
Birthday       object
dtype: object

First 5 rows:


Unnamed: 0,CustomerKey,Gender,Name,City,State Code,State,Zip Code,Country,Continent,Birthday
0,301,Female,Lilly Harding,WANDEARAH EAST,SA,South Australia,5523,Australia,Australia,7/3/1939
1,325,Female,Madison Hull,MOUNT BUDD,WA,Western Australia,6522,Australia,Australia,9/27/1979
2,554,Female,Claire Ferres,WINJALLOK,VIC,Victoria,3380,Australia,Australia,5/26/1947
3,786,Male,Jai Poltpalingada,MIDDLE RIVER,SA,South Australia,5223,Australia,Australia,9/17/1957
4,1042,Male,Aidan Pankhurst,TAWONGA SOUTH,VIC,Victoria,3698,Australia,Australia,11/19/1965



Missing values:
State Code    10
dtype: int64


PRODUCTS DATASET
Shape: (2517, 10)

Data Types:
ProductKey         int64
Product Name      object
Brand             object
Color             object
Unit Cost USD     object
Unit Price USD    object
SubcategoryKey     int64
Subcategory       object
CategoryKey        int64
Category          object
dtype: object

First 5 rows:


Unnamed: 0,ProductKey,Product Name,Brand,Color,Unit Cost USD,Unit Price USD,SubcategoryKey,Subcategory,CategoryKey,Category
0,1,Contoso 512MB MP3 Player E51 Silver,Contoso,Silver,$6.62,$12.99,101,MP4&MP3,1,Audio
1,2,Contoso 512MB MP3 Player E51 Blue,Contoso,Blue,$6.62,$12.99,101,MP4&MP3,1,Audio
2,3,Contoso 1G MP3 Player E100 White,Contoso,White,$7.40,$14.52,101,MP4&MP3,1,Audio
3,4,Contoso 2G MP3 Player E200 Silver,Contoso,Silver,$11.00,$21.57,101,MP4&MP3,1,Audio
4,5,Contoso 2G MP3 Player E200 Red,Contoso,Red,$11.00,$21.57,101,MP4&MP3,1,Audio



Missing values:
No missing values


SALES DATASET
Shape: (62884, 9)

Data Types:
Order Number      int64
Line Item         int64
Order Date       object
Delivery Date    object
CustomerKey       int64
StoreKey          int64
ProductKey        int64
Quantity          int64
Currency Code    object
dtype: object

First 5 rows:


Unnamed: 0,Order Number,Line Item,Order Date,Delivery Date,CustomerKey,StoreKey,ProductKey,Quantity,Currency Code
0,366000,1,1/1/2016,,265598,10,1304,1,CAD
1,366001,1,1/1/2016,1/13/2016,1269051,0,1048,2,USD
2,366001,2,1/1/2016,1/13/2016,1269051,0,2007,1,USD
3,366002,1,1/1/2016,1/12/2016,266019,0,1106,7,CAD
4,366002,2,1/1/2016,1/12/2016,266019,0,373,1,CAD



Missing values:
Delivery Date    49719
dtype: int64


STORES DATASET
Shape: (67, 5)

Data Types:
StoreKey           int64
Country           object
State             object
Square Meters    float64
Open Date         object
dtype: object

First 5 rows:


Unnamed: 0,StoreKey,Country,State,Square Meters,Open Date
0,1,Australia,Australian Capital Territory,595.0,1/1/2008
1,2,Australia,Northern Territory,665.0,1/12/2008
2,3,Australia,South Australia,2000.0,1/7/2012
3,4,Australia,Tasmania,2000.0,1/1/2010
4,5,Australia,Victoria,2000.0,12/9/2015



Missing values:
Square Meters    1
dtype: int64


EXCHANGE RATES DATASET
Shape: (11215, 3)

Data Types:
Date         object
Currency     object
Exchange    float64
dtype: object

First 5 rows:


Unnamed: 0,Date,Currency,Exchange
0,1/1/2015,USD,1.0
1,1/1/2015,CAD,1.1583
2,1/1/2015,AUD,1.2214
3,1/1/2015,EUR,0.8237
4,1/1/2015,GBP,0.6415



Missing values:
No missing values



## Data Quality Assessment

In [5]:
# Check for duplicates
print("DUPLICATE RECORDS CHECK")
print("=" * 50)
print(f"Customers duplicates: {customers.duplicated().sum()}")
print(f"Products duplicates: {products.duplicated().sum()}")
print(f"Sales duplicates: {sales.duplicated().sum()}")
print(f"Stores duplicates: {stores.duplicated().sum()}")
print(f"Exchange rates duplicates: {exchange_rates.duplicated().sum()}")

DUPLICATE RECORDS CHECK
Customers duplicates: 0
Products duplicates: 0
Sales duplicates: 0
Stores duplicates: 0
Exchange rates duplicates: 0
Sales duplicates: 0
Stores duplicates: 0
Exchange rates duplicates: 0


In [6]:
# Check unique values in key columns
print("UNIQUE VALUES IN KEY COLUMNS")
print("=" * 50)
print(f"Unique customers: {customers['CustomerKey'].nunique()}")
print(f"Unique products: {products['ProductKey'].nunique()}")
print(f"Unique stores: {stores['StoreKey'].nunique()}")
print(f"Unique orders: {sales['Order Number'].nunique()}")
print(f"Total sales records: {len(sales)}")

UNIQUE VALUES IN KEY COLUMNS
Unique customers: 15266
Unique products: 2517
Unique stores: 67
Unique orders: 26326
Total sales records: 62884


## Explore Categorical Variables

In [7]:
# Product categories and brands
print("PRODUCT ANALYSIS")
print("=" * 50)
print(f"Product categories: {products['Category'].unique()}")
print(f"\nNumber of products per category:")
print(products['Category'].value_counts())

print(f"\nTop 10 brands by product count:")
print(products['Brand'].value_counts().head(10))

PRODUCT ANALYSIS
Product categories: ['Audio' 'TV and Video' 'Computers' 'Cameras and camcorders' 'Cell phones'
 'Music, Movies and Audio Books' 'Games and Toys' 'Home Appliances']

Number of products per category:
Home Appliances                  661
Computers                        606
Cameras and camcorders           372
Cell phones                      285
TV and Video                     222
Games and Toys                   166
Audio                            115
Music, Movies and Audio Books     90
Name: Category, dtype: int64

Top 10 brands by product count:
Contoso                 710
Fabrikam                267
Litware                 264
Proseware               244
Adventure Works         192
Southridge Video        192
Wide World Importers    173
The Phone Company       152
Tailspin Toys           144
A. Datum                132
Name: Brand, dtype: int64


In [8]:
# Customer demographics
print("CUSTOMER DEMOGRAPHICS")
print("=" * 50)
print(f"Gender distribution:")
print(customers['Gender'].value_counts())

print(f"\nCountries: {customers['Country'].unique()}")
print(f"\nCustomers per country:")
print(customers['Country'].value_counts())

CUSTOMER DEMOGRAPHICS
Gender distribution:
Male      7748
Female    7518
Name: Gender, dtype: int64

Countries: ['Australia' 'Canada' 'Germany' 'France' 'Italy' 'Netherlands'
 'United Kingdom' 'United States']

Customers per country:
United States     6828
United Kingdom    1944
Canada            1553
Germany           1473
Australia         1420
Netherlands        733
France             670
Italy              645
Name: Country, dtype: int64


In [9]:
# Store locations
print("STORE LOCATIONS")
print("=" * 50)
print(f"Countries with stores: {stores['Country'].unique()}")
print(f"\nStores per country:")
print(stores['Country'].value_counts())

STORE LOCATIONS
Countries with stores: ['Australia' 'Canada' 'France' 'Germany' 'Italy' 'Netherlands'
 'United Kingdom' 'United States' 'Online']

Stores per country:
United States     24
Germany            9
France             7
United Kingdom     7
Australia          6
Canada             5
Netherlands        5
Italy              3
Online             1
Name: Country, dtype: int64


## Date Range Analysis

In [10]:
# Convert date columns to datetime for analysis
sales_temp = sales.copy()
sales_temp['Order Date'] = pd.to_datetime(sales_temp['Order Date'])
sales_temp['Delivery Date'] = pd.to_datetime(sales_temp['Delivery Date'])

print("DATE RANGE ANALYSIS")
print("=" * 50)
print(f"Sales date range: {sales_temp['Order Date'].min()} to {sales_temp['Order Date'].max()}")
print(f"Total time period: {(sales_temp['Order Date'].max() - sales_temp['Order Date'].min()).days} days")

# Missing delivery dates
missing_delivery = sales_temp['Delivery Date'].isnull().sum()
print(f"\nMissing delivery dates: {missing_delivery} ({missing_delivery/len(sales_temp)*100:.1f}%)")

DATE RANGE ANALYSIS
Sales date range: 2016-01-01 00:00:00 to 2021-02-20 00:00:00
Total time period: 1877 days

Missing delivery dates: 49719 (79.1%)


## Price and Currency Analysis

In [11]:
# Examine price formats in products
print("PRICE FORMAT ANALYSIS")
print("=" * 50)
print("Sample unit costs:")
print(products['Unit Cost USD'].head(10))
print("\nSample unit prices:")
print(products['Unit Price USD'].head(10))

# Currency codes in sales
print(f"\nCurrency codes in sales:")
print(sales['Currency Code'].value_counts())

PRICE FORMAT ANALYSIS
Sample unit costs:
0     $6.62 
1     $6.62 
2     $7.40 
3    $11.00 
4    $11.00 
5    $11.00 
6    $11.00 
7    $30.58 
8    $30.58 
9    $30.58 
Name: Unit Cost USD, dtype: object

Sample unit prices:
0    $12.99 
1    $12.99 
2    $14.52 
3    $21.57 
4    $21.57 
5    $21.57 
6    $21.57 
7    $59.99 
8    $59.99 
9    $59.99 
Name: Unit Price USD, dtype: object

Currency codes in sales:
USD    33767
EUR    12621
GBP     8140
CAD     5415
AUD     2941
Name: Currency Code, dtype: int64


## Summary of Findings

Based on the initial exploration, here are the key observations:

In [12]:
print("KEY FINDINGS FROM DATA EXPLORATION")
print("=" * 60)
print(f"""
📊 DATASET OVERVIEW:
   • Total customers: {customers.shape[0]:,}
   • Total products: {products.shape[0]:,}
   • Total sales records: {sales.shape[0]:,}
   • Total stores: {stores.shape[0]:,}
   • Exchange rate records: {exchange_rates.shape[0]:,}

🛒 SALES DATA:
   • Date range: {sales_temp['Order Date'].min().strftime('%Y-%m-%d')} to {sales_temp['Order Date'].max().strftime('%Y-%m-%d')}
   • Unique orders: {sales['Order Number'].nunique():,}
   • Missing delivery dates: {missing_delivery:,} records

🎯 PRODUCT DATA:
   • Categories: {products['Category'].nunique()}
   • Brands: {products['Brand'].nunique()}
   • Price format: Contains $ signs and commas (needs cleaning)

👥 CUSTOMER DATA:
   • Countries: {customers['Country'].nunique()}
   • Gender distribution: Balanced
   • Encoding: Latin-1 required for proper reading

🏪 STORE DATA:
   • Store locations across {stores['Country'].nunique()} countries
   • Store sizes vary significantly

💱 CURRENCY DATA:
   • Multiple currencies: {sales['Currency Code'].nunique()} types
   • Exchange rates available for conversion

🔧 DATA QUALITY ISSUES TO ADDRESS:
   • Price columns need cleaning (remove $, commas, spaces)
   • Date columns need proper datetime conversion
   • Some missing delivery dates
   • Customer data requires latin-1 encoding
""")

KEY FINDINGS FROM DATA EXPLORATION

📊 DATASET OVERVIEW:
   • Total customers: 15,266
   • Total products: 2,517
   • Total sales records: 62,884
   • Total stores: 67
   • Exchange rate records: 11,215

🛒 SALES DATA:
   • Date range: 2016-01-01 to 2021-02-20
   • Unique orders: 26,326
   • Missing delivery dates: 49,719 records

🎯 PRODUCT DATA:
   • Categories: 8
   • Brands: 11
   • Price format: Contains $ signs and commas (needs cleaning)

👥 CUSTOMER DATA:
   • Countries: 8
   • Gender distribution: Balanced
   • Encoding: Latin-1 required for proper reading

🏪 STORE DATA:
   • Store locations across 9 countries
   • Store sizes vary significantly

💱 CURRENCY DATA:
   • Multiple currencies: 5 types
   • Exchange rates available for conversion

🔧 DATA QUALITY ISSUES TO ADDRESS:
   • Price columns need cleaning (remove $, commas, spaces)
   • Date columns need proper datetime conversion
   • Some missing delivery dates
   • Customer data requires latin-1 encoding

