# 🛒 Day 1 - Superstore Sales Data Analysis
## ✅ Step 1: Load and Explore the Dataset
Let's begin by importing the necessary libraries and loading the dataset to understand its structure.

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('../data/SuperstoreSales.csv', encoding='ISO-8859-1')

## 🔍 Step 2: View Basic Information
This will help us understand the number of rows, columns, data types, and memory usage.

In [2]:
# Dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8399 entries, 0 to 8398
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Row ID                8399 non-null   int64  
 1   Order ID              8399 non-null   int64  
 2   Order Date            8399 non-null   object 
 3   Order Priority        8399 non-null   object 
 4   Order Quantity        8399 non-null   int64  
 5   Sales                 8399 non-null   float64
 6   Discount              8399 non-null   float64
 7   Ship Mode             8399 non-null   object 
 8   Profit                8399 non-null   float64
 9   Unit Price            8399 non-null   float64
 10  Shipping Cost         8399 non-null   float64
 11  Customer Name         8399 non-null   object 
 12  Province              8399 non-null   object 
 13  Region                8399 non-null   object 
 14  Customer Segment      8399 non-null   object 
 15  Product Category     

## 🧾 Step 3: Preview the First Few Rows
This gives us an idea of how the data is structured and what kind of values are present.

In [3]:
# Preview the data
df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Order Priority,Order Quantity,Sales,Discount,Ship Mode,Profit,Unit Price,...,Customer Name,Province,Region,Customer Segment,Product Category,Product Sub-Category,Product Name,Product Container,Product Base Margin,Ship Date
0,1,3,10/13/2010,Low,6,261.54,0.04,Regular Air,-213.25,38.94,...,Muhammed MacIntyre,Nunavut,Nunavut,Small Business,Office Supplies,Storage & Organization,"Eldon Base for stackable storage shelf, platinum",Large Box,0.8,10/20/2010
1,49,293,10/1/2012,High,49,10123.02,0.07,Delivery Truck,457.81,208.16,...,Barry French,Nunavut,Nunavut,Consumer,Office Supplies,Appliances,"1.7 Cubic Foot Compact ""Cube"" Office Refrigera...",Jumbo Drum,0.58,10/2/2012
2,50,293,10/1/2012,High,27,244.57,0.01,Regular Air,46.71,8.69,...,Barry French,Nunavut,Nunavut,Consumer,Office Supplies,Binders and Binder Accessories,"Cardinal Slant-D® Ring Binder, Heavy Gauge Vinyl",Small Box,0.39,10/3/2012
3,80,483,7/10/2011,High,30,4965.7595,0.08,Regular Air,1198.97,195.99,...,Clay Rozendal,Nunavut,Nunavut,Corporate,Technology,Telephones and Communication,R380,Small Box,0.58,7/12/2011
4,85,515,8/28/2010,Not Specified,19,394.27,0.08,Regular Air,30.94,21.78,...,Carlos Soltero,Nunavut,Nunavut,Consumer,Office Supplies,Appliances,Holmes HEPA Air Purifier,Medium Box,0.5,8/30/2010


## 🧹 Step 4: Check for Duplicates
Let's see if there are any duplicate rows in the dataset that need to be cleaned.

In [4]:
# Check for duplicates
df.duplicated().sum()

np.int64(0)

## 📏 Step 5: Check for Missing Values
We need to identify missing data so we can decide how to handle it in the cleaning phase.

In [5]:
# Check for missing values
df.isnull().sum()

Row ID                   0
Order ID                 0
Order Date               0
Order Priority           0
Order Quantity           0
Sales                    0
Discount                 0
Ship Mode                0
Profit                   0
Unit Price               0
Shipping Cost            0
Customer Name            0
Province                 0
Region                   0
Customer Segment         0
Product Category         0
Product Sub-Category     0
Product Name             0
Product Container        0
Product Base Margin     63
Ship Date                0
dtype: int64

## ✍️ Step 6: Rename Columns
We'll clean up any column names that have spaces or inconsistent formatting for easier access in future steps.

In [6]:
# Rename columns (optional step)
df.columns = df.columns.str.strip().str.replace(' ', '_').str.lower()
df.columns

Index(['row_id', 'order_id', 'order_date', 'order_priority', 'order_quantity',
       'sales', 'discount', 'ship_mode', 'profit', 'unit_price',
       'shipping_cost', 'customer_name', 'province', 'region',
       'customer_segment', 'product_category', 'product_sub-category',
       'product_name', 'product_container', 'product_base_margin',
       'ship_date'],
      dtype='object')

## 📊 Step 7: Understand Unique Values in Each Column
We’ll inspect categorical fields to get a sense of the distinct entries, which helps in EDA and feature engineering.

In [7]:
# Check unique values for object columns
for col in df.select_dtypes(include='object').columns:
    print(f"{col}: {df[col].nunique()} unique values")

order_date: 1418 unique values
order_priority: 5 unique values
ship_mode: 3 unique values
customer_name: 795 unique values
province: 13 unique values
region: 8 unique values
customer_segment: 4 unique values
product_category: 3 unique values
product_sub-category: 17 unique values
product_name: 1263 unique values
product_container: 7 unique values
ship_date: 1450 unique values


## 📦 Step 8: Save Cleaned Dataset for Future Use
We'll save this version so we don't need to repeat the basic cleanup in upcoming days.

In [8]:
# Save cleaned dataset
df.to_csv('../data/superstore_cleaned.csv', index=False)

## ✅ Summary of Day 1
- Imported and loaded the Superstore dataset
- Inspected structure, null values, and duplicates
- Renamed columns for ease of use
- Saved a cleaned version for future analysis

🚀 Ready to move on to Day 2: Exploratory Data Analysis!