# 📊 Superstore Sales Dataset — Exploratory Data Analysis (EDA)

## 📌 Objective

The goal of this notebook is to perform a structured **Exploratory Data Analysis (EDA)** on the cleaned Superstore dataset.

We aim to uncover:
- 📦 Product category insights  
- 🌍 Regional and state-level performance  
- 💰 Profitability trends  
- 📈 Sales over time  
- 🔁 Customer segments & shipping trends  

---

## 🧭 Step 1: Import Libraries and Load Cleaned Data


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Setup for clean visuals
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

# Load the cleaned data
df = pd.read_csv("/mnt/shared/Data_Analytics_Projects/Superstore_Sales_Data/Superstore.csv", encoding='ISO-8859-1')

# Preview data
df.head()


## 🧐 Step 2: Basic Data Understanding

We look at general trends like:
- Most common categories/sub-categories
- Sales & profit distributions


In [None]:
# Top 5 most common sub-categories
df['Sub-Category'].value_counts().head()


In [None]:
# Distribution of Sales and Profit
sns.histplot(df['Sales'], bins=50, kde=True)
plt.title("Sales Distribution")
plt.xlabel("Sales")
plt.ylabel("Frequency")
plt.show()

sns.histplot(df['Profit'], bins=50, kde=True, color='green')
plt.title("Profit Distribution")
plt.xlabel("Profit")
plt.ylabel("Frequency")
plt.show()


## 🗃️ Step 3: Category-wise Performance

We analyze which product categories & sub-categories drive revenue and profit.


In [None]:
# Category-wise Sales and Profit
category_summary = df.groupby('Category')[['Sales', 'Profit']].sum().sort_values(by='Sales', ascending=False)
category_summary.plot(kind='bar', stacked=False, title="Sales and Profit by Category", colormap='viridis')
plt.ylabel("Amount")
plt.xticks(rotation=0)
plt.show()

# Sub-Category Analysis
subcat = df.groupby('Sub-Category')[['Sales', 'Profit']].sum().sort_values(by='Sales', ascending=True)
subcat.plot(kind='barh', stacked=False, title="Sales and Profit by Sub-Category", colormap='plasma')
plt.xlabel("Amount")
plt.show()


## 🧭 Step 4: Regional and State-wise Insights

We analyze how different regions and states are performing in terms of Sales and Profit.


In [None]:
# Region-Level Summary
region_summary = df.groupby('Region')[['Sales', 'Profit']].sum().sort_values(by='Sales', ascending=False)
region_summary.plot(kind='bar', title="Sales and Profit by Region", colormap='coolwarm')
plt.ylabel("Amount")
plt.xticks(rotation=0)
plt.show()

# Top 10 states by Profit
top_states = df.groupby('State')[['Sales', 'Profit']].sum().sort_values(by='Profit', ascending=False).head(10)
top_states.plot(kind='bar', title="Top 10 States by Profit", colormap='spring')
plt.ylabel("Amount")
plt.show()


## 📅 Step 5: Time Series Analysis — Sales Over Time

We evaluate how sales and profit trends vary by year and month.


In [None]:
# Ensure Order Date is in datetime format
df['Order Date'] = pd.to_datetime(df['Order Date'])

# Create new columns for year and month
df['Order Year'] = df['Order Date'].dt.year
df['Order Month'] = df['Order Date'].dt.month_name()

# Year-wise Sales
df.groupby('Order Year')['Sales'].sum().plot(marker='o')
plt.title("Total Sales per Year")
plt.ylabel("Sales")
plt.xlabel("Year")
plt.grid(True)
plt.show()

# Month-wise Sales trend (aggregated over all years)
df.groupby('Order Month')['Sales'].sum().reindex([
    'January', 'February', 'March', 'April', 'May', 'June',
    'July', 'August', 'September', 'October', 'November', 'December'
]).plot(kind='bar')
plt.title("Sales by Month (Aggregated)")
plt.xlabel("Month")
plt.ylabel("Sales")
plt.show()


## 🧍‍♂️ Step 6: Segment & Shipping Mode Analysis

We look at which **Customer Segments** and **Shipping Methods** bring the most revenue and profit.


In [None]:
# Segment Performance
segment = df.groupby('Segment')[['Sales', 'Profit']].sum()
segment.plot(kind='bar', title="Sales and Profit by Customer Segment", colormap='Set2')
plt.xticks(rotation=0)
plt.ylabel("Amount")
plt.show()

# Ship Mode
ship_mode = df.groupby('Ship Mode')[['Sales', 'Profit']].sum()
ship_mode.plot(kind='bar', title="Sales and Profit by Ship Mode", colormap='Accent')
plt.xticks(rotation=0)
plt.ylabel("Amount")
plt.show()


## 🔗 Step 7: Relationship Between Sales, Profit, and Discount

We explore if high discounts lead to negative profits using scatter plots and correlation.


In [None]:
# Correlation heatmap
sns.heatmap(df[['Sales', 'Profit', 'Discount']].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation between Sales, Profit, and Discount")
plt.show()

# Discount vs Profit
sns.scatterplot(data=df, x='Discount', y='Profit', hue='Category')
plt.title("Discount vs Profit by Category")
plt.show()


---
# ✅ EDA Completed

We’ve successfully explored the cleaned Superstore dataset from various angles:

- 🗂️ Categories and sub-categories
- 🏙️ Region and state performance
- 📆 Time trends (Year/Month)
- 👥 Segments and shipping methods
- 🧮 Impact of discount on profit

Now this data is ready for:
- 📈 Dashboarding in Power BI / Excel
- 📦 Business insights presentation
- 📄 Sharing this notebook as part of your project/portfolio

---
