#  Retail Sales EDA
This notebook contains:
- Dataset Generation (20,000 rows)
- Data Cleaning
- Exploratory Data Analysis
- Business Insights


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Step 1: Generate Large Dataset

In [None]:
np.random.seed(42)
n = 20000

order_dates = pd.date_range(start='2022-01-01', periods=n, freq='H')
regions = ['North', 'South', 'East', 'West']
categories = ['Electronics', 'Clothing', 'Furniture', 'Grocery']
payment_modes = ['Cash', 'Card', 'UPI', 'Net Banking']

data = pd.DataFrame({
    'Order_ID': np.arange(100000, 100000+n),
    'Order_Date': np.random.choice(order_dates, n),
    'Region': np.random.choice(regions, n),
    'Category': np.random.choice(categories, n),
    'Customer_Age': np.random.randint(18, 70, n),
    'Quantity': np.random.randint(1, 15, n),
    'Discount': np.round(np.random.uniform(0, 0.4, n), 2),
    'Payment_Mode': np.random.choice(payment_modes, n)
})

category_base_price = {
    'Electronics': 15000,
    'Clothing': 3000,
    'Furniture': 10000,
    'Grocery': 1500
}

data['Base_Price'] = data['Category'].map(category_base_price)

data['Sales'] = (
    data['Base_Price'] * data['Quantity'] * (1 - data['Discount'])
    * np.random.uniform(0.8, 1.2, n)
).round(2)

data['Profit'] = (
    data['Sales'] * np.random.uniform(0.05, 0.25, n)
    - (data['Discount'] * data['Sales'] * 0.5)
).round(2)

loss_index = np.random.choice(data.index, 1000)
data.loc[loss_index, 'Profit'] *= -1

data.loc[np.random.choice(data.index, 500), 'Customer_Age'] = np.nan
data.loc[np.random.choice(data.index, 400), 'Profit'] = np.nan

outlier_index = np.random.choice(data.index, 50)
data.loc[outlier_index, 'Sales'] *= 5

data.drop(columns=['Base_Price'], inplace=True)

data.head()


#  EDA TASK SHEET 

## Scenario
You are a Data Analyst working for a retail company.
Perform complete Exploratory Data Analysis and generate business insights.

---

##  PART 1: Basic Exploration

1. Load the dataset.
2. Display first 5 rows.
3. Check shape of dataset.
4. Check data types.
5. Identify missing values.
6. Generate statistical summary.
7. Check duplicate records.

---

##  PART 2: Data Cleaning

8. Handle missing values in:
   - Customer_Age
   - Profit
9. Convert Order_Date to datetime format.
10. Extract Year, Month, Day from Order_Date.
11. Create a new column:
   - Profit_Margin = Profit / Sales

---

##  PART 3: Univariate Analysis

12. Plot distribution of:
   - Sales
   - Profit
13. Identify skewness of Sales.
14. Countplot for:
   - Region
   - Category
   - Payment_Mode
15. Detect outliers using boxplot.

---

##  PART 4: Bivariate Analysis

16. Sales vs Profit (scatterplot).
17. Category vs Sales (boxplot).
18. Region vs Profit (barplot).
19. Discount vs Profit relationship.
20. Correlation heatmap.

---

##  PART 5: Time Series Analysis

21. Monthly Sales Trend.
22. Region-wise Monthly Sales.
23. Identify best performing month.

---

## PART 6: Business Insight Questions

24. Which category generates highest revenue?
25. Which region is underperforming?
26. Does higher discount reduce profit?
27. Which payment mode is most used?
28. Which age group contributes most to sales?

---

## Bonus Challenge

29. Segment customers into:
   - Young (18–30)
   - Adult (31–50)
   - Senior (50+)

30. Perform EDA on customer segments and compare profitability.



