# Exploratory Data Analysis (EDA)
## Sales Forecasting Dataset

### Objective
- Understand sales patterns and trends
- Identify seasonality and anomalies
- Discover key factors influencing sales
- Generate business-driven insights to guide modeling

This EDA focuses on time-series behavior and sales distribution to support forecasting tasks.


In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

plt.style.use("seaborn-v0_8")


In [None]:
df = pd.read_csv("../data/raw/sales.csv")

df.columns = df.columns.str.lower().str.replace(" ", "_")
df.head()


In [None]:
df.info()


In [None]:
df.describe()


- Dataset contains historical daily sales data
- Sales values show high variance â†’ potential seasonality & outliers
- Date column needs to be converted to datetime


In [None]:
monthly_sales = (
    df
    .set_index("order_date")
    .resample("M")["sales"]
    .sum()
    .reset_index()
)

px.line(
    monthly_sales,
    x="order_date",
    y="sales",
    title="Monthly Sales Trend"
)


In [None]:
sns.histplot(df['sales'], bins=50, kde=True)
plt.title("Sales Distribution")
plt.show()


- Distribution is right-skewed
- Presence of high-value outliers
- Log transformation may help some models


In [None]:
sns.boxplot(x=df['sales'])
plt.title("Sales Outliers")
plt.show()


In [None]:
numeric_df = df.select_dtypes(include=np.number)

plt.figure(figsize=(10,6))
sns.heatmap(
    numeric_df.corr(),
    annot=True,
    cmap="coolwarm"
)
plt.title("Correlation Heatmap")
plt.show()


## Key Insights from EDA

1. Sales data shows strong temporal dependency
2. Clear trend and seasonality patterns exist
3. Lagged sales values are likely strong predictors
4. Rolling statistics may help capture short-term trends
5. Feature engineering is critical for accurate forecasting
