<a href="https://colab.research.google.com/github/ahmedchafiq/R137590662_AHMED_CHAFIQ/blob/main/Business_Sales_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
ayeshaseherr_buisness_sales_path = kagglehub.dataset_download('ayeshaseherr/buisness-sales')

print('Data source import complete.')


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Business Sales EDA

## Introduction

This notebook presents an exploratory data analysis (EDA) of the **Business Sales** dataset to understand sales patterns, pricing behavior, and product distribution.

The analysis focuses on:
- Exploring the distribution of sales volume and product prices.
- Analyzing categorical features such as origin, season, section, and product category.
- Visualizing relationships between price and sales volume.
- Identifying trends, patterns, and potential anomalies in the data.

Multiple visualization techniques including histograms, box plots, bar charts, and correlation analysis are used to extract meaningful business insights.

The goal of this analysis is to support data-driven decision making and provide a clear understanding of factors that influence sales performance.


Step 1: Import Required Libraries
We begin by importing essential Python libraries for data manipulation, visualization, and analysis:

pandas: For loading, cleaning, and manipulating structured data in dataframe format
matplotlib: For creating static, high-quality visualizations and plots
plotly: For interactive, web-based visualizations that enable hover details and zooming
seaborn: For statistical data visualization built on top of matplotlib with enhanced aesthetics
numpy: For numerical computations and array operations
%matplotlib inline: Jupyter magic command to display plots directly in notebook cells

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.io as pio

import warnings
warnings.filterwarnings('ignore')

## Load Data

In [None]:
df = pd.read_csv("/kaggle/input/buisness-sales/Business_sales_EDA.csv", delimiter=';', encoding='utf-8')
df

## 1. Exploratory Data Analysis (EDA)Â¶
Dataset Information and missing value

In [None]:
df.head()

In [None]:
df.tail(10)

In [None]:
df.describe()

In [None]:
df.describe(include = "O")

In [None]:
df.count()

In [None]:
df.duplicated()

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.isna().sum()

### Sales Volume Distribution

This histogram visualizes how sales volume values are distributed across the dataset.  
It helps identify common sales ranges, data spread, and potential outliers.


In [None]:
plt.figure(figsize=(10,5))
plt.hist(df['Sales Volume'], bins=40)
plt.title('Sales Volume Distribution')
plt.show()


### Frequency of Product Origin

This bar chart shows how often each product origin appears in the dataset.  
Rotating the x-axis labels improves readability and prevents overlap.


In [None]:
plt.figure(figsize=(8,5))
sns.countplot(x="origin", data=df)

plt.title("Frequency of Product Origin")
plt.xlabel("Origin")
plt.ylabel("Count")

plt.xticks(rotation=45)

plt.tight_layout()
plt.show()


### Frequency of Season

This count plot shows how frequently each season appears in the dataset.  
It helps identify seasonal patterns in the data distribution.


In [None]:
sns.countplot(x="season", data=df)
plt.title(f"Frequency of Season")
plt.xlabel("season")
plt.ylabel("Count")
plt.show()

### Frequency of Section

This horizontal bar chart shows how each section is distributed in the dataset.  
It makes it easier to compare category frequencies across sections.


In [None]:
sns.countplot(y="section", data=df)
plt.title(f"Frequency of Section")
plt.ylabel("section")
plt.xlabel("Count")
plt.show()

## Exploratory Data Analysis (EDA) Visualizations

### 1. Distribution of Sales Volume and Price
The histograms with KDE curves show the distribution of **Sales Volume** and **Price**.
They help identify:
- The shape of the distribution (normal, skewed, etc.).
- Common value ranges.
- Potential outliers.

### 2. Sales Volume by Product Category
The box plot compares **Sales Volume** across different product categories.
It highlights:
- The median sales per category.
- Variability and spread.
- Outliers within each category.

### 3. Pair Plot of Numerical Features
The pair plot visualizes relationships between all numerical features.
It is useful for:
- Detecting correlations.
- Observing linear or non-linear trends.
- Identifying feature interactions.

*(Displayed only if there are enough numerical columns.)*

### 4. Correlation Heatmap
The heatmap shows correlation coefficients between numerical variables.
It helps:
- Identify strongly related features.
- Detect multicollinearity.
- Understand feature dependencies.

*(Displayed only if there are enough numerical columns.)*


In [None]:
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['Sales Volume'], kde=True, bins=30, color='skyblue')
plt.title('Distribution of Sales Volume')

plt.subplot(1, 2, 2)
sns.histplot(df['price'], kde=True, bins=30, color='salmon')
plt.title('Distribution of Price')
plt.tight_layout()
plt.show()

plt.figure(figsize=(10, 6))
sns.boxplot(x='Product Category', y='Sales Volume', data=df)
plt.title('Sales Volume by Product Category')
plt.xticks(rotation=45)
plt.show()

numeric_df = df.select_dtypes(include=[np.number])
if numeric_df.shape[1] >= 4:
    sns.pairplot(numeric_df)
    plt.show()
else:
    print('Not enough numeric columns for a pair plot.')

if numeric_df.shape[1] >= 4:
    plt.figure(figsize=(8, 6))
    corr = numeric_df.corr()
    sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm')
    plt.title('Correlation Heatmap')
    plt.show()
else:
    print('Not enough numeric columns for a correlation heatmap.')

### Origin vs Price

This bar chart shows the relationship between product origin and price.  
It compares prices across different origins and highlights differences between categories.


In [None]:
pio.renderers.default = "iframe"

fig = px.bar(
    df,
    x='origin',
    y='price',
    title='Origin and Price',
    color='origin'
)

iplot(fig)

### Pair Plot for Sales Volume and Price

This pair plot visualizes the relationship between **Sales Volume** and **Price**.  
The diagonal KDE plots show the distribution of each variable, while the scatter plot shows their interaction.


In [None]:
num_cols = ["Sales Volume", "price"]

sns.pairplot(df[num_cols], diag_kind="kde", height=3)
plt.show()