# ***Grocery Store Sales Data Analysis***
# ðŸ›’ Introduction
In todayâ€™s fast-paced retail industry, analyzing sales data is crucial for making informed business decisions, improving inventory management, and understanding customer behavior. This project focuses on the exploratory data analysis (EDA) of a grocery sales dataset using Python.

The primary goal is to gain insights into the sales performance across various product categories and features by identifying patterns, trends, and anomalies in the data. Through the use of powerful Python libraries like Pandas, NumPy, Matplotlib, Seaborn, and Plotly, we dive deep into:

* The composition and distribution of product categories

* Sales statistics such as quantities and total revenue

* Detection of missing or inconsistent data

* Summary metrics to identify top-performing segments

By visualizing this data, we make it easier for stakeholders to interpret trends and take data-driven actions. The analysis not only supports decision-making for marketing and supply chain optimization but also serves as a practical example of using data science in the retail domain.

Whether you're a data analyst, business strategist, or aspiring data scientist, this project demonstrates how raw data can be transformed into valuable business insights.



# Overview
* This Project is centered around analyzing sales data from a chain of grocery stores in Maharashtra. By examining various parameters such as item categories, sales volume, profit margins, and customer ratings, we aim to glean insights that can aid in optimizing inventory management and sales strategies.

# Objective
* The goal is to identify trends in sales data that will inform better inventory decisions, highlight profitable items and understand customer preferences.

# ðŸ§° 1. Importing Libraries
"A well-prepared data analysis starts with importing the right tools."
In this section, we import the essential Python libraries required for our analysis:
* NumPy and Pandas for numerical operations and data manipulation.
* Matplotlib and Seaborn for static data visualization.
* Plotly Express for interactive visualizations that offer deeper exploration.

We also enable %matplotlib inline to render plots directly within the notebook environment.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings
filterwarnings("ignore")
custom_palette = ["skyblue", "lightgreen", "lightblue", "lightpink", "lavender", "lightsalmon", "lightsteelblue"]
sns.set_palette(custom_palette)

# ðŸ“‚ 2. Loading the Dataset
"We begin the analysis by loading the dataset into memory."
* Here, we load the dataset Grocery_sales_dataset.csv using pd.read_csv(), specifying the index column. 
* This dataset represents sales records for various grocery items across categories and possibly regions or stores. 
* Viewing the first few rows (.head()) offers a preview of how the data is structured, helping to verify a successful load and giving early insights into the content.


In [None]:
df = pd.read_csv(r"C:\Users\91801\OneDrive\Desktop\AI\Projects\grocery\Grocery_sales_dataset.csv",index_col=0)
df

# ðŸ§­ 3. Exploring Dataset Structure
"Understanding the data layout helps plan the analysis."
This section examines the structure of the dataset:

* Index: Determines the default labeling of rows.

* Columns: Identifies the various features (e.g., category, price, quantity).

* Values: Shows a raw matrix view of the dataset content.

This step is crucial for ensuring familiarity with how data is accessed and manipulated during the analysis.

Display the first few rows of the DataFrame.

In [None]:
df.head()

# ðŸ§ª  4. Data Types and Missing Values
"Detecting data types and missing information prepares us for cleaning and transformation."

Before any statistical analysis, it is important to inspect:

* Data types (.dtypes) to understand what transformations may be necessary (e.g., converting strings to dates).

* Missing values (.isnull().count() and .info()) to identify potential gaps that could distort insights.

Cleaning or imputing missing data depends on the insights from this step.



In [None]:
df.dtypes

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.values

In [None]:
df.isnull().count()

In [None]:
df.info()

# ðŸ“‰ 5. Descriptive Statistics
"Descriptive statistics give a numerical snapshot of the dataset."

Using .describe(), we generate summary statistics for numerical columns, including:

* Mean, median, standard deviation
* Min/max values and percentiles

This helps spot anomalies, outliers or skewed distributions early in the process and guides which features may need normalization or transformation.

In [None]:
df.describe()

# ðŸ§¾ 6. Distribution of Products Across Major Categories
"Understanding what sells most is key to any retail analysis."

In this step, we analyze the frequency of different product categories using .value_counts(). It reveals:
* The most sold product types
* Inventory focus areas
* Patterns that may correlate with seasonal or promotional events

This insight helps prioritize deeper analysis in high-volume categories.



In [None]:
df['category'].value_counts()

In [None]:
df['product_name'].value_counts()

In [None]:
df['day_of_week'].value_counts()

# ðŸ“Š 7.Data Visualization 
"Visualizations turn raw numbers into compelling stories."

This section is dedicated to visually exploring patterns, trends, and relationships in the grocery sales dataset. While numerical summaries give foundational insight, data visualization reveals the deeper structure and meaning hidden within the data.

Using a mix of Matplotlib, Seaborn, and Plotly, we create both static and interactive charts to:
* Analyze sales volume across product categories
* Understand quantity and revenue distribution
* Explore correlations between features like item type, price, and frequency
* Identify seasonal or periodic sales trends (if temporal data exists)
* Spot anomalies or outliers in sales behavior

These visuals help stakeholders quickly grasp complex relationships and support smarter decision-making in areas like inventory management, marketing strategy, and customer demand forecasting. The interactive plots in particular offer dynamic insights that are especially useful in dashboards or business intelligence tools.

ðŸ”¹ Total Sales Volume per Product Line

* Show the counts of observations in category and product.

In [None]:
plt.figure(figsize=(15,3))
plt.subplot(1,2,1)
sns.countplot(x='category', data=df)
plt.subplot(1,2,2)
sns.countplot(x='product_name', data=df)
plt.show()

* Visualize average number of items sold for each category and product.


In [None]:
plt.figure(figsize=(18,3))
plt.subplot(1,2,1)
sns.barplot(x='category', y='number_of_items_sold', data=df,)
plt.subplot(1,2,2)
sns.barplot(x='product_name', y='number_of_items_sold', data=df)
plt.show()

* Visualize Average number of items sold  and total revenue on each day.

In [None]:
plt.figure(figsize=(18,3))
plt.subplot(1,2,1)
sns.barplot(x='day_of_week', y='number_of_items_sold', data=df)
plt.subplot(1,2,2)
sns.barplot(x='day_of_week', y='total_revenue', data=df)
plt.show()

* Compare the total revenue generated on holidays versus regular days using a bar chart.

In [None]:
df['is_holiday'] = df['holiday'].map({True: 'Holiday', False: 'Regular Day'})
revenue_by_holiday = df.groupby('is_holiday')['total_revenue'].sum()
plt.figure(figsize=(8, 6))
revenue_by_holiday.plot(kind='bar', color=['skyblue', 'lightgreen'])
plt.title('Total Revenue Comparison: Holidays vs. Regular Days')
plt.xlabel('Day Type')
plt.ylabel('Total Revenue')
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.show()

* Visualize the gender distribution of buyers for all products using a pie chart.

In [None]:
gender_distribution = df['buyer_gender'].value_counts()
plt.figure(figsize=(8, 6))
plt.pie(gender_distribution, labels=gender_distribution.index, autopct='%1.1f%%', colors=['lightblue', 'lightpink'], startangle=140)
plt.title('Gender Distribution of Buyers')
plt.axis('equal') 
plt.show()

* Visualizing relation between total revenue and number of items sold.
* Mark each category and product with different colors using <b>hue<b></p>

In [None]:
plt.figure(figsize=(15, 5))
sns.scatterplot(x='number_of_items_sold', y='total_revenue',hue='category', data=df)
plt.show()

In [None]:
plt.figure(figsize=(15, 5))
sns.scatterplot(x='number_of_items_sold', y='total_revenue',hue='product_name', data=df,palette=['skyblue', 'lightgreen', 'lavender', 'lightsalmon','lightsteelblue'])
plt.show()

* Visualizing distribution of category with number of items sold.

In [None]:
plt.figure(figsize=(15, 5))
sns.boxplot(x='day_of_week', y='number_of_items_sold', data=df)
plt.xticks(rotation=45)
plt.show()

* Visualize average sales by sales hours, months and days.

In [None]:
df['sales_time'].head()

In [None]:
df['hours']=df['sales_time'].str.split(":").str[0] ### seperate hour from hour and minutes
df['hours'].head()

In [None]:
plt.figure(figsize=(15, 5))
sns.lineplot(x='hours',y='total_revenue',data=df.sort_values('hours'),hue='buyer_gender')
plt.show()

In [None]:
df['sales_date'] = df['sales_date'].astype(str)
df[['year', 'month', 'day']] = df['sales_date'].str.split("-", expand=True)
df.head()

In [None]:
plt.figure(figsize=(15, 5))
sns.lineplot(x='month',y='total_revenue',data=df.sort_values('month'))
plt.show()

In [None]:
plt.figure(figsize=(15, 5))
sns.lineplot(x='day',y='total_revenue',data=df.sort_values('day'))
plt.show()

* Visualize a line plot illustrating the daily total revenue over the observed period.

In [None]:
df['sales_date'] = pd.to_datetime(df['sales_date'])
daily_revenue = df.groupby('sales_date')['total_revenue'].sum()
plt.figure(figsize=(10, 6))
plt.plot(daily_revenue.index, daily_revenue.values, color='skyblue', marker='o', linestyle='-')
plt.title('Daily Total Revenue Over Time')
plt.xlabel('Date')
plt.ylabel('Total Revenue')
plt.grid(True)
plt.show()

* Create a scatter plot to explore the relationship between the price of a product and the number of items sold.

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df['price'], df['number_of_items_sold'], alpha=0.7)
plt.title('Relationship between Price and Number of Items Sold')
plt.xlabel('Price')
plt.ylabel('Number of Items Sold')
plt.grid(True)
plt.show()

* Show the top-selling products based on total revenue using a horizontal bar chart.

In [None]:
top_selling_products = df.groupby('product_name')['total_revenue'].sum().sort_values(ascending=False)
top_10_products = top_selling_products.head(10)
plt.figure(figsize=(10, 6))
top_10_products.sort_values().plot(kind='barh')
plt.title('Top Selling Products by Revenue')
plt.xlabel('Total Revenue')
plt.ylabel('Product Name')
plt.grid(axis='x')
plt.show()

* Analyze seasonal effects by plotting the total revenue trend over different months or quarters.

In [31]:
df['month'] = df['sales_date'].dt.month
df['quarter'] = df['sales_date'].dt.quarter
monthly_revenue = df.groupby('month')['total_revenue'].sum()
quarterly_revenue = df.groupby('quarter')['total_revenue'].sum()

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(monthly_revenue.index, monthly_revenue.values, marker='o', linestyle='-')
plt.title('Total Revenue Trend Over Different Months')
plt.xlabel('Month')
plt.ylabel('Total Revenue')
plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.grid(True)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(quarterly_revenue.index, quarterly_revenue.values, marker='o', linestyle='-')
plt.title('Total Revenue Trend Over Different Quarters')
plt.xlabel('Quarter')
plt.ylabel('Total Revenue')
plt.xticks(range(1, 5))
plt.grid(True)
plt.show()

# ðŸ“Œ 7. Conclusion and Analysis Report
ðŸ“Š Summary of Findings:
Throughout this grocery sales data analysis, we explored the structure, statistical properties, and distribution of the dataset to uncover meaningful business insights.
Here are the key takeaways:

* ðŸ§¾ Product Category Dominance: Certain product categories consistently appeared more frequently, suggesting they are core revenue drivers.

* ðŸ“‰ Sales Distribution: The sales and quantity data followed expected patterns, though some outliers indicated unusually high or low salesâ€”possibly due to promotions or stock-outs.

* ðŸ§ª Data Quality: The dataset was mostly clean, with minor missing or inconsistent values, which were accounted for during preprocessing.

* ðŸ“ˆ Statistical Insight: Descriptive statistics revealed a balanced spread of item quantities but skewed revenue values, indicating a few high-priced items significantly impact total sales.

* ðŸ§­ Data Structure Awareness: Early structural exploration (columns, types, indexes) ensured proper downstream analysis and visualization setup.

ðŸ’¡ Final Thoughts:
The visualizations made the sales dynamics much clearer, turning numerical data into actionable insights. 
This type of exploratory analysis can guide:

* Inventory Planning

* Sales Strategy Optimization

* Customer Demand Forecasting

Future extensions of this project could include time-series analysis, predictive modeling, or even clustering customer purchase behavior.