End-to-end SQL project analyzing Walmart sales data using PostgreSQL. Includes data cleaning with Python(pandas), business insights, and advanced SQL queries to solve real-world retail problems
This project is an end-to-end data analysis solution designed to extract critical business insights from Walmart sales data. We utilize Python for data processing and analysis, SQL for advanced querying, and structured problem-solving techniques to solve key business questions. The project is ideal for data analysts looking to develop skills in data manipulation and SQL querying,\
- Tools Used: Visual Studio Code (VS Code), Python, SQL (PostgreSQL)
- Goal: Create a structured workspace within VS Code and organize project folders for smooth development and data handling.
- Data Source: Used the Kaggle to download the Walmart sales datasets.
- Dataset Link: Walmart Sales Dataset
- Libraries: Installd necessary Python libraries using:
pip install pandas numpy sqlalchemy psycopg2
- Loading Data: Read the data into a Pandas DataFrame for initial analysis and transformations.
- Goal: Conduct an initial data exploration to understand data distribution, check column names, types, and identify potential issues.
- Analysis: Used functions like
.info()
,.describe()
, and.head()
to get a quick overview of the data structure and statistics.
- Remove Duplicates: Identify and remove duplicate entries to avoid skewed results.
- Handle Missing Values: Drop rows or columns with missing values if they are insignificant; fill values where essential.
- Fix Data Types: Ensure all columns have consistent data types (e.g., dates as
datetime
, prices asfloat
). - Currency Formatting: Use
.replace()
to handle and format currency values for analysis. - Validation: Check for any remaining inconsistencies and verify the cleaned data.
- Create New Columns: Calculated the
Total_price
for each transaction by multiplyingunit_price
byquantity
and adding this as a new column. - Enhance Dataset: Adding this calculated field will streamline further SQL analysis and aggregation tasks.
- Set Up Connections: Connect to PostgreSQL using
sqlalchemy
and load the cleaned data into each database. - Table Creation: Set up tables in PostgreSQL using Python SQLAlchemy to automate table creation and data insertion.
- Verification: Run initial SQL queries to confirm that the data has been loaded accurately.
- Business Problem-Solving: Write and execute complex SQL queries to answer critical business questions, such as:
- Revenue trends across branches and categories.
- Identifying best-selling product categories.
- Sales performance by time, city, and payment method.
- Analyzing peak sales periods and customer buying patterns.
- Profit margin analysis by branch and category.
- Documentation: Maintain well-structured documentation of the entire process in Markdown or a Jupyter Notebook.
- Project Publishing: Published the completed project on GitHub:
- The
README.md
file (this document). - Jupyter Notebooks.
- SQL query scripts.
- Data files or steps to access them.
- The
- Python 3.8+
- SQL Databases: MySQL, PostgreSQL
- Python Libraries:
pandas
,numpy
,sqlalchemy
,psycopg2
|-- data/ # Raw data and transformed data
|-- sql_queries/ # SQL scripts for analysis and queries
|-- notebooks/ # Jupyter notebooks for Python analysis
|-- README.md # Project documentation
This section will include your analysis findings:
- Sales Insights: Key categories, branches with highest sales, and preferred payment methods.
- Profitability: Insights into the most profitable product categories and locations.
- Customer Behavior: Trends in ratings, payment preferences, and peak shopping hours.