# Unveiling Insights: Python Data Analysis of E-Commerce Sales Dataset

<img src="Shopping.jpeg" width=600 height=450 />

## Introduction

In today’s digital age, E-commerce has revolutionized the way we shop and conduct business. With the rise of online platforms, enormous amounts of data are generated daily. This treasure trove of information holds valuable insights that can drive strategic decision-making. In this blog, we delve into a Python data analysis of an E-commerce sales dataset, unearthing meaningful patterns and trends that can inform business strategies and optimize performance.

> **What is ecommerce?:** "Ecommerce" or "electronic commerce" is the trading of goods and services on the internet."

> Ecommerce is a retail method that enables people to buy and sell products. It offers diverse approaches, with some businesses exclusively operating online, while others integrate ecommerce into a wider strategy involving physical stores and various distribution channels. Regardless of the approach, ecommerce provides opportunities for startups, small businesses, and large companies to sell products on a global scale and connect with customers worldwide.


## Tools used for the data analysis

**Python:** With Python as our tool of choice, we can leverage powerful libraries like Pandas and NumPy to explore, clean, and analyze this data efficiently. We use Matplotlib and Seaborn for the visualization libraries. So main python libraries used are: 

- Numpy
- Pandas
- Matplotlib
- Seaborn

## Data Preprocessing
Before diving into the analysis, it’s crucial to preprocess the dataset. This step involves handling missing values, removing duplicates, and ensuring consistency in data formats. By employing Python’s Pandas library, we can perform these tasks seamlessly, ensuring the integrity of our analysis.

## Understanding the Dataset

To begin our analysis, let’s gain some insights into the E-commerce sales dataset we’ll be working with. The dataset comprises various attributes, including customer information, product details, transactional data, and sales records. In the present data analysis project, we have datasets on 

1. Sale Report
2. International sale Report
3. Amazon Sale Report

with following data on
- Customer Information
- Product Details
- Transaction Data  
- Sales Records
- Promotions
- Promotion Types
- Promotion Targets
- Promotion Results
- Promotion Activities



## Exploratory Data Analysis (EDA)

With the dataset prepared, we can now unleash the power of Python to uncover fascinating insights through EDA. Let’s explore a few key aspects:

1. **Sales Performance Analysis:** By aggregating sales data, we can determine the top-selling products, identify peak sales periods, and evaluate the performance of different product categories. Python’s data visualization libraries such as Matplotlib and Seaborn enable us to create informative charts and graphs to better understand sales patterns.

2. **Customer Segmentation:** Using techniques like clustering and RFM (Recency, Frequency, Monetary) analysis, we can segment customers based on their purchasing behavior. This allows us to identify high-value customers, understand their preferences, and tailor marketing strategies accordingly.

3. **Geographic Analysis:** Analyzing sales data by geographical regions can reveal lucrative markets and highlight areas for potential expansion. Python’s geospatial libraries, such as GeoPandas and Folium, can help create interactive maps to visualize sales patterns geographically.

4. **Seasonal Trends and Forecasting:** We can identify seasonal trends and patterns by examining historical sales data. Python provides libraries like Prophet and ARIMA to perform time series analysis and make accurate sales forecasts, enabling businesses to plan inventory, marketing campaigns, and resource allocation effectively.


Insights and Recommendations: Upon completing our data analysis, we uncover valuable insights that can drive actionable recommendations for the E-commerce business:

- **Product Optimization:** Identify underperforming products and focus on improving their sales through marketing campaigns, product enhancements, or pricing strategies.

- **Targeted Marketing:** Tailor marketing efforts by leveraging customer segmentation insights. Create personalized campaigns to target high-value customers and improve customer retention.

- **Geographical Expansion:** Identify regions with high sales potential and consider expanding operations or targeting marketing efforts in those areas.

- **Inventory Planning:** Utilize sales forecasting models to optimize inventory management, ensuring sufficient stock levels during peak demand periods while minimizing excess inventory costs.

# Exploratory Data Analysis

## Importing the Python libraries

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

## Loading the important datasets

We have following datasets:

1. Amazon Sale Report.csv
2. Internationalsale Report.csv
3. May-2022.csv
4. P L March 2021.csv
5. Sale Report.csv

Here in our present analysis, we load datsets 1, 2, and 5 for out analysis.

In [2]:
df_amz = pd.read_csv('Amazon Sale Report.csv', low_memory = False)
df_int = pd.read_csv('International sale Report.csv', low_memory = False)
df_sale = pd.read_csv('Sale Report.csv', low_memory = False)

With the provided dataframes `df_amz` and `df_sale`, there are various data analysis tasks you can perform using Python. Here are some possible analyses you can conduct:

1. **Data exploration:**

  - Check the dimensions and basic information of the dataframes using functions like shape, head, tail, info, etc.
  - Examine the statistical summary of numeric columns using describe.

2. **Data cleaning and preprocessing:**

  - Handle missing values using methods like fillna or dropping missing values using dropna.
  - Remove duplicates using drop_duplicates.
  - Convert data types of columns using astype or other relevant conversion functions.
  - Merge or join the two dataframes based on common columns using merge or join.

3. **Data visualization:**

  - Plot various charts and graphs to visualize the data, such as bar charts, pie charts, line plots, scatter plots, histograms, etc. You can use libraries like Matplotlib, Seaborn, or Plotly for data visualization.

4. **Data aggregation and grouping:**

  - Perform aggregation operations like sum, count, mean, median, etc., on numeric columns using groupby.
  - Group the data based on different columns and perform calculations or analyses specific to those groups.

5. **Data filtering and selection:**

  - Filter the data based on specific conditions using boolean indexing.
  - Select specific columns or rows using column names, index positions, or conditions.

6. **Statistical analysis:**

  - Conduct statistical analyses on the data, such as correlation analysis, hypothesis testing, etc. You can use libraries like NumPy and SciPy for statistical calculations.

7. **Data merging and joining:**

  - Combine the data from both dataframes based on common columns using functions like merge or join.

8. **Data summarization:**

  - Calculate various summary statistics like mean, median, mode, standard deviation, etc., for numeric columns.
  - Compute frequency counts for categorical variables.
  - Calculate the total stock, sales quantities, or revenue based on different criteria.

These are just a few examples of the possible data analyses you can perform with the provided data. The specific analysis tasks you choose will depend on the goals and questions you have about the data and the insights you want to derive from it.

# 1. Amazon sales dataframe
## 1.1. Checking the basic structure of the dataframe

In [3]:
df_amz.head() # top 5 rows of the dataset

Unnamed: 0,index,Order ID,Date,Status,Fulfilment,Sales Channel,ship-service-level,Style,SKU,Category,...,currency,Amount,ship-city,ship-state,ship-postal-code,ship-country,promotion-ids,B2B,fulfilled-by,Unnamed: 22
0,0,405-8078784-5731545,04-30-22,Cancelled,Merchant,Amazon.in,Standard,SET389,SET389-KR-NP-S,Set,...,INR,647.62,MUMBAI,MAHARASHTRA,400081.0,IN,,False,Easy Ship,
1,1,171-9198151-1101146,04-30-22,Shipped - Delivered to Buyer,Merchant,Amazon.in,Standard,JNE3781,JNE3781-KR-XXXL,kurta,...,INR,406.0,BENGALURU,KARNATAKA,560085.0,IN,Amazon PLCC Free-Financing Universal Merchant ...,False,Easy Ship,
2,2,404-0687676-7273146,04-30-22,Shipped,Amazon,Amazon.in,Expedited,JNE3371,JNE3371-KR-XL,kurta,...,INR,329.0,NAVI MUMBAI,MAHARASHTRA,410210.0,IN,IN Core Free Shipping 2015/04/08 23-48-5-108,True,,
3,3,403-9615377-8133951,04-30-22,Cancelled,Merchant,Amazon.in,Standard,J0341,J0341-DR-L,Western Dress,...,INR,753.33,PUDUCHERRY,PUDUCHERRY,605008.0,IN,,False,Easy Ship,
4,4,407-1069790-7240320,04-30-22,Shipped,Amazon,Amazon.in,Expedited,JNE3671,JNE3671-TU-XXXL,Top,...,INR,574.0,CHENNAI,TAMIL NADU,600073.0,IN,,False,,


In [4]:
df_amz.shape # dataset shape = 24 columns, 12897 rows

(128975, 24)

In [10]:
df_amz.columns

Index(['index', 'Order ID', 'Date', 'Status', 'Fulfilment', 'Sales Channel ',
       'ship-service-level', 'Style', 'SKU', 'Category', 'Size', 'ASIN',
       'Courier Status', 'Qty', 'currency', 'Amount', 'ship-city',
       'ship-state', 'ship-postal-code', 'ship-country', 'promotion-ids',
       'B2B', 'fulfilled-by', 'Unnamed: 22'],
      dtype='object')

We can see that there are too many columns and some of them are not relevant for us. 

In [13]:
# Removing unwanted columns from the table

df_amzcopy = df_amz.drop(labels = ['index' , 'Order ID', 'Unnamed: 22', 'ship-postal-code', 'promotion-ids'], axis = 1)
df_amzcopy.head()

Unnamed: 0,Date,Status,Fulfilment,Sales Channel,ship-service-level,Style,SKU,Category,Size,ASIN,Courier Status,Qty,currency,Amount,ship-city,ship-state,ship-country,B2B,fulfilled-by
0,2022-04-30,Cancelled,Merchant,Amazon.in,Standard,SET389,SET389-KR-NP-S,Set,S,B09KXVBD7Z,,0,INR,647.62,MUMBAI,MAHARASHTRA,IN,False,Easy Ship
1,2022-04-30,Shipped - Delivered to Buyer,Merchant,Amazon.in,Standard,JNE3781,JNE3781-KR-XXXL,kurta,3XL,B09K3WFS32,Shipped,1,INR,406.0,BENGALURU,KARNATAKA,IN,False,Easy Ship
2,2022-04-30,Shipped,Amazon,Amazon.in,Expedited,JNE3371,JNE3371-KR-XL,kurta,XL,B07WV4JV4D,Shipped,1,INR,329.0,NAVI MUMBAI,MAHARASHTRA,IN,True,
3,2022-04-30,Cancelled,Merchant,Amazon.in,Standard,J0341,J0341-DR-L,Western Dress,L,B099NRCT7B,,0,INR,753.33,PUDUCHERRY,PUDUCHERRY,IN,False,Easy Ship
4,2022-04-30,Shipped,Amazon,Amazon.in,Expedited,JNE3671,JNE3671-TU-XXXL,Top,3XL,B098714BZP,Shipped,1,INR,574.0,CHENNAI,TAMIL NADU,IN,False,


In [31]:
df_amzcopy.columns

Index(['Date', 'Status', 'Fulfilment', 'Sales Channel ', 'ship-service-level',
       'Style', 'SKU', 'Category', 'Size', 'ASIN', 'Courier Status', 'Qty',
       'currency', 'Amount', 'ship-city', 'ship-state', 'ship-country', 'B2B',
       'fulfilled-by'],
      dtype='object')

In [32]:
# First renaming the COlumn list

df_amzcopy = df_amzcopy.rename(columns={'Sales Channel ': 'Sales_Channel'})
df_amzcopy.head()

Unnamed: 0,Date,Status,Fulfilment,Sales_Channel,ship-service-level,Style,SKU,Category,Size,ASIN,Courier Status,Qty,currency,Amount,ship-city,ship-state,ship-country,B2B,fulfilled-by
0,2022-04-30,Cancelled,Merchant,Amazon.in,Standard,SET389,SET389-KR-NP-S,Set,S,B09KXVBD7Z,,0,INR,647.62,MUMBAI,MAHARASHTRA,IN,False,Easy Ship
1,2022-04-30,Shipped - Delivered to Buyer,Merchant,Amazon.in,Standard,JNE3781,JNE3781-KR-XXXL,kurta,3XL,B09K3WFS32,Shipped,1,INR,406.0,BENGALURU,KARNATAKA,IN,False,Easy Ship
2,2022-04-30,Shipped,Amazon,Amazon.in,Expedited,JNE3371,JNE3371-KR-XL,kurta,XL,B07WV4JV4D,Shipped,1,INR,329.0,NAVI MUMBAI,MAHARASHTRA,IN,True,
3,2022-04-30,Cancelled,Merchant,Amazon.in,Standard,J0341,J0341-DR-L,Western Dress,L,B099NRCT7B,,0,INR,753.33,PUDUCHERRY,PUDUCHERRY,IN,False,Easy Ship
4,2022-04-30,Shipped,Amazon,Amazon.in,Expedited,JNE3671,JNE3671-TU-XXXL,Top,3XL,B098714BZP,Shipped,1,INR,574.0,CHENNAI,TAMIL NADU,IN,False,


In [35]:
# finding the unique values in column: Status

unique_status = df_amzcopy['Status'].unique()
unique_status

array(['Cancelled', 'Shipped - Delivered to Buyer', 'Shipped',
       'Shipped - Returned to Seller', 'Shipped - Rejected by Buyer',
       'Shipped - Lost in Transit', 'Shipped - Out for Delivery',
       'Shipped - Returning to Seller', 'Shipped - Picked Up', 'Pending',
       'Pending - Waiting for Pick Up', 'Shipped - Damaged', 'Shipping'],
      dtype=object)

>  **Similarly we can find the unique values in each columns. However doing this one by one is bit lengthy. We use following to print unique values for each columns as follows:**

In [62]:
# Define the list of columns for which you want to print unique values
columns = ['Status', 'Fulfilment', 'Sales_Channel', 'ship-service-level', 'Category', 'Size',
           'Courier Status', 'Qty', 'currency', 'B2B', 'fulfilled-by']

# Iterate over the columns and print unique values in a tabular form
for column in columns:
    unique_values = df_amzcopy[column].unique()
    unique_values_count = len(unique_values)

    print(f"Column: {column}")
    print(f"Number of unique values: {unique_values_count}")
    print("Unique values in: {}".format(column))
    for value in unique_values:
        print(value)
    print("-----------------------")
    print("\n")

Column: Status
Number of unique values: 13
Unique values in: Status
Cancelled
Shipped - Delivered to Buyer
Shipped
Shipped - Returned to Seller
Shipped - Rejected by Buyer
Shipped - Lost in Transit
Shipped - Out for Delivery
Shipped - Returning to Seller
Shipped - Picked Up
Pending
Pending - Waiting for Pick Up
Shipped - Damaged
Shipping
-----------------------


Column: Fulfilment
Number of unique values: 2
Unique values in: Fulfilment
Merchant
Amazon
-----------------------


Column: Sales_Channel
Number of unique values: 2
Unique values in: Sales_Channel
Amazon.in
Non-Amazon
-----------------------


Column: ship-service-level
Number of unique values: 2
Unique values in: ship-service-level
Standard
Expedited
-----------------------


Column: Category
Number of unique values: 9
Unique values in: Category
Set
kurta
Western Dress
Top
Ethnic Dress
Bottom
Saree
Blouse
Dupatta
-----------------------


Column: Size
Number of unique values: 11
Unique values in: Size
S
3XL
XL
L
XXL
XS
6XL
M

## 1.2. Data cleaning

### 1.2.1. NaN values

Now checking NaN values or empty cells. 

In [72]:
df_amz.isna() # checking the null values in the dataset

Unnamed: 0,index,Order ID,Date,Status,Fulfilment,Sales Channel,ship-service-level,Style,SKU,Category,...,currency,Amount,ship-city,ship-state,ship-postal-code,ship-country,promotion-ids,B2B,fulfilled-by,Unnamed: 22
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,True
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,True
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
128970,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,True,False
128971,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
128972,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,True,False
128973,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False


In [63]:
df_amzcopy.isnull().sum()

Date                      0
Status                    0
Fulfilment                0
Sales_Channel             0
ship-service-level        0
Style                     0
SKU                       0
Category                  0
Size                      0
ASIN                      0
Courier Status         6872
Qty                       0
currency               7795
Amount                 7795
ship-city                33
ship-state               33
ship-country             33
B2B                       0
fulfilled-by          89698
dtype: int64

So we have some columns filled with NaN values. We need to clean them and fill them with the appropriate values. 

#### Note on `NaN` values
Pandas provides several other methods for filling missing values. Here are some commonly used methods along with examples and explanations of when they are useful:

1. **Forward fill (ffill):** This method fills missing values with the previous non-null value along the specified axis. It propagates the last observed value forward until the next non-null value is encountered. **Example:**

    ```
    df['Column'].fillna(method='ffill', axis=0)
    ``` 

    Use case: Forward fill is useful when you want to fill missing values with the most recent preceding value. This is applicable when you have data that is sorted chronologically or when you want to carry forward a value from a previous observation.

2. **Mean fill (mean):** This method fills missing values with the mean of the non-null values along the specified axis. **Example:**
    ```
    df['Column'].fillna(df['Column'].mean())
    ````
    
    **Use case:** Mean fill is useful when you want to replace missing values with the average value of the available data. This method helps to maintain the overall statistical characteristics of the data.

3. **Median fill (median):** This method fills missing values with the median of the non-null values along the specified axis. **Example:**

    ```
    df['Column'].fillna(df['Column'].median())
    ```

    **Use case:** Median fill is useful when you want to replace missing values with the middle value of the available data. It is less sensitive to outliers compared to mean fill, making it suitable when dealing with skewed distributions or data with extreme values.

4. **Mode fill (mode):** This method fills missing values with the mode (most frequent value) of the non-null values along the specified axis. **Example:**
    ```
    df['Column'].fillna(df['Column'].mode()[0])
    ```
    
    **Use case:** Mode fill is useful when you want to replace missing values with the most commonly occurring value in the available data. It is often used for categorical or discrete data.

5. **Constant fill:** This method fills missing values with a specified constant value. **Example:**

    ```
    df['Column'].fillna('Unknown')
    ```

    **Use case:** Constant fill is useful when you want to replace missing values with a specific predefined value, such as 'Unknown', 'Not Available', or any other meaningful constant.

These methods have different use cases depending on the nature of the data and the objective of the analysis. It's important to choose the appropriate fill method based on the specific requirements and characteristics of your dataset.

In [64]:
# Changing the 'NaN' values in 'Currency' column to 'COD'
df_amzcopy['currency'] = df_amzcopy['currency'].fillna('COD')

In [65]:
# checking the unique values in column: currency

unique_currency = df_amzcopy['currency'].unique()
unique_currency

array(['INR', 'COD'], dtype=object)

In [66]:
# Changing the 'NaN' values in 'Currency' column to 'COD'
df_amzcopy['currency'] = df_amzcopy['currency'].fillna('COD')

Date                      0
Status                    0
Fulfilment                0
Sales_Channel             0
ship-service-level        0
Style                     0
SKU                       0
Category                  0
Size                      0
ASIN                      0
Courier Status         6872
Qty                       0
currency                  0
Amount                 7795
ship-city                33
ship-state               33
ship-country             33
B2B                       0
fulfilled-by          89698
dtype: int64

### 1.2.2. Changing data types

In [14]:
df_amzcopy.dtypes # datatype of each columns

Date                  datetime64[ns]
Status                        object
Fulfilment                    object
Sales Channel                 object
ship-service-level            object
Style                         object
SKU                           object
Category                      object
Size                          object
ASIN                          object
Courier Status                object
Qty                            int64
currency                      object
Amount                       float64
ship-city                     object
ship-state                    object
ship-country                  object
B2B                             bool
fulfilled-by                  object
dtype: object

In [15]:
df_amzcopy.info() # datatype of each columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128975 entries, 0 to 128974
Data columns (total 19 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   Date                128975 non-null  datetime64[ns]
 1   Status              128975 non-null  object        
 2   Fulfilment          128975 non-null  object        
 3   Sales Channel       128975 non-null  object        
 4   ship-service-level  128975 non-null  object        
 5   Style               128975 non-null  object        
 6   SKU                 128975 non-null  object        
 7   Category            128975 non-null  object        
 8   Size                128975 non-null  object        
 9   ASIN                128975 non-null  object        
 10  Courier Status      122103 non-null  object        
 11  Qty                 128975 non-null  int64         
 12  currency            121180 non-null  object        
 13  Amount              121180 no

clearly we need to change the data type of few columns. For example: Date

In [7]:
df_amz['Date'] = pd.to_datetime(df_amz['Date']) # converting 'Date' datatype from object to Datetime.

# 2. Sales data

## 2.1. Checking shape and size of the dataframe

In [75]:
df_sale.head()

Unnamed: 0,index,SKU Code,Design No.,Stock,Category,Size,Color
0,0,AN201-RED-L,AN201,5.0,AN : LEGGINGS,L,Red
1,1,AN201-RED-M,AN201,5.0,AN : LEGGINGS,M,Red
2,2,AN201-RED-S,AN201,3.0,AN : LEGGINGS,S,Red
3,3,AN201-RED-XL,AN201,6.0,AN : LEGGINGS,XL,Red
4,4,AN201-RED-XXL,AN201,3.0,AN : LEGGINGS,XXL,Red


Dropping the index column

In [103]:
# Removing unwanted columns from the table

df_sale = df_sale.drop(labels = ['index'], axis = 1)
df_sale.head()

Unnamed: 0,SKU,Design No.,Stock,Category,Size,Color
0,AN201-RED-L,AN201,5.0,Leggings,L,Red
1,AN201-RED-M,AN201,5.0,Leggings,M,Red
2,AN201-RED-S,AN201,3.0,Leggings,S,Red
3,AN201-RED-XL,AN201,6.0,Leggings,XL,Red
4,AN201-RED-XXL,AN201,3.0,Leggings,XXL,Red


In [104]:
df_sale.shape

(9271, 6)

In [105]:
df_sale.columns

Index(['SKU', 'Design No.', 'Stock', 'Category', 'Size', 'Color'], dtype='object')

In [79]:
# renaming 'SKU Code' to 'SKU'
df_sale = df_sale.rename(columns={'SKU Code': 'SKU'})
df_sale.columns

Index(['index', 'SKU', 'Design No.', 'Stock', 'Category', 'Size', 'Color'], dtype='object')

In [84]:
# Define the list of columns for which you want to print unique values
columns_sales = ['Category', 'Size','Color']

# Iterate over the columns and print unique values in a tabular form
for column in columns_sales:
    unique_salevalues = df_sale[column].unique()
    unique_salevalues_count = len(unique_salevalues)

    print(f"Column: {column}")
    print(f"Number of unique values: {unique_salevalues_count}")
    print("Unique values in: {}".format(column))
    for value in unique_salevalues:
        print(value)
    print("-----------------------")
    print("\n")

Column: Category
Number of unique values: 22
Unique values in: Category
AN : LEGGINGS
BLOUSE
PANT
BOTTOM
PALAZZO
SHARARA
SKIRT
DRESS
KURTA SET
LEHENGA CHOLI
SET
TOP
KURTA
nan
CROP TOP
TUNIC
CARDIGAN
JUMPSUIT
CROP TOP WITH PLAZZO
SAREE
KURTI
NIGHT WEAR
-----------------------


Column: Size
Number of unique values: 12
Unique values in: Size
L
M
S
XL
XXL
FREE
XS
XXXL
4XL
5XL
6XL
nan
-----------------------


Column: Color
Number of unique values: 63
Unique values in: Color
Red
Orange
Maroon
Purple
Yellow
Green
Pink
Beige
Navy Blue
Black
White
Brown
Gold
Chiku
Blue
Multicolor
Peach
Grey
Olive
Dark Green
Turquoise Blue
Mustard
Teal
Khaki
Olive Green
TEAL BLUE 
Cream
OFF WHITE
Light Green
Light Pink
Lemon Yellow
Sea Green
Turquoise Green
LEMON 
LEMON
Sky Blue
LIME GREEN
nan
Light Blue
Dark Blue
Indigo
Rust
BURGUNDY
Wine
Light Brown
Mauve
MINT GREEN
CORAL ORANGE
CORAL PINK
Turquoise
AQUA GREEN
LIGHT YELLOW
Magenta
Powder Blue
CORAL 
TEAL GREEN 
Taupe
Charcoal
Teal Green
NAVY
MINT
NO REFERENC

## 2.2. Data Cleaning

### 2.2.1. Capitalizing letters and NaN values

In [87]:
# Convert the 'Category' column to first letter capital
df_sale['Category'] = df_sale['Category'].str.title()
df_sale['Category'].unique()

array(['An : Leggings', 'Blouse', 'Pant', 'Bottom', 'Palazzo', 'Sharara',
       'Skirt', 'Dress', 'Kurta Set', 'Lehenga Choli', 'Set', 'Top',
       'Kurta', nan, 'Crop Top', 'Tunic', 'Cardigan', 'Jumpsuit',
       'Crop Top With Plazzo', 'Saree', 'Kurti', 'Night Wear'],
      dtype=object)

In [89]:
# renaming 'An : Leggings' to 'Leggings'

# Replace 'An : Leggings' with 'Leggings' in the 'Category' column
df_sale['Category'] = df_sale['Category'].replace('An : Leggings', 'Leggings')
df_sale['Category'].unique()

array(['Leggings', 'Blouse', 'Pant', 'Bottom', 'Palazzo', 'Sharara',
       'Skirt', 'Dress', 'Kurta Set', 'Lehenga Choli', 'Set', 'Top',
       'Kurta', nan, 'Crop Top', 'Tunic', 'Cardigan', 'Jumpsuit',
       'Crop Top With Plazzo', 'Saree', 'Kurti', 'Night Wear'],
      dtype=object)

In [90]:
df_sale.isnull().sum()

index          0
SKU           83
Design No.    36
Stock         36
Category      45
Size          36
Color         45
dtype: int64

In [91]:
# Changing the 'NaN' values in 'Category' column to 'Unknonwn'
df_sale['Category'] = df_sale['Category'].fillna('Unknown')

In [94]:
# Changing the 'NaN' values in 'Size' column to 'Unknonwn'
df_sale['Size'] = df_sale['Size'].fillna('Unknown size')

In [97]:
# Changing the 'NaN' values in 'Color' column to 'Unknonwn color'
df_sale['Color'] = df_sale['Color'].fillna('Unknown color')

### 2.2.2. Changing data type

In [102]:
df_sale.dtypes

index           int64
SKU            object
Design No.     object
Stock         float64
Category       object
Size           object
Color          object
dtype: object

So dtype is correct here. No need to change it.

# 3. Visualization

Some possible cases for data visualization using different types of charts and graphs:

1. Bar Chart: Visualize categorical data by plotting bars of different heights. Useful for comparing data across different categories or groups.

2. Pie Chart: Represent data as slices of a pie, showing the proportion or percentage distribution of different categories.

3. Line Plot: Display the relationship between two continuous variables by plotting data points connected by lines. Suitable for visualizing trends or patterns over time.

4. Scatter Plot: Plot individual data points in a Cartesian coordinate system to show the relationship between two continuous variables. Useful for identifying correlations or clusters in the data.

5. Histogram: Display the distribution of a single numeric variable by dividing the data into bins and showing the frequency or count of observations in each bin.

6. Box Plot: Visualize the distribution of a continuous variable through quartiles, outliers, and other statistical measures. Helps in identifying skewness, outliers, and variability in the data.

7. Heatmap: Display a matrix of data as a grid of colored squares, where the colors represent the values. Useful for visualizing correlations or patterns in a tabular dataset.

8. Area Chart: Plot the cumulative values of multiple variables over time, showing the contribution of each variable to the total.

9. Violin Plot: Combine the features of a box plot and a kernel density plot to visualize the distribution and density of a variable.

10. Bubble Chart: Represent data points as bubbles on a scatter plot, where the size or color of the bubble represents a third variable.

11. TreeMap: Display hierarchical data as nested rectangles, where the size of each rectangle represents a value.

12. Radar Chart: Display multivariate data on a two-dimensional chart with multiple axes, showing the values of each variable relative to a central point.

## 3.1. Bar Chart

1. Visualize the sales quantities or revenue by category using a bar chart:

In [108]:
df_amzcopy.columns

Index(['Date', 'Status', 'Fulfilment', 'Sales_Channel', 'ship-service-level',
       'Style', 'SKU', 'Category', 'Size', 'ASIN', 'Courier Status', 'Qty',
       'currency', 'Amount', 'ship-city', 'ship-state', 'ship-country', 'B2B',
       'fulfilled-by'],
      dtype='object')

In [110]:
plt.bar(df_amzcopy['Category'], df_amzcopy['Amount'])

# Add labels and title
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Chart')

# Display the chart
plt.show()

## Conclusion

Python’s data analysis capabilities empower businesses to extract meaningful insights from complex E-commerce sales datasets. By leveraging Python libraries and techniques, we can uncover patterns, segment customers, identify sales trends, and make informed decisions to drive growth and success. The analysis we conducted is just a glimpse into the vast possibilities that data-driven approaches offer in the realm of E-commerce. So, let’s embrace the power of Python and unlock the potential hidden within our data for a competitive edge in the dynamic world of online retail.

## E-Commerce Sales Dataset

- https://data.world/anilsharma87/sales
- https://www.kaggle.com/datasets/thedevastator/unlock-profits-with-e-commerce-sales-data
- https://www.kaggle.com/code/jaysonli/e-commerce-sales-analysis/notebook