<a href="https://colab.research.google.com/github/cpython-projects/da_vn/blob/main/session_06_part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Data Visualization with Plotly

You are given an e-commerce dataset (`ecommerce_data.csv`).
Your task is to explore this dataset using **different types of visualizations** and interpret the business or analytical value of each.
For each visualization type below, describe:

- What the plot shows
- What kind of insights or patterns we might expect to find

---

### E-commerce Legend


| Column Name         | Description |
|---------------------|-------------|
| `order_id`          | Unique identifier for each order |
| `customer_id`       | Unique identifier for the customer |
| `order_date`        | Date when the order was placed |
| `product_id`        | Unique identifier for the product |
| `product_name`      | Name of the purchased product |
| `category`          | Product category (e.g. Electronics, Fashion) |
| `price`             | Unit price of the product (in USD) |
| `quantity`          | Quantity of the product ordered |
| `weight`            | Weight of the product (e.g., "0.5kg") |
| `discount`          | Discount applied on the product (in decimal, e.g. 0.15 = 15%) |
| `shipping_cost`     | Cost to ship the product |
| `payment_method`    | Method used for payment (e.g., Credit Card, PayPal, Debit) |
| `delivery_status`   | Status of delivery (e.g., Delivered, Shipped, Processing) |
| `customer_city`     | Customer's city |
| `customer_state`    | Customer's state |
| `customer_country`  | Customer's country |
| `return_requested`  | 1 if a return was requested, 0 otherwise |
| `review_score`      | Customer review rating (1 to 5) |
| `days_to_deliver`   | Number of days it took to deliver the product |

---

### Data Reading

In [18]:
from google.colab import files
uploaded = files.upload()

Saving ecommerce_data.csv to ecommerce_data (2).csv


In [19]:
import pandas as pd
df = pd.read_csv('ecommerce_data.csv')
df.head()

Unnamed: 0,order_id,customer_id,order_date,product_id,product_name,category,price,quantity,weight,discount,shipping_cost,payment_method,delivery_status,customer_city,customer_state,customer_country,return_requested,review_score,days_to_deliver
0,1001,C101,2023-01-15,P001,Smartphone X,Electronics,599.99,1,0.5kg,0.1,5.99,Credit Card,Delivered,New York,NY,USA,0,5.0,3.0
1,1002,C102,2023-01-16,P002,Laptop Pro,Electronics,1299.99,1,2.2kg,0.15,12.99,paypal,Delivered,los angeles,CA,USA,1,4.0,5.0
2,1003,C103,2023-01-17,P003,Wireless Earbuds,Electronics,79.99,2,0.1kg,0.0,,Credit Card,Shipped,Chicago,IL,USA,0,,
3,1004,C104,2023-01-18,P004,Smart Watch,Electronics,199.99,1,0.3kg,0.05,4.99,debit,Delivered,Houston,TX,USA,0,5.0,4.0
4,1005,C105,2023-01-19,P005,Tablet Mini,Electronics,299.99,1,0.7kg,,6.99,credit,Processing,PHOENIX,AZ,USA,1,2.0,


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   order_id          40 non-null     int64  
 1   customer_id       40 non-null     object 
 2   order_date        40 non-null     object 
 3   product_id        40 non-null     object 
 4   product_name      40 non-null     object 
 5   category          40 non-null     object 
 6   price             40 non-null     float64
 7   quantity          40 non-null     int64  
 8   weight            40 non-null     object 
 9   discount          26 non-null     float64
 10  shipping_cost     30 non-null     float64
 11  payment_method    40 non-null     object 
 12  delivery_status   40 non-null     object 
 13  customer_city     40 non-null     object 
 14  customer_state    40 non-null     object 
 15  customer_country  40 non-null     object 
 16  return_requested  40 non-null     int64  
 17 

In [21]:
duplicate_rows = df.duplicated().sum()
print(duplicate_rows)

5


In [22]:
df = df.drop_duplicates()

In [23]:
df['discount'] = df.discount.fillna(0)

shipping_cost_median = df.shipping_cost.median()
df['shipping_cost'] = df.shipping_cost.fillna(shipping_cost_median)

In [24]:
def convert_weight(x):
    if not isinstance(x, str):
        return x

    if 'kg' in x:
        return float(x.replace('kg', ''))
    if 'lbs' in x:
        return float(x.replace('lbs', '')) * 0.453592

df['weight_kg'] = df.weight.apply(convert_weight)
df.drop('weight', axis=1, inplace=True)

In [26]:
# Standardize text fields
df['payment_method'] = df['payment_method'].str.title()
df['delivery_status'] = df['delivery_status'].str.title()
df['customer_city'] = df['customer_city'].str.title()
df['customer_country'] = df['customer_country'].replace(['U.S.A', 'United States'], 'USA')

In [27]:
from dateutil import parser
def date_parse(item):
  if pd.notna(item):
    return parser.parse(item).strftime('%Y-%m-%d')
  return item
df['order_date'] = df['order_date'].apply(date_parse)
df['order_date'] = pd.to_datetime(df['order_date'], format='%Y-%m-%d')

In [29]:
df.describe()

Unnamed: 0,order_id,order_date,price,quantity,discount,shipping_cost,return_requested,review_score,days_to_deliver,weight_kg
count,35.0,35,35.0,35.0,35.0,35.0,35.0,33.0,20.0,35.0
mean,1044.114286,2023-01-29 13:01:42.857142784,268.704286,1.4,0.05,8.161429,0.142857,4.0,4.1,1.808511
min,1001.0,2023-01-15 00:00:00,24.99,1.0,0.0,2.99,0.0,1.0,3.0,0.05
25%,1008.5,2023-01-21 12:00:00,74.99,1.0,0.0,5.49,0.0,4.0,3.0,0.25
50%,1016.0,2023-01-29 00:00:00,179.99,1.0,0.0,6.99,0.0,4.0,4.0,0.5
75%,1024.5,2023-02-06 12:00:00,324.99,1.0,0.1,8.99,0.0,5.0,5.0,1.8
max,2006.0,2023-02-13 00:00:00,1299.99,4.0,0.25,19.99,1.0,5.0,6.0,15.4
std,167.595938,,285.45975,0.811679,0.068599,3.964814,0.355036,1.118034,1.020836,3.077948


### Histograms  
- Visualize the distribution of `price`, `quantity`, `discount`, `shipping_cost`, or `review_score`
- Look for skewed distributions, price clusters, or popular discount levels.

*insight (write 1–3 sentences):*

### Boxplot  
- Compare price or review scores across different `category` values.
- Spot outliers, detect variability in product pricing or customer feedback by product type.

*insight (write 1–3 sentences):*

### KDE Plot (Density Plot)
- Show the smoothed distribution of `price` or `days_to_deliver`.
- Useful for identifying where most values are concentrated.

*insight (write 1–3 sentences):*

### Scatter Plot  
- Plot relationships such as `price` vs. `review_score`, or `shipping_cost` vs. `days_to_deliver`.
- Identify trends, e.g., whether more expensive items get better reviews.


*insight (write 1–3 sentences):*

### Line Plot
- Track total sales or number of orders over time using `order_date`.
- Reveal seasonal trends, growth, or drops in sales volume.

*insight (write 1–3 sentences):*

### Bar Plot  
- Show totals or averages grouped by categories, such as:
     - Average `review_score` per `category`
     - Total `price × quantity` per `customer_state`
- Useful for comparing performance across groups.

*insight (write 1–3 sentences):*

### Heatmap
- Display correlations between numeric variables like `price`, `discount`, `review_score`, `days_to_deliver`, `shipping_cost`.
- Spot which features are closely related.

*insight (write 1–3 sentences):*