# Corrected Analysis: Quarterly Sales Report

This notebook contains the corrected version of the "Fix My Analysis" exercise. It serves as a solution guide for students to compare their results.

**Key Fixes:**
1.  Changed join strategy to `left` to retain all order data.
2.  Fixed the total_price calculation (multiplication instead of addition).
3.  Identified and removed invalid data (negative quantity, zero unit cost).
4.  Checked for and handled missing data across all columns.
5.  Corrected the aggregation logic to use `sum` for total sales instead of `mean`.
6.  Fixed visualisations to plot the correct metrics with accurate labels.
7.  Corrected geographical analysis aggregation and metric counting.
8.  Fixed time-series analysis to sum sales instead of counting orders.

## Step 1: Load the Data

In [None]:
import pandas as pd

# Load the datasets
customers = pd.read_csv("customers.csv")
products = pd.read_csv("products.csv")
orders = pd.read_csv("orders.csv")

print("Datasets loaded!")
print(f"Orders: {len(orders)} rows")
print(f"Customers: {len(customers)} rows")
print(f"Products: {len(products)} rows")
orders.head()

## Step 2: Corrected Join Strategy

**Problem:** The original `inner` join silently dropped orders for customers or products not in the other tables.

**Solution:** Use a `left` join starting from the `orders` DataFrame to ensure all orders are kept.

In [None]:
# Corrected Join Strategy
merged_data = orders.merge(customers, on="customer_id", how="left")
merged_data = merged_data.merge(products, on="product_id", how="left")

print(f"Original number of orders: {len(orders)}")
print(f"Number of rows after left join: {len(merged_data)}")
print(f"Number of rows with missing customer/product info: {merged_data.isnull().any(axis=1).sum()}")
merged_data.info()

## Step 3: Corrected Data Cleaning

**Problem 1:** The original calculation used addition instead of multiplication.

**Problem 2:** The dataset contains invalid data points (negative quantities, zero costs).

**Solution:**
1.  Convert `order_date` to datetime.
2.  Fix the total_price calculation (multiply, not add).
3.  Identify and remove orders with invalid data.

In [None]:
# Convert order_date to datetime
merged_data["order_date"] = pd.to_datetime(merged_data["order_date"])

# CORRECTED: Calculate total price using multiplication
merged_data["total_price"] = merged_data["quantity"] * merged_data["unit_cost"]

print("Initial data statistics:")
merged_data[["quantity", "unit_cost", "total_price"]].describe()

In [None]:
# Identify invalid data
print("\nIdentifying invalid data...\n")

invalid_quantity = merged_data[merged_data["quantity"] <= 0]
print(f"Found {len(invalid_quantity)} order(s) with invalid quantity (<=0):")
if len(invalid_quantity) > 0:
    print(invalid_quantity[["order_id", "product_name", "quantity", "unit_cost", "total_price"]])

invalid_cost = merged_data[merged_data["unit_cost"] <= 0]
print(f"\nFound {len(invalid_cost)} order(s) with invalid unit cost (<=0):")
if len(invalid_cost) > 0:
    print(invalid_cost[["order_id", "product_name", "quantity", "unit_cost", "total_price"]])

In [None]:
# Remove invalid data
original_rows = len(merged_data)
cleaned_data = merged_data[(merged_data["quantity"] > 0) & (merged_data["unit_cost"] > 0)].copy()
print(f"\nRemoved {original_rows - len(cleaned_data)} order(s) with invalid quantity or unit cost.")

cleaned_data[["quantity", "unit_cost", "total_price"]].describe()

## Step 4: Check for Missing Data

**Problem:** Missing data wasn't systematically checked across all columns.

**Solution:** Count and analyse missing values for all columns, then handle appropriately.

In [None]:
# Check for missing data across all columns
print("Missing Data Analysis:\n")
missing_counts = cleaned_data.isnull().sum()
missing_percentages = (cleaned_data.isnull().sum() / len(cleaned_data) * 100).round(2)

missing_summary = pd.DataFrame({
    'Column': missing_counts.index,
    'Missing Count': missing_counts.values,
    'Missing Percentage': missing_percentages.values
})

# Show only columns with missing data
missing_summary_filtered = missing_summary[missing_summary['Missing Count'] > 0]

missing_summary_filtered.head()

All missing information relates to customers so no action is required

## Step 5: Sales Analysis by Category

**Problem:** The original analysis used `mean` for the sales summary, which is not what management wants to see.

**Solution:** Use `sum` to calculate the total sales value for each category.

In [None]:
# Corrected Aggregation
category_summary = cleaned_data.groupby("category").agg({
    "total_price": "sum",
    "order_id": "count"
}).reset_index()

category_summary.columns = ["Category", "Total Sales Value", "Number of Orders"]

print("Sales Summary by Category:")
category_summary.sort_values("Total Sales Value", ascending=False)

## Step 6: Corrected Visualisation

**Problem:** The original chart plotted "Number of Orders" but labeled it as "Total Sales Value".

**Solution:** Plot the correct metric (Total Sales Value) with accurate labels and sorted bars.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Corrected Visualisation: Sales by Category
plt.figure(figsize=(12, 7))
category_sorted = category_summary.sort_values("Total Sales Value", ascending=False)
sns.barplot(data=category_sorted, x="Category", y="Total Sales Value")
plt.title("Total Sales Value by Category", fontsize=16)
plt.xlabel("Product Category", fontsize=12)
plt.ylabel("Total Sales Value (£)", fontsize=12)
plt.xticks(rotation=0)
plt.show()

## Step 7: Corrected Geographical Sales Analysis

**Problem 1:** Used `mean` instead of `sum` for sales.

**Problem 2:** Counted `customer_id` (double-counting repeat customers) instead of `order_id`.

**Solution:** Use `sum` for total sales and count `order_id` for number of orders.

In [None]:
# Corrected Geographical Sales Analysis
region_summary = cleaned_data.groupby("region").agg({
    "total_price": "sum",
    "order_id": "count"
}).reset_index()

region_summary.columns = ["Region", "Total Sales Value", "Number of Orders"]

print("Sales Summary by Region:")
region_summary.sort_values("Total Sales Value", ascending=False)

In [None]:
# Corrected Visualisation: Sales by Region
plt.figure(figsize=(12, 7))
region_sorted = region_summary.sort_values("Total Sales Value", ascending=False)
sns.barplot(data=region_sorted, x="Region", y="Total Sales Value")
plt.title("Total Sales Value by Region", fontsize=16)
plt.xlabel("Customer Region", fontsize=12)
plt.ylabel("Total Sales Value (£)", fontsize=12)
plt.show()

## Step 8: Corrected Monthly Sales Trend Analysis

**Problem:** The original analysis counted orders instead of summing sales value.

**Solution:** Use `sum` of `total_price` to get actual monthly sales revenue.

In [None]:
# Corrected Monthly Sales Trend Analysis
cleaned_data["order_month"] = cleaned_data["order_date"].dt.to_period("M")
monthly_summary = cleaned_data.groupby("order_month").agg({
    "total_price": "sum"
}).reset_index()

monthly_summary.columns = ["Month", "Total Sales Value"]
monthly_summary["Month"] = monthly_summary["Month"].dt.to_timestamp()

monthly_summary.head(10)

In [None]:
# Corrected Visualisation: Monthly Sales Trend
plt.figure(figsize=(14, 7))
sns.lineplot(data=monthly_summary, x="Month", y="Total Sales Value")
plt.title("Monthly Sales Trend", fontsize=16)
plt.xlabel("Month", fontsize=12)
plt.ylabel("Total Sales Value (£)", fontsize=12)
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Final Summary

By correcting all the errors in the original analysis, we have produced a reliable and accurate report. The key fixes were:

1.  **Using a `left` join** to ensure no order data was lost during the merge.
2.  **Fixing the total_price calculation** to use multiplication instead of addition.
3.  **Removing invalid data** including negative quantities and zero unit costs.
4.  **Systematically checking for missing data** across all columns and handling it appropriately.
5.  **Correcting all aggregations** to use `sum` for total sales instead of `mean` for averages.
6.  **Fixing visualisations** to plot the correct metrics with accurate labels.
7.  **Correcting the regional analysis** to count orders (not customers) and sum sales.
8.  **Fixing the time-series analysis** to sum sales revenue instead of counting orders.

This corrected analysis provides an accurate picture of the company's performance across categories, regions, and time periods, and can be confidently shared with management.