# AtliQ Hardware product analysis
The company would like to know how its popularity has changed over time and how it differs by region, which of its items have sold the most, and hopefully as a result how it can increase sales. The relevant tables are these:

* dim_product: It has product names and codes along with categorisations of the product, like "keyboard" or "storage", on three levels. Or four, if you count the fact that variants of the same named product (e.g., premium vs. plus, blue vs. red) have different codes.
* fact_manufacturing_cost: Manufacturing cost by product code by year.
* fact_gross_price: Price by product code by year.
* fact_sales_monthly: Number of each product code bought by customer and month.

There's a dashboard [available here](https://public.tableau.com/app/profile/hollis.krause/viz/s12_dash_a2/Salesquantities) where you can look at some of the charts in this notebook in a more interactive way.

## Testing the connection

In [None]:
import pandas as pd
import sqlite3
import plotly.express as px
import plotly.graph_objects as go

con = sqlite3.connect('atliq_db.sqlite3')

cursor = con.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(cursor.fetchall())

query="""Select * from 
dim_customer
LIMIT 10
"""
dim_customer=pd.read_sql_query(query, con)
dim_customer.head()

I know it was for testing, but that table is so boring I'm doing another query.

In [None]:
query="""
SELECT DISTINCT customer
FROM dim_customer
LIMIT 10
"""
read = pd.read_sql_query(query, con)
read

Note from this that the table of customers has different rows for the same entity operating in different countries. Which is good for regional comparisons, you just have to remember to combine them when necessary.

## Basic checks and initial looks
### What do some of these things look like?
dim_product is hard to visualise from the schema, let's look at it.

In [None]:
query="""
SELECT *
FROM dim_product
LIMIT 5
"""
read = pd.read_sql_query(query, con)
read

How many of these products are there?

In [None]:
query="""
SELECT COUNT(DISTINCT product_code) as product_codes, COUNT(DISTINCT product) as product_names
FROM dim_product
"""
read = pd.read_sql_query(query, con)
read

73, with an average of about 5.4 variations each. This makes me wonder if there are even enough products in each category to do any category-based analysis confidently.

In [None]:
query="""
SELECT COUNT(DISTINCT category)
FROM dim_product
"""
read = pd.read_sql_query(query, con)
read

So about five products per category. Not much, but not useless. But all this doesn't matter; the sales data only has four products in a total of fourteen variations. Here's every code:

In [None]:
query="""
SELECT DISTINCT product_code
FROM fact_sales_monthly
"""
read = pd.read_sql_query(query, con)
read

The weird A0 row is mostly null and doesn't represent anything real. Here are the products:

In [None]:
query="""
SELECT dim_product.product AS product, SUM(sold_quantity) as total_quan
FROM fact_sales_monthly LEFT JOIN dim_product ON fact_sales_monthly.product_code = dim_product.product_code
WHERE product IS NOT NULL
GROUP BY dim_product.product
ORDER BY total_quan DESC
"""
read = pd.read_sql_query(query, con)
read

Obviously it isn't a lot, but it's what we've got, so they're getting compared. Such as by total sales there.

Moving on, what dates are we covering?

In [None]:
query="""
SELECT DISTINCT date
FROM fact_sales_monthly
ORDER BY date ASC
LIMIT 10
"""
read = pd.read_sql_query(query, con)
read

In [None]:
query="""
SELECT MIN(date),MAX(date)
FROM fact_sales_monthly
"""
read = pd.read_sql_query(query, con)
read

September 2017 to December 2021. Also, "date" is really month, as the table's name suggests. And do we have every month?

In [None]:
query="""
SELECT COUNT(DISTINCT date)
FROM fact_sales_monthly
"""
read = pd.read_sql_query(query, con)
read

Yep, 4*12+4=52.

### Checking for duplicates and missing values

In [None]:
query="""
SELECT COUNT(DISTINCT product_code) as distinct_codes, COUNT(*) as rows
FROM dim_product
"""
read = pd.read_sql_query(query, con)
read

In [None]:
query="""
SELECT *
FROM fact_manufacturing_cost
GROUP BY product_code, cost_year
HAVING COUNT(*) > 1
"""
read = pd.read_sql_query(query, con)
read

In [None]:
query="""
SELECT *
FROM fact_gross_price
GROUP BY product_code, fiscal_year
HAVING COUNT(*) > 1
"""
read = pd.read_sql_query(query, con)
read

In [None]:
query="""
SELECT *
FROM fact_sales_monthly
GROUP BY date, product_code, customer_code
HAVING COUNT(*) > 1
"""
read = pd.read_sql_query(query, con)
read

In [None]:
query="""
SELECT *
FROM dim_product
WHERE product_code IS NULL OR division IS NULL OR segment IS NULL OR category IS NULL OR product IS NULL OR variant IS NULL
"""
read = pd.read_sql_query(query, con)
read

In [None]:
query="""
SELECT *
FROM fact_manufacturing_cost
WHERE product_code IS NULL OR cost_year IS NULL OR manufacturing_cost IS NULL
"""
read = pd.read_sql_query(query, con)
read

In [None]:
query="""
SELECT *
FROM fact_gross_price
WHERE product_code IS NULL OR fiscal_year IS NULL OR gross_price IS NULL
"""
read = pd.read_sql_query(query, con)
read

In [None]:
query="""
SELECT *
FROM fact_sales_monthly
WHERE date IS NULL OR product_code IS NULL OR customer_code IS NULL OR sold_quantity IS NULL OR fiscal_year IS NULL
"""
read = pd.read_sql_query(query, con)
read

There's the A0 row I mentioned.

In [None]:
query="""
SELECT *
FROM dim_product
WHERE product_code = 'A0'
"""
read = pd.read_sql_query(query, con)
read

It doesn't even have a corresponding row in dim_product. Again, one row is not a real thing. But it is just a single row, at least.

## Main analysis
Perhaps the most obvious thing of all is comparing gross price to manufacturing cost, a quick check to make sure every individual sale is actually profit.

In [None]:
query="""
SELECT fact_manufacturing_cost.product_code, fact_manufacturing_cost.cost_year, fact_manufacturing_cost.manufacturing_cost, fact_gross_price.gross_price, fact_gross_price.gross_price - fact_manufacturing_cost.manufacturing_cost AS difference
FROM fact_manufacturing_cost LEFT JOIN fact_gross_price ON fact_manufacturing_cost.product_code=fact_gross_price.product_code AND fact_manufacturing_cost.cost_year=fact_gross_price.fiscal_year
LIMIT 10
"""
read = pd.read_sql_query(query, con)
read

Everything has four pairs of figures, one for each year, which makes things tricky. We could just use the latest year, that'd probably be best for suggesting changes anyway.

In [None]:
query="""
SELECT fact_manufacturing_cost.product_code, fact_manufacturing_cost.cost_year AS year, fact_manufacturing_cost.manufacturing_cost, fact_gross_price.gross_price, fact_gross_price.gross_price - fact_manufacturing_cost.manufacturing_cost AS difference
FROM fact_manufacturing_cost LEFT JOIN fact_gross_price ON fact_manufacturing_cost.product_code=fact_gross_price.product_code AND year=fact_gross_price.fiscal_year
WHERE year = 2021
ORDER BY difference ASC
"""
diffs = pd.read_sql_query(query, con)
diffs

Even the narrowest of margins is over 2 per unit, which is nice to see. The wide range of those margins here also makes me want to know what counts as an ordinary one.

In [None]:
fig = px.histogram(diffs, x="difference")
fig.show()

A little over half of them fall into that <50 bracket, but it doesn't tail off in a way you might expect. No number is really "weird". We could also look at it in terms of the ratio.

In [None]:
query="""
SELECT fact_manufacturing_cost.product_code, fact_manufacturing_cost.cost_year AS year, fact_manufacturing_cost.manufacturing_cost, fact_gross_price.gross_price, (fact_gross_price.gross_price / fact_manufacturing_cost.manufacturing_cost) - 1 AS rev_pc
FROM fact_manufacturing_cost LEFT JOIN fact_gross_price ON fact_manufacturing_cost.product_code=fact_gross_price.product_code AND year=fact_gross_price.fiscal_year
WHERE year = 2021
ORDER BY rev_pc ASC
"""
read = pd.read_sql_query(query, con)
read

And now things become rather even. Makes more sense that the ratio is what things would be based on anyway.

But are there any trends to worry about?

In [None]:
query="""
SELECT fact_manufacturing_cost.product_code, fact_manufacturing_cost.cost_year AS year, fact_manufacturing_cost.manufacturing_cost, fact_gross_price.gross_price, (fact_gross_price.gross_price / fact_manufacturing_cost.manufacturing_cost) - 1 AS rev_pc
FROM fact_manufacturing_cost LEFT JOIN fact_gross_price ON fact_manufacturing_cost.product_code=fact_gross_price.product_code AND year=fact_gross_price.fiscal_year
WHERE fact_manufacturing_cost.product_code IN ('A0118150101', 'A0118150102', 'A0118150103', 'A0118150104', 'A0219150201', 'A0219150202', 'A0219150203', 'A0320150301', 'A0320150302', 'A0320150303', 'A0418150101', 'A0418150102', 'A0418150103', 'A0418150104') AND year >= 2018 AND year <= 2021
ORDER BY year, fact_manufacturing_cost.product_code ASC
"""
ppu_table = pd.read_sql_query(query, con)

fig = px.line(ppu_table, x="year", y="rev_pc", color="product_code")
names = {"A0118150101": "Dracula Standard", "A0118150102": "Dracula Plus", "A0118150103": "Dracula Premium", "A0118150104": "Dracula Premium Plus", "A0219150201": "WereWolf Standard", "A0219150202": "WereWolf Plus", "A0219150203": "WereWolf Premium", "A0320150301": "Zion Saga Standard", "A0320150302": "Zion Saga Plus", "A0320150303": "Zion Saga Premium", "A0418150101": "Mforce Gen X Std. 1", "A0418150102": "Mforce Gen X Std. 2", "A0418150103": "Mforce Gen X Std. 3", "A0418150104": "Mforce Gen X Plus 1"}

fig.update_layout(title="No variant's profitability per unit is falling dramatically", xaxis_title="Year", yaxis_title="Price/cost ratio minus 1", legend_title_text="Variant")
fig.update_yaxes(range=[0, 2.5])
fig.update_xaxes(nticks=4)
fig.for_each_trace(lambda t: t.update(name = names[t.name]))
fig.show()

Nope. Good to see.

Now let's finally answer something that was directly asked: which items have the most sales.

In [None]:
query="""
SELECT dim_product.product AS product, dim_product.variant AS variant, SUM(sold_quantity) as total_quan
FROM fact_sales_monthly LEFT JOIN dim_product ON fact_sales_monthly.product_code = dim_product.product_code
WHERE product IS NOT NULL
GROUP BY fact_sales_monthly.product_code
ORDER BY total_quan DESC
"""
read = pd.read_sql_query(query, con)
read

The Mforce Gen X has the single highest variation, but is that really what we're looking for? Maybe grouping by the name would be better.

In [None]:
query="""
SELECT dim_product.product AS product, SUM(sold_quantity) as total_quan
FROM fact_sales_monthly LEFT JOIN dim_product ON fact_sales_monthly.product_code = dim_product.product_code
WHERE product IS NOT NULL
GROUP BY dim_product.product
ORDER BY total_quan DESC
"""
read = pd.read_sql_query(query, con)
read

It has the best single variant, but the worst total sales. This is at least partly because it only had one variant being sold from September 2020 onward. The Dracula is on top, and by a good margin, but there are only four things, after all.

Another thing to look at is which items are most popular in the sense of having been bought by the most customers.

In [None]:
query="""
SELECT product, COUNT(DISTINCT customer_code) as customers
FROM fact_sales_monthly LEFT JOIN dim_product ON fact_sales_monthly.product_code=dim_product.product_code
WHERE product IS NOT NULL
GROUP BY product
ORDER BY customers DESC
"""
read = pd.read_sql_query(query, con)
read

All the same. Also, if you do it with just 2021, you get the same numbers.

In [None]:
query="""
SELECT product, COUNT(DISTINCT customer_code) as customers
FROM fact_sales_monthly LEFT JOIN dim_product ON fact_sales_monthly.product_code=dim_product.product_code
WHERE SUBSTRING(date, 1, 4) = '2021'
GROUP BY product
ORDER BY customers DESC
"""
read = pd.read_sql_query(query, con)
read

### Differences over time and comparing products

The products over time:

In [None]:
query="""
SELECT date, product, SUM(sold_quantity) as quantity
FROM fact_sales_monthly LEFT JOIN dim_product ON fact_sales_monthly.product_code=dim_product.product_code
GROUP BY product, date
"""
sales_mp = pd.read_sql_query(query, con)
sales_mp = sales_mp.dropna()

fig = px.line(sales_mp, x="date", y="quantity", color="product")
names = {"AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache": "AQ Dracula HDD", "AQ Mforce Gen X": "AQ Mforce Gen X", "AQ WereWolf NAS Internal Hard Drive HDD – 8.89 cm": "AQ WereWolf HDD", "AQ Zion Saga": "AQ Zion Saga"}
fig.for_each_trace(lambda t: t.update(name = names[t.name]))
fig.update_layout(title="Mforce Gen X has been behind since late 2020", xaxis_title="Month", yaxis_title="Quantity sold", legend_title_text="Product")
fig.show()

They all have a similar shape. We see a seasonal trend from this, with an increase over the last third of the year before a drop and things being very level again. And look at how big 2021 was, especially if you aren't the Mforce Gen X. But going back to a point stated a few tables earlier, if you look into the individual variants, you find that only one got to experience the rise of 2021:

In [None]:
query="""
SELECT date, fact_sales_monthly.product_code, product, variant, SUM(sold_quantity) as quantity
FROM fact_sales_monthly LEFT JOIN dim_product ON fact_sales_monthly.product_code=dim_product.product_code
WHERE product='AQ Mforce Gen X'
GROUP BY date, product, variant
"""
sales_mv = pd.read_sql_query(query, con)
sales_mv = sales_mv.dropna()

fig = px.line(sales_mv, x="date", y="quantity", color="product_code")
names = {"A0118150101": "Dracula Standard", "A0118150102": "Dracula Plus", "A0118150103": "Dracula Premium", "A0118150104": "Dracula Premium Plus", "A0219150201": "WereWolf Standard", "A0219150202": "WereWolf Plus", "A0219150203": "WereWolf Premium", "A0320150301": "Zion Saga Standard", "A0320150302": "Zion Saga Plus", "A0320150303": "Zion Saga Premium", "A0418150101": "Mforce Gen X Std. 1", "A0418150102": "Mforce Gen X Std. 2", "A0418150103": "Mforce Gen X Std. 3", "A0418150104": "Mforce Gen X Plus 1"}
fig.for_each_trace(lambda t: t.update(name = names[t.name]))
fig.update_layout(title="The abandonment of Mforce Gen X variants", xaxis_title="Month", yaxis_title="Quantity sold", legend_title_text="Variant")
fig.show()

Everything was selling very close to equally well, but the Plus 1's last month is June 2019 and the last month for two of the Standard variants is August 2020. So should this product actually be the most popular? Perhaps not, because the other variants wouldn't have been stopped for no reason, and the fact that seemingly none of their sales redirected to the Standard 3 is curious. This kind of makes it feel like even within products the data is incomplete, but if that's true, this whole analysis becomes pointless.

The Dracula had a couple variants dropped too, and right before this rise:

In [None]:
query="""
SELECT date, fact_sales_monthly.product_code, product, variant, SUM(sold_quantity) as quantity
FROM fact_sales_monthly LEFT JOIN dim_product ON fact_sales_monthly.product_code=dim_product.product_code
WHERE product='AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache'
GROUP BY date, product, variant
"""
sales_mv = pd.read_sql_query(query, con)
sales_mv = sales_mv.dropna()

fig = px.line(sales_mv, x="date", y="quantity", color="product_code")
names = {"A0118150101": "Dracula Standard", "A0118150102": "Dracula Plus", "A0118150103": "Dracula Premium", "A0118150104": "Dracula Premium Plus", "A0219150201": "WereWolf Standard", "A0219150202": "WereWolf Plus", "A0219150203": "WereWolf Premium", "A0320150301": "Zion Saga Standard", "A0320150302": "Zion Saga Plus", "A0320150303": "Zion Saga Premium", "A0418150101": "Mforce Gen X Std. 1", "A0418150102": "Mforce Gen X Std. 2", "A0418150103": "Mforce Gen X Std. 3", "A0418150104": "Mforce Gen X Plus 1"}
fig.for_each_trace(lambda t: t.update(name = names[t.name]))
fig.update_layout(title="The abandonment of Dracula variants", xaxis_title="Month", yaxis_title="Quantity sold", legend_title_text="Variant")
fig.show()

But if we do say the loss of these variants had a good cause and we trust the totals throughout as meaningful, let's see if the differences in sales numbers are significant, in the statistical sense. We'll compare each product's monthly quantities sold, yielding a table like this:
<table>
<tr>
 <td>Product</td>
 <td>Dracula HDD</td>
 <td>Mforce Gen X</td>
 <td>WereWolf HDD</td>
 <td>Zion Saga</td>
</tr>
<tr>
 <td>Dracula HDD</td>
 <td>—</td>
 <td>TBD</td>
 <td>TBD</td>
 <td>TBD</td>
</tr>
<tr>
 <td>Mforce Gen X</td>
 <td>TBD</td>
 <td>—</td>
 <td>TBD</td>
 <td>TBD</td>
</tr>
<tr>
 <td>WereWolf HDD</td>
 <td>TBD</td>
 <td>TBD</td>
 <td>—</td>
 <td>TBD</td>
</tr>
<tr>
 <td>Zion Saga</td>
 <td>TBD</td>
 <td>TBD</td>
 <td>TBD</td>
 <td>—</td>
</tr>
</table>

In [None]:
import scipy.stats as st
dracula = sales_mp[sales_mp["product"] == "AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache"]["quantity"]
gen_x = sales_mp[sales_mp["product"] == "AQ Mforce Gen X"]["quantity"]

results = st.mannwhitneyu(dracula, gen_x, use_continuity=True, alternative="greater")
print("p-value:", results.pvalue)

In [None]:
dracula = sales_mp[sales_mp["product"] == "AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache"]
dracula = dracula[dracula["date"] >= "2018-09-01"]["quantity"]
werewolf = sales_mp[sales_mp["product"] == "AQ WereWolf NAS Internal Hard Drive HDD – 8.89 cm"]["quantity"]

results = st.mannwhitneyu(dracula, werewolf, use_continuity=True, alternative="greater")
print("p-value:", results.pvalue)

In [None]:
dracula = sales_mp[sales_mp["product"] == "AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache"]
dracula = dracula[dracula["date"] >= "2019-09-01"]["quantity"]
zion = sales_mp[sales_mp["product"] == "AQ Zion Saga"]["quantity"]

results = st.mannwhitneyu(dracula, zion, use_continuity=True, alternative="greater")
print("p-value:", results.pvalue)

In [None]:
gen_x = sales_mp[sales_mp["product"] == "AQ Mforce Gen X"]
gen_x = gen_x[gen_x["date"] >= "2018-09-01"]["quantity"]
werewolf = sales_mp[sales_mp["product"] == "AQ WereWolf NAS Internal Hard Drive HDD – 8.89 cm"]["quantity"]

results = st.mannwhitneyu(gen_x, werewolf, use_continuity=True, alternative="less")
print("p-value:", results.pvalue)

In [None]:
gen_x = sales_mp[sales_mp["product"] == "AQ Mforce Gen X"]
gen_x = gen_x[gen_x["date"] >= "2019-09-01"]["quantity"]
zion = sales_mp[sales_mp["product"] == "AQ Zion Saga"]["quantity"]

results = st.mannwhitneyu(gen_x, zion, use_continuity=True, alternative="less")
print("p-value:", results.pvalue)

In [None]:
werewolf = sales_mp[sales_mp["product"] == "AQ WereWolf NAS Internal Hard Drive HDD – 8.89 cm"]
werewolf = werewolf[werewolf["date"] >= "2019-09-01"]["quantity"]
zion = sales_mp[sales_mp["product"] == "AQ Zion Saga"]["quantity"]

results = st.mannwhitneyu(werewolf, zion, use_continuity=True, alternative="greater")
print("p-value:", results.pvalue)

Because we're doing multiple tests, in addition to an alpha of 1/20, the Benjamini-Hochberg method will be used:

In [None]:
print(0.0002 < (1/6)*0.05, 0.005 < (2/6)*0.05, 0.009 < (3/6)*0.05, 0.020 < (4/6)*0.05, 0.096 < (5/6)*0.05, 0.165 < (6/6)*0.05)

Here's the table. A product's row shows its p-value of more sales by month, column shows fewer, and bold means significance.
<table>
<tr>
 <td>Product</td>
 <td>Dracula HDD</td>
 <td>Mforce Gen X</td>
 <td>WereWolf HDD</td>
 <td>Zion Saga</td>
</tr>
<tr>
 <td>Dracula HDD</td>
 <td>—</td>
 <td><b>.000</b></td>
 <td><b>.020</b></td>
 <td><b>.009</b></td>
</tr>
<tr>
 <td>Mforce Gen X</td>
 <td>1</td>
 <td>—</td>
 <td>.995</td>
 <td>.835</td>
</tr>
<tr>
 <td>WereWolf HDD</td>
 <td>.980</td>
 <td><b>.005</b></td>
 <td>—</td>
 <td>.096</td>
</tr>
<tr>
 <td>Zion Saga</td>
 <td>.991</td>
 <td>.165</td>
 <td>.904</td>
 <td>—</td>
</tr>
</table>

It doesn't rise as much in the big rise, but thanks to it consistently beating everything in every other time period, the Dracula is found to be the best. The WereWolf also comfortably performs better than the Mforce Gen X.

*Profit* by year, as opposed to quantity (or revenue):

In [None]:
query="""
SELECT date, product_code, SUM(sold_quantity) as quantity
FROM fact_sales_monthly
GROUP BY product_code, date
"""
revenues = pd.read_sql_query(query, con)
revenues = revenues.dropna()

query="""
SELECT * FROM fact_gross_price
WHERE product_code IN (SELECT product_code FROM fact_sales_monthly)
"""
read = pd.read_sql_query(query, con)

In [None]:
for i in range(len(revenues["date"])):
    revenues["date"].iloc[i] = revenues["date"].iloc[i][0:4]
revenues_grouped = revenues.groupby(["date", "product_code"], as_index=False).agg({"quantity": "sum"})
revenues_grouped = revenues_grouped.rename(columns={"date": "year"})
revenues_grouped["year"] = revenues_grouped["year"].astype("int64")
query="""
SELECT fact_manufacturing_cost.product_code, fact_manufacturing_cost.cost_year AS year, fact_manufacturing_cost.manufacturing_cost, fact_gross_price.gross_price, fact_gross_price.gross_price - fact_manufacturing_cost.manufacturing_cost AS difference
FROM fact_manufacturing_cost LEFT JOIN fact_gross_price ON fact_manufacturing_cost.product_code=fact_gross_price.product_code AND year=fact_gross_price.fiscal_year
ORDER BY difference ASC
"""
diffs = pd.read_sql_query(query, con)
merged = diffs.merge(revenues_grouped, how="inner", on=["product_code", "year"])

query="""
SELECT product_code, product FROM dim_product
"""
codes = pd.read_sql_query(query, con)
merged = merged.merge(codes, how="left", on="product_code")
merged["year"].value_counts()

merged["profit"] = merged["difference"] * merged["quantity"]
merged_grouped = merged.groupby(["year", "product"], as_index=False).agg({"profit": "sum"})

fig = px.line(merged_grouped, x="year", y="profit", color="product")
names = {"AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache": "AQ Dracula HDD", "AQ Mforce Gen X": "AQ Mforce Gen X", "AQ WereWolf NAS Internal Hard Drive HDD – 8.89 cm": "AQ WereWolf HDD", "AQ Zion Saga": "AQ Zion Saga"}
fig.for_each_trace(lambda t: t.update(name = names[t.name]))
fig.update_layout(title="The rise of everything, but especially the Zion Saga and WereWolf", xaxis_title="Year", yaxis_title="Total profit", legend_title_text="Product")
fig.update_xaxes(nticks=4)
fig.show()

As you could guess from the sales chart, we again see everything rise from 2020 to 2021, with the WereWolf and Zion Saga seeing awfully dramatic ones. It's worth noting that the Mforce Gen X is the only non-HDD:

In [None]:
query="""
SELECT DISTINCT product, category FROM dim_product
WHERE product IN ('AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache', 'AQ Mforce Gen X', 'AQ WereWolf NAS Internal Hard Drive HDD – 8.89 cm', 'AQ Zion Saga')
"""
read = pd.read_sql_query(query, con)
read

So maybe beating up on it isn't fair.

### Regional differences

If you look at the groupings, "market" is the country, and there are so many little ones they're begging to be merged, "region" is larger than continental and the groups are too broad to feel like they'll have much to say, and "sub_zone" is nicely in the middle. Specifically, the subzones are:
* ANZ: Australia and New Zealand
* India: India
* LATAM: South America and Mexico (includes Brazil, despite the name)
* NA: US and Canada (excludes Mexico, despite the name)
* NE: Northern Europe, Central Europe, UK, and Benelux
* ROA: Asia sans India
* SE: Southern Europe and France

However, this doesn't mean they're of roughly equal size saleswise. Here are the sales numbers by region and product across the whole period measured:

In [None]:
query="""
SELECT date, product, customer_code, SUM(sold_quantity) as quantity
FROM fact_sales_monthly LEFT JOIN dim_product ON fact_sales_monthly.product_code=dim_product.product_code
GROUP BY product, date, customer_code
"""
sales_mpc = pd.read_sql_query(query, con)
sales_mpc = sales_mpc.dropna()
sales_mpc["customer_code"] = sales_mpc["customer_code"].astype("int64")
sales_mpc["quantity"] = sales_mpc["quantity"].astype("int64")
sales_mpc.head(10)

query="SELECT customer_code, market, sub_zone, region FROM dim_customer"
customer_areas = pd.read_sql_query(query, con) # There are a few misspelled countries, but that's okay

sales_mpcm = sales_mpc.merge(customer_areas, how="left", on="customer_code")
# sales_mpcm.head(10)

In [None]:
sales_mpcm.pivot_table(index="product", columns="sub_zone", values="quantity", aggfunc="sum", margins=True, margins_name="Total")

In [None]:
sales_mpcm_sums = sales_mpcm.groupby(["date", "product", "sub_zone"], as_index=False).agg({"quantity": "sum"})
sales_mpcm_oce = sales_mpcm_sums[sales_mpcm_sums["sub_zone"] == "ANZ"]
sales_mpcm_ind = sales_mpcm_sums[sales_mpcm_sums["sub_zone"] == "India"]
sales_mpcm_lat = sales_mpcm_sums[sales_mpcm_sums["sub_zone"] == "LATAM"]
sales_mpcm_aam = sales_mpcm_sums[sales_mpcm_sums["sub_zone"] == "NA"]
sales_mpcm_eun = sales_mpcm_sums[sales_mpcm_sums["sub_zone"] == "NE"]
sales_mpcm_roa = sales_mpcm_sums[sales_mpcm_sums["sub_zone"] == "ROA"]
sales_mpcm_eus = sales_mpcm_sums[sales_mpcm_sums["sub_zone"] == "SE"]

In [None]:
fig = px.line(sales_mpcm_oce, x="date", y="quantity", color="product")
names = {"AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache": "AQ Dracula HDD", "AQ Mforce Gen X": "AQ Mforce Gen X", "AQ WereWolf NAS Internal Hard Drive HDD – 8.89 cm": "AQ WereWolf HDD", "AQ Zion Saga": "AQ Zion Saga"}
fig.for_each_trace(lambda t: t.update(name = names[t.name]))
fig.update_layout(title="Tasman countries sales chart", xaxis_title="Month", yaxis_title="Quantity sold", legend_title_text="Product")
fig.show()

Late 2019 and late 2020 are more similar in volume, but other than that there isn't much difference from the world overall chart.

In [None]:
fig = px.line(sales_mpcm_ind, x="date", y="quantity", color="product")
names = {"AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache": "AQ Dracula HDD", "AQ Mforce Gen X": "AQ Mforce Gen X", "AQ WereWolf NAS Internal Hard Drive HDD – 8.89 cm": "AQ WereWolf HDD", "AQ Zion Saga": "AQ Zion Saga"}
fig.for_each_trace(lambda t: t.update(name = names[t.name]))
fig.update_layout(title="Indian sales chart", xaxis_title="Month", yaxis_title="Quantity sold", legend_title_text="Product")
fig.show()

Not much is very different here at all. The WereWolf continues to rise dramatically in December 2021 rather than being reasonably flat, which makes the chart look a bit different, but there really isn't much special to say.

In [None]:
fig = px.line(sales_mpcm_roa, x="date", y="quantity", color="product")
names = {"AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache": "AQ Dracula HDD", "AQ Mforce Gen X": "AQ Mforce Gen X", "AQ WereWolf NAS Internal Hard Drive HDD – 8.89 cm": "AQ WereWolf HDD", "AQ Zion Saga": "AQ Zion Saga"}
fig.for_each_trace(lambda t: t.update(name = names[t.name]))
fig.update_layout(title="Rest of Asia sales chart", xaxis_title="Month", yaxis_title="Quantity sold", legend_title_text="Product")
fig.show()

Late 2019 looks better for this sub-zone than it does in the world total. Also, the Zion Saga and Mforce Gen X oddly don't rise in October 2021. There's nothing too dramatic, though.

In [None]:
fig = px.line(sales_mpcm_lat, x="date", y="quantity", color="product")
names = {"AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache": "AQ Dracula HDD", "AQ Mforce Gen X": "AQ Mforce Gen X", "AQ WereWolf NAS Internal Hard Drive HDD – 8.89 cm": "AQ WereWolf HDD", "AQ Zion Saga": "AQ Zion Saga"}
fig.for_each_trace(lambda t: t.update(name = names[t.name]))
fig.update_layout(title="South American-Mexican sales chart", xaxis_title="Month", yaxis_title="Quantity sold", legend_title_text="Product")
fig.show()

With 7 customer codes and, as shown above, so few units sold compared to the other areas, this is easily the smallest region, so it makes sense that it would be a bit weird. With an extra bump in mid-2020, the shape there is different, and the late 2019 bump goes higher than the late 2020 one. Also, the WereWolf is awfully volatile throughout those periods, but again, that could just be small numbers. The fact that the first half of 2020 isn't a valley is worth thinking about, though.

In [None]:
fig = px.line(sales_mpcm_aam, x="date", y="quantity", color="product")
names = {"AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache": "AQ Dracula HDD", "AQ Mforce Gen X": "AQ Mforce Gen X", "AQ WereWolf NAS Internal Hard Drive HDD – 8.89 cm": "AQ WereWolf HDD", "AQ Zion Saga": "AQ Zion Saga"}
fig.for_each_trace(lambda t: t.update(name = names[t.name]))
fig.update_layout(title="Anglo-American sales chart", xaxis_title="Month", yaxis_title="Quantity sold", legend_title_text="Product")
fig.show()

Believe it or not, this is a different chart from the world overall.

In [None]:
fig = px.line(sales_mpcm_eun, x="date", y="quantity", color="product")
names = {"AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache": "AQ Dracula HDD", "AQ Mforce Gen X": "AQ Mforce Gen X", "AQ WereWolf NAS Internal Hard Drive HDD – 8.89 cm": "AQ WereWolf HDD", "AQ Zion Saga": "AQ Zion Saga"}
fig.for_each_trace(lambda t: t.update(name = names[t.name]))
fig.update_layout(title="Europe North sales chart", xaxis_title="Month", yaxis_title="Quantity sold", legend_title_text="Product")
fig.show()

The rises to December 2019 are a bit straighter. That's about it.

In [None]:
fig = px.line(sales_mpcm_eus, x="date", y="quantity", color="product")
names = {"AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache": "AQ Dracula HDD", "AQ Mforce Gen X": "AQ Mforce Gen X", "AQ WereWolf NAS Internal Hard Drive HDD – 8.89 cm": "AQ WereWolf HDD", "AQ Zion Saga": "AQ Zion Saga"}
fig.for_each_trace(lambda t: t.update(name = names[t.name]))
fig.update_layout(title="Europe South sales chart", xaxis_title="Month", yaxis_title="Quantity sold", legend_title_text="Product")
fig.show()

As if to be a counterpart to Europe North's sharp triangles, here we have smooth humps. But it's honestly minor differences.

## Conclusions
First off, all of these products are good. There's no reason to get rid of or lift off the throttle on any of them. Even though the Mforce Gen X does amusingly bad in the comparisons, it still makes plenty of money. As for which is the best or what should be promoted particularly hard, perhaps the Dracula. It didn't have the same rise as the other HDDs, but surely you can put that down to some of the versions disappearing. If those hypothetical sales didn't go to other Dracula variants, the product feels like it should be higher because the absent ones had been selling just fine, and if some sales did, that makes it feel like it was underpromoted or otherwise underrecommended and again "should have" been the most popular in late 2021. Or maybe the build quality or reputation took a hit. With the way it happened right before that huge rise got started, plus without any kind of advertising data, it isn't possible to know, but these numbers are big enough that there's got to be some reason behind it all. Again, assuming there isn't a big chunk of missing data.

There aren't any notable regional differences except for the LATAM bump. To be honest, it probably is just how small it is in terms of customers and income making for funny numbers, or it could just be early days of COVID-19 affecting different areas differently – we do have to remember that with this timespan, pretty much every conclusion has a COVID-19 asterisk on it – but it is a deviation from what every other year in every other area has, so it is worth looking into what was going on there at that time or if there was something different being done. As for categories, the hard drives did better than the graphics card, but it's hard to say it means anything with a total of four products compared, even before the matter of the disappearing variants.

And that is a matter to look into. The Standard 3 variant of the Mforce Gen X has a much higher sales total than the others seemingly because it was the only one not discontinued mid-2020, but why? The others were selling just as well. Same deal for the Dracula.