In [None]:
# First make sure to install all required packages.
# You can do it by running the following command:

In [None]:
# ]add Arrow CSV DataFrames Plots Clustering Distances FreqTables

In [None]:
# If you launched Jupyter in directory with attached Project.toml and Manifest.toml
# use below command to install required packages with fixed versions. 
# Check Project introduction for more information.

In [None]:
#] instantiate

In [None]:
# Import required libraries
using Arrow
using CSV
using DataFrames
using Plots
using Clustering
using Distances
using Random
using Statistics
using FreqTables

In [None]:
# Read clean dataset from Arrow file into DataFrame
sales_norm = DataFrame(Arrow.Table("sales_norm.arrow"))

In [None]:
# Convert DataFrame to Matrix and make sure that one product is one column in the resulting matrix
cluster_data = Matrix(sales_norm)

In [None]:
# Check the extreme values of mean of each column - should be really close to 0
extrema(mean(cluster_data, dims=1))

In [None]:
# Check the extreme values of standard deviation of each column - should be really close to 1
extrema(std(cluster_data, dims=1))

In [None]:
# Set seed for reproducibility
# You may not use seed in your solution and receive slightly different 
# results due to probabilistic character of K-means algorithm
Random.seed!(42)

In [None]:
# Produce clustering for 2-20 clusters with kmpp (K-means++) seeding
res_kmpp = [kmeans(cluster_data, i, init=:kmpp) for i in 2:20];

In [None]:
# Produce clustering for 2-20 clusters with rand (random) seeding
res_rand = [kmeans(cluster_data, i, init=:rand) for i in 2:20];

In [None]:
# Check how many iterations both seeding algorithm required before converging for all 'k' values
# We may expected K-means++ to converge faster due to 'intelligent' seeding, but this is not the case for our task
plot(2:20, hcat(getfield.(res_kmpp,:iterations), getfield.(res_rand,:iterations)),
    xlab="Number of clusters", ylab="Iterations until convergence", label=["K-means++" "Random seeding"])

In [None]:
# Visualize total cost (Sum of Squared Errors) for both seeding algorithms and all values of 'k'
# We can choose proper number of cluster using 'elbow' method
# The curve is really smooth and it's hard to pick the proper number of clusters
# This is a common in common situation in practie
# We'll use 4 clusters as we want to interpret the results with reasonable number of groups
# and there seems to be a slight change of slope there
plot(2:20, hcat(getfield.(res_kmpp,:totalcost), getfield.(res_rand,:totalcost)), 
    xlab="Number of clusters", ylab="Cost (SSE)", label=["K-means++" "Random seeding"])

In [None]:
# Check how similar the clusterings for random seeding and k-means++ are
freqtable(res_kmpp[3].assignments, res_rand[3].assignments, )
# The results are quite consistent, but not identical
# therefore we decide to run the k-means algorithm for 4 clusters 1000 times to get the best assignment

In [None]:
# Produce clustering for 4 clusters with kmpp (K-means++) seeding
kmpp4 = [kmeans(cluster_data, 4, init=:kmpp) for _ in 1:1000];

In [None]:
# Check the coefficient of variation for produced clusterings
tc_kmpp4 = getfield.(kmpp4, :totalcost);
string("Coefficient of variation: ", std(tc_kmpp4)/mean(tc_kmpp4)*100, "%")

In [None]:
# Produce clustering for 4 clusters with random seeding
krand4 = [kmeans(cluster_data, 4, init=:rand) for _ in 1:1000];

In [None]:
# Check the coefficient of variation for produced clusterings
tc_krand4 = getfield.(krand4, :totalcost);
string("Coefficient of variation: ", std(tc_krand4)/mean(tc_krand4)*100, "%")

In [None]:
# Pick the best clustering based on conducted evaluations
opt_clustering = if minimum(tc_kmpp4) <= minimum(tc_krand4)
    kmpp4[argmin(tc_kmpp4)]
else
    krand4[argmin(tc_krand4)]
end;

In [None]:
# Check the members count in each cluster
# The clusters look quite balanced
freqtable(opt_clustering.assignments)

In [None]:
# Plot cluster averages
# Each cluster has some distinct characteristic we should summarize for the recipients of the report
plot(hcat([mean(cluster_data[:, opt_clustering.assignments .== i], dims=2) for i in 1:4]...), 
    xlab="Week", ylab="Normalized sales", labels=[1 2 3 4], linewidth=2)

In [None]:
# Plot cluster standard deviations
# No huge difference here, but we can spot that some clusters have higher variability in general
plot(hcat([std(cluster_data[:, opt_clustering.assignments .== i], dims=2) for i in 1:4]...), 
    xlab="Week", ylab="Normalized sales", labels=[1 2 3 4], linewidth=2)

In [None]:
# Save k-means assignments to the text file
open("kmeans_assignments.txt", "w") do io
  foreach(e -> println(io, e), opt_clustering.assignments)
end

**Analysis of clustering results**

Based on clustering results evaluation we produced 4 clusters for all products in the dataset.

Each cluster has distinct characteristic summarized below:
* Cluster 1 sales were increasing steadily until mid-year when they dropped to extremely low values. After that we again see a steady increase in sales. We should investigate what exactly happend around the spike period - maybe there was a new hot release (initial increase in sales), but it was faulty and our customers resigned from the product?
* Cluster 2 is similar to Cluster 3 as it has a positive trend for sales and absolute sales values are on similar level. However there is a rapid raise mid-year that is followed by also sudden drop in sales to the previous level. Also there is huge sales boost at the end of tracked period. That group of products are really popular recently for some reason.
* Cluster 3 is the most stable, there is no spike in the sales and it expose slight increasing trend
* Cluster 4 maintained high, steady sales for 20 weeks, which was followed by sharp drop in revenue. Sales were recovering after the drop, but we can spot another decrease during last weeks. Maybe our other products are consuming sales for that cluster?

Variance analysis revealed that there is no signifanct difference in clusters' sales stability. Cluster 2 and 3 are however switiching from really small variance at the beginning of the year to high variance mid-year and at the end of the period. 