## Step 02 - Create Clusters
We need to order our learning data to clusters. When we have the clusters we can count for each test data the closest cluster. Then based on the learning data in that cluster we can make peak load predictions for our test data.

The make it possible to count distance between our clusters and test data, the dimension of the test data and the vectors in the cluster must be equal. As the test data has lower dimension (it's resoultion is daily not half-hourly) first we need to reduce the resolution of the half-hourly learning data to become daily. Note: we need to keep the original resolution as well for the peak load predicitons later.

After we reduced the resolution we can create clusters. We use [spectral clustering](https://towardsdatascience.com/spectral-clustering-aba2640c0d5b) to cluster it. The number of clusters needs to be set in the program, usually the less than 10 is a good chice.
Note: the project could be improved by trying different number of clusters, finding the number with best accuracy.

After clustering the reduced dimension learning data vectors. Count the average of vectors in each cluster, creating a cluster-center vector. At this point we can start using the prepared test data.

For each test data vectors we count the distance between the cluster-center vectors. We group each test data vectors to the closest cluster. And based on the (high resolution version of) learning data vectors in the cluster we can give half-hourly load prediction to the low resolution learning data vectors.

### Note:
The code below use same hand typed example data. Here we do not use the real big data sets yet. (Not even the cleaned and interpolated meta reading data.) The code is only illustration how the process works.

In [8]:
%pylab inline
from sklearn.cluster import SpectralClustering
from scipy.spatial import distance_matrix

Populating the interactive namespace from numpy and matplotlib


In [2]:
#smd: smart meter data -> 9 points / record
# "learning data"
smd = [[1,3,4, 1,0,1, 2,1,3],    # (X._.X)   1   # <-- clusterd by hand, to see how well the 
       [0,1,2, 4,3,1, 1,1,0],    # |_.X._|   2   #     algorithm works 
       [.1,.8,.2, 3,3,2, 3,4,3], # |_.X.X|   2
       [.9,.1,1, 0,0,1, 4,4,3],  # <_._.X>   0
       [.2, .9,1, 0,3,2, 3,3,5], # |_.X.X|   2
       [3,4,4, 1,0,0, 3,1,2],    # (X._.X)   1
       [0,0,1, 0,1,0, 4,1,5]]    # <_._.X>   0

# mr: meta reading data
# "test data"
mr = [[9,1,3],                    # (X._.x)    1
      [0.8,7,15],                 # |_.X.X|    2   <-- pay attention to this (between type 2 and type 0)
      [3,3,12]]                   # <_._.X>    0  

smd2 = []

#  reducing the resolution to mach with the test data's resolution 
for vec in smd:
    smd2.append([sum(vec[:3]),sum(vec[3:6]), sum(vec[6:])]) # needs to be automated, now splitted by "hand"
    
D = distance_matrix(smd2, smd2)
smd2  # small resolution version of smart reading ("learning-") data

[[8, 2, 6],
 [3, 8, 2],
 [1.1, 8, 10],
 [2.0, 1, 11],
 [2.1, 5, 11],
 [11, 1, 6],
 [1, 1, 10]]

In [3]:
N = 3 # number of clusters
clustering = SpectralClustering(n_clusters=N,
         assign_labels="discretize",
         random_state=0).fit(D)



In [4]:
smLabels = clustering.labels_ # with same indexing as input we find cluster id-s in this
smLabels   # smart meter labels, the clustering algorithm groups vectors to the same groups we did "by hand"

array([2, 1, 1, 0, 1, 2, 0], dtype=int64)

In [9]:
centers = [zeros(len(smd2[0]))] * N 
# (number of clusters) times null-vectors

for i in range(len(smd2)): # sum of vectors in each cluster
    vec = smd2[i]
    smLabel = smLabels[i] # actual cluster ID
    centers[smLabel] = [sum(x) for x in zip(vec, centers[smLabel])]
    
for smLabel in range(N): # divide it by the number of vectors in each cluster
    centers[smLabel] = [x / smLabels.tolist().count(smLabel) for x in centers[smLabel]]

D2 = distance_matrix(mr, centers)
mrLabels = [] # gives the id of the closest cluster for each 
            # meta reading data point
for row in D2:
    mrLabels.append(argmin(row))

In [40]:
# now we know the closest clusters for each meta reading
# next step to fill the peak loads

prediction = [[0. for i in range(len(smd[0]))] for j in range(N)]
'''We are going to make one prediction to each cluster.
The meta reading data vectors will get the prediction
of the closest cluster.'''        


#   Summarize vectors in each cluster:                 
for i in range(len(smd[0])):  # assume that all vectors in smd has the same length (== smd[0])
    for k in range(len(smd)):
        prediction[smLabels[k]][i] += smd[k][i]

#   Devide the sum by the number of vectors in each cluster
for i in range(N):
    norm = smLabels.tolist().count(i) 
    prediction[i] = [j / norm for j in prediction[i]]

In [34]:
# create the half-hourly load prediction for each meta reading vector:
mr2 = mr

for i in range(len(mr2)):
    mr2[i] = prediction[mrLabels[i]]

mr2

[[4.0, 7.0, 8.0, 2.0, 0.0, 1.0, 5.0, 2.0, 5.0],
 [0.30000000000000004, 2.7, 3.2, 7.0, 9.0, 5.0, 7.0, 8.0, 8.0],
 [0.9, 0.1, 2.0, 0.0, 1.0, 1.0, 8.0, 5.0, 8.0]]