# Action plan

## Goal

The goal here is to take in a very large point cloud and use mini batch k means to cluster the points into texture/land types. 

### For features:

For a point at (lon, lat) = (x0, y0), I take it and its closest 8 neighbours { (x1, y1), ..., (x8, y8) } and use them to poll the data to construct the feature vector/array ((z0, r0, g0, b0), (z1, r1, g1, b1), ..., (z8, r8, g8, b8)). 

## Import packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
from DataLoader import DataLoader
from helper_functions import *

## Read in data

In [None]:
POINT_RESOLUTION = 1e-6 # ~10 cm
dl = DataLoader()

In [None]:
dl.readData()

In [None]:
dl.data[0].head()

We've read in a total of 5 000 000 rows... but we'll use fewer than that.

In [None]:
dl.max_rows

## Visualize colour distribution

Below we plot a histogram approximating the distribution of colours in the image (x,y,z) axes are (r,g,b) respectively. Honestly, this is mostly because it's neat. But there could be useful visualizations here, including representations of what the dominant colours (hence land types?) are.

In [None]:
hist3d(dl.getScaledRgbArray(), azim=-105)

# Making clustering happen

### Operate on a subset of the data

In [None]:
# Use 500 000 points
data = dl.data[0]

### Generate the point features

Collect the (z,r,g,b) tuples of the 10 nearest neighbours and scale them by the inverse distance to the neighbours

In [None]:
ptFtrs = generatePointFeatures(data.values)

In [None]:
# number of clusters
n_clusters=20
km = miniBatchKMeans(ptFtrs, n_clusters=n_clusters)

In [None]:
clusterMembership = getClusterMembership(km, ptFtrs)

## Plotting the data

The original data

In [None]:
# number of points to plot
N = 100000

In [None]:
plt.figure(figsize=(20,15))
plt.scatter(x=data.values[:N,0], y=data.values[:N,1], c=data.values[:N, 3:6]/255, s=1, cmap='gist_earth')
plt.xlabel('Longitude');

All the clusters together

In [None]:
makeScatterPlot(data.values[:N, :2], clusterMembership[:N], nClusters=n_clusters)

## Individual clusters

In [None]:
clusterData = {j : data.values[clusterMembership==j, :] for j in range(n_clusters)}

In [None]:
ncols = 4
nrows = n_clusters / 4

plt.figure(figsize=(5*ncols, 7*nrows))


for j in range(n_clusters):
    clusterj = clusterData[j]
    plt.subplot(nrows, ncols, j+1)
    plt.scatter(clusterj[:N,0], clusterj[:N, 1], c=clusterj[:N, 3:6]/255, s=1)

## Color histograms of the users

In [None]:
ncols = 4
nrows = n_clusters/ncols

plt.figure(figsize=(5*ncols, 4*nrows))


for j in range(n_clusters):
    clusterj = clusterData[j]
    plt.subplot(nrows, ncols, j+1)
    plt.hist(clusterj[:N, 3:6], bins=256, color=['r', 'g', 'b']);
    plt.title('Cluster {}'.format(j))

Some of these distributions look nearly identical. I believe this suggests that we've chosen too many clusters. 