### CE88: Data Science for Smart Cities - 11/27/17

# *K-means Clustering: Electric Car Battery Level Clustering*

In this lab session, we will analyze how battery (SOC) profiles of electric cars can be classifed into 'k' different categories using k-means clustering algorithm. We will utilize sklearn.cluster.Kmeans library to fit & predict the SOC profile of each vehicle. 

In [None]:
from datascience import *
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.colors as colors
import matplotlib.cm as cmx

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

# 1. Introduction to Clustering
[Scikit-learn](http://scikit-learn.org/stable/) is a python package that can help you to do more advanced predictive and exploratory analysis with data. Today we are going to learn about a [clustering method](http://scikit-learn.org/stable/modules/clustering.html#k-means) used for systematically grouping similar datapoints.

In [None]:
plt.figure(figsize=(12, 12))

# Sample number
n_samples = 1500

# Number of blobs (number of clusters)
n_blobs = 7

# This is an arbitrary seed for random generator
random_state = 33

# Generate blobs that have 7 clusters
X, y = make_blobs(n_samples=n_samples, centers=n_blobs, random_state=random_state)

# Scatter plot of the generated data
plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1], c=np.ones(n_samples))
plt.title("Data")

# Cluster with the correct number of clusters
y_pred = KMeans(n_clusters=n_blobs, random_state=random_state).fit_predict(X)

plt.subplot(222)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.title("Correct number of clusters")

# Cluster with the wrong (fewer than correct) number of clusters
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X)

plt.subplot(223)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.title("Wrong number of clusters: too few")


# Cluster with the wrong (more than correct) number of clusters
y_pred = KMeans(n_clusters=8, random_state=random_state).fit_predict(X)

plt.subplot(224)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.title("Wrong number of clusters: too many")

# 2. Clustering EV users with similar driving and charging patterns

## 2.1 About the dataset
'EV_soc.csv' contains data on the state of charge (SOC), meaning the % battery remaining, for 1023 Electric Vehicles (EVs). The dataset has the SOC for each car for every 5 minute interval in the day. The driver column indicates the driver id, the day indicates the day of the week, where 1=Sunday. The ##_soc columns each correspond to a 5-min interval of the day.

In this lab we will see how we can use clustering to identify drivers with similar driving and charging habits.


In [None]:
soc = Table.read_table('EV_soc.csv')
soc

### 2.1.1 Calculate rolling averages of SOC profile
The rolling_window=6 parameter is used to compute the rolling average over a half hour timespan, rather than 
considering each 5 min interval independently. 

In [None]:
# Rolling window size -- this is to get the rolling average (6*5 = 30 mins)
rolling_window = 6

# Your task: only select data of Tuesday (day == 3)
# Answer key: 
soc_tuesday = 

# Your task: drop the first two columns ('driver' and 'day') of the table and convert Table to dataframe
# Note - sklearn is less likely compatible with datascience.table. library. 
#        Therefore, when you use sklearn, pandas.dataframe is recommended.
# (HINT: use Table.to_df() function to convert a Table to pandas dataframe)
# Answer key:
X = 

# Your task: get rolling average of SOC over time with rolling_window = 6
# (HINT: use pd.rolling_mean() function to get the rolling average)
# Answer key:
X_rolling = 

## 2.2 Clustering EVs with similar Tuesday Charging Habits

In the following section I grabbed the SOC data where day=3 (Tuesday). The Scikit-learn K-means package does all of the heavy lifting for us, and finds ways to group similar drivers that have the most similar SOC data throughout the day. We found 5 clusters works well to identify unique driving/charging habbits.

### 2.2.1 Find optimal 'k'
The algorithm is somewhat naive--it clusters the data into k clusters, even if k is not the right number of clusters to use. Therefore, when using k-means clustering, we need some ways to determine whether you are using the right number of clusters.

One method to validate the number of clusters is the **elbow method**. The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 10 in the examples above), and for each value of k calculate the sum of squared errors (SSE).

In [None]:
def getSSE(X, maxK=10):
    '''
    Return a list of Sum of Squared Errors (SSE) over a range of cluster numbers based on elbow method
    
    <Input>
    (Dataframe) X: data to determine SSE
    (Integer) maxK: maximum number of cluster (default = 10)
    
    <Outout>
    (ndarray) an array of SSEs associated with number of clusters
    '''
    sse = np.zeros((maxK,1))
    for k in np.arange(1,maxK+1):
        estimator = KMeans(n_clusters=k)
        estimator.fit(X)
        for l in estimator.labels_:
            data = X[estimator.labels_==l]
            data_mean = np.mean(data)
            sse[k-1] += np.sum((data-data_mean).values**2)
    return sse

# NOTE: Running this cell would take 25-30 seconds
maxK=6
plt.plot(np.arange(1,maxK+1),getSSE(X_rolling, maxK))
plt.xlabel('Number of clusters (k)')
plt.ylabel('Sum of Squared Errors')

### 2.2.2 Fit the Kmeans estimator with SOC profile of rolling averaged

We have already completed the code to fit the estimator using sklearn. You will see how simple it is!

In [None]:
# Fit the estimator
n_clusters = 5
estimator = KMeans(n_clusters=n_clusters)
estimator.fit(X_rolling)

### 2.2.3 Visualize the clusters

Effective visualization is a very important task to analyze your clusters. In this section, you will learn a visualization technique of plotting multiple graphs in a same figure using for loops.

In [None]:
# Get color map array -- this is just getting N distinguished colors that assigned with N clusters
jet = plt.get_cmap('jet') 
cNorm  = colors.Normalize(vmin=0, vmax=n_clusters-1)
scalarMap = cmx.ScalarMappable(norm=cNorm, cmap=jet)
colorVals = [scalarMap.to_rgba(i) for i in range(n_clusters)]

** Example: Visualize SOC profiles of label = 1 **

In [None]:
# x-values
x_ticks = np.arange(rolling_window-1,soc.drop([0,1]).num_columns)/12.

# create a new figure for i-th cluster
plt.figure()

# Y-values
data = X_rolling[estimator.labels_== 1]

for j in range(data.shape[0]):
    # SOC profile of each EV driver
    plt.plot(x_ticks, data.T[data.index[j]], color=colorVals[1], alpha=.05)

# Your task: plot centroid of the cluster. In this case, the centroid is the mean of SOC profiles over EV users.
# Answer key:

# x, ylabel, title...
plt.xlabel('Hour')
plt.ylabel('State of charge')
plt.ylim(0,1.2)
plt.title('State of charge per EV')
textstr = 'N = %i, %.1f%% of vehicles'%(len(data),float(len(data)*100/len(X_rolling)))
plt.text(7, 1.1, textstr, fontsize=10,verticalalignment='top')

** Visualize multiple figures **

In [None]:
# Multiple figures 
for i in range(n_clusters):
    # Your task: show SOC profiles of each cluster in each figure
    # Answer key:

## 2.3 Overlaying the clusters and plotting the derivatives
In this section, we will analyse how each cluster is distinguished to each other in terms of change in SOC.

In [None]:
def get_slope(y, x):
    '''
    Get the derivative of y with respect to x
    '''
    dys = y[1:]-y[:-1]
    dxs = x[1:]-x[:-1]
    return dys/dxs

In [None]:
#Plot the derivative
plt.figure(figsize = (8,8))
for i in range(n_clusters):
    dy_dx = get_slope(np.array(np.mean(X_rolling[estimator.labels_==i])), x_ticks)
    plt.plot(x_ticks[:-1], dy_dx, color=colorVals[i], linewidth=2)
    
plt.plot(x_ticks[:-1], np.zeros(len(x_ticks[:-1])), color = 'grey')
plt.xlabel('Hour')
plt.ylabel('Change in state of charge')
plt.title('Change in state of charge per EV')


## Exercise 
My initial thought was that there would maybe be 2 unique charging patterns, one for commuters, and another for families who use EVs as a 2nd vehicle, or non-commuter vehicle. 

**Task 1 -** Adjust the number of clusters to 2 and describe the trends in the two clusters. 

In [None]:
# Your answers here:




**Task 2 -** Now adjust the number of clusters to 10. Do you see multiple clusters that show very similar SOC patterns? If so then these can probably be combined, and we can reduce the number of clusters.

In [None]:
# Your answers here:




## (Optional) IF time allows: clustering behavior for the whole workweek
In the previous section we clustered EV data for a single work day. Now we will look to cluster similar driving and charging behacior for the whole workweek. Each row in 'workweek_soc.csv' contains EV SOC data for the entire workweek rather than a single day. 

Again I found that 5 clusters seemed to capture the unique charging behavior well. Run the code below to see the workweek clustering results.

**Task 3 -** Find 5 clusters and plot SOC per EV for each cluster. 

In [None]:
# Use this dataframe
week_soc = Table.read_table('workweek_soc.csv')
week_df = week_soc.drop('driver').to_df()

n_clusters=5
rolling_window = 6

# get color map array
jet = plt.get_cmap('jet') 
cNorm  = colors.Normalize(vmin=0, vmax=n_clusters-1)
scalarMap = cmx.ScalarMappable(norm=cNorm, cmap=jet)
colorVals = [scalarMap.to_rgba(i) for i in range(n_clusters)]

# Get rolling average
X_rolling = (pd.rolling_mean(week_df.T,window=rolling_window)[rolling_window-1::]).T

In [None]:
# Your answers here:
