Objective: Find behaviour patterns in driver activities via clustering. (Manually) Classify this patterns to help recognize common faults while driving.

Example: Frequently missing B_T3 after a B_T2, making the hole sequence illegal.

Things worth considering:
- Variable length sequences.
- Activity representation. Possibilities:
  - activity, daytype, sequence, breaktype, token, legal -> [2,0,1,0,1]
  - Not sure if we should consider duration. In that case, should be normalized?
    - Maybe not because we are keeping token, and thus there would be redundant information
- Sequence representation. Possibilities:
  - List of activities (should span a day at most)
  - One activity at a time
  - Just one Sequence
- Distance function depends on the two previous points. Possibilities:
  - Substract duration and for the rest of variables only consider if they are the same. Probably different weights, as legal is more valuable.
- Categorical variables. We need to be careful with the distance function

---

Possible method:
1. Coding activities as word in a list representation
2. Group activities as sentences according to the Sequence column
3. Group sentences as documents, one for each driver
4. Apply sentence clustering or topic modelling (LDA?)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import OrdinalEncoder

from gensim.corpora.dictionary import Dictionary
from gensim.models.tfidfmodel import TfidfModel
from gensim.matutils import corpus2dense, corpus2csc

from sklearn.cluster import KMeans

In [42]:
#########################################################################
# Load data
#########################################################################
data_path = "./combined-log.csv"
df = pd.read_csv(data_path, sep="\t",)

# To timestamp format
df.DateTimeStart = pd.to_datetime(df.DateTimeStart)
df.DateTimeEnd = pd.to_datetime(df.DateTimeEnd)

# Rename column
df = df_original = df.rename(columns={"#Driver":"Driver", "Duration(min)":"Duration"})


# To numerical
df.Legal = df.Legal.map({"yes": 1, "no": 0}) # Not sure if [-1,1] is better

# Drop columns
df = df.drop(columns=['Duration', 'ZenoInfo', "DateTimeStart", "DateTimeEnd", 'Week'])

df

Unnamed: 0,Driver,Activity,Day,DayType,Sequence,BreakType,Token,Legal
0,driver1,Break,1.0,ndd,first,split_1,B_T0,1
1,driver1,Driving,1.0,ndd,first,split_1,A,1
2,driver1,Other,1.0,ndd,first,split_1,A,1
3,driver1,Driving,1.0,ndd,first,split_1,A,1
4,driver1,Other,1.0,ndd,first,split_1,A,1
...,...,...,...,...,...,...,...,...
4170,driver25,Driving,13.0,ndd,unique,uninterrupted,A,1
4171,driver25,Other,13.0,ndd,unique,uninterrupted,A,1
4172,driver25,Driving,13.0,ndd,unique,uninterrupted,A,1
4173,driver25,Other,13.0,ndd,unique,uninterrupted,A,1


In [43]:
#########################################################################
# Transform data
# Encode each column as numeric and group them
#########################################################################

# Reorder columns
cols = ['Driver', 'Day', 'Activity', 'DayType', 'Sequence', 'BreakType', 'Token', 'Legal']
df = df[cols]

x = df.to_numpy()

ordinalencoder_X = OrdinalEncoder(dtype=np.int8)
x[:,2:] = ordinalencoder_X.fit_transform(x[:,2:])

df2 = pd.DataFrame(x, columns=cols)

# Group columns into one (as string)
df2['Encoding'] = df2[df2.columns[2:]].apply(
    lambda x: ''.join(x.dropna().astype(str)),
    axis=1
)

# Remove encoded columns
df2 = df2[['Driver','Day','Encoding']]

df2

Unnamed: 0,Driver,Day,Encoding
0,driver1,1,010111
1,driver1,1,110101
2,driver1,1,310101
3,driver1,1,110101
4,driver1,1,310101
...,...,...,...
4170,driver25,13,114301
4171,driver25,13,314301
4172,driver25,13,114301
4173,driver25,13,314301


In [44]:
# Group activities by day and join them in a list

# For each driver append all Encodings of one Day into a list
groups = df2.groupby('Driver', sort=False) # False to keep driver ordering

# Each sentence is the sequence of activities in a day
sentences = []

for name, group in groups:
  a = group.drop(columns='Driver')
  g = a.groupby('Day').cumcount()
  L = (a.set_index(['Day',g])
        .stack().groupby(level=0)
        .apply(lambda x: x.values))

  sentences.extend(L)

In [45]:
dicti = df2.Encoding.unique()
dicti

array(['010111', '110101', '310101', '010131', '110201', '010241',
       '112301', '312301', '012351', '114301', '314301', '014351',
       '100101', '000131', '100201', '300201', '000241', '102101',
       '002131', '102201', '002241', '103301', '303301', '003361',
       '102301', '002321', '103101', '003131', '103201', '003211',
       '003251', '010211', '310201', '012361', '110301', '010321',
       '112101', '012131', '112201', '012211', '012251', '012311',
       '0123101', '121300', '321300', '221390', '021320', '121000',
       '221090', '0210120', '021010', '321000', '114101', '014131',
       '114201', '014271', '0143101', '100301', '000311', '300301',
       '000321', '002111', '302101', '403251', '000111', '300101',
       '302201', '002211', '310301', '012271', '014381', '312101',
       '012111', '312201', '121100', '021130', '121200', '021240',
       '012261', '321100', '021210', '221290', '321200', '314101',
       '014111', '0142101', '014311', '003351', '021310', '

In [46]:
# TODO: THIS METHOD DOES NOT CONSIDER ORDER, TRY WORD2VEC
# It seem it will work without considering order (maybe because the tags already
# add information about it?)
# https://stackoverflow.com/questions/50933591/how-to-perform-kmean-clustering-from-gensim-tfidf-values

# Each sentence will be a document
corpus_lists = sentences

# Get the dictionary of our corpus
dictionary = Dictionary(corpus_lists)
num_docs = dictionary.num_docs
num_terms = len(dictionary.keys())

# Transform into bow (bag-of-words)
# It's a small dictionary so I think this representation shouldn't be a problem
corpus_bow = [dictionary.doc2bow(doc) for doc in corpus_lists]

# Transform into tf-idf (term frequency – inverse document frequency)
tfidf = TfidfModel(corpus_bow)
corpus_tfidf = tfidf[corpus_bow]

# Now you can transform into sparse/dense matrix:
corpus_tfidf_dense = corpus2dense(corpus_tfidf, num_terms, num_docs)
# corpus_tfidf_sparse = corpus2csc(corpus_tfidf, num_terms, num_docs)

In [9]:
print(dictionary.token2id)

{'010111': 0, '010131': 1, '010241': 2, '012351': 3, '110101': 4, '110201': 5, '112301': 6, '310101': 7, '312301': 8, '014351': 9, '114301': 10, '314301': 11, '000131': 12, '000241': 13, '002131': 14, '002241': 15, '003361': 16, '100101': 17, '100201': 18, '102101': 19, '102201': 20, '103301': 21, '300201': 22, '303301': 23, '002321': 24, '003131': 25, '003211': 26, '003251': 27, '102301': 28, '103101': 29, '103201': 30, '010211': 31, '012361': 32, '310201': 33, '010321': 34, '012131': 35, '012211': 36, '012251': 37, '110301': 38, '112101': 39, '112201': 40, '0123101': 41, '012311': 42, '021010': 43, '0210120': 44, '021320': 45, '121000': 46, '121300': 47, '221090': 48, '221390': 49, '321000': 50, '321300': 51, '014131': 52, '014271': 53, '114101': 54, '114201': 55, '0143101': 56, '000311': 57, '000321': 58, '002111': 59, '100301': 60, '300301': 61, '302101': 62, '403251': 63, '302201': 64, '303201': 65, '014311': 66, '014371': 67, '014111': 68, '014211': 69, '014281': 70, '021110': 71

In [47]:
#########################################################################
# Train data
#########################################################################

n_clusters = 10

# Apply KMeans clustering
model = KMeans(n_clusters=n_clusters, random_state=12345)
clusters = model.fit_predict(corpus_tfidf_dense.T)  # Transposed!

In [48]:
clusters

array([2, 5, 0, 3, 2, 9, 7, 2, 6, 4, 7, 0, 0, 1, 5, 8, 8, 4, 5, 0, 2, 3,
       5, 4, 3, 2, 9, 6, 1, 9, 6, 5, 2, 7, 9, 0, 9, 5, 5, 1, 5, 8, 7, 3,
       5, 2, 5, 7, 6, 0, 3, 9, 0, 9, 1, 0, 0, 5, 2, 3, 8, 3, 1, 8, 5, 9,
       8, 3, 2, 0, 2, 0, 7, 5, 9, 9, 6, 0, 5, 7, 0, 8, 1, 6, 4, 3, 1, 8,
       5, 5, 6, 7, 5, 0, 0, 9, 6, 9, 3, 5, 4, 7, 7, 6, 3, 9, 4, 4, 6, 4,
       1, 8, 8, 0, 3, 1, 1, 9, 5, 3, 0, 1, 0, 8, 2, 8, 4, 1, 7, 1, 1, 9,
       3, 9, 3, 3, 8, 8, 8, 1, 7, 2, 0, 2, 5, 4, 4, 6, 2, 4, 0, 3, 4, 9,
       6, 6, 3, 8, 7, 4, 5, 5, 7, 3, 0, 9, 8, 1, 5, 7, 9, 5, 7, 5, 2, 3,
       7, 7, 2, 5, 0, 3, 8, 4, 7, 7, 5, 6, 1, 2, 2, 7, 6, 0, 6, 1, 6, 7,
       0, 6, 1, 6, 6, 9, 5, 7, 3, 3, 7, 4, 7, 5, 6, 0, 5, 8, 0, 9, 9, 4,
       5, 9, 2, 3, 3, 7, 4, 4, 1, 6, 2, 7, 6, 2, 0, 8, 1, 1, 0, 2, 5],
      dtype=int32)

Not sure if the idea would work, but at least some promising thing:

- Cluster 4 (for driver1 days 1,5,8) seems to explain NDD in 2 sequences where the first one is split and the second one is uninterrupted.

- On the other hand, day 8 contains illegal actions and 1 and 5 no, but maybe with a fine-tuning the number of clusters this could be fixed.

- As far as illegal detection: Cluster 5 (for driver1 day 10, driver10 day 9) both have unrecognized DayType, Sequence and BreakType preceded by an only recognized BreakType and a full recognized NDD follows. Something similar happens in driver11 day 2 although there are two recognized sequences in the last NDD.

In [49]:
results = []

groups = df2.groupby(['Driver','Day'], sort=False)

for (name, group), cluster in zip(groups, clusters):
  results.append("{} in day {}: {}".format(name[0], int(name[1]), cluster))

results

['driver1 in day 1: 2',
 'driver1 in day 2: 5',
 'driver1 in day 3: 0',
 'driver1 in day 4: 3',
 'driver1 in day 5: 2',
 'driver1 in day 6: 9',
 'driver1 in day 7: 7',
 'driver1 in day 8: 2',
 'driver1 in day 9: 6',
 'driver1 in day 10: 4',
 'driver1 in day 11: 7',
 'driver1 in day 12: 0',
 'driver2 in day 1: 0',
 'driver2 in day 2: 1',
 'driver2 in day 3: 5',
 'driver2 in day 4: 8',
 'driver2 in day 5: 8',
 'driver2 in day 6: 4',
 'driver3 in day 1: 5',
 'driver3 in day 2: 0',
 'driver3 in day 3: 2',
 'driver3 in day 4: 3',
 'driver3 in day 5: 5',
 'driver3 in day 6: 4',
 'driver3 in day 7: 3',
 'driver3 in day 8: 2',
 'driver3 in day 9: 9',
 'driver3 in day 10: 6',
 'driver3 in day 11: 1',
 'driver4 in day 1: 9',
 'driver4 in day 2: 6',
 'driver4 in day 3: 5',
 'driver4 in day 4: 2',
 'driver4 in day 5: 7',
 'driver4 in day 6: 9',
 'driver4 in day 7: 0',
 'driver4 in day 8: 9',
 'driver4 in day 9: 5',
 'driver4 in day 10: 5',
 'driver4 in day 11: 1',
 'driver4 in day 12: 5',
 'driver

In [67]:
#########################################################################
# Save results
#########################################################################

# Final dataset columns
cols = df_original.columns.to_numpy()
cols = np.insert(cols, -1, 'Cluster')

df_out = pd.DataFrame(columns=cols)

groups = df_original.groupby(['Driver','Day'], sort=False)

# Add clusters to log
for (name, group), cluster in zip(groups, clusters):
  group["Cluster"] = cluster
  df_out = df_out.append(group)

df_out

Unnamed: 0,Driver,DateTimeStart,DateTimeEnd,Duration,Activity,Week,Day,DayType,Sequence,BreakType,Token,Legal,Cluster,ZenoInfo
0,driver1,2017-02-01 17:59:00,2017-02-01 18:13:00,14.0,Break,1.0,1.0,ndd,first,split_1,B_T0,1,2,
1,driver1,2017-02-01 18:13:00,2017-02-01 18:16:00,3.0,Driving,1.0,1.0,ndd,first,split_1,A,1,2,
2,driver1,2017-02-01 18:16:00,2017-02-01 18:18:00,2.0,Other,1.0,1.0,ndd,first,split_1,A,1,2,
3,driver1,2017-02-01 18:18:00,2017-02-01 18:20:00,2.0,Driving,1.0,1.0,ndd,first,split_1,A,1,2,
4,driver1,2017-02-01 18:20:00,2017-02-01 18:43:00,23.0,Other,1.0,1.0,ndd,first,split_1,A,1,2,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4170,driver25,2017-01-15 09:56:00,2017-01-15 09:59:00,3.0,Driving,3.0,13.0,ndd,unique,uninterrupted,A,1,5,
4171,driver25,2017-01-15 09:59:00,2017-01-15 10:01:00,2.0,Other,3.0,13.0,ndd,unique,uninterrupted,A,1,5,
4172,driver25,2017-01-15 10:01:00,2017-01-15 10:08:00,7.0,Driving,3.0,13.0,ndd,unique,uninterrupted,A,1,5,
4173,driver25,2017-01-15 10:08:00,2017-01-15 10:12:00,4.0,Other,3.0,13.0,ndd,unique,uninterrupted,A,1,5,


In [69]:
# Save as CSV
df_out.to_csv("log-clustering.csv", sep="\t", index=False)

# Download
from google.colab import files
files.download("log-clustering.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [79]:
# Save each cluster separated
groups = df_out.groupby(['Cluster'], sort=False)

# Add clusters to log
for name, group in groups:
  cluster = group['Cluster'].to_numpy()[0]
  
  path = "log-clustering-c{}.csv".format(cluster)
  group.to_csv(path, sep="\t", index=False)
  files.download(path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>