# Clustering User Behavior

Now we can use the spatial and time discretization to obtain clusters for the uses behavior.

First we load a dataset.

In [1]:
%matplotlib inline
from Code.STData import STData
from Code.Constants import homepath, cityparams
from Code.Clustering import cluster_events, cluster_cache
from Code.Transactions import DailyDiscretizedTransactions, DailyClusteredTransactions
from Code.TimeDiscretizer import TimeDiscretizer
import folium

data = STData('../', cityparams['berlin'], 'twitter')

data.read_data()
data.info()

Reading Data ...
A=  twitter
C=  (None, (52.32, 52.62, 13.11, 13.6), 'berlin', None, 120, None)
D=  (140510,)


Now we discretize the geographical positions using clustering (if the cluster is in the /Clusters directory is read from there)

In [2]:
# Clustering algorithm for spatial discretization 'leader'. 'kmeans'
dclalg = 'kmeans' 
# radius = 0.005 represents a circle of around 500m diameter depending on latitude/longitude
radius = 0.005
nclusters = 500
cluster = cluster_cache(data, alg=dclalg, radius=radius, nclusters=nclusters)
if cluster is None:
    print 'Computing Clustering'
    cluster, _ = cluster_events(data, alg=dclalg, radius=radius, nclusters=nclusters)

Computing Clustering
Clustering ... kmeans
Generating the events plot ...


Now we generate the transactions for each user joining the events of the user during the period of the data for (position,time)

In [3]:
timedis = [6, 18] # Time discretization
trans = DailyClusteredTransactions(data, cluster=cluster, timeres=TimeDiscretizer(timedis))
trans.info()

Generating Transactions ...
Trans Size = 10000


We obtain a data matrix computing the attribute values using binary values ('bin'), normalized frequency ('af'), normalized frequency ('nf'), we dan add 'idf' to each one to normalize by IDF

In [4]:
# Minimum number of events
minloc = 5
# Attribute types 'bin'=[0,1] ; 'binidf'=[0,1]/IDF 
mode = 'bin'
datamat, users = trans.generate_data_matrix(minloc=minloc, mode=mode)

Generating data matrix ...
Generating colapsed Transactions ...
(2455, 1000)


We obtain the clusters from the data matrix

In [5]:
from Code.Clustering import cluster_colapsed_events

# Clustering Algorithms 'kmeans', 'spectral', 'affinity'
calg = 'kmeans'
# affinity damping parmeter 0.1 - 1
damping=0.5
# number of clusters for kmeans and spectral clustering
nclust = 5

# minsize = Minimum number of elements in a cluster
cls = cluster_colapsed_events(datamat, users, alg=calg, damping=damping, nclust=nclust, minsize=10)

[(c, len(cls[c])) for c in cls]

Clustering Transactions ...  kmeans


[('c3', 229), ('c2', 212), ('c1', 1413), ('c0', 329), ('c4', 272)]

Now we can select a cluster and represent the frquency of the users of the clusters in the different positions in a map

In [10]:
cluster_name = 'c1'
dataclus = data.select_data_users(cls[cluster_name])
dataclus.info()
mymap = dataclus.plot_events_cluster(cluster=cluster, dataname=cluster_name) 
mymap

Selecting Users ...
A=  twitter
C=  (None, (52.32, 52.62, 13.11, 13.6), 'berlin', None, 120, None)
D=  (41846,)
Generating the events plot ...

