# Clustering User Behavior

Now we can use the spatial and time discretization to obtain clusters for the uses behavior.

First we load a dataset.

In [1]:
%matplotlib inline
from Code.STData import STData
from Code.Constants import homepath, cityparams
from Code.Clustering import cluster_events, cluster_cache
from Code.Transactions import DailyDiscretizedTransactions, DailyClusteredTransactions
from Code.TimeDiscretizer import TimeDiscretizer
import folium

data = STData('../', cityparams['bcn'], 'twitter')

data.read_data()
data.info()

Reading Data ...
A=  twitter
C=  (None, (41.2, 41.65, 1.9, 2.4), 'bcn', None, 120, None)
D=  (220853,)


Now we discretize the geographical positions using clustering (if the cluster is in the /Clusters directory is read from there)

In [2]:
# radius = 0.005 represents a circle of around 500m diameter depending on latitude/longitude

cluster = cluster_cache(data, alg= 'kmeans', radius=0.005, nclusters=50)
if cluster is None:
    print 'Computing Clustering'
    cluster, _ = cluster_events(data, alg= 'kmeans', radius=0.005, nclusters=50)

Clustering in cache ...


Now we generate the transactions for each user joining the events of the user during the period of the data for (position,time)

In [3]:
timedis = [6, 18] # Time discretization
trans = DailyClusteredTransactions(data, cluster=cluster, timeres=TimeDiscretizer(timedis))
trans.info()

Generating Transactions ...
Trans Size = 9950


We obtain a data matrix computing the attribute values using binary values ('bin'), normalized frequency ('af'), normalized frequency ('nf'), we dan add 'idf' to each one to normalize by IDF

In [4]:
# Minimum number of events
minloc = 5
# Attribute types 'bin'=[0,1] ; 'binidf'=[0,1]/IDF 
mode = 'bin'
datamat, users = trans.generate_data_matrix(minloc=minloc, mode=mode)

Generating data matrix ...
Generating colapsed Transactions ...
(1719, 100)


In [5]:
from Code.Clustering import cluster_colapsed_events

# Clustering Algorithms 'kmeans', 'spectral', 'affinity'
calg = 'kmeans'
# affinity damping parmeter 0.1 - 1
damping=0.5
# number of clusters for kmeans and spectral clustering
nclust = 5
# Minimum number of elements in a cluster

cls = cluster_colapsed_events(datamat, users, alg=calg, damping=damping, nclust=nclust, minsize=10)

[(c, len(cls[c])) for c in cls]

Clustering Transactions ...  kmeans


[('c3', 404), ('c2', 325), ('c1', 251), ('c0', 276), ('c4', 463)]

In [7]:
from Code.STData import STData

cluster_name = 'c0'
dataclus = data.select_data_users(cls[cluster_name])
dataclus.info()
mymap = dataclus.plot_events_cluster(cluster=cluster, dataname=cluster_name) 
mymap

Selecting Users ...
276
[False False False ..., False False False]
A=  twitter
C=  (None, (41.2, 41.65, 1.9, 2.4), 'bcn', None, 120, None)
D=  (0,)
Generating the events plot ...

