Voglio usare l'algoritmo di pattern mining per vedere gli spostamenti più frequenti dei visistatori: ordino in ordine temporale, dal più lontano al più vicino temporalmente, e applico l'algoritmo per vedere i pattern più frequenti.

In [2]:
from prefixspan import PrefixSpan
import numpy as np
import pandas as pd
from datetime import datetime
from pandas import Timestamp

In [3]:
year = '2016' #change if needed
csv_folder = '../csv/'+year

#### SPIEGAZIONE:
<i>ps.frequent(4)</i> => restituisce tutti gli itemset che compaiono almeno in 3 sequenze (quindi nel nostro caso, ci restituirebbe tutti gli itemset si labels di cluster compaiono almeno 3 volte in ogni riga/utente.

<i>ps.topk(5)</i> => restituisce i primi 5 iitemset più frequenti nel dataset.

# Prova con df con labels da Bisecting-kmeans

The procedure of this analysis consists of grouping data obtained from the clustering analysis, which is labeled with the corresponding cluster, by the users: eventually there will be groups of data for each user with all the movements made by him or her. Starting from this, we will apply the PrefixSpan algorithm in order to find the most frequent patterns of movements among the users, which will lead to a deeper comprehension on the greatest interests of visitors in the Internet Festival.  

In [4]:
# leggo il csv
df_bkmeans = pd.read_csv(csv_folder+'/df_bkmeans.csv')
# elimino colonna inutile
df_bkmeans.drop(['Unnamed: 0'], axis='columns', inplace=True)
# converto Created_At in datetime
df_bkmeans['Created_At'] = pd.to_datetime(df_bkmeans['Created_At'])
# ordino i valori per data
df_bkmeans = df_bkmeans.sort_values(by=['Created_At'])

df_bkmeans

Unnamed: 0,Screen_name,UserID,TweetID,Coords,Lat,Lon,Created_At,Text,Labels
169,kiaruzza,16315538,779002375612796928,"[43.7167, 10.3833]",43.716700,10.383300,2016-09-22 17:00:14+00:00,Stacco la spina.\n.\n.\n.#corona #beer #salt #...,16
649,liviofotografie,376767411,779011990358556672,"[43.71266, 10.39692]",43.712660,10.396920,2016-09-22 17:38:26+00:00,"Doppio anniversario per Filippo e Valentina, c...",13
513,recasensroger,2789960074,779013476882657280,"[43.7167, 10.3833]",43.716700,10.383300,2016-09-22 17:44:20+00:00,Torre de Pisa🇮🇹 #Canon #Pisa #torredepisa #pi...,16
381,djstephfloss,16136553,779021131378548741,"[43.72263, 10.3948]",43.722630,10.394800,2016-09-22 18:14:45+00:00,"In honor of the first day of Fall, let's not f...",9
75,curropar,10651072,779022459802685440,"[43.7229514, 10.39497213]",43.722951,10.394972,2016-09-22 18:20:02+00:00,"I'm at Tower of Pisa in Pisa, PI https://t.co/...",9
...,...,...,...,...,...,...,...,...,...
14,Mattar79,67786279,783022414187880449,"[43.72040817, 10.39710583]",43.720408,10.397106,2016-10-03 19:14:26+00:00,Seguimos el periplo por este país (@ Pisa in P...,35
234,LaScalettaPisa,2327332688,783023774195077120,"[43.72725, 10.38945]",43.727250,10.389450,2016-10-03 19:19:50+00:00,Astici per una magnifica serata. Il segreto? U...,26
235,LaScalettaPisa,2327332688,783024417005645824,"[43.72725, 10.38945]",43.727250,10.389450,2016-10-03 19:22:23+00:00,Questa sera per un meeting importantissimo di ...,26
162,_miReyaRedondo,239979352,783038120497389568,"[43.72305556, 10.39641667]",43.723056,10.396417,2016-10-03 20:16:50+00:00,|• BELLA ITALIA •| 🇮🇹❣\n\n#pisa #italy #Murien...,46


### Procedimento:
Each row is a user and each number contained in the row is an identifier of the geohash cell or of the cluster in which the user made a movement. The data is temporally ordered. Note that we decided to leave the noise in the analysis, but remember that if there is a movement towards a noise point, it means that the movement will not be interesting for the aim of our analysis. 

In [6]:
# raggruppo il df in base agli utenti
Users = df_bkmeans.groupby(['Screen_name'])

lista = []

for group in Users:
    user = []
    for label in group[1]['Labels']:
        user.append(label)
    lista.append(user)
    
ps = PrefixSpan(lista)

### Pattern frequenti in almeno 3 utenti

In [7]:
ps_3 = ps.frequent(3, closed=True)

# stampo solo quelli con almeno 3 spostamenti
frequent_moves = []

for item in ps_3:
    if(len(item[1])>2):
        frequent_moves.append(item)

frequent_moves.sort(reverse=True)
frequent_moves

[(7, [46, 46, 46]),
 (7, [16, 16, 16]),
 (4, [46, 46, 4]),
 (4, [46, 4, 4]),
 (4, [16, 16, 16, 16]),
 (4, [4, 4, 4]),
 (3, [46, 46, 46, 46]),
 (3, [46, 46, 4, 4]),
 (3, [46, 4, 4, 4]),
 (3, [35, 9, 46]),
 (3, [9, 46, 46]),
 (3, [9, 46, 4]),
 (3, [9, 9, 9]),
 (3, [4, 46, 46])]

Bisecting-Kmeans labels: it is possible to notice that predictably the patterns with the highest frequency are also those that do not present real movements, but are those in which users have tweeted several times in the same place. Excluding those, we can see that many people frequently followed the path 46-46-4 or 35, 9, 46 (image).

### Pattern frequenti in almeno 2 utenti

In [8]:
ps_2 = ps.frequent(2, closed=True)

# stampo solo quelli con almeno 3 spostamenti
frequent_moves = []

for item in ps_2:
    if(len(item[1])>2):
        frequent_moves.append(item)

frequent_moves.sort(reverse=True)
frequent_moves

[(7, [46, 46, 46]),
 (7, [16, 16, 16]),
 (4, [46, 46, 4]),
 (4, [46, 4, 4]),
 (4, [16, 16, 16, 16]),
 (4, [4, 4, 4]),
 (3, [46, 46, 46, 46]),
 (3, [46, 46, 4, 4]),
 (3, [46, 4, 4, 4]),
 (3, [35, 9, 46]),
 (3, [9, 46, 46]),
 (3, [9, 46, 4]),
 (3, [9, 9, 9]),
 (3, [4, 46, 46]),
 (2, [46, 46, 4, 46, 46]),
 (2, [46, 46, 4, 4, 46]),
 (2, [46, 9, 9]),
 (2, [46, 4, 46, 4, 46]),
 (2, [21, 21, 21]),
 (2, [16, 46, 46]),
 (2, [9, 46, 46, 46]),
 (2, [9, 46, 46, 4]),
 (2, [4, 46, 46, 46]),
 (2, [4, 16, 16]),
 (2, [4, 4, 9])]

# Prova con df con labels da Dbscan3

In [12]:
# leggo il csv
df_dbscan3 = pd.read_csv(csv_folder+'/df_dbscan3.csv')
# elimino colonna inutile
df_dbscan3.drop(['Unnamed: 0'], axis='columns', inplace=True)
# converto Created_At in datetime
df_dbscan3['Created_At'] = pd.to_datetime(df_dbscan3['Created_At'])
# ordino i valori per data
df_dbscan3 = df_dbscan3.sort_values(by=['Created_At'])

df_dbscan3

Unnamed: 0,Screen_name,UserID,TweetID,Coords,Lat,Lon,Created_At,Text,Labels
169,kiaruzza,16315538,779002375612796928,"[43.7167, 10.3833]",43.716700,10.383300,2016-09-22 17:00:14+00:00,Stacco la spina.\n.\n.\n.#corona #beer #salt #...,1
649,liviofotografie,376767411,779011990358556672,"[43.71266, 10.39692]",43.712660,10.396920,2016-09-22 17:38:26+00:00,"Doppio anniversario per Filippo e Valentina, c...",21
513,recasensroger,2789960074,779013476882657280,"[43.7167, 10.3833]",43.716700,10.383300,2016-09-22 17:44:20+00:00,Torre de Pisa🇮🇹 #Canon #Pisa #torredepisa #pi...,1
381,djstephfloss,16136553,779021131378548741,"[43.72263, 10.3948]",43.722630,10.394800,2016-09-22 18:14:45+00:00,"In honor of the first day of Fall, let's not f...",3
75,curropar,10651072,779022459802685440,"[43.7229514, 10.39497213]",43.722951,10.394972,2016-09-22 18:20:02+00:00,"I'm at Tower of Pisa in Pisa, PI https://t.co/...",3
...,...,...,...,...,...,...,...,...,...
14,Mattar79,67786279,783022414187880449,"[43.72040817, 10.39710583]",43.720408,10.397106,2016-10-03 19:14:26+00:00,Seguimos el periplo por este país (@ Pisa in P...,3
234,LaScalettaPisa,2327332688,783023774195077120,"[43.72725, 10.38945]",43.727250,10.389450,2016-10-03 19:19:50+00:00,Astici per una magnifica serata. Il segreto? U...,15
235,LaScalettaPisa,2327332688,783024417005645824,"[43.72725, 10.38945]",43.727250,10.389450,2016-10-03 19:22:23+00:00,Questa sera per un meeting importantissimo di ...,15
162,_miReyaRedondo,239979352,783038120497389568,"[43.72305556, 10.39641667]",43.723056,10.396417,2016-10-03 20:16:50+00:00,|• BELLA ITALIA •| 🇮🇹❣\n\n#pisa #italy #Murien...,3


In [13]:
# raggruppo il df in base agli utenti
Users = df_dbscan3.groupby(['Screen_name'])

lista = []

for group in Users:
    user = []
    for label in group[1]['Labels']:
        user.append(label)
    lista.append(user)
    
ps = PrefixSpan(lista)

In [14]:
ps_3 = ps.frequent(3, closed=True)
# stampo solo quelli con almeno 3 spostamenti
for item in ps_3:
    if(len(item[1])>2):
        print(item)

(4, [1, 3, 3])
(7, [1, 1, 1])
(4, [1, 1, 1, 1])
(20, [3, 3, 3])
(11, [3, 3, 3, 3])
(6, [3, 3, 3, 3, 3])
(4, [3, 3, 3, 3, 3, 3])
(3, [3, 3, 3, 3, 3, 3, 3, 3, 3])
(3, [3, 3, 3, 4])
(4, [3, 3, 4])
(3, [3, 3, 1])
(3, [3, 1, 1])
(3, [3, 1, 3])
(6, [4, 4, 4])
(3, [4, 4, 4, 3])
(5, [4, 4, 3])
(3, [4, 4, 3, 1])
(5, [4, 4, 1])
(3, [4, 3, 3])
(3, [4, 1, 4])


In [15]:
ps_2 = ps.frequent(2, closed=True)
# stampo solo quelli con almeno 3 spostamenti
for item in ps_2:
    if(len(item[1])>2):
        print(item)

(2, [-1, 3, 3, 3])
(4, [1, 3, 3])
(2, [1, 3, 3, 3])
(2, [1, 1, 7])
(2, [1, 1, 4])
(7, [1, 1, 1])
(4, [1, 1, 1, 1])
(2, [1, 4, 3])
(2, [1, 4, 1])
(20, [3, 3, 3])
(11, [3, 3, 3, 3])
(6, [3, 3, 3, 3, 3])
(4, [3, 3, 3, 3, 3, 3])
(3, [3, 3, 3, 3, 3, 3, 3, 3, 3])
(2, [3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
(2, [3, 3, 3, 3, 4])
(3, [3, 3, 3, 4])
(2, [3, 3, 3, 1])
(2, [3, 3, 3, 1, 4])
(4, [3, 3, 4])
(3, [3, 3, 1])
(3, [3, 1, 1])
(3, [3, 1, 3])
(2, [3, 1, 3, 1])
(2, [3, 4, 4, 4])
(2, [3, 4, 4, 3, 1])
(2, [3, 7, 3])
(2, [3, 7, 1])
(2, [7, 7, 7])
(2, [7, 7, 1])
(2, [7, 4, 7])
(2, [7, 4, 19, 3])
(2, [7, 4, 3])
(2, [7, 19, 3])
(6, [4, 4, 4])
(3, [4, 4, 4, 3])
(2, [4, 4, 4, 3, 1])
(5, [4, 4, 3])
(3, [4, 4, 3, 1])
(2, [4, 4, 3, 4])
(5, [4, 4, 1])
(2, [4, 4, 1, 1])
(2, [4, 4, 1, 4])
(3, [4, 3, 3])
(3, [4, 1, 4])
(2, [4, -1, 1])
(2, [4, -1, 3])
(2, [5, 5, 5])


# Prova con df con labels da Geohash

In [16]:
# leggo il csv
df_geohash = pd.read_csv(csv_folder+'/df_geohash_to_numbers.csv')
# elimino colonna inutile
df_geohash.drop(['Unnamed: 0'], axis='columns', inplace=True)
# converto Created_At in datetime
df_geohash['Created_At'] = pd.to_datetime(df_geohash['Created_At'])
# ordino i valori per data
df_geohash = df_geohash.sort_values(by=['Created_At'])

df_geohash

Unnamed: 0,Screen_name,UserID,TweetID,Coords,Lat,Lon,Created_At,Text,geohash
169,kiaruzza,16315538,779002375612796928,"[43.7167, 10.3833]",43.716700,10.383300,2016-09-22 17:00:14+00:00,Stacco la spina.\n.\n.\n.#corona #beer #salt #...,6
649,liviofotografie,376767411,779011990358556672,"[43.71266, 10.39692]",43.712660,10.396920,2016-09-22 17:38:26+00:00,"Doppio anniversario per Filippo e Valentina, c...",9
513,recasensroger,2789960074,779013476882657280,"[43.7167, 10.3833]",43.716700,10.383300,2016-09-22 17:44:20+00:00,Torre de Pisa🇮🇹 #Canon #Pisa #torredepisa #pi...,6
381,djstephfloss,16136553,779021131378548741,"[43.72263, 10.3948]",43.722630,10.394800,2016-09-22 18:14:45+00:00,"In honor of the first day of Fall, let's not f...",12
75,curropar,10651072,779022459802685440,"[43.7229514, 10.39497213]",43.722951,10.394972,2016-09-22 18:20:02+00:00,"I'm at Tower of Pisa in Pisa, PI https://t.co/...",12
...,...,...,...,...,...,...,...,...,...
14,Mattar79,67786279,783022414187880449,"[43.72040817, 10.39710583]",43.720408,10.397106,2016-10-03 19:14:26+00:00,Seguimos el periplo por este país (@ Pisa in P...,12
234,LaScalettaPisa,2327332688,783023774195077120,"[43.72725, 10.38945]",43.727250,10.389450,2016-10-03 19:19:50+00:00,Astici per una magnifica serata. Il segreto? U...,17
235,LaScalettaPisa,2327332688,783024417005645824,"[43.72725, 10.38945]",43.727250,10.389450,2016-10-03 19:22:23+00:00,Questa sera per un meeting importantissimo di ...,17
162,_miReyaRedondo,239979352,783038120497389568,"[43.72305556, 10.39641667]",43.723056,10.396417,2016-10-03 20:16:50+00:00,|• BELLA ITALIA •| 🇮🇹❣\n\n#pisa #italy #Murien...,12


In [17]:
# raggruppo il df in base agli utenti
Users = df_geohash.groupby(['Screen_name'])

lista = []

for group in Users:
    user = []
    for label in group[1]['geohash']:
        user.append(label)
    lista.append(user)
    
ps = PrefixSpan(lista)

In [18]:
ps_3 = ps.frequent(3, closed=True)
# stampo solo quelli con almeno 3 spostamenti
for item in ps_3:
    if(len(item[1])>2):
        print(item)

(21, [12, 12, 12])
(10, [12, 12, 12, 12])
(6, [12, 12, 12, 12, 12])
(4, [12, 12, 12, 12, 12, 12])
(3, [12, 12, 12, 12, 12, 12, 12, 12, 12])
(3, [12, 12, 12, 11])
(5, [12, 12, 11])
(5, [12, 12, 6])
(3, [12, 12, 6, 11])
(4, [12, 6, 6])
(4, [12, 6, 12])
(3, [12, 11, 11, 6])
(3, [12, 11, 11, 11])
(5, [9, 9, 9])
(3, [9, 9, 6])
(4, [6, 12, 12])
(3, [6, 6, 9])
(7, [6, 6, 6])
(4, [6, 6, 6, 6])
(6, [11, 11, 12])
(5, [11, 11, 6])
(7, [11, 11, 11])
(3, [11, 11, 11, 12, 6])
(3, [11, 12, 12])
(3, [11, 6, 11])


In [19]:
ps_2 = ps.frequent(2, closed=True)
# stampo solo quelli con almeno 3 spostamenti
for item in ps_2:
    if(len(item[1])>2):
        print(item)

(21, [12, 12, 12])
(10, [12, 12, 12, 12])
(6, [12, 12, 12, 12, 12])
(4, [12, 12, 12, 12, 12, 12])
(3, [12, 12, 12, 12, 12, 12, 12, 12, 12])
(2, [12, 12, 12, 12, 12, 12, 12, 12, 12, 12])
(3, [12, 12, 12, 11])
(2, [12, 12, 12, 6])
(2, [12, 12, 12, 6, 11])
(5, [12, 12, 11])
(2, [12, 12, 11, 11, 6])
(2, [12, 12, 11, 11, 11])
(5, [12, 12, 6])
(2, [12, 12, 6, 6])
(2, [12, 12, 6, 6, 11])
(3, [12, 12, 6, 11])
(4, [12, 6, 6])
(4, [12, 6, 12])
(2, [12, 6, 12, 12])
(2, [12, 6, 12, 6])
(3, [12, 11, 11, 6])
(2, [12, 11, 11, 6, 11])
(3, [12, 11, 11, 11])
(2, [12, 11, 11, 11, 11, 11])
(2, [12, 11, 11, 11, 11, 12, 6])
(2, [12, 9, 6])
(2, [9, 6, 6])
(5, [9, 9, 9])
(2, [9, 9, 9, 9, 9])
(2, [9, 9, 9, 6])
(3, [9, 9, 6])
(2, [9, 9, 11])
(2, [9, 11, 9, 11])
(2, [9, 11, 11, 12])
(2, [9, 11, 12])
(4, [6, 12, 12])
(2, [6, 12, 12, 12])
(3, [6, 6, 9])
(7, [6, 6, 6])
(4, [6, 6, 6, 6])
(2, [6, 9, 6])
(2, [6, 11, 12])
(2, [6, 11, 6])
(2, [19, 19, 19])
(2, [11, 11, 13, 12])
(6, [11, 11, 12])
(2, [11, 11, 12, 11])
(5