### 연구 목적: Indoor moving patterns을 이용하여 revisit intention을 예측.

크롤링한 User들의 Wi-fi log을, moving pattern별로 indexing한 후, 적합한 feature들을 선별하여, 어떤 moving pattern이 추후 user revisit과 관련이 있는지 분석하고자 함.

#### Feature description:
특정 moving pattern의 revisit intention을 예측하는 supervised learning(classification) 모델에 이용될 feature들
1. User가 찍힌 로그 총 개수: num_logs
2. 한 User가 와이파이에 잡힌 총 시간: total_dwell_time
3. dwell_time > 10인 indoor area 개수: num_sp_100
4. indoor 로그 중 dwell_time > 10인 확률: prob_dwell_10
5. 전체 로그 중 deny=True(직원)인 확률: prob_deny
6. dwell_time > 100인 indoor area에서 보낸 total time: time_sp_100
7. dwell_time > 100인 indoor area들의 variance: std_sp_100


#### Details
1. Moving pattern: 한 유저가 하루동안 매장 안에서 돌아다닌 wi-fi patterns (e.g., out,in,b1,b1-left-1,1f,out,in,b1-left)
2. Raw data의 경우 wi-fi에 찍힌 로그(어떤 device_id가 어떤 장소의 wi-fi 수신기에 몇시부터 몇초간 접속중이었고, 직원인지의 여부와, 현재까지의 revisit count와 가장 최근의 revisit interval)가 하나의 row로 이루어져 있다. 고로, 하나의 moving pattern에 대한 log가 여러 개로 이루어져 있는데, 로그의 날짜와 device_id를 조합하여 key로 삼고 aggregate하여 하나의 moving pattern이 하나의 row로 나타낼 수 있도록 함)
3. 기존 revisit_count 데이터를 보면 꼬인 경우가 많았다, 같은 device_id인데 오늘까지 revisit count가 69였는데 그다음날 7이 된다거나, 그래서 이 데이터를 이용하지 않고 로그를 기반하여 정제해 주었다.
4. revisit_intention은 binary variable이며 우리가 예측하고자 하는 label이다, 매장 방문 이후 n일 이내에 다시 매장 방문 내역이 있다면 revisit_intention = 1 로 표시해 주었다.
5. 매장 방문의 정의: 특정 일에 특정 유저의 moving pattern이 at least 'in을 포함된' 로그를 가지고 있을 경우 매장을 방문하였다고 함. out만 여러번 있는 로그의 경우는 지나가는 행인이라고 생각하여 매장을 방문하지 않았다고 간주.
6. 따라서 어떤 유저가 'in' 로그를 갖는 timestamp가 가장 오래 된 방문을 가장 첫 방문이라고 가정하고, 그 다음 'in' 로그를 갖는 방문을 세서 revisit_count를 계산해 주었다, 이 때 각 방문 사이의 간격은 적어도 1일 이상이다. revisit_count는 각 방문 패턴의 history 정보를 누락시키지 않는다는 목적으로 feature로 삼기로 했다.
7. 참고: User-ID별로 indexing해서 유저별 revisit_count를 naive하게 예측한 이전 모델은 supervised_model(basic_features).ipynb, 에 정리

__데이터__
1. 781, 786번 매장 데이터

In [92]:
### import libraries
import pandas as pd
import datetime
import numpy as np
import re

In [9]:
df = pd.read_pickle("../data/781/781.p")
df.head(5)   ### Show raw data

Unnamed: 0,area,deny,device_id,dwell_time,key,revisit_count,revisit_period,ts
0,1f-right-1,True,cae0f3cb170db4ae18897d6af8497c38,3330,781:7fea91f80d36bcc5:1f-right-1,,,1472645246100
1,right-test,True,cae0f3cb170db4ae18897d6af8497c38,3330,781:7fea91f80d36bcc5:right-test,,,1472645246100
2,out,,c41e7932a5fedc55d61950559232c0bf,0,781:7fea91f80e6add4d:out,,,1472645241170
3,out,,249e7ac1b1cd6ba9bee1cd81ef1e013f,85,781:7fea91f80ea49f2d:out,,,1472645240246
4,out,,35e9e62a01594c31a9297c397cee0390,34,781:7fea91f80ee44f5d:out,,,1472645239227


In [11]:
df['date'] = df['ts'] // 86400000
# df = df.loc[(df['deny']!= True) & (df['dwell_time'] > 0)]
df = df.loc[(df['dwell_time'] > 0)]
df['date_device_id'] = df.date.map(str) + "_" + df.device_id

In [12]:
remainder = (df['ts']%604800000)/1000

def timestamp_to_day(x):
    a = x / 86400
    switcher = {
        0: "Thu",
        1: "Fri",
        2: "Sat",
        3: "Sun",
        4: "Mon",
        5: "Tue",
        6: "Wed"
    }
    return switcher.get(int(a))

df['day'] = remainder.apply(lambda x: timestamp_to_day(x))

df.head(5) ### show difference after basic preprocessing

Unnamed: 0,area,deny,device_id,dwell_time,key,revisit_count,revisit_period,ts,date,date_device_id,day
0,1f-right-1,True,cae0f3cb170db4ae18897d6af8497c38,3330,781:7fea91f80d36bcc5:1f-right-1,,,1472645246100,17044,17044_cae0f3cb170db4ae18897d6af8497c38,Wed
1,right-test,True,cae0f3cb170db4ae18897d6af8497c38,3330,781:7fea91f80d36bcc5:right-test,,,1472645246100,17044,17044_cae0f3cb170db4ae18897d6af8497c38,Wed
3,out,,249e7ac1b1cd6ba9bee1cd81ef1e013f,85,781:7fea91f80ea49f2d:out,,,1472645240246,17044,17044_249e7ac1b1cd6ba9bee1cd81ef1e013f,Wed
4,out,,35e9e62a01594c31a9297c397cee0390,34,781:7fea91f80ee44f5d:out,,,1472645239227,17044,17044_35e9e62a01594c31a9297c397cee0390,Wed
6,out,,6370140ec2920e3f1878d877cd9a7b2a,776,781:7fea91f80f441dc9:out,,,1472645237694,17044,17044_6370140ec2920e3f1878d877cd9a7b2a,Wed


In [125]:
def feature_generator(df_toy):
    print('Generating features from raw data')
    ### F1: 로그 총 개수
    f1 = df_toy.groupby(['date_device_id'])['ts'].count()
#     print(f1.head(5))

    ### F2: 와이파이에 잡힌 총 시간
    f2 = df_toy.groupby(['date_device_id'])['dwell_time'].sum()
#     print(f2.head(5))

    ### F3: dwell_time > 100인 indoor area 개수
    df_toy_indoor = df_toy.loc[df_toy['area']!='out']
    df_toy_indoor2 = df_toy_indoor.loc[df_toy_indoor['dwell_time']>100]
    f3 = df_toy_indoor2.groupby(['date_device_id'])['area'].count()
#     print(f3.head(5))

    ### F4: indoor 로그 중 dwell_time > 100인 확률
    f3_2 = df_toy_indoor.groupby(['date_device_id'])['area'].count()
    f4 = f3.div(f3_2)
#     print(f4.head(5))

    ### F5: dwell_time > 100인 indoor area에서 보낸 total time
    f5 = df_toy_indoor2.groupby(['date_device_id'])['dwell_time'].sum()
#     print(f5.head(5))

    ### F6: dwell_time > 100인 indoor area들의 standard deviation
    f6 = df_toy_indoor2.groupby(['date_device_id'])['dwell_time'].std()
#     print(f6.head(5))

    ### F7: deny = True일 확률
    a = df_toy.groupby(['date_device_id']).deny.count()
    b = df_toy['date_device_id'].value_counts()
    f7 = a.div(b)
#     print(f7.head(5))


    
#     ### F8: 로그 총 개수 - 요일별
#     days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
#     f8 = df_toy_indoor2.groupby(['day', 'date_device_id'])['dwell_time'].sum()
#     f8 = f8.reindex(days, level='day')
#     f8 = f8.to_frame(name='count').reset_index()
# #     print(f8.head(5))
    
#     ### Label: Maximum revisit count from the log
#     label_toy = df_toy.groupby(['date_device_id'])['revisit_count'].max()
# #     print(label_toy.head(5))

    return f1, f2, f3, f4, f5, f6, f7

In [156]:
def df_generator(df, f1, f2, f3, f4, f5, f6, f7):
    print('Generating a data frame which aggergated features')

#     days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
#     days_numlogs = ['num_logs_' + s for s in days]
    columns = ['num_logs', 'total_dwell_time', 'num_sp_100', 'prob_dwell_10', 'time_sp_100', 'std_sp_100', 'prob_deny']
#     columns = columns + days_numlogs

    # feature들과의 index의 통일을 위해 np.sort를 이용.
    device_ids = np.sort(df['date_device_id'].unique())       
    df2 = pd.DataFrame(columns=columns, index=device_ids)

    # feature를 df에 삽입
    df2["num_logs"] = f1          
    df2["total_dwell_time"] = f2
    df2["num_sp_100"] = f3
    df2["prob_dwell_100"] = f4
    df2["time_sp_100"] = f5
    df2["std_sp_100"] = f6
    df2["prob_deny"] = f7

#     ### F8를 df에 합치는 부분
#     for day in days:
#         f8_certain_day = f8.loc[f8['day']==day]
#         f8_certain_day = f8_certain_day[["date_device_id", "count"]].set_index(['date_device_id'])
#         columnName = 'num_logs_'+day
#         df2[columnName] = f8_certain_day

#     # label을 df에 합침
#     df2 = pd.concat([df2, label], axis=1)   

    # machine learning에 바로 이용될 dataframe을 리턴
    return df2

In [126]:
f1, f2, f3, f4, f5, f6, f7 = feature_generator(df)

Generating features from raw data


In [128]:
df2 = df_generator(df, f1, f2, f3, f4, f5, f6, f7)

Generating a data frame which aggergated features


In [130]:
df2.shape

(1876809, 7)

### 메소드에 아직 못합친 부분
각 moving pattern별 트라젝토리 및 revisit_count등

In [15]:
### traj1: Generate trajectories for each moving patterns
traj = df.groupby(['date', 'device_id'])['area']
trajformovings = traj.apply(lambda x: ','.join(x.sort_index(ascending=False))[::])
traj1 = trajformovings.to_frame(name='traj').reset_index()

### traj2: Count visit_counts for each user (history)
trajsum = traj.sum()
def checkin(x):
    result = 1
    if "in" in x: 
        result = 1
    else:
        result = 0
    return result
trajcount = trajsum.map(lambda x: checkin(x)).groupby(level=1).cumsum()
traj2 = trajcount.to_frame(name='new_visit_count').reset_index()

In [71]:
traj1['date_device_id'] = traj1.date.map(str) + "_" + traj1.device_id
traj2['date_device_id'] = traj2.date.map(str) + "_" + traj2.device_id

traj1 = traj1.set_index(['date_device_id'])
traj2 = traj2.set_index(['date_device_id'])

trajs_full = pd.concat([traj1, traj2[['new_visit_count']]], axis=1)

### show some sample moving patterns having large history
trajs_full.loc[trajs_full['new_visit_count']>100].tail(5)

Unnamed: 0_level_0,date,device_id,traj,new_visit_count
date_device_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
17044_7961e337158a5b699a0d6314a7c680b1,17044,7961e337158a5b699a0d6314a7c680b1,"out,in,1f,in,1f-right-2,1f,in,1f,in,1f,in,1f,i...",107
17044_8b4357a377e3673d44489d67eee696ac,17044,8b4357a377e3673d44489d67eee696ac,"out,in,b1",362
17044_cae0f3cb170db4ae18897d6af8497c38,17044,cae0f3cb170db4ae18897d6af8497c38,"out,in,1f,right-test,1f-right-1,1f-right-2,lef...",258
17044_ceb0005b0d1e75950118f92c16d9a619,17044,ceb0005b0d1e75950118f92c16d9a619,"out,in,b1,in,b1,in,b1,in,b1,in,b1,in,b1",261
17044_f6258edf9145d1c0404e6f3d7a27a29d,17044,f6258edf9145d1c0404e6f3d7a27a29d,"out,in,b1,out,in,b1,out,in,1f,out,in,b1,out,in...",311


In [73]:
print(trajs_full.shape)
print(trajs_full.loc[trajs_full.new_visit_count>=1].shape)
print(trajs_full.loc[trajs_full.new_visit_count>=2].shape)
print(trajs_full.loc[trajs_full.new_visit_count>=3].shape)  

## 제대로 된 로그가 거의 없단 얘기  (재방문 >= 1 인 로그가 1,876,809개 중 50,233개밖에 안됨.)

(1876809, 4)
(50233, 4)
(13724, 4)
(9772, 4)


In [41]:
# print 해보고 싶으면 커맨트 다 uncomment하고 돌리면 됨.

trajs = trajs_full.loc[trajs.new_visit_count>=1]
trajs['revisit_intention'] = 0
revisit_interval_thres = 120

for ids in trajs['device_id'].unique():
    dff = trajs.loc[trajs['device_id']==ids]   
    a = 0
    date = 16672
    prev_idx = ''
    for index, row in dff.iterrows():
        if a+1 == row['new_visit_count']:
            if date+revisit_interval_thres > row['date']:
#                 print('regular revisit: {0} days interval'.format(row['date']-date))
#                 print('previous index: ',prev_idx)
                trajs.set_value(prev_idx, 'revisit_intention', 1)
                ## 이때만 하기.
                
#             elif row['new_visit_count'] == 1:
#                 print('regular revisit: {0} days interval'.format(row['date']-date))
                
#             else:
#                 print('Irregular revisit: {0} days interval'.format(row['date']-date))

            prev_idx = index
                
#             print(row,'\n')
            a = row['new_visit_count']
            date = row['date']

In [137]:
trajs_combined = pd.concat([trajs, df2], axis=1, join='inner')

In [140]:
trajs_combined.loc[trajs_combined['revisit_intention'] == 1]['traj'].value_counts()

out,in,b1                                                                                                                                                                                                                                                                                                                                                                                            606
out,in,1f                                                                                                                                                                                                                                                                                                                                                                                            233
out,in,b1,1f                                                                                                                                                                                                          

In [143]:
trajs_combined.head(5)

Unnamed: 0,date,device_id,traj,new_visit_count,revisit_intention,num_logs,total_dwell_time,num_sp_100,prob_dwell_10,time_sp_100,std_sp_100,prob_deny
16673_028a1f4dbca00ed06814fdda60f1b599,16673.0,028a1f4dbca00ed06814fdda60f1b599,"out,in,b1,1f,b1-left-3,b1-left-2,1f-right-1,1f...",1.0,1.0,14,11092,11.0,0.846154,8329.0,870.565772,0.0
16673_0bc0852bb3b760c270585483cea24b4a,16673.0,0bc0852bb3b760c270585483cea24b4a,"out,in,b1,1f,b1-right",1.0,0.0,5,776,4.0,1.0,591.0,0.5,0.0
16673_0d4fd55bb363bf6f6f7f8b3342cd0467,16673.0,0d4fd55bb363bf6f6f7f8b3342cd0467,"out,in,b1,1f,in,b1,1f,in,b1,1f,b1-right,b1-lef...",1.0,1.0,19,76002,16.0,0.888889,53715.0,4072.029735,1.0
16673_17490e4c91a3ad7d1bdef7e61ea469c3,16673.0,17490e4c91a3ad7d1bdef7e61ea469c3,"out,in,1f,1f-left-1",1.0,0.0,4,453,,,,,0.0
16673_1de880af544e89c437bd624454615e36,16673.0,1de880af544e89c437bd624454615e36,"out,in,1f",1.0,0.0,3,240,,,,,0.0


In [162]:
### revisit_intention (predict해야 하는 라벨)을 가장 끝으로 보냄.
cols = trajs_combined.columns.tolist()
newcols = cols[:4]+cols[5:]+cols[4:5]
trajs_combined = trajs_combined[newcols]

In [163]:
trajs_combined.head(5)

Unnamed: 0,date,device_id,traj,new_visit_count,num_logs,total_dwell_time,num_sp_100,prob_dwell_10,time_sp_100,std_sp_100,prob_deny,revisit_intention
16673_028a1f4dbca00ed06814fdda60f1b599,16673.0,028a1f4dbca00ed06814fdda60f1b599,"out,in,b1,1f,b1-left-3,b1-left-2,1f-right-1,1f...",1.0,14,11092,11.0,0.846154,8329.0,870.565772,0.0,1.0
16673_0bc0852bb3b760c270585483cea24b4a,16673.0,0bc0852bb3b760c270585483cea24b4a,"out,in,b1,1f,b1-right",1.0,5,776,4.0,1.0,591.0,0.5,0.0,0.0
16673_0d4fd55bb363bf6f6f7f8b3342cd0467,16673.0,0d4fd55bb363bf6f6f7f8b3342cd0467,"out,in,b1,1f,in,b1,1f,in,b1,1f,b1-right,b1-lef...",1.0,19,76002,16.0,0.888889,53715.0,4072.029735,1.0,1.0
16673_17490e4c91a3ad7d1bdef7e61ea469c3,16673.0,17490e4c91a3ad7d1bdef7e61ea469c3,"out,in,1f,1f-left-1",1.0,4,453,,,,,0.0,0.0
16673_1de880af544e89c437bd624454615e36,16673.0,1de880af544e89c437bd624454615e36,"out,in,1f",1.0,3,240,,,,,0.0,0.0


### XGBoost로 테스트 

In [173]:
import datetime
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
import xgboost as xgb
import random
import zipfile
import time
import shutil
from sklearn.metrics import log_loss
from sklearn.metrics import mean_squared_error
import json

random.seed(2016)
datadir = "../data/781/781.p"

In [164]:
def run_xgb(train, test, features, target, random_state=0):
    start_time = time.time()
    objective = "reg:linear"
    booster = "gbtree"
    eval_metric = ["auc", "rmse"]
    eta = 0.1
    max_depth = 3
    subsample = 0.7
    colsample_bytree = 0.7
    silent = 1

    print('XGBoost params. ETA: {}, MAX_DEPTH: {}, SUBSAMPLE: {}, COLSAMPLE_BY_TREE: {}'.format(eta, max_depth, subsample, colsample_bytree))
    params = {
        "objective": objective,
    #         "num_class": 2,
        "booster" : booster,
        "eval_metric": eval_metric,
        "eta": eta,
        "max_depth": max_depth,
        "subsample": subsample,
        "colsample_bytree": colsample_bytree,
        "silent": silent,
        "seed": random_state,
    }
    num_boost_round = 200
    early_stopping_rounds = 20
    test_size = 0.2

    X_train, X_valid = train_test_split(train, test_size=test_size, random_state=random_state)
    print('Length train:', len(X_train.index))
    print('Length valid:', len(X_valid.index))
    y_train = X_train[target]
    y_valid = X_valid[target]
    dtrain = xgb.DMatrix(X_train[features], y_train)
    dvalid = xgb.DMatrix(X_valid[features], y_valid)

    watchlist = [(dtrain, 'train'), (dvalid, 'eval')]
    gbm = xgb.train(params, dtrain, num_boost_round, evals=watchlist, early_stopping_rounds=early_stopping_rounds, verbose_eval=True)

    print("Validating...")
    check = gbm.predict(xgb.DMatrix(X_valid[features]), ntree_limit=gbm.best_iteration)

    score = mean_squared_error(y_valid.tolist(), check)

    print("Predict test set...")
    test_prediction = gbm.predict(xgb.DMatrix(test[features]), ntree_limit=gbm.best_iteration)

    training_time = round((time.time() - start_time)/60, 2)
    print('Training time: {} minutes'.format(training_time))

    print(gbm)

    # To save logs
    explog = {}
    explog['features'] = features
    explog['target'] = target
    explog['params'] = {}
    explog['params']['objective'] = objective
    explog['params']['booster'] = booster
    explog['params']['eval_metric'] = eval_metric
    explog['params']['eta'] = eta
    explog['params']['max_depth'] = max_depth
    explog['params']['subsample'] = subsample
    explog['params']['colsample_bytree'] = colsample_bytree
    explog['params']['silent'] = silent
    explog['params']['seed'] = random_state
    explog['params']['num_boost_round'] = num_boost_round
    explog['params']['early_stopping_rounds'] = early_stopping_rounds
    explog['params']['test_size'] = test_size
    explog['length_train']= len(X_train.index)
    explog['length_valid']= len(X_valid.index)
    # explog['gbm_best_iteration']= 
    explog['score'] = score
    explog['training_time'] = training_time




    return test_prediction.tolist(), score, explog

In [165]:
def updateLog(explog, logPath):
    try:
        with open(logPath, 'r') as f:
            obob = json.load(f)
        f.close()
    except:
        obob = []


    obob.append(explog)

    with open(logPath, 'w') as f:
        json.dump(obob, f)
    f.close()

In [174]:
trajs_combined = trajs_combined.fillna(0)
trajs_combined = trajs_combined.reindex(np.random.permutation(trajs_combined.index))

idx = int(len(trajs_combined.index)*9/10)
train = trajs_combined[:idx]
test = trajs_combined[idx:]
features = list(trajs_combined.columns)[3:-1]
target = 'revisit_intention'

print('Length of train: ', len(train))
print('Length of test: ', len(test))
print('Features [{}]: {}'.format(len(features), sorted(features)))

test_prediction, score, explog = run_xgb(train, test, features, target)
print('Score: ', score)

logPath = '../result/results.json'

explog['dataset']= datadir
explog['ts']= time.strftime('%Y-%m-%d %H:%M:%S')

updateLog(explog, logPath)

Length of train:  45209
Length of test:  5024
Features [8]: ['new_visit_count', 'num_logs', 'num_sp_100', 'prob_deny', 'prob_dwell_10', 'std_sp_100', 'time_sp_100', 'total_dwell_time']
XGBoost params. ETA: 0.1, MAX_DEPTH: 3, SUBSAMPLE: 0.7, COLSAMPLE_BY_TREE: 0.7
Length train: 36167
Length valid: 9042


Will train until eval error hasn't decreased in 20 rounds.
Multiple eval metrics have been passed: 'rmse' will be used for early stopping.

[0]	train-auc:0.948440	train-rmse:0.455263	eval-auc:0.943472	eval-rmse:0.455341
[1]	train-auc:0.953184	train-rmse:0.416022	eval-auc:0.947812	eval-rmse:0.416135
[2]	train-auc:0.956822	train-rmse:0.380169	eval-auc:0.951259	eval-rmse:0.380416
[3]	train-auc:0.957118	train-rmse:0.349273	eval-auc:0.951330	eval-rmse:0.349681
[4]	train-auc:0.961599	train-rmse:0.320764	eval-auc:0.955957	eval-rmse:0.321220
[5]	train-auc:0.961730	train-rmse:0.295706	eval-auc:0.956028	eval-rmse:0.296308
[6]	train-auc:0.961817	train-rmse:0.273754	eval-auc:0.956163	eval-rmse:0.274493
[7]	train-auc:0.961844	train-rmse:0.255379	eval-auc:0.956620	eval-rmse:0.256155
[8]	train-auc:0.961947	train-rmse:0.238682	eval-auc:0.956163	eval-rmse:0.239625
[9]	train-auc:0.962768	train-rmse:0.223666	eval-auc:0.957623	eval-rmse:0.224644
[10]	train-auc:0.962898	train-rmse:0.210612	eval-auc:0.95783

Validating...
Predict test set...
Training time: 0.02 minutes
<xgboost.core.Booster object at 0x16d6385c0>
Score:  0.0200542615548


[95]	train-auc:0.970112	train-rmse:0.136797	eval-auc:0.964221	eval-rmse:0.141717
[96]	train-auc:0.970112	train-rmse:0.136786	eval-auc:0.964220	eval-rmse:0.141722
Stopping. Best iteration:
[76]	train-auc:0.970098	train-rmse:0.137499	eval-auc:0.964368	eval-rmse:0.141565



### 데이터 validity check한 흔적들...

In [67]:
### visit_count(매장에 정식으로 방문한 횟수)가 1이상인 moving patterns 중 6개월 내에 revisit intention이 있는 moving pattern들의 count
trajs['revisit_intention'].value_counts()

0.0    46331
1.0     3903
Name: revisit_intention, dtype: int64

In [66]:
### n 이 n+1보다 큰 경우가 있는데, n번 방문 후 찍히는 outout같은 로그들(revisit과는 상관없는 로그)들이 중간에 포함되어 있어서 그렇다는 걸 알게 됨.
print(trajs['new_visit_count'].value_counts().sort_index().head(10))

### revisit_intention = 1 인 정상적인 moving pattern 개수는 monotonically decreasing함.
print(trajs.loc[(trajs['new_visit_count']==8) & (trajs['revisit_intention'] == 1)]['traj'].count())
print(trajs.loc[(trajs['new_visit_count']==9) & (trajs['revisit_intention'] == 1)]['traj'].count())

1.0     36509
2.0      3952
3.0      1115
4.0       608
5.0       527
6.0       179
7.0       349
8.0       125
9.0       218
10.0      133
Name: new_visit_count, dtype: int64
42
39


In [86]:
trajs_full['traj'].value_counts()

out                                                                                                                                                                                                                                                                                                                          1597404
out,out                                                                                                                                                                                                                                                                                                                       198370
out,out,out                                                                                                                                                                                                                                                                                                                    34865
out,out,out,out          

In [107]:
### out, in 간격이 멀수도 있구나 (기존에 out-in순서로 있는 경우에만 revisit count를 올려 주었는데, 코드 수정해야 할듯)
# trajs_full.loc[bool(re.search('in', trajs_full['traj']))]

### traj_full중에 in이 들어간 row만 따로 df로 trajs_in이라고 만듦. - 어떤 traj들이 있다 확인
trajs_in = trajs_full.loc[trajs_full['traj'].str.contains('in')]
# trajs_in['traj'].value_counts()

In [115]:
print(trajs_full.index.size)
print(trajs_in.index.size)
print(trajs_full.index.difference(trajs_in.index).size)

1876809
19929
1856880


In [118]:
### 전체 트라젝토리 중 in이 없는 트라젝토리 (out-out-out 이외의 딴것이 있나 확인을 위해)
trajs_notin = trajs_full.loc[trajs_full.index.difference(trajs_in.index)]

In [120]:
### (out-out-out 이외의 딴것이 있긴 하지만 매우 적으므로, 그냥 가용한 트라젝토리를 in이 포함된 트라젝토리로 하기로 한다.)
trajs_notin['traj'].value_counts()

out                                                                                                                                                                                                                                                                                                                                                           1597404
out,out                                                                                                                                                                                                                                                                                                                                                        198370
out,out,out                                                                                                                                                                                                                                                                                 