# Why is there missing start/tlm data?

This is an exploration to look for patterns in the missing cdb_starts and tlm_session data. As we data set to work with we will use a 7-day period (to eliminate the cyclicality of the weekends).

In [16]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xlrd
import sys
import pytz
import datetime
pacific = pytz.timezone('US/Pacific')

sys.path.append('/Users/dane/src/datatools')

end_time = datetime.datetime(2016, 12, 17, 0, 0, 0, 0, pacific)
start_time = end_time - datetime.timedelta(days=7)

## Cleanup the data


### Session

The session data already contains the data on the starts since that is imported when the session is created.  We can therefore trim the session data to a specified time range without worrying about losing the corresponding starts.

In [21]:
sdf = pd.read_pickle('./session.df')

# cleanup the data
# Convert dates to pacific time (to match starts)
sdf['created_pst'] = sdf.created_time.apply(lambda a: a.astimezone(pacific))

# remove features
sdf.drop(['features', 'project_id', 'disp_name', 'user_name'], axis=1, inplace=True)

# set the index to be the session id
sdf = sdf.set_index('sess_id')
sdf.head()

# limit to desired time period
sess_df = sdf[(start_time <= sdf.created_pst) & (sdf.created_pst <= end_time)]

print('Total sessions = {}, Sessions in desired week = {}'.format(len(sdf), len(sess_df)))

Total sessions = 39049, Sessions in desired week = 3279


In [23]:
sess_df.head(2)

Unnamed: 0_level_0,build_number,guid,instid,sess_user,company,serial_num,created_time,has_commands,user_type,user_id,custid,runtime,state,start_user,proj_name,created_pst
sess_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
48855,8275,1436D665-0B94-4F27-BC90-E2AB9EDD2875_1,1436D665-0B94-4F27-BC90-E2AB9EDD2875,Cust# 8905,Qorvo - FL,52989,2016-12-13 05:00:48+00:00,True,customer,334,8905,5390,D,pzayas,,2016-12-12 21:00:48-08:00
50649,8289,8D979B81-D596-49C9-B300-A8B31552B515_8,8D979B81-D596-49C9-B300-A8B31552B515,Cust# 8905,Qorvo - FL,52989,2016-12-16 00:59:20+00:00,True,customer,309,8905,0,C,stanuz,Cust# 8905:50653,2016-12-15 16:59:20-08:00


## SessionData


In [27]:
sddf = pd.read_pickle('./sessiondata.df')

# set the index to database id
sddf = sddf.set_index('id')

# don't need the type of count columns
sddf.drop(['newcnt', 'opncnt'], axis=1, inplace=True)

# only keep session data for the sessions in the week of interest
sessdata_df = sddf[sddf.session_id.isin(sess_df.index)]

print('Total sessiondata = {}, Sessiondata in desired week = {}'.format(len(sddf), len(sessdata_df)))

Total sessiondata = 581063, Sessiondata in desired week = 51342


### CDB Starts

When looking at starts and looking for missing sessions, we will only process instances that ended in the week of interest since there should always be at least one session from the instance terminating.

In [44]:
start_df = pd.read_pickle('./cdb_starts.df')

# set the index to database id
start_df = start_df.set_index('start_id')

# instid comes in as string view of binary rep of instid, need to strip extra chars
start_df['instid'] = start_df.instid.apply(lambda x: x[2:-1])

# prune to the right data range
start_df = start_df[(start_time <= start_df.end_time) & (start_df.end_time <= end_time)]

print('There are {} starts'.format(len(start_df)))
start_df.head(2)

There are 2858 starts


Unnamed: 0_level_0,build_number,serial_num,custid,user_name,machine_name,end_time,country,errlog,instid,runtime
start_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
16431577,8289,52989,8905.0,marichardson,APKMRICHARDSOND,2016-12-10 10:01:38.777000,US,160,4AF21BB4-79F4-481B-B64C-0EDF20959FAF,12621
16431589,8289,90063341,,rahul (),RAHUL,2016-12-10 08:00:53.403000,IN,160,6DE6CA79-25F9-46AF-9EC6-0725066D2228,4255


In [40]:
x = start_df.loc[16425783]
y = x['instid']
print('{}, {}'.format(y, y[2:-1]))

b'2F9A760E-FA81-4253-BC16-31C83F79DE8F', 2F9A760E-FA81-4253-BC16-31C83F79DE8F
