# Getting Started with the Featured DataFrame

When looking at telemetry data there are 2 representations of the data.  The first is the raw session based telemetry data.  The second, which this notebook will describe, contains each user-week with feature columns.

First, let's start with our standard imports

In [9]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import datetime
import dtutil.configs as dtc  # import the datatools config variables
from helper import get_name_and_company

Next, we need to load in the feature file.  This file has one row for each user-week and a column for each feature in addition to information describing the user-week.

In [10]:
df = pd.read_csv(dtc.paths.tlm_uwdata_file)  # this is the path to the features file
df.columns

Index(['Unnamed: 0', 'HB_aplac_b', 'HB_legacy_b', 'add_measurement_b',
       'add_mwo_elem_b', 'add_sim_doc_b', 'add_vss_elem_b', 'ana_doc_b',
       'ana_sim_n', 'annotation_b', 'antenna_b', 'bindkeys_b',
       'board_import_b', 'cell_stretchers_b', 'commands_n', 'company',
       'create_process_b', 'disp_name', 'doc_sets_b', 'drc_b', 'ele_created_n',
       'em_sim_b', 'ems_doc_b', 'ems_highcnt_b', 'ems_sim_n',
       'enable_disable_b', 'envelope_sim_b', 'extract_b',
       'floating_load_pull_b', 'freeze_traces_b', 'gen_poly_b', 'ifilter_b',
       'insert_window_b', 'job_rich_text_b', 'job_scheduler_b', 'layout_b',
       'layout_checks_b', 'lin_noise_b', 'lin_sim_n', 'load_pull_b',
       'load_pull_script_b', 'lvs_b', 'markers_b', 'measurements_n',
       'network_synthesis_b', 'new_HB_meas_b', 'no_sim_b', 'nonlin_sim_n',
       'optim_b', 'osc_b', 'output_eq_b', 'pcb_improved_clipping_b',
       'pcb_point_ports_b', 'pcb_shape_flow_b', 'pdk_b', 'phased_array_b',
       'proj

### Column Description

The columns which are features will end in '_n' or '_b' depending on whether they represent counts (the count of the operation for the user-week) or just boolean flags (0 or 1).  In addition to these, we have the derived features of activity and scope.

Descriptive columns, which don't represent features, include week, user_id, user_type, company and disp_name.

#### Week

The week is an integer representation of the week where the last 2 digits are the week number (0-51) and the beginning digits are (year - 2010) so 803 is the 4th week of 2018.

#### User Type

The user type is determined by serial number and can be internal, academic, demo, customer, pirate and licensed.  Licensed represents non-AWR NI users.

### Analyzing the Data

##### Filtering

One of the most common tasks will be filtering the data

In [33]:
# just keep 2018 so we have 1 year of data
df2018 = df[(df.week >= 800) & (df.week < 900)]

# filter out internal users
external_df = df2018[df2018.user_type != 'internal']

The simplest way to filter a dataframe is to create a boolean vector and then subset the dataframe using square brackets.  In the first example we take the column user_type and compare it to 'internal'.  This will perform the comparison on each row of the dataframe and return a vector (actually a series) of the results.  When we pass this to the indexing function (the square brackets) it will return a new dataframe containing only the rows where the vector is True.

In the second example, we are actually creating two series, performing an and between them and then passing that result to the indexer.  Note when doing this I've found parenthesis are required to get the right order of operations.

##### Getting Unique Values

In [34]:
# Get all the user_types
df2018.user_type.unique()

array(['internal', 'customer', 'licensed', 'academic', 'pirate', 'demo',
       'loan', 'unknown'], dtype=object)

In [35]:
# or get them with counts (# user-weeks)
df2018.user_type.value_counts()

customer    23390
academic    22491
pirate      10945
demo         3104
internal     2730
licensed     1056
loan          283
unknown         4
Name: user_type, dtype: int64

In [36]:
user_weeks_by_user = external_df.user_id.value_counts()
len(user_weeks_by_user)

11320

In [37]:
# wow, that's a lot, how many have only 1 or 2 user-weeks
print('Number of casual users:', len(user_weeks_by_user[user_weeks_by_user < 3]))
print('Number of heavy users:', len(user_weeks_by_user[user_weeks_by_user > 26]))

Number of casual users: 5629
Number of heavy users: 358


In [41]:
for user_id, count in user_weeks_by_user[user_weeks_by_user > 50].iteritems():
    n, c = get_name_and_company(external_df, user_id)
    print('{:3d} - {}, {} ({})'.format(count, c, n, user_id))

 53 - Friends, nan (1399)
 53 - Werlatone, Inc. - Patterson, Mariama Barrie @ Werlatone (1023)
 53 - Qualcomm Technologies, Inc. - Headquarters, Qualcomm Technologies, Inc. - Headquarters (15926)
 53 - USER, Pirate No user name (963)
 53 - Qualcomm Technologies, Inc. - Headquarters, nan (11420)
 53 - Qorvo - Osaka, nan (6574)
 52 - Arizona State University - School of Earth and Space Exploration, HAMDI MANI (799)
 52 - Qualcomm Technologies, Inc. - Headquarters, nan (11408)
 51 - Akoustis, Inc., nan (1352)
 51 - National Instruments - Santa Rosa, Justin Majers @ NISR (350)
 51 - Qorvo - TQ Apopka, Aziz Alakan @ Qorvo FL (1048)
 51 - Qorvo - TQTX IDP Richardson, Scott Schafer @ Qorvo (363)
 51 - Qualcomm Technologies, Inc. - Headquarters, nan (11460)
 51 - Qualcomm Technologies, Inc. - Headquarters, nan (11301)
 51 - Qualcomm Technologies, Inc. - Headquarters, nan (11171)
 51 - Qorvo - TQ Apopka, Hailing Yue? at Qorvo FL (742)


Let's explain this a little more.  When we do a value_counts() we get back a series.  Think of this as a two column table where the first column is the user_id and the second column is the count.  When we compare this to a value as in:
```
user_weeks_by_user < 3
```
we get back a vector or boolean values which we can then use to filter the series as we did in the dataframe.

### Tips and Tricks

Because there are some operations that are pretty common I've created some functions to help simplify the code for them such as get_name_and_company().  For example if we want to count the number of user weeks for uses of the phased array feature we could use

In [17]:
cust2018 = df[(df.user_type == 'customer') & (df.week >= 800) & (df.week < 900)]
pharr_use = cust2018.groupby(['user_id']).phased_array_b.sum()
pharr_use[pharr_use > 1]

print('cnt  Company, User')
for user_id, count in pharr_use[pharr_use > 1].iteritems():
    n, c = get_name_and_company(cust2018, user_id)
    print('{:3d} - {}, {} ({})'.format(count, c, n, user_id))

cnt  Company, User
  3 - Syrlinks, Simon Mener @ Syrlinks (678)
  2 - SARAS Technology Limited, nan (11832)
  2 - Qorvo - TQTX IDP Richardson, nan (13209)
  2 - Oculus/Facebook, nan (16234)
  3 - Dynetics, nan (16994)
  2 - Konkuk University - Electronic Engineering, nan (17395)
  3 - Konkuk University - Electronic Engineering, nan (17854)


### Features

Here is how you get the description of all the feature columns.  In order to load in the featuring code, we need to set the system path to add the directory the file is in.

In [6]:
import sys
sys.path.append('\\src\\datatools\\ipy\\notebooks\\session_data\\dev')
from features import define_feature

In [7]:
define_feature.print_docs()

ele_created_n not in pct

 1 - ele_created_n 
     Defined as: Number of elements created
               : Counts the number of MWOElement and VSSElement with the new tag
commands_n not in pct

 2 - commands_n 
     Defined as: Number of commands executed
               : Sums the total count on all items in the Command category
no_sim_b not in pct

 3 - no_sim_b 
     Defined as: Contain no simulations
               : True if there are no items in the Simulate category
lin_sim_n not in pct

 4 - lin_sim_n 
     Defined as: Linear Simulation Count
               : Count of the LinCktSimAWR, Default Linear and APLAC Linear items
nonlin_sim_n not in pct

 5 - nonlin_sim_n 
     Defined as: Nonlinear Simulation Count
               : Count of all HB, AC, stability, transient (including HSPICE) simulations
HB_legacy_b not in pct

 6 - HB_legacy_b 
     Defined as: Uses legacy HB simulator
               : True if HB legacy or AC-HB legacy simulations have been run
new_HB_meas_b not in pct