## Preprocessing ITSM Data
The purpose of this notebook is to prepare the data in a format suitable for machine learning. The dataset consists of a few numerical and many categorical attributes. The numerical attributes are discretized. The embedding for the categorical values is developed similar to developing embeddings for words in NLP (see https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html). Each categorical value is mapped to a unique integer. The encoded data that is presented to the embedding layer is a sequence of integers, with each integer corresponding to a word, This notebook performs this mapping. It also encodes unknown values to a 'UNKNOWN' category.

In [None]:
import pandas as pd
fp = "./data/incident_event_log.csv"
df = pd.read_csv(fp)
df['reassigned'] = df['reassigned'] = df['reassignment_count'].apply(lambda x: 0 if x == 0 else 1)

## Discretize the numerical attributes

In [None]:
numeric = ['sys_mod_count', 'reopen_count']
dfn = df[numeric]
dcols = []
for col in numeric:
    dlabel = 'D_' + col
    labels = [dlabel +'_' + str(c) for c in range(5)]
    dcols.append(dlabel)
    dfn[dlabel] = pd.qcut(dfn[col].rank(method='first'),5, labels = labels, duplicates = 'drop')
    

In [None]:
dfn.head()

## Isolate the attributes used for the analysis 
1. Remove the timestamp attributes
2. Remove the numeric attributes. The discretized version of these attributes is added subsequently

In [None]:
attributes = df.columns.tolist()
remove = [ 'made_sla', 'opened_at', 'resolved_at','sys_created_at', 'caller_id', 'closed_at',\
          'notify', 'sys_updated_by','sys_created_by', 'number', 'sys_updated_at', 'reassigned' ]
exclude = remove + numeric
keep = list(set(attributes) - set(exclude)) 
keep

In [None]:
df_cat_vars = df[keep]
df_cat_vars = df_cat_vars.replace(to_replace = '?', value = 'UNKNOWN')
df_cat_vars =  pd.concat([df_cat_vars, dfn[dcols]], axis = 1)
df['made_sla'] = df['made_sla'].map({True: 1, False: 0})

df = df.reset_index()

In [None]:
df['reassigned'].value_counts()

In [None]:
cols = df_cat_vars.columns.tolist()
vocab_size = 0
for c in cols:
    print("Num unique vals for category " + str(c) + " = " + str(df_cat_vars[c].nunique()))
    vocab_size += df_cat_vars[c].nunique()
print("Vocab size: %s" % vocab_size)

## Recode the categorical values to integers

In [None]:
UNKNOWN_VAL = 1
cat_cols = df_cat_vars.columns.tolist()
cat_int_map = {col: dict() for col in cat_cols}
int_index = 2
for c in cat_cols:
    unique_col_values = df_cat_vars[c].unique().tolist()
    col_int_map = cat_int_map[c]
    for uv in unique_col_values:
        if uv == 'UNKNOWN':
            col_int_map[uv] = UNKNOWN_VAL
        else:
            col_int_map[uv] = int_index
            int_index +=1
    df_cat_vars[c] = df_cat_vars[c].map(cat_int_map[c])    

In [None]:
combined_cat_int_map = dict()
for col in cat_int_map.keys():
    for cat_val, int_map in cat_int_map[col].items():
        combined_cat_int_map[cat_val] = int_map
    

## Write preprocessed raw data to disk

In [None]:
fp_cat_int_map = "./data/category_to_integer_map.csv"
df_map = pd.DataFrame(combined_cat_int_map, index = [0])
df_map = df_map.T
df_map = df_map.reset_index()
df_map.columns = ["cat_value", "assigned_integer"]
df_map.to_csv(fp_cat_int_map, index = False)

In [None]:
add_to_cat_vars = ['number','sys_updated_at', 'reassigned'] 
df = pd.concat([df[add_to_cat_vars], df_cat_vars], axis = 1)

In [None]:
df['sys_updated_at'] = pd.to_datetime(df['sys_updated_at']) 

In [None]:
df['sys_updated_at'].dtype

In [None]:
df = df.sort_values(by = ['number', 'sys_updated_at'])

In [None]:
fp = './data/pp_batch_incident_event_log.csv'
df.to_csv(fp, index = False)

In [None]:
df['reassigned'].value_counts()

In [None]:
int_index

## Generate data for learning
The data used for learning has the raw data summarized by incident, i.e. , the raw data for each incident is grouped and summarized. A sample of the data used for learning can be viewed.

In [None]:
dfgb = df.groupby(by = ['number'])
df_pp = df.loc[dfgb.sys_updated_at.idxmax()]
df_pp = df_pp.reset_index()
cols = df_pp.columns.tolist()
cols.remove('index')
df_pp = df_pp[cols]
fprp = "pp_recoded_incident_event_log.csv"
df_pp.to_csv(fprp, index = False)

In [None]:
df_pp['reassigned'].value_counts()

In [None]:
df_pp.head()

In [None]:
vocab_size