<a href="https://colab.research.google.com/github/arangoml/networkx-adapter/blob/master/examples/ITSM_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install Required Libraries

In [None]:
%%capture
!git clone -b 0.0.0.2.5.3 https://github.com/arangoml/networkx-adapter.git
!rsync -av networkx-adapter/examples/ ./ --exclude=.git
!pip3 install adbnx-adapter==0.0.0.2.5.3
!pip3 install networkx
!pip3 install matplotlib
!pip3 install pyarango
!pip3 install python-arango

## Data Characteristics

The data is an event log that was extracted from the audit system of a __ServiceNow__ platform (this is an enterprise service help desk application). The data is available from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Incident+management+process+enriched+event+log) (please visit the link for more details). This notebook captures the salient aspects of exploratory analysis of this dataset.

## Read the data

In [None]:
import pandas as pd
fp = "data/incident_event_log.csv"
df = pd.read_csv(fp)

## What are the main characteristics?
1. What does a sample of the dataset look like?
2. How many incidents are reported in this dataset?

In [None]:
df.head()

In [None]:
df['number'].nunique()

## List the data types of the various attributes

In [None]:
df.dtypes

## Convert the $\texttt{sys_updated_at}$ attribute to be a timestamp

In [None]:
df['sys_updated_at'] = pd.to_datetime(df['sys_updated_at'])

## Machine Learning Task for this Dataset 

The contributors of this dataset have used this data to predict the time to resolution of the ticket. This data has been used for a classification task in this work. A [graph convolutional network for relational data(GCN)](https://arxiv.org/abs/1703.06103) will be the machine learning task for this work. We will be using a __GCN__  to predict the property of a particular node. What property would be useful to predict ? What are the characteristics of this property in the data? The cells below explore these questions. 

### Explore candidate list of tags
Note: For the experiment, we will pick a tag that is fairly evenly distributed in the data. This will avoild the imbalanced classs label problem.

In [None]:
dfcc = df[['made_sla', 'urgency', 'impact', 'reassignment_count']]
for c in dfcc.columns.tolist():
    print(str(dfcc[c].value_counts()))

A review of the level counts of the categorical variables in this dataset suggest that $\texttt{made_sla}$ and $\texttt{urgency}$ are both highly imbalanced. The minority levels are almost anomalies. The $\texttt{reassignment_count}$ seems promising. We can derive a new attribute $\texttt{reassigned}$ that captures if the ticket has been reassigned, i.e., has it been assigned to someone after the initial assignment. Such an attribute captures inefficiencies in triaging the ticket and is a useful indicator to track for an organization. A $0$ for this attribute indicates that there was no reassignment and a $1$ indicates that there was a reassignment. This attribute has a nice even spread in the data, i.e., an almost even spread of $0$ and $1$. The cells below create this attribute

## Feature Creation (reassigned):
It looks like tracking ticket reassignment can create a variable that is somewhat evenly distributed in the data. About half the tickets have the correct assignment at first. About half are reassigned to various degrees.

In [None]:
df['reassigned'] = df['reassignment_count'].apply(lambda x: 0 if x == 0 else 1)
df['reassigned'].value_counts()

In [None]:
dfpp = df.loc[df.groupby(by=['number']).sys_updated_at.idxmax()]
dfpp = dfpp.reset_index()
cols = dfpp.columns.tolist()
cols.remove('index')
cols.remove('number')
dfpp = dfpp[cols]

Now that we have characterized the data and identified the machine learning task to be performed. The next step is to transform the data to a form amenable for machine learning. 