# Identifying 'no-action' tickets in order to size support team correctly

Many of the support tickets raised do not need any action by the support team. Identifying such tickets is important to understand the workload and to size the team accordingly. Depending on the size of the firm, the ticket repositories could contain hundreds of thousands of tickets. Hence, an automated approach for identifying no-action tickets is required.

## Approach: 
Ticket repositories contain structured (like "Ticket priority") and unstructured data (like "Ticket description"). By extracting relevant information from both types of data, a more useful dataset could be built for downstream machine learning algorithms.

### Unstructured data: 
"Ticket descriptions" could contain important indicators. In order to convert such unstructured data to numeric form, the Word2Vec algorithm is used. In essence, Word2Vec discovers multiple contextual relationships between words and represents these in a high dimensional space. In doing so, it is believed that words and short phrases with similar "meaning" are clustered together. In this example, we use Word2Vec to process the unstructured "Short Description" column into a 300 dimensional numeric representation.

### Word2Vec
Word2Vec represents words and relationships in multi-dimensional space

### Machine Learning: 
The dataset is unlabeled and K-means clustering is used on the concatenation of the structured data and the 300 dimension representation of the unstructured data. 

In [1]:
#%%Loading required libraries
import os
import pandas as pd  
import numpy as np

import re

from bs4 import BeautifulSoup

import nltk.data
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords 

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.externals import joblib
from sklearn.model_selection import cross_val_score

import logging
from sklearn.externals import joblib

import gensim
from gensim.models import word2vec
from gensim.models.keyedvectors import KeyedVectors

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D


from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

from IPython.display import display

In [2]:
#%%workaround for change in Pandas 0.18 to 0.21.0
import sys
import pandas.core.indexes
sys.modules['pandas.indexes'] = pandas.core.indexes

import pandas.core.base, pandas.core.indexes.frozen
setattr(sys.modules['pandas.core.base'],'FrozenNDArray', pandas.core.indexes.frozen.FrozenNDArray)


In [3]:
#%%Loading data
os.chdir('/home/bala/Documents/DSA0001')
trainRawDSA000101 = open("InputOutputData, edited.csv", 'r', encoding='latin-1') 
trainRawDSA000102 = pd.read_csv(trainRawDSA000101)

We explore the tickets data set provided with standard numerical techniques. Transformations are performed at a later step.

In [4]:
display(trainRawDSA000102.shape)
display(trainRawDSA000102.head())
display(trainRawDSA000102.describe())

(99, 10)

Unnamed: 0,Number,Short description,Created,State,Priority,Assignment Group,Resolved,Month,"Resolution time, time","Resolution time, int"
0,1001,Network - Password Reset,2017-03-20 23:59:34,Closed,4 - Low,Assignment Group Team 1,2017-03-21 00:19:48,Mar,00:20:14,0.0141
1,1002,Delay in start of PF.ORACLE_RETAIL_BATCH in Pr...,2017-03-20 23:28:43,Closed,1 - Critical,Assignment Group Team 2,2017-04-03 17:02:29,Mar,17:33:46,13.7318
2,1003,Open Manifest Report - sev 2 7:15PM,2017-03-20 23:04:58,Closed,2 - High,Assignment Group Team 3,2017-03-22 21:43:44,Mar,22:38:46,1.9436
3,1004,Hang Up / Wrong Number,2017-03-20 22:56:36,Closed,4 - Low,Assignment Group Team 4,2017-03-20 22:56:55,Mar,00:00:19,0.0002
4,1005,User id is not created properly,2017-03-20 22:30:08,Closed,3 - Moderate,Assignment Group Team 5,2017-03-21 08:13:13,Mar,09:43:05,0.4049


Unnamed: 0,Number,"Resolution time, int"
count,99.0,99.0
mean,1050.0,0.697442
std,28.722813,2.388784
min,1001.0,0.0002
25%,1025.5,0.0005
50%,1050.0,0.001
75%,1074.5,0.0687
max,1099.0,13.7318


Given that there are only 99 records in this dataset, it is unviable to build out an adequate representation of words. Instead, we load a published word embedding from GloVe that contains about 400,000 words from Wikipedia dump 2014 + Gigaword 5 (news articles from NY Times, Associated press etc.) 

In [5]:
#loading the word embedding
WordEmbGlove6B300d = gensim.models.KeyedVectors.load_word2vec_format('/home/bala/Documents/glove.6B/glove.6B.300d.bin')

#code used for converting from the embedding from GloVe to Word2Vec formats
#gensim.scripts.glove2word2vec.glove2word2vec('/home/bala/Documents/glove.6B/glove.6B.300d.txt', '/home/bala/Documents/glove.6B/1glove.6B.300d.bin')

The code comments below explain key details of code chunks.

In [6]:
#%%Creating training data by averaging vectors for the words in the "Risk Description" column, by ticket

def makeFeatureVec(words, model, num_features):
    # Function to average all of the word vectors in a given paragraph
    #
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,),dtype="float32")
    #
    nwords = 0.
    #
    # Index2word is a list that contains the names of the words in
    # the model's vocabulary. Convert it to a set, for speed
    index2word_set = set(model.wv.index2word)
    #
    # Loop over each word in the SHORT_DESCRIPTION and, if it is in the model's
    # vocaublary, add its feature vector to the total
    for word in words:
        if word in index2word_set:
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])
    #
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec,nwords)
    return featureVec


def getAvgFeatureVecs(SHORT_DESCRIPTIONs, model, num_features):
    # Given a set of SHORT_DESCRIPTIONs (each one a list of words), calculate
    # the average feature vector for each one and return a 2D numpy array
    #
    # Initialize a counter
    counter = 0.
    #
    # Preallocate a 2D numpy array, for speed
    SHORT_DESCRIPTIONFeatureVecs = np.zeros((len(SHORT_DESCRIPTIONs),num_features),dtype="float32")
    #
    # Loop through the SHORT_DESCRIPTIONs
    for SHORT_DESCRIPTION in SHORT_DESCRIPTIONs:
       # Call the function (defined above) that makes average feature vectors
       SHORT_DESCRIPTIONFeatureVecs[int(counter)] = makeFeatureVec(SHORT_DESCRIPTION, model, \
           num_features)
       #
       # Increment the counter
       counter = counter + 1.
    return SHORT_DESCRIPTIONFeatureVecs


def SHORT_DESCRIPTION_to_wordlist( SHORT_DESCRIPTION, remove_stopwords=False ):
    # Function to convert a document to a sequence of words,
    # optionally removing stop words.  Returns a list of words.
    #
    # 1. Remove HTML
    SHORT_DESCRIPTION_text = BeautifulSoup(SHORT_DESCRIPTION).get_text()
    #  
    # 2. Remove non-letters
    SHORT_DESCRIPTION_text = re.sub("[^a-zA-Z]"," ", SHORT_DESCRIPTION_text)
    #
    # 3. Convert words to lower case and split them
    words = SHORT_DESCRIPTION_text.lower().split()
    #
    # 4. Optionally remove stop words (false by default)
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    #
    # 5. Return a list of words
    return(words)
    

def getCleanSHORT_DESCRIPTIONs(SHORT_DESCRIPTIONs):
    clean_SHORT_DESCRIPTIONs = []
    for SHORT_DESCRIPTION in SHORT_DESCRIPTIONs:
        
        clean_SHORT_DESCRIPTIONs.append( SHORT_DESCRIPTION_to_wordlist( SHORT_DESCRIPTION, remove_stopwords=True ))
    return clean_SHORT_DESCRIPTIONs

#Since the three lines of code below can create the Short Decription Vectors
#they have been commented. Intead simply load the pre-computed vectors.
#DSA0001ShortDescVecs = getAvgFeatureVecs( getCleanSHORT_DESCRIPTIONs(trainRawDSA000102['Short description']), WordEmbGlove6B300d, 300 )
#DSA0001ShortDescVecs = pd.DataFrame(DSA0001ShortDescVecs)
#
## Saving the Vectors created from the Short Descriptions
#joblib.dump(DSA0001ShortDescVecs, 'DSA0001ShortDescVecs.pkl')

In [7]:
#%%Data munging

#One hot encoding categorical variables 'Priority' & 'Assignment Group'
def one_hot_encode(df, cols):
    #df1 = []    
    for each in cols:
        data = list(df[each])
        values = array(data)
        #print(values)
        # integer encode
        label_encoder = LabelEncoder()
        integer_encoded = label_encoder.fit_transform(values)
        #print(integer_encoded)
        # binary encode
        onehot_encoder = OneHotEncoder(sparse=False)
        integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
        onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
        df = pd.concat([df, pd.DataFrame(onehot_encoded)], axis=1)
    return df

trainRawDSA000103 = one_hot_encode(trainRawDSA000102, ['Priority', 'Assignment Group'])

#Retaining some of the columns we think we can use to cluster [retain 'Resolution time, int']
trainRawDSA000104 = trainRawDSA000103.drop(['Number',
 'Short description',
 'Created',
 'State',
 'Priority',
 'Assignment Group',
 'Resolved',
 'Month',
 'Resolution time, time'], axis=1)

#Retaining some of the columns we think we can use to cluster [remove 'Resolution time, int']
trainRawDSA000105 = trainRawDSA000104.drop(['Resolution time, int'], axis=1)


#Merge Average Vectors of Short Descriptions to the rest of the input file
DSA0001ShortDescVecs = joblib.load('DSA0001ShortDescVecs.pkl')
DSA0001DatwDescVecsResTimeInc = pd.concat([trainRawDSA000104,DSA0001ShortDescVecs], axis=1)
DSA0001DatwDescVecsResTimeExc = pd.concat([trainRawDSA000105,DSA0001ShortDescVecs], axis=1)

In [8]:
#%%Clustering the tickets into 4 groups using K Means
# Initalize a k-means object and use it to extract centroids
kmeans_clustering = KMeans( n_clusters = 4 )

#With 'Resolution time, int'
idx = kmeans_clustering.fit_predict( DSA0001DatwDescVecsResTimeInc ) 

np.unique(idx, return_counts=True)
trainRawDSA000108 = pd.concat([trainRawDSA000102,pd.DataFrame(idx)], axis=1)
trainRawDSA000108.rename(columns = {0:'Cluster'}, inplace = True)

#WithOut 'Resolution time, int'
idx = kmeans_clustering.fit_predict( DSA0001DatwDescVecsResTimeExc ) 

np.unique(idx, return_counts=True)
trainRawDSA000109 = pd.concat([trainRawDSA000102,pd.DataFrame(idx)], axis=1)
trainRawDSA000109.rename(columns = {0:'Cluster'}, inplace = True)

## Conclusion:
We observe below the effectiveness of ticket clustering based on Word2Vec's numeric representation of the descriptions. In clusters "0" and "1", the similarity is apparent. For "2" and "3", we can only tell that these are dis-similar to the tickets in "0" and "1" but cannot tell how the Word2Vec algorithm differentiates between "2" and "3". 
We infer that in a larger dataset, if there are any indicators for no-action tickets in the descriptions, this program would use the same to identify no-action tickets. 
We could also improve the unsupervised machine learning component by exploring options like spectral clustering and auto encoders in place of K-means clustering.

In [9]:
trainRawDSA000108[['Short description', 'Cluster']].sort_values(['Cluster']).groupby('Cluster').head()

Unnamed: 0,Short description,Cluster
0,Network - Password Reset,0
67,Performance Manager - PW Reset,0
66,Performance Manager - PW Reset,0
64,Performance Manager - PW Reset,0
63,Performance Manager - PW Reset,0
74,I cannot see any associates in self service.,1
1,Delay in start of PF.ORACLE_RETAIL_BATCH in Pr...,1
8,User vision Plan amount is displayed wrongly,1
59,S - Mag Cables x3,1
56,Network Alert,1
