All the modules used in creating this notebook is open-source and free.



# Reading and Analyzing data

In [23]:
#reading the excel file
import pandas as pd
dfs = pd.read_excel('/content/Datasets EY GDS.xlsx')

In [24]:
input = pd.read_excel('/content/Datasets EY GDS.xlsx', sheet_name='Input')

In [25]:
input.head()

Unnamed: 0,Exception (input),Queue,Process
0,Invite not found in both CBS mailbox,Queue-12,Process-9
1,Interviewer xxxx mail ID not found in invite.,Queue-12,Process-9
2,Invite not found in both TAX and PAS mailbox,Queue-12,Process-9
3,InternalFailed to evaluate expression 'Replace...,Queue-14,Process-11
4,Could not execute code stage because exception...,Queue-16,Process-13


In [26]:
tag = pd.read_excel('/content/Datasets EY GDS.xlsx', sheet_name='Keywords for tagging')

In [27]:
tag.head()

Unnamed: 0,Keywords for System Exception,Keywords for Business Exception
0,Unable to launch service,Missing the mandatory fields
1,Failed to fetch data,Mail Id not found
2,Failed fetching Site issue,Invite is not found
3,Code stage not executed,Template is not found
4,Exception occured in processing,Error in Input File


# Data Cleaning

Regex helps in cleaning text based data on the basis of pattern. </br>
Using regular rexpressions we have cleaned our data in following ways:
1. Conversions of entire text in lower cases
2. Removing texts between brackets 
3. Removing digits and unwanted symbols
4. Removing extra white spaces

In [28]:
#importing regex for data cleaning
import re   

def cleaning_data(text): #function that cleans the data
  text=text.lower()   #converting text to lower case
  text=re.sub("([\(\[]).*?([\)\]])", "\g<1>\g<2>", text)  #removing text between brackets
  text=re.sub(r'[^a-zA-Z ]+', '', text)   #removing all characters except letters
  text=re.sub("\s+"," ",text)   #removing extra white spaces
  return text

In [29]:
#passing text to get cleaned
text = input["Exception (input)"][0]
text = cleaning_data(text)

In [30]:
print(text)

invite not found in both cbs mailbox


# Extracting Keyword

We have implemented YAKE! for extraction of keywords from texts.</br>
Reasons to use Yake! over many other keyword extracting techniques are:
1. It is light-weight, unsupervised approach
2. It has outperformed many of the exisiting state-of-the-art methods
3. It is corpus and domain independent

In [31]:
#installing and importing yake
!pip install yake

import yake



In [32]:
#providing the parameters
language = "en"
max_ngram_size = 4
numOfKeywords = 2

kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, top=numOfKeywords, features=None)

#extracting the keywords
keywords = kw_extractor.extract_keywords(text)

#list of keywords
sentences=[]
sentences.append(text)
for kw in keywords:
  sentences.append(kw[0])

# Classifying Keywords

In [33]:
#list of system exceptions
Sys_Exception=[]
for i in range(11):
  Sys_Exception.append(tag['Keywords for System Exception'][i])

In [34]:
#list of business exceptions
Bus_Exception=[]
for i in range(8):
  Bus_Exception.append(tag['Keywords for Business Exception'][i])

In [13]:
!pip install sent2vec



In [35]:
from scipy import spatial
from sent2vec.vectorizer import Vectorizer

In [36]:
#converting the keywords into vectors
vectorizer1 = Vectorizer()
vectorizer1.bert(sentences)
vectors_bert1 = vectorizer1.vectors

In [37]:
#converting the system exception into vectors
vectorizer2 = Vectorizer()
vectorizer2.bert(Sys_Exception)
vectors_bert2 = vectorizer2.vectors

In [38]:
#converting the business exception into vectors
vectorizer3 = Vectorizer()
vectorizer3.bert(Bus_Exception)
vectors_bert3 = vectorizer3.vectors

In [39]:
#minimum distance from text is printed
min1 = 100
min2 = 100
for k in range(3):
  for i in range(11):
    min1 = min(min1, spatial.distance.cosine(vectors_bert1[k], vectors_bert2[i]))
  for i in range(8):
    min2 = min(min2, spatial.distance.cosine(vectors_bert1[k], vectors_bert3[i]))
if (min1<min2):
  print("System Exception")
else:
  print("Business Exception")

Business Exception
Business Exception


# Extracting similar issues

Implementation of BM25 for similar top-n text retrieval.

In [40]:
!pip install rank_bm25



In [41]:
from rank_bm25 import BM25Okapi

#tokenizing the corpus from where to retrieve similar issues
corpus = input["Exception (input)"]

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

In [42]:
#using the first keyword as query

query = sentences[0]
tokenized_query = query.split(" ")

In [43]:
#receiving top-3 similar issues from the text
bm25.get_top_n(tokenized_query, corpus, n=3)

['Invite not found in both CBS mailbox',
 'Invite not found in both TAX and PAS mailbox',
 'Invite not found in Assurance Mailbox']

['Invite not found in both CBS mailbox',
 'Invite not found in both TAX and PAS mailbox',
 'Invite not found in Assurance Mailbox']