All the modules used in creating this notebook is open-source and free.



# Data Cleaning

Regex helps in cleaning text based data on the basis of pattern. </br>
Using regular rexpressions we have cleaned our data in following ways:
1. Conversions of entire text in lower cases
2. Removing texts between brackets 
3. Removing digits and unwanted symbols
4. Removing extra white spaces

In [59]:
import re   #importing regex for data cleaning

def cleaning_data(text): #function that cleans the data
  text=text.lower()   #converting text to lower case
  text=re.sub("([\(\[]).*?([\)\]])", "\g<1>\g<2>", text)  #removing text between brackets
  text=re.sub(r'[^a-zA-Z ]+', '', text)   #removing all characters except letters
  text=re.sub("\s+"," ",text)   #removing extra white spaces
  return text

In [60]:
#passing text to get cleaned
text = "Internal:Failed to evaluate expression '[Filtered output.P2] - The collection has no current row"
text = cleaning_data(text)

# Extracting Keyword

We have implemented YAKE! for extraction of keywords from texts.</br>
Reasons to use Yake! over many other keyword extracting techniques are:
1. It is light-weight, unsupervised approach
2. It has outperformed many of the exisiting state-of-the-art methods
3. It is corpus and domain independent

In [61]:
#installing and importing yake
!pip install yake

import yake



In [62]:
language = "en"
max_ngram_size = 4
deduplication_thresold = 0.9
deduplication_algo = 'seqm'
windowSize = 1
numOfKeywords = 2
kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_thresold, dedupFunc=deduplication_algo, windowsSize=windowSize, top=numOfKeywords, features=None)
keywords = kw_extractor.extract_keywords(text)
sentences=[]
for kw in keywords:
  sentences.append(kw[0])

# Classifying Keywords

In [63]:
#Getting labeled Keywords
import pandas as pd
dfs = pd.read_excel('/content/Datasets EY GDS.xlsx', sheet_name='Output')

In [64]:
tag = pd.read_excel('/content/Datasets EY GDS.xlsx', sheet_name='Keywords for tagging')

In [65]:
tag.head()

Unnamed: 0,Keywords for System Exception,Keywords for Business Exception
0,Unable to launch service,Missing the mandatory fields
1,Failed to fetch data,Mail Id not found
2,Failed fetching Site issue,Invite is not found
3,Code stage not executed,Template is not found
4,Exception occured in processing,Error in Input File


In [66]:
Sys_Exception=[]
for i in range(11):
  Sys_Exception.append(tag['Keywords for System Exception'][i])

In [67]:
Bus_Exception=[]
for i in range(8):
  Bus_Exception.append(tag['Keywords for Business Exception'][i])

In [68]:
!pip install sent2vec



In [69]:
from scipy import spatial
from sent2vec.vectorizer import Vectorizer

In [70]:
vectorizer1 = Vectorizer()
vectorizer.bert(sentences)
vectors_bert1 = vectorizer.vectors

In [71]:
vectorizer2 = Vectorizer()
vectorizer.bert(Sys_Exception)
vectors_bert2 = vectorizer.vectors

In [72]:
vectorizer3 = Vectorizer()
vectorizer.bert(Bus_Exception)
vectors_bert3 = vectorizer.vectors

In [75]:
min1 = 100
min2 = 100
for k in range(2):
  for i in range(11):
    min1 = min(min1, spatial.distance.cosine(vectors_bert1[k], vectors_bert2[i]))
  for i in range(8):
    min2 = min(min2, spatial.distance.cosine(vectors_bert1[k], vectors_bert3[i]))
if (min1<min2):
  print("System Exception")
else:
  print("Business Exception")

System Exception
