# IT Tickets Classification Project

## Previous Notebooks

- [Data Collection](0-Data Collection.ipynb)
- [Data Cleaning and EDA](1-Data Cleaning and EDA.ipynb)
- [Document-Term Matrix](2-Document-Term Matrix.ipynb)

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from scipy.sparse.linalg import svds

## Topic Modeling

In this notebook I use SVD to do topic modeling of the tickets' text, selecting the first 20 eigenvectors to represent the 20 most relevant topics in the collection. Before applying SVD I normalize the document-term matrix using TF-IDF.

In [2]:
doc_term = pd.read_pickle('../data/interim/document_term.pkl')

In [3]:
tfidf = TfidfTransformer()
X = tfidf.fit_transform(doc_term)

In [4]:
k = 20
u, s, vt = svds(X, k=k)

In [5]:
s

array([ 10.48890729,  10.695562  ,  10.88343826,  11.62480531,
        11.94193887,  11.96500652,  12.72272519,  13.15786524,
        13.66620399,  14.02286417,  14.15552572,  15.24216734,
        16.09869209,  17.74408135,  18.18617037,  19.91073291,
        21.9617521 ,  24.93779523,  28.14551286,  33.79456837])

Below are the 10 most relevant keywords of the selected topics: topic 20 seems to be related to some technical terms, while the other topics seem to be related to agreements (e.g. topic 16, 15 and 8), suppliers (e.g. topic 19, 18 and 11) or IT procedures (e.g. topic 12 and 11).

In [32]:
topics = []
for i in range(k):
    topics.append(list(pd.Series(vt[i], index=doc_term.columns).sort_values(ascending=False).head(10).index.values))

topics = pd.DataFrame(topics)
topics.columns = ['word{}'.format(i+1) for i in topics.columns.values]
topics.sort_index(ascending=False, inplace=True)
topics.index = ['topic{}'.format(i+1) for i in topics.index.values]
topics.T.iloc[:, 0:10]

Unnamed: 0,topic20,topic19,topic18,topic17,topic16,topic15,topic14,topic13,topic12,topic11
word1,managedreferencemethodinterceptorfactory,octo,box,dat,copertur,smart,graz,recuper,fil,riapertur
word2,managedreferencemethodinterceptor,box,attiv,stat,recuper,disdett,sospes,sospes,alleg,octo
word3,initialinterceptor,ticket,terminal,append,graz,car,alleg,verif,import,atr
word4,componentdispatcherinterceptor,fluss,conferm,pag,incass,retroatt,recuper,manc,elabor,voucher
word5,isadocbusinessdelegateimpl,disinstall,portal,prem,import,contratt,cia,alleg,produzion,aggiorn
word6,isadocproxyclient,terminal,forz,eur,cia,recess,buongiorn,fil,procedur,fluss
word7,weavedinterceptor,attiv,disinstall,copertur,verif,client,fil,accm,lanc,covell
word8,userinterceptorfactory,portal,smont,regol,alleg,intest,chied,esit,esit,pol
word9,jaxws,forz,stat,graz,richied,disposit,targ,elabor,sospes,scad
word10,synchronization,conferm,telematics,martell,box,verific,produzion,covell,titol,lavor


In [33]:
topics.T.iloc[:, 10:20]

Unnamed: 0,topic10,topic9,topic8,topic7,topic6,topic5,topic4,topic3,topic2,topic1
word1,lanc,atr,convenzion,convenzion,riattiv,riattiv,riattiv,error,anagraf,sinistr
word2,procedur,caric,dipendent,dipendent,visibil,targ,covell,alleg,trasfer,elabor
word3,produzion,buon,verif,inser,potrest,graz,atr,sinistr,script,trasfer
word4,fil,lavor,manc,migrazion,client,propost,utenz,pag,esecu,fil
word5,elabor,ani,inser,dat,verific,convenzion,pol,fas,produzion,anagraf
word6,ebaas,chied,atr,ebaas,targ,cia,caric,sin,covell,oggett
word7,input,pol,incass,abit,oggett,dipendent,convenzion,serviz,ambient,esit
word8,dat,scad,sostitu,famigl,iban,rinnov,contraent,dispon,esegu,liquid
word9,esit,rinnov,elabor,scadenz,manc,richied,abilit,buongiorn,dat,estrazion
word10,batc,cia,lanc,anagraf,covell,finocc,visibil,stat,sinistr,cortes


In [7]:
issues = pd.read_pickle('../data/interim/bag_of_words.pkl')

Finally I'm going to create the dataset I will use to predict the topics' labels by adding to the topic data the label and issue type:

In [8]:
processed_data = pd.DataFrame(u, index=issues['key'])
processed_data.columns = ['topic{}'.format(i+1) for i in processed_data.columns.values]
processed_data = processed_data.merge(issues[['key', 'issue_type', 'label']].set_index('key'),
                                      how='inner', left_index=True, right_index=True)

In [9]:
processed_data.head()

Unnamed: 0_level_0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,...,topic13,topic14,topic15,topic16,topic17,topic18,topic19,topic20,issue_type,label
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ISAHD-31941,-0.009479,-0.003791,0.017397,-0.002119,-0.01136,0.004534,-0.004413,0.001417,-0.010153,-0.001711,...,-0.003778,0.00476,0.000375,0.002743,0.003082,-0.00067,-0.000691,-0.000491,Modifica dati,ISAHD Rilascio
ISAHD-31940,0.003563,0.003421,-0.000765,0.002659,-0.004842,-0.000941,-0.001263,0.000591,0.004206,-0.000204,...,-0.003044,0.001408,2.7e-05,0.001142,0.00105,-0.000507,-0.000266,-0.000249,Modifica dati,ISAHD Forzatura dati
ISAHD-31939,-0.010522,-0.003833,0.010525,-0.004286,-0.014718,0.007967,-0.006912,-0.000352,-0.007102,-9.5e-05,...,-0.00386,0.004374,0.000327,0.001815,0.00287,0.000539,-0.003514,-0.002006,Modifica dati,ISAHD Altro
ISAHD-31938,-0.007514,-0.002789,0.011523,-0.002412,-0.014195,0.006187,-0.006845,-0.000117,-0.009371,-2.2e-05,...,-0.001613,0.005189,0.000371,0.002673,0.003931,-0.000812,-0.000766,-0.000688,Modifica dati,ISAHD Rilascio
ISAHD-31937,-0.003378,-0.004804,-0.00833,-0.00693,0.001492,0.002304,-0.001676,7.9e-05,-0.00302,-0.001176,...,-0.004654,0.002205,0.000179,0.003854,-0.002536,-0.00758,0.001632,-0.010511,Modifica dati,ISAHD Forzatura dati


In [10]:
processed_data.to_pickle('../data/processed/proc_data.pkl')

## Following Notebooks

- [Random Forest Prediction](4-Model.ipynb)