# IT Tickets Classification Project

## Previous Notebooks

- [Data Collection](0-Data Collection.ipynb)
- [Data Cleaning and EDA](1-Data Cleaning and EDA.ipynb)

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from scipy.sparse.linalg import svds

## Document-Term Matrix

In this notebook I will transform the tickets' bag of words into a document-term matrix, which I'll later use to do some topic modeling.

In doing so I will set aside all the words that appear in every ticket or just once.

In [2]:
bag = pd.read_pickle('../data/interim/bag_of_words.pkl')

In [3]:
n = len(bag)
n

20178

In [4]:
all_terms = {}
for row in bag['bag_of_words']:
    for term in row:
        if term in all_terms:
            all_terms[term] += row[term]
        else:
            all_terms[term] = row[term]

In [5]:
m = len(all_terms)
m

14690

In [6]:
terms = {}
for term in all_terms:
    if all_terms[term] > 1 and all_terms[term] < m:
        terms[term] = all_terms[term]

In [7]:
m = len(terms)
m

8016

In [8]:
doc_term = pd.DataFrame(np.zeros([n, m]), columns=list(terms), index=bag['key'])
doc_term['bag_of_words'] = bag[['bag_of_words', 'key']].set_index('key')

In [9]:
for col in doc_term.columns:
    doc_term[col] = [bag[col] if col in bag else 0 for bag in doc_term['bag_of_words']]

In [10]:
doc_term.drop('bag_of_words', axis=1, inplace=True)

In [11]:
for col in doc_term.columns:
    if len(doc_term.loc[doc_term[col]!=0]) == 1 or len(doc_term.loc[doc_term[col]!=0]) == n:
        doc_term.drop(col, axis=1, inplace=True)

In [12]:
doc_term.head()

Unnamed: 0_level_0,perfezion,restitu,sostitu,serviz,clicc,sistem,error,riesc,dispon,alleg,...,diacon,corl,monopol,usal,galantin,toig,cammaller,ebran,camozz,piccalug
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ISAHD-31941,2,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
ISAHD-31940,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ISAHD-31939,0,1,0,1,0,1,1,0,1,1,...,0,0,0,0,0,0,0,0,0,0
ISAHD-31938,1,1,0,0,1,1,1,1,0,2,...,0,0,0,0,0,0,0,0,0,0
ISAHD-31937,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
doc_term.to_pickle('../data/interim/document_term.pkl')

## Following Notebooks

- [Topic Modeling](3-Topic Modeling.ipynb)
- [Random Forest Prediction](4-Model.ipynb)