# IT Tickets Classification Project

## Previous Notebooks

- [Data Collection](0-Data Collection.ipynb)
- [Data Cleaning and EDA](1-Data Cleaning and EDA.ipynb)
- [Document-Term Matrix](2-Document-Term Matrix.ipynb)
- [Topic Modeling](3-Topic Modeling.ipynb)

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import f1_score, precision_recall_curve, confusion_matrix, accuracy_score, recall_score, precision_score, make_scorer

## Random Forest

In this last notebook I use the topics determined in the previous steps and the issue type to predict the tickets' labels using a random forest.

I try using different numbers of topics and determine the best one using 5-fold cross-validation.

In [2]:
processed_data = pd.read_pickle('../data/processed/proc_data.pkl')

In [3]:
le1 = LabelEncoder()
processed_data['issue_type'] = le1.fit_transform(processed_data['issue_type'])
le2 = LabelEncoder()
processed_data['label'] = le2.fit_transform(processed_data['label'])

In [4]:
Xtr, Xts, ytr, yts = train_test_split(processed_data.drop('label', axis=1),
                                      processed_data['label'],
                                      test_size=0.2, random_state=42, stratify=processed_data['label'])

In [5]:
rfc = RandomForestClassifier()

In [6]:
param_grid = {'n_estimators':np.arange(30, 110, 10),
              'max_features':np.arange(0.1, 1.1, 0.1),
              'min_samples_split': np.arange(2, 6, 1)}
f1scorer = make_scorer(f1_score, average='weighted')
kf = StratifiedKFold(n_splits=5, random_state=74)

In [7]:
gs10 = GridSearchCV(estimator=rfc, param_grid=param_grid, n_jobs=-1, cv=kf, verbose=10, scoring=f1scorer)
gs10.fit(Xtr[['issue_type', 'topic10','topic11', 'topic12', 'topic13', 'topic14',
              'topic15', 'topic16', 'topic17', 'topic18', 'topic19', 'topic20']], ytr)

Fitting 5 folds for each of 320 candidates, totalling 1600 fits


[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    8.2s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   10.6s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   15.1s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:   20.8s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   29.5s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   37.4s
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:   42.7s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   50.7s
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  2.7min
[Paralle

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=74, shuffle=False),
       error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'n_estimators': array([ 30,  40,  50,  60,  70,  80,  90, 100]), 'max_features': array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ]), 'min_samples_split': array([2, 3, 4, 5])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=make_scorer(f1_score, average=weighted), verbose=10)

In [8]:
print(gs10.best_score_)
print(gs10.best_params_)

0.678711405911
{'max_features': 0.30000000000000004, 'min_samples_split': 5, 'n_estimators': 100}


In [9]:
gs15 = GridSearchCV(estimator=rfc, param_grid=param_grid, n_jobs=-1, cv=kf, verbose=10, scoring=f1scorer)
gs15.fit(Xtr[['issue_type', 'topic5','topic6', 'topic7', 'topic8', 'topic9',
              'topic10','topic11', 'topic12', 'topic13', 'topic14',
              'topic15', 'topic16', 'topic17', 'topic18', 'topic19', 'topic20']], ytr)

Fitting 5 folds for each of 320 candidates, totalling 1600 fits


[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    8.6s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   11.2s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   16.1s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:   21.8s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   29.9s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   36.9s
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:   42.1s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   50.1s
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  2.7min
[Paralle

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=74, shuffle=False),
       error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'n_estimators': array([ 30,  40,  50,  60,  70,  80,  90, 100]), 'max_features': array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ]), 'min_samples_split': array([2, 3, 4, 5])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=make_scorer(f1_score, average=weighted), verbose=10)

In [10]:
print(gs15.best_score_)
print(gs15.best_params_)

0.690890917993
{'max_features': 0.20000000000000001, 'min_samples_split': 4, 'n_estimators': 70}


In [11]:
gs20 = GridSearchCV(estimator=rfc, param_grid=param_grid, n_jobs=-1, cv=kf, verbose=10, scoring=f1scorer)
gs20.fit(Xtr[['issue_type', 'topic1','topic2', 'topic3', 'topic4', 'topic5',
              'topic5','topic6', 'topic7', 'topic8', 'topic9',
              'topic10','topic11', 'topic12', 'topic13', 'topic14',
              'topic15', 'topic16', 'topic17', 'topic18', 'topic19', 'topic20']], ytr)

Fitting 5 folds for each of 320 candidates, totalling 1600 fits


[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    9.4s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   12.8s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   19.6s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:   28.3s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   40.1s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   50.8s
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:   58.9s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  3.9min
[Paralle

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=74, shuffle=False),
       error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'n_estimators': array([ 30,  40,  50,  60,  70,  80,  90, 100]), 'max_features': array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ]), 'min_samples_split': array([2, 3, 4, 5])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=make_scorer(f1_score, average=weighted), verbose=10)

In [12]:
print(gs20.best_score_)
print(gs20.best_params_)

0.697161725092
{'max_features': 0.30000000000000004, 'min_samples_split': 2, 'n_estimators': 90}


The best model is the one with all the 20 most relevant topics in it, even if its cv score is not much greater than the model with 15 topics.

In [13]:
rfc.set_params(max_features=0.3, min_samples_split=2, n_estimators=90)
rfc.fit(Xtr, ytr)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=0.3, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=90, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [14]:
for imp, col in sorted(zip(rfc.feature_importances_, Xtr.columns), reverse=True):
    print('{} importance: {:.5f}'.format(col, imp))

topic19 importance: 0.12848
issue_type importance: 0.08766
topic20 importance: 0.07580
topic6 importance: 0.05249
topic7 importance: 0.05021
topic13 importance: 0.04892
topic18 importance: 0.04671
topic12 importance: 0.04375
topic9 importance: 0.04246
topic10 importance: 0.04217
topic11 importance: 0.04078
topic17 importance: 0.03844
topic2 importance: 0.03708
topic3 importance: 0.03657
topic5 importance: 0.03517
topic1 importance: 0.03403
topic4 importance: 0.03336
topic14 importance: 0.03253
topic8 importance: 0.03199
topic16 importance: 0.03072
topic15 importance: 0.03068


In [15]:
ypred_ts = rfc.predict(Xts)
ypred_ts_proba = rfc.predict_proba(Xts)
print('Accuracy: {}'.format(accuracy_score(yts, ypred_ts)))
print('Recall: {}'.format(recall_score(yts, ypred_ts, average='weighted')))
print('Precision: {}'.format(precision_score(yts, ypred_ts, average='weighted')))
print('F1: {}'.format(f1_score(yts, ypred_ts, average='weighted')))

Accuracy: 0.7106045589692765
Recall: 0.7106045589692765
Precision: 0.6895518954054446
F1: 0.6964414982971537


This model gives me an F1 score of almost 0.7, with very balanced accuracy, precision and recall.

The confusion matrix tells me that the model is not very precise in determining "ISAHD Altro" label, which is quite generic and in general it's confused with many other issues' categories. There are also problems in identifying the least numerous categories, such as "OTHER", "ISAHD Rilascio" and "ISAHD Estrazione effettuata".

In [16]:
pd.DataFrame(confusion_matrix(le2.inverse_transform(yts), le2.inverse_transform(ypred_ts)),
             index=le2.classes_,
             columns=le2.classes_)

Unnamed: 0,ISAHD Altro,ISAHD Estrazione effettuata,ISAHD Forzatura dati,ISAHD Intervento,ISAHD Riesecuzione procedura,ISAHD Rilascio,ISAITAMDWH,ISAITAMS,OTHER,SCAI
ISAHD Altro,189,9,151,88,82,1,12,43,1,15
ISAHD Estrazione effettuata,13,6,15,6,2,0,14,1,0,0
ISAHD Forzatura dati,79,0,1520,22,15,0,2,14,0,14
ISAHD Intervento,67,0,69,298,22,0,2,5,0,7
ISAHD Riesecuzione procedura,53,3,20,7,612,2,3,11,0,4
ISAHD Rilascio,2,0,4,1,2,0,1,1,0,2
ISAITAMDWH,15,3,0,0,1,0,82,0,0,0
ISAITAMS,88,0,33,7,25,0,0,108,1,2
OTHER,3,0,0,2,0,0,1,7,4,0
SCAI,36,0,23,4,18,0,0,12,0,49


## Conclusions

In this project I used some NLP techniques to analyze textual data about incoming IT requests, using topic modeling to try and predict how an issues has to be addressed in order to correctly solve it. Since there are a lot of ways to solve an issue this is a multi-class prediction, in which some of the classes are very difficult to predict correctly since they occur rarely.

The final F1 score is near 0.7, which leaves room for improvement for sure: I could be more precise in eliminating noise words and try to collect more issues in order to have more data for the least frequent categories.

## Further Analysis

If I had more time I would have tried:

- to add more topics to my dataset
- to determine a list of words inherent to the language used in requesting something (such as greetings, thanking and so on)
- to see how the predictions change by adding the name of the person who created the ticket