# IT Tickets Classification Project

The goal of this project is to use NLP to analyze IT tickets in order to understand how to channel them for a correct resolution.

The first advantage of such a project is to pre-sort tickets for the IT structure dedicated to this task, speeding up their work; a second and more ambitious goal could be the possibility to create a processo to assign automatically tickets to the correct workgroups.

In [1]:
import requests
import json
import pandas as pd

## Data Collection

[JIRA](https://www.atlassian.com/software/jira) is a software that can be used to manage requests workflow; for this purpose it has a web interface and a [REST API](https://developer.atlassian.com/cloud/jira/platform/rest/) that allows to get tickets data with a `GET` request.

I'm going to collect some basic data about tickets created during the 2017:

- ticket key
- date of the request and name of the applicant
- issue type
- ticket's summary and description
- eventual ticket's type of resolution and/or workgroup assignment

To gather all this data I'm creating three functions:
- one that reads data from a single ticket
- one that loops through a single day of tickets (being careful about the 50 issues limit returned by the API)
- one that loops through all the days in a timespan

In [2]:
def get_issue_data(issue):
    '''
    Given the JSON of an issue it extracts its features
    '''
    fields = issue['fields']
    key = issue['key']
    creation_date = fields['created']
    creator = fields['creator']['displayName']
    description = fields['description']
    issue_type = fields['issuetype']['name']
    reporter = fields['reporter']['displayName']
    summary = fields['summary']
    solution = None
    if 'customfield_15600' in fields:
        if fields['customfield_15600'] is not None:
            solution = fields['customfield_15600']['value']
    rel_issue = []
    for sub_issue in fields['issuelinks']:
        if sub_issue['type']['name'] == 'Relates' and 'outwardIssue' in sub_issue:
            rel_issue = rel_issue + [sub_issue['outwardIssue']['key']]
    return [key, creation_date, creator, description, issue_type, reporter, summary, solution, rel_issue]

In [3]:
def get_issues(day):
    '''
    Extracts all issues in one day (day has to be a non-positive integer indicating how many days in the past
    you are interested in), looping each 50 issues to avoid jira limit of 50 issues per-request
    '''
    url = 'http://iaasjira01.spvita.sanpaoloimiwm.local:8080/rest/api/2/search/?jql='
    jql = 'project = "ISA HELP DESK" and createdDate >= startOfDay({}) and createdDate <= endOfDay({})'.format(day, day)
    response = requests.get(url+jql, auth=('alessandro.diantonio','password'))
    response_data = json.loads(response.text)
    n_jira = response_data['total']
    # print(n_jira)
    i = 0
    issues = []
    while i*50 < n_jira:
        response = requests.get(url+jql+'&startAt={}'.format(i*50), auth=('alessandro.diantonio','password'))
        response_data = json.loads(response.text)
        for issue in response_data['issues']:
            issues.append(get_issue_data(issue))
        i += 1
    return issues

In [4]:
def get_all_issues(start_day, end_day):
    '''
    Gets all issues in a timespan (start_day and end_day have to be non-positive integers, with start_day <= end_day)
    '''
    all_issues = []
    for day in range(end_day, start_day, -1):
        all_issues = all_issues + get_issues(day)
        if day%10 == 0:
            print(day)
    return all_issues

In [5]:
# -419 = 02/01/2017 
# -57 = 30/12/2017
issues = get_all_issues(-419, -57)

-60
-70
-80
-90
-100
-110
-120
-130
-140
-150
-160
-170
-180
-190
-200
-210
-220
-230
-240
-250
-260
-270
-280
-290
-300
-310
-320
-330
-340
-350
-360
-370
-380
-390
-400
-410


In [6]:
issues_df = pd.DataFrame(issues, columns=['key', 'creation_date', 'creator', 'description', 'issue_type', 'reporter', 'summary', 'solution', 'rel_issue'])

In [9]:
issues_df.loc[issues_df['reporter']=='Back Office'].tail()

Unnamed: 0,key,creation_date,creator,description,issue_type,reporter,summary,solution,rel_issue
20141,ISAHD-11765,2017-01-03T13:13:31.000+0100,Back Office,Ciao\nchiedo di generare la PRP per la polizza...,Modifica dati,Back Office,GENERARE PRP 104011318615 TARGA CD282RE POLET...,Riesecuzione procedura,[]
20154,ISAHD-11752,2017-01-03T11:46:06.000+0100,Back Office,Ciao nel preventivo di sostituzione n° PRE0020...,Modifica dati,Back Office,VARIAZIONE DICITURA ASSICURATIVA CIGNETTI - MA...,Forzatura dati,[]
20163,ISAHD-11743,2017-01-03T10:45:28.000+0100,Back Office,"Ciao,\nper questa posizione 304012069489 01/0...",Modifica dati,Back Office,generare tasto paga su psp,Riesecuzione procedura,[]
20166,ISAHD-11740,2017-01-03T10:27:20.000+0100,Back Office,GENERARE LA PRP PER IL CONTRATTO 304012079294 ...,HD - segnalazione anomalie,Back Office,GENERARE PRP 304012079294 - COSTA S,Riesecuzione procedura,[]
20172,ISAHD-11734,2017-01-03T09:50:04.000+0100,Back Office,"Buongiorno,\nvi chiedo di emettere un preventi...",Modifica dati,Back Office,GENERARE PRP 734015167154 DY526PF DE LAURENTII...,Riesecuzione procedura,[]


In [11]:
issues_df.to_pickle('../data/raw/issues.pkl')

## Following Notebooks

- [Data Cleaning and EDA](1-Data Cleaning and EDA.ipynb)
- [Document-Term Matrix](2-Document-Term Matrix.ipynb)
- [Topic Modeling](3-Topic Modeling.ipynb)
- [Random Forest Prediction](4-Model.ipynb)