In [1]:
%pip install pandas lightgbm sklearn numpy shap

Looking in indexes: https://pypi.org/simple, https://nexus.ccl/nexus/repository/pypi-hosted/simple
You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
from datetime import datetime, date
import lightgbm as lgb
from sklearn.metrics import mean_squared_error, average_precision_score
from collections import Counter, defaultdict
import numpy as np  # need numpy 1.21
import shap


pd.set_option('display.max_columns', None)

In [3]:
# load data, look a the values
df = pd.read_csv("data.csv", parse_dates=False)
df

Unnamed: 0.1,Unnamed: 0,ID,ID_status,active,count_reassign,count_opening,count_updated,ID_caller,opened_by,opened_time,Created_by,created_at,updated_by,updated_at,type_contact,location,category_ID,user_symptom,Support_group,support_incharge,Doc_knowledge,confirmation_check,impact,notify,problem_ID,change_request
0,1,INC0000045,New,True,0,0,0,Caller 2403,Opened by 8,29-02-2016 01:16,Created by 6,29-02-2016 01:23,Updated by 21,29-02-2016 01:23,Phone,Location 143,Category 55,Symptom 72,Group 56,?,True,False,2 - Medium,Do Not Notify,?,?
1,3,INC0000045,Resolved,True,0,0,3,Caller 2403,Opened by 8,29-02-2016 01:16,Created by 6,29-02-2016 01:23,Updated by 804,29-02-2016 11:29,Phone,Location 143,Category 55,Symptom 72,Group 56,?,True,False,2 - Medium,Do Not Notify,?,?
2,4,INC0000045,Closed,False,0,0,4,Caller 2403,Opened by 8,29-02-2016 01:16,Created by 6,29-02-2016 01:23,Updated by 908,05-03-2016 12:00,Phone,Location 143,Category 55,Symptom 72,Group 56,?,True,False,2 - Medium,Do Not Notify,?,?
3,6,INC0000047,Active,True,1,0,1,Caller 2403,Opened by 397,29-02-2016 04:40,Created by 171,29-02-2016 04:57,Updated by 21,29-02-2016 05:30,Phone,Location 165,Category 40,Symptom 471,Group 24,Resolver 31,True,False,2 - Medium,Do Not Notify,?,?
4,7,INC0000047,Active,True,1,0,2,Caller 2403,Opened by 397,29-02-2016 04:40,Created by 171,29-02-2016 04:57,Updated by 21,29-02-2016 05:33,Phone,Location 165,Category 40,Symptom 471,Group 24,Resolver 31,True,False,2 - Medium,Do Not Notify,?,?
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99193,141707,INC0120835,Resolved,True,1,0,3,Caller 116,Opened by 12,16-02-2017 09:09,?,?,Updated by 27,16-02-2017 09:53,Email,Location 204,Category 42,Symptom 494,Group 31,Resolver 10,False,True,2 - Medium,Do Not Notify,?,?
99194,141708,INC0120835,Closed,False,1,0,4,Caller 116,Opened by 12,16-02-2017 09:09,?,?,Updated by 27,16-02-2017 09:53,Email,Location 204,Category 42,Symptom 494,Group 31,Resolver 10,False,True,2 - Medium,Do Not Notify,?,?
99195,141709,INC0121064,Active,True,0,0,0,Caller 116,Opened by 12,16-02-2017 14:17,?,?,Updated by 908,16-02-2017 14:17,Email,Location 204,Category 42,Symptom 494,Group 70,Resolver 10,False,False,2 - Medium,Do Not Notify,?,?
99196,141710,INC0121064,Active,True,1,0,1,Caller 116,Opened by 12,16-02-2017 14:17,?,?,Updated by 60,16-02-2017 15:20,Email,Location 204,Category 42,Symptom 494,Group 31,?,False,False,2 - Medium,Do Not Notify,?,?


In [4]:
# Some exploratory data analysis
print(f"Unique impact values with counts:\n{df['impact'].value_counts()}")
print()

print(f"Unique ID_status values with counts:\n{df['ID_status'].value_counts()}")
print()

Unique impact values with counts:
2 - Medium    94034
3 - Low        2720
1 - High       2444
Name: impact, dtype: int64

Unique ID_status values with counts:
Active                27075
New                   25515
Resolved              18158
Closed                17387
Awaiting User Info    10235
Awaiting Vendor         493
Awaiting Problem        307
Awaiting Evidence        26
-100                      2
Name: ID_status, dtype: int64



In [5]:
# I have decided to predict the "impact" column, so that the IT teams can prioritize tasks with high impact.
# Here I am selecting columns which should be used as features for the prediction.
#
# I am not using all the feature columns.
# Data in some columns are provided after Incident is raised, so they will not be available when predicting.
# Also it doesn't make sense to me to use ID as model of the feature.
# 
# The chosen model, LightGBM supports categorical as well as numerical features and I have to encode these differently.
# Also I have decided to encode timestamp columns in a different way, hopefully improving the results.
# And I also have one special column, "previous_impact" which is computed differently than other columns.

numerical_columns = ["count_reassign", "count_opening", "count_updated"]
categorical_columns = ["ID_status", "active", "ID_caller", "opened_by", "Created_by", "updated_by",
                      "type_contact","location", "category_ID", "user_symptom"]
datetime_columns = ["opened_time", "created_at", "updated_at"]
special_columns = ["previous_impact"]

In [6]:
# I have decided to split the data by time, specifically by "updated_at" column.
# This is the safest option, which should prevent any information leakage between splits of datasets.
# (By splits I mean "train", "test" and "valid" parts of the input data.

# sort by 'updated_at' so that we don't have any data spillage
df["sort_index"] = df["updated_at"].apply(lambda x: datetime.strptime(x, '%d-%m-%Y %H:%M'))
df = df.sort_values("sort_index")  # sort by time so that we don't have any data spillage

# Add new column split with values "train", "valid" and "test". Train contains oldest data, test contains newest.
TRAIN_RATIO, VALID_RATIO = 0.8, 0.1
df["split"] = "test"  # add split column
train_end = int(len(df.index) * TRAIN_RATIO)  
valid_end = int(len(df.index) * (TRAIN_RATIO + VALID_RATIO))
df.iloc[0:train_end, df.columns.get_loc("split")] = "train"
df.iloc[train_end:valid_end, df.columns.get_loc("split")] = "valid"

print(df['split'].value_counts())

train    79358
test      9920
valid     9920
Name: split, dtype: int64


In [7]:
df

Unnamed: 0.1,Unnamed: 0,ID,ID_status,active,count_reassign,count_opening,count_updated,ID_caller,opened_by,opened_time,Created_by,created_at,updated_by,updated_at,type_contact,location,category_ID,user_symptom,Support_group,support_incharge,Doc_knowledge,confirmation_check,impact,notify,problem_ID,change_request,sort_index,split
0,1,INC0000045,New,True,0,0,0,Caller 2403,Opened by 8,29-02-2016 01:16,Created by 6,29-02-2016 01:23,Updated by 21,29-02-2016 01:23,Phone,Location 143,Category 55,Symptom 72,Group 56,?,True,False,2 - Medium,Do Not Notify,?,?,2016-02-29 01:23:00,train
3,6,INC0000047,Active,True,1,0,1,Caller 2403,Opened by 397,29-02-2016 04:40,Created by 171,29-02-2016 04:57,Updated by 21,29-02-2016 05:30,Phone,Location 165,Category 40,Symptom 471,Group 24,Resolver 31,True,False,2 - Medium,Do Not Notify,?,?,2016-02-29 05:30:00,train
4,7,INC0000047,Active,True,1,0,2,Caller 2403,Opened by 397,29-02-2016 04:40,Created by 171,29-02-2016 04:57,Updated by 21,29-02-2016 05:33,Phone,Location 165,Category 40,Symptom 471,Group 24,Resolver 31,True,False,2 - Medium,Do Not Notify,?,?,2016-02-29 05:33:00,train
9,14,INC0000057,New,True,0,0,0,Caller 4416,Opened by 8,29-02-2016 06:10,?,?,Updated by 21,29-02-2016 06:26,Phone,Location 204,Category 20,Symptom 471,Group 70,?,True,False,2 - Medium,Do Not Notify,?,?,2016-02-29 06:26:00,train
10,15,INC0000057,New,True,0,0,1,Caller 4416,Opened by 8,29-02-2016 06:10,?,?,Updated by 21,29-02-2016 06:38,Phone,Location 204,Category 20,Symptom 471,Group 70,?,True,False,2 - Medium,Do Not Notify,?,?,2016-02-29 06:38:00,train
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99196,141710,INC0121064,Active,True,1,0,1,Caller 116,Opened by 12,16-02-2017 14:17,?,?,Updated by 60,16-02-2017 15:20,Email,Location 204,Category 42,Symptom 494,Group 31,?,False,False,2 - Medium,Do Not Notify,?,?,2017-02-16 15:20:00,test
99197,141711,INC0121064,Resolved,True,1,0,2,Caller 116,Opened by 12,16-02-2017 14:17,?,?,Updated by 27,16-02-2017 16:38,Email,Location 204,Category 42,Symptom 494,Group 31,Resolver 10,False,True,2 - Medium,Do Not Notify,?,?,2017-02-16 16:38:00,test
99182,141695,INC0120304,Resolved,True,0,0,1,Caller 90,Opened by 8,15-02-2017 02:02,?,?,Updated by 21,17-02-2017 00:47,Email,Location 188,Category 52,Symptom 494,Group 64,Resolver 6,False,True,2 - Medium,Do Not Notify,?,?,2017-02-17 00:47:00,test
99183,141696,INC0120304,Closed,False,0,0,2,Caller 90,Opened by 8,15-02-2017 02:02,?,?,Updated by 21,17-02-2017 00:50,Email,Location 188,Category 52,Symptom 494,Group 64,Resolver 6,False,True,2 - Medium,Do Not Notify,?,?,2017-02-17 00:50:00,test


In [8]:
# Now I encode the data so that they are easily usable by the model.
# First, because I want to use Regression approach, I map "impact" column to integer values
#
# Second, I convert the categorical types and encode the timestamp columns.
# I have decided to encode timestamp into 2 separate columns:
#   1. Hour (integer 0-24)
#   2. Day of the week (Monday, Tuesday, ...)
# And I also have special value for missing the timestamp.
# Reasoning behind this is following:
#  1. Standard representation of time, Unix timestamp, will not allow the model to generalize.
#     This is the case, because if model was trained on some date, all the training timestamps will be lower then the training date,
#     but the model will likely be used to predict future data, with timestamps higher than in training.
#     So it will get only Out-Of-Distribution data during inference.
#  2. I thought that if someone creates an Incident for example at midnight or during weekend,
#     it is more likely it is important then those created during normal working hours. And I wanted the model to capture this.
#
# Third, I created a special column "previous_impact".
# If a single Incident was updated/reopened and we want to predict the new impact,
# it is quite likely it will be correlated with the previous impact.
# And we have the previous impact and can use it for prediction.


def prepare_data(df):
    # Encode impact as integer & convert categorical columns
    df["impact"] = df["impact"].map({'1 - High': 1, '2 - Medium': 2, '3 - Low': 3})
    for categorical_column in categorical_columns:
        df[categorical_column] = df[categorical_column].astype('category')

    # Encode timestamp columns as hour + day in week
    for datetime_column in datetime_columns:
        df[datetime_column + "_parsed"] = df[datetime_column].apply(lambda x: None if x == "?" else datetime.strptime(x, '%d-%m-%Y %H:%M'))
        df[datetime_column + "_hour"] = df[datetime_column + "_parsed"].apply(lambda x: "-1" if x == None else str(x.hour))
        df[datetime_column + "_hour"] = df[datetime_column + "_hour"].astype('category')
        
        df[datetime_column + "_weekday"] = df[datetime_column + "_parsed"].apply(lambda x: "-1" if x == None else str(x.weekday()))
        df[datetime_column + "_weekday"] = df[datetime_column + "_weekday"].astype('category')
        
        df.drop(datetime_column + "_parsed", axis=1)

    # Create a previous impact_column
    df["previous_impact"] = "unknown"
        
    # Group records by ID
    rows_by_id = defaultdict(lambda: [])
    for index, row in df.iterrows():
        rows_by_id[row["ID"]].append(row)

    for index, row in df.iterrows():
        updated_time = datetime.strptime(row["updated_at"], '%d-%m-%Y %H:%M')
        incident_id = row["ID"]
        
        # For each row find the previous record. This is a record which:
        # - is older than the current one
        # - is newest between all the records with the same ID, that are older than the current record
        previous_impact = "unknown"  # If no previous record is found, use this value
        previous_impact_timestamp = datetime(1970, 1, 1, 0, 0, 0)
        
        for potential_previous_row in rows_by_id[incident_id]:
            potential_previous_time = datetime.strptime(potential_previous_row["updated_at"], '%d-%m-%Y %H:%M')
            if updated_time > potential_previous_time and potential_previous_time > previous_impact_timestamp:
                previous_impact = potential_previous_row["impact"]
                previous_impact_timestamp = potential_previous_time

        df.at[index, "previous_impact"] = previous_impact

    df["previous_impact"] = df["previous_impact"].astype('category')
    return df

df = prepare_data(df) 

In [9]:
df

Unnamed: 0.1,Unnamed: 0,ID,ID_status,active,count_reassign,count_opening,count_updated,ID_caller,opened_by,opened_time,Created_by,created_at,updated_by,updated_at,type_contact,location,category_ID,user_symptom,Support_group,support_incharge,Doc_knowledge,confirmation_check,impact,notify,problem_ID,change_request,sort_index,split,opened_time_parsed,opened_time_hour,opened_time_weekday,created_at_parsed,created_at_hour,created_at_weekday,updated_at_parsed,updated_at_hour,updated_at_weekday,previous_impact
0,1,INC0000045,New,True,0,0,0,Caller 2403,Opened by 8,29-02-2016 01:16,Created by 6,29-02-2016 01:23,Updated by 21,29-02-2016 01:23,Phone,Location 143,Category 55,Symptom 72,Group 56,?,True,False,2,Do Not Notify,?,?,2016-02-29 01:23:00,train,2016-02-29 01:16:00,1,0,2016-02-29 01:23:00,1,0,2016-02-29 01:23:00,1,0,unknown
3,6,INC0000047,Active,True,1,0,1,Caller 2403,Opened by 397,29-02-2016 04:40,Created by 171,29-02-2016 04:57,Updated by 21,29-02-2016 05:30,Phone,Location 165,Category 40,Symptom 471,Group 24,Resolver 31,True,False,2,Do Not Notify,?,?,2016-02-29 05:30:00,train,2016-02-29 04:40:00,4,0,2016-02-29 04:57:00,4,0,2016-02-29 05:30:00,5,0,unknown
4,7,INC0000047,Active,True,1,0,2,Caller 2403,Opened by 397,29-02-2016 04:40,Created by 171,29-02-2016 04:57,Updated by 21,29-02-2016 05:33,Phone,Location 165,Category 40,Symptom 471,Group 24,Resolver 31,True,False,2,Do Not Notify,?,?,2016-02-29 05:33:00,train,2016-02-29 04:40:00,4,0,2016-02-29 04:57:00,4,0,2016-02-29 05:33:00,5,0,2
9,14,INC0000057,New,True,0,0,0,Caller 4416,Opened by 8,29-02-2016 06:10,?,?,Updated by 21,29-02-2016 06:26,Phone,Location 204,Category 20,Symptom 471,Group 70,?,True,False,2,Do Not Notify,?,?,2016-02-29 06:26:00,train,2016-02-29 06:10:00,6,0,NaT,,,2016-02-29 06:26:00,6,0,unknown
10,15,INC0000057,New,True,0,0,1,Caller 4416,Opened by 8,29-02-2016 06:10,?,?,Updated by 21,29-02-2016 06:38,Phone,Location 204,Category 20,Symptom 471,Group 70,?,True,False,2,Do Not Notify,?,?,2016-02-29 06:38:00,train,2016-02-29 06:10:00,6,0,NaT,,,2016-02-29 06:38:00,6,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99196,141710,INC0121064,Active,True,1,0,1,Caller 116,Opened by 12,16-02-2017 14:17,?,?,Updated by 60,16-02-2017 15:20,Email,Location 204,Category 42,Symptom 494,Group 31,?,False,False,2,Do Not Notify,?,?,2017-02-16 15:20:00,test,2017-02-16 14:17:00,14,3,NaT,,,2017-02-16 15:20:00,15,3,2
99197,141711,INC0121064,Resolved,True,1,0,2,Caller 116,Opened by 12,16-02-2017 14:17,?,?,Updated by 27,16-02-2017 16:38,Email,Location 204,Category 42,Symptom 494,Group 31,Resolver 10,False,True,2,Do Not Notify,?,?,2017-02-16 16:38:00,test,2017-02-16 14:17:00,14,3,NaT,,,2017-02-16 16:38:00,16,3,2
99182,141695,INC0120304,Resolved,True,0,0,1,Caller 90,Opened by 8,15-02-2017 02:02,?,?,Updated by 21,17-02-2017 00:47,Email,Location 188,Category 52,Symptom 494,Group 64,Resolver 6,False,True,2,Do Not Notify,?,?,2017-02-17 00:47:00,test,2017-02-15 02:02:00,2,2,NaT,,,2017-02-17 00:47:00,0,4,2
99183,141696,INC0120304,Closed,False,0,0,2,Caller 90,Opened by 8,15-02-2017 02:02,?,?,Updated by 21,17-02-2017 00:50,Email,Location 188,Category 52,Symptom 494,Group 64,Resolver 6,False,True,2,Do Not Notify,?,?,2017-02-17 00:50:00,test,2017-02-15 02:02:00,2,2,NaT,,,2017-02-17 00:50:00,0,4,2


In [10]:
# Create 3 datasets (train, test, valid) out of the single dataframe.
# The datasets are in LightGBM compatible format
def create_dataset(input_df, split, feature_columns, target_column="impact"):
    assert target_column not in feature_columns
    split_df = input_df.loc[input_df['split'] == split]
    print(f"Creating dataset '{split}' with {len(split_df.index)} examples")
    if split == "test":
        dataset = split_df[feature_columns], split_df[target_column]
    else:
        dataset = lgb.Dataset(split_df[feature_columns], split_df[target_column])
        
    return dataset

columns_to_use = numerical_columns + categorical_columns + special_columns
for datetime_column in datetime_columns:
    columns_to_use.append(datetime_column + "_hour")
    columns_to_use.append(datetime_column + "_weekday")

train = create_dataset(df, "train", columns_to_use)
valid = create_dataset(df, "valid", columns_to_use)
test_features, test_target = create_dataset(df, "test", columns_to_use)

Creating dataset 'train' with 79358 examples
Creating dataset 'valid' with 9920 examples
Creating dataset 'test' with 9920 examples


In [11]:
# I have decided to use LightGBM
# The reason is that we are dealing with tabular data of moderate size, 
# which is a perfect situation for Gradient Boosted Trees.
# And LightGBM is a my favorite Gradient Boosted Tree library :).

# We frame this problem as a regression. The advantages of this are 2:
# 1. It better fits the problem - impact is naturally a scale (number), not a class.
#    And if the model doesn't know what to predict, it will probably predict around the middle (2, Medium impact),
#    which is much more reasonable behaviour guessing some class randomly.
# 2. It will allow us to easily sort the tasks according to predict impact,
#    from the estimated highest priority (Impact) to the lowest

# Model hyperparameters
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': 15,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

# Train the model. 
# The LightGBM package automatically evaluates the model after addition of each tree using the validation set.
# The returned model is the best one according to validation.
gbm = lgb.train(params,
                train,
                num_boost_round=100,
                valid_sets=valid)

You can set `force_col_wise=true` to remove the overhead.
[1]	valid_0's l2: 0.0435263	valid_0's l1: 0.0486187
[2]	valid_0's l2: 0.0404992	valid_0's l1: 0.0469848
[3]	valid_0's l2: 0.0398952	valid_0's l1: 0.0474439
[4]	valid_0's l2: 0.0372424	valid_0's l1: 0.0459708
[5]	valid_0's l2: 0.0348476	valid_0's l1: 0.0445941
[6]	valid_0's l2: 0.0326821	valid_0's l1: 0.043338
[7]	valid_0's l2: 0.0322081	valid_0's l1: 0.0439174
[8]	valid_0's l2: 0.0303166	valid_0's l1: 0.0428147
[9]	valid_0's l2: 0.0286223	valid_0's l1: 0.0417718
[10]	valid_0's l2: 0.027079	valid_0's l1: 0.0407445
[11]	valid_0's l2: 0.0256961	valid_0's l1: 0.0397903
[12]	valid_0's l2: 0.0244482	valid_0's l1: 0.0388988
[13]	valid_0's l2: 0.0233204	valid_0's l1: 0.0380437
[14]	valid_0's l2: 0.0223065	valid_0's l1: 0.0373437
[15]	valid_0's l2: 0.0214244	valid_0's l1: 0.0365843
[16]	valid_0's l2: 0.020586	valid_0's l1: 0.035833
[17]	valid_0's l2: 0.0198751	valid_0's l1: 0.0351936
[18]	valid_0's l2: 0.0191804	valid_0's l1: 0.0344966
[

Overriding the parameters from Reference Dataset.
categorical_column in param dict is overridden.


[29]	valid_0's l2: 0.0151915	valid_0's l1: 0.03062
[30]	valid_0's l2: 0.0151596	valid_0's l1: 0.0309103
[31]	valid_0's l2: 0.0149657	valid_0's l1: 0.0306291
[32]	valid_0's l2: 0.0147794	valid_0's l1: 0.0303261
[33]	valid_0's l2: 0.0146235	valid_0's l1: 0.0301032
[34]	valid_0's l2: 0.0144844	valid_0's l1: 0.0298922
[35]	valid_0's l2: 0.014358	valid_0's l1: 0.0296908
[36]	valid_0's l2: 0.0142389	valid_0's l1: 0.0294672
[37]	valid_0's l2: 0.0141497	valid_0's l1: 0.0292623
[38]	valid_0's l2: 0.0140706	valid_0's l1: 0.0290477
[39]	valid_0's l2: 0.0139971	valid_0's l1: 0.0288366
[40]	valid_0's l2: 0.0139356	valid_0's l1: 0.0286546
[41]	valid_0's l2: 0.0138554	valid_0's l1: 0.0285217
[42]	valid_0's l2: 0.0138667	valid_0's l1: 0.0287101
[43]	valid_0's l2: 0.0137965	valid_0's l1: 0.02856
[44]	valid_0's l2: 0.0137286	valid_0's l1: 0.0284431
[45]	valid_0's l2: 0.013685	valid_0's l1: 0.0282891
[46]	valid_0's l2: 0.0137035	valid_0's l1: 0.0284394
[47]	valid_0's l2: 0.01366	valid_0's l1: 0.0283327
[

In [12]:
# Predict the test set using the model.
prediction = gbm.predict(test_features, num_iteration=gbm.best_iteration)
prediction = np.clip(prediction, 1, 3)  # Clip numbers smaller than 1 and higher than 3

In [13]:
# Implement baseline which uses following rules:
#  - if there is a previous record for same incident, use the previous impact (it is saved in the "previous_impact" column)
#  - otherwise, predict 2, Medium impact
baseline_prediction = []
for _, row in test_features.iterrows():
    if row["previous_impact"] != "unknown":
        baseline_prediction.append(row["previous_impact"])
    else:
        baseline_prediction.append(2)

In [16]:
# I have implemented multiple metrics, because I have not found a single one, that would be perfect.
def evaluate(prediction, test_target, model_name):
    print(f"\n\nEvaluating model {model_name}")

    # Standard classification accuracy.
    # Here the LightGBM model probably has little lower results than it could have,
    # because we just used thresholds of 1.5 and 2.5 (i.e. rounding) to convert the numerical prediction to a class.
    # With tuned thresholds the accuracy might be a little better.
    correct = 0
    for pred, target in zip(prediction, test_target):
        if round(pred) == int(target):
            correct += 1

    print(f"Accuracy: {correct / len(test_target)*100:.1f}%")
    
    # Standard mean squared error
    mse = mean_squared_error(prediction, test_target)
    print(f"Mean squared error: {mse:.3f}")
    
    # Since I imagine the model will be used for ranking (ordering) the Incidents, I wanted to measure how good the ordering is.
    # Ideally, if we sort the test examples by predicted score, first there will be the High priority ones,
    # followed by the Medium priority ones and the Low priority ones.
    # To evaluate this, I computed on the average position on which the High, Medium and Low examples are,
    # when they are sorted according to the prediction.
    sorted_targets = sorted(zip(prediction, test_target), key=lambda x: x[0])
    ranks = defaultdict(lambda: [])
    for rank, (pred, target) in enumerate(sorted_targets):
        ranks[target].append(rank)

    print(f"Priority 1 average rank: {round(np.mean(ranks[1]))}")
    print(f"Priority 2 average rank: {round(np.mean(ranks[2]))}")
    print(f"Priority 3 average rank: {round(np.mean(ranks[3]))}")
    
    counts = dict(Counter(test_target))
    print(f"Counts in test: {counts}")
    
    # Another "metric" about the ordering - if the IT will use the Incidents in the order given by the scores of the model
    # and they will be able to accomplish only e.g. 10% of the overall amount of Incidents, how many Incidents with
    # high priority will be solved?
    # Ideally, all the High priority tasks should be solved first.
    for ratio in [0.01, 0.1, 0.5]:
        top_part = sorted_targets[0:int(len(test_target) * ratio)]
        counts_solved = dict(Counter(target for pred, target in top_part))
        print(f"If IT Teams solve top {ratio * 100}% of tasks ordered by the model, they will solve approximately {counts_solved[1] / counts[1] * 100:.0f}% high priority tasks")
        print(counts_solved)

# Now we evaluate our model, baseline and an optimal prediction, just to put our numbers into context.
evaluate(prediction, test_target, "LightGBM")
evaluate(baseline_prediction, test_target, "Baseline")
evaluate(test_target, test_target, "Oracle")



Evaluating model LightGBM
Accuracy: 98.7%
Mean squared error: 0.012
Priority 1 average rank: 865
Priority 2 average rank: 4961
Priority 3 average rank: 9440
Counts in test: {2: 9450, 3: 223, 1: 247}
If IT Teams solve top 1.0% of tasks ordered by the model, they will solve approximately 39% high priority tasks
{1: 96, 2: 3}
If IT Teams solve top 10.0% of tasks ordered by the model, they will solve approximately 85% high priority tasks
{1: 210, 2: 780, 3: 2}
If IT Teams solve top 50.0% of tasks ordered by the model, they will solve approximately 94% high priority tasks
{1: 231, 2: 4720, 3: 9}


Evaluating model Baseline
Accuracy: 98.8%
Mean squared error: 0.012
Priority 1 average rank: 1247
Priority 2 average rank: 4977
Priority 3 average rank: 8334
Counts in test: {2: 9450, 3: 223, 1: 247}
If IT Teams solve top 1.0% of tasks ordered by the model, they will solve approximately 39% high priority tasks
{1: 97, 2: 2}
If IT Teams solve top 10.0% of tasks ordered by the model, they will sol

## Results discussion
We can see that both the accuracy and MSE is very good for both LightGBM as well as Baseline.
This doesn't mean that the models are perfect, it is caused by the imbalanced dataset, so it would be the best to not look at these metrics too much.

There are some differences between the LightGBM and Baseline in the ranking metrics,
e.g. Priority 1 average rank: 865 vs 1247. So the LightGBM model is better than the baseline in ranking. But when compared to the Oracle, we can see that the model is far from perfect and
it will be doing some mistakes.

In [17]:
exp = shap.TreeExplainer(gbm)
sv = exp.shap_values(test_features)

# We want importance per column without regard whether the impact of the value is negative or positive for a certain prediction.
result = np.sum(np.absolute(sv), axis=0)

for column_name, importance in sorted(zip(test_features.columns, result), key=lambda x: x[1]):
    print(f"Feature {column_name} has importance {importance:.4}")

Feature count_opening has importance 0.0
Feature type_contact has importance 0.0
Feature active has importance 0.1289
Feature updated_at_weekday has importance 0.6937
Feature created_at_weekday has importance 0.7405
Feature opened_time_weekday has importance 1.292
Feature count_reassign has importance 1.958
Feature created_at_hour has importance 2.511
Feature updated_at_hour has importance 2.861
Feature location has importance 3.462
Feature opened_time_hour has importance 6.044
Feature Created_by has importance 6.631
Feature count_updated has importance 7.021
Feature ID_status has importance 12.04
Feature category_ID has importance 15.24
Feature user_symptom has importance 22.19
Feature updated_by has importance 26.97
Feature ID_caller has importance 90.87
Feature opened_by has importance 112.4
Feature previous_impact has importance 391.6


### Feature importance discussion
We can see that the most important feature is *previous_impact* - this is not suprising, since the Baseline model, which is based on this feature, is quite good. Other important features are related to who opened and reported the incident. Also *category_ID* and *user_symptom* are somewhat important.

### PART 2 Answers



*Q1: What would be your next steps if IT team likes your PoC from part 1?*

I would probably start creating some plan for productionalization of the solution. Discussing with the right people,
where the model will be running, whether it will predict in batches or in streaming fashion, or how the integration
to the system IT Teams are using would look like. Also a meaningful question is, whether the Baseline wouldn't be enough,
since it is much easier to deploy.

Also I would do some more experiments, to find whether all the data will really be available in Production,
how often the model should be retrained and also I would think about some online metrics 
(e.g. how often model predicts some Impact, but the IT Teams change it something else). 



*Q2: IT team would like to understand if they can utilize ML/AI for other processes. Think about use-cases and problems that could be solved by ML capabilities, list your ideas. (Base your ideas on provided dataset).*

Yes, it would be possible to predict the time needed for closing an Incident, so that they can first solve the fast ones.

Also it might be possible to predict which group/resolver should be assigned for a specific incident,
so that reassignment does not happen often.



*Q3: How would you productionize suggested solution?*

It depends a lot on the existing infrastructure and how the model should be used. One possibility is
creating a server with REST API in a Docker container, which then can be run on e.g. Kubernetes and predict in real time.

In this case I would make sure the data preparation is the same as in my experiments. I would probably abandon pandas and
rewrite it in pure Python, because if we are predicting in real time, we are getting a single example instead 
of a whole dataframe. Also I would add some integration tests for checking whether the model is really behaving
the way it behaved during experiments.

Also I would add some online metric, so that it is possible to detect some degradation in performance.