# Using a Machine Learning to assign Changes in Request Mangement System


The chosen dataset is from the Change Request system of a ficticious Cloud company containing all their change data from the previous year. Using this data we will attempt to train and test a Machine Learning Model using various Classification Algorithms that can be used via a prediction API to later assign live data to various technical teams.  The dataset we have is limited as we only have the Change ID, Requestor, Client, Location, and Classification of each change. We will aim to get an above 80% accuracy rate with our Machine Learning Model. 

Note: All names in this data are ficticious and were generated using the name generator at https://www.name-generator.org.uk/


### Importing Historical Change Data from the Previous Year

In [15]:
import pandas as pd               # pandas is a dataframe library
import matplotlib.pyplot as plt   # matplotlib.pyplot plots data
import numpy as np                # numpy provides N-dimensional object support

df = pd.read_csv("ChangeHistoryData.csv")   # This is our ficticious change history data, df stands for dataframe

In [16]:
df.shape   # This will show number of rows in the dataframe and number of columns, shows us 6 columns and 387 rows of data

(387, 6)

In [17]:
df.head()   # show first 5 rows

Unnamed: 0,ChangeID,Requestor,Classification,Client,Location,Team
0,31001,Lavinia Marshall,Configuration,,Toronto,Development Team
1,31002,Samirah Calvert,Hardware,All,Regina,Network Team
2,31003,Marisa Parra,Automation,Client D,Vancouver,Data Team
3,31004,Saniya Steadman,Application,All,Toronto,Development Team
4,31005,Mikey Beaumont,Automation,,Toronto,Automation Team


In [18]:
df.tail()    # show last 5 rows

Unnamed: 0,ChangeID,Requestor,Classification,Client,Location,Team
382,31408,Murtaza Finch,Application,Client D,Toronto,Development Team
383,31409,Rex Swift,Security,All,Toronto,Production Team
384,31410,Rex Swift,Security,All,Toronto,Production Team
385,31411,Rex Swift,Security,All,Toronto,Production Team
386,31412,Ailish Kumar,Automation,,Vancouver,Data Team


### Inspecting and getting farmiliar with the data to make sure its good data

In [19]:
df.isnull().values.any()    # will return True if there are any cells with a null value

False

In [20]:
df.groupby(["Requestor"])['Team'].count()  # Here we can group the data by requestor

Requestor
Ailish Kumar          31
Ajwa Kim               2
Alejandro Grainger     6
Angus Bowler          12
Arielle Olson         15
Arwel Erickson         5
Ava-Rose Marsh         5
Bogdan Jenkins         1
Borys Battle           1
Calista Reese          4
Derren Legge           1
Efa Nichols            1
Emanuel Tomlinson      1
Fahima Guzman          2
Grayson Parkinson     10
Hakim Hurley           1
Harriet Marquez        1
Izzie Heath            7
Jai Carson             1
Jarod Novak            2
Jolie Holden          33
Jonathan Nash          1
Kaeden Justice         1
Kairo Mclean          20
Kameron Mac            1
Kamile Turner          9
Karan Douglas          4
Katelin Cabrera        3
Kennedy Cameron       22
Kirstin Schneider      1
Kurtis Levy           10
Lavinia Marshall       1
Lewie Peterson         6
Marisa Parra          32
Maximus Ahmad          2
Mikey Beaumont         2
Murtaza Finch         25
Naima Hutchings        1
Neive Casey            1
Nieve Winter   

In [21]:
df.groupby(["Location"])['Team'].count()  # By Location

Location
Edmonton      19
Halifax        2
Montreal       1
Regina         2
Toronto      270
Vancouver     93
Name: Team, dtype: int64

In [22]:
df.groupby(["Client"])['Team'].count()  # By Client

Client
All         133
Client A      8
Client C      2
Client D     45
Client E      8
Client F     10
Client G      2
None        179
Name: Team, dtype: int64

In [23]:
df.groupby(["Classification"])['Team'].count()  # By Classification or Type of change

Classification
Application      159
Automation        95
Configuration     24
Hardware          30
Network            5
Project           27
Security          29
Software          18
Name: Team, dtype: int64

In [24]:
df.groupby(["Team"])['Team'].count()  # By Team that implemented the Change, this is what we hope to predict

Team
Automation Team                  1
Data Team                       93
Database Team                   56
Development Team               106
Linux Team                      45
Network Team                     8
Production Team                 60
Project Management Team          2
Technical Architecture Team     10
Web Services Team                2
Windows Team                     4
Name: Team, dtype: int64

### Shaping the Data and removing any unnessacary columns

In [25]:
del df['ChangeID']  # The ChangeID field is of no real value when making predictions so we will remove it


In [26]:
df.head()  # You can see here the ChangeID column is now gone

Unnamed: 0,Requestor,Classification,Client,Location,Team
0,Lavinia Marshall,Configuration,,Toronto,Development Team
1,Samirah Calvert,Hardware,All,Regina,Network Team
2,Marisa Parra,Automation,Client D,Vancouver,Data Team
3,Saniya Steadman,Application,All,Toronto,Development Team
4,Mikey Beaumont,Automation,,Toronto,Automation Team


## Preparing and Cleaning the Data

This is usually the most time consuming step as in a lot of cases you aren't so lucky with getting "good" data and will end up with things like some fields/entries in a row missing their values or bad characters caused by the encoding of your file. In this case since we are just focusing on building and training this model the data has been pre-prepared for you and is relatively good. 

We will have to figure out what is the best way to convert this data into a more easily understandable format for the machine learning algorithm which in this case is numbers. We can do this by mapping the values in each column to a number using a map file. I tend to use a map file so that things like names to number pairing stay consistent for when you need to convert the live data to numbers to use with you prediction API. 

###### Note: There was a reason why I grouped all the data above as it makes for easy copy and pasting those values into a json file to build out the map files I have provided for you.

In [13]:
import json  #here we import all the map files into dictionary varibles
with open('team_map.json', 'r', encoding='utf-8') as file1:
    team_map = json.load(file1)
    
with open('classification_map.json', 'r') as file2:
    classification_map = json.load(file2)
   
with open('requestor_map.json', 'r', encoding='utf-8') as file3:
    requestor_map = json.load(file3)

with open('location_map.json', 'r') as file4:
    location_map = json.load(file4)
  
with open('client_map.json', 'r', encoding='utf-8') as file5:
    client_map = json.load(file5)
    
df['Team'] = df['Team'].map(team_map)   # Now use pandas to iterate through each column and map the values to numbers
df['Requestor'] = df['Requestor'].map(requestor_map) 
df['Classification'] = df['Classification'].map(classification_map)
df['Client'] = df['Client'].map(client_map)
df['Location'] = df['Location'].map(location_map)

In [14]:
df.head() # These values have to be specific because we are going to use them to match the teams later on in our prediction API

Unnamed: 0,Requestor,Classification,Client,Location,Team
0,32,3,,5,4
1,45,4,1.0,4,6
2,34,2,4.0,6,2
3,46,1,1.0,5,4
4,36,2,,5,1


Here you can see we clearly messed up somewhere because some of the values for the Client column are showing up as NaN or null. Checking our client_map.json file we can see that we are missing an entry for the "None" or no client. I have provided a corrected map file for this called client_fixed_map.json. 

In [27]:
with open('client_fixed_map.json', 'r', encoding='utf-8') as file6:
    client_map = json.load(file6)

Now we cant just remap the values in the Client Column here because the data has already been converted to integers above. We have to go all the way to the top and rerun Cells 1-12. Dont forget to this this or you will get errors. Make sure not to run Cell 13 where we did the first mapping. We will redo that code below but using the right fixed map file for Client. Normally this is not nessacary and you would just change the code above, but I am trying to illustrate a point here.

In [28]:
import json  #here we import all the map files into dictionary varibles
with open('team_map.json', 'r', encoding='utf-8') as file1:
    team_map = json.load(file1)
    
with open('classification_map.json', 'r') as file2:
    classification_map = json.load(file2)
   
with open('requestor_map.json', 'r', encoding='utf-8') as file3:
    requestor_map = json.load(file3)

with open('location_map.json', 'r') as file4:
    location_map = json.load(file4)
  
with open('client_fixed_map.json', 'r', encoding='utf-8') as file5:  # Using the fixed client map file.
    client_map = json.load(file5)
    
df['Team'] = df['Team'].map(team_map)   # Now use pandas to iterate through each column and map the values to numbers
df['Requestor'] = df['Requestor'].map(requestor_map) 
df['Classification'] = df['Classification'].map(classification_map)
df['Client'] = df['Client'].map(client_map)
df['Location'] = df['Location'].map(location_map)

In [29]:
df.head(10)  # the data looks much better now!

Unnamed: 0,Requestor,Classification,Client,Location,Team
0,32,3,8,5,4
1,45,4,1,4,6
2,34,2,4,6,2
3,46,1,1,5,4
4,36,2,8,5,1
5,34,2,2,6,2
6,34,2,4,6,2
7,29,8,8,5,10
8,31,1,1,1,4
9,50,3,8,5,7


In [30]:
len(df) - len(df.dropna())# checking for any more NaN or null values, luckily we dont have any

0

### Spliting our data into Training and Test data sets
We have to split our data that we have into one set for training the model and a second set for testing to check the accuracy of the model. We cannot use data from the trained set in our test data as the trained model should not have previously seen this data hence why we split the data. 

In [31]:
from sklearn.model_selection import train_test_split  # importing from SciKit-Learn
train, test = train_test_split(df, test_size=0.3)  # Split into 70% training data and 30% for testing data

In [32]:
train.head()  # train group, note the row IDs are randomized now

Unnamed: 0,Requestor,Classification,Client,Location,Team
380,29,1,8,5,9
215,4,4,1,5,5
168,37,1,8,5,4
69,21,6,1,5,3
56,1,2,8,6,2


In [33]:
test.head()   # test group

Unnamed: 0,Requestor,Classification,Client,Location,Team
48,37,1,8,1,4
378,37,1,8,5,4
195,4,3,1,5,11
190,1,2,4,6,2
8,31,1,1,1,4


### Using a Decision Tree Classifier algorithm for our Model

In [34]:
from sklearn.tree import DecisionTreeClassifier  # import Decision Tree Classifier from SKLearn
classifier = DecisionTreeClassifier(max_leaf_nodes=10)  # train the model adjusting number of max leaf nodes, 10 seems best
feature_training_columns = ["Requestor", "Classification", "Client", "Location"] # Here we are using all the columns except Team
classifier = classifier.fit(train[feature_training_columns], train["Team"])  # The Team Column is what we are trying to predict

In [35]:
classifier   # prints out description of decision tree classifier

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=10,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [36]:
classifier.feature_importances_  # below you can see that the Age and Gender are the most important factors

array([0.19711903, 0.24994659, 0.02656151, 0.52637286])

In [37]:
from sklearn import tree  # using tree from sklearn module to output a GraphViz .dot file
with open("DecisionTreeModel.dot","w") as file:
    file = tree.export_graphviz(classifier,
                               feature_names=feature_training_columns,out_file=file)

<h3><center>Decision Tree GraphViz file</center></h3>
                   
![title](DecisionTreeModel.png)

### Testing Our Trained Model

In [38]:
predictions = classifier.predict(test[feature_training_columns]) # Use the trained modal to make perdictions

In [39]:
from sklearn.metrics import accuracy_score      # Checking the accuracy of our Trained Modal
accuracy_score(test["Team"], predictions)  # Looks like we still get above 80 accuracy most times

0.7008547008547008

The accuracy is not bad considering we dont have more columns to work with. Lets try a different Algorithm to see if we can achieve above 80% accuracy.

### Using Random Forest Tree Algorithm

In [40]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100,  max_leaf_nodes=20)
clf

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=20,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [41]:
def checkAccuracy(clf):
    clf=clf.fit(train[feature_training_columns], train["Team"])
    predictions = clf.predict(test[feature_training_columns])
    return accuracy_score(test["Team"], predictions)

In [42]:
checkAccuracy(clf)

0.7264957264957265

## Using Gradient Boosted Tree Algorithm

In [43]:
from xgboost.sklearn import XGBClassifier
clf = XGBClassifier(learning_rate=.25)
clf

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.25, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [44]:
checkAccuracy(clf)

0.7863247863247863

#### Much better accuracy here, but we could still do better with some hyper-parameter tunning, but we will leave that for another lesson.

#### Also in another lesson I will show you how to serve this Machine Learning Model as a Prediction API that you can actually use on the live data set in RealData.csv.
