# Overview

This notebook can be used when you have a model with x amount of drops and now you want to predict data in a drop not included in the model (e.g. an ongoing drop or a drop that just finished).


# How to run this notebook

You will need the training dataset that was used to make the model.  
If you have a model only, simply run the scraper again on the same drops as the model.  
- This should generate the dataset for you. Make sure to name it "training_dataset.csv"

You then need an assertion dataset.  
- This is the dataset that has the drop you want to assert on.  

Here is how you can generate the assertion dataset:  

1. Run the scraper on the same drops as the model plus the drop you want to assert on  
     
   - For example: if the model is drops 20.12 to 20.16 and you want to assert on 20.17 only, run the scraper against drops 20.12 to 20.17
    

2. The scraper will generate the following file: training_dataset.csv. Rename it to assertion_dataset.csv

3. Open assertion_dataset.csv and delete all rows that are not for the drop that you want to assert on.  
     
   - Following on the example above: delete all rows for the drops 20.12 to 20.16 and leave all rows belonging to drop 20.17


4. Save the file.

5. Drag the following files to the same workspace as this notebook is located in (delete any existing files in this workspace if they already exist)  
    - model.pickle
    - training_dataset.csv
    - assertion_dataset.csv


# Set-up

First, we'll do some basic set-up.
The `numpy` library provides numerical algorithms.
The `pandas` library provides tools for data manipulation, 
and is used extensively for data mining and machine learning with Python.
Here are some tutorials on `pandas`: https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix

The code below is used to left align tables

In [2]:
%%html
<style>
table {float:left}
</style>

# Load the data

In [3]:
training_df = pd.read_csv('training_dataset.csv')
training_df

Unnamed: 0,PS,DG Number,ERICnetworkscopewidget_CXP9032438,ERICtopologybrowser_CXP9030753,ERICdomainproxypersistence_CXP9035433,ERICenmsgasrlforwarderdef_CXP9034691,ERICbulkconfigurationgui_CXP9034995,ERICvipstpnodemodelcommon_CXP9035070,ERICvdscnodemodelcommon_CXP9034432,ERICtransportcimnormalization_CXP9035508,...,SoftwareHardwareManager_Upgrade_5GNodes_MT - Ameya,PM_STATISTICAL_MSC - Eklavya,FM_Acceptance_Test - Quarks,FM_FMX - Dhruva,SSR Configaration Management- CreativeCoders,SoftwareHardwareManager_Upgrade_GSM_RFA250 - Royals,SoftwareHardwareManager_MINI-LINK - Scorpions,TCU02 CM Add/Sync node- Starks,AutoProvisioning - GreatWall,obsoleted
0,20.12.1,47975,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
1,20.12.1,47974,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
2,20.12.1,47973,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
3,20.12.1,47972,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
4,20.12.2,47978,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1735,20.17.99,50662,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
1736,20.17.99,50648,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
1737,20.17.100,50663,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,1,0
1738,20.17.100,50661,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,1,0


In [4]:
assertion_df = pd.read_csv('assertion_dataset.csv')
assertion_df

Unnamed: 0,PS,DG Number,ERICnetworkscopewidget_CXP9032438,ERICtopologybrowser_CXP9030753,ERICdomainproxypersistence_CXP9035433,ERICenmsgasrlforwarderdef_CXP9034691,ERICmediationmrsnodemodelcommon_CXP9036681,ERICbulkconfigurationgui_CXP9034995,ERICvipstpnodemodelcommon_CXP9035070,ERICvdscnodemodelcommon_CXP9034432,...,SoftwareHardwareManager_Upgrade_5GNodes_MT - Ameya,PM_STATISTICAL_MSC - Eklavya,FM_Acceptance_Test - Quarks,FM_FMX - Dhruva,SSR Configaration Management- CreativeCoders,SoftwareHardwareManager_Upgrade_GSM_RFA250 - Royals,SoftwareHardwareManager_MINI-LINK - Scorpions,TCU02 CM Add/Sync node- Starks,AutoProvisioning - GreatWall,obsoleted
0,21.01.2,50678,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
1,21.01.2,50683,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
2,21.01.4,50696,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,1,0
3,21.01.4,50690,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,1,0
4,21.01.4,50687,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110,21.01.48,50844,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,1,0
111,21.01.49,50846,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,0,1,1,0
112,21.01.49,50767,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,0,1,1,0
113,21.01.49,50822,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,0,1,1,0


# Cleanup the assertion dataframe

We cannot simply use the model we have to assert on this new assertion dataset.  
The reason for this is that there could be features (columns in the csv) that are not present in the model.  
The amount of columns in the dataframe must exactly match what the model has.  
Therefore, what we do is grab the columns in the training dataframe and if the columns of the assertion dataset are not present in the training, we disregard them.   

In [5]:
for column in assertion_df.columns:
    if column not in training_df.columns:
        del assertion_df[column]
assertion_df

Unnamed: 0,PS,DG Number,ERICnetworkscopewidget_CXP9032438,ERICtopologybrowser_CXP9030753,ERICdomainproxypersistence_CXP9035433,ERICenmsgasrlforwarderdef_CXP9034691,ERICbulkconfigurationgui_CXP9034995,ERICvipstpnodemodelcommon_CXP9035070,ERICvdscnodemodelcommon_CXP9034432,ERICtransportcimnormalization_CXP9035508,...,SoftwareHardwareManager_Upgrade_5GNodes_MT - Ameya,PM_STATISTICAL_MSC - Eklavya,FM_Acceptance_Test - Quarks,FM_FMX - Dhruva,SSR Configaration Management- CreativeCoders,SoftwareHardwareManager_Upgrade_GSM_RFA250 - Royals,SoftwareHardwareManager_MINI-LINK - Scorpions,TCU02 CM Add/Sync node- Starks,AutoProvisioning - GreatWall,obsoleted
0,21.01.2,50678,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
1,21.01.2,50683,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
2,21.01.4,50696,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,1,0
3,21.01.4,50690,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,1,0
4,21.01.4,50687,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110,21.01.48,50844,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,1,0
111,21.01.49,50846,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,0,1,1,0
112,21.01.49,50767,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,0,1,1,0
113,21.01.49,50822,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,0,1,1,0


# Make new Predictions

We are almost ready to begin making predictions.  
Let's just drop the PS and DG columns as they are not used in Machine Learning.

In [6]:
dg_numbers = assertion_df['DG Number']
assertion_df = assertion_df.drop(columns=['PS','DG Number'])

Now lets take the answers for whether a DG was obsoleted or not to another dataframe called y_answers.  
We also drop that column from the original dataframe.  

In [7]:
y_answers = assertion_df['obsoleted']
assertion_df = assertion_df.drop(columns=['obsoleted'])


Now we'll use the model to predict whether or not any delivery groups should be obsoleted.  
We'll read the model and use it for prediction.

In [8]:
import pickle
with open('model.pickle', 'rb') as f:
    trained_estimator = pickle.load(f)

y_new_pred = trained_estimator.predict(assertion_df)
y_new_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0])

# Analyse results
If we were to take results as they are, the model would pretty much say to not obsolete anything.  
However, we want to use a confidence threshold to make sure we are very sure of letting content through.  
- This will help us avoid bad content through
- The higher the treshhold, the more good content will be incorrectly obsoleted but the less likely the model will allow bad content through

Lets get the probability of the predictions into a dataframe called prob_answers

In [9]:
prob_predictions = trained_estimator.predict_proba(assertion_df)

The code below will filter out the results based on the confidence threshold set.  
Feel free to play around with the threshold to fine tune it to best cater for what you want.  

In [15]:
final_predictions = []
obsolete = []
no_obsolete = []
confidence_threshold = 0.973

for item in prob_predictions:
    if item[0] >= confidence_threshold:
        final_predictions.append(0)
        no_obsolete.append(0)
    else:
        final_predictions.append(1)
        obsolete.append(1)

print('obsolete: ' + str(len(obsolete)))
print('no_obsolete: ' + str(len(no_obsolete)))
matrix = confusion_matrix(y_answers,final_predictions)   
matrix

obsolete: 104
no_obsolete: 11


array([[11, 94],
       [ 0, 10]])

##### How to read the matrix above:
| Positive        | Negative           |
| ------------- |:-------------:|
| Should not obsolete and did not obsolete | Should not obsolete but obsoleted |
| Should obsolete and did not obsolete | Should obsolete and obsoleted |     

The result below will list out each DG prediction and how confident the model was.  
We should aim to have the ones that are not obsoleted to have a really high confidence while the ones that are obsoleted to have a really low confidence (the lower the confidence in not obsoleting, the more sure the model was in obsoleting)  
As you can see, our results arent great as almost every result is around 97.2%.  
This means the model is actually having a hard time identifying which DGs are bad and which aren't.  
This contributes to the low accuracy we are going to see further down.  

In [11]:
for i in range(len(final_predictions)):
    if final_predictions[i] == 0:
        action = "Do not obsolete"
    else:
        action = "Obsolete"
    print(action + ": " + str(dg_numbers[i]) + ' - Confidence that should not be obsoleted: ' + ("%.2f" % (prob_predictions[i][0] * 100)) + '%')

Obsolete: 50678 - Confidence that should not be obsoleted: 97.11%
Obsolete: 50683 - Confidence that should not be obsoleted: 97.25%
Obsolete: 50696 - Confidence that should not be obsoleted: 97.22%
Obsolete: 50690 - Confidence that should not be obsoleted: 97.21%
Obsolete: 50687 - Confidence that should not be obsoleted: 97.20%
Obsolete: 50685 - Confidence that should not be obsoleted: 97.21%
Do not obsolete: 50703 - Confidence that should not be obsoleted: 97.49%
Obsolete: 50699 - Confidence that should not be obsoleted: 97.26%
Obsolete: 50705 - Confidence that should not be obsoleted: 97.20%
Obsolete: 50706 - Confidence that should not be obsoleted: 97.22%
Obsolete: 50710 - Confidence that should not be obsoleted: 97.26%
Obsolete: 50708 - Confidence that should not be obsoleted: 97.21%
Obsolete: 50711 - Confidence that should not be obsoleted: 97.20%
Obsolete: 50712 - Confidence that should not be obsoleted: 97.20%
Obsolete: 50713 - Confidence that should not be obsoleted: 97.20%
Obs

# Evaluation

Have a look at the matrix above.  
We want to have as much content going through without bad content through. Adjust the confidence threshold to get the most content through.  
To get overall accuracy, you can manually add the numbers and see the result.  
Example from matrix above:  
- ( (11 + 10) / (11 + 94 + 10) ) * 100 = 18.26%

In [12]:
accuracy = (matrix[0][0] + matrix[1][1]) / (sum(matrix[0]) + sum(matrix[1])) * 100
accuracy

18.26086956521739

This result should match what we have below.  
Precision, recall, and F1 measures are the most commonly used metrics for classification tasks.  
Scikit-Learn's metrics library contains the classification_report method, which can be readily used to find out the values for these important metrics.  

In [13]:
from sklearn.metrics import classification_report

print(classification_report(y_answers,final_predictions))

              precision    recall  f1-score   support

           0       1.00      0.10      0.19       105
           1       0.10      1.00      0.18        10

    accuracy                           0.18       115
   macro avg       0.55      0.55      0.18       115
weighted avg       0.92      0.18      0.19       115

