# Algorithm Tuning
### ToDo
* Build a separate run for each tuning parameter

## Goal
Goal should be an automated procedure, that tells us what a good amount of won Ausschreibungen is and how diverse the Ausschreibungen should be. That can then further be combined with different kinds of attributes and positive/negative training data ratio. We should build an automated and datadrive test that shows the best tuning parameters.

## Setup

In [None]:
%matplotlib inline
import math
import matplotlib.pyplot as plt
from sklearn import tree
from db import connection, engine
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, auc
import pandas as pd
import numpy as np
import helpers as fdn

In [None]:
select_an = (
    "anbieter.anbieter_id, "
    "anbieter.anbieter_plz, "
    "anbieter.institution as anbieter_institution, "
    "cpv_dokument.cpv_nummer as anbieter_cpv, "
    "ausschreibung.meldungsnummer"
)
# anbieter_CPV are all the CPVs the Anbieter ever won a procurement for. So all the CPVs they are interested in. 
select_aus = (
    "anbieter.anbieter_id, "
    "auftraggeber.institution as beschaffungsstelle_institution, "
    "auftraggeber.beschaffungsstelle_plz, "
    "ausschreibung.gatt_wto, "
    "cpv_dokument.cpv_nummer as ausschreibung_cpv, "
    "ausschreibung.meldungsnummer"
)

## Optimal Amout of Public Tenders by Institution

**Findings**: A bit early to tell but it looks like about 10 won procurements are needed for a first satisfying result. Although it has to be said that can also be a cooincidence: Once test run takes about 12-14 minutes at the moment so not that many have been conducted since they are very time consuming. We could b etter test for this matter if we would have a fully automated test suite but so far, the low number of 10 sounds promising.

In [None]:
data_an = fdn.getFromSimap(select_an)
data_aus = fdn.getFromSimap(select_aus)

In [None]:
# Create list of Anbieter that have won a different amounts of procurements
inst_count = pd.DataFrame(data_an["anbieter_institution"].value_counts())

def createInstBin(lower, upper):
    return inst_count[(inst_count["anbieter_institution"] >= lower) &(inst_count["anbieter_institution"] <= upper)]

# Create different sized bins
bin_0_5 = createInstBin(0, 5)
bin_5_10 = createInstBin(5, 10)
bin_10_15 = createInstBin(10, 15)
bin_15_20 = createInstBin(15, 20)
bin_20_30 = createInstBin(20, 30)
bin_30_40 = createInstBin(30, 40)
bin_40_50 = createInstBin(40, 50)
bin_50_75 = createInstBin(50, 75)
bin_75_100 = createInstBin(75, 100)
bin_100_n = inst_count[inst_count["anbieter_institution"] >= 100]

In [None]:
createInstBin(150,160)

In [None]:

# Pick a random sample of one out of each bin to see how different bins perform in the algorithem
def chooseFromInstBins(bins):
    l = list()
    for eachBin in bins:
        l.append(eachBin.sample(n=1).index[0])
    return l

institutionList = chooseFromInstBins(
    [bin_0_5,
    bin_5_10,
    bin_10_15,
    bin_15_20,
    bin_20_30,
    bin_30_40,
    bin_40_50,
    bin_50_75,
    bin_75_100,
    bin_100_n])

institutionList

In [None]:
import time
def treeRunPerSize(instList):
    start_time = time.time()
    results = [];
    for inst in instList:
        df_pos_full, df_neg_full = fdn.createAnbieterDf(select_an, select_aus, inst)
        x, y, z = fdn.decisionTreeRun(df_pos_full, df_neg_full , len(df_pos_full)*2)
        results.append([x, y, z])
    elapsed_time = time.time() - start_time
    return results, elapsed_time;


In [None]:
v, t = treeRunPerSize(institutionList)
print(t)

In [None]:
for e in v:
    print(e[1])
for e in v:
    print(e[2])

## Optimal Amount of CPV Diversity of Institution
Do the same with CPV Diversitiy fdn.getCpvCount('Swisscom').

In [None]:
fdn.getCpvDiversity('Swisscom')

**Findings**: This sort of analysis should be done at a later stage, when we have a full test suite and can repeatatly train a model on different data samples. We then can then investigate the inpact of CPV deversity of a subject (Anbieter).

## Decision Tree Run (multiple runs)
### ToDo
* Try with Random Forest
* Add other attributes (generic, so we are able to test which work best)
* Add optimal Tender and CPV Amount (see above)
* Extend Evaluation

In [None]:
# Create a df with all negative and all positive respones for a specific Anbieter
df_pos_full, df_neg_full = fdn.createAnbieterDf(select_an, select_aus, "Adecco AG")

In [None]:
positives_count = len(df_pos_full)
step = math.ceil(positives_count / 10)
max_negative_count = step * 100
print(positives_count, step)

In [None]:
# TODO insert for 50000 --> len(df_neg_full) modulo...

positives_count = len(df_pos_full)
step = math.ceil(positives_count / 10)
max_negative_count = step * 100

# Create list placeholders
precision, pos_neg_ratio, confusion_matrices, fns, fps = ([] for i in range(5))

# run the decison tree multiple times
for i in range(positives_count, max_negative_count, step):
    x, y, z = fdn.decisionTreeRun(df_pos_full, df_neg_full , i)
    precision.append(x)
    pos_neg_ratio.append(y)
    confusion_matrices.append(z)
    fns.append(z[1][0])
    fps.append(z[0][1])

## Evaluation of positive to negtive Datapoints Ratio

**Findings**: You can see a positive linear trend in False Positives (FPS) and False Negatives (FNS) with increasing share of negative data points used. While one might think it is better to keep the negatives low to reduce them. However the reason is most likely that in a bigger negative pool there are likely more similar procurements that could considered positives. So these FPS could be procurements in which the bidder might actually be interested in.
As for the rising amount of FNS: If there are very few positves in the test set, and the trainig might get more inacurate if by bad luck lots of the positives get put in the test set and not in the training set.
*Conclusion*: Postives should probably not make less than 25% of the test set.
*Try Suggestion*: The negatives in test and traiing set should be of more similar range concerning the CPVs to train more accuratly

In [None]:
# Display False Negatives
print(positives_count)
plt.plot(range(positives_count, max_negative_count, step), fns)

In [None]:
# Display False Positives
print(positives_count)
plt.plot(range(positives_count, max_negative_count, step), fps)

## TODO: Look at the indiviual ones