# [KDD Cup|Humanities Track Tutorial Q-Learning](https://compete.hexagon-ml.com/tutorial/kdd-cuphumanities-track-tutorial/)

### KDD Cup|Humanities Track Tutorial Q-Learning
This Tutorial builds on the previous tutorial to demonstrate a baseline implementation of a standard Reinforcement Learning (RL) Algorithm

### State

$S \in \{1,2,3,4,5\}$

### Action
$A_S = [a_{ITN},a_{IRS}]$

where  $a_{ITN} \in [0,1]$ and $a_{IRS} \in [0,1]$

### Reward
$R_{\pi} \in (- \infty,\infty)$

![](image/rewards2.png)

In [6]:
import numpy as np
from collections import defaultdict
import random
# !pip3 install git+https://github.com/slremy/netsapi --user --upgrade
from netsapi.challenge import * 

### Creating a Valid Submission from Agent Code:

In [11]:
class BanditRPM(object):
    def __init__(self,env):
        self.env = env
        self.action_resolution = 0.1
        self.actions = self.actionSpace()    
        
        self.ActionValue = {}
        self.init = (2,5)
        for key in self.actions:
            self.ActionValue[key] = self.init
        
    def actionSpace(self):
         x = np.arange(0,1+self.action_resolution,self.action_resolution)
         y = 1-x
         x = x.reshape(len(x),1)
         y = y.reshape(len(y),1)
         xy = np.concatenate((x, y), axis=1)
         xy = xy.round(2)
         xy = [tuple(row) for row in xy]
        
         return xy
        
    
    def choose_action(self):
        """
        Use Thompson sampling to choose action. Sample from each posterior and choose the max of the samples.
        """
        samples = {}
        for key in self.ActionValue:
            samples[key] = np.random.beta(self.ActionValue[key][0], self.ActionValue[key][1])
        max_value =  max(samples, key=samples.get)
        return max_value    

    def update(self,action,reward):
        """
        Update parameters of posteriors, which are Beta distributions
        """
        a, b = self.ActionValue[action]
        a = a+reward/100
        b = b + 1 - reward/100
        a = 0.001 if a <= 0 else a
        b = 0.001 if b <= 0 else b
        
        self.ActionValue[action] = (a, b)
        
    def train(self):
        for _ in range(20): #Do not change
            self.env.reset()
            while True:
                action =  self.choose_action()
                nextstate, reward, done, _ = self.env.evaluateAction(list(action))
                self.update(action,reward)
                if done:
                    break


    def generate(self):
        best_policy = None
        best_reward = -float('Inf')
        self.train()
        best_policy = {state: list(self.choose_action()) for state in range(1,6)}
        best_reward = self.env.evaluatePolicy(best_policy)
        
        print(best_policy, best_reward)
        
        return best_policy, best_reward                    

### Run the EvaluateChallengeSubmission Method with your Agent Class

In [12]:
EvaluateChallengeSubmission(ChallengeSeqDecEnvironment, BanditRPM, "BanditRPM_submission.csv")

105  Evaluations Remaining
104  Evaluations Remaining
103  Evaluations Remaining
102  Evaluations Remaining
101  Evaluations Remaining
100  Evaluations Remaining
99  Evaluations Remaining
98  Evaluations Remaining
97  Evaluations Remaining
96  Evaluations Remaining
95  Evaluations Remaining
94  Evaluations Remaining
93  Evaluations Remaining
92  Evaluations Remaining
91  Evaluations Remaining
90  Evaluations Remaining
89  Evaluations Remaining
88  Evaluations Remaining
87  Evaluations Remaining
86  Evaluations Remaining
85  Evaluations Remaining
84  Evaluations Remaining
83  Evaluations Remaining
82  Evaluations Remaining
81  Evaluations Remaining
80  Evaluations Remaining
79  Evaluations Remaining
78  Evaluations Remaining
77  Evaluations Remaining
76  Evaluations Remaining
75  Evaluations Remaining
74  Evaluations Remaining
73  Evaluations Remaining
72  Evaluations Remaining
71  Evaluations Remaining
70  Evaluations Remaining
69  Evaluations Remaining
68  Evaluations Remaining
67  Ev

104  Evaluations Remaining
103  Evaluations Remaining
102  Evaluations Remaining
101  Evaluations Remaining
100  Evaluations Remaining
99  Evaluations Remaining
98  Evaluations Remaining
97  Evaluations Remaining
96  Evaluations Remaining
95  Evaluations Remaining
94  Evaluations Remaining
93  Evaluations Remaining
92  Evaluations Remaining
91  Evaluations Remaining
90  Evaluations Remaining
89  Evaluations Remaining
88  Evaluations Remaining
87  Evaluations Remaining
86  Evaluations Remaining
85  Evaluations Remaining
84  Evaluations Remaining
83  Evaluations Remaining
82  Evaluations Remaining
81  Evaluations Remaining
80  Evaluations Remaining
79  Evaluations Remaining
78  Evaluations Remaining
77  Evaluations Remaining
76  Evaluations Remaining
75  Evaluations Remaining
74  Evaluations Remaining
73  Evaluations Remaining
72  Evaluations Remaining
71  Evaluations Remaining
70  Evaluations Remaining
69  Evaluations Remaining
68  Evaluations Remaining
67  Evaluations Remaining
66  Eva

102  Evaluations Remaining
101  Evaluations Remaining
100  Evaluations Remaining
99  Evaluations Remaining
98  Evaluations Remaining
97  Evaluations Remaining
96  Evaluations Remaining
95  Evaluations Remaining
94  Evaluations Remaining
93  Evaluations Remaining
92  Evaluations Remaining
91  Evaluations Remaining
90  Evaluations Remaining
89  Evaluations Remaining
88  Evaluations Remaining
87  Evaluations Remaining
86  Evaluations Remaining
85  Evaluations Remaining
84  Evaluations Remaining
83  Evaluations Remaining
82  Evaluations Remaining
81  Evaluations Remaining
80  Evaluations Remaining
79  Evaluations Remaining
78  Evaluations Remaining
77  Evaluations Remaining
76  Evaluations Remaining
75  Evaluations Remaining
74  Evaluations Remaining
73  Evaluations Remaining
72  Evaluations Remaining
71  Evaluations Remaining
70  Evaluations Remaining
69  Evaluations Remaining
68  Evaluations Remaining
67  Evaluations Remaining
66  Evaluations Remaining
65  Evaluations Remaining
64  Evalu

100  Evaluations Remaining
99  Evaluations Remaining
98  Evaluations Remaining
97  Evaluations Remaining
96  Evaluations Remaining
95  Evaluations Remaining
94  Evaluations Remaining
93  Evaluations Remaining
92  Evaluations Remaining
91  Evaluations Remaining
90  Evaluations Remaining
89  Evaluations Remaining
88  Evaluations Remaining
87  Evaluations Remaining
86  Evaluations Remaining
85  Evaluations Remaining
84  Evaluations Remaining
83  Evaluations Remaining
82  Evaluations Remaining
81  Evaluations Remaining
80  Evaluations Remaining
79  Evaluations Remaining
78  Evaluations Remaining
77  Evaluations Remaining
76  Evaluations Remaining
75  Evaluations Remaining
74  Evaluations Remaining
73  Evaluations Remaining
72  Evaluations Remaining
71  Evaluations Remaining
70  Evaluations Remaining
69  Evaluations Remaining
68  Evaluations Remaining
67  Evaluations Remaining
66  Evaluations Remaining
65  Evaluations Remaining
64  Evaluations Remaining
63  Evaluations Remaining
62  Evaluat

<netsapi.challenge.EvaluateChallengeSubmission at 0x1ddc2161c50>