[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/aynetdia/Budget_Constrained_Bidding/blob/master/Tutorial.ipynb) 

# Setting up the notebook in Colab

In [None]:
# Mount on GDrive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Cloning the repo (only when setting up for the first time)
%cd drive/MyDrive
!git clone https://github.com/aynetdia/budget_constrained_bidding.git

Here's a link to the dataset: https://drive.google.com/drive/folders/1YYyxGMDW0EuZA2BI-uR_2j60lpngKgy8?usp=sharing

In order to set up the dataset, you either have to download the linked folder containing the dataset and put it into `/budget_constrained_biddig/data/ipinyou/`, in case you choose to run the notebook locally, or add a GDrive shortcut (go to: Shared with me -> Right click on the dataset folder -> Add shortcut to Drive) and select the  same `/budget_constrained_biddig/data/ipinyou/` as a location to place the shortcut in, if you want to run the notebook in Colab.

In [None]:
# Go into the project folder and pull the last changes (do that every time before running the notebook)
# If executing after cloning the repo: %cd budget_constrained_bidding
%cd drive/MyDrive/budget_constrained_bidding
!git pull

# Table of Contents

1. Introduction into the application domain
2. Methods
3. DRLB framework
4. Implementation
5. Results

# 1. Introduction

# 2. Methods

## 2.1 Introduction and Methods Comparison


Reinforcement Learning (RL) is a tool to support sequential decision-making events and is a highly active research field of machine learning. It finds enormous application in e.g. real-time-bidding in advertisement auction.

RL is a core control problem in which an agent sequentially interacts with an unknown environment to maximize its cumulative reward [1]. It is a simulation of the human learning process in dynamic environments without any supervision. To find the best strategy of reward maximization, the agent continuously interacts with the environment and finds the optimal action under different states. Roughly speaking, the agent’s goal is to get as much reward as possible over the long run.

For instance, board game playing is a classical dynamic decision-making process. Players need to interact with their counterparts. It is a procedure filled with immediate rewards, intuitive judgments, and each action will generate an impact in the end. The cumulative reward will decide its win-or-failure [1].

In reality, such scenarios will consist of several problems that can be solved by a class of solution methods, but not all of them can achieve the rewards maximization objective. In the learning process of RL, each action the agent made, will generate some impact not only on the immediate reward but also on the future states. In other words, RL faces a dynamic decision-making problem, the agent will obtain continuous rewards and punishments in this learning process and modify its behaviors according to the environment's feedbacks and finally to maximize the cumulative reward.

In order to make the process more clear, we explain some basic concept of RL as follows: 
Policy: the whole actions the learning agent has taken in a concrete period 
Agent: the one who takes the action
Environment: the place where the agent takes action and interacts with him
Action: the move the agent makes to interact with the environment
State: a situation in which the agent perceives 
Reward: feedback of the agent’s action 	

 Image 1 illustrates the whole process of RL.

![Image 1: Reinforcement Learning Process](process_RL.png)
	

	
Besides RL, supervised learning and unsupervised learning are the most classic machine learning methods. In this section, we explain why RL is the best choice for solving dynamic decision-making problems. Thus, the core differences among these three methods are shown in table(#).

![Table 1: Machine Learning Methods Comparison](comparison_methods.png)
	



Supervised learning is mainly used to solve regression and classification problems. It explores the relation among labeled, the target variable, and other input variables and produces a model to predict further category identification.

In comparison, unsupervised learning obtains unlabeled observations and searches the hidden structure behind the input data. Both of these methods are not suitable for the desired behavior optimization problem in a dynamic process. Unlike supervised and unsupervised learning, which will not consider the capacity or some constrained environments, reinforcement learning will start with a clear initial setting, goal, and interactive agent. All the sequential action the agent takes will generate some influence in this environment. After a dynamic model training process, the agent obtains more and more rewards as desired by taking optimal actions in order to maximize the cumulative reward. 

Reinforcement Learning models problems into a Markov Decision Process and matching the output to the problem. In theory, when we can map our problems into an MDP, we can expect that reinforcement learning can be a useful tool to solve such problems. In the next section, we will explain how it works with Markov Decision Process and give more details about Deep Q-Learning, which is a core concept in our project.

	





# 3. DRLB framework

# 4. Implementation

In [None]:
#you should find yourself inside the project folder after you've pulled the latest changes
!ls

In [15]:
# import the dataset
import pandas as pd
bid_requests = pd.read_csv('data/ipinyou/1458/train.log.txt', sep="\t")

In [26]:
bid_requests

Unnamed: 0,click,weekday,hour,min,bidid,timestamp,logtype,ipinyouid,useragent,IP,...,slotheight,slotvisibility,slotformat,slotprice,creative,bidprice,payprice,keypage,advertiser,usertag
0,0,4,0,00,81aced04baad90f9358aa39a4521cd6f,20130606000104828,1,Vhk7ZAnxDIuOjCn,windows_ie,115.45.195.*,...,280,2,1,0,77819d3e0b3467fe5c7b16d68ad923a1,300,51,bebefa5efe83beee17a3d245e7c5085b,1458,1000610110
1,0,4,0,00,572fa35095e8b6c30b1aa871e52b2d,20130606000105075,1,Z0n7Ce1GPe5-toc,windows_chrome,120.40.95.*,...,280,0,1,0,77819d3e0b3467fe5c7b16d68ad923a1,300,87,bebefa5efe83beee17a3d245e7c5085b,1458,100311304210110
2,0,4,0,00,e1e44b8a725b957a626991ecb15b56f,20130606000105119,1,VhkE1w9iOeu2eWz,windows_ie,60.163.144.*,...,280,2,1,0,77819d3e0b3467fe5c7b16d68ad923a1,300,33,bebefa5efe83beee17a3d245e7c5085b,1458,1000610110
3,0,4,0,00,f250658c8cfc615c824f9552497d6325,20130606000105254,1,Vh1K15crOZ1djGn,windows_ie,123.120.244.*,...,280,2,1,0,77819d3e0b3467fe5c7b16d68ad923a1,300,65,bebefa5efe83beee17a3d245e7c5085b,1458,100521000610110
4,0,4,0,00,125f1b570af0b42341eb60a532544f0b,20130606000105284,1,Vh2EPxqxLTLWFgB,windows_chrome,222.75.4.*,...,90,0,1,0,48f2e9ba15708c0146bda5e1dd653caa,300,238,bebefa5efe83beee17a3d245e7c5085b,1458,"10074,10083,13800,10077,10006,10063,10075,1004..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3083051,0,3,23,30,b3cf4e52fc8970885daf8145e9c71f83,20130612233441177,1,VhnE1aMrOolywVk,windows_chrome,113.200.138.*,...,90,0,0,20,fb5afa9dba1274beaf3dad86baf97e89,300,20,bebefa5efe83beee17a3d245e7c5085b,1458,100631005210110
3083052,0,3,23,30,41b79d540e6df7e8097a40f410cdcbbe,20130612233514614,1,Vh2SC55LPc52Msz,android_safari,186.254.179.*,...,600,2,1,0,cb7c76e7784031272e37af8e7e9b062c,300,18,bebefa5efe83beee17a3d245e7c5085b,1458,10063100061386610111
3083053,0,3,23,30,d53b09eff85e28357015aad4293b848a,20130612233549014,1,Vh5z1aNYOH5L3Mb,windows_firefox,208.172.112.*,...,90,0,0,20,832b91d59d0cb5731431653204a76c0e,300,20,bebefa5efe83beee17a3d245e7c5085b,1458,
3083054,0,3,23,30,81fc22e0c022225f635c1889f197b7ce,20130612234017010,1,Vh5z1u1TOHT6aGb,other_other,117.136.12.*,...,90,0,0,70,832b91d59d0cb5731431653204a76c0e,300,70,bebefa5efe83beee17a3d245e7c5085b,1458,


In [21]:
# add the necessary data intervals

def get_time_interval(data):
    time_inv=int(data[10:12])
    if time_inv>=0 and time_inv<15:
            return ("00")
    elif time_inv >= 15 and time_inv < 30:
            return ("15")
    elif time_inv >= 30 and time_inv < 45:
            return ("30")
    elif time_inv >= 45 and time_inv <=60:
            return ("45")
    else:
            return(None)

# bid_requests["timestamp"]=bid_requests["timestamp"].apply(str)
min_intervals = bid_requests.apply(lambda row : get_time_interval(row['timestamp']), axis = 1)
bid_requests.insert(3, "min", min_intervals) # insert the new column after the 'hour' column

In [25]:
# save the updated dataset. do not run again!
# bid_requests.to_csv('data/ipinyou/1458/train.log.txt', sep="\t", header=True, index=False)

In [7]:
# model training
%run -i 'src/rtb_agent/rl_bid_agent.py'

KeyError: 'pCTR'

In [None]:
#random bidding
class Random_Bidding():


        def random_bidding(self,highest_bid):

                #generate a random number generator (0,300)
                #get a dataframe with columns('click', 'slotprice', 'payprice','r_bid_price','wins')
                c_names=('click','day', 'slotprice', 'payprice','r_bid_price','wins')
                zero_Data=np.zeros(shape=(self.total_bids,len(c_names)))
                self.df=pd.DataFrame(zero_Data,columns=c_names)
                self.df['day']=self.bid_requests['weekday']
                self.df['click']=self.bid_requests['click']
                self.df['slotprice'] = self.bid_requests['slotprice']
                self.df['payprice'] = self.bid_requests['payprice']
                self.df['r_bid_price']=np.random.randint(0,highest_bid,[self.total_bids,1])


                self.rem_budget=self.budget
                self.cur_day=str(int(self.df['day'][0]))
                #print(range(len(self.df)))
                for i in range(len(self.df)):
                        if str(int(self.df.loc[i]['day'])) != self.cur_day:
                                self.cur_day=str(int(self.df.loc[i]['day']))
                                self.rem_budget=self.budget

                        cost=self.df.loc[i]['r_bid_price']
                        if self.rem_budget<cost:
                                self.df.loc[i]['r_bid_price']=self.rem_budget
                                cost = self.df.loc[i]['r_bid_price']

                        if self.rem_budget<=0:
                                  self.df.loc[i]['r_bid_price']=0
                                  cost=0

                        self.rem_budget-=(cost/1e9)



                def wins_value(row):
                        if row['r_bid_price'] >= row['slotprice'] and row['r_bid_price'] > row['payprice']:
                                return 1
                        return 0

                self.df['wins'] = self.df.apply(wins_value, axis=1)

In [None]:
rb=Random_Bidding()
rb.budget = 1000
rb.bid_requests=bid_requests
rb.total_bids=len(bid_requests)
rb.random_bidding(highest_bid=100)

click1=sum(rb.df['click'])
wins1=rb.total_bids
wins2=sum(rb.df['wins'])
click2=sum(rb.df.loc[rb.df['wins']==1]['click'])


print("Total actual random winning Impressions = {} clicks = {} \n;".format(wins1,click1),
      "Total random winning Impressions = {} clicks = {}".format(wins2, click2))

#rb.df.to_csv(os.getcwd() + '\data\rd_bid.txt', header=True,  sep=' ', mode='a')


# 5. Results