* We are going to solve the Frozen lake problem from the open AI gym
* The frozen AI problem consists of 4 * 4 grid of blocks
* Each of the blocks are either being the start block, the goal block , a safe frozen block or a dangerous hole
*  The objective is to have an agent learn to navigate from the start to the goal without moving onto a hole. At any given time the agent can choose to move either up, down, left, or right. The catch is that there is a wind which occasionally blows the agent onto a space they didn’t choose. As such, perfect performance every time is impossible, but learning to avoid the holes and reach the goal are certainly still doable. The reward at every step is 0, except for entering the goal, which provides a reward of 1.
* In it’s simplest implementation, Q-Learning is a table of values for every state (row) and action (column) possible in the environment. Within each cell of the table, we learn a value for how good it is to take a given action within a given state. In the case of the FrozenLake environment, we have 16 possible states (one for each block), and 4 possible actions (the four directions of movement), giving us a 16x4 table of Q-values. We start by initializing the table to be uniform (all zeros), and then as we observe the rewards we obtain for various actions, we update the table accordingly
* for update we are going to use the bellman optimality equation for the q-values

In [1]:
import gym
import numpy as np

In [2]:
env = gym.make ('FrozenLake-v0')

The environment that we have is 
* SFFF
* FHFH
* FFFH
* HFFG

In [3]:
#this is a simple q-learning model
Q = np.zeros ([env.observation_space.n , env.action_space.n ])

#set the learning parameters
lr = 0.8
gamma = 0.95
num_episodes = 1000

#list for the rewards
rList = []
for i in range(num_episodes):
    #reset the environment and get first new observation
    s = env.reset()
    rALL = 0                #this is for the total reward
    d = False               #this is to indicate the end of the episode
    j = 0                   #for the number of operations
                            #maxmum size of the episode is 99
                            #fourth paramters gives the probability of being in that state
    
                            #as the value of i increases the randomness decreases
                            #this is from choosing the action, as the value of i increase the effect of the 
                                    #normal distribution shrinks
    
    while j < 99:
        j+=1
        
        #randn is from the standard normal distribution
        
        a = np.argmax (Q[s,:] + np.random.randn(1,env.action_space.n)*(1./(i+1))) 
        
        #get the new state and reward from the environment
        s1 , r , d,_ = env.step(a)
        #updating the value of each state
        #we know the next state from the  step function given above
        #new q value = reward + gamma * (max of Q[s1, :])
        Q[s,a] = Q[s,a]  + lr * (r + gamma * np.max (Q[s1 , :]) - Q[s,a])
    
        #update the reward
        rALL += r
        s = s1              #update the state
        if (d == True):     #end of the episode
            break
    rList.append (rALL)

In [4]:
print ("Score over time : " , sum(rList)/num_episodes)

Score over time :  0.434


In [5]:
print ("final q_values ")
print (np.around(Q, 3))

final q_values 
[[0.106 0.009 0.006 0.009]
 [0.002 0.002 0.001 0.167]
 [0.007 0.006 0.006 0.097]
 [0.001 0.    0.001 0.038]
 [0.185 0.    0.011 0.002]
 [0.    0.    0.    0.   ]
 [0.009 0.    0.    0.   ]
 [0.    0.    0.    0.   ]
 [0.002 0.001 0.002 0.353]
 [0.002 0.632 0.001 0.002]
 [0.841 0.    0.002 0.002]
 [0.    0.    0.    0.   ]
 [0.    0.    0.    0.   ]
 [0.002 0.005 0.506 0.002]
 [0.    0.    0.    0.985]
 [0.    0.    0.    0.   ]]
