# Network Friendly Recommendations Project: 2nd assignment

- You need to be able to solve the Q-learning problem (i.e. you don't know $u_{min}$ and $a$) again, but now you need to make it work for large $K$ (e.g. 1000s) and larger $N$ (e.g. up to $N$ = 5 or 10).
- If you learned anything from this project, and the class, it is clear that no "tabular" solution can work here. You'll need tricks and approximations. What tricks, it's up to you.
- Some initial ideas I have provided are
    - what we have done to deal with large $N$ (see my team's paper[$^{[\ast]}$](https://hal.science/hal-03578013/document), cited in the description), but this is in the MDP case, not Q-learning;
    - what Google has tried to deal with this problem in a paper[$^{[\ast]}$](https://arxiv.org/pdf/1905.12767.pdf) I also cite there;
    - other (Deep/Approximate) RL methods you can think of, or find, to deal with your main problem, the large action space.
      [Deep Q-learning with experience replay](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf)
- You could even try to model some other user type (e.g. someone that has "memory" of bad recommendations for more than one content - i.e., you give me one recommendations below $u_{min}$, and I might keep ignoring your good recommendations, for X contents in the future, with some probability...until you regain my trust).



## Libraries:

In [14]:
import numpy as np
import matplotlib.pyplot as plt
import sys, os, time, copy, math, random
import matplotlib.cm as cm
import torch

## Functions:

In [15]:
# Print u matrix with colors based on relations
def print_matrix(matrix,u_min):
    RED = '\033[91m'  # ANSI escape code for red text
    YELLOW = '\033[93m'  # ANSI escape code for yellow text
    RESET = '\033[0m'  # ANSI escape code to reset the text color

    for i, row in enumerate(matrix):
        for j, element in enumerate(row):
            if i == j:
                print(f"{YELLOW}{element:.3f}{RESET}", end=" ")  # Print diagonal element in yellow
            elif element < u_min:
                print(f"{RED}{element:.3f}{RESET}", end=" ")  # Print in red if smaller than min_value
            else:
                print(f"{element:.3f}", end=" ")
        print()

# Create a symmetric matrix
def create_symmetric_matrix(K):
    matrix = [[random.random() if i != j else 0 for j in range(K)] for i in range(K)]

    # Make the matrix symmetric by copying the upper triangle to the lower triangle
    for i in range(K):
        for j in range(i+1, K):
            matrix[j][i] = matrix[i][j]

    return matrix

# Choose which C of the K items will be cached
def random_cached_items(K, C):
    reward = [-1] * K  # Create a vector of length K with all 0 elements

    # Select C random indices
    indices = random.sample(range(K), C)

    # Set the selected indices to True
    for index in indices:
        reward[index] = 0

    return reward

# Random recommendation for current content watched
def random_recommendation(K, N, curr_content):
    recom = []  # Create an empty vector

    for _ in range(N):
        random_number = random.randint(0, K-1)
        while random_number == curr_content or random_number in recom:  # Check if random number is equal to curr_content or is already in recom
            random_number = random.randint(0, K-1)  # Generate a new random number
        recom.append(random_number)
    recom.sort()
    return recom

# Are all the recommended videos relevant to the current content being watched?
def all_relevant(N, curr_content, u, recom, u_min):
    all_relevant = True
    for i in range(N):
        if u[curr_content][recom[i]] < u_min: # Check if at least one is irrelevant
            all_relevant = False                # If one is irrelevant return false
            break
    return all_relevant

# User chooses the next video to watch
def user_chooses(K, N, q, u, u_min, a, recom, curr_content):
    if random.uniform(0, 1) > q:
        if all_relevant(N, curr_content, u, recom, u_min):  # If all recommended are relevant
            if random.uniform(0, 1) < a:  # User chooses one of the recommended
                new_content = recom[random.randint(0, N-1)]
            else:
                new_content = random.choice([x for x in range(K) if x != curr_content])
        else:                                               # If at least one recommended isn't relevant
            new_content = random.choice([x for x in range(K) if x != curr_content])
    else:
        new_content = -1
    return new_content

# Environment probability of user moving to next content given that he watches current content
# and he is being recommended the list recom
def env_prob(K, N, u, u_min, a, recom, curr_content, next_content):
    if curr_content==next_content:  # No possibility that the user watches the same content consequently
        env_prob = 0.0
    else:
        if all_relevant(N, curr_content, u, recom, u_min):  # If all recommended are relevant
        #print("All Relevant")
            if next_content in recom:     # If the next content was recommended
                env_prob = a/N + (1-a)/(K-1)
            else:                         # If the next content wasn't recommended
                env_prob = (1-a)/(K-1)
        else:                           # If at least one recommended isn't relevant
            #print("Not Relevant")
            env_prob = 1/(K-1)
    return env_prob

# All possible recommendations for state s
def possible_recom(K, N, s):
    items = list(range(K))
    items.remove(s)  # Remove 's' from the list of items

    def generate_combinations(curr_set, remaining_items):
        if len(curr_set) == N:
            return [curr_set]

        all_combinations = []
        for i, item in enumerate(remaining_items):
            new_set = curr_set + [item]
            new_remaining = remaining_items[i+1:]
            all_combinations.extend(generate_combinations(new_set, new_remaining))

        return all_combinations

    combinations = generate_combinations([], items)
    return list(map(list, combinations))

# All possible recommendations for all states
def all_states_possible_recom(K, N):

    all_combinations = []
    for s in range(K):
        state_combination = possible_recom(K, N, s)
        all_combinations.append(state_combination)

    return all_combinations

# Run "sessions_num" Monte Carlo episodes and compute the mean loss and the total loss
def monte_carlo_sessions(sessions_num, policy, reward, K, N, u_min, a, q, u):
    print("> Running Monte Carlo sessions...")
    total_loss = 0
    content_watched = 0
    for _ in range(sessions_num):
        curr_content = random.randint(0, K-1) # The first item viewed is random

        while True:
            recom = policy[curr_content]  # Recommend N items based on the policy
            curr_content = user_chooses(K, N, q, u, u_min, a, recom, curr_content)
            if curr_content == -1:
                break
            if reward[curr_content]==-1:
                total_loss += 1
            content_watched+=1

    if content_watched == 0:
        mean_loss = 0
    else:
        mean_loss = total_loss/content_watched

    return mean_loss, total_loss

# Find all the values above u_min
def find_values_above_min(matrix, u_min):
    result = []
    for i, row in enumerate(matrix):
        row_result = []
        for j, value in enumerate(row):
            if value > u_min:
                row_result.append(j)
        result.append(row_result)
    return result

# Disable
def blockPrint():
    sys.stdout = open(os.devnull, 'w')

# Restore
def enablePrint():
    sys.stdout = sys.__stdout__

def create_matrix(data):
    K = len(data)
    N = len(data[0])

    matrix = [[None] * N for _ in range(K)]

    for i in range(K):
        matrix[i] = list(data[i])

    return matrix

def create_tuple_list(matrix):
    K = len(matrix)
    N = len(matrix[0])

    tuple_list = []

    for i in range(K):
        rounded_values = [int(value) for value in matrix[i]]
        tuple_list.append(rounded_values)

    return tuple_list


## Environment (Content Catalogue):

In [17]:
K = 100      # Number of content items
u_min = 0.5 # Minimum below which two contents are "irrelevant"
C = int(np.floor(0.2*K))   # Number of cached content items
u = create_symmetric_matrix(K)        # Create the matrix of relativity
reward = random_cached_items(K, C) # Create a vector that checks if the given item is cached
#print_matrix(u,u_min)
print(reward)

[0, -1, -1, -1, -1, -1, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1, 0, -1, -1, -1, 0, -1, 0, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, -1, -1, -1, -1, 0, -1, -1, -1, -1, 0, -1, 0, -1, 0, -1, -1, -1, -1, 0, -1, -1, -1, -1, -1, 0, -1, -1, -1, -1, -1, 0, -1, -1, -1, -1, 0, -1, 0, -1, 0, -1]


## Environment (User Model):

In [18]:
N = 10         # Number of recommended content items
q = 0.05      # Probability of a session ending
a = 1         # Probability of choosing a recommended content item
round = 0     # Number of content item viewed during this session
history = []  # The history of content items viewed during this session
curr_content = random.randint(0, K-1) # The first item viewed is random
history.append(curr_content)  # Append first item in history

while True:
    #print("Round: "+str(round))
    #print("Current content: "+str(curr_content))
    #### THIS WILL BE REPLACED BY OUR ALGORITHMS ####
    recom = random_recommendation(K, N, curr_content)  # Recommend N random items
    #################################################
    #print("Recommendation: "+str(recom))
    curr_content = user_chooses(K, N, q, u, u_min, a, recom, curr_content)
    if curr_content == -1:
        break
    history.append(curr_content)
    round+=1

print(history)

[28, 34, 56, 31, 28, 16, 12, 68, 59, 51, 91, 85, 72, 80, 82, 72, 93, 90, 22, 90, 21, 44, 13, 97, 0, 71, 55, 6, 28, 99, 93, 79, 20, 46, 83, 20, 8]


In [19]:
# Number of combinations
comb = (int)(math.comb(K-1, N))
print("There are " + str(comb) + " actions per state")
print("Q table should be a " + str(K) + "x" + str(comb) + " table")

There are 15579278510796 actions per state
Q table should be a 100x15579278510796 table


In [20]:
theta_label = 0
theta_main = random.random()
theta_pctr = random.random()
D_training = 

T = 10^6     # Number of iterations
M = 10^2     # Interval to update label network
for i in range(1,T):
    if i % M == 0:
        theta_label = theta_main
    for j in 
