# Predicting if the user would play the game and the play hours.

There are two tasks:
1. **Play Prediction**: Predict given a (user,game) pair from ‘pairs Played.csv’ whether the user would play the
game (0 or 1). Accuracy will be measured in terms of the categorization accuracy (fraction of correct
predictions). The test set has been constructed such that exactly 50% of the pairs correspond to played
games and the other 50% do not.

2. **Time played prediction**: Predict how long a person will play a game (transformed as log2
(hours + 1), for
those (user,game) pairs in ‘pairs Hours.csv’. Accuracy will be measured in terms of the mean-squared
error (MSE).

data: https://cseweb.ucsd.edu/classes/fa23/cse258-a/files/assignment1.tar.gz

The data contains:
- **train.json.gz** 175,000 instances to be used for training. This data should be used for both the ‘play prediction’
and ‘time played prediction’ tasks. It is not necessary to use all observations for training, for example if
doing so proves too computationally intensive.
    - **userID** The ID of the user. This is a hashed user identifier from Steam.
    - **gameID** The ID of the game. This is a hashed game identifier from Steam.
    - **text** Text of the user’s review of the game.
    - **date** Date when the review was entered.
    - **hours** How many hours the user played the game.
    - **hours transformed** log2(hours+1). *This transformed value is the one we are trying to predict.*

- **pairs Played.csv** Pairs on which you are to predict whether a game was played. (This is for grading)

- **pairs Hours.csv** Pairs (userIDs and gameIDs) on which you are to predict time played. (This is for grading)

- **baselines.py** A simple baseline for each task, described below

In [1]:
import gzip
from collections import defaultdict
import math
import scipy.optimize
from sklearn import svm
import numpy
import string
import random
from sklearn import linear_model

In [2]:
def readGz(path):
    for l in gzip.open(path, 'rt'):
        yield eval(l)

In [3]:
def readJSON(path):
    f = gzip.open(path, 'rt', encoding="utf-8")
    f.readline()
    for l in f:
        d = eval(l)
        u = d['userID']
        g = d['gameID']
        yield u,g,d

In [4]:
answers = {}

In [5]:
# Some data structures that will be useful

In [13]:
!wget https://cseweb.ucsd.edu/classes/fa23/cse258-a/files/assignment1.tar.gz

--2023-12-20 02:11:31--  https://cseweb.ucsd.edu/classes/fa23/cse258-a/files/assignment1.tar.gz
Resolving cseweb.ucsd.edu (cseweb.ucsd.edu)... 132.239.8.30
Connecting to cseweb.ucsd.edu (cseweb.ucsd.edu)|132.239.8.30|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 36046793 (34M) [application/x-gzip]
Saving to: ‘assignment1.tar.gz’


2023-12-20 02:11:31 (152 MB/s) - ‘assignment1.tar.gz’ saved [36046793/36046793]



In [15]:
# run this cell to extract tar.gz files
!tar -xzvf "/content/assignment1.tar.gz" -C "/content/"

assignment1/
assignment1/baselines.py
assignment1/train.json.gz
assignment1/pairs_Played.csv
assignment1/pairs_Hours.csv


In [16]:
# Read the data into a list
allHours = []
for l in readJSON("assignment1/train.json.gz"):
    allHours.append(l)

In [17]:
len(allHours), allHours[0]

(174999,
 ('u70666506',
  'g49368897',
  {'userID': 'u70666506',
   'early_access': False,
   'hours': 63.5,
   'hours_transformed': 6.011227255423254,
   'found_funny': 1,
   'text': 'If you want to sit in queue for 10-20min and have 140 ping then this game is perfect for you :)',
   'gameID': 'g49368897',
   'user_id': '76561198030408772',
   'date': '2017-05-20'}))

## Task 1. Play Prediction

Since we don’t have access to the test labels, we’ll need to simulate validation/test sets of our own.
So, let’s split the training data (‘train.json.gz’) as follows:
1. Reviews 1-165,000 for training
2. Reviews 165,001-175,000 for validation

In [18]:
hoursTrain = allHours[:165000]
hoursValid = allHours[165000:]

In [19]:
# From baseline code
# This baseline just returns True if the item in question is ‘popular,’ using a threshold of
# the 50th percentile of popularity (totalPlayed/2).
gameCount = defaultdict(int)
totalPlayed = 0 # The number of play times of whole games

for u,g,_ in hoursTrain:
    gameCount[g] += 1
    totalPlayed += 1

mostPopular = [(gameCount[x], x) for x in gameCount]
mostPopular.sort() # Sort the games according to the number of times they have been played
mostPopular.reverse()

return1 = set()
count = 0
for ic, i in mostPopular:
    count += ic
    return1.add(i)
    if count > totalPlayed/2: break

In [20]:
# Helpful data structures
hoursPerUser = defaultdict(list)
hoursPerItem = defaultdict(list)
for u,g,d in hoursTrain:
    r = d['hours_transformed']
    hoursPerUser[u].append((g,r))
    hoursPerItem[g].append((u,r))

In [21]:
# Take a look at them
# Games and hours the user played
list(hoursPerUser.keys())[0], list(hoursPerUser.values())[0]

('u70666506',
 [('g49368897', 6.011227255423254),
  ('g48657523', 3.7224660244710908),
  ('g03488908', 4.733354340613827),
  ('g29657740', 3.2172307162206693),
  ('g02273341', 7.936048963676615),
  ('g75228197', 7.424586226251101),
  ('g78278951', 6.048759311919856),
  ('g89200271', 6.617651119427331),
  ('g29741733', 1.84799690655495),
  ('g23175113', 5.307428525192248),
  ('g15360200', 2.765534746362977),
  ('g42729032', 6.4561490346479955),
  ('g49157768', 2.9818526532897405)])

In [22]:
# Users and hours the game were played
list(hoursPerItem.keys())[0], list(hoursPerItem.values())[0]

('g49368897',
 [('u70666506', 6.011227255423254),
  ('u70647035', 5.58796498888268),
  ('u73312807', 2.035623909730721),
  ('u02922169', 0.37851162325372983),
  ('u20470173', 1.4329594072761063),
  ('u38296376', 1.070389327891398),
  ('u90606033', 3.263034405833794),
  ('u48738997', 4.722466024471091),
  ('u69205212', 6.997744026059632),
  ('u51944505', 3.3074285251922473),
  ('u36112555', 2.263034405833794),
  ('u58254457', 0.765534746362977),
  ('u28301519', 0.2630344058337938),
  ('u40167309', 6.2890967024199895),
  ('u92778705', 4.035623909730721),
  ('u24964143', 5.371558862611963),
  ('u35200970', 8.909893083770042),
  ('u93364692', 6.407692648665677),
  ('u41303810', 3.867896463992655),
  ('u19119107', 1.0),
  ('u75222565', 1.5360529002402097),
  ('u49621496', 2.035623909730721),
  ('u96883621', 0.2630344058337938),
  ('u17759890', 6.330916878114617),
  ('u48087366', 6.632268215499513),
  ('u75656042', 3.548436624696042),
  ('u65793926', 1.3785116232537298),
  ('u31096754', 8.18

### Question 1

Although we have built a validation set, it only consists of positive samples. For this task we also need
examples of user/item pairs that weren’t played. For each entry (user,game) in the validation set, sample
a negative entry by randomly choosing a game that user hasn’t played. Evaluate the performance
(accuracy) of the baseline model on the validation set you have built (1 mark).

In [23]:
# Generate a negative set
userSet = set()
gameSet = set()
playedSet = set()

# Get the exist (user, game) pairs
for u,g,d in allHours:
    userSet.add(u)
    gameSet.add(g)
    playedSet.add((u,g))

lGameSet = list(gameSet)

notPlayed = set()
for u,g,d in hoursValid:
    g = random.choice(lGameSet)
    # Get the (user, game) pair that have not been seen
    while (u,g) in playedSet or (u,g) in notPlayed:
        g = random.choice(lGameSet)
    notPlayed.add((u,g))

playedValid = set()
for u,g,r in hoursValid:
    playedValid.add((u,g))

In [25]:
# Evaluate baseline strategy on the validation set
correct = 0
for (label,sample) in [(1, playedValid), (0, notPlayed)]:
    for (u,g) in sample:
        pred = 0
        if g in return1:
            pred = 1
        if pred == label:
            correct += 1

acc = correct / (len(playedValid) + len(notPlayed))
print(f"Accuracy: {acc}")

Accuracy: 0.6810681068106811


### Question 2

The existing ‘played prediction’ baseline just returns True if the item in question is ‘popular,’ using a
threshold of the 50th percentile of popularity (totalPlayed/2). Assuming that the ‘non-played’ test
examples are a random sample of user-game pairs, this threshold may not be the best one. See if you
can find a better threshold and report its performance on your validation set (1 mark).

In [27]:
# Set the threshold to be totalPlayed * 0.75
return1 = set()
count = 0
for ic, i in mostPopular:
    count += ic
    return1.add(i)
    if count > totalPlayed * 0.75: break

In [28]:
# Evaluate imporved strategy
correct = 0
for (label,sample) in [(1, playedValid), (0, notPlayed)]:
    for (u,b) in sample:
        pred = 0
        if b in return1:
            pred = 1
        if pred == label:
            correct += 1

In [29]:
acc2 = correct / (len(playedValid) + len(notPlayed))
print(f"Accuracy: {acc2}")

Accuracy: 0.6923192319231923


### Question 3

A stronger baseline than the one provided might make use of the Jaccard similarity (or another similarity
metric) Given a pair (u, g) in the validation set, consider all training items g'
that user u has played.
For each, compute the Jaccard similarity between g and g'
, i.e., users (in the training set) who have
played g and users who have played g'
. Predict as ‘played’ if the maximum of these Jaccard similarities
exceeds a threshold (you may choose the threshold that works best). Report the performance on your
validation set (1 mark).

In [30]:
def Jaccard(s1, s2):
    numer = len(s1.intersection(s2))
    denom = len(s1.union(s2))
    if denom == 0:
        return 0
    return numer / denom

In [32]:
# Slow implementation, could easily be improved following the code from Lecture 7
correct = 0
p0, p1 = 0, 0

# Get (user, game) pairs from validation set. 1, 0 are labels means played or not
for (label, sample) in [(1, playedValid), (0, notPlayed)]:
    for (u, g) in sample:
        maxSim = 0
        users = set(hoursPerItem[g]) # Get the users who played g
        for g2,_ in hoursPerUser[u]: # Get the games that played by user u
            sim = Jaccard(users, set(hoursPerItem[g2])) # Get the similarity betweem two games g and g2
            if sim > maxSim:
                maxSim = sim
        pred = 0
        # If there is a game that user u played before is similar with game g, we predict user u would play it
        if maxSim > 0.025:
            pred = 1
            p1 += 1
        else:
            p0 += 1
        if pred == label:
            correct += 1

In [33]:
acc3 = correct / (len(playedValid) + len(notPlayed))
print(f"Accuracy: {acc3}")

Accuracy: 0.5003500350035004


### Question 4


Improve the above predictor by incorporating both a Jaccard-based threshold and a popularity based
threshold. Report the performance on your validation set.

In [37]:
# Similarity and popularity
correct = 0
p0, p1 = 0,0
for (label,sample) in [(1, playedValid), (0, notPlayed)]:
    for (u,g) in sample:
        maxSim = 0
        users = set(hoursPerItem[g])
        for g2,_ in hoursPerUser[u]:
            sim = Jaccard(users,set(hoursPerItem[g2]))
            if sim > maxSim:
                maxSim = sim
        pred = 0
        if maxSim > 0.025 or len(hoursPerItem[g]) > 60:
            pred = 1
            p1 += 1
        else:
            p0 += 1
        if pred == label:
            correct += 1

In [35]:
acc4 = correct / (len(playedValid) + len(notPlayed))
print(f"Accuracy: {acc4}")

Accuracy: 0.7018201820182018


### Assignment 1. Use Bayesian Personalized Ranking (Implicit)

Implicit BPR: https://benfred.github.io/implicit/api/models/cpu/bpr.html

Example code from lecture: https://cseweb.ucsd.edu/~jmcauley/pml/code/chap5.html

In [52]:
!pip install implicit
from implicit import bpr

Collecting implicit
  Downloading implicit-0.7.2-cp310-cp310-manylinux2014_x86_64.whl (8.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: implicit
Successfully installed implicit-0.7.2


In [56]:
# Give user and game ids for bpr model
userIDs, gameIDs = {}, {}

for d in hoursTrain:
  u, g = d[0], d[1]
  if not u in userIDs: userIDs[u] = len(userIDs)
  if not g in gameIDs: gameIDs[g] = len(gameIDs)

nUsers, nGames = len(userIDs), len(gameIDs)
nUsers, nGames

(6710, 2437)

In [57]:
# get game id form gameIDs
games = defaultdict(str)
for g, id in gameIDs.items():
  games[id] = g

In [58]:
# Convert dataset to sparse matrix. Only storing positive feedback instances (i.e., played game).
Xui = scipy.sparse.lil_matrix((nUsers, nGames))
for d in hoursTrain:
  Xui[userIDs[d[0]], gameIDs[d[1]]] = 1

Xui_csr = scipy.sparse.csr_matrix(Xui)

In [118]:
# Bayesian Personalized Ranking model with 5 latent factors
model = bpr.BayesianPersonalizedRanking(factors=100)

In [119]:
# Fit the model
model.fit(Xui_csr)

  0%|          | 0/100 [00:00<?, ?it/s]

In [117]:
# Get N recommendations for a particular user (the first one)
recommended = model.recommend(0, Xui_csr[0], N=10) # Return games' id and scores
recommended

(array([ 666, 1965,  272, 1078,  686,  252,  731,  580, 1140, 1524],
       dtype=int32),
 array([1.5485433, 1.5159506, 1.4521528, 1.4054072, 1.3455342, 1.3383043,
        1.32     , 1.3181386, 1.3127204, 1.2926066], dtype=float32))

In [79]:
import numpy as np
np.set_printoptions(suppress=True)

In [106]:
correct = 0
for (label, sample) in [(1, playedValid), (0, notPlayed)]:
    for (u, g) in sample:
        pred = 0
        recommended = model.recommend(userIDs[u], Xui_csr[userIDs[u]], N=nGames, filter_already_liked_items=False)
        # if the score of game g is greater some value, we predict that user u would play it
        if recommended[1][np.where(recommended[0]==gameIDs[g])] > 0.8:
            pred = 1

        if pred == label:
            correct += 1

acc = correct / (len(playedValid) + len(notPlayed))
print(f"acc: {acc}")

acc: 0.5492049204920492


Combine with popularity

In [120]:
correct = 0
for (label, sample) in [(1, playedValid), (0, notPlayed)]:
    for (u, g) in sample:
        pred = 0
        recommended = model.recommend(userIDs[u], Xui_csr[userIDs[u]], N=nGames, filter_already_liked_items=False)
        if recommended[1][np.where(recommended[0]==gameIDs[g])] > 0.8 or len(hoursPerItem[g]) > 70:
            pred = 1

        if pred == label:
            correct += 1

acc = correct / (len(playedValid) + len(notPlayed))
print(f"acc: {acc}")

acc: 0.7122212221222122


The datasets are very popularity-dominant, i.e., people just play popular games most of the time. So consider the popularity is important.

The number of factors of bpr model also important for the result.

### Question 5 (For assignment 1)

To run our model on the test set, we’ll have to use the files ‘pairs Played.txt’ to find the reviewerID/itemID
pairs about which we have to make predictions. Using that data, run the above model and upload your
solution to the Assignment 1 gradescope. If you’ve already uploaded a better solution to gradescope,
that’s fine too!

In [None]:
predictions = open("HWpredictions_Played.csv", 'w')
for l in open("pairs_Played.csv"):
    if l.startswith("userID"):
        predictions.write(l)
        continue
    u,g = l.strip().split(',')

    # Logic...
    maxSim = 0
    users = set(hoursPerItem[g])
    for g2,_ in hoursPerUser[u]:
        sim = Jaccard(users,set(hoursPerItem[g2]))
        if sim > maxSim:
            maxSim = sim
    pred = 0
    if maxSim > 0.025 or len(hoursPerItem[g]) > 60:
        pred = 1

    _ = predictions.write(u + ',' + g + ',' + str(pred) + '\n')

predictions.close()

In [None]:
answers['Q5'] = "I confirm that I have uploaded an assignment submission to gradescope"

## Time played prediction

Let’s start by building our training/validation sets much as we did for the first task. This time building a
validation set is more straightforward: you can simply use part of the data for validation, and do not need to
randomly sample non-played users/games.
Note that you should use the time transformed field, which is computed as log2
(time played + 1). This
is the quantity we are trying to predict.

In [38]:
trainHours = [r[2]['hours_transformed'] for r in hoursTrain]
globalAverage = sum(trainHours) * 1.0 / len(trainHours)

In [39]:
validMSE = 0
for u,g,d in hoursValid:
    r = d['hours_transformed']
    se = (r - globalAverage)**2
    validMSE += se

validMSE /= len(hoursValid)

print("Validation MSE (average only) = " + str(validMSE))

Validation MSE (average only) = 5.316020858088501



### Question 6

Fit a predictor of the form

 - time(user, item) ≃ α + βuser + βitem,

by fitting the mean and the two bias terms as described in the lecture notes. Use a regularization
parameter of λ = 1. Report the MSE on the validation set.

Page 83: https://cseweb.ucsd.edu/classes/fa23/cse258-a/slides/recommendation.pdf

In [123]:
betaU = {}
betaI = {}
for u in hoursPerUser:
    betaU[u] = 0

for g in hoursPerItem:
    betaI[g] = 0

In [124]:
alpha = globalAverage # Could initialize anywhere, this is a guess
alpha

3.716088074007024

In [125]:
def MSE(y, ypred):
    diffs = [(a-b)**2 for (a,b) in zip(y,ypred)]
    return sum(diffs) / len(diffs)

In [126]:
# hoursPerItem #  'g36137304': {0.13750352374993502,
              # 0.2630344058337938,
              # 0.5849625007211562,
              # 0.6780719051126377,
              # 0.925999418556223,
              # 1.4329594072761063,
              # ...

In [127]:
def iterate(lamb):
    # ...
    ## Update alpha
    newAlpha = 0
    for u,g,d in hoursTrain:
        r = d['hours_transformed']
        newAlpha += r - (betaU[u] + betaI[g])
    alpha = newAlpha / len(hoursTrain)

    ## Update BetaU
    for u in hoursPerUser:
        newBetaU = 0
        for g,r in hoursPerUser[u]:
            newBetaU += r - (alpha + betaI[g])
        betaU[u] = newBetaU / (lamb + len(hoursPerUser[u]))

    ## Update BetaI
    for g in hoursPerItem:
        newBetaI = 0
        for u,r in hoursPerItem[g]:
            newBetaI += r - (alpha + betaU[u])
        betaI[g] = newBetaI / (lamb + len(hoursPerItem[g]))

    ## Calculate MSE
    mse = 0
    for u,g,d in hoursTrain:
        r = d['hours_transformed']
        prediction = alpha + betaU[u] + betaI[g]
        mse += (r - prediction)**2
    regularizer = 0
    for u in betaU:
        regularizer += betaU[u]**2
    for g in betaI:
        regularizer += betaI[g]**2
    mse /= len(hoursTrain)
    return mse, mse + lamb*regularizer

In [128]:
mse,objective = iterate(1)
newMSE,newObjective = iterate(1)
iterations = 2

In [129]:
## See if it will converge
while iterations < 10 or objective - newObjective > 0.01:
    mse, objective = newMSE, newObjective
    newMSE, newObjective = iterate(1)
    iterations += 1
    print("Objective after "
        + str(iterations) + " iterations = " + str(newObjective))
    print("MSE after "
        + str(iterations) + " iterations = " + str(newMSE))

Objective after 3 iterations = 6916.291258826528
MSE after 3 iterations = 2.756414053005335
Objective after 4 iterations = 6935.23715550776
MSE after 4 iterations = 2.755604333875777
Objective after 5 iterations = 6924.6062017768245
MSE after 5 iterations = 2.755486661616215
Objective after 6 iterations = 6905.833993834695
MSE after 6 iterations = 2.755457716152593
Objective after 7 iterations = 6885.30918519728
MSE after 7 iterations = 2.7554456693587497
Objective after 8 iterations = 6864.742698779786
MSE after 8 iterations = 2.7554377123048535
Objective after 9 iterations = 6844.576219662253
MSE after 9 iterations = 2.755430906995692
Objective after 10 iterations = 6824.91582215507
MSE after 10 iterations = 2.7554245073468486
Objective after 11 iterations = 6805.779255606049
MSE after 11 iterations = 2.7554183159504917
Objective after 12 iterations = 6787.161156344411
MSE after 12 iterations = 2.755412278370264
Objective after 13 iterations = 6769.050270855048
MSE after 13 iteration

In [130]:
### Calculate mse of valid dataset
validMSE = 0
for u,g,d in hoursValid:
    r = d['hours_transformed']
    bu = 0
    bi = 0
    if u in betaU:
        bu = betaU[u]
    if g in betaI:
        bi = betaI[g]
    prediction = alpha + bu + bi
    validMSE += (r - prediction)**2

validMSE /= len(hoursValid)
print("Validation MSE = " + str(validMSE))

Validation MSE = 3.3620657269506733


### Assignment 1: Implement early stopping

In [131]:
def get_validMSE(alpha, bu, bi):
  validMSE = 0
  for u,g,d in hoursValid:
      r = d['hours_transformed']
      bu = 0
      bi = 0
      if u in betaU:
          bu = betaU[u]
      if g in betaI:
          bi = betaI[g]
      prediction = alpha + bu + bi
      validMSE += (r - prediction)**2

  validMSE /= len(hoursValid)
  # print("Validation MSE = " + str(validMSE))
  return validMSE

In [132]:
def iterate(lambU, lambI):
    # Here we use different lamb for betaU and betaI
    ## Update alpha
    newAlpha = 0
    for u,g,d in hoursTrain:
        r = d['hours_transformed']
        newAlpha += r - (betaU[u] + betaI[g])
    alpha = newAlpha / len(hoursTrain)

    ## Update BetaU
    for u in hoursPerUser:
        newBetaU = 0
        for g,r in hoursPerUser[u]:
            newBetaU += r - (alpha + betaI[g])
        betaU[u] = newBetaU / (lambU + len(hoursPerUser[u]))

    ## Update BetaI
    for g in hoursPerItem:
        newBetaI = 0
        for u,r in hoursPerItem[g]:
            newBetaI += r - (alpha + betaU[u])
        betaI[g] = newBetaI / (lambI + len(hoursPerItem[g]))

    ## Calculate MSE
    mse = 0
    for u,g,d in hoursTrain:
        r = d['hours_transformed']
        prediction = alpha + betaU[u] + betaI[g]
        mse += (r - prediction)**2
    regularizer = 0
    for u in betaU:
        regularizer += betaU[u]**2
    for g in betaI:
        regularizer += betaI[g]**2
    mse /= len(hoursValid)
    return mse, mse + (lambU + lambI) * regularizer

In [133]:
alpha = globalAverage # Could initialize anywhere, this is a guess
betaU = {}
betaI = {}
for u in hoursPerUser:
    betaU[u] = 0

for g in hoursPerItem:
    betaI[g] = 0

lambU, lambI = 7.9, 1.2
newMSE,newObjective = iterate(lambU, lambI)
iterations = 1
validMSE = get_validMSE(alpha, betaU, betaI)
print(f"validMSE: {validMSE}")
while iterations < 10 or objective - newObjective > 0.01:
    mse, objective = newMSE, newObjective
    newMSE, newObjective = iterate(lambU, lambI) # Let lambda to be 5
    iterations += 1

    # Early stopping, stop iterate when the mse increase
    new_validMSE = get_validMSE(alpha, betaU, betaI)
    print(f"validMSE: {new_validMSE}")
    if validMSE - new_validMSE < 0.000000001:
      print(f"validMSE - new_validMSE: {validMSE} - {new_validMSE} = {validMSE - new_validMSE}")
      break
    validMSE = new_validMSE
    # print("MSE after " + str(iterations) + " iterations = " + str(newMSE))

validMSE: 3.096501336157259
validMSE: 2.986833884688678
validMSE: 2.9851903043698162
validMSE: 2.9855709275170716
validMSE - new_validMSE: 2.9851903043698162 - 2.9855709275170716 = -0.0003806231472553989


In [135]:
print(f"Valid MSE: {validMSE}")

Valid MSE: 2.9851903043698162


Early stopping is important to prevent over fitting

### Question 7

Report the user and game IDs that have the largest and smallest values of β.

In [48]:
betaUs = [(betaU[u], u) for u in betaU]
betaIs = [(betaI[i], i) for i in betaI]
betaUs.sort()
betaIs.sort()

print("Maximum betaU = " + str(betaUs[-1][1]) + ' (' + str(betaUs[-1][0]) + ')')
print("Maximum betaI = " + str(betaIs[-1][1]) + ' (' + str(betaIs[-1][0]) + ')')
print("Minimum betaU = " + str(betaUs[0][1]) + ' (' + str(betaUs[0][0]) + ')')
print("Minimum betaI = " + str(betaIs[0][1]) + ' (' + str(betaIs[0][0]) + ')')

Maximum betaU = u60898505 (5.828316739259239)
Maximum betaI = g17604638 (5.495973739724736)
Minimum betaU = u13037838 (-3.0057870148761894)
Minimum betaI = g84397720 (-2.809328679823356)


### Question 8

Find a better value of λ using your validation set. Report the value you chose, its MSE, and upload your
solution to the Assignment 1 gradescope.

In [49]:
# Better lambda...
iterations = 1
while iterations < 10 or objective - newObjective > 0.01:
    mse, objective = newMSE, newObjective
    newMSE, newObjective = iterate(5) # Let lambda to be 5
    iterations += 1
    print("Objective after " + str(iterations) + " iterations = " + str(newObjective))
    print("MSE after " + str(iterations) + " iterations = " + str(newMSE))

Objective after 2 iterations = 23723.40581939076
MSE after 2 iterations = 2.7788624145856393
Objective after 3 iterations = 23510.585432916247
MSE after 3 iterations = 2.77950918417825
Objective after 4 iterations = 23487.108448891875
MSE after 4 iterations = 2.779632986564118
Objective after 5 iterations = 23482.603926859705
MSE after 5 iterations = 2.7796579991997605
Objective after 6 iterations = 23481.074962137227
MSE after 6 iterations = 2.7796657598050394
Objective after 7 iterations = 23480.130273496496
MSE after 7 iterations = 2.7796701526871908
Objective after 8 iterations = 23479.338493177987
MSE after 8 iterations = 2.7796737556337834
Objective after 9 iterations = 23478.61426569989
MSE after 9 iterations = 2.7796770737497773
Objective after 10 iterations = 23477.938819973006
MSE after 10 iterations = 2.7796802099553215
Objective after 11 iterations = 23477.30667015269
MSE after 11 iterations = 2.7796831871553858
Objective after 12 iterations = 23476.715002575234
MSE after 1

In [50]:
validMSE = 0
for u,g,d in hoursValid:
    r = d['hours_transformed']
    bu = 0
    bi = 0
    if u in betaU:
        bu = betaU[u]
    if g in betaI:
        bi = betaI[g]
    prediction = alpha + bu + bi
    validMSE += (r - prediction)**2

validMSE /= len(hoursValid)
print("Validation MSE = " + str(validMSE))

Validation MSE = 3.3246506094357864
