# Introduction

In this notebook, I will explore matrix factorization using singular value decomposition and its utility in designing simple recommender systems

In [1]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

# Data: Twitch Stream Viewers

Below is a dataset of users consuming streaming content on Twitch. We retrieved all streamers, and all users connected in their respective chats, every 10 minutes during 43 days. `100k.csv` is a subset of 100k users. For the purposes of this demonstration, we shall only use a subset of this data. It contains the following attributes:
- User ID (anonymized)
- Stream ID
- Streamer username
- Time start
- Time stop

Start and stop times are provided as integers and represent periods of 10 minutes. Stream ID could be used to retrieve a single broadcast segment from a streamer (not used in our work).

<b>Data citation:<br></b>
Recommendation on Live-Streaming Platforms: Dynamic Availability and Repeat Consumption<br>
Jérémie Rappaz, Julian McAuley and Karl Aberer<br>
RecSys, 2021

The dataset can be found here: https://drive.google.com/drive/folders/1BD8m7a8m7onaifZay05yYjaLxyVV40si/

In [2]:
df = pd.read_csv('100k_a.csv', names=['UserID', 'StreamID', 'Streamer', 'Start', 'Stop']).head(1000000)
df

Unnamed: 0,UserID,StreamID,Streamer,Start,Stop
0,1,33842865744,mithrain,154,156
1,1,33846768288,alptv,166,169
2,1,33886469056,mithrain,587,588
3,1,33887624992,wtcn,589,591
4,1,33890145056,jrokezftw,591,594
...,...,...,...,...,...
999995,33132,34130329216,sheriffeli,3158,3159
999996,33132,34139249856,grizzly_ben,3268,3269
999997,33132,34140588288,kotton,3305,3308
999998,33132,34150050464,iipog,3412,3431


## Recommendation Task

We wish to recommend new Twitch streamers to users from their Twitch stream video consumption habits. To achieve this task,  we shall engineer a new feature `Time` from the `Start` and `Stop` attributes

In [3]:
df['Time'] = df['Stop'] - df['Start']
df = df[['UserID', 'Streamer', 'StreamID', 'Time']]
df

Unnamed: 0,UserID,Streamer,StreamID,Time
0,1,mithrain,33842865744,2
1,1,alptv,33846768288,3
2,1,mithrain,33886469056,1
3,1,wtcn,33887624992,2
4,1,jrokezftw,33890145056,3
...,...,...,...,...
999995,33132,sheriffeli,34130329216,1
999996,33132,grizzly_ben,34139249856,1
999997,33132,kotton,34140588288,3
999998,33132,iipog,34150050464,19


# Data Wrangling

We shall only recommend streamers who stream frequently. To achieve this, we shall count the number of `StreamID`'s by `Streamer`

In [4]:
streams = df.groupby('Streamer').agg({'StreamID': 'count'})
streams.columns = ['Streams']
streams = streams.reset_index().sort_values('Streams')
df = df.merge(streams, on='Streamer')

However, if a user has not spenT any time watching streams, the corresponding matrix factorisation rating will be 0. SO there is a cold start problem. Therefore, we shall only recommend streamers to users who have watched enough streams

In [5]:
streams_watched = df.groupby('UserID').agg({'StreamID': 'count'})
streams_watched.columns = ['Streams Watched']
streams_watched = streams_watched.reset_index().sort_values('Streams Watched')
df = df.merge(streams_watched, on='UserID')
df

Unnamed: 0,UserID,Streamer,StreamID,Time,Streams,Streams Watched
0,1,mithrain,33842865744,2,2215,50
1,1,mithrain,33886469056,1,2215,50
2,1,mithrain,34060922080,1,2215,50
3,1,mithrain,34077379792,2,2215,50
4,1,mithrain,34157036272,1,2215,50
...,...,...,...,...,...,...
999995,33033,foolz0104,34024157728,5,1,9
999996,33033,royalstsp,34048345920,5,1,9
999997,33033,mutombo_gold,34163700464,4,1,9
999998,33033,rolfmans,34166886416,6,1,9


To achieve this, we shall eliminate the bottom quartile of streamers and stream watchers

In [6]:
df.describe()

Unnamed: 0,UserID,StreamID,Time,Streams,Streams Watched
count,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0
mean,16649.328595,34129870000.0,3.124577,1135.873188,67.445316
std,9503.445895,168134100.0,4.214894,2663.644648,50.291451
min,1.0,33802120000.0,1.0,1.0,5.0
25%,8420.0,33988900000.0,1.0,25.0,27.0
50%,16662.0,34130370000.0,1.0,186.0,57.0
75%,24811.0,34273490000.0,3.0,893.0,98.0
max,33132.0,34416420000.0,83.0,15309.0,309.0


In [7]:
df = df[(df['Streams'] > 25) & (df['Streams Watched'] > 27)]

# Generate Recommendation Matrix

We wish to capture the relationship between user's stream consumption habits and the streamers they watch. We can achieve this by generating a sparse matrix of `UserID`s and `Streamer`s where the values are `Time`

In [11]:
def overall_time(watch_time, norm_method = 'std'):
    M = watch_time.groupby(['UserID', 'Streamer']).agg({'Time': 'sum'}).reset_index()
    M = M.pivot(index = 'UserID', columns = 'Streamer', values = 'Time')
    
    # optional normalization
    if norm_method == 'std':
        M = (M - M.mean()) / M.std()
    elif norm_method == 'norm':
        M = (M - M.min()) / (M.max() - M.min())
    
    M = M.fillna(0)
    
    return M.astype('float32')

In [12]:
M = overall_time(df)
M

Streamer,0nuqtive,10000days,109ace,1adrianaries1,1buhorpaduha1,1drakonz,1mpala,1pvcs,1tsz1rk,1ukeofficial,...,zugorow,zuliezulie,zulu,zulzorander,zumi,zuthar13,zvonimirtv,zwag,zzamtiger0310,zzzerrr
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33108,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
33116,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
33117,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
33129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We have 11,303 unique users and 4699 unique streamers in our sparse matrix

In [13]:
user_ids = M.index
streamers = M.columns

# Matrix Factorization with SVD

Suppose we have a (pivot table) sparse matrix M of all users and all contents and values are time spent. We can factorize that matrix `M` $(n \times m)$ into three matrices `u` $(n \times n)$, `s` $(n \times m)$, and `v` $(m \times m)$ using SVD. Implicitly, `u` contains generic features that describe users and `v` contains features describing streamers, while `s` contains feature weights along the diagonal.

We can then multiply any user row in `u` with its associated weight in `s` and any streamer column in `v` transpose to determine a score (1x1) for that streamer for said user. We can then use this score to determine and sort suggestions.

This is an alternative to content-based filtering where we decide explicitly the features of the content. Here, the features describing the users and streamers are implicit in the time spent watching the streams.

In [14]:
def factorize_matrix(M):
    return np.linalg.svd(M, full_matrices=True)

def recommendation(u, s, v, user_num, top_n):
    true_s = np.zeros((u.shape[1], v.shape[0]))
    true_s[:s.size, :s.size] = np.diag(s)

    user = np.dot(u[user_num], true_s)
    ratings = np.dot(user, v)

    top_n_indices = np.argsort(ratings)[-top_n:] - 1

    return top_n_indices, ratings

In [15]:
u, s, v = factorize_matrix(M)

# Results

Let us closely examine the generated recommendations for a particular user. Expect a long runtime for the cell below

In [19]:
indices, ratings = recommendation(u, s, v, 1, 5)

In [20]:
df[df['UserID'] == 1]

Unnamed: 0,UserID,Streamer,StreamID,Time,Streams,Streams Watched
0,1,mithrain,33842865744,2,2215,50
1,1,mithrain,33886469056,1,2215,50
2,1,mithrain,34060922080,1,2215,50
3,1,mithrain,34077379792,2,2215,50
4,1,mithrain,34157036272,1,2215,50
5,1,mithrain,34195515568,1,2215,50
6,1,mithrain,34233524528,1,2215,50
7,1,mithrain,34405646144,1,2215,50
8,1,alptv,33846768288,3,104,50
9,1,wtcn,33887624992,2,1526,50


It appears that this user enjoys watching Turkish FIFA and live gameplay Twitch streamers

The algorithm has correctly identified this trend and is recommending more Turkish gameplay channels as well as a few Turkish musicians. This is impressive considering that there was no feature engineering involved!

In [21]:
streamers[indices]

Index(['yanni', 'jungjil', 'handiofiblood', 'saddota2tv', 'sakonoko_game'], dtype='object', name='Streamer')

In [22]:
ratings[indices]

array([ 1.52197052e-09,  6.36350059e-10,  3.11045193e-10, -2.42000280e-09,
       -2.39867151e-10])

# Discussion

Matrix factorization is an alternative to content-based filtering where we decide explicitly the features of the content. Here, the features describing the users and streamers are implicit in the time spent watching the streams. This makes it very attractive as as a recommender system as it eliminates the feature engineering process.

However, This huge sparse matrix requires a lot of computing power and time to factorize. Parallelising this process is a necessity with large datasets.

Therefore, although matrix factorisation is one of the easiest ways to generate recommendations using some simple linear algebra concepts, it poses difficulties dealing with large amounts of data. 