# Implementation of PLSI algorithm using Spark 

In [1]:
import pandas as pd
import time
import numpy as np
import matplotlib.pyplot as plt
import findspark
findspark.init("spark-2.2.3-bin-hadoop2.6")
import pyspark
from numpy import random


In [2]:
from pyspark import sql

## 1. Prepare the environment

### Import the data

We will implement the PLSI algorithm on the Movielens dataset. To simplify our task, we will start by implementing the PLSI algorithm on a reduced version of the Movielens dataset ("ratings_short"), which contains 100 836 observations. 

In [3]:
ratings_data = pd.read_csv("ratings.csv")
ratings_data.head()

Unnamed: 0,user_id,user_username,movie_id,rating
0,2,William,1768,1
1,3,James,615,3
2,7,Joseph,82,3
3,7,Joseph,532,3
4,8,Thomas,698,3


**Description of the dataset :** 
- userId, to characterize the users 
- movieId, to characterize the movies
- rating : the rating of the user to the corresponding movie. Ratings are going from 1 to 5 but here we will only use the seen / not seen information to provide movie recommendations.  
- timestamp : we will not be using this column

More information about the movies are available in the "movies.csv" dataset : the movieId gives us access to the corresponding movie information such as the title and the genres. 

In [4]:
movies = pd.read_csv("movies.csv")
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [112]:
movies.shape

(9742, 3)

### Create a Spark environment

In [5]:
sc = pyspark.SparkContext()

To be able to implement the PLSI algorithm in Spark, we will need to transform the dataset into an RDD, and then perform pyspark operations on it. 

In [116]:
rdd = sc.textFile("ratings.csv")

#Remove header line 
header = rdd.first()
rdd = rdd.filter(lambda x: x != header)

rdd.collect()[0:10]

['2,William,1768,1',
 '3,James,615,3',
 '7,Joseph,82,3',
 '7,Joseph,532,3',
 '8,Thomas,698,3',
 '10,Robert,1693,3',
 '11,Edward,615,1',
 '18,David,1,3',
 '18,David,28,3',
 '18,David,1596,5']

In [110]:
rdd.count()

295

## 2. Probabilistic Latent Semantic Indexing algorithm (PLSI)

The PLSI algorithm that we will implement here is based on Das description of the Google News recommendation system. The algorithm is based on the following model : 
- u (users) and s (movies) are random variables 
- The relationship between users and movies is learned by modeling the joint distribution of users and items as a mixture distribution 
- To capture this relationship, we introduce a hidden variable z (latent variable), that kind of represents user communities (same preferences) and movie communities (sames genres). 

All in all, we try to compute the following probability for each (user, movie) couple : p(s|u) = sum(p(s|z)p(z|u)), which is the probability for a given user to see a given movie. This is obtained by summing for each community the probability for a movie s to be seen given a community z times the probability to be in the community z given a user u. 

### Going through the algorithm : main steps

**INITIALISATION**

**E-STEP - Compute q( z | (u,s) ) : the probability that the (user, movie) couple belongs to the class z**
This step is first initialized at random :
- To each couple (u,s), assign each possible community 
- Ex with number of classes = 2 : the lines (Marie, Star Wars) and (Gaëlle, Matrix) will give (Marie, Star Wars, 1), (Marie, Star Wars, 2), (Gaëlle, Matrix, 1), (Gaëlle, Matrix, 2)
- To each line, assign a random probability. This random probability corresponds to q*( z | (u,s) ). For example if I have (Marie, Star Wars, 1, 0.3), then the probability that the couple (Marie, Star Wars) is in class 1 is 0.3. 

LogLik = 0

**ITERATION**

**M-STEP - Compute p(s|z) and p(z|u) based on q( z | (u,s) )**
- Compute p(s | z) :  sum the probas associated to every couple (s,z) and divide it by the sum of probas associated to this z
- Compute p(z | u) : sum the probas associated to every couple (u,z) and divide by the sum of probas associated to this u

**E-STEP - Compute new q( z | (u,s) ) = p(s|z)p(z|u) / ∑p(s|z)p(z|u)**
- For each (u,s,z), compute p(s | z) * p(z | u)
- For each (u,s), compute ∑ p(s | z)* p(z | u) (summing over z)     ***(this corresponds to p(s|u))***
- For each (u,s,z), compute p(s|z)p(z|u) / ∑p(s|z)p(z|u)             ***(this corresponds to the new q( z | (u,s) )***
	    
**Update LogLik** = sum( log( ∑ p(s | z) * p(z | u))) = sum( log (p(s | u))

**Iterate again until LogLik converges** : this means that it has reached its maximum and we have found the best estimation of p(z | u) and p(s | z).
	    
**We can now predict the probability that Gaëlle will watch Star Wars** :
p(Star Wars | Gaëlle) = p( 1 | Gaëlle) * p(Star Wars |1) + p(2 | Gaëlle) * p(Star Wars | 2)

### Implementing the algorithm 

Keep only (user, movie) information : 

In [117]:
rdd = rdd.map(lambda line : line.split(',')).map(lambda line : line[0] + ',' + line[2])

In [118]:
rdd.count()

295

#### Initialisation of q : (first E-Step)

To each couple (u,s), assign each possible community z : 

In [119]:
nb_z = 3 #number of classes
classes = sc.parallelize(range(nb_z))
classes.collect()
rdd = rdd.cartesian(classes)
rdd = rdd.distinct()

In [120]:
rdd.count()

882

In [121]:
ordered_rdd = rdd.map(lambda x: (x[0].split(','), x[1])).sortBy(lambda x : (x[0][0], x[0][1], x[1]))
ordered_rdd.count()

882

To each line, assign a random probability :

In [59]:
proba0 = np.random.rand(int(ordered_rdd.count()/nb_z), nb_z)
random_p = (proba0 / np.reshape(proba0.sum(1), (int(ordered_rdd.count()/nb_z), 1))).flatten()
random_p = list(random_p)

In [60]:
q = ordered_rdd.map(lambda x : (x, random_p.pop(0))) 

In [122]:
q.count()

882

#### One iteration step 

M-STEP - Compute p(s|z) and p(z|u) based on q( z | (u,s) )

Compute p(s | z) : sum the probas associated to every couple (s,z) and divide it by the sum of probas associated to this z

In [149]:
SZ_probas = q.map(lambda x: ((x[0][0][1], x[0][1]), x[1]))

In [150]:
Nsz = SZ_probas.reduceByKey(lambda x,y: x + y)

In [151]:
Z_probas = q.map(lambda x: (x[0][1], x[1]))

In [152]:
Nz = Z_probas.reduceByKey(lambda x,y: x+y)

In [153]:
Nsz = Nsz.map(lambda x : (x[0][1], (x[0][0], x[1])))

In [154]:
Psz = Nsz.join(Nz)

In [155]:
Psz = Psz.map(lambda x : ((x[1][0][0], x[0]), x[1][0][1] / x[1][1]))

In [156]:
Psz.collect()[0:10]

[(('68', 0), 0.03256301978953673),
 (('564', 0), 0.0456296914649065),
 (('598', 0), 0.039794696974641275),
 (('517', 0), 0.03873465730959445),
 (('1', 0), 0.0024997187053983397),
 (('1596', 0), 0.016210888135171533),
 (('1293', 0), 0.007316101600084628),
 (('130', 0), 0.03199415695945645),
 (('2767', 0), 0.0058923521597445214),
 (('3390', 0), 0.007129522877872921)]

Compute p(z | u) : sum the probas associated to every couple (u,z) and divide by the sum of probas associated to this u

In [157]:
ZU_probas = q.map(lambda x: ((x[0][0][0], x[0][1]), x[1]))

In [158]:
Nzu = ZU_probas.reduceByKey(lambda x,y: x + y)

In [159]:
U_probas = q.map(lambda x: (x[0][0][0], x[1]))

In [160]:
Nu = U_probas.reduceByKey(lambda x,y: x+y)

In [161]:
Nzu = Nzu.map(lambda x : (x[0][0], (x[0][1], x[1])))

In [162]:
Pzu = Nzu.join(Nu)

In [163]:
Pzu = Pzu.map(lambda x : ((x[1][0][0], x[0]), x[1][0][1] / x[1][1]))

In [164]:
Pzu.collect()[0:10]

[((1, '10'), 0.16963470139079387),
 ((0, '10'), 0.4221572109040528),
 ((2, '10'), 0.4082080877051534),
 ((1, '102'), 0.37803579896367123),
 ((0, '102'), 0.0804472599717988),
 ((2, '102'), 0.54151694106453),
 ((1, '112'), 0.02336607334999523),
 ((0, '112'), 0.3769991829941053),
 ((2, '112'), 0.5996347436558995),
 ((1, '121'), 0.2545663277686507)]

E-STEP - Compute new q( z | (u,s) ) = p(s|z)p(z|u) / ∑p(s|z)p(z|u)

For each (u,s,z), compute p(s | z) * p(z | u)

In [165]:
Pzu.collect()[0:10]

[((1, '10'), 0.16963470139079387),
 ((0, '10'), 0.4221572109040528),
 ((2, '10'), 0.4082080877051534),
 ((1, '102'), 0.37803579896367123),
 ((0, '102'), 0.0804472599717988),
 ((2, '102'), 0.54151694106453),
 ((1, '112'), 0.02336607334999523),
 ((0, '112'), 0.3769991829941053),
 ((2, '112'), 0.5996347436558995),
 ((1, '121'), 0.2545663277686507)]

In [166]:
Psz.collect()[0:10]

[(('68', 0), 0.03256301978953673),
 (('564', 0), 0.0456296914649065),
 (('598', 0), 0.039794696974641275),
 (('517', 0), 0.03873465730959445),
 (('1', 0), 0.0024997187053983397),
 (('1596', 0), 0.016210888135171533),
 (('1293', 0), 0.007316101600084628),
 (('130', 0), 0.03199415695945645),
 (('2767', 0), 0.0058923521597445214),
 (('3390', 0), 0.007129522877872921)]

In [168]:
q.collect()[0:10]

[((['10', '1693'], 0), 0.4221572109040528),
 ((['10', '1693'], 1), 0.16963470139079387),
 ((['10', '1693'], 2), 0.4082080877051534),
 ((['102', '750'], 0), 0.0804472599717988),
 ((['102', '750'], 1), 0.37803579896367123),
 ((['102', '750'], 2), 0.54151694106453),
 ((['104', '68'], 0), 0.08620273757421953),
 ((['104', '68'], 1), 0.6216334844783418),
 ((['104', '68'], 2), 0.29216377794743864),
 ((['109', '641'], 0), 0.21377081825163968)]

In [169]:
q_int = q.map(lambda x : ((x[0][1], x[0][0][0]), (x[0][0][1], x[0][1])))

In [179]:
q_int.collect()[0:10]

[((0, '10'), ('1693', 0)),
 ((1, '10'), ('1693', 1)),
 ((2, '10'), ('1693', 2)),
 ((0, '102'), ('750', 0)),
 ((1, '102'), ('750', 1)),
 ((2, '102'), ('750', 2)),
 ((0, '104'), ('68', 0)),
 ((1, '104'), ('68', 1)),
 ((2, '104'), ('68', 2)),
 ((0, '109'), ('641', 0))]

In [187]:
q_int2 = q_int.join(Pzu)
q_int2.collect()[0:10]

[((2, '10'), (('1693', 2), 0.4082080877051534)),
 ((2, '102'), (('750', 2), 0.54151694106453)),
 ((2, '112'), (('681', 2), 0.5996347436558995)),
 ((0, '116'), (('3753', 0), 0.42782101132556444)),
 ((2, '119'), (('564', 2), 0.20088974585296776)),
 ((2, '121'), (('564', 2), 0.5656429411196668)),
 ((2, '121'), (('710', 2), 0.5656429411196668)),
 ((2, '121'), (('781', 2), 0.5656429411196668)),
 ((1, '196'), (('1768', 1), 0.35805914103750713)),
 ((1, '196'), (('615', 1), 0.35805914103750713))]

In [200]:
q_int3 = q_int2.map(lambda x : (x[1][0], (x[0], x[1][1])))
q_int3.collect()[0:10]

[(('1693', 2), ((2, '10'), 0.4082080877051534)),
 (('750', 2), ((2, '102'), 0.54151694106453)),
 (('681', 2), ((2, '112'), 0.5996347436558995)),
 (('3753', 0), ((0, '116'), 0.42782101132556444)),
 (('564', 2), ((2, '119'), 0.20088974585296776)),
 (('564', 2), ((2, '121'), 0.5656429411196668)),
 (('710', 2), ((2, '121'), 0.5656429411196668)),
 (('781', 2), ((2, '121'), 0.5656429411196668)),
 (('1768', 1), ((1, '196'), 0.35805914103750713)),
 (('615', 1), ((1, '196'), 0.35805914103750713))]

In [193]:
Psz.collect()[0:10]

[(('68', 0), 0.03256301978953673),
 (('564', 0), 0.0456296914649065),
 (('598', 0), 0.039794696974641275),
 (('517', 0), 0.03873465730959445),
 (('1', 0), 0.0024997187053983397),
 (('1596', 0), 0.016210888135171533),
 (('1293', 0), 0.007316101600084628),
 (('130', 0), 0.03199415695945645),
 (('2767', 0), 0.0058923521597445214),
 (('3390', 0), 0.007129522877872921)]

In [236]:
PzuPsz = q_int3.join(Psz)

In [237]:
PzuPsz.collect()[0:10]

[(('797', 2), (((2, '248'), 0.6866864862497908), 0.01089515621074776)),
 (('797', 2), (((2, '183'), 0.3868519783148643), 0.01089515621074776)),
 (('496', 2), (((2, '191'), 0.4721365194894196), 0.03624050251823825)),
 (('496', 2), (((2, '780'), 0.06586349378526116), 0.03624050251823825)),
 (('496', 2), (((2, '131'), 0.03484969453588014), 0.03624050251823825)),
 (('496', 2), (((2, '117'), 0.46928662376653096), 0.03624050251823825)),
 (('496', 2), (((2, '491'), 0.5965023967390713), 0.03624050251823825)),
 (('496', 2), (((2, '159'), 0.5880009614472013), 0.03624050251823825)),
 (('496', 2), (((2, '368'), 0.006222634220917525), 0.03624050251823825)),
 (('496', 2), (((2, '296'), 0.20092487544523233), 0.03624050251823825))]

In [238]:
PzuPsz = PzuPsz.map(lambda x: ((x[1][0][0][1], x[0][0]), (x[0][1], x[1][0][1]*x[1][1])))

In [239]:
PzuPsz.collect()[0:10]

[(('248', '797'), (2, 0.007481556535500965)),
 (('183', '797'), (2, 0.004214812734177252)),
 (('191', '496'), (2, 0.017110464723508554)),
 (('780', '496'), (2, 0.0023869261123847262)),
 (('131', '496'), (2, 0.0012629704425873981)),
 (('117', '496'), (2, 0.017007183070386494)),
 (('491', '496'), (2, 0.021617546611157466)),
 (('159', '496'), (2, 0.021309450324053812)),
 (('368', '496'), (2, 0.0002255113911532371)),
 (('296', '496'), (2, 0.007281618454549649)),
 (('585', '496'), (2, 0.020460741434082354)),
 (('62', '496'), (2, 0.02231592026986537)),
 (('119', '564'), (0, 0.01449950935987998)),
 (('933', '564'), (0, 0.02701312056532757)),
 (('221', '564'), (0, 0.015235335573305019)),
 (('747', '564'), (0, 0.010984935558201646)),
 (('159', '564'), (0, 0.010048838006823914)),
 (('313', '564'), (0, 0.01503892283565035)),
 (('161', '564'), (0, 0.028266208578808738)),
 (('766', '564'), (0, 0.03223630442095555)),
 (('121', '564'), (0, 0.00820379558887604)),
 (('355', '564'), (0, 0.01147085586766

For each (u,s), compute ∑ p(s | z)* p(z | u) (summing over z) (this corresponds to p(s|u))

In [240]:
SumPzuPsz = PzuPsz.map(lambda x : (x[0], x[1][1])).reduceByKey(lambda x,y : x+y)

In [241]:
SumPzuPsz.collect()[0:10]

[(('296', '496'), 0.030686521647097387),
 (('775', '1241'), 0.007127781535276998),
 (('571', '3915'), 0.019366064417274828),
 (('244', '1550'), 0.007851704704439282),
 (('126', '681'), 0.038186213220138765),
 (('74', '130'), 0.03694488165304033),
 (('485', '1935'), 0.007644547890965291),
 (('313', '3252'), 0.021040793474339257),
 (('663', '517'), 0.036117372141881626),
 (('170', '82'), 0.0439722326771784)]

For each (u,s,z), compute p(s|z)p(z|u) / ∑p(s|z)p(z|u) (this corresponds to the new q( z | (u,s) )

In [242]:
q1 = PzuPsz.join(SumPzuPsz)

In [243]:
q1.collect()[0:10]

[(('296', '496'), ((2, 0.007281618454549649), 0.030686521647097387)),
 (('296', '496'), ((1, 0.013070706183897877), 0.030686521647097387)),
 (('296', '496'), ((0, 0.010334197008649862), 0.030686521647097387)),
 (('485', '1935'), ((1, 0.0005287743149323337), 0.007644547890965291)),
 (('485', '1935'), ((0, 0.003291170126037254), 0.007644547890965291)),
 (('485', '1935'), ((2, 0.0038246034499957032), 0.007644547890965291)),
 (('313', '3252'), ((1, 0.004589641296092026), 0.021040793474339257)),
 (('313', '3252'), ((0, 0.008070044426856389), 0.021040793474339257)),
 (('313', '3252'), ((2, 0.00838110775139084), 0.021040793474339257)),
 (('170', '82'), ((1, 0.030479099910464647), 0.0439722326771784))]

In [244]:
q1.count()

882

In [245]:
q1 = q1.map(lambda x : (x[0], x[1][0][0], x[1][0][1]/x[1][1]))

In [246]:
q1.collect()[0:10]

[(('296', '496'), 2, 0.23729044752253345),
 (('296', '496'), 1, 0.42594290529940937),
 (('296', '496'), 0, 0.33676664717805727),
 (('485', '1935'), 1, 0.06917012261212538),
 (('485', '1935'), 0, 0.4305251498165017),
 (('485', '1935'), 2, 0.5003047275713729),
 (('313', '3252'), 1, 0.2181306185857211),
 (('313', '3252'), 0, 0.3835427802044814),
 (('313', '3252'), 2, 0.3983266012097974),
 (('170', '82'), 1, 0.6931442425092804)]

In [247]:
q1.count()

882

Update LogLik = sum( log( ∑ p(s | z) * p(z | u))) = sum( log (p(s | u))

In [248]:
LogLik = SumPzuPsz.map(lambda x : x[1]).sum()
LogLik

8.464520735594743

End of the iteration step 