### Recommendation System

**Description**

- Also called collaborative filters.
- Analyzes past data to understand behaviors of people/entities.
- The recommendation is made by similar behavior.
- The recommendation is based on users or items.
- Recommendation algorithms expect to receive data in a specific format: [user_id, item_id, score].
- The score (rating) indicates a user's preference for an item. They can be boolean values, ratings or even sales volume.

In [1]:
from pyspark.ml.recommendation import ALS

In [2]:
spSession = SparkSession.builder.master('local').appName('RecommendationSystem').getOrCreate()

In [3]:
rddRatings = sc.textFile('aux/datasets/ratings.txt')

In [4]:
rddRatings.cache()

aux/datasets/ratings.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [5]:
rddRatings.count()

38

In [6]:
rddRatings.take(5)

['1001,9001,10', '1001,9002,1', '1001,9003,9', '1002,9001,3', '1002,9002,5']

In [7]:
rddRatings02 = \
    rddRatings.map(
        lambda line: line.split(',')
    ).map(
        lambda line: (
            int(line[0]), 
            int(line[1]), 
            float(line[2])
        )
    )

In [8]:
dfRatings = spSession.createDataFrame(rddRatings02, ['user', 'item', 'rating'])

In [9]:
dfRatings.show(5)

+----+----+------+
|user|item|rating|
+----+----+------+
|1001|9001|  10.0|
|1001|9002|   1.0|
|1001|9003|   9.0|
|1002|9001|   3.0|
|1002|9002|   5.0|
+----+----+------+
only showing top 5 rows



**ALS (Alternating Least Squares)**<br />
Algorithm for recommendation system, which optimizes the loss function and works very well in parallelized environments.

In [10]:
als = ALS(rank = 10, maxIter = 5)

In [11]:
model = als.fit(dfRatings)

**Affinity Score**

In [12]:
model.userFactors.orderBy('id').take(5)

[Row(id=1001, features=[-1.020338773727417, 0.15884457528591156, 0.43435177206993103, 0.42108067870140076, -0.08316300064325333, -0.22315619885921478, 0.43449509143829346, -0.016361376270651817, -0.09652017056941986, 1.2523633241653442]),
 Row(id=1002, features=[-0.7278647422790527, -0.3039030432701111, -1.4372807741165161, 0.4101669490337372, -0.20219333469867706, 0.30689162015914917, 0.2394382208585739, -0.31496909260749817, -0.42265287041664124, -0.5618031024932861]),
 Row(id=1003, features=[-0.2754066288471222, -0.02016931213438511, -1.2622171640396118, 0.6222057938575745, -0.3387295603752136, 0.6792862415313721, -0.059917986392974854, 0.23600462079048157, -0.24518680572509766, -0.4790261387825012]),
 Row(id=1004, features=[-1.1808589696884155, -0.198384091258049, 0.1094837412238121, 0.714041531085968, -0.03430582955479622, -0.05891014635562897, 0.40418773889541626, -0.05229669436812401, -0.19879166781902313, 0.6823390126228333]),
 Row(id=1005, features=[-0.520517110824585, 0.53832

In [13]:
dfTest = spSession.createDataFrame([(1001, 9003), (1001, 9004), (1001, 9005)], ['user', 'item'])

In [14]:
dfTest.show()

+----+----+
|user|item|
+----+----+
|1001|9003|
|1001|9004|
|1001|9005|
+----+----+



In [15]:
predictions = model.transform(dfTest)

In [16]:
predictions.collect()

[Row(user=1001, item=9004, prediction=-0.6660881042480469),
 Row(user=1001, item=9005, prediction=-2.7070765495300293),
 Row(user=1001, item=9003, prediction=9.008316993713379)]