## ALS

The final goal was to attempt to create a basic recommendation system using past reservations. This uses [Alternating Least Squares](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html), a native recommender in PySpark's ML library. This documentation, along with [this walkthrough](https://github.com/shashwatwork/Building-Recommeder-System-in-PySpark/blob/master/Crafting%20Recommedation%20System%20with%20PySpark.ipynb) guided this implementation.

ALS is fairly straightforward, using three inputs, all integers, to build the model:
- userCol - person record in the transaction, customerzip was used in this model
- itemCol - facilityid, the product identifier
- ratingCol - a score to the item assigned by the user, used a "days stayed" calculation to simulate this value

In [1]:
from pyspark.sql import SparkSession

MAX_MEMORY = "8g"

spark = SparkSession.builder.appName('recreation.gov reservations') \
    .config("spark.executor.memory", MAX_MEMORY) \
    .config("spark.driver.memory", MAX_MEMORY) \
    .getOrCreate()

In [11]:
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType

schemaRating = StructType([
    StructField("user", IntegerType(), True),
    StructField("item", IntegerType(), True),
    StructField("rating", IntegerType(), True),
])

In [12]:
dfReservations2021 = spark.read.format('csv').schema(schemaRating).csv('REC_Collaborative_Facility.csv', header=True, ignoreTrailingWhiteSpace=True)

In [13]:
dfReservations2021.show(truncate=False)

+-----+------+------+
|user |item  |rating|
+-----+------+------+
|99709|252494|1     |
|99706|252494|1     |
|99709|252494|1     |
|84401|252494|2     |
|99709|252494|2     |
|99709|252494|1     |
|99743|252494|2     |
|99708|252494|3     |
|84401|252494|2     |
|99710|252494|1     |
|99705|252494|1     |
|99709|252494|1     |
|99709|252494|1     |
|99712|252494|2     |
|99709|252494|1     |
|99712|252494|1     |
|99705|252494|2     |
|99775|252494|1     |
|99709|252494|2     |
|99755|252494|1     |
+-----+------+------+
only showing top 20 rows



In [14]:
dfReservations2021 = dfReservations2021.dropna()

In [15]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

#split train and test
trainDF, testDF = dfReservations2021.randomSplit([0.8, 0.2])
trainDF.cache()

# build model
# coldStartStrategy - helped drop nulls
als = ALS(coldStartStrategy="drop", implicitPrefs=True)
model = als.fit(trainDF)

In [16]:
# generate predictions
predictions = model.transform(testDF)

# evalute model using root mean squared evaluator
evaluator = RegressionEvaluator(metricName="mae", labelCol="rating",predictionCol="prediction")
evaluator.evaluate(predictions)

1.4178938687063154

In [17]:
# create a test user from Silver Spring, MD
test_user = testDF.filter('user == 20901').select('user', 'item', 'rating')
test_user.show()

+-----+------+------+
| user|  item|rating|
+-----+------+------+
|20901|232432|     2|
|20901|232433|     1|
|20901|232459|     1|
|20901|232459|     1|
|20901|232459|     2|
|20901|232459|     2|
|20901|232459|     2|
|20901|232459|     3|
|20901|232459|     3|
|20901|232507|     2|
|20901|232507|     2|
|20901|232507|     3|
|20901|232507|     3|
|20901|232507|     3|
|20901|232507|     3|
|20901|233379|     1|
|20901|233379|     2|
|20901|234059|     1|
|20901|247762|     1|
|20901|247762|     1|
+-----+------+------+
only showing top 20 rows



In [18]:
# get recommendations for test user
recommendations = model.recommendForUserSubset(test_user, 5)
recommendations.sort('recommendations', ascending=False).show()

+-----+--------------------+
| user|     recommendations|
+-----+--------------------+
|20901|[{232507, 1.50426...|
+-----+--------------------+



In [19]:
from pyspark.sql.functions import explode

recommendations.select(explode('recommendations').alias('recs')).select('recs.item', 'recs.rating').sort('recs.rating', ascending=False).show()

+------+---------+
|  item|   rating|
+------+---------+
|232507|1.5042671|
|232508|1.4093786|
|232459|1.4052066|
|251431| 1.385287|
|233626|1.3784753|
+------+---------+



In [20]:
%pip install requests

Note: you may need to restart the kernel to use updated packages.


In [21]:
import requests
r = requests.get('https://ridb.recreation.gov/api/v1/facilities/232507', headers={'apikey': 'XXXXX'})
r.json()

{'message': 'Unauthorized Access'}

In [22]:
from pyspark.ml.tuning import ParamGridBuilder

# set parameters for tuning
paramGrid = ParamGridBuilder()\
    .addGrid(als.maxIter, [5, 10, 15])\
    .addGrid(als.regParam, [0.001, 0.01, 0.1])\
    .build()

In [23]:
from pyspark.ml.tuning import CrossValidator

crossval = CrossValidator(estimator=als,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator)

# cross validate create best model
cvModel = crossval.fit(trainDF)

In [24]:
# assess prediction model
cvPred = cvModel.bestModel.transform(testDF)
evaluator.evaluate(cvPred)

1.3887643411175716

In [25]:
cvModel.bestModel

ALSModel: uid=ALS_a0c6f9b7676a, rank=10

In [26]:
cvRecommendations = cvModel.bestModel.recommendForUserSubset(test_user, 5)
cvRecommendations.sort('recommendations', ascending=False).show()

from pyspark.sql.functions import explode

cvRecommendations.select(explode('recommendations').alias('recs')).select('recs.item', 'recs.rating').sort('recs.rating', ascending=False).show()



+-----+--------------------+
| user|     recommendations|
+-----+--------------------+
|20901|[{232507, 1.56666...|
+-----+--------------------+

+------+---------+
|  item|   rating|
+------+---------+
|232507|1.5666603|
|232508|1.5243237|
|251431|1.4806706|
|232459|1.4672366|
|233626|1.4087949|
+------+---------+



In [27]:
r = requests.get('https://ridb.recreation.gov/api/v1/facilities/233379', headers={'apikey': 'XXXXX'})
r.json()

{'message': 'Unauthorized Access'}