## ALS

The final goal was to attempt to create a basic recommendation system using past reservations. This uses [Alternating Least Squares](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html), a native recommender in PySpark's ML library. This documentation, along with [this walkthrough](https://github.com/shashwatwork/Building-Recommeder-System-in-PySpark/blob/master/Crafting%20Recommedation%20System%20with%20PySpark.ipynb) guided this implementation.

ALS is fairly straightforward, using three inputs, all integers, to build the model:
- userCol - person record in the transaction, customerzip was used in this model
- itemCol - facilityid, the product identifier
- ratingCol - a score to the item assigned by the user, used a "days stayed" calculation to simulate this value

## Load Data

Create a required Spark session, define a schema and load the data into a dataframe.

In [1]:
from pyspark.sql import SparkSession

MAX_MEMORY = "8g"

spark = SparkSession.builder.appName('recreation.gov reservations') \
    .config("spark.executor.memory", MAX_MEMORY) \
    .config("spark.driver.memory", MAX_MEMORY) \
    .getOrCreate()

In [28]:
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType

# define schema
schemaRating = StructType([
    StructField("user", IntegerType(), True),
    StructField("item", IntegerType(), True),
    StructField("rating", IntegerType(), True),
])

In [29]:
# load data with schema
dfReservations2021 = spark.read.format('csv').schema(schemaRating).csv('REC_Collaborative_Facility.csv', header=True, ignoreTrailingWhiteSpace=True)

In [30]:
# inspect data
dfReservations2021.show(truncate=False)

+-----+------+------+
|user |item  |rating|
+-----+------+------+
|99709|252494|1     |
|99706|252494|1     |
|99709|252494|1     |
|84401|252494|2     |
|99709|252494|2     |
|99709|252494|1     |
|99743|252494|2     |
|99708|252494|3     |
|84401|252494|2     |
|99710|252494|1     |
|99705|252494|1     |
|99709|252494|1     |
|99709|252494|1     |
|99712|252494|2     |
|99709|252494|1     |
|99712|252494|1     |
|99705|252494|2     |
|99775|252494|1     |
|99709|252494|2     |
|99755|252494|1     |
+-----+------+------+
only showing top 20 rows



In [31]:
# drop any nulls
dfReservations2021 = dfReservations2021.dropna()

## Basic Model

Create a basic ALS model with no hyperparamter tuning.  Follow an 80/20 train/test split and score model with MAE.  Create a testing user and generate predeictions using the native API.

In [32]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

# split train and test
trainDF, testDF = dfReservations2021.randomSplit([0.8, 0.2])
trainDF.cache()

# build model
# coldStartStrategy - helped drop nulls
# implicitPrefs - the ratings are not "hard" ratings, but implied
als = ALS(coldStartStrategy="drop", implicitPrefs=True)
model = als.fit(trainDF)

In [33]:
# generate predictions
predictions = model.transform(testDF)

# evalute model using root mean squared evaluator
evaluator = RegressionEvaluator(metricName="mae", labelCol="rating",predictionCol="prediction")
evaluator.evaluate(predictions)

1.4170840733953038

In [34]:
# create a test user from Silver Spring, MD
test_user = testDF.filter('user == 20901').select('user', 'item', 'rating')
test_user.show()

+-----+------+------+
| user|  item|rating|
+-----+------+------+
|20901|232432|     2|
|20901|232433|     2|
|20901|232433|     2|
|20901|232459|     1|
|20901|232459|     2|
|20901|232459|     3|
|20901|232463|     2|
|20901|232507|     1|
|20901|232507|     1|
|20901|232507|     1|
|20901|232507|     2|
|20901|232507|     3|
|20901|232507|     3|
|20901|232507|     3|
|20901|232507|     3|
|20901|232507|     3|
|20901|232507|     3|
|20901|232507|     3|
|20901|232508|     2|
|20901|233379|     2|
+-----+------+------+
only showing top 20 rows



In [35]:
from pyspark.sql.functions import explode

# get recommendations for test user
recommendations = model.recommendForUserSubset(test_user, 5)
recommendations.select(explode('recommendations').alias('recs')).select('recs.item', 'recs.rating').sort('recs.rating', ascending=False).show()

+------+---------+
|  item|   rating|
+------+---------+
|232507|1.4859892|
|232508|1.4284883|
|232459|1.4030318|
|251431|1.3911495|
|233563|1.3731725|
+------+---------+



In [36]:
%pip install requests

Note: you may need to restart the kernel to use updated packages.


In [38]:
import requests

r = requests.get('https://ridb.recreation.gov/api/v1/facilities/232507', headers={'apikey': 'XXXXX'})
r.json()

{'message': 'Unauthorized Access'}

## Cross Validation

Create an optomized model using hyperparameters and cross validation.  Evaluate model with MAE and use the best model to generate recommendations for the test user.

In [39]:
from pyspark.ml.tuning import ParamGridBuilder

# set parameters for tuning
paramGrid = ParamGridBuilder()\
    .addGrid(als.maxIter, [5, 10, 15])\
    .addGrid(als.regParam, [0.001, 0.01, 0.1])\
    .build()

In [40]:
from pyspark.ml.tuning import CrossValidator

crossval = CrossValidator(estimator=als,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator)

# cross validate create best model
cvModel = crossval.fit(trainDF)

In [41]:
# assess prediction model
cvPred = cvModel.bestModel.transform(testDF)
evaluator.evaluate(cvPred)

1.3874826119513284

In [42]:
cvModel.bestModel

ALSModel: uid=ALS_514f1170bcf6, rank=10

In [43]:
# get recommendations for user
cvRecommendations = cvModel.bestModel.recommendForUserSubset(test_user, 5)
cvRecommendations.select(explode('recommendations').alias('recs')).select('recs.item', 'recs.rating').sort('recs.rating', ascending=False).show()

+------+---------+
|  item|   rating|
+------+---------+
|232507|1.5436949|
|251431|1.4961312|
|232508|1.4867854|
|232459|1.4550209|
|233563|1.4480174|
+------+---------+



In [44]:
r = requests.get('https://ridb.recreation.gov/api/v1/facilities/233379', headers={'apikey': 'XXXXX'})
r.json()

{'message': 'Unauthorized Access'}