## ALS

The final goal was to attempt to create a basic recommendation system using past reservations. This uses [Alternating Least Squares](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html), a native recommender in PySpark's ML library. This documentation, along with [this walkthrough](https://github.com/shashwatwork/Building-Recommeder-System-in-PySpark/blob/master/Crafting%20Recommedation%20System%20with%20PySpark.ipynb) guided this implementation.

ALS is fairly straightforward, using three inputs, all integers, to build the model:
- userCol - person record in the transaction, customerzip was used in this model
- itemCol - facilityid, the product identifier
- ratingCol - a score to the item assigned by the user, used a "days stayed" calculation to simulate this value

## Load Data

Create a required Spark session, define a schema and load the data into a dataframe.

In [1]:
from pyspark.sql import SparkSession

MAX_MEMORY = "8g"

spark = SparkSession.builder.appName('recreation.gov reservations') \
    .config("spark.executor.memory", MAX_MEMORY) \
    .config("spark.driver.memory", MAX_MEMORY) \
    .getOrCreate()

In [2]:
from pyspark.sql.types import StructType, StructField, IntegerType

# define schema
schemaRating = StructType([
    StructField("user", IntegerType(), True),
    StructField("item", IntegerType(), True),
    StructField("rating", IntegerType(), True),
])

In [3]:
# load data with schema
dfReservations2021 = spark.read.format('csv').schema(schemaRating).csv('./data/REC_Collaborative_Facility.csv', header=True, ignoreTrailingWhiteSpace=True)

In [4]:
# inspect data
dfReservations2021.show(truncate=False)

+-----+------+------+
|user |item  |rating|
+-----+------+------+
|99709|252494|1     |
|99706|252494|1     |
|99709|252494|1     |
|84401|252494|2     |
|99709|252494|2     |
|99709|252494|1     |
|99743|252494|2     |
|99708|252494|3     |
|84401|252494|2     |
|99710|252494|1     |
|99705|252494|1     |
|99709|252494|1     |
|99709|252494|1     |
|99712|252494|2     |
|99709|252494|1     |
|99712|252494|1     |
|99705|252494|2     |
|99775|252494|1     |
|99709|252494|2     |
|99755|252494|1     |
+-----+------+------+
only showing top 20 rows



In [5]:
# drop any nulls
dfReservations2021 = dfReservations2021.dropna()

## Basic Model

Create a basic ALS model with no hyperparamter tuning.  Follow an 80/20 train/test split and score model with MAE.  Create a testing user and generate predeictions using the native API.

In [6]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

# split train and test
trainDF, testDF = dfReservations2021.randomSplit([0.8, 0.2])
trainDF.cache()

# build model
# coldStartStrategy - helped drop nulls
# implicitPrefs - the ratings are not "hard" ratings, but implied
als = ALS(coldStartStrategy="drop", implicitPrefs=True)
model = als.fit(trainDF)

In [7]:
# generate predictions
predictions = model.transform(testDF)

# evalute model using root mean squared evaluator
evaluator = RegressionEvaluator(metricName="mae", labelCol="rating",predictionCol="prediction")
evaluator.evaluate(predictions)

1.4182925862033209

In [8]:
# evaluate RMSE
rmse_evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")
rmse_evaluator.evaluate(predictions)

1.6245429718418

In [9]:
# create a test user from Silver Spring, MD
test_user = testDF.filter('user == 20901').select('user', 'item', 'rating')
test_user.show()

+-----+------+------+
| user|  item|rating|
+-----+------+------+
|20901|232459|     2|
|20901|232459|     2|
|20901|232459|     3|
|20901|232490|     3|
|20901|232507|     1|
|20901|232507|     2|
|20901|232507|     3|
|20901|232507|     3|
|20901|232507|     3|
|20901|232507|     3|
|20901|232507|     3|
|20901|232507|     3|
|20901|232507|     3|
|20901|233321|     2|
|20901|234059|     1|
|20901|234685|     3|
|20901|247762|     2|
|20901|252968|     1|
|20901|252968|     3|
|20901|258830|     2|
+-----+------+------+
only showing top 20 rows



In [10]:
from pyspark.sql.functions import explode

# get recommendations for test user
recommendations = model.recommendForUserSubset(test_user, 5)
dfRecommendations = recommendations.select(explode('recommendations').alias('recs')).select('recs.item', 'recs.rating').sort('recs.rating', ascending=False)
dfRecommendations.show()



+------+---------+
|  item|   rating|
+------+---------+
|232507|1.4952216|
|232508|1.4192376|
|232459| 1.399731|
|251431|1.3828542|
|233626|1.3669988|
+------+---------+



## Requests

Use the RIDB RESTful API to fetch more information about a recommendation.

NOTE - requires and API key

In [22]:
%env RIDB_API_KEY=XXXXX

env: RIDB_API_KEY=XXXXX


In [12]:
%pip install requests

Note: you may need to restart the kernel to use updated packages.


In [13]:
import os
import requests

# load RIDB API key environment variable
RIDB_API_KEY = os.environ.get('RIDB_API_KEY')

# fetch more data from RESTful API
if RIDB_API_KEY:
    r = requests.get('https://ridb.recreation.gov/api/v1/facilities/' + str(dfRecommendations.collect()[0][0]), headers={'apikey': RIDB_API_KEY})
    print(r.json())
else:
    print('Request not available')

{'FacilityID': '232507', 'LegacyFacilityID': '70989', 'OrgFacilityID': 'AN370989', 'ParentOrgID': '128', 'ParentRecAreaID': '2576', 'FacilityName': 'ASSATEAGUE ISLAND NATIONAL SEASHORE CAMPGROUND', 'FacilityDescription': '<h2>Overview</h2>\n<p>Assateague Island, famed for its wild horses, lies off the Delmarva Peninsula on the Atlantic Coast. This barrier island is a constantly shifting ribbon of sand, altered daily by powerful wind and waves. <br> <br>The Assateague Island National Seashore, Assateague State Park, and the Chincoteague National Wildlife Refuge each manage and protect this unique, diverse strip of land. <br> <br>For more information go to https://www.nps.gov/asis</p>\n<h2>Recreation</h2>\nActivities are abundant on the island, with crabbing and clamming, and a long stretch of beach for swimming, kayaking and fishing.<h2>Facilities</h2>\n<p>The campground is open year-round. Advance reservations are available up to 6 months in advance during the following dates:\xa0</p>\

## Cross Validation

Create an optomized model using hyperparameters and cross validation.  Evaluate model with MAE and use the best model to generate recommendations for the test user.

In [14]:
from pyspark.ml.tuning import ParamGridBuilder

# set parameters for tuning
paramGrid = ParamGridBuilder()\
    .addGrid(als.maxIter, [5, 10, 15])\
    .addGrid(als.regParam, [0.001, 0.01, 0.1])\
    .build()

In [15]:
from pyspark.ml.tuning import CrossValidator

crossval = CrossValidator(estimator=als,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator)

# cross validate create best model
cvModel = crossval.fit(trainDF)

In [16]:
# assess prediction model
cvPred = cvModel.bestModel.transform(testDF)
evaluator.evaluate(cvPred)

1.388201987081891

In [17]:
# get RMSE score
rmse_evaluator.evaluate(cvPred)

1.594626026508609

In [18]:
cvModel.bestModel

ALSModel: uid=ALS_1b818a5a6f87, rank=10

In [19]:
# get recommendations for user
cvRecommendations = cvModel.bestModel.recommendForUserSubset(test_user, 5)
dfCVRecommendations = cvRecommendations.select(explode('recommendations').alias('recs')).select('recs.item', 'recs.rating').sort('recs.rating', ascending=False)
dfCVRecommendations.show()



+------+---------+
|  item|   rating|
+------+---------+
|232507|1.5563093|
|251431|1.5072519|
|232508|1.5059693|
|232459|1.4619796|
|233563|1.4107742|
+------+---------+



In [20]:
# fetch more data from RESTful API
if RIDB_API_KEY:
    r = requests.get('https://ridb.recreation.gov/api/v1/facilities/' + str(dfCVRecommendations.collect()[0][0]), headers={'apikey': RIDB_API_KEY})
    print(r.json())
else:
    print('Request not available')

{'FacilityID': '232507', 'LegacyFacilityID': '70989', 'OrgFacilityID': 'AN370989', 'ParentOrgID': '128', 'ParentRecAreaID': '2576', 'FacilityName': 'ASSATEAGUE ISLAND NATIONAL SEASHORE CAMPGROUND', 'FacilityDescription': '<h2>Overview</h2>\n<p>Assateague Island, famed for its wild horses, lies off the Delmarva Peninsula on the Atlantic Coast. This barrier island is a constantly shifting ribbon of sand, altered daily by powerful wind and waves. <br> <br>The Assateague Island National Seashore, Assateague State Park, and the Chincoteague National Wildlife Refuge each manage and protect this unique, diverse strip of land. <br> <br>For more information go to https://www.nps.gov/asis</p>\n<h2>Recreation</h2>\nActivities are abundant on the island, with crabbing and clamming, and a long stretch of beach for swimming, kayaking and fishing.<h2>Facilities</h2>\n<p>The campground is open year-round. Advance reservations are available up to 6 months in advance during the following dates:\xa0</p>\

## Export Model

Save model for future use.

In [21]:
# save model
cvModel.bestModel.write().overwrite().save('./model/als.model')