### News Recommendation ALS w/ AML Example Databricks Notebook
##### by Daniel Ciborowski, dciborow@microsoft.com

##### Copyright (c) Microsoft Corporation. All rights reserved.

##### Licensed under the MIT License.

##### Setup
1. Create new Cluster, DB 4.1, Spark 2.3.0, Python3
1. (Optional for Ranking Metrics) From Maven add to cluster the following jar: Azure:mmlspark:0.15
1. Cosmos DB Uber Jar - https://repo1.maven.org/maven2/com/microsoft/azure/azure-cosmosdb-spark_2.3.0_2.11/1.2.7/azure-cosmosdb-spark_2.3.0_2.11-1.2.7-uber.jar

##### This notebook is broken down into four sections.
1. Experimentation
1. Training & Scoring
1. Serving

##### The following Azure services will be deployed into a new or existing resource group.
1. [ML Service](https://docs.databricks.com/user-guide/libraries.html)
1. [Cosmos DB](https://azure.microsoft.com/en-us/services/cosmos-db/)
1. [Container Registery](https://docs.microsoft.com/en-us/azure/container-registry/)
1. [Container Instances](https://docs.microsoft.com/en-us/azure/container-instances/)
1. [Application Insights](https://azure.microsoft.com/en-us/services/monitor/)
1. Storage Account
1. Key Vault

In a news recommendation scenario, items have an active lifespan when they should be recommended. After this time has expired old stories are not recommended, and new news stories replace the expired ones. When recommending new stories, only active stories should be recommended. This example shows how to train a model using historical data, and make recommendations for the latest news stories.

![Design](https://raw.githubusercontent.com/dciborow/DB-Recs/master/NewsRecs/reco_news_design.JPG)


In order to turn new stories from cold items, to warm items, 1% of the recommendations servered should include a random new (cold) story. This population should also be used to provide a baseline to measure the online model performance.

New Recommendation Dataset can be found here. http://reclab.idi.ntnu.no/dataset/

##### Citation
Gulla, J. A., Zhang, L., Liu, P., Özgöbek, Ö., & Su, X. (2017, August). The Adressa dataset for news recommendation. In Proceedings of the International Conference on Web Intelligence (pp. 1042-1048). ACM.

In [2]:
import pandas as pd
import random

from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql.types import *
from pyspark.sql.functions import col, collect_list

In [3]:
from azure.common.client_factory import get_client_from_cli_profile

import azureml.core
from azureml.core import Workspace
from azureml.core.run import Run
from azureml.core.experiment import Experiment


from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import StringIndexer
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql import Row

import numpy as np
import os
import pandas as pd
import pprint
import shutil
import time, timeit
import urllib
import yaml

# Check core SDK version number - based on build number of preview/master.
print("SDK version:", azureml.core.VERSION)

prefix = "dcib_igor_"
subscription_id = ''
data = 'news'

workspace_region = "westus2"
resource_group = prefix + "_" + data
workspace_name = prefix + "_"+data+"_aml"
experiment_name = data + "_als_Experiment"
aks_name = "dcibigoraks"
service_name = "dcibigoraksals"

# import the Workspace class and check the azureml SDK version
# exist_ok checks if workspace exists or not.
ws = Workspace.create(name = workspace_name,
                      subscription_id = subscription_id,
                      resource_group = resource_group, 
                      location = workspace_region,
                      exist_ok=True)

# persist the subscription id, resource group name, and workspace name in aml_config/config.json.
ws.write_config()

# start a training run by defining an experiment
myexperiment = Experiment(ws, experiment_name)
root_run = myexperiment.start_logging()


AzureML is a way to organize your Machine Learning development process. It can be used directly from the Azure portal, or programmatically from a notebook like in this example.
![Design](https://raw.githubusercontent.com/dciborow/DB-Recs/master/NewsRecs/workspace.JPG)

# I. Experimentation

In [6]:
spark = SparkSession.builder.getOrCreate()

data = spark.read.json("wasb://sampledata@dcibviennadata.blob.core.windows.net/one_week.json") \
  .cache()

from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline, PipelineModel

df = data \
  .filter(col("sessionStart") != 'true') \
  .filter(col("sessionStop") != 'true') \
  .filter(col("url") != "http://adressa.no") \
  .filter(col("activeTime") > 10) \
  .select("userId","url", "activeTime", "time") \
  .cache()

indexerContacts = StringIndexer(inputCol='userId', outputCol='userIdIndex', handleInvalid='keep').fit(df)
indexerRules = StringIndexer(inputCol='url', outputCol='itemIdIndex', handleInvalid='keep').fit(df)

ratings = indexerRules.transform(indexerContacts.transform(df)) \
  .select("userIdIndex","itemIdIndex","activeTime","time") \
  .withColumnRenamed('userIdIndex',"userId") \
  .withColumnRenamed('itemIdIndex',"itemId") \
  .withColumnRenamed('activeTime',"rating") \
  .withColumnRenamed('time',"timestamp") \
  .cache()

display(ratings.select('userId','itemId','rating','timestamp').orderBy('userId','itemId'))

In [7]:
display(ratings.select('userId','itemId','rating','timestamp').orderBy('userId','itemId'))

In [8]:
# Build the recommendation model using ALS on the rating data
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
algo = ALS(userCol="userId", itemCol="itemId", implicitPrefs=True, coldStartStrategy="drop")
model = algo.fit(ratings)

In [9]:
# Evaluate the model by computing ranking metrics on the rating data
from mmlspark.RankingAdapter import RankingAdapter
from mmlspark.RankingEvaluator import RankingEvaluator

output = RankingAdapter(mode='allUsers', k=5, recommender=algo) \
  .fit(ratings) \
  .transform(ratings)

metrics = ['ndcgAt','map','recallAtK','mrr','fcp']
metrics_dict = {}
for metric in metrics:
    metrics_dict[metric] = RankingEvaluator(k=3, metricName=metric).evaluate(output)

for k in metrics_dict:
  root_run.log(k, metrics_dict[k])    
  
display(spark.createDataFrame(pd.DataFrame(list(root_run.get_metrics().items()), columns=['metric','value'])))   

metric,value
map,0.3410084486390418
ndcgAt,0.3831677112362854
rmse,94.22848622877932
recallAtK,0.1854026957431714
mrr,0.4864189232443509
fcp,0.2125321225090765


In [10]:
%%writefile recommend.py

import pyspark
from pyspark.ml.recommendation import ALS

# Recommend Subset Wrapper
def recommendSubset(self, df, timestamp):
  def Func(lines):
    out = []
    for i in range(len(lines[1])):
      out += [(lines[1][i],lines[2][i])]
    return lines[0], out

  tup = StructType([
    StructField('itemId', IntegerType(), True),
    StructField('rating', FloatType(), True)
  ])
  array_type = ArrayType(tup, True)
  active_items = df.filter(col("timestamp") > timestamp).select("itemId").distinct()
  users = df.select("userId").distinct()

  users_active_items = users.crossJoin(active_items)
  scored = self.transform(users_active_items)

  recs = scored \
    .groupBy(col('userId')) \
    .agg(collect_list(col("itemId")),collect_list(col("prediction"))) \
    .rdd \
    .map(Func) \
    .toDF() \
    .withColumnRenamed("_1","userId") \
    .withColumnRenamed("_2","recommendations") \
    .select(col("userId"),col("recommendations").cast(array_type))

  return recs

import pyspark
pyspark.ml.recommendation.ALSModel.recommendSubset = recommendSubset

#Implement this function
def recommend(historic, timestamp):   
  algo = ALS(userCol="userId", itemCol="itemId", implicitPrefs=True, coldStartStrategy="drop")
  model = algo.fit(historic)  
  recs = model.recommendSubset(historic, timestamp)
  return recs

In [11]:
with open('recommend.py', 'r') as myfile:
    data=myfile.read()

exec(data)

recs = recommend(ratings, 1483747200) \
  .cache()
# display(recs.orderBy('userId'))

recs.take(1)

In [12]:
root_run.upload_file("outputs/recommend.py",'recommend.py')
root_run.complete()

print("Run Url: " + root_run.get_portal_url())

AzureML has recorded this experiment and provides a configurable dashboard for run history.
![Design](https://raw.githubusercontent.com/dciborow/DB-Recs/master/NewsRecs/experiment.JPG)

In [14]:
# Register as model
from azureml.core.model import Model
mymodel = Model.register(model_path = 'recommend.py', # this points to a local file
                       model_name = 'als', # this is the name the model is registered as, am using same name for both path and name.                 
                       description = "ADB trained model by Dan",
                       workspace = ws)

print(mymodel.name, mymodel.description, mymodel.version)
print("URL: " + mymodel.url)

The model becomes a trackable asset in the Machine Learning workspace.

![Model](https://raw.githubusercontent.com/dciborow/DB-Recs/master/NewsRecs/model.JPG)

# II. Training and Scoring

In [17]:
from azureml.core.model import Model

mymodel = Model.list(ws)[0]
mymodel.download('./o16n/',exists_ok=True)
print(mymodel.name, mymodel.description, mymodel.version)

with open('./o16n/recommend.py', 'r') as myfile:
    data=myfile.read()

exec(data)

recs = recommend(ratings, 1483747200) \
  .cache()

recs.take(1)

In [18]:
account_name = "movies-ds-sql"
endpoint = "https://" + account_name + ".documents.azure.com:443/"
master_key = ""

writeConfig = {
  "Endpoint": endpoint,
  "Masterkey": master_key,
  "Database": 'recommendations',
  "Collection": 'news',
  "Upsert": "true"
}

# recs \
#   .withColumn("id",recs['userid'].cast("string")) \
#   .select("id", "recommendations.itemid")\
#   .write \
#   .format("com.microsoft.azure.cosmosdb.spark") \
#   .mode('overwrite') \
#   .options(**writeConfig) \
#   .save()

Store recommendations in Cosmos DB for low latency serving.

![Cosmos DB](https://raw.githubusercontent.com/dciborow/DB-Recs/master/NewsRecs/cosmosdb.JPG)

# III. Serving

In [21]:
from azureml.core.model import Model

mymodel = Model.list(ws)[0]
print(mymodel.name, mymodel.description, mymodel.version)

In [22]:
%%writefile score_sparkml.py

import json
def init(local=False):
    global client, collection
    try:
      # Query them in SQL
      import pydocumentdb.document_client as document_client

      MASTER_KEY = '{key}'
      HOST = '{endpoint}'
      DATABASE_ID = "{database}"
      COLLECTION_ID = "{collection}"
      database_link = 'dbs/' + DATABASE_ID
      collection_link = database_link + '/colls/' + COLLECTION_ID
      
      client = document_client.DocumentClient(HOST, {'masterKey': MASTER_KEY})
      collection = client.ReadCollection(collection_link=collection_link)
    except Exception as e:
      collection = e
def run(input_json):      

    try:
      import json

      id = json.loads(json.loads(input_json)[0])['id']
      query = {'query': 'SELECT * FROM c WHERE c.id = "' + str(id) +'"' } #+ str(id)

      options = {}

      result_iterable = client.QueryDocuments(collection['_self'], query, options)
      result = list(result_iterable);
  
    except Exception as e:
        result = str(e)
    return json.dumps(str(result)) #json.dumps({{"result":result}})

In [23]:
# Test Web Service Code
with open('score_sparkml.py', 'r') as myfile:
    score_sparkml=myfile.read()
    
import json
score_sparkml = score_sparkml.replace("{key}",writeConfig['Masterkey']).replace("{endpoint}",writeConfig['Endpoint']).replace("{database}",writeConfig['Database']).replace("{collection}",writeConfig['Collection'])

exec(score_sparkml)

In [24]:
%%writefile myenv_sparkml.yml

name: myenv
channels:
  - defaults
dependencies:
  - pip:
    - numpy==1.14.2
    - scikit-learn==0.19.1
    - pandas
    # Required packages for AzureML execution, history, and data preparation.
    - --extra-index-url https://azuremlsdktestpypi.azureedge.net/sdk-release/Preview/E7501C02541B433786111FE8E140CAA1
    - azureml-core
    - pydocumentdb

In [25]:
# Create Image for Web Service
models = [mymodel]
runtime = "spark-py"
conda_file = 'myenv_sparkml.yml'
driver_file = "score_sparkml.py"

# image creation
from azureml.core.image import ContainerImage
myimage_config = ContainerImage.image_configuration(execution_script = driver_file, 
                                    runtime = runtime, 
                                    conda_file = conda_file)

image = ContainerImage.create(name = "news-als",
                                # this is the model object
                                models = [mymodel],
                                image_config = myimage_config,
                                workspace = ws)

# Wait for the create process to complete
image.wait_for_creation(show_output = True)

AzureML tracks images used for web services.

![Image](https://raw.githubusercontent.com/dciborow/DB-Recs/master/NewsRecs/image.JPG)

In [27]:
#create AKS compute
#it may take 20-25 minutes to create a new cluster

from azureml.core.compute import AksCompute, ComputeTarget

# Use the default configuration (can also provide parameters to customize)
prov_config = AksCompute.provisioning_configuration()

# Create the cluster
aks_target = ComputeTarget.create(workspace = ws, 
                                  name = aks_name, 
                                  provisioning_configuration = prov_config)

aks_target.wait_for_completion(show_output = True)

print(aks_target.provisioning_state)
print(aks_target.provisioning_errors)

Azure Kubernetes service is used to host the endpoint. 

![Image](https://raw.githubusercontent.com/dciborow/DB-Recs/master/NewsRecs/aks.JPG)

In [29]:
# Deploy image to AKS

from azureml.core.webservice import Webservice, AksWebservice
from azureml.core.image import ContainerImage

#Set the web service configuration (using default here with app insights)
aks_config = AksWebservice.deploy_configuration(enable_app_insights=True)

# Webservice creation using single command, there is a variant to use image directly as well.
try:
  aks_service = Webservice.deploy_from_image(
    workspace=ws, 
    name=service_name,
    deployment_config = aks_config,
    image = image,
    deployment_target = aks_target
      )
  aks_service.wait_for_deployment(show_output=True)
except Exception:
    aks_service = Webservice.list(ws)[0]


AzureML provides more details about this deployment

![Deployment](https://raw.githubusercontent.com/dciborow/DB-Recs/master/NewsRecs/deployment.JPG)

Including ML linage back to model used to produce recommendations served by this service.

![Deployment](https://raw.githubusercontent.com/dciborow/DB-Recs/master/NewsRecs/deployment_models.JPG)

In [31]:
import urllib
import time
import json

scoring_url = aks_service.scoring_uri
service_key = aks_service.get_keys()[0]

input_data = '["{\\"id\\":\\"1\\"}"]'.encode()

req = urllib.request.Request(scoring_url,data=input_data)
req.add_header("Authorization","Bearer {}".format(service_key))
req.add_header("Content-Type","application/json")

tic = time.time()
with urllib.request.urlopen(req) as result:
    res = result.readlines()
    print(res)
    
toc = time.time()
t2 = toc - tic
print("Full run took %.2f seconds" % (toc - tic))

Application Insights is automatically set up, and tracks response time, requests, and more!

![Deployment](https://raw.githubusercontent.com/dciborow/DB-Recs/master/NewsRecs/appinsights.JPG)