# Implicit Bayesian Personalized Ranking for Open edX


To acquire the implicit BPR algorithm to use in AWS Sagemaker, subscribe to it from the marketplace in https://aws.amazon.com/marketplace/pp/prodview-xgpovurjdgtga?ref_=beagle&applicationId=AWSMPContessa

You will have to create an IAM role with access to SageMager and S3 resources.

Once you have this done, paste the ARN of the role and the algorithm here:

In [None]:
role_arn = "Paste the role ARN here"
algo_arn = "Paste the algorithm ARN here"
bucket = "Paste here the name of the bucket to store the data"
test_size = 50

## Step 1: Get the data

<div class="alert alert-block alert-danger">
<b>Warning:</b> don't run this notebook on a production Open edX server. You can run the following command in a production server to upload the data, but the notebook should be hosted locally or in a separated instance.
</div> 

Extract the enrollment data into a CSV file.
Remember to set the correct user and password.

Options to do this successfully:
- Work with a local copy of the database
- Add a `-h <hostname>` option to the mysql command to connect to a remote db (check that you have access to it)
- Run the following command command in the Open edX instance and copy the output file to the jupyter notebook directory

In [None]:
!export MYSQL_PWD=<mysql password>; mysql -u root -b edxapp -e "select user_id, course_id as item_id from student_courseenrollment;" | tr '\t' ',' > student_courseenrollment.csv

In [None]:
!head student_courseenrollment.csv

### Prepare course enrollments training data <a id="prepare-data"></a>

In [None]:
import pandas as pd

df = pd.read_csv('student_courseenrollment.csv')
print(df.shape)
df.head(10)

Now we want to filter users with more than N enrollments.

First we count enrollments per user

In [None]:
enrollments_counter = df.pivot_table(index=['user_id'], aggfunc={'user_id': 'size'})

Then we apply the filter

In [None]:
N = 3
user_index = df.set_index('user_id').index
filtered_users = enrollments_counter[enrollments_counter.user_id > N].index
filtered_df = df[user_index.isin(filtered_users)]
filtered_df.reset_index(drop=True, inplace=True)
filtered_df

### Create the test dataframe

We now filter out a random number of records from the training dataset, and use them to create a test dataset. Later on, we will compare this test dataset against the ranking inferred.

In [None]:
test_df = filtered_df.sample(test_size)
test_df

And we remove the test records from the dataframe. This will be our final training dataframe

In [None]:
train_df = filtered_df.drop(test_df.index)
train_df.shape

### Create training data file <a id="create-training-data-file"></a>

Create a csv file from the dataframe above. Do not include the index, but include headers `user_id`, and `item_id` where each row is an enrollment. Show the head of the file and number of lines.

In [None]:
train_data_dir = 'aulasneo-training'
train_data_file = '{}/student_courseenrollment-train.csv'.format(train_data_dir)

!mkdir -p {train_data_dir}
train_df[["user_id", "item_id"]].to_csv(train_data_file, index=False)

!head {train_data_file}
!wc -l {train_data_file}

### Upload training data to s3 <a id="upload-training-data"></a>

Choose a bucket in the correct region, optionally customize the prefix, and upload the csv created above.

In [None]:
%%time

import sagemaker
import boto3

prefix = "implicit-bpr-test"

boto_session = boto3.Session()
sagemaker_session = sagemaker.Session(boto_session=boto_session)

training_data = sagemaker_session.upload_data(train_data_file, bucket, "{}/training".format(prefix))
"uploaded training data file to {}".format(training_data)

## Step 2 - Create a model <a id="create-model"></a>

### Run a SageMaker training job <a id="run-training-job"></a>

Provide a proper role and the algorithm arn from your subscription in the proper region. This code will start a training job, wait for it to be done, and report its status.

In [None]:
%%time

import time

job_name_prefix = 'implicit-bpr-test'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp

estimator = sagemaker.AlgorithmEstimator(
    sagemaker_session=sagemaker_session,
    algorithm_arn = algo_arn,
    role=role_arn,
    instance_count=1,
    instance_type="ml.c5.xlarge",
    input_mode='File',
    output_path='s3://{}/{}/output'.format(bucket, job_name_prefix),
    base_job_name=job_name_prefix
)

inputs = {"training": training_data}

estimator.fit(inputs)

### Create a SageMaker model <a id="create-sagemaker-model"></a>

This will set up a model_package and model within SageMaker from the artifacts created during training. This will be used later for recommendations.

In [None]:
%%time

model_name = estimator.latest_training_job.name
sagemaker_session.create_model_package_from_algorithm(model_name, 'test', algo_arn, estimator.model_data)
sagemaker_session.wait_for_model_package(model_name, poll=5)
sagemaker_session.create_model(model_name, role_arn, [{'ModelPackageName': model_name}], enable_network_isolation=True)

## Step 3 - Get recommendations (Inference) <a id="get-recommendations"></a>

### Create batch transform input file <a id="create-batch-input"></a>

Each row is a json object containing two keys:

* `user_id`: the id of user
* `top_n`: the number of top scoring recommendations to return

The head of the batch input file is shown.

In [None]:
%%time

import json
ranking_size = 10

batch_input_dir = 'batch_input'
batch_input_file = batch_input_dir + '/recommendation.requests'

!mkdir -p {batch_input_dir}

unique_users_in_test_df = test_df["user_id"].drop_duplicates()

with open(batch_input_file, 'w') as outfile:
    for user_id in unique_users_in_test_df:
        json.dump({"user_id": str(user_id), "top_n": str(ranking_size)}, outfile)
        outfile.write("\n")
   
!head {batch_input_file}

### Upload the batch transform input file to s3 <a id="upload-batch-input"></a>

In [None]:
batch_input = sagemaker_session.upload_data(batch_input_dir, bucket, "{}/batch_input".format(prefix))
"uploaded training data file to {}".format(batch_input)

### Run the Batch Transform Job <a id="run-transform"></a>

This code will start a batch transform job, wait for it to be done, and report its status.

In [None]:
%%time

import boto3
boto3_session = boto3.Session()
sage = boto3_session.client(service_name='sagemaker')

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
batch_job_name = "implicit-bpr-test" + timestamp
batch_output = 's3://{}/{}/output'.format(bucket, batch_job_name)
request = \
{
  "TransformJobName": batch_job_name,
  "ModelName": model_name,
  "BatchStrategy": "SingleRecord",
  "TransformInput": {
    "DataSource": {
      "S3DataSource": {
        "S3DataType": "S3Prefix",
        "S3Uri": batch_input
      }
    },
    "ContentType": "application/json",
    "CompressionType": "None",
    "SplitType": "Line"
  },
  "TransformOutput": {
    "S3OutputPath": batch_output,
    "Accept": "text/csv",
    "AssembleWith": "Line"
  },
  "TransformResources": {
    "InstanceType": "ml.c5.xlarge",
    "InstanceCount": 1
  }
}

sage.create_transform_job(**request)

print("Created Transform job with name: ", batch_job_name)

while True :
    job_info = sage.describe_transform_job(TransformJobName=batch_job_name)
    status = job_info['TransformJobStatus']
    if status == 'Completed':
        print("Transform job ended with status: " + status)
        break
    if status == 'Failed':
        message = job_info['FailureReason']
        print('Transform failed with the following error: {}'.format(message))
        raise Exception('Transform job failed') 
    time.sleep(30)

### Download the batch results <a id="download-batch-results"></a>

Download the results from S3 and show the head of the file.

In [None]:
!aws s3 cp {batch_output + '/recommendation.requests.out'} .

!head recommendation.requests.out

### Recommendations with scores <a id="recommendations"></a>

Import the recommendations from the batch output file downloaded above. These are the top course recommendations for our example users.

In [None]:
!wc -l recommendation.requests.out

In [None]:
recommendations_df = pd.read_csv('recommendation.requests.out', 
                                 header=None, 
                                 names=["user_id", "item_id", "score"])

recommendations_df["position"] = list(range(1, ranking_size+1)) * int(len(recommendations_df)/ranking_size)

recommendations_df

### Testing

Lets now randomly take one of the users from out test dataset

In [None]:
sample_user = test_df.sample()["user_id"].iloc[0]

sample_user

These are the courses that our selected user took in the past. Note that we have hidden one enrollment to the algorithm

In [None]:
train_df[train_df["user_id"]==sample_user]

This is the enrollment that our sample user has taken, and we have hidden to the algorithm.

In [None]:
test_df[test_df["user_id"]==sample_user]

Let's see how this course would be positioned in the personal recommendation ranking for our sample user, if he hadn't taken it before. If it is in the ranking, it will be highlighted in the next table. 

In [None]:
check_for_user = recommendations_df[recommendations_df["user_id"]==sample_user]
course = test_df[test_df["user_id"]==sample_user]['item_id'].iloc[0]

def highlight_item(x, item):
    if x == item:
        return 'background-color: yellow'
    else:
        return 'background-color: white'

check_for_user.style.applymap(highlight_item, item=course)

### Benchmarking

Let's repeat this for other users:

In [None]:
sample_user = test_df.sample()["user_id"].iloc[0]
check_for_user = recommendations_df[recommendations_df["user_id"]==sample_user]
course = test_df[test_df["user_id"]==sample_user]['item_id'].iloc[0]
print("User: {}, Course: {}".format(sample_user, course))
check_for_user.style.applymap(highlight_item, item=course)

Let's see how many, out of the actual enrollments that we have hidden to the algorithm, would have appear in the personal recommendation, and at which position in the ranking (1=top recommendation).

In [None]:
test_df_to_check = test_df.set_index(["user_id", "item_id"])

check_df = recommendations_df.set_index(["user_id", "item_id"]).join(test_df_to_check, on=["user_id", "item_id"], how="inner")

print("{} out of {} actual enrollments fell into the top {} personal ranking".format(check_df['position'].count(), test_size, ranking_size))
check_df

This histogram graph shows how many actual enrollments from the test data set appeard in each position in the user's personal ranking

In [None]:
bars = check_df.groupby("position").count()
zeros = pd.DataFrame(index=list(range(1,ranking_size + 1)), columns=["score"], data=[0] * ranking_size)
(bars + zeros).fillna(0).plot.bar()

The next cells will create two csv files to log each run of this notebook. Then they are used to calculate the average times the user's choice falls into the ranking, and make a box plot to compare the positions in the ranking.

In [None]:
from csv import writer

with open('scores.csv', 'a+', newline='') as f:
    csv_writer = writer(f)
    csv_writer.writerow((bars + zeros).fillna(0)['score'])

In [None]:
with open('top10counts.csv', 'a+', newline='') as f:
    csv_writer = writer(f)
    csv_writer.writerow([check_df['position'].count()])

In [None]:
top10counts = pd.read_csv('top10counts.csv', header=None)
scores = pd.read_csv('scores.csv', header=None, names = list(range(1,11)))

print ("{}% of the test enrollments fell into the top-{} ranking (average from of {} runs)".format(top10counts.mean()[0]*100/test_size, ranking_size, len(scores)))

ax=scores.boxplot(grid=False)
ax.set_xlabel("Position in ranking")
ax.set_ylabel("Frequency")

## Step 4 - Optional Clean up <a id="cleanup"></a>

In [None]:
def cleanup():
    sagemaker_session.delete_model(model_name)
    
# optionally uncomment and run the code to clean everything up  

#cleanup()