# How to train and deploy Learning To Rank

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/afoucret/elasticsearch-labs/blob/ltr-notebook/notebooks/learning-to-rank/01-learning-to-rank.ipynb)

In this notebook we will see example on how to train a Learning To Rank model using [XGBoost](https://xgboost.ai/) and how to deploy it to be used as a rescorer in Elasticsearch.

## Install required packages

First we will be installing packages required for our example.

In [10]:
!pip install elasticsearch "eland[xgboost]"

You should consider upgrading via the '/Users/afoucret/git/elasticsearch-labs/.venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

## Configure your Elasticsearch deployment

For this example, we will be using an [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) deployment (available with a [free trial](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook)).

In [11]:
import getpass
from elasticsearch import Elasticsearch

# Found in the "Manage Deployment" page
try: CLOUD_ID
except NameError: CLOUD_ID = getpass.getpass("Enter Elastic Cloud ID: ")

# Password for the "elastic" user generated by Elasticsearch
try: ELASTIC_PASSWORD
except NameError:
    ELASTIC_PASSWORD = getpass.getpass("Enter Elastic password: ")

# Create the client instance
es_client = Elasticsearch(
    cloud_id=CLOUD_ID,
    basic_auth=("elastic", ELASTIC_PASSWORD)
)

client_info = es_client.info()

f"Successfully connected to cluster {client_info['cluster_name']} (version {client_info['version']['number']})"

'Successfully connected to cluster 2a391d1ee27d46078c34e6d04a30c384 (version 8.12.0)'

## Download the dataset

In this example notebook we will use a dataset derived from [MSRD](https://github.com/metarank/msrd/tree/master) (Movie Search Ranking Dataset).

The dataset contains the following files:

- **movies_corpus.jsonl.gz**
- **movies_judgements.csv.gz**
- **movies_feature_extractors.json**
- **movies_index_settings.json**

In [12]:
import os
from urllib.parse import urljoin

DATASET_BASE_URL = "https://raw.githubusercontent.com/afoucret/elasticsearch-labs/ltr-notebook/notebooks/learning-to-rank/sample_data/"

CORPUS_URL = urljoin(DATASET_BASE_URL, "movies_corpus.jsonl.gz")
JUDGEMENTS_FILE_URL = urljoin(DATASET_BASE_URL,"movies_judgments.csv.gz")
INDEX_SETTINGS_URL = urljoin(DATASET_BASE_URL,"movies_index_settings.json")
FEATURE_EXTRACTORS_URL = urljoin(DATASET_BASE_URL,"movies_feature_extractors.json")


 ## Importing the document corpus

In [13]:
import json
import elasticsearch.helpers as es_helpers
import pandas as pd
from urllib.request import urlopen

MOVIE_INDEX = "movies"

# Delete index
print("Deleting index if it already exists:", MOVIE_INDEX)
es_client.options(ignore_status=[400, 404]).indices.delete(index=MOVIE_INDEX)

print("Creating index:", MOVIE_INDEX)
index_settings = json.load(urlopen(INDEX_SETTINGS_URL))
es_client.indices.create(index=MOVIE_INDEX, **index_settings)

print(f"Loading the corpus from {CORPUS_URL}")
corpus_df = pd.read_json(CORPUS_URL, lines=True)

print(f"Indexing the corpus into {MOVIE_INDEX} ...")
bulk_result = es_helpers.bulk(
  es_client,
  actions=[{ "_id": movie['id'], "_index": MOVIE_INDEX, **movie } for movie in corpus_df.to_dict('records')]
)
print(f"Indexed {bulk_result[0]} documents into {MOVIE_INDEX}")

Deleting index if it already exists: movies
Creating index: movies
Loading the corpus from https://raw.githubusercontent.com/afoucret/elasticsearch-labs/ltr-notebook/notebooks/learning-to-rank/sample_data/movies_corpus.jsonl.gz
Indexing the corpus into movies ...
Indexed 9751 documents into movies
