# Relevancy Training Beta in Watson Discovery Service
This notebook is meant to provide a basic example of how to use the Relevancy Training beta capabilities in Watson Discovery 
service. See the note on Beta capabilities in the release notes here https://www.ibm.com/watson/developercloud/doc/discovery/release-notes.html

Relevancy Training allows developers to train Watson Discovery to find signals in the language of questions and documents to help surface the most relevant documents to the top of the results. Documentation for the capability is here https://www.ibm.com/watson/developercloud/doc/discovery/train.html

## Step 1: Collect Representative Queries
In order to perform Relevancy Training, you first need a set of representative queries that reflect what real users will ask of the Discovery service when integrated in your app. In general Relevancy Training is best suited to deal with queries expressed in natural language or phrases where there are multiple important terms. 
There are two common ways to collect these questions. One is to work with Subject Matter Experts (SMEs) to create questions. The other is to deploy a simple application to a set of pilot or alpha users and log/track usage. For this example we will assume questions have already been collected. 

## Step 2: Collect Training Examples
After collecting questions, you need to provide Relevancy training with examples of good and bad answer documents for those questions. To do this, we will prepare a file with the training queries and results from the untrained Discovery service. 

In [11]:
import watson_developer_cloud
import json
import csv
import requests

#create a new Discovery object using the python SDK and credentials from bluemix. 
username="<INSERT CREDENTIALS HERE>" 
password="<INSERT CREDENTIALS HERE>"

discovery = watson_developer_cloud.DiscoveryV1(
    '2016-11-07',
    username=username,
    password=password)

#specify the environment and collection where the content lives. These ids can be collected from 
#the discovery web tooling collection details page.
environment = "<INSERT ENVIRONMENT_ID HERE>"
collection = "<INSERT COLLECTION_ID HERE>"

In this sample, we will work with a predifined set of questions, when using this example, you will need to fill in the path to a txt file containing a single training question per line. 

This step may take a few minutes to run through all the queries.

In [9]:
with open ("c:/users/IBM_ADMIN/documents/data/wimbledon/wimbledon/questions/questions1.txt") as questions:
    #open an output file to place the responses 
    filestr = "c:/users/IBM_ADMIN/documents/data/wimbledon/gtQuestions_2.tsv"
    of = open(filestr, "w")
    writer = csv.writer(of, delimiter="\t")
    
    #go through each question in file and prepare Discovery query paramaters 
    for line in questions:
        question = line.replace("\n", "")
        params = {}
        params["query"] = "%s" % (question)
        params["return"] = "_id,body,title" #these fields may need to be updated depending on the content being used 
        params["count"] = 4 
        
        #run Discovery query to get results from untrained service 
        result = discovery.query(environment_id=environment, collection_id=collection, query_options=params)
        
        #create a row for each query and results 
        result_list = [question.encode("utf8")]
        for resultDoc in result["results"]:
            id = resultDoc["id"]
            body = resultDoc["body"].encode("utf8")
            title = resultDoc["title"].encode("utf8")
            result_list.extend([id,title,body,' ']) #leave a space to enter a relevance label for each doc 
        
        #write the row to the file 
        writer.writerow(result_list)
    
    of.close()

The resulting file contains the question and potential answers from an untrained Discovery instance. This file can be shared with SMEs to help rate each of the answers. These ratings will be used as relevance labels for the training data. 

## Step 3: Upload training data
The next step is to take the queries, documents, and relevance labels, and create training data objects to send to the Discovery service. 

In [23]:
#function for posting to training data endpoint 
def training_post(discovery_path, training_obj):
    training_json = json.dumps(training_obj)
    headers = {
        'content-type': "application/json"
        }
    auth = (username, password)
    r = requests.request(method="POST",url=discovery_path,data=training_json,headers=headers,auth=auth)
 
#open the training file and create new training data objects
with open(filestr,'r') as training_doc:
    training_csv = csv.reader(training_doc, delimiter='\t')    
    training_obj = {}
    training_obj["examples"] = []
    
    discovery_path = "https://gateway.watsonplatform.net/discovery/api/v1/environments/" + environment + "/collections/" + collection 
    discovery_training_path = discovery_path + "/training_data?version=2016-11-07"
    
    count = 0 
    #use first 100 ratings for training, rest will be used for testing 
    if(count < 100):
        #create a new object for each example 
        for row in training_csv:
            training_obj["natural_language_query"] = row[0]
            i = 1 
            for j in range(1,3):
                example_obj = {}
                example_obj["relevance"] = row[i+3]
                example_obj["document_id"] = row[i]
                training_obj["examples"].append(example_obj)
                i = i + 4 

            #send the training data to the discovery service 
            training_post(discovery_training_path, training_obj)

            #only take first half 
            count = count + 1

    

## Step 4: Check training status 
After uploading data, you can check the status of the training data to determine if all criteria have been met. The training data requirements are listed in the documentation here https://www.ibm.com/watson/developercloud/doc/discovery/train.html#training-data-requirements

In [27]:
status = discovery.get_collection(environment,collection)
print(json.dumps(status))

{"status": "active", "updated": "2017-04-30T15:28:24.819Z", "name": "WimbledonSmall", "language": "en_us", "created": "2017-04-30T15:28:24.819Z", "document_counts": {"available": 0, "failed": 5, "processing": 0}, "configuration_id": "6739fd67-f347-44c9-b60b-5395345ce9a8", "training_status": {"available": true, "successfully_trained": "1969-12-20T19:04:22.992+0000", "total_examples": 297, "processing": false, "sufficient_label_diversity": true, "minimum_examples_added": true, "notices": 0, "minimum_queries_added": true, "data_updated": "1969-12-31T00:53:39.992+0000"}, "collection_id": "800ef70c-7ac9-4a77-9311-6ab96adc7751", "description": null}


## Step 5: Run natural_language_query 
Once training is ready, you can start to query the service using the training and see the results. To do this you can put aside a set of collected questions to use as a test set. Training is utilized with the natural_langauge_query parameter in the Discovery query language. The function below will write out the json results from this query. These results could be incorporated into your application 

In [25]:
def relevance_query(path, query):
    headers = {
        'content-type': "application/json"
        }
    params = {}
    params["natural_language_query"] = query
    params["version"] = "2016-11-07"
    params["return"] = "_id,body,title"
    params["count"] = 3
    auth = (username, password)
    r = requests.request(method="GET",url=path,params=params,headers=headers,auth=auth)
    print(r.text)

#replace with path to your questions 
test_questions_path = "c:/users/IBM_ADMIN/documents/data/wimbledon/wimbledon/questions/questions_test.txt" 

discovery_query_path = discovery_path + "/query"

#perform a natural_language_query 
with open(test_questions_path, 'r') as test_questions:
    for question in test_questions:
        print(discovery_query_path)
        relevance_query(discovery_query_path, question)
    

https://gateway.watsonplatform.net/discovery/api/v1/environments/7455992d-3f0c-4936-b4d3-1dc8d0277e50/collections/800ef70c-7ac9-4a77-9311-6ab96adc7751/query
{
  "matching_results": 3,
  "results": [
    {
      "id": "c3776995-5fe6-409b-97f1-059988ffe79e",
      "score": 1,
      "body": "The Field Cup, 1877–1883. The original Gentlemen’s Singles trophy, won by the first Wimbledon Champion, S.W. Gore, in 1877, was the Field Cup. This Challenge Cup was presented to The All England Croquet and Lawn Tennis Club, especially for the event, by “The Field” newspaper. Mr. J.H. Walsh, who was the Honorary Secretary of the Club and the editor of “The Field”, persuaded his Proprietors to support the new venture by providing the 25 guineas Cup.",
      "title": "1.1.1 The Field Cup, 1877–1883"
    },
    {
      "id": "eb5b62fa-5364-4b26-b1ff-bf63f979ed22",
      "score": 1,
      "body": "l895. Crown Princess Stephanie of Austria was the first royal visitor to Wimbledon.",
      "title": "1.1.49.

## Step 6: Measure Results
In order to judge the effectivness of the training, you can compute standard information retrieval metrics. There are number of options to use for this. One common metric is NDCG (Normalized Discounted cumulative gain) 
https://en.wikipedia.org/wiki/Discounted_cumulative_gain

## Notes
* This code is not officially maintained or validated, and is used only for demonstration purposes. 
