# Setup Azure Cognitive Search Service
This notebook will set up your Azure Cognitive Search Service for the COVID-19 example described at https://aka.ms/Covid19CognitiveSearchCode.  Data is pulled from two folders in the same Azure blob storage container.  The main indexer runs data in json format through a skillset which reshapes the data and extracts medical entities, and puts the enriched data in the search index.  A second metadata indexer pulls additional metadata into the same search index.   

First, you will need an Azure account.  If you don't already have one, you can start a free trial of Azure [here](https://azure.microsoft.com/free/).  

Secondly, create a new Azure search service using the Azure portal at <https://portal.azure.com/#create/Microsoft.Search>.  Select your Azure subscription.  You may create a new resource group (you can name it something like "covid19-search-rg").  You will need a globally-unique URL as the name of your search service (try something like "covid19-search-" plus your name, organization, or numbers).  Finally, choose a nearby location to host your search service - please remember the location that you chose, as your Cognitive Services instance will need to be based in the same location.  Click "Review + create" and then (after validation) click "Create" to instantiate and deploy the service.  

After deployment is complete, click "Go to resource" to navigate to your new search service.  We will need some information about your search service to fill in the "Azure Search variables" section in the cell below.  First, on the "Overview" main page, you should see a "Url" value.  Copy that value into the "azsearch_url" variable in the cell below (you can just update the "<YourSearchServiceName>" section of the URL with the name of your Azure search service).  Then, on the Azure portal page in the left-hand pane under "Settings", click on "Keys".  Update the azsearch_key value below with one of the keys from your service on the Azure portal page.  

Finally, you will need to create an Azure storage account and upload the COVID-19 data set. The data set can be downloaded from https://www.semanticscholar.org/cord19/download. There are two different sections to download: the metadata and document parses. Then, back on the Azure portal, you can create a new Azure storage account at https://portal.azure.com/#create/Microsoft.StorageAccount. Use the same subscription, resource group, and location that you did for the Azure search service. Choose your own unique storage account name (it must be lowercase letters and numbers only). You can change the replication to LRS. You can use the defaults for everything else, and then create the storage. Once it has been deployed, update the blob_connection_string variable in the cell below. Then create a container in your blob storage called "covid19". Inside of that container, create a folder called "json" and upload the document parses data there. Then create a folder called "metadata" in the same blob container, and upload the metadata.csv file to that folder. If you modify those names, update their respective values below.

In [None]:
# Azure Search variables
azsearch_url = "<YourSearchServiceName>.search.windows.net"  # If you copy this value from the portal, leave off the "https://" from the beginning
azsearch_key = "TODO" 

# Data source which contains documents to process
blob_connection_string = "DefaultEndpointsProtocol=https;AccountName=TODO;AccountKey=TODO;EndpointSuffix=core.windows.net"
blob_container = "covid19"
data_folder = "json"
metadata_folder = "metadata"

# Prefix for elements of the Cognitive Search service
search_prefix = "covid19"  # Note that if you change this value, you will also have to change the values in the indexer json.

print("The variables are initialized.")

We will first create a simple function to wrap REST requests to the Azure Search service.  If called with no parameters, it will get the service statistics.  

In [None]:
import json

def azsearch_rest(request_type="GET", endpoint="servicestats", body=None):
    # Imports and constants
    import http.client, urllib.request, urllib.parse, urllib.error, base64, json, urllib

    # Request headers.
    headers = {
        'Content-Type': 'application/json',
        'api-key': azsearch_key
    }

    # Request parameters
    params = urllib.parse.urlencode({
        'api-version':'2019-05-06-Preview'
    })
    
    try:
        # Execute the REST API call and get the response.
        conn = http.client.HTTPSConnection(azsearch_url)
        request_path = "/{0}?{1}".format(endpoint, params)
        conn.request(request_type, request_path, body, headers)
        response = conn.getresponse()
        print(response.status)
        data = response.read().decode("UTF-8")
        result = None
        if len(data) > 0:
            result = json.loads(data)
        return result

    except Exception as ex:
        raise ex
        
# Test the function
try:
    response = azsearch_rest()
    if response != None:
        print(json.dumps(response, sort_keys=True, indent=2))
except Exception as ex:
    print(ex.message)

First, let's set up data sources for your search service.  In this service, we have two data sources, one that pulls data from a json folder and one that pulls data from a metadata folder.  

In [None]:
def create_datasource(datasource_name, blob_connection_string, blob_container, folder):

    # Define the request body with details of the data source we want to create
    body = {   
        "name": datasource_name,  
        "description": "",  
        "type": "azureblob",
        "credentials": 
        { 
            "connectionString": blob_connection_string
        },  
        "container": { 
            "name": blob_container, 
            "query": folder 
        }
    } 

    try:
        # Call the REST API's 'datasources' endpoint to create a data source
        result = azsearch_rest(request_type="POST", endpoint="datasources", body=json.dumps(body))
        if result != None:
            print(json.dumps(result, sort_keys=True, indent=2))
    except Exception as ex:
        print(ex)
        

# Create two datasources
datasource_name = search_prefix + "-ds"
metadata_datasource_name = "metadata-ds"

create_datasource(datasource_name, blob_connection_string, blob_container, data_folder)
create_datasource(metadata_datasource_name, blob_connection_string, blob_container, metadata_folder)

Then let's set up your search index.  

In [None]:
index_name = search_prefix + "-index"

# Define the request body
with open("index.json") as datafile:
  index_json = json.load(datafile)

try:
    result = azsearch_rest(request_type="PUT", endpoint="indexes/" + index_name, body=json.dumps(index_json))
    if result != None:
        print(json.dumps(result, sort_keys=True, indent=2))

except Exception as e:
    print('Error:')
    print(e)

Next, we will set up your skillset.  

In [None]:
skillset_name = search_prefix + "-skillset"

# Define the request body
with open("skillset.json") as datafile:
  skillset_json = json.load(datafile)

try:
    result = azsearch_rest(request_type="PUT", endpoint="skillsets/" + skillset_name, body=json.dumps(skillset_json))
    if result != None:
        print(json.dumps(result, sort_keys=True, indent=2))

except Exception as e:
    print('Error:')
    print(e)

Now, we will set up your main indexer.  This indexer will take the data from the json folder in your Azure blob container, run it through the skillset, and put the results in the search index.  

In [None]:
def create_indexer(indexer_name, filename):

    # Define the request body
    with open(filename) as datafile:
      indexer_json = json.load(datafile)

    try:
        result = azsearch_rest(request_type="PUT", endpoint="indexers/" + indexer_name, body=json.dumps(indexer_json))
        if result != None:
            print(json.dumps(result, sort_keys=True, indent=2))

    except Exception as e:
        print('Error:')
        print(e)
        

# Create main indexer
indexer_name = search_prefix + "-indexer"
create_indexer(indexer_name, filename="data-indexer.json")

Finally, we will set up your metadata indexer.  This indexer pulls the data from the metadata folder in your Azure blob container and adds it to the search index.  

In [None]:
metadata_indexer_name = "metadata-indexer"
create_indexer(metadata_indexer_name, filename="metadata-indexer.json")

If this is your first time running an indexer, you won't need to reset it.  But just in case you want to reuse this code and rerun your indexer with changes (perhaps pointing to your own dataset in Azure blob storage instead of ours), you will need to reset the indexer before making changes.  

In [None]:
def reset_indexer(indexer_name):
    # Reset the indexer.
    result = azsearch_rest(request_type="POST", endpoint="/indexers/{0}/reset".format(indexer_name), body=None)
    if result != None:
        print(json.dumps(result, sort_keys=True, indent=2))

def run_indexer(indexer_name):
    # Rerun the indexer.
    result = azsearch_rest(request_type="POST", endpoint="/indexers/{0}/run".format(indexer_name), body=None)
    if result != None:
        print(json.dumps(result, sort_keys=True, indent=2))


# Reset and rerun main indexer.  
reset_indexer(indexer_name)
run_indexer(indexer_name)

In [None]:
# Reset and rerun the metadata indexer.
reset_indexer(metadata_indexer_name)
run_indexer(metadata_indexer_name)

The indexer run can take a while, so let's check the status to see when it is ready.  Below we are checking the main indexer, not the metadata indexer, but you can do both if you want.  

In [None]:
import time, json

def check_indexer_status(indexer_name):
    try:
        complete = False
        while (complete == False):
            result = azsearch_rest(request_type="GET", endpoint="indexers/{0}/status".format(indexer_name))
            state = result["status"]
            if result['lastResult'] is not None:
                state = result['lastResult']['status']
            print (state)
            if state in ("success", "error"):
                complete = True
            time.sleep(1)

    except Exception as e:
        print('Error:')
        print(e)


# Check the main indexer
check_indexer_status(indexer_name)