Lab 2: Re-create a Cognitive Search Skillset with Image Skills
In this lab, we will verify the lack of image processing results we got from the previous lab and fix it by adding image analysis skill sets to our pipeline.
There were png and jpg images within the provided dataset. If you decided to bring your own data, it was suggested to also include images. But we did not add any predefined skillsets for image analysis. This is exactly what we will do now, but first, let's check out the problem with steps 1 and 2.
Step 1 - Checking part of the problem
Let's check the indexer status again, it has valuable information about our "images problem". You can use the same command we used in the previous lab (pasted below for convenience). If you used another indexer name, just change it in the URL.
GET https://[servicename].search.windows.net/indexers/demoindexer/status?api-version=2017-11-11-Preview api-key: [api-key] Content-Type: application/json
If you check the response messages for any of the png or jpg files, there will be warnings and not data.
Step 2 - Checking the other part of the problem
Now let's again repeat a previous lab request, but with another analysis. We will re-execute the step to verify content, but this time querying all fields.
GET https://[servicename].search.windows.net/indexes/demoindex/docs?search=*&$select=*&api-version=2017-11-11-Preview api-key: [api-key] Content-Type: application/json
You will probably see something similar to the image below - no information for the images we have.
How can we fix it?
We will fix it, but there is a challenge for you increase your learning about Predefined Skills. The next steps will guide you through the challenge and don't worry if you get stuck (that's why it's a challenge!), we will share the solution, too.
Step 3 - Learning
We will add OCR to our cognitive search pipeline, this skill set will read text from the images within our dataset. Here is a link where you can read more details.
Step 4 - Cleaning the environment
We need to prepare the environment to add the image analysis we will create. The most practical approach is to delete the objects from Azure Search and rebuild them. With the exception of the data source, we will delete everything else. Resource names are unique, so by deleting an object, you can recreate it using the same name.
Save all scripts (API calls) you did until here, including the definition json files you used in the "body" field. Let's start deleting the index and the indexer. You can use Azure Portal or API calls:
- Deleting the indexer - Just use your service, key and indexer name
- Deleting the index - Just use your service, key and indexer name
Skillsets can only be deleted through an HTTP command, let's use another API call request to delete it. If you used another skillset name, just change it in the URL.
DELETE https://[servicename].search.windows.net/skillsets/demoskillset?api-version=2017-11-11-Preview api-key: [api-key] Content-Type: application/json
Status code 204 is returned on successful deletion.
Step 5 - Recreating the environment - Challenge!!
Now it is your time to guide the work. We are using a basic Azure Search service, so we can create skillsets with up to 5 skills. Since we currently are using 4, from the previous lab, we can add one more for image processing.
Use the same skillset definition from Lab 1, but add in the OCR image analysis skill you read about in Step 3. We suggest you add them at the end of the JSON of the body syntax definition.
Skipping the services and the data source creation, repeat the other steps of the Lab 1, in the same order. Use the previous lab as a reference.
Create the services at the portalNot required, we did not delete it. Create the Data SourceNot required, we did not delete it.
- Recreate the Skillset
- Recreate the Index
- Recreate the Indexer
- Check Indexer Status - Here you can repeat the same verification of Lab 2, Step 1. If you don't have a different result, something went wrong.
- Check the Index Fields - Check the image fields you just created.
- Check the data - Here you can repeat the same verification of Lab 2, Step 2. If you don't have a different result, something went wrong.
TIP 1: What you need to do:
- Create a new skillset exactly like the one we did in Lab 1, but with an extra skill, the OCR skillset. You can use the same json body field and add the new OCR skill in the end.
- Create a new index exactly like the one we did in Lab 1 but with an extra field for the OCR text from the images. Name suggestion: myOCRtext. You can use the same json body field and add the new OCR field in the end.
- Create a new indexer exactly like the one we did in Lab 1, but with and extra mapping for the new skill and the new field listed above. You can use the same json body field and add the new OCR mapping in the end.
TIP 2: Your new field in the Index must have the Collection Data Type.
TIP 3: You can query only the OCR field, to better visualize the results. Suppose that your new index field name is myOcrTex. You can query it using:
GET https://[servicename].search.search.windows.net/indexes/demoindex/docs?search=*&$select=myOcrText&api-version=2017-11-11-Preview api-key: [api-key] Content-Type: application/json
TIP 4: Your indexer sourceFieldName for the OCR text field has to be /document/normalized_images/*/myOcrText if your field is named myOcrText.
If you could not make it, here is the challenge solution. You just need to follow the steps.