# Document Intelligence with Azure Cognitive Search Skillset

AI enrichment is the application of machine learning models over content that isn't full text searchable in its raw form. Through enrichment, analysis and inference are used to create searchable content and structure where none previously existed.
In this notebook, we use an [AI enrichment pipeline with Azure Cognitive Search](https://learn.microsoft.com/en-us/azure/search/cognitive-search-concept-intro) to extract information and create new content from various formats of document files.
After we extract the enriched documents, we define a Feathr feature, materialize it, and utilize it from NLP (Natural Language Processing) scenarios such as [question-answering](https://huggingface.co/docs/transformers/tasks/question_answering) and [summarization](https://huggingface.co/tasks/summarization).

The overall workflow is:
1. Deploy Azure Cognitive Search and Feathr resources.
2. Prepare mixed media sample documents.
3. Extract texts by using Azure Cognitive Search Skillset and store the results.
4. Define a Feathr feature with the enriched documents, register the feature, and materialize it.
5. Use the feature for NLP scenarios.

## 1. Deployment

### Deploy Azure Cognitive Search service
Please follow [this link](https://learn.microsoft.com/en-us/azure/search/search-create-service-portal) to create the search service from the Azure portal.

### Deploy Feathr resources
Please follow [this link](https://feathr-ai.github.io/feathr/how-to-guides/azure-deployment-arm.html) to deploy necessary resources via ARM template and grant permissions to access them.

## 2. Prepare Dataset
In this notebook, we use mixed media documents from an [Azure sample repository](https://github.com/Azure-Samples/azure-search-knowledge-mining).

1. Download document files from the *[https://github.com/Azure-Samples/azure-search-knowledge-mining/tree/main/sample_documents](https://github.com/Azure-Samples/azure-search-knowledge-mining/tree/main/sample_documents)*
2. Create a container called `cogsearch` at the Storage Account under the Cognitive Search resource group you deployed
3. Upload the document files to the container 

## 3. AI Enrichment by using Azure Cognitive Search Skillset

Before move on to the following steps, let's go over some of the important concepts in [Azure Cognitive Search](https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search):
* **Azure Cognitive Search** is a cloud search service that gives developers infrastructure, APIs, and tools for building a rich search experience over private, heterogeneous content in web, mobile, and enterprise applications.
* **Indexing** is an intake process that loads content into your search service and makes it searchable. AI enrichment through cognitive skills is an extension of indexing. If your content needs image or language analysis before it can be indexed, AI enrichment can extract text embedded in application files, translate text, and also infer text and structure from non-text files by analyzing the content.
* **Skillset** is a reusable resource in Azure Cognitive Search that's attached to an indexer. It contains one or more skills that call built-in AI or external custom processing over documents retrieved from an external data source.
* **Knowledge store** is a data sink created by a Cognitive Search enrichment pipeline that stores AI-enriched content in tables and blob containers in Azure Storage for independent analysis or downstream processing in non-search scenarios like knowledge mining.

In this notebook, we use two built-in skills:
* Translation and language detection
* Optical Character Recognition (OCR) that recognizes printed and handwritten text in binary files

For more details about the built-in skillset, see [here](https://learn.microsoft.com/en-us/azure/search/cognitive-search-predefined-skills).

![Cognitive Search](../images/cognitive-search-enrichment-architecture.png)


### 3.1 Set parameters for connecting to the resources

In [None]:
import json
from pprint import pprint
import requests
import time

In [None]:
# TODO fill the values:
SERVICE_NAME = None  # Search Service name and API key
API_KEY = None
STORAGE_NAME = None  # Storage account for Azure Cognitive Search datasource and knowledge store
STORAGE_KEY = None

CONTAINER_NAME = "cogsearch"
API_VERSION = "2021-04-30-Preview"

In [None]:
# Storage account connection string
storage_connection_str = f"DefaultEndpointsProtocol=https;AccountName={STORAGE_NAME};AccountKey={STORAGE_KEY};EndpointSuffix=core.windows.net"

To verify the access to the storage account with `STORAGE_NAME` and `STORAGE_KEY` we set from the previous cell, let's list the document blob names in the `cogsearch` container. Please make sure you created the container and uploaded the documents from the section **2. Prepare Datasets**.

In [None]:
from azure.storage.blob import ContainerClient

container_client = ContainerClient.from_connection_string(
    storage_connection_str,
    container_name=CONTAINER_NAME,
)

# Get name of the blobs
output = [blob.name for blob in container_client.list_blobs()]
output

### 3.2 Create DataSource, Index, and Indexer

In Azure Cognitive Search, AI enrichment processing occurs during indexing (or data ingestion). The pipeline consists of:
* Data source
* Skill set
* Index, and
* Indexer.

In this notebook, we use Search REST APIs to create them. First, let's define some helper functions.

In [None]:
# Cognitive search endpoint
endpoint = f"https://{SERVICE_NAME}.search.windows.net"

# Cognitive search REST API
headers = {
    "Content-Type": "application/json",
    "api-key": API_KEY,
}

# Cognitive search resource names
datasource_name = f"{CONTAINER_NAME}-ds"
skillset_name = f"{CONTAINER_NAME}-ss"
index_name = f"{CONTAINER_NAME}-idx"
indexer_name = f"{CONTAINER_NAME}-idxr"

In [None]:
def construct_url(
    endpoint: str,
    resource_type: str,
    resource_name: str = None,
    action: str = None,
    api_version: str = API_VERSION,
):
    """Construct url for REST API calls."""
    components = [endpoint, resource_type]
    
    if resource_name:
        components.append(resource_name)
    if action:
        components.append(action)
    
    return "/".join(components) + f"?api-version={api_version}"


def create_or_update_resource(resource_type: str, resource_name: str, resource_def: dict):
    """Create or update Azure Cognitive Search resources."""
    r = requests.put(
        construct_url(endpoint, resource_type, resource_name, None, API_VERSION),
        data=json.dumps(resource_def),
        headers=headers,
    )

    # The request should return a status code of 201 confirming success.
    if r.status_code == 201:
        print(f"Successfully created {resource_type} {resource_name}")
    elif r.status_code == 204:
        print(f"Successfully updated {resource_type} {resource_name}")
    else:
        print(r.json()["error"]["message"])


In [None]:
# Clean-up previously created cognitive search resources if already exists
for resource, resource_name in {
    "datasources": datasource_name,
    "skillsets": skillset_name,
    "indexes": index_name,
    "indexers": indexer_name,
}.items():
    r = requests.delete(
        construct_url(endpoint, resource, resource_name),
        headers=headers,
    )
    if r.status_code == 204:
        print(f"{resource} {resource_name} is successfully deleted.")
    else:
        print(r.json()["error"]["message"])

#### Create a DataSource

In [None]:
# Create a data source
datasource_def = {
    "name": datasource_name,
    "description": "Feathr with Cognitive Search example",
    "type": "azureblob",
    "credentials": {
        "connectionString": storage_connection_str,
    },
    "container": {
        "name": CONTAINER_NAME,
    },
}

create_or_update_resource("datasources", datasource_name, datasource_def)

#### Create a Skillset

Each skill executes on the content of the document. During processing, Azure Cognitive Search cracks each document to read content from different file formats. Found text originating in the source file is placed into a generated `content` field, one for each document.

For more details, see [Cognitive Search predefined skills](
https://learn.microsoft.com/en-us/azure/search/cognitive-search-predefined-skills).

In [None]:
# Create a skillset
skillset_def = {
    "name": skillset_name,
    "description": "Apply OCR, ",
    "skills": [
        {
            # Recognizes text and numbers in image files.
            "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
            "context": "/document/normalized_images/*",
            "defaultLanguageCode": "en",
            "detectOrientation": True,
            "inputs": [
                {
                    "name": "image",
                    "source": "/document/normalized_images/*"
                }
            ],
            "outputs": [
                {
                    "name": "text"
                }
            ]
        },
        {
            # Images and text are separated during the document cracking phase. The merge skill recombines them.
            "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
            "context": "/document",
            "insertPreTag": " ",
            "insertPostTag": " ",
            "inputs": [
                {
                    "name":"text", 
                    "source": "/document/content"
                },
                {
                    "name": "itemsToInsert", 
                    "source": "/document/normalized_images/*/text"
                },
                {
                    "name":"offsets", 
                    "source": "/document/normalized_images/*/contentOffset" 
                }
            ],
            "outputs": [
                {
                    "name": "mergedText", 
                    "targetName" : "merged_text"
                }
            ]
        },
        {
            # Translates different language text to English.
            "@odata.type": "#Microsoft.Skills.Text.TranslationSkill",
            "defaultToLanguageCode": "en",
            "suggestedFrom": "es",
            "context": "/document",
            "inputs": [
                {
                    "name": "text",
                    "source": "/document/merged_text"
                }
            ],
            "outputs": [
                {
                    "name": "translatedText",
                    "targetName": "translated_text"
                },
                {
                    "name": "translatedFromLanguageCode",
                    "targetName": "translated_from_language_code"
                },
                {
                    "name": "translatedToLanguageCode",
                    "targetName": "translated_to_language_code"
                }
            ]
        },
        # Shaper skill to determine the schema and contents of the projection to Knowledge store.
        {
            "@odata.type": "#Microsoft.Skills.Util.ShaperSkill",
            "context": "/document",
            "inputs": [
                {
                    # metadata_storage_name is the document file (blob) name
                    "name": "metadata_storage_name",
                    "source": "/document/metadata_storage_name"
                },
                {
                    "name": "text",
                    "source": "/document/translated_text"
                }
            ],
            "outputs": [
                {
                    "name": "output",
                    "targetName": "shaper_output"
                },
            ],
        },
    ],
    # Free skillset execution quota is 20 documents. To process more, set Azure Cognitive Service here.
    "cognitiveServices": None,
    # A knowledge store makes enriched content available in Azure Storage for downstream apps and workloads.
    "knowledgeStore": {
        "storageConnectionString": storage_connection_str,
        "projections": [
            {
                # Store enrichment results into blobs:
                "objects": [{"storageContainer": f"{CONTAINER_NAME}output", "source": "/document/shaper_output"}],
            },
        ],
    },
}

create_or_update_resource("skillsets", skillset_name, skillset_def)

#### Create an Index

Provide the schema of the search index. A fields collection requires one field to be designated as the key. For blob content, this field is often the `metadata_storage_path` that uniquely identifies each blob in the container.

In this schema, the "text" field receives OCR output, "raw_content" receives merged output, and "content" receives translation output.

In [None]:
# Create an index
index_def = {
    "name": index_name,
    "fields": [
        {
            "name": "text",
            "type": "Collection(Edm.String)",
            "searchable": True,
            "sortable": False,
            "filterable": True,
            "facetable": False
        },
        {
            "name": "content",
            "type": "Edm.String",
            "searchable": True,
            "sortable": False,
            "filterable": False,
            "facetable": False
        },
        {
            "name": "raw_content",
            "type": "Edm.String",
            "searchable": False,
            "sortable": False,
            "filterable": False,
            "facetable": False
        },
        {
            "name": "metadata_storage_path",
            "type": "Edm.String",
            "key": True,
            "searchable": True,
            "sortable": False,
            "filterable": False,
            "facetable": False
        },
        {
            "name": "metadata_storage_name",
            "type": "Edm.String",
            "searchable": True,
            "sortable": False,
            "filterable": True,
            "facetable": False
        }
    ]
}

create_or_update_resource("indexes", index_name, index_def)

#### Create and Run an Indexer

Creating an indexer invokes the pipeline.

In [None]:
indexer_def = {
    "name": indexer_name,
    "dataSourceName": datasource_name,
    "targetIndexName": index_name,
    "skillsetName": skillset_name,
    "cache": {
        "enableReprocessing": True,
        "storageConnectionString": storage_connection_str,
    },
    # fieldMappings are processed before the skillset, sending content from the data source to target fields in an index.
    "fieldMappings": [
        {
            "sourceFieldName": "metadata_storage_path",
            "targetFieldName": "metadata_storage_path",
            "mappingFunction": {"name": "base64Encode"}
        },
        {
            "sourceFieldName": "metadata_storage_name",
            "targetFieldName": "metadata_storage_name"
        }
    ],
    # outputFieldMappings are for fields created by skills, after skillset execution.
    # The references to sourceFieldName in outputFieldMappings don't exist until document cracking or enrichment creates them.
    # The targetFieldName is a field in an index, defined in the index schema.
    "outputFieldMappings": [
        {
            "sourceFieldName": "/document/merged_text",
            "targetFieldName": "raw_content"
        },
        {
            "sourceFieldName": "/document/translated_text",
            "targetFieldName": "content"
        },
        {
            "sourceFieldName": "/document/normalized_images/*/text",
            "targetFieldName": "text"
        }
    ],
    "parameters":
    {
        "batchSize": 1,
        "maxFailedItems": -1,  # -1 to ignore errors during data import
        "maxFailedItemsPerBatch": -1,
        "configuration": 
        {
            "dataToExtract": "contentAndMetadata",  # automatically extract the content from different file formats as well as metadata related to each file
            "imageAction": "generateNormalizedImages"  # combined with the OCR Skill and Text Merge Skill, tells the indexer to extract text from the images
        }
    }   
}

create_or_update_resource("indexers", indexer_name, indexer_def)

Previous cell will create the indexer and run it. Now, we wait until the indexer run and check the result after done.

In [None]:
while True:
    r = requests.get(
        construct_url(endpoint, "indexers", indexer_name, action="status"),
        headers=headers,
    )
    
    try:
        if r.json()["lastResult"]["endTime"] is not None:
            break
    except:
        pass

    time.sleep(3)

pprint(json.dumps(r.json(), indent=1))

#### Test Search

With REST API call, we can verify the result of cognitive search skills. 

In [None]:
r = requests.post(
    construct_url(
        endpoint=endpoint,
        resource_type="indexes",
        resource_name=index_name + "/docs",
        action="search",
    ),
    data=json.dumps({
        "search": "*",
        "filter": "metadata_storage_name eq 'Cognitive Services and Bots  (spanish).pdf'",  #'Mesh_for_Microsoft_Teams.docx'",
        "select": "raw_content, content",
    }),
    headers=headers,
)

result = r.json()['value'][0]
print(
    f"[Content]{result['content'][:100]}",
    f"[Raw content]{result['raw_content'][:100]}",
    sep="\n=========\n",
)


## 4. Feathr Feature Store

In [None]:
import os
from pathlib import Path
import shutil

from pyspark.sql import DataFrame

import feathr
from feathr import (
    FeathrClient,
    # Feature data types
    STRING, ValueType,
    # Feature data sources
    HdfsSource,
    # Feature key
    TypedKey,
    # Feature types and anchor
    Feature, FeatureAnchor,
    # Materialization
    MaterializationSettings, RedisSink,
    # Offline feature computation
    FeatureQuery, ObservationSettings,
)
from feathr.spark_provider.feathr_configurations import SparkExecutionConfiguration
from feathr.utils.config import generate_config
from feathr.utils.job_utils import get_result_df
from feathr.utils.platform import is_databricks

print(f"Feathr version: {feathr.__version__}")

In [None]:
RESOURCE_PREFIX = None  # TODO fill the value used to deploy the Feathr resources via ARM template
PROJECT_NAME = "cogsearch"

# Currently support: 'azure_synapse', 'databricks', and 'local' 
SPARK_CLUSTER = "local"

# TODO fill values to use databricks cluster:
DATABRICKS_CLUSTER_ID = None             # Set Databricks cluster id to use an existing cluster
if is_databricks():
    # If this notebook is running on Databricks, its context can be used to retrieve token and instance URL
    ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
    DATABRICKS_WORKSPACE_TOKEN_VALUE = ctx.apiToken().get()
    SPARK_CONFIG__DATABRICKS__WORKSPACE_INSTANCE_URL = f"https://{ctx.tags().get('browserHostName').get()}"
else:
    DATABRICKS_WORKSPACE_TOKEN_VALUE = None                  # Set Databricks workspace token to use databricks
    SPARK_CONFIG__DATABRICKS__WORKSPACE_INSTANCE_URL = None  # Set Databricks workspace url to use databricks

# TODO fill values to use Azure Synapse cluster:
AZURE_SYNAPSE_SPARK_POOL = None  # Set Azure Synapse Spark pool name
AZURE_SYNAPSE_URL = None         # Set Azure Synapse workspace url to use Azure Synapse
ADLS_KEY = None                  # Set Azure Data Lake Storage key to use Azure Synapse

# If set True, use an interactive browser authentication to get the redis password.
USE_CLI_AUTH = False

In [None]:
# Use dbfs if the notebook is running on Databricks
if is_databricks():
    WORKING_DIR = f"/dbfs/{PROJECT_NAME}"
else:
    WORKING_DIR = PROJECT_NAME

In [None]:
# Get an authentication credential to access Azure resources and register features
if USE_CLI_AUTH:
    # Use AZ CLI interactive browser authentication
    !az login --use-device-code
    from azure.identity import AzureCliCredential
    credential = AzureCliCredential(additionally_allowed_tenants=['*'],)
elif "AZURE_TENANT_ID" in os.environ and "AZURE_CLIENT_ID" in os.environ and "AZURE_CLIENT_SECRET" in os.environ:
    # Use Environment variable secret
    from azure.identity import EnvironmentCredential
    credential = EnvironmentCredential()
else:
    # Try to use the default credential
    from azure.identity import DefaultAzureCredential
    credential = DefaultAzureCredential(
        exclude_interactive_browser_credential=False,
        additionally_allowed_tenants=['*'],
    )

In [None]:
# Redis password
if "REDIS_PASSWORD" not in os.environ:
    from azure.keyvault.secrets import SecretClient
    vault_url = f"https://{RESOURCE_PREFIX}kv.vault.azure.net"
    secret_client = SecretClient(vault_url=vault_url, credential=credential)
    retrieved_secret = secret_client.get_secret('FEATHR-ONLINE-STORE-CONN').value
    os.environ['REDIS_PASSWORD'] = retrieved_secret.split(",")[1].split("password=", 1)[1]

### 4.1 Initialize Feathr client

In [None]:
config_path = generate_config(
    resource_prefix=RESOURCE_PREFIX,
    project_name=PROJECT_NAME,
    spark_config__spark_cluster=SPARK_CLUSTER,
    spark_config__azure_synapse__dev_url=AZURE_SYNAPSE_URL,
    spark_config__azure_synapse__pool_name=AZURE_SYNAPSE_SPARK_POOL,
    spark_config__databricks__workspace_instance_url=SPARK_CONFIG__DATABRICKS__WORKSPACE_INSTANCE_URL,
    databricks_cluster_id=DATABRICKS_CLUSTER_ID,
)

with open(config_path, 'r') as f: 
    print(f.read())

In [None]:
client = FeathrClient(config_path=config_path, credential=credential)

### 4.2 Prepare Dataset

To generate the data source for the features, let's get the Cognitive Search's AI skillset outputs from the Knowledge Store.
As we defined the Knowledge Store projection to be objects, the outputs are stored as Json records. 

In [None]:
if "spark" in locals() or "spark" in globals():
    spark.conf.set(f"fs.azure.account.key.{STORAGE_NAME}.blob.core.windows.net", STORAGE_KEY)
else:
    from pyspark.sql import SparkSession
    spark = (
        SparkSession
        .builder
        .appName("feathr")
        .config(
            "spark.jars.packages",
            ",".join([
                "org.apache.spark:spark-avro_2.12:3.3.0",
                "io.delta:delta-core_2.12:2.1.1",
                "org.apache.hadoop:hadoop-azure:3.3.0",
                "com.microsoft.azure:azure-storage:8.6.6",
            ])
        )
        .config(f"fs.azure.account.key.{STORAGE_NAME}.blob.core.windows.net", STORAGE_KEY)
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
        .config("spark.ui.port", "8080")  # Set ui port other than the default one (4040) so that feathr spark job doesn't fail. 
        .getOrCreate()
    )
    

In [None]:

# Read all the json records of the AI enrichment output
df = spark.read.option("recursiveFileLookup", "true").json(f"wasbs://{CONTAINER_NAME}output@{STORAGE_NAME}.blob.core.windows.net/")
df.limit(5).toPandas()

In [None]:
data_file_path = f"{WORKING_DIR}/documents.parquet"

In [None]:
if Path(data_file_path).exists():
    print(f"Remove existing data file: {data_file_path}")
    shutil.rmtree(data_file_path)
print(f"Write data file to: {data_file_path}")
df.write.parquet(data_file_path)

In [None]:
# Upload files to cloud if needed
if client.spark_runtime == "local":
    # In local mode, we can use the same data path as the source.
    data_source_path = data_file_path
elif client.spark_runtime == "databricks" and is_databricks():
    # If the notebook is running on databricks, we can use the same data path as the source.
    data_source_path = data_file_path.replace("/dbfs", "dbfs:")
else:
    # Otherwise, upload the local file to the cloud storage (either dbfs or adls).
    data_source_path = client.feathr_spark_launcher.upload_or_get_cloud_path(data_file_path)    

### 4.3 Define Features with UDF (User Defined Function)

We preprocess the texts so that the later NLP models can consume them.

In [None]:
def preprocessing(df: DataFrame) -> DataFrame:
    import pyspark.sql.functions as F
    
    # Any types of text preprocessing
    return df.withColumn("text", F.regexp_replace("text", "\n", " "))


Now, define features using the UDF.

In [None]:
hdfs_source = HdfsSource(
    name="ai_enrichment",
    path=data_source_path,
    preprocessing=preprocessing,
)

# key is required for the features from non-INPUT_CONTEXT source
key = TypedKey(
    key_column="metadata_storage_name",
    key_column_type=ValueType.STRING,
    description="Document name",
    full_name=f"{PROJECT_NAME}.doc_name",
)

features = [
    Feature(
        name="f_text",
        key=key,
        feature_type=STRING,
        transform="text",
    ),
]

feature_anchor = FeatureAnchor(
    name="data_feature_anchor",
    source=hdfs_source,
    features=features,
)

In [None]:
client.build_features(
    anchor_list=[feature_anchor],
)

In [None]:
query = FeatureQuery(
    feature_list=["f_text"],
    key=key,
)

settings = ObservationSettings(
    observation_path=data_source_path,
)

client.get_offline_features(
    observation_settings=settings,
    feature_query=query,
    # For more details, see https://feathr-ai.github.io/feathr/how-to-guides/feathr-job-configuration.html
    execution_configurations=SparkExecutionConfiguration({
        "spark.feathr.outputFormat": "parquet",
    }),
    output_path="./text_features.parquet",
)

client.wait_job_to_finish(timeout_sec=5000)

In [None]:
get_result_df(client, data_format="parquet").head(5)

### 4.4 Register Features

In [None]:
try:
    client.register_features()
except Exception as e:
    print(e)  
print(client.list_registered_features(project_name=PROJECT_NAME))
# You can get the actual features too by calling client.get_features_from_registry(PROJECT_NAME)

### 4.5 Materialize Features to REDIS

In [None]:
FEATURE_TABLE_NAME = "text_features"

In [None]:
redis_sink = RedisSink(table_name=FEATURE_TABLE_NAME)

settings = MaterializationSettings(
    name=FEATURE_TABLE_NAME + ".job",  # job name
    sinks=[redis_sink],
    feature_names=["f_text"],
)

client.materialize_features(
    settings=settings,
    execution_configurations={"spark.feathr.outputFormat": "parquet"},
    allow_materialize_non_agg_feature=True,
)

client.wait_job_to_finish(timeout_sec=5000)

In [None]:
# Note, to get a single key, you may use client.get_online_features instead
materialized_feature_values = client.get_online_features(
    feature_table=FEATURE_TABLE_NAME,
    key="NYSE_LNKD_2015.PDF",
    feature_names=["f_text"],
)
materialized_feature_values[0][:1000]

## 5. NLP Scenarios

We use [HuggingFace Transformer package](https://huggingface.co/docs/transformers/installation) to demonstrate simple NLP scenarios with the materialized features. Specifically, we use [summarization and question-answering pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) that come with pre-trained models for the simplicity's sake.

Firstly, install *Transformer* and *pyTorch* packages:

In [None]:
%pip install -U transformers torch --extra-index-url https://download.pytorch.org/whl/cu116

In [None]:
from transformers import pipeline

Now, let's use question-answering pipeline of the HuggingFace package.

In [None]:
qa_model = pipeline("question-answering")
qa_model(
    question="what is the LinkedIn's financial goal",
    context=materialized_feature_values[0],
)

With the summarization pipeline,

In [None]:
summarizer = pipeline("summarization")

# The pre-trained model only accepts 1024 tokens as input and thus we set truncation=True
summarizer(materialized_feature_values[0], truncation=True)


## 6. Advanced Topics

In this notebook, we have gone through how to utilize Azure Cognitive Search Skills to extract and translate texts from various formats of documents and use them with Feathr Feature Store for NLP scenarios.

Here is a list of advanced topics we did not cover from this notebook:

* [Deploy a model to AKS (Azure Kubernetes Service) via AzureML (Azure Machine Learning) SDK](https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-deploy-azure-kubernetes-service?tabs=python).
* [Use Python and AI to generate searchable content from Azure blobs](https://learn.microsoft.com/en-us/azure/search/cognitive-search-tutorial-blob-python)
* [Enrich cognitive search index with custom classes](https://learn.microsoft.com/en-us/azure/cognitive-services/language-service/custom-text-classification/tutorials/cognitive-search?tabs=multi-classification%2CLanguage-studio)
* [Build and deploy a form recognizer custom skill](https://learn.microsoft.com/en-us/training/modules/build-form-recognizer-custom-skill-for-azure-cognitive-search/4-exercise-build-deploy)