# Azure Document Translation

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/erikhopf/CognitiveServicesPyNotebooks/HEAD?filepath=Document_Translator.ipynb) 

Document Translation is a feature of the Microsoft Translator service, that allows you to batch translate a document, power point, or other text based artifacts into one or many languages. In this notebook, we'll focus on submitting a translation request, checking the job status, and retrieving the translation outputs. We cover translating a single document or a batch of documents. You don't need to do both, but if you do, you'll see a bit of repetition.

<!-- Beyond this sample, you can use Document Translation in conjunction with services like Batch Transcription (Speech to Text) or Sentiment Analysis (Text Analytics).  -->

<!-- For example, a company may have audio data that it needs to transcribe before it can analyze the data for trends. However, your call center may serve areas that speak a different language than your analysts. Following transcription, you can submit the artifacts for translation in batches, and provide them to your analysts.  -->

Key benefits of Document Translation:

* Translate one document to one or many languages
* Translate many documents into one or many languages

At this time, Document Translation relies on Azure Blob Storage. This means that you'll need an Azure Storage account and two blob containers. 

* A container to read from
* A conatiner to write to

## Supported input formats

Document Translation currenlty supports these files and file extensions. For the latest, please see [What is Document Translation (preview)](https://docs.microsoft.com/en-us/azure/cognitive-services/translator/document-translation/overview).
  
  
| File type | File extension | Description|
|-----------|----------------|------------|
|Adobe PDF|.pdf|Adobe Acrobat portable document format|
|HTML|.html|Hyper Text Markup Language.|
|Localization Interchange File Format|.xlf. , xliff| A parallel document format, export of Translation Memory systems. The languages used are defined inside the file.|
|Microsoft Excel|.xlsx|A spreadsheet file for data analysis and documentation.|
|Microsoft Outlook|.msg|An email message created or saved within Microsoft Outlook.|
|Microsoft PowerPoint|.pptx| A presentation file used to display content in a slideshow format.|
|Microsoft Word|.docx| A text document file.|
|Tab Separated Values/TAB|.tsv/.tab| A tab-delimited raw-data file used by spreadsheet programs.|
|Text|.txt| An unformatted text document.|

## STOP! THIS SERVICE WILL INCUR COSTS 

This service is **not** free. Please be aware that any calls to the service will incur costs. 

Now that we've got that out of the way. If you are a new user, you can sign up for a free Azure account that includes $200 of credit that you can use in the first 30 days after your account is created. For more information about account credits [click here](https://azure.microsoft.com/en-us/free/cognitive-services/).

For detailed pricing information, [click here](https://azure.microsoft.com/pricing/details/cognitive-services/translator/).

## Before you get started

You'll need:

* An [Azure subscription](https://azure.microsoft.com/en-us/free/cognitive-services/)
* An [Azure Blob storage account with two containers](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-portal#create-a-container) -- I'll walk you through the setup below, since I made some mistakes on my first run. 
  * The read container must have **READ** and **LIST** permission. 
  * The write container must have **WRITIE** and **LIST** permission.
* An [Azure Transltor resource](https://docs.microsoft.com/en-us/azure/cognitive-services/translator/translator-how-to-signup) in the S1 pricing tier. This notebook **won't** work with a *Free (F0)* key.
* Install the `requests` to your environment. We strongly recommend that you run all of these notebooks in a virtual environment (virtualenv, venv, pyenv, pipenv, etc.). Run this command from your terminal/command line: `pip install requests`.

## Download text files for Translation

For this notebook, I've added two text snippets downloaded from [Project Gutenberg](https://www.gutenberg.org/). These are purposefully small to keep testing costs down. Both source files are in English and `.txt` files. 

* [Download text files from GitHub](https://github.com/erikhopf/CognitiveServicesPyNotebooks/tree/main/text_snippets)

## Setup your Azure Blob Storage account and create containers

1. Sign into the Azure portal and search for "storage account". Locate "storage account" and select **Create**. 
   ![Screen Shot 2021-04-14 at 11.06.27 AM.png](attachment:c2243866-f7bf-4070-83f1-17443793d77f.png)
2. After you've created a storage account, we need to create two containers. One for our books and one for our translation. Locate and select **Containers** from your storage account.
  ![Screen Shot 2021-04-14 at 11.08.06 AM.png](attachment:5f7e914b-23fd-4f24-9a47-2537c8b7ded7.png)
3. Select **+ Conatiner** and name your container **books**. This is where we'll upload the sample text files that you've downloaded.
![Screen Shot 2021-04-22 at 10.30.50 AM.png](attachment:10922daf-9acc-4947-93fd-c0f5722df447.png)
4. Upload sample text files to your **books** container.
   ![Screen Shot 2021-04-22 at 11.10.20 AM.png](attachment:babef205-95ff-40e0-90da-e4defc4e5aef.png)
5. To read from this container we need to update the permissions and get an SAS token. From the left-nav select **Shared access signature**. Locate **Permissions** and select **READ** and **LIST**.
  ![Screen Shot 2021-04-22 at 10.32.11 AM.png](attachment:564b4346-76fd-49ca-ab56-d26f0d63bfe9.png)
  Next, generate an SAS token. This give you access to the container. Make sure to save your **SAS token** and **URL**. You'll need these soon.
  ![Screen Shot 2021-04-22 at 10.31.50 AM.png](attachment:ebfc0a25-0c8d-4052-8811-51af9936fc18.png)
6. Now we're going to repeat the process. Select **+ Container** and name the container **translations**.
  ![Screen Shot 2021-04-22 at 10.33.03 AM.png](attachment:06cc68a9-1ae6-4a75-ba43-3d647f85a340.png)
7. Finally, you need to configure your translations container and generate an SAS token. Make sure when configuring permissions that you select **WRITE** and **LIST**. Make sure to save your **SAS token** and **URL**. You'll need these soon.
  ![Screen Shot 2021-04-22 at 10.33.32 AM.png](attachment:b19b5bf7-c015-43a4-bfc5-f2536ddc1952.png)
8. Alright, you're ready to use Document Translation.

## Import modules

In [None]:
import requests
import json
import time

## Set key and endpoint

Each call you make to the Document Translation APIs require both a key and an endpoint with your custom domain name. Please update the values below with the key and endpoint from your Translator resources.

**CUSTOM_SUBDOMAIN_NAME** is the name of your Translator resource.

In [None]:
endpoint = 'https://CUSTOM_SUBDOMAIN_NAME.cognitiveservices.azure.com/translator/text/batch/v1.0-preview.1'
key =  'YOUR_TRANSLATOR_KEY'

## Set read and write locations

Each call to the Document Translation service must be authenticated. We'll do this with SAS tokens. In the snippet below, you'll need to add your source location and SAS token, and your target location and SAS token. 

When generating your SAS token, make sure that you:

* Set **READ** and **LIST** permissions for the source container. 
* Set **WRITE** and **LIST** permissions for the source container. 
* Set a relevant timeframe for your SAS token expiration. By default it's about a day. 

These variables will be used if translating a single document or a batch of documents. 

In [None]:
# Each call to the Document Translation service must be authenticated.
# For both source and target URLs you'll need an SAS signature.

# Here's an example url: https://YOUR_RESOURCE_.blob.core.windows.net/books
# Here's a sample SAS token: sp=rl&st=2021-04-22T16:48:42Z&se=2021-06-01T00:48:42Z&spr=https&sv=2020-02-10&sr=c&sig=XLQtslJDJIRLZ4x3.........

source_url = 'REPLACE_WITH_URL_FOR_BOOKS_CONTAINER'
source_sas = 'REPLACE_WITH_SAS_TOKEN_FOR_BOOKS_CONTAINER'
target_url = 'REPLACE_WITH_URL_FOR_TRANSLATIONS_CONTAINER'
target_sas = 'REPLACE_WITH_SAS_TOKEN_FOR_TRANSLATION_CONTAINER'


## Set request headers

In this notebook, each request will use the same headers. So we're going to set them here and reuse them to simplify code blocks.

In [None]:
headers = {
  'Ocp-Apim-Subscription-Key': key,
  'Content-Type': 'application/json'
}

## Translate a single document

Let's take a look how you translate a single source document into more than one language using the Document Translation service. 

In the sample below, you're going to translate a `.txt` file in English to Spanish, German, and Japanese. Both the source and target require an SAS signature, which we've separated into multiple variables for illustrative purposes below. 

**Before you continue**: This sample assumes that you are using the sample text files, specifically: `dubliners_sisters_snippet.txt`. If you're using a different file, please update the `sourceUrl` parameter in the sample.

In [None]:
# The path used for translation is /batches.
path = '/batches'
constructed_url = endpoint + path

# In this example, you're translating a single document. 
# To account for this, the JSON payload of the request must have
# storageType set to File.
payload= {
    "inputs": [
        {
            "storageType": "File",
            "source": {
                "sourceUrl": f'{source_url}/dubliners_sisters_snippet.txt?{source_sas}'
            },
            "targets": [
                {
                    "targetUrl": f'{target_url}/single/spanish_translation.txt?{target_sas}',
                    "language": "es"
                },
                {
                    "targetUrl": f'{target_url}/single/german_translation.txt?{target_sas}',
                    "language": "de"
                },
                {
                    "targetUrl": f'{target_url}/single/ja_translation.txt?{target_sas}',
                    "language": "ja"
                }
            ]
        }
    ]
}

response = requests.post(constructed_url, headers=headers, json=payload)

# This code block checks for a valid request to the Document Translation service.
if response.status_code == 202:
    # The operation location contains your job ID.
    # We're going to store this value in a variable and use it
    # to make calls to the service.
    job_id = response.headers['Operation-Location'].replace(constructed_url, '')
    print(f'Job submitted successfully...\nStatus code: {response.status_code}\nJob status: {response.reason}\nJob ID: {job_id.replace("/", "")}')
else:
    print(f'Status code: {response.status_code}')

### Check job status

After you submit a document (or documents) for translation, the Document Translation service processes document asynchronously. However, you can get the status for the job. This means that we need to make a series of calls to the Document Translation service to retrieve the state, and when the operation is complete, we can get the translation output. 

Here we a code block that will poll for status and print the result when complete.

In [None]:
# GET request to check the status of your translation job.
response = requests.get(constructed_url + job_id, headers=headers).json()

# As long as the request hasn't failed or timed out, this code block
# is going to poll for the job status, and when complete it will print 
# the response.
if response['status'] == 'Succeeded' or 'Running':
    while response['summary']['inProgress'] != 0:
        print(f'Translation in progress... Trying again in 5 seconds' \
              f'\nTranslations completed: {response["summary"]["success"]} / {response["summary"]["total"]} '
              f'\nTranslations in progress: {response["summary"]["inProgress"]} / {response["summary"]["total"]}\n')
        time.sleep(5)
        response = requests.get(constructed_url + job_id, headers=headers).json()
        
if response['status'] == 'Failed':
    print(json.dumps(response, sort_keys=True, indent=4, ensure_ascii=False, separators=(',', ': ')))
    

### Get a list of translation outputs

After your job complete, you can retrieve individual translations. In our case, we should have three: Spanish, German, and Japanese. 

Here we're going to make a get request using the `job_id` and the `/documents` path. The response will include a list of translated documents, as well as their status.

Click X to view one of the documents. These are all stored in your Azure Blob storage container.

In [None]:
response = requests.get(constructed_url + job_id + '/documents', headers=headers).json()
print(json.dumps(response, sort_keys=True, indent=4, ensure_ascii=False, separators=(',', ': ')))

## Translate a batch of documents

Alright, so you've translated a single file and that's great. But what happens when you have many files that you want to translate into many languages?

Similar to translating a single document, you'll need to read from and write to an Azure Blob storage container. The main differences are:
* Rather than reading a specific file, you're reading from a location in an Azure Blob storage container. 
* Rather than specifying the filename and extension of the output file, it will inherit the name of the file that was translated. For this reason, we're adding a directory structure to our write container. For example, `/es`.

**TIP**: Feel free to try other languages for translsation. Just remember to update `language` and `targetUrl`.

In [None]:
path = '/batches'
constructed_url = endpoint + path

payload= {
    "inputs": [
        {
            "source": {
                "sourceUrl": source_url + '?' + source_sas,
                "storageSource": "AzureBlob",
                "language": "en"
            },
            "targets": [
                {
                    "targetUrl": f'{target_url}/es?{target_sas}',
                    "storageSource": "AzureBlob",
                    "language": "es"
                },
                {
                    "targetUrl": f'{target_url}/ja?{target_sas}',
                    "storageSource": "AzureBlob",
                    "language": "ja"
                },
                {
                    "targetUrl": f'{target_url}/de?{target_sas}',
                    "storageSource": "AzureBlob",
                    "language": "de"
                },
                
            ]
        }
    ]
}

response = requests.post(constructed_url, headers=headers, json=payload)

if response.status_code == 202:
    # The operation location contains your job ID.
    # We're going to store this value in a variable and use it
    # to make calls to the service.
    job_id = response.headers['Operation-Location'].replace(constructed_url, '')
    print(f'Job submitted successfully...\n Status code: {response.status_code}\nJob status: {response.reason}\nJob ID: {response.headers["Operation-Location"]}')
else:
    print(f'Status code: {response.status_code}')

### Check job status

After you submit a document (or documents) for translation, the Document Translation service processes document asynchronously. However, you can get the status for the job. This means that we need to make a series of calls to the Document Translation service to retrieve the state, and when the operation is complete, we can get the translation output. 

Here we a code block that will poll for status and print the result when complete.

In [None]:
response = requests.get(constructed_url + job_id, headers=headers).json()

if response['status'] == 'Succeeded' or 'Running':
    while response['summary']['inProgress'] != 0:
        print(f'Translation in progress... Trying again in 5 seconds' \
              f'\nTranslations completed: {response["summary"]["success"]} / {response["summary"]["total"]} '
              f'\nTranslations in progress: {response["summary"]["inProgress"]} / {response["summary"]["total"]}\n')
        time.sleep(5)
        response = requests.get(constructed_url + job_id, headers=headers).json()
        
if response['status'] == 'Failed':
    print(json.dumps(response, sort_keys=True, indent=4, ensure_ascii=False, separators=(',', ': ')))
    

### Get a list of translation outputs

After your job complete, you can retrieve individual translations. In our case, we should have three: Spanish, German, and Japanese. 

Here we're going to make a get request using the `job_id` and the `/documents` path. The response will include a list of translated documents, as well as their status.

Click X to view one of the documents. These are all stored in your Azure Blob storage container.

In [None]:
response = requests.get(constructed_url + job_id + '/documents', headers=headers).json()
print(json.dumps(response, sort_keys=True, indent=4, ensure_ascii=False, separators=(',', ': ')))

## Next steps

Document Translation is a very powerful service, and this notebook only showcases two very basic tasks. If you want to learn more, I encourage you to take a look at the reference. 

* [Document Translation API reference](https://docs.microsoft.com/azure/cognitive-services/translator/document-translation/reference/start-translation)