### SageMaker Large Scale prediction

In [1]:
import sagemaker
import boto3

sagemaker_session = sagemaker.Session()
account_id =  boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name


#role = sagemaker.get_execution_role()
role="arn:aws:iam::{}:role/service-role/AmazonSageMaker-ExecutionRole-20190118T115449".format(account_id)


In [2]:
pytorch_custom_image_name="ppi-extractor:cpu-1.0.0-202101020146"
instance_type = "ml.m5.large" 

In [3]:
docker_repo = "{}.dkr.ecr.{}.amazonaws.com/{}".format(account_id, region, pytorch_custom_image_name)

### Step 1: Convert pubtator format to inference json

The input pubtator files look like this.. These are converted to produce inference 

```text
20791654|a|Liver scan characteristics and liver function tests of 72 patients with proved hepatic malignancy (54 metastatic, 18 primary) were evaluated. Well-defined focal defects were observed in 83% of patients with metastatic and 77% of patients with primary liver carcinoma. In 10% of the patients with metastatic liver disease the distribution of radioactivity was normal. Four or more biochemical liver function tests were normal in 33% of metastatic and 29% of primary liver cancer patients. Hepatic enlargement was present in the scan in 94% of the patients with liver metastases; however, data obtained from 104 necropsies of patients with hepatic metastases showed that only 46% had hepatomegaly. We recommend, therefore, that a liver scan should be performed before major tumour surgery in every patient with known malignancy regardless of normal liver size or normal liver function tests.
20791654	58	66	patients	Species	9606
20791654	193	201	patients	Species	9606
20791654	229	237	patients	Species	9606
20791654	282	290	patients	Species	9606
20791654	478	486	patients	Species	9606
20791654	546	554	patients	Species	9606
20791654	624	632	patients	Species	9606
20791654	796	803	patient	Species	9606

20791817|a|5-Aminosalicylic acid given to rats as a single intravenous injection led to necrosis of the proximal convoluted tubules and of the renal papilla. These two lesions developed at the same time and the cortical lesions did not appear to be a consequence of the renal papillary necrosis. Since the compound possesses the molecular structure both of a phenacetin derivative and of a salicylate these observations may be relevant to the problem of renal damage incident to abuse of analgesic compounds and suggest the possibility that in this syndrome cortical lesions may develop independently of renal papillary necrosis.
20791817	31	35	rats	Species	10116

```

In [83]:
import datetime
date_fmt = datetime.datetime.today().strftime("%Y%m%d%H")

In [None]:
#s3_input_pubtator = "s3://aegovan-data/pubmed_json_parts_annotation_iseries/pubmed19n0550.json.txt"
s3_input_pubtator = "s3://aegovan-data/pubmed_json_parts_annotation_iseries/"
s3_id_mapping_file="s3://aegovan-data/settings/HUMAN_9606_idmapping.dat"

s3_output_pubmed_asbtract = f"s3://aegovan-data/pubmed_asbtract/inference_multi_{date_fmt}/"

In [12]:
from sagemaker.network import NetworkConfig
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.processing import ScriptProcessor

script_processor = ScriptProcessor(image_uri=docker_repo,
                                       command=["python"],
                                       env={'mode': 'python', 'PYTHONPATH':'/opt/ml/code'},
                                       role=role,
                                       instance_type=instance_type,
                                       instance_count=10,
                                       max_runtime_in_seconds=172800,
                                       volume_size_in_gb = 50,
                                       network_config=NetworkConfig(enable_network_isolation=False),
                                       base_job_name ="ppi-large-inference-data-prep"


                                       )


sm_local_input_pubtator_txt = "/opt/ml/processing/input/data/json"
sm_local_input_idmapping = "/opt/ml/processing/input/data/mapping"
sm_local_output = "/opt/ml/processing/output"


script_processor.run(
        code='source/datatransformer/pubtator_annotations_inference_transformer.py',

        arguments=[
        
            sm_local_input_pubtator_txt,
            sm_local_output,
           "{}/{}".format(sm_local_input_idmapping,s3_id_mapping_file.split("/")[-1]) 

        ],
    
       inputs=[
                ProcessingInput(
                    source=s3_input_pubtator,
                    destination=sm_local_input_pubtator_txt,
                    s3_data_distribution_type="ShardedByS3Key")

            ,ProcessingInput(
                    source=s3_id_mapping_file,
                    destination=sm_local_input_idmapping,
                    s3_data_distribution_type="FullyReplicated")
            ],

        outputs=[ProcessingOutput(
                source=sm_local_output, 
                destination=s3_output_pubmed_asbtract,
                output_name='inferenceabstracts')]
    )

Parameter 'session' will be renamed to 'sagemaker_session' in SageMaker Python SDK v2.



Job Name:  ppi-large-inference-data-prep-2020-12-31-12-29-53-262
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://aegovan-data/pubmed_json_parts_annotation_iseries/', 'LocalPath': '/opt/ml/processing/input/data/json', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'input-2', 'S3Input': {'S3Uri': 's3://aegovan-data/settings/HUMAN_9606_idmapping.dat', 'LocalPath': '/opt/ml/processing/input/data/mapping', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-2-324346001917/ppi-large-inference-data-prep-2020-12-31-12-29-53-262/input/code/pubtator_annotations_inference_transformer.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}

[34m2020-12-31 12:40:01,616 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0031.json.txt with records 446[0m
[32m2020-12-31 12:40:02,453 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0037.json.txt with records 373[0m
[33m2020-12-31 12:40:02,373 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0139.json.txt with records 17[0m
[33m2020-12-31 12:40:02,914 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0680.json.txt with records 18458[0m
[34m2020-12-31 12:40:02,313 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0061.json.txt with records 9113[0m
[32m2020-12-31 12:40:03,143 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0763.json.txt with records 19677[0m
[32m2020-12-31 12:40:03,043 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0557.json.txt with records 171

[34m2020-12-31 12:40:17,704 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0541.json.txt with records 13627[0m
[33m2020-12-31 12:40:18,180 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0885.json.txt with records 14066[0m
[36m2020-12-31 12:40:18,319 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0368.json.txt with records 16314[0m
[35m2020-12-31 12:40:19,301 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0937.json.txt with records 10316[0m
[34m2020-12-31 12:40:19,176 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0756.json.txt with records 17433[0m
[35m2020-12-31 12:40:18,909 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0370.json.txt with records 17931[0m
[36m2020-12-31 12:40:19,265 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0643.json.txt with rec

[33m2020-12-31 12:40:35,575 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0199.json.txt with records 4[0m
[35m2020-12-31 12:40:35,605 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0326.json.txt with records 16995[0m
[34m2020-12-31 12:40:36,194 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0065.json.txt with records 5735[0m
[35m2020-12-31 12:40:36,365 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0230.json.txt with records 261[0m
[36m2020-12-31 12:40:35,878 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0864.json.txt with records 17211[0m
[32m2020-12-31 12:40:36,355 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0167.json.txt with records 11[0m
[32m2020-12-31 12:40:36,681 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0607.json.txt with records 10132

[35m2020-12-31 12:40:55,189 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0596.json.txt with records 17483[0m
[32m2020-12-31 12:40:55,109 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0322.json.txt with records 16977[0m
[35m2020-12-31 12:40:54,880 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0430.json.txt with records 1[0m
[35m2020-12-31 12:40:54,943 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0330.json.txt with records 1621[0m
[36m2020-12-31 12:40:55,101 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0383.json.txt with records 17140[0m
[33m2020-12-31 12:40:55,820 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0274.json.txt with records 6796[0m
[35m2020-12-31 12:40:55,839 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0390.json.txt with records 1

[33m2020-12-31 12:41:15,609 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0955.json.txt with records 12871[0m
[36m2020-12-31 12:41:16,198 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0103.json.txt with records 4229[0m
[32m2020-12-31 12:41:16,776 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0948.json.txt with records 11626[0m
[34m2020-12-31 12:41:16,594 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0842.json.txt with records 16117[0m
[35m2020-12-31 12:41:16,785 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0470.json.txt with records 18[0m
[35m2020-12-31 12:41:16,821 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0340.json.txt with records 1[0m
[35m2020-12-31 12:41:17,357 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0711.json.txt with records 153

[34m2020-12-31 12:41:37,061 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0826.json.txt with records 17067[0m
[36m2020-12-31 12:41:37,306 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0684.json.txt with records 15091[0m
[34m2020-12-31 12:41:38,033 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0161.json.txt with records 41[0m
[34m2020-12-31 12:41:38,041 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0611.json.txt with records 0[0m
[34m2020-12-31 12:41:38,041 - __main__ - INFO - No records generated[0m
[32m2020-12-31 12:41:38,373 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0262.json.txt with records 8609[0m
[32m2020-12-31 12:41:38,109 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0627.json.txt with records 18086[0m
[34m2020-12-31 12:41:38,621 - __main__ - INFO - Processed file 

[34m2020-12-31 12:41:58,152 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0335.json.txt with records 6745[0m
[33m2020-12-31 12:41:59,704 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0815.json.txt with records 16758[0m
[32m2020-12-31 12:42:00,348 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0422.json.txt with records 18122[0m
[34m2020-12-31 12:42:00,178 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0415.json.txt with records 17030[0m
[32m2020-12-31 12:42:00,915 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0728.json.txt with records 17263[0m
[33m2020-12-31 12:42:00,643 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0029.json.txt with records 510[0m
[34m2020-12-31 12:42:00,686 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0271.json.txt with record

[35m2020-12-31 12:42:15,111 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0356.json.txt with records 15985[0m
[36m2020-12-31 12:42:14,900 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0373.json.txt with records 18818[0m
[33m2020-12-31 12:42:15,829 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0419.json.txt with records 15380[0m
[36m2020-12-31 12:42:16,188 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0949.json.txt with records 1994[0m
[33m2020-12-31 12:42:16,755 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0594.json.txt with records 17365[0m
[36m2020-12-31 12:42:17,208 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0498.json.txt with records 17390[0m
[34m2020-12-31 12:42:17,141 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0265.json.txt with reco

[35m2020-12-31 12:42:38,967 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0570.json.txt with records 16857[0m
[36m2020-12-31 12:42:39,606 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0138.json.txt with records 116[0m
[36m2020-12-31 12:42:39,632 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0198.json.txt with records 5[0m
[34m2020-12-31 12:42:39,275 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0776.json.txt with records 14393[0m
[36m2020-12-31 12:42:39,032 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0203.json.txt with records 1981[0m
[32m2020-12-31 12:42:39,883 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0527.json.txt with records 16352[0m
[33m2020-12-31 12:42:39,729 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0469.json.txt with records 9

[36m2020-12-31 12:42:57,193 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0568.json.txt with records 17410[0m
[33m2020-12-31 12:42:57,195 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0369.json.txt with records 17366[0m
[36m2020-12-31 12:42:58,441 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0653.json.txt with records 9185[0m
[32m2020-12-31 12:42:58,872 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0848.json.txt with records 16858[0m
[34m2020-12-31 12:42:58,574 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0782.json.txt with records 16253[0m
[35m2020-12-31 12:42:59,322 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0636.json.txt with records 17131[0m
[36m2020-12-31 12:43:00,734 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0924.json.txt with reco

[36m2020-12-31 12:43:20,757 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0869.json.txt with records 17061[0m
[36m2020-12-31 12:43:20,503 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0343.json.txt with records 13479[0m
[35m2020-12-31 12:43:20,680 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0677.json.txt with records 2395[0m
[32m2020-12-31 12:43:20,969 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0652.json.txt with records 16255[0m
[35m2020-12-31 12:43:21,706 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0556.json.txt with records 18330[0m
[33m2020-12-31 12:43:22,988 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0529.json.txt with records 19465[0m
[34m2020-12-31 12:43:23,529 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0481.json.txt with reco

[32m2020-12-31 12:43:39,532 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0898.json.txt with records 12671[0m
[32m2020-12-31 12:43:40,085 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0813.json.txt with records 19509[0m
[35m2020-12-31 12:43:40,415 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0060.json.txt with records 2325[0m
[34m2020-12-31 12:43:40,233 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0395.json.txt with records 13737[0m
[35m2020-12-31 12:43:40,950 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0446.json.txt with records 2[0m
[35m2020-12-31 12:43:41,352 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0606.json.txt with records 11918[0m
[34m2020-12-31 12:43:40,827 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0802.json.txt with records 

[32m2020-12-31 12:43:58,756 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0958.json.txt with records 13340[0m
[32m2020-12-31 12:43:59,673 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0432.json.txt with records 10[0m
[32m2020-12-31 12:44:00,206 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0532.json.txt with records 15738[0m
[35m2020-12-31 12:44:00,459 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0771.json.txt with records 16205[0m
[33m2020-12-31 12:44:01,127 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0009.json.txt with records 156[0m
[36m2020-12-31 12:44:01,411 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0258.json.txt with records 9765[0m
[36m2020-12-31 12:44:01,719 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0964.json.txt with records 1

[36m2020-12-31 12:44:17,394 - __main__ - INFO - Completed with 97 files and 1030971 records [0m
[33m2020-12-31 12:44:17,639 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0279.json.txt with records 10483[0m
[34m2020-12-31 12:44:18,297 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0411.json.txt with records 16031[0m
[35m2020-12-31 12:44:17,931 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0731.json.txt with records 17886[0m
[32m2020-12-31 12:44:19,591 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0482.json.txt with records 17367[0m
[33m2020-12-31 12:44:20,031 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0960.json.txt with records 9020[0m
[35m2020-12-31 12:44:19,756 - __main__ - INFO - Processed file /opt/ml/processing/input/data/json/pubmed19n0036.json.txt with records 542[0m
[35m2020-12-31 12:44:20,540 - __ma

## Step 2: Run predictions

In [13]:
prepare_models=False

In [14]:
jobs = [
"ppi-bert-2020-12-28-06-14-27-510",
"ppi-bert-2020-12-28-06-13-28-613",
"ppi-bert-2020-12-28-06-12-43-937",
"ppi-bert-2020-12-28-06-11-48-471",
"ppi-bert-2020-12-28-06-10-53-005",
"ppi-bert-2020-12-28-06-10-00-183",
"ppi-bert-2020-12-28-06-09-00-491",
"ppi-bert-2020-12-28-06-08-02-139",
"ppi-bert-2020-12-28-06-07-01-234",
"ppi-bert-2020-12-28-06-06-07-198"
]

s3_model_path_format = "s3://aegovan-data/results/{}/output/model.tar.gz"

s3_model_paths = [s3_model_path_format.format(j) for j in jobs]

In [15]:
s3_output_ensemble_models= "s3://aegovan-data/ensemble_models/{}".format("2020-12-28-06-part")

### Prepare ensemble models
TODO: This is just a hack to untar a bunch of zipped models and upload them to a single s3 locaton. Have a single processing job to do this is an overkill...

In [16]:
def get_processing_inputs_s3_local_path(s3_model_paths, sm_local_input):
    # Map the s3 model path to local input path
    inputs = []
    for i, s3_path in enumerate(s3_model_paths):
         p = ProcessingInput(
                        source=s3_path,
                        destination="{}/{}".format(sm_local_input.rstrip("/"), i)
         )
         inputs.append(p)
    return inputs


In [17]:
from sagemaker.network import NetworkConfig
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.processing import ScriptProcessor


sm_local_input = "/opt/ml/processing/input/models"
sm_local_output = "/opt/ml/processing/output"

script_processor = ScriptProcessor(image_uri=docker_repo,
                                       command=["python"],
                                       env={'mode': 'python', 'PYTHONPATH':'/opt/ml/code'},
                                       role=role,
                                       instance_type=instance_type,
                                       instance_count=1,
                                       max_runtime_in_seconds=172800,
                                       volume_size_in_gb = 50,
                                       network_config=NetworkConfig(enable_network_isolation=False),
                                       base_job_name ="ppi-ensemble-model-packer"
                                       )


In [18]:

if prepare_models:
    # Work around to get over the processing job input limit size
    chunk_size=5
    for i in range(0, len(s3_model_paths), chunk_size ):

        script_processor.run(
                code='source/algorithms/ensemble_inference_prepare_models.py',

                arguments=[
                    "--input-dir",
                    sm_local_input,
                    "--dest-dir",
                    sm_local_output

                ],

                inputs=get_processing_inputs_s3_local_path(s3_model_paths[i:i+chunk_size], sm_local_input),


                outputs=[ProcessingOutput(
                        source=sm_local_output, 
                        destination=s3_output_ensemble_models,
                        output_name='models')]
            )



### Run ensemble prediction

In [84]:
s3_output_predictions = "s3://aegovan-data/pubmed_asbtract/predictions_multi_{}_{}/".format("2020_12_28_06_m_",date_fmt)

In [85]:
pytorch_custom_image_name="ppi-extractor:gpu-1.0.0-202101020146"
instance_type = "ml.p3.16xlarge" 

In [86]:
#temp
s3_output_pubmed_asbtract = f"s3://aegovan-data/pubmed_asbtract/inference_multi_2020123123/"

In [87]:
docker_repo = "{}.dkr.ecr.{}.amazonaws.com/{}".format(account_id, region, pytorch_custom_image_name)

In [None]:
from sagemaker.network import NetworkConfig
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.processing import ScriptProcessor

script_processor = ScriptProcessor(image_uri=docker_repo,
                                       command=["python"],
                                       env={'mode': 'python', 'PYTHONPATH':'/opt/ml/code'},
                                       role=role,
                                       instance_type=instance_type,
                                       instance_count=4,
                                       max_runtime_in_seconds=172800,
                                       volume_size_in_gb = 250,
                                       network_config=NetworkConfig(enable_network_isolation=False),
                                       base_job_name ="ppi-ensemble-inference"
                                       )


sm_local_input_models = "/opt/ml/processing/input/data/models"
sm_local_input_data = "/opt/ml/processing/input/data/jsonlines"
sm_local_output = "/opt/ml/processing/output"



script_processor.run(
        code='source/algorithms/main_predict.py',

        arguments=[
            "PpiMulticlassDatasetFactory",
            sm_local_input_data,
            sm_local_input_models,
            sm_local_output
        ],

        inputs=[
                ProcessingInput(
                    source=s3_output_pubmed_asbtract,
                    destination=sm_local_input_data,
                    s3_data_distribution_type="ShardedByS3Key")

            ,ProcessingInput(
                    source=s3_output_ensemble_models,
                    destination=sm_local_input_models,
                    s3_data_distribution_type="FullyReplicated")
            ],


        outputs=[ProcessingOutput(
                source=sm_local_output, 
                destination=s3_output_predictions,
                output_name='predictions')]
    )




Parameter 'session' will be renamed to 'sagemaker_session' in SageMaker Python SDK v2.



Job Name:  ppi-ensemble-inference-2021-01-02-01-55-48-933
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://aegovan-data/pubmed_asbtract/inference_multi_2020123123/', 'LocalPath': '/opt/ml/processing/input/data/jsonlines', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'input-2', 'S3Input': {'S3Uri': 's3://aegovan-data/ensemble_models/2020-12-28-06-part', 'LocalPath': '/opt/ml/processing/input/data/models', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-2-324346001917/ppi-ensemble-inference-2021-01-02-01-55-48-933/input/code/main_predict.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'pred