# Sequencia para processamento em Cluster AWS

Todo o código foi baseado na biblioteca boto3. Para executar é necessário ter na máquina configurada as credencias da AWS conforme descrito no link https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html. 

In [1]:
try:
    !pip install boto3=="1.13.1" --quiet
except:
    print("Running throw py file.")

ERROR: awsebcli 3.18.1 has requirement botocore<1.16,>=1.15, but you'll have botocore 1.16.26 which is incompatible.


In [2]:
import boto3
import os
import json

In [3]:
dirpath = os.getcwd()

## Configurando serviços AWS
Sequencia de atividads para configuração de ambiente AWS para armazenamento e processamento do modelo PySpark.

#### Definindo Variáveis usados na configuração de ambiente AWS.

In [4]:
my_bucket = "escale-fk"
app_key = "escale-test-fk"
my_tag = [{'Key': app_key, 'Value': ''}]
my_resource_group = "rg-escale-test-fk"
my_emr_cluster = "spark-escale-test-fk"
files_to_upload = ['Escale_Challenge.py', 'Escale_Challenge.html']

### Criação de um Bucket S3 "data-sprints-fk" para armazenamento do modelo PySpark.

In [5]:
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
s3_client.create_bucket(Bucket=my_bucket)

{'ResponseMetadata': {'RequestId': '73848C1D9E2FA4DA',
  'HostId': 'esuZlBUI3qSrC8rIzjsDC9ZTnMqFRqAHb+CYQuTLzjp/c+yVd/hH+aJPaTvFRgG5Ax6Nm1TgY9M=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'esuZlBUI3qSrC8rIzjsDC9ZTnMqFRqAHb+CYQuTLzjp/c+yVd/hH+aJPaTvFRgG5Ax6Nm1TgY9M=',
   'x-amz-request-id': '73848C1D9E2FA4DA',
   'date': 'Sat, 29 Aug 2020 14:00:40 GMT',
   'location': '/escale-fk',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'Location': '/escale-fk'}

Definição de uma TAG para o Bucket criado

In [6]:
s3_client.put_bucket_tagging(Bucket=my_bucket, Tagging= {'TagSet': my_tag} )

{'ResponseMetadata': {'RequestId': '3079691F05DB24BA',
  'HostId': 'kim3BWFQXJvow7yHqZxyDnuYVlsmHGb4UCiylY+77Ks5Ft7PGMZC1tyD1NgnVtMfT7mejEk10PE=',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': 'kim3BWFQXJvow7yHqZxyDnuYVlsmHGb4UCiylY+77Ks5Ft7PGMZC1tyD1NgnVtMfT7mejEk10PE=',
   'x-amz-request-id': '3079691F05DB24BA',
   'date': 'Sat, 29 Aug 2020 14:00:55 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}

Upload do modelo para o bucket na pasta model.

In [7]:
for file in files_to_upload: 
    file_name = dirpath + "/" + file
    try:
        if '.html' in file_name:
            response = s3_client.upload_file(file_name, my_bucket, file)
            
            
            #Modificando o ContentType
            object = s3.Object(my_bucket, file)
            object.copy_from(CopySource={'Bucket': my_bucket, 'Key': file},
                             MetadataDirective="REPLACE",
                             ContentType="text/html",
                             ACL = 'public-read')

        else:
            response = s3_client.upload_file(file_name, my_bucket, "model/" + file, ExtraArgs={'ACL':'public-read', })
            
        print("It was uploaded the file", "'" + file + "'", ".")
    except ClientError as e:
        logging.error(e)

It was uploaded the file 'Escale_Challenge.py' .
It was uploaded the file 'Escale_Challenge.html' .


Abrir em um Browser o site: https://escale-fk.s3.amazonaws.com/Escale_Challenge.html

### Configuração de um Resource Group

In [8]:
RG_client = boto3.client('resource-groups')

#AWS::AllSupported
#AWS::S3::Bucket
query = {
    "ResourceTypeFilters": ["AWS::AllSupported"],
    "TagFilters":  [{
        "Key": my_tag[0].get("Key"),
        "Values": [""]
    }] 
}
resource_query = {
    'Type': 'TAG_FILTERS_1_0',
    'Query': json.dumps(query)
}

try:
    resp = RG_client.create_group(Name=my_resource_group,ResourceQuery=resource_query)
    print("Resource Group was created.")
except Exception as e:
    print(e)

#print(query)
#print(my_tag)

Resource Group was created.


### Criação de um  EMR Cluster

In [9]:
emr_client = boto3.client('emr') #region_name='us-east-1'

cluster_id = emr_client.run_job_flow(Name=my_emr_cluster, 
    ReleaseLabel='emr-5.30.1',
    LogUri='s3://' + my_bucket + '/log/',
    Applications=[
        {
            'Name': 'Spark'
        },
    ],
    Instances={
        'InstanceGroups': [
            {
                'Name': "Master",
                'Market': 'ON_DEMAND',
                'InstanceRole': 'MASTER',
                'InstanceType': 'm4.large',
                'InstanceCount': 1,
            },
            {
                'Name': "Slave",
                'Market': 'ON_DEMAND',
                'InstanceRole': 'CORE',
                'InstanceType': 'm4.large',
                'InstanceCount': 2,
            }
        ],
        'KeepJobFlowAliveWhenNoSteps': False,
        'TerminationProtected': False,
    },
    Steps=[
        {
            'Name': 'Spark application',   
                    'ActionOnFailure': 'CONTINUE',
                    'HadoopJarStep': {
                        'Jar': 'command-runner.jar',
                        'Args': ["spark-submit","--deploy-mode","cluster","s3://" + my_bucket + "/model/" + files_to_upload[0]]
                    }
        }        
    ],                                    
    VisibleToAllUsers=True,
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole',
    Tags=my_tag
)

In [10]:
from datetime import datetime, date

clusters = emr_client.list_clusters(
        CreatedAfter = datetime.today()
)
my_cluster = [i for i in clusters['Clusters'] if i['Name'] == my_emr_cluster][0]
my_cluster

{'Id': 'j-391EZQ2PVVO2Y',
 'Name': 'spark-escale-test-fk',
 'Status': {'State': 'STARTING',
  'StateChangeReason': {},
  'Timeline': {'CreationDateTime': datetime.datetime(2020, 8, 29, 11, 3, 47, 285000, tzinfo=tzlocal())}},
 'NormalizedInstanceHours': 0,
 'ClusterArn': 'arn:aws:elasticmapreduce:us-east-1:032594213725:cluster/j-391EZQ2PVVO2Y'}

In [11]:
import time

response = emr_client.describe_cluster(ClusterId = my_cluster['Id'])
print('The current state is', response['Cluster']['Status']['State'], '-', datetime.today())
i = 0

while response['Cluster']['Status']['State'] != 'TERMINATED' and i < 30:
    response = emr_client.describe_cluster(ClusterId = my_cluster['Id'])
    print('The current state is', response['Cluster']['Status']['State'], '-', datetime.today(), i)
    i += 1
    time.sleep(60)

The current state is STARTING - 2020-08-29 11:03:52.664575
The current state is STARTING - 2020-08-29 11:03:52.852447 0
The current state is STARTING - 2020-08-29 11:04:53.809218 1
The current state is STARTING - 2020-08-29 11:05:54.801854 2
The current state is STARTING - 2020-08-29 11:06:55.750853 3
The current state is STARTING - 2020-08-29 11:07:56.658490 4
The current state is STARTING - 2020-08-29 11:08:57.460025 5
The current state is STARTING - 2020-08-29 11:09:58.379422 6
The current state is RUNNING - 2020-08-29 11:10:59.240086 7
The current state is WAITING - 2020-08-29 11:12:00.187990 8
The current state is WAITING - 2020-08-29 11:13:01.101256 9
The current state is WAITING - 2020-08-29 11:14:02.056952 10
The current state is WAITING - 2020-08-29 11:15:02.984443 11
The current state is WAITING - 2020-08-29 11:16:03.894022 12
The current state is WAITING - 2020-08-29 11:17:04.938533 13
The current state is WAITING - 2020-08-29 11:18:05.849133 14
The current state is WAITING 

## Desativação/Remoção das configurações da AWS

Remoção do Resource Group

In [12]:
RG_client.delete_group(GroupName=my_resource_group)

{'ResponseMetadata': {'RequestId': 'f47db126-a691-43cf-8a73-ee4ee7970d95',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Sat, 29 Aug 2020 14:24:10 GMT',
   'content-type': 'application/json',
   'content-length': '147',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'f47db126-a691-43cf-8a73-ee4ee7970d95',
   'x-amz-apigw-id': 'SCS5pE2uIAMFYRw=',
   'x-amzn-trace-id': 'Root=1-5f4a650a-c9b749d88a6c6050f4d236e8'},
  'RetryAttempts': 0},
 'Group': {'GroupArn': 'arn:aws:resource-groups:us-east-1:032594213725:group/rg-escale-test-fk',
  'Name': 'rg-escale-test-fk'}}

Remoção de todos os arquivos do Bucket

In [13]:
bucket = s3.Bucket(my_bucket)
bucket.objects.all().delete()

[{'ResponseMetadata': {'RequestId': 'E83DBB219EBF86D8',
   'HostId': '8rHRlAQHcfq7iadroQ8KkHOzPFTjqK5HgohOZOjHwn8zgU2j5fWxb/h16GGudhuCuCOMHOaaHUE=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': '8rHRlAQHcfq7iadroQ8KkHOzPFTjqK5HgohOZOjHwn8zgU2j5fWxb/h16GGudhuCuCOMHOaaHUE=',
    'x-amz-request-id': 'E83DBB219EBF86D8',
    'date': 'Sat, 29 Aug 2020 14:24:12 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'log/j-391EZQ2PVVO2Y/node/i-05f5bc674774734c1/applications/hadoop-hdfs/hadoop-hdfs-namenode-ip-172-31-62-174.out.gz'},
   {'Key': 'log/j-391EZQ2PVVO2Y/node/i-005b54950ba330ba0/applications/hadoop-hdfs/hadoop-hdfs-datanode-ip-172-31-55-26.out.gz'},
   {'Key': 'log/j-391EZQ2PVVO2Y/node/i-05f5bc674774734c1/daemons/instance-state/console.log-2020-08-29-14-05.gz'},
   {'Key': 'log/j-391EZQ2PVVO2Y/node/i-005b54950ba330ba0/provision-node/8836bf34-d308

Remoção do Bucket

In [14]:
s3_client.delete_bucket(Bucket=my_bucket)

{'ResponseMetadata': {'RequestId': '56E6F52804EFDA2A',
  'HostId': 'yolByYwnyCnk62DATfXCGTdPwJ+YfrUNTfZdL7zuPtlWl8fL65hakSfjxSQZSdkxOPPgJ1mDiu8=',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': 'yolByYwnyCnk62DATfXCGTdPwJ+YfrUNTfZdL7zuPtlWl8fL65hakSfjxSQZSdkxOPPgJ1mDiu8=',
   'x-amz-request-id': '56E6F52804EFDA2A',
   'date': 'Sat, 29 Aug 2020 14:24:14 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}