## Schedule Pipeline using AWS Events Bridge

Let us go ahead and schedule pipeline using AWS Events Bridge. We can take care of it either by using Lambda console or AWS Events Bridge console.
* To catch up we can schedule the job every 2 minutes and then we can change it to 15 minutes.
* As we are dealing with 45 days as baseline, the job should catch up with in couple of hours.
* We will also clean up every thing before scheduling the job.

In [1]:
!aws s3 rm s3://itversitydata/messages --recursive

delete: s3://itversitydata/messages/part-1415c2d6-d4e3-11ec-8d5b-3e22fbd03f7b.json
delete: s3://itversitydata/messages/part-fef7d1d2-d4e2-11ec-8d5b-3e22fbd03f7b.json


In [2]:
import boto3

In [3]:
dynamodb = boto3.resource('dynamodb')

In [4]:
for table in dynamodb.tables.iterator():
    print(table)

dynamodb.Table(name='emails')
dynamodb.Table(name='ghmarker')
dynamodb.Table(name='ghrepos')
dynamodb.Table(name='gmail_job_run_details')
dynamodb.Table(name='gmail_jobs')
dynamodb.Table(name='posts')


In [5]:
jobs_table = dynamodb.Table('gmail_jobs')
jobs_table.delete_item(Key={'job_id': 'gmail_jobs'})
item = {
    'job_id': 'gmail_ingest',
    'job_description': 'Ingest data from gmail to s3',
    'is_active': 'Y',
    'baseline_days': 45
}
jobs_table.put_item(Item=item)

{'ResponseMetadata': {'RequestId': '2USQFQMI7JBCNAJMN1UBQ4SVUNVV4KQNSO5AEMVJF66Q9ASUAAJG',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'server': 'Server',
   'date': 'Mon, 16 May 2022 06:53:15 GMT',
   'content-type': 'application/x-amz-json-1.0',
   'content-length': '2',
   'connection': 'keep-alive',
   'x-amzn-requestid': '2USQFQMI7JBCNAJMN1UBQ4SVUNVV4KQNSO5AEMVJF66Q9ASUAAJG',
   'x-amz-crc32': '2745614147'},
  'RetryAttempts': 0}}

In [6]:
jrd_table = dynamodb.Table('gmail_job_run_details')

In [7]:
for item in jrd_table.scan()['Items']:
    jrd_table.delete_item(Key={'job_id': item['job_id'], 'job_run_time': item['job_run_time']})

* Here are the cron expressions to schedule every 2 minutes as well as every 15 minutes.

```
cron(0/2 * * * ? *)
cron(0/15 * * * ? *)
```

* Once the job is scheduled to run every 2 minutes to catch up we can validate to see if the emails from GMail are copied to s3 or not.

In [8]:
!aws s3 ls s3://itversitydata/messages/

In [4]:
jobs_table = dynamodb.Table('gmail_jobs')

jobs_table.scan()

{'Items': [{'job_description': 'Ingest data from gmail to s3',
   'is_active': 'Y',
   'job_id': 'gmail_ingest',
   'baseline_days': Decimal('45'),
   'job_run_bookmark_details': {'last_run_max_message_id': '1800b8e80b0a8b02',
    'last_run_start_time_epoch': Decimal('1649376000'),
    'last_run_end_time_epoch': Decimal('1649462400')}}],
 'Count': 1,
 'ScannedCount': 1,
 'ResponseMetadata': {'RequestId': 'UOE2LCD50LDOSLNHCV9GBLBD3VVV4KQNSO5AEMVJF66Q9ASUAAJG',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'server': 'Server',
   'date': 'Mon, 16 May 2022 00:08:56 GMT',
   'content-type': 'application/x-amz-json-1.0',
   'content-length': '352',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'UOE2LCD50LDOSLNHCV9GBLBD3VVV4KQNSO5AEMVJF66Q9ASUAAJG',
   'x-amz-crc32': '962851235'},
  'RetryAttempts': 0}}

In [5]:
jrd_table = dynamodb.Table('gmail_job_run_details')

jrd_table.scan()

{'Items': [{'job_id': 'gmail_ingest',
   'file_name': None,
   'job_run_bookmark_details': {'start_time_epoch': Decimal('1648684800'),
    'end_time_epoch': Decimal('1648771200'),
    'max_message_id': '17fe245c9e5f60c3'},
   'rows_processed': Decimal('77'),
   'job_run_time': Decimal('1652658769')},
  {'job_id': 'gmail_ingest',
   'file_name': None,
   'job_run_bookmark_details': {'start_time_epoch': Decimal('1648771200'),
    'end_time_epoch': Decimal('1648857600'),
    'max_message_id': '17fe77331a39bff5'},
   'rows_processed': Decimal('89'),
   'job_run_time': Decimal('1652658886')},
  {'job_id': 'gmail_ingest',
   'file_name': None,
   'job_run_bookmark_details': {'start_time_epoch': Decimal('1648857600'),
    'end_time_epoch': Decimal('1648944000'),
    'max_message_id': '17fec72ffbd966a0'},
   'rows_processed': Decimal('38'),
   'job_run_time': Decimal('1652659006')},
  {'job_id': 'gmail_ingest',
   'file_name': None,
   'job_run_bookmark_details': {'start_time_epoch': Decimal('