# Scheduling

In [1]:
import dask
import os
from pandas import DataFrame
from boto3 import resource

from grizly.scheduling.registry import Job
from grizly import Email, S3, config
import logging

In [2]:
# os.environ["GRIZLY_REDIS_HOST"] = "10.125.68.177"
os.environ["GRIZLY_REDIS_HOST"] = "pytest_redis"
os.environ["GRIZLY_DASK_SCHEDULER_ADDRESS"] = "10.125.68.177:8999"

## Register jobs

Before you register a job you have to define tasks that your job will run. Let's define a function that returns last modified date of a file in S3.

In [11]:
@dask.delayed
def get_last_modified_date(full_s3_key):
    bucket = "acoe-s3"
    date = resource("s3").Object(bucket, full_s3_key).last_modified
    import time
    time.sleep(5)
    return str(date)

def main():
    return get_last_modified_date(full_s3_key="grizly/test_scheduling.csv").compute()

In [5]:
task = get_last_modified_date(full_s3_key="grizly/test_scheduling.csv")

Jobs that are listening for some changes are called **listener jobs**. A good practice is to end their name with `listener` suffix so that they are easy to list.

In [12]:
job = Job("s3_grizly_test_scheduling_listener")

job.register(main, 
#              "grizly/test_scheduling.csv",
             if_exists="replace"
            )

2020-12-07 16:24:12,463 | INFO : Job s3_grizly_test_scheduling_listener successfully removed from registry
2020-12-07 16:24:12,472 | INFO : Job s3_grizly_test_scheduling_listener successfully registered


Job(name='s3_grizly_test_scheduling_listener')

We just registered a job called `s3_grizly_test_scheduling_listener`. The name of the job is unique and you can always check its details with `info()` method.

In [5]:
job = Job("s3_grizly_test_scheduling_listener")

job.info()

name: s3_grizly_test_scheduling_listener
owner: None
description: None
timeout: 3600
created_at: 2020-12-07 16:08:37.715409+00:00
crons: []
downstream: {}
upstream: {}
triggers: []


As you can see this job is not scheduled yet - it's not a cron job and it doesn't have any upstream jobs and it doesn't have any triggers. You can pass these parameters during registration or overwrite them later using `crons`, `upstream` or `triggers` attributes.

## Add cron string

Let's add now a cron string to our job to run every two hours. You can generate cron string using this website https://crontab.guru/.

In [9]:
job.crons = "0 */2 * * *"

job.info()

name: s3_grizly_test_scheduling_listener
owner: None
description: None
timeout: 3600
created_at: 2020-12-07 16:05:50.334821+00:00
crons: ['0 */2 * * *']
downstream: {}
upstream: {}
triggers: []


## Submit job

You can run your job imediately using `submit()` method.

In [14]:
job.submit(scheduler_address="10.125.68.177:8999")

2020-12-07 16:24:43,041 | INFO : Submitting job s3_grizly_test_scheduling_listener...
2020-12-07 16:24:48,552 | INFO : Job s3_grizly_test_scheduling_listener finished with status success


'2020-12-01 08:35:52+00:00'

## Check job's last run details

After the first run you will be able to access `last_run` property with information about the last run of your job.

In [8]:
job.last_run.info()

id: 2
name: None
created_at: 2020-12-07 16:18:22.445392+00:00
finished_at: 2020-12-07 16:18:24.343976+00:00
duration: 1
status: success
error: None
result: 2020-12-01 08:35:52+00:00


Let's now update the file and run the job again.

In [10]:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = DataFrame(data=d)

s3 = S3(s3_key="grizly/", file_name="test_scheduling.csv").from_df(df)

2020-12-01 08:35:20,698 | INFO : Found credentials in shared credentials file: ~/.aws/credentials
2020-12-01 08:35:22,167 | INFO : Successfully uploaded 'test_scheduling.csv' to S3


In [11]:
job.submit(scheduler_address="10.125.68.177:8999")

2020-12-01 08:35:23,838 | INFO : Submitting job s3_grizly_test_scheduling_listener...
2020-12-01 08:35:26,598 | INFO : Job s3_grizly_test_scheduling_listener finished with status success


['2020-12-01 08:35:22+00:00']

In [12]:
job.last_run.info()

id: 2
name: None
created_at: 2020-12-01 08:35:24.074831+00:00
finished_at: 2020-12-01 08:35:26.600356+00:00
duration: 1
status: success
error: None
result: ['2020-12-01 08:35:22+00:00']


## Register jobs with upstream job

Let's now register two jobs with upstream job `s3_grizly_test_scheduling_listener`. One will send an email whenever upstream finished with status `success` and the other will send an email whenever the upstream changed his result.

In [13]:
@dask.delayed
def send_email(subject, body, to):
    logger = logging.getLogger("distributed.worker").getChild("email")
    e = Email(subject=subject, body=body, logger=logger)
    e.send(to=to)

In [14]:
to = config.get_service("email").get("address")

task = send_email(subject="Job success",
                   body="Job `s3_grizly_test_scheduling_listener` finished with status success.", 
                   to=to)

job = Job("email_upstream_success")

job.register(tasks=[task], 
             if_exists="replace",
             upstream={"s3_grizly_test_scheduling_listener": "success"}
             )

job.info()

2020-12-01 08:35:32,926 | INFO : Job email_upstream_success successfully registered


name: email_upstream_success
owner: None
description: None
timeout: 3600
created_at: 2020-12-01 08:35:30.801676+00:00
crons: []
downstream: {}
upstream: {'s3_grizly_test_scheduling_listener': 'success'}
triggers: []


In [15]:
to = config.get_service("email").get("address")

task = send_email(subject="File changed",
                   body="Somebody changed 'grizly/test_scheduling.csv' file!", 
                   to=to)

job = Job("email_upstream_result_change")

job.register(tasks=[task], 
               if_exists="replace",
               upstream={"s3_grizly_test_scheduling_listener": "result_change"}
              )

job.info()

2020-12-01 08:35:37,652 | INFO : Job email_upstream_result_change successfully registered


name: email_upstream_result_change
owner: None
description: None
timeout: 3600
created_at: 2020-12-01 08:35:35.535265+00:00
crons: []
downstream: {}
upstream: {'s3_grizly_test_scheduling_listener': 'result_change'}
triggers: []


You can see now that `s3_grizly_test_scheduling_listener` has two downstream jobs.

In [16]:
job = Job("s3_grizly_test_scheduling_listener")
job.info()

name: s3_grizly_test_scheduling_listener
owner: None
description: None
timeout: 3600
created_at: 2020-12-01 08:35:06.592377+00:00
crons: ['0 */2 * * *']
downstream: {'email_upstream_success': 'success', 'email_upstream_result_change': 'result_change'}
upstream: {}
triggers: []


Let's now submit the listener job.

In [17]:
job.submit(scheduler_address="10.125.68.177:8999")

2020-12-01 08:35:43,590 | INFO : Submitting job s3_grizly_test_scheduling_listener...
2020-12-01 08:35:46,352 | INFO : Job s3_grizly_test_scheduling_listener finished with status success
2020-12-01 08:35:50,294 | INFO : Job email_upstream_success has been enqueued


['2020-12-01 08:35:22+00:00']

As you can see `s3_grizly_test_scheduling_listener` job finished with status success and enqueued his downstream job `email_upstream_succcess`. Let's now change the file in s3 and run our listener job again.

In [18]:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = DataFrame(data=d)

s3 = S3(s3_key="grizly/", file_name="test_scheduling.csv").from_df(df)

2020-12-01 08:35:51,819 | INFO : Successfully uploaded 'test_scheduling.csv' to S3


In [19]:
job.submit(scheduler_address="10.125.68.177:8999")

2020-12-01 08:35:53,480 | INFO : Submitting job s3_grizly_test_scheduling_listener...
2020-12-01 08:35:56,230 | INFO : Job s3_grizly_test_scheduling_listener finished with status success
2020-12-01 08:36:00,158 | INFO : Job email_upstream_success has been enqueued
2020-12-01 08:36:01,920 | INFO : Job email_upstream_result_change has been enqueued


['2020-12-01 08:35:52+00:00']

## Unregister jobs

In [20]:
Job("s3_grizly_test_scheduling_listener").unregister(remove_job_runs=True)
Job("email_upstream_success").unregister(remove_job_runs=True)
Job("email_upstream_result_change").unregister(remove_job_runs=True)

2020-12-01 08:36:10,159 | INFO : Job s3_grizly_test_scheduling_listener successfully removed from registry
2020-12-01 08:36:14,514 | INFO : Job email_upstream_success successfully removed from registry
2020-12-01 08:36:17,853 | INFO : Job email_upstream_result_change successfully removed from registry
