# Scheduling

In [1]:
import dask
import os
from pandas import DataFrame
from boto3 import resource

from grizly.scheduling.registry import Job
from grizly import Email, S3, config
import logging

In [2]:
os.environ["GRIZLY_REDIS_HOST"] = "10.125.68.177"

## Register jobs

Before you register a job you have to define tasks that your job will run. Let's define a function that returns last modified date of a file in S3.

In [3]:
@dask.delayed
def get_last_modified_date(full_s3_key):
    bucket = config.get_service("s3").get("bucket")
    date = resource("s3").Object(bucket, full_s3_key).last_modified
    return str(date)

In [4]:
task = get_last_modified_date(full_s3_key="grizly/test_scheduling.csv")

Jobs that are listening for some changes are called **listener jobs**. A good practice is to start their name with `listener` prefix so that they are easy to list.

In [5]:
job = Job("listener_s3_grizly_test_scheduling")

job.register(tasks=[task], 
             if_exists="replace"
            )

2020-11-18 14:16:22,026 | INFO : Job listener_s3_grizly_test_scheduling successfully removed from registry
2020-11-18 14:16:23,008 | INFO : Job listener_s3_grizly_test_scheduling successfully registered


Job(name='listener_s3_grizly_test_scheduling')

We just registered a job called `listener_s3_grizly_test_scheduling`. The name of the job is unique and you can always check its details with `info()` method.

In [6]:
job = Job("listener_s3_grizly_test_scheduling")

job.info()

name: listener_s3_grizly_test_scheduling
owner: None
description: None
timeout: 3600
created_at: 2020-11-18 14:16:22.028037+00:00
crons: []
downstream: {}
upstream: {}
triggers: []


As you can see this job is not scheduled yet - it's not a cron job and it doesn't have any upstream jobs and it doesn't have any triggers. You can pass these parameters during registration or overwrite them later using `crons`, `upstream` or `triggers` attributes.

## Add cron string

Let's add now a cron string to our job to run every two hours. You can generate cron string using this website https://crontab.guru/.

In [7]:
job.crons = "0 */2 * * *"

job.info()

name: listener_s3_grizly_test_scheduling
owner: None
description: None
timeout: 3600
created_at: 2020-11-18 14:16:22.028037+00:00
crons: ['0 */2 * * *']
downstream: {}
upstream: {}
triggers: []


## Submit job

You can run your job imediately using `submit()` method.

In [8]:
job.submit()

Mismatched versions found

+-------------+--------+-----------+---------+
| Package     | client | scheduler | workers |
+-------------+--------+-----------+---------+
| cloudpickle | 1.4.1  | 1.6.0     | 1.4.1   |
+-------------+--------+-----------+---------+
2020-11-18 14:16:32,786 | INFO : Submitting job listener_s3_grizly_test_scheduling...
2020-11-18 14:16:37,670 | INFO : Job listener_s3_grizly_test_scheduling finished with status success


['2020-11-18 14:15:19+00:00']

## Check job's last run details

After the first run you will be able to access `last_run` property with information about the last run of your job.

In [9]:
job.last_run.info()

id: 1
name: None
created_at: 2020-11-18 14:16:33.115320+00:00
finished_at: 2020-11-18 14:16:37.671481+00:00
duration: 3
status: success
error: None
result: ['2020-11-18 14:15:19+00:00']


Let's now update the file and run the job again.

In [10]:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = DataFrame(data=d)

s3 = S3(s3_key="grizly/", file_name="test_scheduling.csv").from_df(df)

2020-11-18 14:16:41,550 | INFO : Found credentials in shared credentials file: ~/.aws/credentials
2020-11-18 14:16:45,338 | INFO : Successfully uploaded 'test_scheduling.csv' to S3


In [11]:
job.submit()

2020-11-18 14:16:46,512 | INFO : Submitting job listener_s3_grizly_test_scheduling...
2020-11-18 14:16:51,305 | INFO : Job listener_s3_grizly_test_scheduling finished with status success


['2020-11-18 14:16:46+00:00']

In [12]:
job.last_run.info()

id: 2
name: None
created_at: 2020-11-18 14:16:46.799387+00:00
finished_at: 2020-11-18 14:16:51.306814+00:00
duration: 3
status: success
error: None
result: ['2020-11-18 14:16:46+00:00']


## Register jobs with upstream job

Let's now register two jobs with upstream job `listener_s3_grizly_test_scheduling`. One will send an email whenever upstream finished with status `success` and the other will send an email whenever the upstream changed his result.

In [13]:
@dask.delayed
def send_email(subject, body, to):
    logger = logging.getLogger("distributed.worker").getChild("email")
    e = Email(subject=subject, body=body, logger=logger)
    e.send(to=to)

In [14]:
to = config.get_service("email").get("address")

task = send_email(subject="Job success",
                   body="Job `listener_s3_grizly_test_scheduling` finished with status success.", 
                   to=to)

job = Job("email_upstream_succcess")

job.register(tasks=[task], 
             if_exists="replace",
             upstream={"listener_s3_grizly_test_scheduling": "success"}
             )

job.info()

2020-11-18 14:17:02,071 | INFO : Job email_upstream_succcess successfully removed from registry
2020-11-18 14:17:04,937 | INFO : Job email_upstream_succcess successfully registered


name: email_upstream_succcess
owner: None
description: None
timeout: 3600
created_at: 2020-11-18 14:17:02.355410+00:00
crons: []
downstream: {}
upstream: {'listener_s3_grizly_test_scheduling': 'success'}
triggers: []


In [15]:
to = config.get_service("email").get("address")

task = send_email(subject="File changed",
                   body="Somebody changed 'grizly/test_scheduling.csv' file!", 
                   to=to)

job = Job("email_upstream_result_change")

job.register(tasks=[task], 
               if_exists="replace",
               upstream={"listener_s3_grizly_test_scheduling": "result_change"}
              )

job.info()

2020-11-18 14:17:12,310 | INFO : Job email_upstream_result_change successfully removed from registry
2020-11-18 14:17:15,410 | INFO : Job email_upstream_result_change successfully registered


name: email_upstream_result_change
owner: None
description: None
timeout: 3600
created_at: 2020-11-18 14:17:12.605310+00:00
crons: []
downstream: {}
upstream: {'listener_s3_grizly_test_scheduling': 'result_change'}
triggers: []


You can see now that `listener_s3_grizly_test_scheduling` has two downstream jobs.

In [16]:
job = Job("listener_s3_grizly_test_scheduling")
job.info()

name: listener_s3_grizly_test_scheduling
owner: None
description: None
timeout: 3600
created_at: 2020-11-18 14:16:22.028037+00:00
crons: ['0 */2 * * *']
downstream: {'email_upstream_succcess': 'success', 'email_upstream_result_change': 'result_change'}
upstream: {}
triggers: []


Let's now submit the listener job.

In [17]:
job.submit()

2020-11-18 14:17:23,453 | INFO : Submitting job listener_s3_grizly_test_scheduling...
2020-11-18 14:17:28,069 | INFO : Job listener_s3_grizly_test_scheduling finished with status success
2020-11-18 14:17:33,259 | INFO : Job email_upstream_succcess has been enqueued


['2020-11-18 14:16:46+00:00']

As you can see `listener_s3_grizly_test_scheduling` job finished with status success and enqueued his downstream job `email_upstream_succcess`. Let's now change the file in s3 and run our listener job again.

In [18]:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = DataFrame(data=d)

s3 = S3(s3_key="grizly/", file_name="test_scheduling.csv").from_df(df)

2020-11-18 14:17:37,077 | INFO : Successfully uploaded 'test_scheduling.csv' to S3


In [19]:
job.submit()

2020-11-18 14:17:38,385 | INFO : Submitting job listener_s3_grizly_test_scheduling...
2020-11-18 14:17:43,419 | INFO : Job listener_s3_grizly_test_scheduling finished with status success
2020-11-18 14:17:48,239 | INFO : Job email_upstream_succcess has been enqueued
2020-11-18 14:17:50,439 | INFO : Job email_upstream_result_change has been enqueued


['2020-11-18 14:17:37+00:00']

## Unregister jobs

In [20]:
Job("listener_s3_grizly_test_scheduling").unregister(remove_job_runs=True)
Job("email_upstream_succcess").unregister(remove_job_runs=True)
Job("email_upstream_result_change").unregister(remove_job_runs=True)

2020-11-18 14:18:00,619 | INFO : Job listener_s3_grizly_test_scheduling successfully removed from registry
2020-11-18 14:18:06,036 | INFO : Job email_upstream_succcess successfully removed from registry
2020-11-18 14:18:10,217 | INFO : Job email_upstream_result_change successfully removed from registry
