# Scheduling

In [1]:
import dask
import os
from pandas import DataFrame
from boto3 import resource

from grizly.scheduling.registry import Job
from grizly import Email, S3, config
import logging

In [2]:
os.environ["GRIZLY_REDIS_HOST"] = "10.125.68.177"

## Register jobs

Before you register a job you have to define function that your job will run. Let's define a function that returns last modified date of a file in S3.

In [3]:
def get_last_modified_date():
    bucket = "acoe-s3"
    date = resource("s3").Object(bucket, "grizly/test_scheduling.csv").last_modified
    return str(date)

Jobs that are listening for some changes are called **listener jobs**. A good practice is to end their name with `listener` suffix so that they are easy to list.

In [4]:
job = Job("s3_grizly_test_scheduling_listener")

job.register(func=get_last_modified_date, 
             if_exists="replace"
            )

2020-12-10 17:22:38,207 - grizly.scheduling.registry - INFO - Job s3_grizly_test_scheduling_listener successfully removed from registry
2020-12-10 17:22:39,146 - grizly.scheduling.registry - INFO - Job s3_grizly_test_scheduling_listener successfully registered


Job(name='s3_grizly_test_scheduling_listener')

We just registered a job called `s3_grizly_test_scheduling_listener`. The name of the job is unique and you can always check its details with `info()` method.

In [5]:
job = Job("s3_grizly_test_scheduling_listener")

job.info()

name: s3_grizly_test_scheduling_listener
owner: None
description: None
timeout: 3600
created_at: 2020-12-10 17:22:38.209627+00:00
crons: []
downstream: {}
upstream: {}
triggers: []


As you can see this job is not scheduled yet - it's not a cron job and it doesn't have any upstream jobs and it doesn't have any triggers. You can pass these parameters during registration or overwrite them later using `crons`, `upstream` or `triggers` attributes.

## Add cron string

Let's add now a cron string to our job to run every two hours. You can generate cron string using this website https://crontab.guru/.

In [6]:
job.crons = "0 */2 * * *"

job.info()

name: s3_grizly_test_scheduling_listener
owner: None
description: None
timeout: 3600
created_at: 2020-12-10 17:22:38.209627+00:00
crons: ['0 */2 * * *']
downstream: {}
upstream: {}
triggers: []


## Submit job

You can run your job imediately using `submit()` method.

In [7]:
job.submit(scheduler_address="10.125.68.177:8999")

2020-12-10 17:22:49,912 - grizly.scheduling.registry - INFO - Submitting job s3_grizly_test_scheduling_listener...
2020-12-10 17:22:51,572 - botocore.credentials - INFO - Found credentials in shared credentials file: ~/.aws/credentials
2020-12-10 17:22:55,667 - grizly.scheduling.registry - INFO - Job s3_grizly_test_scheduling_listener finished with status success


'2020-12-10 17:21:22+00:00'

## Check job's last run details

After the first run you will be able to access `last_run` property with information about the last run of your job.

In [8]:
job.last_run.info()

id: 1
name: None
created_at: 2020-12-10 17:22:50.261603+00:00
finished_at: 2020-12-10 17:22:55.668267+00:00
duration: 4
status: success
error: None
result: 2020-12-10 17:21:22+00:00


Let's now update the file and run the job again.

In [9]:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = DataFrame(data=d)

s3 = S3(s3_key="grizly/", file_name="test_scheduling.csv").from_df(df)

2020-12-10 17:23:02,719 - grizly.sources.filesystem.old_s3 - INFO - Successfully uploaded 'test_scheduling.csv' to S3


In [10]:
job.submit(scheduler_address="10.125.68.177:8999")

2020-12-10 17:23:04,934 - grizly.scheduling.registry - INFO - Submitting job s3_grizly_test_scheduling_listener...
2020-12-10 17:23:10,431 - grizly.scheduling.registry - INFO - Job s3_grizly_test_scheduling_listener finished with status success


'2020-12-10 17:23:03+00:00'

In [11]:
job.last_run.info()

id: 2
name: None
created_at: 2020-12-10 17:23:05.253426+00:00
finished_at: 2020-12-10 17:23:10.432068+00:00
duration: 4
status: success
error: None
result: 2020-12-10 17:23:03+00:00


## Register jobs with upstream job

Let's now register two jobs with upstream job `s3_grizly_test_scheduling_listener`. One will send an email whenever upstream finished with status `success` and the other will send an email whenever the upstream changed his result.

In [12]:
def send_email(subject, body, to):
    logger = logging.getLogger("grizly").getChild("email")
    e = Email(subject=subject, body=body, logger=logger)
    e.send(to=to)
    
to = config.get_service("email").get("address")

In [13]:
job = Job("email_upstream_success")

job.register(func=send_email, 
             subject="Job success",
             body="Job `s3_grizly_test_scheduling_listener` finished with status success.", 
             to=to,
             if_exists="replace",
             upstream={"s3_grizly_test_scheduling_listener": "success"}
             )

job.info()

2020-12-10 17:23:18,627 - grizly.scheduling.registry - INFO - Job email_upstream_success successfully registered


name: email_upstream_success
owner: None
description: None
timeout: 3600
created_at: 2020-12-10 17:23:15.796617+00:00
crons: []
downstream: {}
upstream: {'s3_grizly_test_scheduling_listener': 'success'}
triggers: []


In [14]:
job = Job("email_upstream_result_change")

job.register(func=send_email, 
             subject="File changed",
             body="Somebody changed 'grizly/test_scheduling.csv' file!", 
             to=to,
             if_exists="replace",
             upstream={"s3_grizly_test_scheduling_listener": "result_change"}
            )

job.info()

2020-12-10 17:23:24,944 - grizly.scheduling.registry - INFO - Job email_upstream_result_change successfully registered


name: email_upstream_result_change
owner: None
description: None
timeout: 3600
created_at: 2020-12-10 17:23:22.105970+00:00
crons: []
downstream: {}
upstream: {'s3_grizly_test_scheduling_listener': 'result_change'}
triggers: []


You can see now that `s3_grizly_test_scheduling_listener` has two downstream jobs.

In [15]:
job = Job("s3_grizly_test_scheduling_listener")
job.info()

name: s3_grizly_test_scheduling_listener
owner: None
description: None
timeout: 3600
created_at: 2020-12-10 17:22:38.209627+00:00
crons: ['0 */2 * * *']
downstream: {'email_upstream_success': 'success', 'email_upstream_result_change': 'result_change'}
upstream: {}
triggers: []


Let's now submit the listener job.

In [16]:
job.submit(scheduler_address="10.125.68.177:8999")

2020-12-10 17:23:32,821 - grizly.scheduling.registry - INFO - Submitting job s3_grizly_test_scheduling_listener...
2020-12-10 17:23:38,143 - grizly.scheduling.registry - INFO - Job s3_grizly_test_scheduling_listener finished with status success
2020-12-10 17:23:43,995 - grizly.scheduling.registry - INFO - Job email_upstream_success has been enqueued


'2020-12-10 17:23:03+00:00'

As you can see `s3_grizly_test_scheduling_listener` job finished with status success and enqueued his downstream job `email_upstream_succcess`. Let's now change the file in s3 and run our listener job again.

In [17]:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = DataFrame(data=d)

s3 = S3(s3_key="grizly/", file_name="test_scheduling.csv").from_df(df)

2020-12-10 17:23:47,104 - grizly.sources.filesystem.old_s3 - INFO - Successfully uploaded 'test_scheduling.csv' to S3


In [18]:
job.submit(scheduler_address="10.125.68.177:8999")

2020-12-10 17:23:49,326 - grizly.scheduling.registry - INFO - Submitting job s3_grizly_test_scheduling_listener...
2020-12-10 17:23:54,687 - grizly.scheduling.registry - INFO - Job s3_grizly_test_scheduling_listener finished with status success
2020-12-10 17:24:00,563 - grizly.scheduling.registry - INFO - Job email_upstream_success has been enqueued
2020-12-10 17:24:03,567 - grizly.scheduling.registry - INFO - Job email_upstream_result_change has been enqueued


'2020-12-10 17:23:47+00:00'

## Failing job's traceback

If your job failed you can easily check the details of the last job run using `Job.last_run` and checking `traceback` property. 

In [19]:
def failing_func():
    logger = logging.getLogger("grizly").getChild("failing_function")
    a = 2
    b = '2'
    logger.info(f"I'm adding {str(a)} + {str(b)}...")
    return a + b

job = Job("failing_job")

job.register(func=failing_func,
             if_exists="replace"
            )

job.submit(scheduler_address="10.125.68.177:8999")

2020-12-10 17:24:04,819 - grizly.scheduling.registry - INFO - Job failing_job successfully registered
2020-12-10 17:24:06,393 - grizly.scheduling.registry - INFO - Submitting job failing_job...
2020-12-10 17:24:07,967 - grizly.failing_function - INFO - I'm adding 2 + 2...
2020-12-10 17:24:11,542 - grizly.scheduling.registry - INFO - Job failing_job finished with status fail


In [20]:
print(job.last_run.traceback)

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/grizly-0.4.2rc0-py3.8.egg/grizly/scheduling/registry.py", line 1155, in submit
    result = self.func(*args, **kwargs)
  File "<ipython-input-19-41af6e778f93>", line 6, in failing_func
    return a + b
TypeError: unsupported operand type(s) for +: 'int' and 'str'



## Job's logs

If you are using grizly logger (as in the `failing_func` in the example above) you can read your job's logs using `Job.last_run.logs` property.

In [21]:
print(job.last_run.logs)

I'm adding 2 + 2...



## Unregister jobs

In [22]:
Job("s3_grizly_test_scheduling_listener").unregister(remove_job_runs=True)
Job("email_upstream_success").unregister(remove_job_runs=True)
Job("email_upstream_result_change").unregister(remove_job_runs=True)
Job("failing_job").unregister(remove_job_runs=True)

2020-12-10 17:24:26,911 - grizly.scheduling.registry - INFO - Job s3_grizly_test_scheduling_listener successfully removed from registry
2020-12-10 17:24:32,766 - grizly.scheduling.registry - INFO - Job email_upstream_success successfully removed from registry
2020-12-10 17:24:37,189 - grizly.scheduling.registry - INFO - Job email_upstream_result_change successfully removed from registry
2020-12-10 17:24:40,330 - grizly.scheduling.registry - INFO - Job failing_job successfully removed from registry
