# Google Analytics in Airflow 101.

## Quick Overview: Hooks and Operators.
The hook will contain all code needed to manage the connection, while the operator will focus on using the connection
to execute a task

This leaves the DAG file itself as close to a config as possible:

## Implementation:

Before any code is written, the Google Analytics connection information needs to be accessible to Airflow

### Configure the connections in the Airflow connections panel.
![connections](img/ucg_connections.png)


### Create a new connection with the name that will be referenced in the DAG
![new_connection](img/ucg_config_connections.png)

### The Astronomer version of the google-analytics hook is configured it work with the json file put inside of "client_secret" in the Extras field:

![extras_field](img/ucg_extras_field.png)


## Writing the DAG file:

The DAG file itself should be as close to a config file as possible, simply importing Operators and setting connections

In [None]:
# from uncommon_good_ga.py: - full file:
# /base_workflow/dags/uncommon_goods_ga_dag.py

## Set imports.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from plugins.google_analytics_plugin.operators.google_analytics_reporting_to_s3_operator import GoogleAnalyticsReportingToS3Operator

# TODO: UCG specific credentials:
s3_bucket = 'astronomer-workflows-dev'
s3_conn_id = 'astronomer-s3'

time_string = '{{ ts_nodash }}'
google_analytics_conn_id = 'google_analytics_connection'

# TODO: Schedule
end_date = datetime.today()
start_date = end_date - timedelta(days=6)

# UCG viewid.
view_id = '120725274'

default_args = {
    # TODO: UCG SPECIFICS:
    'start_date': datetime(2018, 3, 20, 0, 0),
    'email': ['l5t3o4a9m9q9v1w9@astronomerteam.slack.com'],
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 0,
}
# TODO: Naming convention.
dag = DAG(
    'core_reporting_test_one',
    schedule_interval='@daily',
    default_args=default_args,
    catchup=False
)

### Now to define the work that's going to get done:

In [None]:
# from uncommon_good_ga.py: - full file:
# /base_workflow/dags/uncommon_goods_ga_dag.py

# Lets define the reports needed here:

# These are just some basic core reporting reports.
pipelines = [
    {
        'name': 'demographics',
        'dimensions': [
            {'name': 'ga:date'},
            {'name': 'ga:userAgeBracket'},
            {'name': 'ga:userGender'}
        ],
        'metrics': [
            {'expression': 'ga:sessions'}
        ],

        # TODO: Destination schema.
        'schema': [
            {}
        ]
    },
    {
        'name': 'date_sessions',
        'dimensions': [
            {'name': 'ga:date'},
        ],
        'metrics': [
            {'expression': 'ga:sessions'}
        ],

        # TODO: Destination schema.
        'schema': [
            {}
        ]
    },
    {
        'name': 'medium_source',
        'dimensions': [
            {'name': 'ga:medium'},
            {'name': 'ga:source'},
        ],
        'metrics': [
            {'expression': 'ga:sessions'},
            {'expression': 'ga:avgTimeOnPage'},
            {'expression': 'ga:avgTimeOnPage'},
        ],

        # TODO: Destination schema.
        'schema': [
            {}
        ]
    },
]

### Defining the DAG itself:

In [None]:
# from uncommon_good_ga.py: - full file:
# /base_workflow/dags/uncommon_goods_ga_dag.py

with dag:

    start = DummyOperator(task_id='start')

    for pipeline in pipelines:
        google_analytics = GoogleAnalyticsReportingToS3Operator(
            task_id='ga_reporting_{endpoint}_to_s3'.format(
                endpoint=pipeline['name']),
            google_analytics_conn_id=google_analytics_conn_id,
            view_id=view_id,
            since=execution_date,
            until=next_execution_date,
            sampling_level='LARGE',
            dimensions=pipeline['dimensions'],
            metrics=pipeline['metrics'],
            page_size=100,
            include_empty_rows=True,
            s3_conn_id=s3_conn_id,
            s3_bucket=s3_bucket,
            s3_key='ucg_ga_reporting_{endpoint}_{time_string}'.format(
                endpoint=pipeline['name'], time_string=time_string)
        )


        start >> google_analytics

__Full file: /base_workflow/dags/ga_dag.py__

### This DAG will look like:
![new_connection](img/ucg_dag.png)

All additions (downstream file manipulations will) will be dependent on first the GoogleAnalyticstoS3 operator sucessfully getting the data.

Each of the objects in the pipelines dictionary in the DAG file is put into a different file

## How it works:

The work that needs to get done is defined in the Operator, which actually executes the request:

In [None]:
# These get fed into the Google Analytics Operator:
# full-file: /baseworkflows/plugins/google_analytics_plugin/operators/google_analytics_reporting_to_s3_operator.py

# operators execute the execute() function. 

# Line 92-110.
def execute(self, context):
    
        # Import hooks.
        ga_conn = GoogleAnalyticsHook(self.google_analytics_conn_id)
        s3_conn = S3Hook(self.s3_conn_id)
        
        # Check for formatting.
        try:
            since_formatted = datetime.strptime(self.since, '%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d')
        except:
            since_formatted = str(self.since)
        try:
            until_formatted = datetime.strptime(self.until, '%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d')
        except:
            until_formatted = str(self.until)
            
        # Make request.
        report = ga_conn.get_analytics_report(self.view_id,
                                              since_formatted,
                                              until_formatted,
                                              self.sampling_level,
                                              self.dimensions,
                                              self.metrics,
                                              self.page_size,
                                              self.include_empty_rows)
        ...

Full-File: /baseworkflows/plugins/google_analytics_plugin/operators/google_analytics_reporting_to_s3_operator.py

#### Operators use hooks to interact with the external systems involved in the task they're trying to execute.

In [None]:
# full-file: /baseworkflows/plugins/google_analytics_plugin/hooks/google_analytics_hook.py
# Line 101-132 - the hook takes the config and provides an interface to interact with the analytics api.
# It doesn't determine what actually gets done - just provides a method to do it.

# Note here to integrate this with MCF funnels.

def get_analytics_report(self,
                             view_id,
                             since,
                             until,
                             sampling_level,
                             dimensions,
                             metrics,
                             page_size,
                             include_empty_rows):

        analytics = self.get_service_object(name='reporting')

        reportRequest = {
            'viewId': view_id,
            'dateRanges': [{'startDate': since, 'endDate': until}],
            'samplingLevel': sampling_level or 'LARGE',
            'dimensions': dimensions,
            'metrics': metrics,
            'pageSize': page_size or 1000,
            'includeEmptyRows': include_empty_rows or False
        }

     
        response = (analytics.
                    reports().
                    batchGet(body={'reportRequests': [reportRequest]}).
                    execute())
        ...

Full-File: /baseworkflows/plugins/google_analytics_plugin/hooks/google_analytics_hook.py