# Calling the Google Analytics API From a Notebook

This sample code usese the google-analytics-data Python library to connect to and read data from the Google Analytics API.

To run this code, make sure you start a cluster. You should make sure you add the google-analytics-data Python library to the library configuration of the cluster. Note that this code also requires your ```credentials.json``` file and an appropriate environment variable set. For testing, you can upload your credentials file to DBFS and then set your environment varialbe on the cluster configuration page like so:

```GOOGLE_APPLICATION_CREDENTIALS="/dbfs/FileStore/credentials.json"```

Note: This is not a production-grade solution. You will have to determine the best authentication method to use and how best to secure your credentails. This is just for demonstration purposes!

In [0]:
from pyspark.sql.functions import current_timestamp
from google.analytics.data_v1beta import BetaAnalyticsDataClient
from google.analytics.data_v1beta.types import (
    DateRange,
    Dimension,
    Metric,
    RunReportRequest,
)
import json

### Call the API

Using the Python library, this code will call the

In [0]:
#Note: set your own proprety ID before running
property_id = ""
client = BetaAnalyticsDataClient()

#Note: You can adjust the date ranges value to capture as much historic data as you want
request = RunReportRequest(
    property=f"properties/{property_id}",
    dimensions=[Dimension(name="pageTitle")],
    metrics=[Metric(name="screenPageViews")],
    date_ranges=[DateRange(start_date="2020-03-31", end_date="today")],
)
response = client.run_report(request)
json_data = []

print("Report result:")
for row in response.rows:
    data = {
      "pageTitle":row.dimension_values[0].value,
      "screenPageViews":row.metric_values[0].value
    }
    json_data.append(data)
    print(row.dimension_values[0].value, row.metric_values[0].value)

In [0]:
print(response)

In [0]:
json_report = json.dumps(json_data)

In [0]:
print(json_report)

Using the report data obtained from the API, convert it to an rdd and then convert the rdd to a Spark dataframe. We also add a column to each result row with the timestamp the report was run.

In [0]:
json_rdd = sc.parallelize([json_report])

In [0]:
df = spark.read.json(json_rdd)
df2 = df.withColumn("captureDateTime", current_timestamp())

In [0]:
display(df2)

### Create a temporary view of the data, and query it using SQL

In [0]:
df2.createOrReplaceTempView("latestPageViews")

In [0]:
%sql
SELECT * FROM latestPageViews