Skip to content
This repository has been archived by the owner on Feb 28, 2023. It is now read-only.

Commit

Permalink
Update readme and naming
Browse files Browse the repository at this point in the history
  • Loading branch information
hermanschaaf committed Jan 24, 2023
1 parent 8b7fb9d commit dbe699d
Show file tree
Hide file tree
Showing 4 changed files with 110 additions and 46 deletions.
87 changes: 80 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,28 @@

This plugin is a Simple Analytics source plugin that can be used to sync data from Simple Analytics to any database, data warehouse, data lake supported by [CloudQuery](https://www.cloudquery.io/), such as PostgreSQL, BigQuery, Athena, and many more.

## Links

- [CloudQuery Quickstart Guide](https://www.cloudquery.io/docs/quickstart)
- [Supported Tables](docs/tables/README.md)

## Configuration

The following configuration file will sync all data points for `mywebsite.com` to a PostgreSQL database. See [the CloudQuery Quickstart](https://www.cloudquery.io/docs/quickstart) for more information on how to configure the source and destination.
The following source configuration file will sync all data points for `mywebsite.com` to a PostgreSQL database. See [the CloudQuery Quickstart](https://www.cloudquery.io/docs/quickstart) for more information on how to configure the source and destination.

```yaml
kind: source
spec:
name: "simple-analytics"
path: "simpleanalytics/simple-analytics"
version: "${VERSION}"
backend: "local"
backend: "local" # remove this to always sync all data
tables:
["*"]
destinations:
- "postgresql"
spec:
# plugin spec section
user_id: "${SA_USER_ID}"
api_key: "${SA_API_KEY}"
websites:
Expand All @@ -28,6 +34,44 @@ spec:
# - ...
```

### Plugin Spec

- `user_id` (string, required):

A user ID from Simple Analytics, obtained from the [account settings](https://simpleanalytics.com/account) page. It should start with `sa_user_id...`

- `api_key` (string, required):

An API Key from Simple Analytics, obtained from the [account settings](https://simpleanalytics.com/account) page. It should start with `sa_api_key...`

- `websites` (array, required):

A list of websites to sync data for. Each website should have the following fields:

- `hostname` (string, required):

The hostname of the website to sync data for. This should be the same as the hostname in Simple Analytics.

- `metadata_fields` (array[string], optional):

A list of metadata fields to sync, e.g. `["path_text", "created_at_time"]`. If not specified, no metadata fields will be synced.

- `start_date` (string, optional):

The date to start syncing data from. If not specified, the plugin will sync data from the beginning of time (or use a start time defined by `period`, if set).

- `end_date` (string, optional):

The date to stop syncing data at. If not specified, the plugin will sync data until the current date.

- `period` (string, optional):

The duration of the time window to fetch historical data for, in days, months or years. It is used to calculate `start_date` if it is not specified. If `start_date` is specified, duration is ignored. Examples:
- `7d`: last 7 days
- `3m`: last 3 months
- `1y`: last year


## Example Queries

### List the top 10 pages by views for a given period, excluding robots
Expand All @@ -37,7 +81,7 @@ select
path,
count(*)
from
simple_analytics_data_points
simple_analytics_page_views
where
hostname = 'mywebsite.com'
and is_robot is false
Expand All @@ -48,22 +92,51 @@ group by
order by
count desc
limit
10;
10
```

```text
+----------------------------------+---------+
| path | count |
|----------------------------------+---------|
| / | 100333 |
| /intro | 91234 |
| /how-we-use-cloudquery-for-elt | 84567 |
| /page | 91234 |
| /another-page | 84567 |
| /blog/introduction | 74342 |
| /google | 69333 |
| /index | 69333 |
| /another/page | 64935 |
| /deeply/nested/page | 50404 |
| /yet/another | 42309 |
| /some/page | 34433 |
| /about-us | 20334 |
+----------------------------------+---------+
```


### List events

```sql
select
added_iso,
datapoint,
path,
browser_name
from
simple_analytics_events
order by
added_iso desc
limit
5
```

```text
+-------------------------+-----------+-----------------------------------------------+---------------+
| added_iso | datapoint | path | browser_name |
|-------------------------+-----------+-----------------------------------------------+---------------|
| 2023-01-23 19:32:25.68 | 404 | /security | Google Chrome |
| 2023-01-22 20:23:23.379 | 404 | /blog/running-cloudquery-in-gcp | Google Chrome |
| 2023-01-19 12:04:57.095 | 404 | /docs/plugins/sources/vercel/configuration.md | Brave |
| 2023-01-19 12:04:36.567 | 404 | /docsss | Firefox |
| 2023-01-19 01:50:19.259 | 404 | /imgs/gcp-cross-project-service-account | Google Chrome |
+-------------------------+-----------+-----------------------------------------------+---------------+
```
61 changes: 26 additions & 35 deletions client/spec.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,27 +26,21 @@ type Spec struct {
// Websites is a list of websites to fetch data for.
Websites []WebsiteSpec `json:"websites"`

// StartTimeStr is the time to start fetching data from. If specified, it must use AllowedTimeLayout.
StartTimeStr string `json:"start_time"`
// StartDateStr is the time to start fetching data from. If specified, it must use AllowedTimeLayout.
StartDateStr string `json:"start_date"`

// EndTimeStr is the time at which to stop fetching data. If not specified, the current time is used.
// EndDateStr is the time at which to stop fetching data. If not specified, the current time is used.
// If specified, it must use AllowedTimeLayout.
EndTimeStr string `json:"end_time"`
EndDateStr string `json:"end_date"`

// WindowOverlapSeconds gives a number of seconds to decrease the start_time by
// when starting from an incremental cursor position. This allows for late-arriving data to
// be fetched in a subsequent sync and guarantee at-least-once delivery, but can
// introduce duplicates.
WindowOverlapSeconds int `json:"window_overlap_seconds"`

// DurationStr is the duration of the time window to fetch historical data for, in days, months or years.
// PeriodStr is the duration of the time window to fetch historical data for, in days, months or years.
// Examples:
// "7d": past 7 days
// "3m": last 3 months
// "1y": last year
// It is used to calculate start_time if it is not specified. If start_time is specified,
// duration is ignored.
DurationStr string `json:"duration"`
PeriodStr string `json:"duration"`
}

type WebsiteSpec struct {
Expand All @@ -69,61 +63,58 @@ func (s Spec) Validate() error {
return fmt.Errorf("every website entry must have a hostname")
}
}
if s.StartTimeStr != "" {
_, err := time.Parse(AllowedTimeLayout, s.StartTimeStr)
if s.StartDateStr != "" {
_, err := time.Parse(AllowedTimeLayout, s.StartDateStr)
if err != nil {
return fmt.Errorf("could not parse start_time: %v", err)
}
}
if s.EndTimeStr != "" {
_, err := time.Parse(AllowedTimeLayout, s.EndTimeStr)
if s.EndDateStr != "" {
_, err := time.Parse(AllowedTimeLayout, s.EndDateStr)
if err != nil {
return fmt.Errorf("could not parse end_time: %v", err)
}
}
if s.DurationStr != "" {
_, err := parseDuration(s.DurationStr)
if s.PeriodStr != "" {
_, err := parsePeriod(s.PeriodStr)
if err != nil {
return fmt.Errorf("could not validate duration: %v (should be a number followed by \"d\", \"m\" or \"y\", e.g. \"7d\", \"1m\" or \"3y\")", err)
return fmt.Errorf("could not validate period: %v (should be a number followed by \"d\", \"m\" or \"y\", e.g. \"7d\", \"1m\" or \"3y\")", err)
}
}
return nil
}

func (s *Spec) SetDefaults() {
if s.StartTimeStr == "" && s.DurationStr == "" {
s.StartTimeStr = DefaultStartTime.Format(AllowedTimeLayout)
}
if s.EndTimeStr == "" {
s.EndTimeStr = time.Now().Format(AllowedTimeLayout)
if s.StartDateStr == "" && s.PeriodStr == "" {
s.StartDateStr = DefaultStartTime.Format(AllowedTimeLayout)
}
if s.WindowOverlapSeconds == 0 {
s.WindowOverlapSeconds = 60
if s.EndDateStr == "" {
s.EndDateStr = time.Now().Format(AllowedTimeLayout)
}
}

func (s Spec) StartTime() time.Time {
if s.StartTimeStr == "" && s.DurationStr != "" {
return time.Now().Add(-s.Duration())
if s.StartDateStr == "" && s.PeriodStr != "" {
return time.Now().Add(-s.Period())
}
t, _ := time.Parse(AllowedTimeLayout, s.StartTimeStr) // any error should be caught by Validate()
t, _ := time.Parse(AllowedTimeLayout, s.StartDateStr) // any error should be caught by Validate()
return t
}

func (s Spec) EndTime() time.Time {
t, _ := time.Parse(AllowedTimeLayout, s.EndTimeStr) // any error should be caught by Validate()
t, _ := time.Parse(AllowedTimeLayout, s.EndDateStr) // any error should be caught by Validate()
return t
}

func (s Spec) Duration() time.Duration {
d, _ := parseDuration(s.DurationStr) // any error should be caught by Validate()
func (s Spec) Period() time.Duration {
d, _ := parsePeriod(s.PeriodStr) // any error should be caught by Validate()
return d
}

func parseDuration(s string) (time.Duration, error) {
func parsePeriod(s string) (time.Duration, error) {
m := reValidDuration.FindStringSubmatch(s)
if m == nil {
return 0, errors.New("invalid duration")
return 0, errors.New("invalid period")
}
n, err := strconv.Atoi(m[1])
if err != nil {
Expand All @@ -138,5 +129,5 @@ func parseDuration(s string) (time.Duration, error) {
return time.Duration(n) * 365 * 24 * time.Hour, nil
}
// should never happen, we already validated using regex
panic("unhandled duration unit")
panic("unhandled period unit")
}
4 changes: 2 additions & 2 deletions resources/events.go
Original file line number Diff line number Diff line change
Expand Up @@ -88,12 +88,12 @@ func fetchEvents(ctx context.Context, meta schema.ClientMeta, parent *schema.Res

// Save cursor state to the backend.
if c.Backend != nil {
// We subtract WindowOverlapSeconds from the end time to allow delayed data points
// We subtract a day from the end time to allow delayed data points
// to be fetched on the next sync. This will cause some duplicates, but
// allows us to guarantee at-least-once delivery. Duplicates can be removed
// by using overwrite-delete-stale write mode, by de-duplicating in queries,
// or by running a post-processing step.
newCursor := end.Add(time.Duration(c.Spec.WindowOverlapSeconds) * time.Second).Format(client.AllowedTimeLayout)
newCursor := end.Add(-24 * time.Hour).Format(client.AllowedTimeLayout)
err = c.Backend.Set(ctx, tableEvents, c.ID(), newCursor)
if err != nil {
return fmt.Errorf("failed to save cursor to backend: %w", err)
Expand Down
4 changes: 2 additions & 2 deletions resources/page_views.go
Original file line number Diff line number Diff line change
Expand Up @@ -86,12 +86,12 @@ func fetchPageViews(ctx context.Context, meta schema.ClientMeta, parent *schema.

// Save cursor state to the backend.
if c.Backend != nil {
// We subtract WindowOverlapSeconds from the end time to allow delayed data points
// We subtract a day from the end time to allow delayed data points
// to be fetched on the next sync. This will cause some duplicates, but
// allows us to guarantee at-least-once delivery. Duplicates can be removed
// by using overwrite-delete-stale write mode, by de-duplicating in queries,
// or by running a post-processing step.
newCursor := end.Add(time.Duration(c.Spec.WindowOverlapSeconds) * time.Second).Format(client.AllowedTimeLayout)
newCursor := end.Add(-24 * time.Hour).Format(client.AllowedTimeLayout)
err = c.Backend.Set(ctx, tablePageViews, c.ID(), newCursor)
if err != nil {
return fmt.Errorf("failed to save cursor to backend: %w", err)
Expand Down

0 comments on commit dbe699d

Please sign in to comment.