Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new Chronicle Schema work from chronicle-core #73

Merged
merged 22 commits into from
Apr 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 2 additions & 44 deletions .rubocop.yml
Original file line number Diff line number Diff line change
@@ -1,44 +1,2 @@
AllCops:
EnabledByDefault: true
TargetRubyVersion: 2.7

Style/FrozenStringLiteralComment:
SafeAutoCorrect: true

Style/StringLiterals:
Enabled: false

Layout/MultilineAssignmentLayout:
Enabled: false

Layout/MultilineMethodCallIndentation:
EnforcedStyle: indented

Layout/RedundantLineBreak:
Enabled: false

Style/MethodCallWithArgsParentheses:
Enabled: false

Style/MethodCalledOnDoEndBlock:
Exclude:
- 'spec/**/*'

Style/OpenStructUse:
Enabled: false

Style/Copyright:
Enabled: false

Style/MissingElse:
Enabled: false

Style/SymbolArray:
EnforcedStyle: brackets

Style/WordArray:
EnforcedStyle: brackets

Lint/ConstantResolution:
Enabled: false

inherit_gem:
chronicle-core: .rubocop.yml
4 changes: 2 additions & 2 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
source "https://rubygems.org"
source 'https://rubygems.org'

git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }
git_source(:github) { |repo_name| "https://github.com/#{repo_name}" }

# Specify your gem's dependencies in chronicle-etl.gemspec
gemspec
6 changes: 3 additions & 3 deletions Guardfile
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
guard :rspec, cmd: "bundle exec rspec" do
require "guard/rspec/dsl"
guard :rspec, cmd: 'bundle exec rspec' do
require 'guard/rspec/dsl'

watch(%r{^spec/.+_spec\.rb$})
watch(%r{^lib/(.+)\.rb$}) { |m| "spec/#{m[1]}_spec.rb" }
watch('spec/spec_helper.rb') { "spec" }
watch('spec/spec_helper.rb') { 'spec' }
end
54 changes: 20 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@

Are you trying to archive your digital history or incorporate it into your own projects? You’ve probably discovered how frustrating it is to get machine-readable access to your own data. While [building a memex](https://hyfen.net/memex/), I learned first-hand what great efforts must be made before you can begin using the data in interesting ways.

If you don’t want to spend all your time writing scrapers, reverse-engineering APIs, or parsing takeout data, this tool is for you! (_If you do enjoy these things, please see the [open issues](https://github.com/chronicle-app/chronicle-etl/issues)._)
If you don’t want to spend all your time writing scrapers, reverse-engineering APIs, or parsing export data, this tool is for you! (_If you do enjoy these things, please see the [open issues](https://github.com/chronicle-app/chronicle-etl/issues)._)

**`chronicle-etl` is a CLI tool that gives you a unified interface to your personal data.** It uses the ETL pattern to _extract_ data from a source (e.g. your local browser history, a directory of images, goodreads.com reading history), _transform_ it (into a given schema), and _load_ it to a destination (e.g. a CSV file, JSON, external API).

## What does `chronicle-etl` give you?

- **A CLI tool for working with personal data**. You can monitor progress of exports, manipulate the output, set up recurring jobs, manage credentials, and more.
- **Plugins for many third-party providers** (see [list](#available-plugins-and-connectors)). This plugin system allows you to access data from dozens of third-party services, all accessible through a common CLI interface.
- **Plugins for many third-party sources** (see [list](#available-plugins-and-connectors)). This plugin system allows you to access data from dozens of third-party services, all accessible through a common CLI interface.
- **A common, opinionated schema**: You can normalize different datasets into a single schema so that, for example, all your iMessages and emails are represented in a common schema. (Don’t want to use this schema? `chronicle-etl` always allows you to fall back on working with the raw extraction data.)

## Chronicle-ETL in action
Expand Down Expand Up @@ -58,10 +58,10 @@ $ chronicle-etl --extractor csv --input data.csv --loader table

# Show available plugins and install one
$ chronicle-etl plugins:list
$ chronicle-etl plugins:install shell
$ chronicle-etl plugins:install imessage

# Retrieve shell commands run in the last 5 hours
$ chronicle-etl -e shell --since 5h
# Retrieve imessage messages from the last 5 hours
$ chronicle-etl -e imessage --since 5h

# Get email senders from an .mbox email archive file
$ chronicle-etl --extractor email:mbox -i sample-email-archive.mbox -t email --fields actor.slug
Expand All @@ -80,12 +80,16 @@ Options:
[--extractor-opts=key:value] # Extractor options
-t, [--transformer=NAME] # Transformer class. Default: null
[--transformer-opts=key:value] # Transformer options
-l, [--loader=NAME] # Loader class. Default: table
-l, [--loader=NAME] # Loader class. Default: json
[--loader-opts=key:value] # Loader options
-i, [--input=FILENAME] # Input filename or directory
[--since=DATE] # Load records SINCE this date (or fuzzy time duration)
[--until=DATE] # Load records UNTIL this date (or fuzzy time duration)
[--limit=N] # Only extract the first LIMIT records
[--schema=SCHEMA_NAME] # Which Schema to transform
# Possible values: chronicle, activitystream, schemaorg, chronobase
[--format=SCHEMA_NAME] # How to serialize results
# Possible values: jsonapi, jsonld
-o, [--output=OUTPUT] # Output filename
[--fields=field1 field2 ...] # Output only these fields
[--header-row], [--no-header-row] # Output the header row of tabular output
Expand Down Expand Up @@ -119,7 +123,7 @@ $ chronicle-etl jobs:list

## Connectors and plugins

Connectors let you work with different data formats or third-party providers.
Connectors let you work with different data formats or third-party sources.

### Built-in Connectors

Expand All @@ -139,13 +143,16 @@ $ chronicle-etl connectors:list
#### Transformers

- [`null`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/transformers/null_transformer.rb) - (default) Don’t do anything and pass on raw extraction data
- [`sampler`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/transformers/sampler_transformer.rb) - Sample `percent` records from the extraction
- [`sort`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/transformers/sampler_transformer.rb) - sort extracted results by `key` and `direction`


#### Loaders

- [`table`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/table_loader.rb) - (default) Output an ascii table of records. Useful for exploring data.
- [`json`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/json_loader.rb) - (default) Load records serialized as JSON
- [`table`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/table_loader.rb) - Output an ascii table of records. Useful for exploring data.
- [`csv`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/csv_extractor.rb) - Load records to CSV
- [`json`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/json_loader.rb) - Load records serialized as JSON
- [`rest`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/rest_loader.rb) - Serialize records with [JSONAPI](https://jsonapi.org/) and send to a REST API
- [`rest`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/rest_loader.rb) - Send JSON to a REST API

### Chronicle Plugins for third-party services

Expand All @@ -161,8 +168,8 @@ $ chronicle-etl plugins:list
$ chronicle-etl plugins:install NAME

# Use a plugin
$ chronicle-etl plugins:install shell
$ chronicle-etl --extractor shell:history --limit 10
$ chronicle-etl plugins:install imessage
$ chronicle-etl --extractor imessage --limit 10

# Uninstall a plugin
$ chronicle-etl plugins:uninstall NAME
Expand Down Expand Up @@ -219,28 +226,7 @@ If you want to work together on a connector, please [get in touch](#get-in-touch
#### Sample custom Extractor class

```ruby
module Chronicle
module FooService
class FooExtractor < Chronicle::ETL::Extractor
register_connector do |r|
r.identifier = 'foo'
r.description = 'from foo.com'
end

setting :access_token, required: true

def prepare
@records = # load from somewhere
end

def extract
@records.each do |record|
yield Chronicle::ETL::Extraction.new(data: row.to_h)
end
end
end
end
end
# TODO
```

## Secrets Management
Expand Down
4 changes: 2 additions & 2 deletions Rakefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
require "bundler/gem_tasks"
require "rspec/core/rake_task"
require 'bundler/gem_tasks'
require 'rspec/core/rake_task'
RSpec::Core::RakeTask.new(:spec)

require 'yard'
Expand Down
9 changes: 4 additions & 5 deletions bin/console
Original file line number Diff line number Diff line change
@@ -1,26 +1,25 @@
#!/usr/bin/env ruby

require "bundler/setup"
require "chronicle/etl"
require 'bundler/setup'
require 'chronicle/etl'

# You can add fixtures and/or initialization code here to make experimenting
# with your gem easier. You can also use a different console, if you like.

# (If you use this, don't forget to add pry to your Gemfile!)
require "pry"
require 'pry'
Pry.start

def reload!(print = true)
puts 'Reloading ...' if print
# Main project directory.
root_dir = File.expand_path('..', __dir__)
# Directories within the project that should be reloaded.
reload_dirs = %w{lib}
reload_dirs = %w[lib]
# Loop through and reload every file in all relevant project directories.
reload_dirs.each do |dir|
Dir.glob("#{root_dir}/#{dir}/**/*.rb").each { |f| load(f) }
end
# Return true when complete.
true
end

102 changes: 51 additions & 51 deletions chronicle-etl.gemspec
Original file line number Diff line number Diff line change
@@ -1,73 +1,73 @@
# frozen_string_literal: true

lib = File.expand_path("../lib", __FILE__)
lib = File.expand_path('lib', __dir__)
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
require "chronicle/etl/version"
require 'chronicle/etl/version'

Gem::Specification.new do |spec|
spec.name = "chronicle-etl"
spec.name = 'chronicle-etl'
spec.version = Chronicle::ETL::VERSION
spec.authors = ["Andrew Louis"]
spec.email = ["andrew@hyfen.net"]
spec.authors = ['Andrew Louis']
spec.email = ['andrew@hyfen.net']

spec.summary = "ETL tool for personal data"
spec.description = "Chronicle-ETL allows you to extract personal data from a variety of services, transformer it, and load it."
spec.homepage = "https://github.com/chronicle-app"
spec.license = "MIT"
spec.summary = 'ETL tool for personal data'
spec.description = 'Chronicle-ETL allows you to extract personal data from a variety of services, transformer it, and load it.'
spec.homepage = 'https://github.com/chronicle-app'
spec.license = 'MIT'

# Prevent pushing this gem to RubyGems.org. To allow pushes either set the 'allowed_push_host'
# to allow pushing to a single host or delete this section to allow pushing to any host.
if spec.respond_to?(:metadata)
spec.metadata['allowed_push_host'] = "https://rubygems.org"
spec.metadata['allowed_push_host'] = 'https://rubygems.org'

spec.metadata["homepage_uri"] = spec.homepage
spec.metadata["source_code_uri"] = "https://github.com/chronicle-app/chronicle-etl"
spec.metadata["changelog_uri"] = "https://github.com/chronicle-app/chronicle-etl/releases"
spec.metadata['homepage_uri'] = spec.homepage
spec.metadata['source_code_uri'] = 'https://github.com/chronicle-app/chronicle-etl'
spec.metadata['changelog_uri'] = 'https://github.com/chronicle-app/chronicle-etl/releases'
else
raise "RubyGems 2.0 or newer is required to protect against " \
"public gem pushes."
raise 'RubyGems 2.0 or newer is required to protect against ' \
'public gem pushes.'
end

# Specify which files should be added to the gem when it is released.
# The `git ls-files -z` loads the files in the RubyGem that have been added into git.
spec.files = Dir.chdir(File.expand_path('..', __FILE__)) do
spec.files = Dir.chdir(File.expand_path(__dir__)) do
`git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
end
spec.bindir = "exe"
spec.bindir = 'exe'
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
spec.require_paths = ["lib"]
spec.required_ruby_version = ">= 2.7"
spec.require_paths = ['lib']
spec.required_ruby_version = '>= 3.1'
spec.metadata['rubygems_mfa_required'] = 'true'

spec.add_dependency "activesupport", "~> 7.0"
spec.add_dependency "chronicle-core", "~> 0.2.2"
spec.add_dependency "chronic_duration", "~> 0.10.6"
spec.add_dependency "colorize", "~> 0.8.1"
spec.add_dependency "gems", ">= 1"
spec.add_dependency "launchy"
spec.add_dependency "marcel", "~> 1.0.2"
spec.add_dependency "mini_exiftool", "~> 2.10"
spec.add_dependency "nokogiri", "~> 1.13"
spec.add_dependency "omniauth", "~> 2"
spec.add_dependency "sequel", "~> 5.35"
spec.add_dependency "sinatra", "~> 2"
spec.add_dependency "sqlite3", "~> 1.4"
spec.add_dependency "thor", "~> 1.2"
spec.add_dependency "thor-hollaback", "~> 0.2"
spec.add_dependency "tty-progressbar", "~> 0.17"
spec.add_dependency "tty-prompt", "~> 0.23"
spec.add_dependency "tty-spinner"
spec.add_dependency "tty-table", "~> 0.11"
spec.add_dependency "xdg", ">= 4.0"
spec.add_dependency 'activesupport', '~> 7.0'
spec.add_dependency 'chronic_duration', '~> 0.10.6'
spec.add_dependency 'chronicle-core', '~> 0.3'
spec.add_dependency 'colorize', '~> 0.8.1'
spec.add_dependency 'gems', '>= 1'
spec.add_dependency 'launchy'
spec.add_dependency 'marcel', '~> 1.0.2'
spec.add_dependency 'omniauth', '~> 2'
spec.add_dependency 'sequel', '~> 5.35'
spec.add_dependency 'sinatra', '~> 2'
spec.add_dependency 'sqlite3', '~> 1.4'
spec.add_dependency 'thor', '~> 1.2'
spec.add_dependency 'thor-hollaback', '~> 0.2'
spec.add_dependency 'tty-progressbar', '~> 0.17'
spec.add_dependency 'tty-prompt', '~> 0.23'
spec.add_dependency 'tty-spinner'
spec.add_dependency 'tty-table', '~> 0.12'
spec.add_dependency 'xdg', '>= 4.0'

spec.add_development_dependency "bundler", "~> 2.1"
spec.add_development_dependency "fakefs", "~> 1.4"
spec.add_development_dependency "guard-rspec", "~> 4.7.3"
spec.add_development_dependency "pry-byebug", "~> 3.9"
spec.add_development_dependency "rake", "~> 13.0"
spec.add_development_dependency "rspec", "~> 3.9"
spec.add_development_dependency "rubocop", "~> 1.25.1"
spec.add_development_dependency "simplecov", "~> 0.21"
spec.add_development_dependency "vcr", "~> 6.1"
spec.add_development_dependency "webmock", "~> 3"
spec.add_development_dependency "webrick", "~> 1.7"
spec.add_development_dependency "yard", "~> 0.9.7"
spec.add_development_dependency 'bundler', '~> 2.1'
spec.add_development_dependency 'fakefs', '~> 1.4'
spec.add_development_dependency 'guard-rspec', '~> 4.7.3'
spec.add_development_dependency 'pry-byebug', '~> 3.9'
spec.add_development_dependency 'rake', '~> 13.0'
spec.add_development_dependency 'rspec', '~> 3.9'
spec.add_development_dependency 'rubocop', '~> 1.57'
spec.add_development_dependency 'simplecov', '~> 0.21'
spec.add_development_dependency 'vcr', '~> 6.1'
spec.add_development_dependency 'webmock', '~> 3'
spec.add_development_dependency 'webrick', '~> 1.7'
spec.add_development_dependency 'yard', '~> 0.9.7'
end
2 changes: 1 addition & 1 deletion exe/chronicle-etl
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/usr/bin/env ruby

require "chronicle/etl/cli"
require 'chronicle/etl/cli'

Chronicle::ETL::CLI::Main.start(ARGV)
5 changes: 4 additions & 1 deletion lib/chronicle/etl.rb
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
# frozen_string_literal: true

require 'chronicle/schema'
require 'chronicle/models/base'

require_relative 'etl/registry/registry'
require_relative 'etl/authorizer'
require_relative 'etl/config'
require_relative 'etl/configurable'
require_relative 'etl/exceptions'
require_relative 'etl/extraction'
require_relative 'etl/record'
require_relative 'etl/job_definition'
require_relative 'etl/job_log'
require_relative 'etl/job_logger'
Expand All @@ -14,7 +18,6 @@
require_relative 'etl/runner'
require_relative 'etl/secrets'
require_relative 'etl/utils/binary_attachments'
require_relative 'etl/utils/text_recognition'
require_relative 'etl/utils/progress_bar'
require_relative 'etl/version'

Expand Down
Loading
Loading