Permalink
Browse files

moved from existing repo to its own repo

  • Loading branch information...
0 parents commit a7d0b7870099a317795548646f31c2e0a576767f @aeden aeden committed Nov 4, 2008
Showing with 7,357 additions and 0 deletions.
  1. +1 −0 .gitignore
  2. +6 −0 0.9-UPGRADE
  3. +190 −0 CHANGELOG
  4. +8 −0 HOW_TO_RELEASE
  5. +7 −0 LICENSE
  6. +85 −0 README
  7. +153 −0 Rakefile
  8. +28 −0 TODO
  9. +78 −0 active_support_logger.patch
  10. +28 −0 bin/etl
  11. +8 −0 bin/etl.cmd
  12. +16 −0 examples/database.example.yml
  13. +78 −0 lib/etl.rb
  14. +2 −0 lib/etl/batch.rb
  15. +111 −0 lib/etl/batch/batch.rb
  16. +55 −0 lib/etl/batch/directives.rb
  17. +2 −0 lib/etl/builder.rb
  18. +96 −0 lib/etl/builder/date_dimension_builder.rb
  19. +31 −0 lib/etl/builder/time_dimension_builder.rb
  20. +89 −0 lib/etl/commands/etl.rb
  21. +3 −0 lib/etl/control.rb
  22. +403 −0 lib/etl/control/control.rb
  23. +420 −0 lib/etl/control/destination.rb
  24. +95 −0 lib/etl/control/destination/database_destination.rb
  25. +124 −0 lib/etl/control/destination/file_destination.rb
  26. +109 −0 lib/etl/control/source.rb
  27. +220 −0 lib/etl/control/source/database_source.rb
  28. +11 −0 lib/etl/control/source/enumerable_source.rb
  29. +90 −0 lib/etl/control/source/file_source.rb
  30. +39 −0 lib/etl/control/source/model_source.rb
  31. +1 −0 lib/etl/core_ext.rb
  32. +5 −0 lib/etl/core_ext/time.rb
  33. +42 −0 lib/etl/core_ext/time/calculations.rb
  34. +552 −0 lib/etl/engine.rb
  35. +20 −0 lib/etl/execution.rb
  36. +9 −0 lib/etl/execution/base.rb
  37. +8 −0 lib/etl/execution/batch.rb
  38. +8 −0 lib/etl/execution/job.rb
  39. +85 −0 lib/etl/execution/migration.rb
  40. +18 −0 lib/etl/execution/record.rb
  41. +2 −0 lib/etl/generator.rb
  42. +20 −0 lib/etl/generator/generator.rb
  43. +39 −0 lib/etl/generator/surrogate_key_generator.rb
  44. +139 −0 lib/etl/http_tools.rb
  45. +11 −0 lib/etl/parser.rb
  46. +49 −0 lib/etl/parser/apache_combined_log_parser.rb
  47. +74 −0 lib/etl/parser/delimited_parser.rb
  48. +65 −0 lib/etl/parser/fixed_width_parser.rb
  49. +41 −0 lib/etl/parser/parser.rb
  50. +218 −0 lib/etl/parser/sax_parser.rb
  51. +65 −0 lib/etl/parser/xml_parser.rb
  52. +11 −0 lib/etl/processor.rb
  53. +14 −0 lib/etl/processor/block_processor.rb
  54. +81 −0 lib/etl/processor/bulk_import_processor.rb
  55. +80 −0 lib/etl/processor/check_exist_processor.rb
  56. +35 −0 lib/etl/processor/check_unique_processor.rb
  57. +26 −0 lib/etl/processor/copy_field_processor.rb
  58. +55 −0 lib/etl/processor/encode_processor.rb
  59. +55 −0 lib/etl/processor/hierarchy_exploder_processor.rb
  60. +12 −0 lib/etl/processor/print_row_processor.rb
  61. +25 −0 lib/etl/processor/processor.rb
  62. +24 −0 lib/etl/processor/rename_processor.rb
  63. +26 −0 lib/etl/processor/require_non_blank_processor.rb
  64. +17 −0 lib/etl/processor/row_processor.rb
  65. +23 −0 lib/etl/processor/sequence_processor.rb
  66. +53 −0 lib/etl/processor/surrogate_key_processor.rb
  67. +35 −0 lib/etl/processor/truncate_processor.rb
  68. +20 −0 lib/etl/row.rb
  69. +14 −0 lib/etl/screen.rb
  70. +20 −0 lib/etl/screen/row_count_screen.rb
  71. +2 −0 lib/etl/transform.rb
  72. +13 −0 lib/etl/transform/block_transform.rb
  73. +20 −0 lib/etl/transform/date_to_string_transform.rb
  74. +51 −0 lib/etl/transform/decode_transform.rb
  75. +20 −0 lib/etl/transform/default_transform.rb
  76. +122 −0 lib/etl/transform/foreign_key_lookup_transform.rb
  77. +49 −0 lib/etl/transform/hierarchy_lookup_transform.rb
  78. +12 −0 lib/etl/transform/ordinalize_transform.rb
  79. +13 −0 lib/etl/transform/sha1_transform.rb
  80. +16 −0 lib/etl/transform/string_to_date_transform.rb
  81. +14 −0 lib/etl/transform/string_to_datetime_transform.rb
  82. +11 −0 lib/etl/transform/string_to_time_transform.rb
  83. +61 −0 lib/etl/transform/transform.rb
  84. +26 −0 lib/etl/transform/trim_transform.rb
  85. +35 −0 lib/etl/transform/type_transform.rb
  86. +59 −0 lib/etl/util.rb
  87. +9 −0 lib/etl/version.rb
  88. +2 −0 test/.ignore
  89. +6 −0 test/all.ebf
  90. +11 −0 test/apache_combined_log.ctl
  91. +41 −0 test/batch_test.rb
  92. +6 −0 test/batch_with_error.ebf
  93. 0 test/batched1.ctl
  94. 0 test/batched2.ctl
  95. +6 −0 test/block_processor.ctl
  96. +1 −0 test/block_processor_error.ctl
  97. +4 −0 test/block_processor_pre_post_process.ctl
  98. +5 −0 test/block_processor_remove_rows.ctl
  99. +38 −0 test/block_processor_test.rb
  100. +9 −0 test/connection/native_mysql/connection.rb
  101. +36 −0 test/connection/native_mysql/schema.sql
  102. +13 −0 test/connection/postgresql/connection.rb
  103. +43 −0 test/connection/postgresql/schema.sql
  104. +43 −0 test/control_test.rb
  105. +3 −0 test/data/apache_combined_log.txt
  106. +3 −0 test/data/bulk_import.txt
  107. +3 −0 test/data/bulk_import_with_empties.txt
  108. +3 −0 test/data/decode.txt
  109. +3 −0 test/data/delimited.txt
  110. +2 −0 test/data/encode_source_latin1.txt
  111. +3 −0 test/data/fixed_width.txt
  112. +3 −0 test/data/multiple_delimited_1.txt
  113. +3 −0 test/data/multiple_delimited_2.txt
  114. +3 −0 test/data/people.txt
  115. +14 −0 test/data/sax.xml
  116. +16 −0 test/data/xml.xml
  117. +18 −0 test/database.example.yml
  118. +18 −0 test/database.mysql.yml
  119. +18 −0 test/database.postgres.yml
  120. +18 −0 test/database.yml
  121. +96 −0 test/date_dimension_builder_test.rb
  122. +30 −0 test/delimited.ctl
  123. +33 −0 test/delimited_absolute.ctl
  124. +25 −0 test/delimited_destination_db.ctl
  125. +34 −0 test/delimited_with_bulk_load.ctl
  126. +171 −0 test/destination_test.rb
  127. +23 −0 test/directive_test.rb
  128. +31 −0 test/encode_processor_test.rb
  129. +32 −0 test/engine_test.rb
  130. +24 −0 test/errors.ctl
  131. +42 −0 test/etl_test.rb
  132. +35 −0 test/fixed_width.ctl
  133. +14 −0 test/generator_test.rb
  134. +17 −0 test/inline_parser.ctl
  135. +26 −0 test/mocks/mock_destination.rb
  136. +25 −0 test/mocks/mock_source.rb
  137. +14 −0 test/model_source.ctl
  138. +22 −0 test/multiple_delimited.ctl
  139. +39 −0 test/multiple_source_delimited.ctl
  140. +1 −0 test/output/.ignore
  141. +3 −0 test/output/delimited.txt
  142. +2 −0 test/output/encode_destination_utf-8.txt
  143. +3 −0 test/output/fixed_width.txt
  144. +3 −0 test/output/inline_parser.txt
  145. +1 −0 test/output/scd_test_type_1.txt
  146. +1 −0 test/output/scd_test_type_1_1.txt
  147. +1 −0 test/output/scd_test_type_1_2.txt
  148. +2 −0 test/output/scd_test_type_2.txt
  149. +2 −0 test/output/test_file_destination.2.txt
  150. +2 −0 test/output/test_file_destination.txt
  151. +1 −0 test/output/test_multiple_unique.txt
  152. +2 −0 test/output/test_unique.txt
  153. +200 −0 test/parser_test.rb
  154. +30 −0 test/performance/delimited.ctl
  155. +38 −0 test/processor_test.rb
  156. +17 −0 test/row_processor_test.rb
  157. +26 −0 test/sax.ctl
  158. +1 −0 test/scd/1.txt
  159. +1 −0 test/scd/2.txt
  160. +1 −0 test/scd/3.txt
  161. +271 −0 test/scd_test.rb
  162. +43 −0 test/scd_test_type_1.ctl
  163. +42 −0 test/scd_test_type_2.ctl
  164. +9 −0 test/screen_test.rb
  165. +3 −0 test/screen_test_error.ctl
  166. +3 −0 test/screen_test_fatal.ctl
  167. +139 −0 test/source_test.rb
  168. +33 −0 test/test_helper.rb
  169. +101 −0 test/transform_test.rb
  170. +31 −0 test/xml.ctl
@@ -0,0 +1 @@
+pkg/*
@@ -0,0 +1,6 @@
+The 0.9 revision of ActiveWarehouse ETL significantly changes how connections are maintained. This release is not backwards compatible.
+
+To upgrade, you must do the following:
+
+1.) All database connections used in ETL control files must be declared in database.yml in the directory that contains your ETL control files.
+2.) All sources, destinations, transforms and processors that use a database connection must include the configuration name/value pair of :target => 'name' where name is replaced with the connection name defined in database.yml. Connection information should no longer be included in control files.
@@ -0,0 +1,190 @@
+0.1.0 - Dec 6, 2006
+* Initial release
+
+0.2.0 - Dec 7, 2006
+* Added an XML parser for source parsing
+* Added support for compound key constraints in destinations via the
+ :unique => [] option
+* Added ability to declare explicit columns in bulk import
+* Added support for generators in destinations
+* Added a SurrogateKeyGenerator for cases where the database doesn't support
+ auto generated surrogate keys
+
+0.3.0 - Dec 19, 2006
+* Added support for calculated values in virtual fields with Proc
+
+0.4.0 - Jan 11, 2006
+* Added :skip_lines option to file source configurations, which can be used
+ to skip the first n lines in the source data file
+* Added better error handling in delimited parser - an error is now raised
+ if the expected and actual field lengths do not match
+* Added :truncate option for database destination. Set to true to truncate
+ before importing data.
+* Added support for :unique => [] option and virtual fields for the database
+ destination
+
+0.5.0 - Feb 17, 2007
+* Changed require_gem to gem and added alias to allow for older versions of
+ rubygems.
+* Added support for Hash in the source configuration where :name => :parser_name
+ defines the parser to use and :options => {} defines options to pass to the
+ parser.
+* Added support for passing a custom Parser class in the source configuration.
+* Removed the need to include Enumerable in each parser implementation.
+* Added new date_to_string and string_to_date transformers.
+* Implemented foreign_key_lookup transform including an ActiveRecordResolver.
+* Added real time activity logging which is called when the etl bin script is
+ invoked.
+* Improved error handling.
+* Default logger level is now WARN.
+
+0.5.1 - Feb 18, 2007
+* Fixed up truncate processor.
+* Updated HOW_TO_RELEASE doc.
+
+0.5.2 - Feb 19, 2007
+* Added error threshold.
+* Fixed problem with transform error handling.
+
+0.6.0 - Mar 8, 2007
+* Fixed missing method problem in validate in Control class.
+* Removed control validation for now (source could be code in the control file).
+* Transform interface now defined as taking 3 arguments, the field name, field
+ value and the row. This is not backwards compatible.
+* Added HierarchyLookupTransform.
+* Added DefaultTransform which will return a specified value if the initial
+ value is blank.
+* Added row-level processing.
+* Added HierarchyExploderProcessor which takes a single hierarchy row and
+ explodes it to multiple rows as used in a hierarchy bridge.
+* Added ApacheCombinedLogParser which parses Apache Combined Log format,
+ including parsing of the
+ user agent string and the URI, returning a Hash.
+* Fixed bug in SAX parser so that attributes are now set when the start_element
+ event is received.
+* Added an HttpTools module which provides some parsing methods (for user agent
+ and URI).
+* Database source now uses its own class for establishing an ActiveRecord
+ connection.
+* Log files are now timestamped.
+* Source files are now archived automatically during the extraction process
+* Added a :condition option to the destination configuration Hash that accepts
+ a Proc with a single argument passed to it (the row).
+* Added an :append_rows option to the destination configuration Hash that
+ accepts either a Hash (to append a single row) or an Array of Hashes (to
+ append multiple rows).
+* Only print the read and written row counts if there is at least one source
+ and one destination respectively.
+* Added a depends_on directive that accepts a list of arguments of either strings
+ or symbols. Each symbol is converted to a string and .ctl is appended;
+ strings are passed through directly. The dependencies are executed in the order
+ they are specified.
+* The default field separator in the bulk loader is now a comma (was a tab).
+
+0.6.1 - Mar 22, 2007
+* Added support for absolute paths in file sources
+* Added CopyFieldProcessor
+
+0.7 - Apr 8, 2007
+* Job execution is now tracked in a database. This means that ActiveRecord is
+ required regardless of the sources being used in the ETL scripts. An example
+ database configuration for the etl can be found in test/database.example.yml.
+ This file is loaded from either a.) the current working directory or b.) the
+ location specified using the -c command line argument when running the etl
+ command.
+* etl script now supports the following command line arguments:
+** -h or --help: Prints the usage
+** -l or --limit: Specifies a limit for the number of source rows to read,
+ useful for testing your control files before executing a full ETL process
+** -o or --offset: Specified a start offset for reading from the source, useful
+ for testing your control files before executing a full ETL process
+** -c or --config: Specify the database.yml file to configure the ETL
+ execution data store
+** -n or --newlog: Write to the logfile rather than appending to it
+* Database source now supports specifying the select, join and order parts of
+ the query.
+* Database source understands the limit argument specified on the etl command
+ line
+* Added CheckExistProcessor
+* Added CheckUniqueProcessor
+* Added SurrogateKeyProcessor. The SurrogateKey processor should be used in
+ conjunction with the CheckExistProcessor and CheckUniqueProcessor to provide
+ surrogate keys for all dimension records.
+* Added SequenceProcessor
+* Added OrdinalizeTransform
+* Fixed a bug in the trim transform
+* Sources now provide a trigger file which can be used to indicate that the
+ original source data has been completely extracted to the local file system.
+ This is useful if you need to recover from a failed ETL process.
+* Updated README
+
+0.7.1 - Apr 8, 2007
+* Fixed source caching
+
+0.7.2 - Apr 8, 2007
+* Fixed quoting bug in CheckExistProcessor
+
+0.8.0 - Apr 12, 2007
+* Source now available through the current row source accessor.
+* Added new_rows_only configuration option to DatabaseSource. A date field must
+ be specified and only records that are greater than the date value in that
+ field, relative to the last successful
+ execution, will be returned from the source.
+* Added an (untested) count feature which returns the number of rows for
+ processing.
+* If no natural key is defined then an empty array will now be used, resulting
+ in the row being written to the output without going through change checks.
+* Mapping argument in destination is now optional. An empty hash will be used
+ if the mapping hash is not specified. If the mapping hash is not specified
+ then the order will be determined using the originating source's order.
+* ActiveRecord configurations loaded from database.yml by the etl tool will be
+ merged with ActiveRecord::Base.configurations.
+* Fixed several bugs in how record change detection was implemented.
+* Fixed how the read_locally functionality was implemented so that it will find
+ that last completed local source copy using the source's trigger file (untested).
+
+0.8.1 - Apr 12, 2007
+* Added EnumerableSource
+* Added :type configuration option to the source directive, allowing the source
+ type to be explicitly specified. The source type can be a string or symbol
+ (in which case the class will be constructed by appending Source to the type
+ name), a class (which will be instantiate and passed the control,
+ configuration and mapping) and finally an actual Source instance.
+
+0.8.2 - April 15, 2007
+* Fixed bug with premature destination closing.
+* Added indexes to execution records table.
+* Added a PrintRowProcessor.
+* Added support for conditions and "group by" in the database source.
+* Added after_initialize hook in Processor base class.
+* Added examples directory
+
+0.8.3 - May 13, 2007
+* Added patches from Andy Triboletti
+
+0.8.4 - May 24, 2007
+* Added fix for backslash in file writer
+
+0.9.0 - August 9, 2007
+* Added support for batch processing through .ebf files. These files are
+ essentially control files that apply settings to an entire ETL process.
+* Implemented support for screen blocks. These blocks can be used to test
+ the data and raise an error if the screens do not pass.
+* Connections are now cached in a Hash available through
+ ETL::Engine.connection(name). This should be used rather than including
+ connection information in the control files.
+* Implemented temp table support throughout.
+* DateDimensionBuilder now included in ActiveWarehouse ETL directly.
+* Time calculations for fiscal year now included in ActiveWarehouse ETL.
+
+0.9.1 -
+* SQLResolver now uses ETL::Engine.table so it may utilize temp tables. (aeden)
+* Added Thibaut Barrère's encode processor.
+* Added MockSource and MockDestination test helpers (thbar)
+* Added the block processor. Can call a block once (pre/post processor)
+ or once for each row (after_read/before_write row processor) (thbar)
+* Changed temp table to use new AdapterExtension copy_table method (aeden)
+* Added bin/etl.cmd windows batch - just add the bin folder to your PATH
+ and it will let you call etl on an unpacked/pistoned version of AW-ETL (thbar)
+* Upgraded to support Rails 2.1. No longer compatible with older versions of Rails.
+* Added ETL::Builder::TimeDimensionBuilder
@@ -0,0 +1,8 @@
+cd trunk
+rake release
+cd ..
+svn cp trunk tags/release-x.y.z
+cd tags/release-x.y.z
+svn commit
+cd ../../trunk
+rake pdoc
@@ -0,0 +1,7 @@
+Copyright (c) 2006-2007 Anthony Eden
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,85 @@
+Ruby Extract-Transform-Load (ETL) tool.
+
+== Requirements
+
+* Ruby 1.8.5 or higher
+* Rubygems
+
+== Online Documentation
+
+Available at http://activewarehouse.rubyforge.org/docs/activewarehouse-etl.html
+
+== Features
+
+Current supported features:
+
+* ETL Domain Specific Language (DSL) - Control files are specified in a Ruby-based DSL
+* Multiple source types. Current supported types:
+ * Fixed-width and delimited text files
+ * XML files through SAX
+ * Apache combined log format
+* Multiple destination types - file and database destinations
+* Support for extracting from multiple sources in a single job
+* Support for writing to multiple destinations in a single job
+* A variety of built-in transformations are included:
+ * Date-to-string, string-to-date, string-to-datetime, string-to-timestamp
+ * Type transformation supporting strings, integers, floats and big decimals
+ * Trim
+ * SHA-1
+ * Decode from an external decode file
+ * Default replacement for empty values
+ * Ordinalize
+ * Hierarchy lookup
+ * Foreign key lookup
+ * Ruby blocks
+ * Any custom transformation class
+* A variety of build-in row-level processors
+ * Check exists processor to determine if the record already exists in the destination database
+ * Check unique processor to determine whether a matching record was processed during this job execution
+ * Copy field
+ * Rename field
+ * Hierarchy exploder which takes a tree structure defined through a parent id and explodes it into a hierarchy bridge table
+ * Surrogate key generator including support for looking up the last surrogate key from the target table using a custom query
+ * Sequence generator including support for context-sensitive sequences where the context can be defined as a combination of fields from the source data
+ * New row-level processors can easily be defined and applied
+* Pre-processing
+ * Truncate processor
+* Post-processing
+ * Bulk import using native RDBMS bulk loader tools
+* Virtual fields - Add a field to the destination data which doesn't exist in the source data
+* Built in job and record meta data
+* Support for type 1 and type 2 slowly changing dimensions
+ * Automated effective date and end date time stamping for type 2
+ * CRC checking
+
+== Dependencies
+ActiveWarehouse ETL depends on the following gems:
+* ActiveSupport Gem
+* ActiveRecord Gem
+* FasterCSV Gem
+* AdapterExtensions Gem
+
+== Usage
+Once the ActiveWarehouse ETL gem is installed jobs can be invoked using the
+included `etl` script. The etl script includes several command line options
+and can process multiple control files at a time.
+
+Command line options:
+* <tt>--help, -h</tt>: Display the usage message.
+* <tt>--config, -c</tt>: Specify a database.yml configuration file to use.
+* <tt>--limit, -l</tt>: Specify a limit to the number of rows to process. This option is currently only applicable to database sources.
+* <tt>--offset, -o</tt>: Specify the start offset for reading from the source. This option is currently only applicable to database sources.
+* <tt>--newlog, -n</tt>: Instruct the engine to create a new ETL log rather than append to the last ETL log.
+* <tt>--skip-bulk-import, -s</tt>: Skip any bulk imports.
+* <tt>--read-locally</tt>: Read from the local cache (skip source extraction)
+
+== Control File Examples
+Control file examples can be found in the examples directory.
+
+== Running Tests
+The tests require Shoulda 1.x.
+
+== Feedback
+This is a work in progress. Comments should be made on the
+activewarehouse-discuss mailing list at the moment. Contributions are always
+welcome.
Oops, something went wrong.

0 comments on commit a7d0b78

Please sign in to comment.