No description, website, or topics provided.
Switch branches/tags
AN-7229 AN-8382 alex/answer-hive amannan/WIP-spark_mysql_etl amannan/fix_user_activity_on_new_jenkins benp/add-coveralls-badge benp/test-reqs braden/user-profile bradenm/devstack brian/analyze-acceptance-logs brian/answer-hive brian/check-s3-keys brian/collocated_events brian/course-struct brian/curate brian/deptree brian/dump-engagement brian/engagement-userid brian/event-analysis brian/ficus.master/update-cryptography-for-xenial brian/fix-block-tolerance brian/fix-event-type brian/fix-gs-rsync brian/fix-inc-location brian/fixenrolldeps brian/incremental-location-2 brian/ingest-invoices brian/ingest-transactions brian/instrument_video brian/load-json-events brian/location-city brian/pull-grades brian/restore-some-counters brian/sailthru brian/segment_keys brian/segment brian/spark1 brian/test-db-import brian/test-emr-spark brian/test-viewings brian/transcripts brian/transcripts2 brian/update-packages brian/upgrade-luigi2c brian/upgrade-luigi2 brian/upgrade-more bugfix/bad-requires-an-7229 cale/tools carlos-debug dcadams/merge_upstream_9_15_2017 dylanrhodes/anon-user-enroll-fix dylanrhodes/first_attempt_distribution edx-west/release ekolpakov/pipeline-docs-discovery environment/prod-olivex financial-reporting-an5614 gabe/all-courses gabe/all-events-for-course gabe/analytics-devstack-acceptance gabe/analyze-log gabe/c14n-and-no-move gabe/c14n-batch gabe/c14n-faster-enroll gabe/c14n gabe/canonicalization gabe/capture-exit-status gabe/central-scheduler gabe/compare-tool gabe/cwsm-length gabe/d-course-from-catalog-api gabe/debug-emr-logs-fixes gabe/deid-integration gabe/deid-version gabe/demographics-diag gabe/docker-remote-task gabe/dump-engagement gabe/engagement-csv-workflow gabe/engagement-modified gabe/engagement-spike gabe/engagement-tests-1 gabe/engagement-tests gabe/finance-manifest gabe/fix-answer-dist gabe/fix-complete-check gabe/fix-local-acceptance gabe/fix-lock-wait-timeout gabe/fix-manifest-class-name gabe/fix-mysql-dogwood-rc gabe/fix-separators-in-data gabe/fix-setuptools-upgrade gabe/gather-course-events gabe/grep gabe/handle-service-unavailable gabe/hotfix-entry-points gabe/hotfix-newline gabe/hotfix-opaque-keys gabe/import-name gabe/incremental-enrollment gabe/incremental-user-activity gabe/la-roster-backup gabe/la-segments gabe/mode-spike gabe/more-debug-info gabe/mysql-insert-buffer-length gabe/no-move gabe/patch-db gabe/paypal-reporting-spike gabe/performance-improvements gabe/pip-timeout gabe/profile gabe/reconcile gabe/refactor-location-hive-to-sql gabe/remote-profiler gabe/remote-task-ssh gabe/revert-module-engagement-change gabe/shoppingcart-import gabe/single-user-video gabe/smaller-db-inserts gabe/spike-engagement-csv gabe/spike-video-1 gabe/spike-video-seek-event gabe/spike-video gabe/ssh-keep-alive gabe/swap-schema gabe/table-swap gabe/tail gabe/tools gabe/tools2 gabe/tweaked-metric gabe/use-c14n gabe/workshop-demo hassan/acceptance-run hassan/acceptance-test-emr-upg hassan/acceptance-test-test hassan/acceptance-tests-support hassan/acceptance-tests-validation hassan/acitve-users hassan/add-course-slot-number hassan/add-grade-calc-metadata hassan/ansible-failure-workaround hassan/answer-dist-emr-upgrade hassan/bigquery-prod hassan/bigquery hassan/boto-test hassan/bq-load hassan/ccx-course-in-exports hassan/change-num-users-watching-video hassan/characterize-events hassan/close-vertica-connection hassan/course-subject-from-api hassan/create-empty-table-for-insights hassan/d-country hassan/d-course-fields hassan/d-course-from-catalog-api hassan/d-course-uniq-constraint-temp hassan/d-user-course-certificate hassan/db-import-schema hassan/db-import-spike hassan/db-import hassan/de-id-spike hassan/debug-acceptance-2 hassan/debug-acceptance hassan/debug-enrollment-emr-upgrade hassan/debug-enrollments-acceptance hassan/debug-metric-ranges hassan/debug-test hassan/deid-acceptance-test hassan/deidentification-acceptance hassan/demographics-recent-day hassan/deploy-incremental-video hassan/enr-exc hassan/enr-remove-counters hassan/enrollment-exc hassan/enrollment-params hassan/event-by-course-fix hassan/event-deidentification hassan/event-export-by-course-param hassan/event-export-by-course hassan/finance-report-refactor hassan/fix-acceptance-test hassan/fix-d-country hassan/fix-d-course hassan/fix-database-import-test hassan/fix-encrypt-util hassan/fix-enrollment-tests hassan/fix-event-export-acceptance-test hassan/fix-finance-report hassan/fix-metric-ranges-dependency hassan/fix-pip-issue hassan/fix-roster-acceptance-test hassan/fix-service-module hassan/fix-user-activity-acceptance-test hassan/fix-user-location hassan/fix-warehouse-acceptance-test hassan/inc-activity-2 hassan/inc-activity-3 hassan/inc-activity-inc-mysql hassan/inc-activity-warehouse hassan/inc-user-activity hassan/incremental-enrollment hassan/incremental-video-test hassan/incremental-video hassan/load-warehouse-redshift hassan/mixed-case-output-for-event-export-by-course hassan/mixed-case-output-for-event-export hassan/mr-diff hassan/mr hassan/mysql-db-vertica-import hassan/opaque-keys-update hassan/pentaho-schema-management hassan/pentaho-schema-mgt hassan/pentaho-single-workflow hassan/pipeline-replacement-user-course hassan/rdx-country-code-2 hassan/rdx-country-code hassan/rdx-validation hassan/redshift-spike hassan/remove-overwrite-n-days hassan/remove-remap hassan/remove-test hassan/reorganize-code hassan/roster-acceptance-test hassan/roster-metrics hassan/simplify-inc-user-activity hassan/simplify-user-activity-loading hassan/speed-up-finance-report hassan/temp-fix-module-engagement hassan/test-1 hassan/test-acceptance hassan/test-activity hassan/test-devstack hassan/test-enterprise-enroll hassan/test-finance-reports hassan/test-mr-failure hassan/test-sqoop hassan/test-ua hassan/test-vertica-load hassan/test-x hassan/tmp-active-users hassan/update-analyze-log hassan/update-opaque-keys hassan/upgrade-google-cloud-bigquery hassan/user-activity-constraints hassan/user-activity-emr-upgrade-2 hassan/user-activity-emr-upgrade hassan/user-activity-user-id hassan/user-activity-workflow hassan/user-course hassan/user-location-user-id-temp hassan/user-location-user-id hassan/video-precompute-viewed-seconds hassan/video-timeline-test hassan/video-timeline-user-id-temp-tables hassan/video-timeline-user-id-tmp hassan/video-timeline-user-id hassan/warehouse-prod hassan/warehouse-test hassan/weekly-active-users-incremental hassan/weekly-active-users hotfix-2015-12-24 incremental-enrollment-2016-07-28 jab/etl_shoppingcart_orderitem jab/financial-reporting jab/financialreport_shoppingcart_import jab/removeCombinerII jab/removingCombinerTest jamesrowan/event-logs-to-vertica jamesrowan/f-user-course-debugging jamesrowan/hive-arch jamesrowan/last-working-vertica-acceptance-tests jamesrowan/unified-acceptance-test-work-eventlogs jbau/anon-user-tasks jbau/test-customresponse jbau/tool-to-merge-override-cfg jvasquez/simplify-analytics-task mannan/DE-232-luigi2 mannan/DE-232 mannan/DE-486 mannan/spark_tasks mannan/user_activity_spark_production_job master mattdrayer/ENT-880-acceptance-fix mattdrayer/ENT-880 mulby/exlude-from-export named-release/dogwood.rc ned/test-dogwood.2 ned/test-eucalyptus ned/test-ficus open-craft-jill/latest-problem-response open-craft-smarnach/update-origin-head open-release/eucalyptus.master open-release/ficus.master open-release/ginkgo.master open-release/hawthorn.beta1 open-release/hawthorn.master origin/hassan/temp-fix-module-engagement ormsbee/bump_opaque_keys_0.3.2 pdesjardins/DOC-2802 release revert-127-escape-hive-columns revert-354-gabe/multiple-roles revert-config-file-template smarnach/forum-vote-events tim/improve-pipeline-workflow tim/ingest-invoice-transactions tobz/backoff-historyserver-api tobz/config-file-templa tobz/config-file-templating tobz/dump-metrics-before-send tobz/events-partition-by-month tobz/gsutil tobz/never-blow-up-for-metrics tobz/no-yum-update-for-system-requirements tobz/update-boto tobz/upgrade-pip-pile-on travis-ci-trigger umer/event-type-distribution umer/f-user-certificates umer/f-user-course umer/user-course-table upgrade-luigi user-table-mysql zafft/DE-77-segment-only zafft/DE-363 zafft/DE-383 zafft/big-query-load-events zafft/big_query_conversion_poc zafft/persistent_spark zafft/re-enable-video-counters zafft/remove_unnecessary_overwriteflags zafft/sqoop_timestamp_fieldnames zafft/vertica_metadata zafft/vertica_sqoop zub/sol-1880-implement-adyen-transactions-reporting
Nothing to show
Clone or download
zubair-arbi Merge pull request #704 from edx/zub/ENT-1232
ENT-1232 Change field type of unenrollment_timestamp to datetime
Latest commit 4284c67 Dec 15, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github Created pull request template. Jun 14, 2018
config Merge pull request #516 from edx/brian/basic-spark Jun 19, 2018
docs/source Add docker analyticstack guides May 23, 2018
edx ENT-1232 Change field type of unenrollment_timestamp to datetime Dec 14, 2018
gpg-keys Initial content of repository, migrated from analytics-tasks repo. Aug 18, 2014
images Update docker analyticstack information in readme May 25, 2018
requirements OEP-18 implementation. Jun 4, 2018
scripts include a context argument Jul 26, 2017
share Add support for templating to Luigi configuration files, redux. (#407) Jun 22, 2017
.coveragerc Added Dockerfile to aid local development Dec 5, 2017
.gitignore support extra repos and packages Dec 15, 2016
.isort.cfg Sorting imports with isort Dec 5, 2017
.travis.yml Only build travis push on master Dec 6, 2018
AUTHORS Add the capability for analysts to define their own tasks, or SQL Dec 15, 2016
LICENSE Initial content of repository, migrated from analytics-tasks repo. Aug 18, 2014
MANIFEST.in Initial content of repository, migrated from analytics-tasks repo. Aug 18, 2014
Makefile Fix pip version before uninstall. Oct 10, 2018
README.md Update docker analyticstack information in readme May 25, 2018
client.cfg Support ssh connections to arbitrary hosts Aug 12, 2015
logging.cfg Support log analysis for performance benchmark Jan 27, 2015
openedx.yaml Specify an owner Aug 23, 2016
pylintrc Upgrade pylint to 1.5 and pep8 to 1.6. Feb 18, 2016
setup.cfg Add LoadAllVideoToVertica task. Dec 7, 2018
setup.py Initial content of repository, migrated from analytics-tasks repo. Aug 18, 2014

README.md

Open edX Data Pipeline

A data pipeline for analyzing Open edX data. This is a batch analysis engine that is capable of running complex data processing workflows.

The data pipeline takes large amounts of raw data, analyzes it and produces higher value outputs that are used by various downstream tools.

The primary consumer of this data is Open edX Insights.

It is also used to generate a variety of packaged outputs for research, business intelligence and other reporting.

It gathers input from a variety of sources including (but not limited to):

  • Tracking log files - This is the primary data source.
  • LMS database
  • Otto database
  • LMS APIs (course blocks, course listings)

It outputs to:

  • S3 - CSV reports, packaged exports
  • MySQL - This is known as the "result store" and is consumed by Insights
  • Elasticsearch - This is also used by Insights
  • Vertica - This is used for business intelligence and reporting purposes

This tool uses spotify/luigi as the core of the workflow engine.

Data transformation and analysis is performed with the assistance of the following third party tools (among others):

The data pipeline is designed to be invoked on a periodic basis by an external scheduler. This can be cron, jenkins or any other system that can periodically run shell commands.

Here is a simplified, high level, view of the architecture:

Open edX Analytics Architectural Overview

Setting up Docker-based Development Environment

As part of our movement towards the adoption of OEP-5, we have ported our development setup from Vagrant to Docker, which uses a multi-container approach driven by Docker Compose. There is a guide in place for Setting up Docker Analyticstack in the devstack repository which can help you set up a new analyticstack.

Here is a diagram showing how the components are related and connected to one another:

the analyticstack

Setting up a Vagrant-based Development Environment

We call this environment the Vagrant "analyticstack". It contains many of the services needed to develop new features for Insights and the data pipeline.

A few of the services included are:

  • LMS (edx-platform)
  • Studio (edx-platform)
  • Insights (edx-analytics-dashboard)
  • Analytics API (edx-analytics-data-api)

We currently have a separate development from the core edx-platform devstack because the data pipeline depends on several services that dramatically increase the footprint of the virtual machine. Given that a small fraction of Open edX contributors are looking to develop features that leverage the data pipeline, we chose to build a variant of the devstack that includes them. In the future we hope to adopt OEP-5 which would allow developers to mix and match the services they are using for development at a much more granular level. In the meantime, you will need to do some juggling if you are also running a traditional Open edX devstack to ensure that both it and the analyticstack are not trying to run at the same time (they compete for the same ports).

If you are running a generic Open edX devstack, navigate to the directory that contains the Vagrantfile for it and run vagrant halt.

Please follow the analyticstack installation guide.

Note: Vagrant "analyticstack" official support is coming to end after Hawthorn.

Running In Production

For small installations, you may want to use our single instance installation guide.

For larger installations, we do not have a similarly detailed guide, you can start with our installation guide.

How to Contribute

Contributions are very welcome, but for legal reasons, you must submit a signed individual contributor's agreement before we can accept your contribution. See our CONTRIBUTING file for more information -- it also contains guidelines for how to maintain high code quality, which will make your contribution more likely to be accepted.