Skip to content
Permalink
Browse files
Added nofollow to all external links
  • Loading branch information
abti committed Jan 1, 2021
1 parent c3ae848 commit efc2a69aa5c1a7bc4ac42e29c590b6bc745023b8
Showing 96 changed files with 505 additions and 505 deletions.
@@ -67,8 +67,8 @@ <h2 id="external-project-resources">External project resources<a class="header-l

<ul>
<li><a href="https://issues.apache.org/jira/browse/GOBBLIN" target="_blank">JIRA</a> for tracking issues</li>
<li><a href="https://github.com/apache/incubator-gobblin/pulls" target="_blank">GitHub</a> for code reviews (via pull requests)</li>
<li><a href="https://travis-ci.org/apache/incubator-gobblin" target="_blank">TravisCI</a> for automated builds</li>
<li><a href="https://github.com/apache/incubator-gobblin/pulls" target="_blank" rel="nofollow">GitHub</a> for code reviews (via pull requests)</li>
<li><a href="https://travis-ci.org/apache/incubator-gobblin" target="_blank" rel="nofollow">TravisCI</a> for automated builds</li>
</ul>

<h2 id="development-docs">Development docs<a class="header-link" href="#development-docs"><i class="fa fa-link"></i></a></h2>
@@ -505,7 +505,7 @@ <h1 id="404-page-not-found">404</h1>

</div>

Built with <a href="http://www.mkdocs.org">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
Built with <a href="http://www.mkdocs.org" rel="nofollow">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme" rel="nofollow">theme</a> provided by <a href="https://readthedocs.org" rel="nofollow">Read the Docs</a>.
</footer>

</div>
@@ -530,7 +530,7 @@
<li>Getting Started</li>
<li class="wy-breadcrumbs-aside">

<a href="https://github.com/apache/incubator-gobblin/edit/master/docs/Getting-Started.md"> Edit on Gobblin</a>
<a href="https://github.com/apache/incubator-gobblin/edit/master/docs/Getting-Started.md" rel="nofollow"> Edit on Gobblin</a>

</li>
</ul>
@@ -620,14 +620,14 @@ <h1 id="running-gobblin-as-a-daemon">Running Gobblin as a Daemon</h1>
<p>Here we show how to run a Gobblin daemon. A Gobblin daemon tracks a directory and finds job configuration files in it (jobs with extensions <code>*.pull</code>). Job files can be either run once or scheduled jobs. Gobblin will automatically execute this jobs as they are received following the schedule.</p>
<p>For this example, we will once again run the Wikipedia example. The records will be stored as Avro files.</p>
<h2 id="preliminary">Preliminary</h2>
<p>Each Gobblin job minimally involves several constructs, e.g. <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-api/src/main/java/org/apache/gobblin/source/Source.java">Source</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-api/src/main/java/org/apache/gobblin/source/extractor/Extractor.java">Extractor</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-api/src/main/java/org/apache/gobblin/writer/DataWriter.java">DataWriter</a> and <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-api/src/main/java/org/apache/gobblin/publisher/DataPublisher.java">DataPublisher</a>. As the names suggest, Source defines the source to pull data from, Extractor implements the logic to extract data records, DataWriter defines the way the extracted records are output, and DataPublisher publishes the data to the final output location. A job may optionally have one or more Converters, which transform the extracted records, as well as one or more PolicyCheckers that check the quality of the extracted records and determine whether they conform to certain policies.</p>
<p>Some of the classes relevant to this example include <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/java/org/apache/gobblin/example/wikipedia/WikipediaSource.java">WikipediaSource</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/java/org/apache/gobblin/example/wikipedia/WikipediaExtractor.java">WikipediaExtractor</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/java/org/apache/gobblin/example/wikipedia/WikipediaConverter.java">WikipediaConverter</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-core/src/main/java/org/apache/gobblin/writer/AvroHdfsDataWriter.java">AvroHdfsDataWriter</a> and <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-core/src/main/java/org/apache/gobblin/publisher/BaseDataPublisher.java">BaseDataPublisher</a>.</p>
<p>To run Gobblin in standalone daemon mode we need a Gobblin configuration file (such as uses <a href="https://github.com/apache/incubator-gobblin/blob/master/conf/standalone/application.conf">application.conf</a>). And for each job we wish to run, we also need a job configuration file (such as <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/resources/wikipedia.pull">wikipedia.pull</a>). The Gobblin configuration file, which is passed to Gobblin as a command line argument, should contain a property <code>jobconf.dir</code> which specifies where the job configuration files are located. By default, <code>jobconf.dir</code> points to environment variable <code>GOBBLIN_JOB_CONFIG_DIR</code>. Each file in <code>jobconf.dir</code> with extension <code>.job</code> or <code>.pull</code> is considered a job configuration file, and Gobblin will launch a job for each such file. For more information on Gobblin deployment in standalone mode, refer to the <a href="user-guide/Gobblin-Deployment#Standalone-Deployment">Standalone Deployment</a> page.</p>
<p>Each Gobblin job minimally involves several constructs, e.g. <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-api/src/main/java/org/apache/gobblin/source/Source.java" rel="nofollow">Source</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-api/src/main/java/org/apache/gobblin/source/extractor/Extractor.java" rel="nofollow">Extractor</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-api/src/main/java/org/apache/gobblin/writer/DataWriter.java" rel="nofollow">DataWriter</a> and <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-api/src/main/java/org/apache/gobblin/publisher/DataPublisher.java" rel="nofollow">DataPublisher</a>. As the names suggest, Source defines the source to pull data from, Extractor implements the logic to extract data records, DataWriter defines the way the extracted records are output, and DataPublisher publishes the data to the final output location. A job may optionally have one or more Converters, which transform the extracted records, as well as one or more PolicyCheckers that check the quality of the extracted records and determine whether they conform to certain policies.</p>
<p>Some of the classes relevant to this example include <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/java/org/apache/gobblin/example/wikipedia/WikipediaSource.java" rel="nofollow">WikipediaSource</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/java/org/apache/gobblin/example/wikipedia/WikipediaExtractor.java" rel="nofollow">WikipediaExtractor</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/java/org/apache/gobblin/example/wikipedia/WikipediaConverter.java" rel="nofollow">WikipediaConverter</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-core/src/main/java/org/apache/gobblin/writer/AvroHdfsDataWriter.java" rel="nofollow">AvroHdfsDataWriter</a> and <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-core/src/main/java/org/apache/gobblin/publisher/BaseDataPublisher.java" rel="nofollow">BaseDataPublisher</a>.</p>
<p>To run Gobblin in standalone daemon mode we need a Gobblin configuration file (such as uses <a href="https://github.com/apache/incubator-gobblin/blob/master/conf/standalone/application.conf" rel="nofollow">application.conf</a>). And for each job we wish to run, we also need a job configuration file (such as <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/resources/wikipedia.pull" rel="nofollow">wikipedia.pull</a>). The Gobblin configuration file, which is passed to Gobblin as a command line argument, should contain a property <code>jobconf.dir</code> which specifies where the job configuration files are located. By default, <code>jobconf.dir</code> points to environment variable <code>GOBBLIN_JOB_CONFIG_DIR</code>. Each file in <code>jobconf.dir</code> with extension <code>.job</code> or <code>.pull</code> is considered a job configuration file, and Gobblin will launch a job for each such file. For more information on Gobblin deployment in standalone mode, refer to the <a href="user-guide/Gobblin-Deployment#Standalone-Deployment">Standalone Deployment</a> page.</p>
<p>A list of commonly used configuration properties can be found here: <a href="user-guide/Configuration-Properties-Glossary">Configuration Properties Glossary</a>.</p>
<h2 id="steps_1">Steps</h2>
<ul>
<li>
<p>Create a folder to store the job configuration file. Put <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/resources/wikipedia.pull">wikipedia.pull</a> in this folder, and set environment variable <code>GOBBLIN_JOB_CONFIG_DIR</code> to point to this folder. Also, make sure that the environment variable <code>JAVA_HOME</code> is set correctly.</p>
<p>Create a folder to store the job configuration file. Put <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/resources/wikipedia.pull" rel="nofollow">wikipedia.pull</a> in this folder, and set environment variable <code>GOBBLIN_JOB_CONFIG_DIR</code> to point to this folder. Also, make sure that the environment variable <code>JAVA_HOME</code> is set correctly.</p>
</li>
<li>
<p>Create a folder as Gobblin's working directory. Gobblin will write job output as well as other information there, such as locks and state-store (for more information, see the <a href="user-guide/Gobblin-Deployment#Standalone-Deployment">Standalone Deployment</a> page). Set environment variable <code>GOBBLIN_WORK_DIR</code> to point to that folder.</p>
@@ -682,11 +682,11 @@ <h2 id="steps_1">Steps</h2>
</code></pre>

<p><code>output.json</code> will contain all retrieved records in JSON format.</p>
<p>Note that since this job configuration file we used (<a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/resources/wikipedia.pull">wikipedia.pull</a>) doesn't specify a job schedule, the job will run immediately and will run only once. To schedule a job to run at a certain time and/or repeatedly, set the <code>job.schedule</code> property with a cron-based syntax. For example, <code>job.schedule=0 0/2 * * * ?</code> will run the job every two minutes. See <a href="http://www.quartz-scheduler.org/documentation/quartz-2.1.x/tutorials/crontrigger.html">this link</a> (Quartz CronTrigger) for more details.</p>
<p>Note that since this job configuration file we used (<a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/resources/wikipedia.pull" rel="nofollow">wikipedia.pull</a>) doesn't specify a job schedule, the job will run immediately and will run only once. To schedule a job to run at a certain time and/or repeatedly, set the <code>job.schedule</code> property with a cron-based syntax. For example, <code>job.schedule=0 0/2 * * * ?</code> will run the job every two minutes. See <a href="http://www.quartz-scheduler.org/documentation/quartz-2.1.x/tutorials/crontrigger.html" rel="nofollow">this link</a> (Quartz CronTrigger) for more details.</p>
<h1 id="other-example-jobs">Other Example Jobs</h1>
<p>Besides the Wikipedia example, we have another example job <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/resources/simplejson.pull">SimpleJson</a>, which extracts records from JSON files and store them in Avro files.</p>
<p>To create your own jobs, simply implement the relevant interfaces such as <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-api/src/main/java/org/apache/gobblin/source/Source.java">Source</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-api/src/main/java/org/apache/gobblin/source/extractor/Extractor.java">Extractor</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-api/src/main/java/org/apache/gobblin/converter/Converter.java">Converter</a> and <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-api/src/main/java/org/apache/gobblin/writer/DataWriter.java">DataWriter</a>. In the job configuration file, set properties such as <code>source.class</code> and <code>converter.class</code> to point to these classes.</p>
<p>On a side note: while users are free to directly implement the Extractor interface (e.g., WikipediaExtractor), Gobblin also provides several extractor implementations based on commonly used protocols, e.g., <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-modules/gobblin-kafka-common/src/main/java/org/apache/gobblin/source/extractor/extract/kafka/KafkaExtractor.java">KafkaExtractor</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-core/src/main/java/org/apache/gobblin/source/extractor/extract/restapi/RestApiExtractor.java">RestApiExtractor</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-modules/gobblin-sql/src/main/java/org/apache/gobblin/source/jdbc/JdbcExtractor.java">JdbcExtractor</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-core/src/main/java/org/apache/gobblin/source/extractor/extract/sftp/SftpExtractor.java">SftpExtractor</a>, etc. Users are encouraged to extend these classes to take advantage of existing implementations.</p>
<p>Besides the Wikipedia example, we have another example job <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/resources/simplejson.pull" rel="nofollow">SimpleJson</a>, which extracts records from JSON files and store them in Avro files.</p>
<p>To create your own jobs, simply implement the relevant interfaces such as <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-api/src/main/java/org/apache/gobblin/source/Source.java" rel="nofollow">Source</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-api/src/main/java/org/apache/gobblin/source/extractor/Extractor.java" rel="nofollow">Extractor</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-api/src/main/java/org/apache/gobblin/converter/Converter.java" rel="nofollow">Converter</a> and <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-api/src/main/java/org/apache/gobblin/writer/DataWriter.java" rel="nofollow">DataWriter</a>. In the job configuration file, set properties such as <code>source.class</code> and <code>converter.class</code> to point to these classes.</p>
<p>On a side note: while users are free to directly implement the Extractor interface (e.g., WikipediaExtractor), Gobblin also provides several extractor implementations based on commonly used protocols, e.g., <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-modules/gobblin-kafka-common/src/main/java/org/apache/gobblin/source/extractor/extract/kafka/KafkaExtractor.java" rel="nofollow">KafkaExtractor</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-core/src/main/java/org/apache/gobblin/source/extractor/extract/restapi/RestApiExtractor.java" rel="nofollow">RestApiExtractor</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-modules/gobblin-sql/src/main/java/org/apache/gobblin/source/jdbc/JdbcExtractor.java" rel="nofollow">JdbcExtractor</a>, <a href="https://github.com/apache/incubator-gobblin/blob/master/gobblin-core/src/main/java/org/apache/gobblin/source/extractor/extract/sftp/SftpExtractor.java" rel="nofollow">SftpExtractor</a>, etc. Users are encouraged to extend these classes to take advantage of existing implementations.</p>

</div>
</div>
@@ -709,7 +709,7 @@ <h1 id="other-example-jobs">Other Example Jobs</h1>

</div>

Built with <a href="http://www.mkdocs.org">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
Built with <a href="http://www.mkdocs.org" rel="nofollow">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme" rel="nofollow">theme</a> provided by <a href="https://readthedocs.org" rel="nofollow">Read the Docs</a>.
</footer>

</div>

0 comments on commit efc2a69

Please sign in to comment.