Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit e0a9caa
Showing
52 changed files
with
10,794 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
env/ | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
.hypothesis/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# pyenv | ||
.python-version | ||
|
||
# celery beat schedule file | ||
celerybeat-schedule | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# dotenv | ||
.env | ||
|
||
# virtualenv | ||
.venv | ||
venv/ | ||
ENV/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
.DS_Store | ||
|
||
# gitbook | ||
_book | ||
|
||
# node.js | ||
node_modules | ||
|
||
# windows | ||
Thumbs.db | ||
|
||
# word | ||
~$*.docx | ||
~$*.doc | ||
|
||
# custom | ||
docs/README.md |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
language: python | ||
python: 3.6 | ||
|
||
install: | ||
- ':' | ||
|
||
script: | ||
- ':' | ||
|
||
after_script: | ||
- git config user.name ${GH_UN} | ||
- git config user.email ${GH_EMAIL} | ||
- git push "https://${GH_TOKEN}@github.com/${GH_USER}/${GH_REPO}.git" v2.2.0-md:${GH_BRANCH} -f | ||
|
||
env: | ||
global: | ||
- GH_UN=jiangzhonglian | ||
- GH_EMAIL=jiang-s@163.com | ||
- GH_USER=apachecn | ||
- GH_REPO=spark-doc-zh | ||
- GH_BRANCH=gh-pages |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
spark.apachecn.org |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# 贡献指南 | ||
|
||
## 参与翻译 & 发现错误 | ||
|
||
1. 在 github 上 fork 该 repository. | ||
2. 翻译 doc/zh 下面的 md 文件即可, 例如, index.md. | ||
3. 然后, 在你的 github 发起 New pull request 请求. | ||
4. 工具使用, 可参考下面的内容. | ||
|
||
## 工具使用(针对新手) | ||
|
||
工欲善其事, 必先利其器 ... | ||
工具随意, 能达到效果就好. | ||
我这里使用的是 `VSCode` 编辑器. | ||
简易的使用指南请参阅: [VSCode Windows 平台入门使用指南](help/vscode-windows-usage.md), 介绍了 `VSCode` 与 `github` 一起搭配的简易使用的方法. | ||
如果要将 VSCode 的 Markdown 预览风格切换为 github 的风格,请参阅: [VSCode 修改 markdown 的预览风格为 github 的风格](help/vscode-markdown-preview-github-style.md). | ||
|
||
## 角色分配 | ||
|
||
目前有如下可分配的角色: | ||
|
||
* 翻译: 负责文章内容的翻译. | ||
* 校验: 负责文章内容的校验, 比如格式, 正确度之类的. | ||
* 负责人: 负责整个 Projcet, 不至于让该 Project 成为垃圾项目, 需要在 Spark 方面经验稍微丰富点. | ||
|
||
有兴趣参与的朋友, 可以看看最后的联系方式. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
# Apache Spark 官方文档中文版 | ||
|
||
![](docs/img/spark-logo-hd.png) | ||
|
||
Apache Spark? 是一个快速的, 用于海量数据处理的通用引擎. | ||
|
||
## 维护地址 | ||
|
||
+ [Github](https://github.com/apachecn/spark-doc-zh/) | ||
+ [在线阅读](http://spark.apachecn.org) | ||
|
||
## 历史版本 | ||
|
||
+ [Apache Spark 2.0.2 官方文档中文版](http://cwiki.apachecn.org/pages/viewpage.action?pageId=2883613) | ||
+ [中文文档 EPUB 格式](https://github.com/apachecn/spark-doc-zh/raw/dl/Spark%202.0.2%20%E4%B8%AD%E6%96%87%E6%96%87%E6%A1%A3.epub) | ||
|
||
## 贡献指南 | ||
|
||
[请见这里](CONTRIBUTING.md) | ||
|
||
## 负责人 | ||
|
||
* [@wangyangting](https://github.com/wangyangting)(那伊抹微笑) | ||
|
||
## 贡献者 | ||
|
||
贡献者可自行编辑如下内容. | ||
|
||
### 2.2.0 | ||
|
||
* [@wangyangting](https://github.com/wangyangting)(那伊抹微笑) | ||
* [@jiangzhonglian](https://github.com/jiangzhonglian)(片刻) | ||
* [@chenyyx](https://github.com/chenyyx)(Joy yx) | ||
* [@XiaoLiz](https://github.com/XiaoLiz)(VoLi) | ||
* [@ruilintian](https://github.com/ruilintian)(ruilintian) | ||
* [@huangtianan](https://github.com/huangtianan)(huangtianan) | ||
* [@kris37](https://github.com/kris37)(kris37) | ||
* [@sehriff](https://github.com/sehriff)(sehriff) | ||
* [@windyqinchaofeng](https://github.com/windyqinchaofeng)(qinchaofeng) | ||
* [@stealthsMrs](https://github.com/stealthsMrs)(stealthsMrs) | ||
|
||
### 2.0.2 | ||
|
||
请参阅: [http://cwiki.apachecn.org/pages/viewpage.action?pageId=2887089](http://cwiki.apachecn.org/pages/viewpage.action?pageId=2887089) | ||
|
||
## 联系方式 | ||
|
||
有任何建议反馈, 或想参与文档翻译, 麻烦联系下面的企鹅: | ||
|
||
* 企鹅: 1042658081 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
+ [Spark 概述](docs/1.md) | ||
+ [编程指南](docs/2.md) | ||
+ [快速入门](docs/3.md) | ||
+ [Spark 编程指南](docs/4.md) | ||
+ [构建在 Spark 之上的模块](docs/5.md) | ||
+ [Spark Streaming 编程指南](docs/6.md) | ||
+ [Spark SQL, DataFrames and Datasets Guide](docs/7.md) | ||
+ [MLlib](docs/8.md) | ||
+ [GraphX Programming Guide](docs/9.md) | ||
+ [API 文档](docs/10.md) | ||
+ [部署指南](docs/11.md) | ||
+ [集群模式概述](docs/12.md) | ||
+ [Submitting Applications](docs/13.md) | ||
+ [部署模式](docs/14.md) | ||
+ [Spark Standalone Mode](docs/15.md) | ||
+ [在 Mesos 上运行 Spark](docs/16.md) | ||
+ [Running Spark on YARN](docs/17.md) | ||
+ [其它](docs/18.md) | ||
+ [更多](docs/19.md) | ||
+ [Spark 配置](docs/20.md) | ||
+ [Monitoring and Instrumentation](docs/21.md) | ||
+ [Tuning Spark](docs/22.md) | ||
+ [作业调度](docs/23.md) | ||
+ [Spark 安全](docs/24.md) | ||
+ [硬件配置](docs/25.md) | ||
+ [Accessing OpenStack Swift from Spark](docs/26.md) | ||
+ [构建 Spark](docs/27.md) | ||
+ [其它](docs/28.md) | ||
+ [外部资源](docs/29.md) | ||
+ [翻译进度](docs/30.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Spark 概述 | ||
|
||
Apache Spark 是一个快速的, 多用途的集群计算系统。 它提供了 Java, Scala, Python 和 R 的高级 API,以及一个支持通用的执行图计算的优化过的引擎. 它还支持一组丰富的高级工具, 包括使用 SQL 处理结构化数据处理的 [Spark SQL](sql-programming-guide.html), 用于机器学习的 [MLlib](ml-guide.html), 用于图计算的 [GraphX](graphx-programming-guide.html), 以及 [Spark Streaming](streaming-programming-guide.html)。 | ||
|
||
# 下载 | ||
|
||
从该项目官网的 [下载页面](http://spark.apache.org/downloads.html) 获取 Spark. 该文档用于 Spark 2.2.0 版本. Spark可以通过Hadoop client库使用HDFS和YARN.下载一个预编译主流Hadoop版本比较麻烦. 用户可以下载一个编译好的Hadoop版本, 并且可以 通过[设置 Spark 的 classpath](hadoop-provided.html) 来与任何的 Hadoop 版本一起运行 Spark. Scala 和 Java 用户可以在他们的工程中通过Maven的方式引入 Spark, 并且在将来 Python 用户也可以从 PyPI 中安装 Spark。 | ||
|
||
如果您希望从源码中编译一个Spark, 请访问 [编译 Spark](building-spark.html). | ||
|
||
Spark可以在windows和unix类似的系统(例如, Linux, Mac OS)上运行。它可以很容易的在一台本地机器上运行 -你只需要安装一个JAVA环境并配置PATH环境变量,或者让JAVA_HOME指向你的JAVA安装路径 | ||
|
||
Spark 可运行在 Java 8+, Python 2.7+/3.4+ 和 R 3.1+ 的环境上。针对 Scala API, Spark 2.2.0 使用了 Scala 2.11\. 您将需要去使用一个可兼容的 Scala 版本 (2.11.x). | ||
|
||
请注意, 从 Spark 2.2.0 起, 对 Java 7, Python 2.6 和旧的 Hadoop 2.6.5 之前版本的支持均已被删除. | ||
|
||
请注意, Scala 2.10 的支持已经不再适用于 Spark 2.1.0, 可能会在 Spark 2.3.0 中删除。 | ||
|
||
# 运行示例和 Shell | ||
|
||
Spark 自带了几个示例程序. Scala, Java, Python 和 R 示例在 `examples/src/main` 目录中. 要运行 Java 或 Scala 中的某个示例程序, 在最顶层的 Spark 目录中使用 `bin/run-example <class> [params]` 命令即可.(这个命令底层调用了 [`spark-submit` 脚本](submitting-applications.html)去加载应用程序)。例如, | ||
|
||
``` | ||
./bin/run-example SparkPi 10 | ||
``` | ||
|
||
您也可以通过一个改进版的 Scala shell 来运行交互式的 Spark。这是一个来学习该框架比较好的方式。 | ||
|
||
``` | ||
./bin/spark-shell --master local[2] | ||
``` | ||
|
||
该 `--master`选项可以指定为 [针对分布式集群的 master URL](submitting-applications.html#master-urls), 或者 以`local`模式 使用 1 个线程在本地运行, `local[N]` 会使用 N 个线程在本地运行.你应该先使用local模式进行测试. 可以通过–help指令来获取spark-shell的所有配置项. Spark 同样支持 Python API。在 Python interpreter(解释器)中运行交互式的 Spark, 请使用 `bin/pyspark`: | ||
|
||
``` | ||
./bin/pyspark --master local[2] | ||
``` | ||
|
||
Python 中也提供了应用示例。例如, | ||
|
||
``` | ||
./bin/spark-submit examples/src/main/python/pi.py 10 | ||
``` | ||
|
||
从 1.4 开始(仅包含了 DataFrames APIs)Spark 也提供了一个用于实验性的 [R API](sparkr.html)。 为了在 R interpreter(解释器)中运行交互式的 Spark, 请执行 `bin/sparkR`: | ||
|
||
``` | ||
./bin/sparkR --master local[2] | ||
``` | ||
|
||
R 中也提供了应用示例。例如, | ||
|
||
``` | ||
./bin/spark-submit examples/src/main/r/dataframe.R | ||
``` | ||
|
||
# 在集群上运行 | ||
|
||
该 Spark [集群模式概述](cluster-overview.html) 说明了在集群上运行的主要的概念。 Spark 既可以独立运行, 也可以在一些现有的 Cluster Manager(集群管理器)上运行。它当前提供了几种用于部署的选项: | ||
|
||
* [Standalone Deploy Mode](spark-standalone.html): 在私有集群上部署 Spark 最简单的方式 | ||
* [Apache Mesos](running-on-mesos.html) | ||
* [Hadoop YARN](running-on-yarn.html) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# API 文档 | ||
|
||
* [Spark Scala API (Scaladoc)](http://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.package) | ||
* [Spark Java API (Javadoc)](http://spark.apache.org/docs/2.2.0/api/java/index.html) | ||
* [Spark Python API (Sphinx)](http://spark.apache.org/docs/2.2.0/api/python/index.html) | ||
* [Spark R API (Roxygen2)](http://spark.apache.org/docs/2.2.0/api/R/index.html) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# 部署指南 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
# 集群模式概述 | ||
|
||
该文档给出了 Spark 如何在集群上运行、使之更容易来理解所涉及到的组件的简短概述。通过阅读 [应用提交指南](submitting-applications.html) 来学习关于在集群上启动应用。 | ||
|
||
# 组件 | ||
|
||
Spark 应用在集群上作为独立的进程组来运行,在您的 main 程序中通过 SparkContext 来协调(称之为 driver 程序)。 | ||
|
||
具体的说,为了运行在集群上,SparkContext 可以连接至几种类型的 Cluster Manager(既可以用 Spark 自己的 Standlone Cluster Manager,或者 Mesos,也可以使用 YARN),它们会分配应用的资源。一旦连接上,Spark 获得集群中节点上的 Executor,这些进程可以运行计算并且为您的应用存储数据。接下来,它将发送您的应用代码(通过 JAR 或者 Python 文件定义传递给 SparkContext)至 Executor。最终,SparkContext 将发送 Task 到 Executor 以运行。 | ||
|
||
![Spark cluster components](img/1b193ef9791313508d0c806587f136fd.jpg "Spark cluster components") | ||
|
||
这里有几个关于这个架构需要注意的地方 : | ||
|
||
1. 每个应用获取到它自己的 Executor 进程,它们会保持在整个应用的生命周期中并且在多个线程中运行 Task(任务)。这样做的优点是把应用互相隔离,在调度方面(每个 driver 调度它自己的 task)和 Executor 方面(来自不同应用的 task 运行在不同的 JVM 中)。然而,这也意味着若是不把数据写到外部的存储系统中的话,数据就不能够被不同的 Spark 应用(SparkContext 的实例)之间共享。 | ||
2. Spark 是不知道底层的 Cluster Manager 到底是什么类型的。只要它能够获得 Executor 进程,并且它们可以和彼此之间通信,那么即使是在一个也支持其它应用的 Cluster Manager(例如,Mesos / YARN)上来运行它也是相对简单的。 | ||
3. Driver 程序必须在自己的生命周期内(例如,请参阅 [在网络配置章节中的 spark.driver.port 章节](configuration.html#networking)。 监听和接受来自它的 Executor 的连接请求。同样的,driver 程序必须可以从 worker 节点上网络寻址(就是网络没问题)。 | ||
4. 因为 driver 调度了集群上的 task(任务),更好的方式应该是在相同的局域网中靠近 worker 的节点上运行。如果您不喜欢发送请求到远程的集群,倒不如打开一个 RPC 至 driver 并让它就近提交操作而不是从很远的 worker 节点上运行一个 driver。 | ||
|
||
# Cluster Manager 类型 | ||
|
||
系统目前支持三种 Cluster Manager: | ||
|
||
* [Standalone](spark-standalone.html) – 包含在 Spark 中使得它更容易来安装集群的一个简单的 Cluster Manager。 | ||
|
||
* [Apache Mesos](running-on-mesos.html) – 一个通用的 Cluster Manager,它也可以运行 Hadoop MapReduce 和其它服务应用。 | ||
* [Hadoop YARN](running-on-yarn.html) –Hadoop 2 中的 resource manager(资源管理器)。 | ||
* [Kubernetes (experimental)](https://github.com/apache-spark-on-k8s/spark) – 除了上述之外,还有 Kubernetes 的实验支持。 Kubernetes 提供以容器为中心的基础设施的开源平台。 Kubernetes 的支持正在 apache-spark-on-k8s Github 组织中积极开发。有关文档,请参阅该项目的 README。 | ||
|
||
# 提交应用程序 | ||
|
||
使用 spark-submit 脚本可以提交应用至任何类型的集群。在 [application submission guide](submitting-applications.html) 介绍了如何做到这一点。 | ||
|
||
# 监控 | ||
|
||
每个 driver 都有一个 Web UI,通常在端口 4040 上,可以显示有关正在运行的 task,executor,和存储使用情况的信息。 只需在 Web 浏览器中的`http://<driver-node>:4040` 中访问此 UI。[监控指南](monitoring.html) 中还介绍了其他监控选项。 | ||
|
||
# Job 调度 | ||
|
||
Spark 即可以在应用间(Cluster Manager 级别),也可以在应用内(如果多个计算发生在相同的 SparkContext 上时)控制资源分配。 在 [任务调度概述](job-scheduling.html) 中更详细地描述了这一点。 | ||
|
||
# 术语 | ||
|
||
下表总结了您将看到的用于引用集群概念的术语: | ||
|
||
| Term(术语) | Meaning(含义) | | ||
| --- | --- | | ||
| Application | 用户构建在 Spark 上的程序。由集群上的一个 driver 程序和多个 executor 组成。 | | ||
| Application jar | 一个包含用户 Spark 应用的 Jar。有时候用户会想要去创建一个包含他们应用以及它的依赖的 “uber jar”。用户的 Jar 应该没有包括 Hadoop 或者 Spark 库,然而,它们将会在运行时被添加。 | | ||
| Driver program | 该进程运行应用的 main() 方法并且创建了 SparkContext。 | | ||
| Cluster manager | 一个外部的用于获取集群上资源的服务。(例如,Standlone Manager,Mesos,YARN) | | ||
| Deploy mode | 根据 driver 程序运行的地方区别。在 “Cluster” 模式中,框架在群集内部启动 driver。在 “Client” 模式中,submitter(提交者)在 Custer 外部启动 driver。 | | ||
| Worker node | 任何在集群中可以运行应用代码的节点。 | | ||
| Executor | 一个为了在 worker 节点上的应用而启动的进程,它运行 task 并且将数据保持在内存中或者硬盘存储。每个应用有它自己的 Executor。 | | ||
| Task | 一个将要被发送到 Executor 中的工作单元。 | | ||
| Job | 一个由多个任务组成的并行计算,并且能从 Spark action 中获取响应(例如 `save`, `collect`); 您将在 driver 的日志中看到这个术语。 | | ||
| Stage | 每个 Job 被拆分成更小的被称作 stage(阶段) 的 task(任务) 组,stage 彼此之间是相互依赖的(与 MapReduce 中的 map 和 reduce stage 相似)。您将在 driver 的日志中看到这个术语。 | |
Oops, something went wrong.