Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21434][Python][DOCS] Add pyspark pip documentation. #18698

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 26 additions & 1 deletion docs/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,11 @@ res3: Long = 15

./bin/pyspark


Or if PySpark is installed with pip in your current enviroment:

pyspark

Spark's primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets. Due to Python's dynamic nature, we don't need the Dataset to be strongly-typed in Python. As a result, all Datasets in Python are Dataset[Row], and we call it `DataFrame` to be consistent with the data frame concept in Pandas and R. Let's make a new DataFrame from the text of the README file in the Spark source directory:

{% highlight python %}
Expand Down Expand Up @@ -206,7 +211,7 @@ a cluster, as described in the [RDD programming guide](rdd-programming-guide.htm

# Self-Contained Applications
Suppose we wish to write a self-contained application using the Spark API. We will walk through a
simple application in Scala (with sbt), Java (with Maven), and Python.
simple application in Scala (with sbt), Java (with Maven), and Python (pip).

<div class="codetabs">
<div data-lang="scala" markdown="1">
Expand Down Expand Up @@ -367,6 +372,16 @@ Lines with a: 46, Lines with b: 23

Now we will show how to write an application using the Python API (PySpark).


If you are building a packaged PySpark application or library you can add it to your setup.py file as:

{% highlight python %}
install_requires=[
'pyspark=={site.SPARK_VERSION}'
]
{% endhighlight %}


As an example, we'll create a simple Spark application, `SimpleApp.py`:

{% highlight python %}
Expand Down Expand Up @@ -406,6 +421,16 @@ $ YOUR_SPARK_HOME/bin/spark-submit \
Lines with a: 46, Lines with b: 23
{% endhighlight %}

If you have PySpark pip installed into your enviroment (e.g. `pip instal pyspark` you can run your application with the regular Python interpeter or use the provided spark-submit as you prefer.

{% highlight bash %}
# Use spark-submit to run your application
$ python SimpleApp.py
...
Lines with a: 46, Lines with b: 23
{% endhighlight %}


</div>
</div>

Expand Down
13 changes: 12 additions & 1 deletion docs/rdd-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,18 @@ import org.apache.spark.SparkConf;
Spark {{site.SPARK_VERSION}} works with Python 2.7+ or Python 3.4+. It can use the standard CPython interpreter,
so C libraries like NumPy can be used. It also works with PyPy 2.3+.

To run Spark applications in Python, use the `bin/spark-submit` script located in the Spark directory.
Python 2.6 support was removed in Spark 2.2.0.

Spark applications in Python can either be run with the `bin/spark-submit` script which includes Spark at runtime, or by including including it in your setup.py as:

{% highlight python %}
install_requires=[
'pyspark=={site.SPARK_VERSION}'
]
{% endhighlight %}


To run Spark applications in Python without pip installing PySpark, use the `bin/spark-submit` script located in the Spark directory.
This script will load Spark's Java/Scala libraries and allow you to submit applications to a cluster.
You can also use `bin/pyspark` to launch an interactive Python shell.

Expand Down