Skip to content

Commit 4e19c0a

Browse files
authored
Update Spark-with-Python.md
1 parent f21771d commit 4e19c0a

File tree

1 file changed

+15
-1
lines changed

1 file changed

+15
-1
lines changed

Spark-with-Python-writeup/Spark-with-Python.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@
44

55
It runs fast (up to 100x faster than traditional [Hadoop MapReduce](https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm) due to in-memory operation, offers robust, distributed, fault-tolerant data objects (called [RDD](https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm)), and integrates beautifully with the world of machine learning and graph analytics through supplementary packages like [Mlib](https://spark.apache.org/mllib/) and [GraphX](https://spark.apache.org/graphx/).
66

7+
<p align='center'>
8+
<img src="https://raw.githubusercontent.com/tirthajyoti/PySpark_Basics/master/Images/Spark%20ecosystem.png" width="400" height="400">
9+
</p>
10+
711
Spark is implemented on [Hadoop/HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) and written mostly in [Scala](https://www.scala-lang.org/), a functional programming language, similar to Java. In fact, Scala needs the latest Java installation on your system and runs on JVM. However, for most beginners, Scala is not a language that they learn first to venture into the world of data science. Fortunately, Spark provides a wonderful Python integration, called **PySpark**, which lets Python programmers to interface with the Spark framework and learn how to manipulate data at scale and work with objects and algorithms over a distributed file system.
812

913
In this article, we will learn the basics of Python integration with Spark. There are a lot of concepts (constantly evolving and introduced), and therefore, we just focus on fundamentals with a few simple examples. Readers are encouraged to build on these and explore more on their own.
@@ -36,7 +40,11 @@ If you’re already familiar with Python and libraries such as Pandas and Numpy,
3640

3741
The exact process of installing and setting up PySpark environment (on a standalone machine) is somewhat involved and can vary slightly depending on your system and environment. The goal is to get your regular Jupyter data science environment working with Spark at the background using PySpark.
3842

39-
Read this article to know more details on the setup process, step-by-step.
43+
**[Read this article](https://medium.com/free-code-camp/how-to-set-up-pyspark-for-your-jupyter-notebook-7399dd3cb389)** to know more details on the setup process, step-by-step.
44+
45+
<p align='center'>
46+
<img src="https://raw.githubusercontent.com/tirthajyoti/PySpark_Basics/master/Images/Components.png" width="500" height="300">
47+
</p>
4048

4149
Alternatively, you can use Databricks setup for practicing Spark. This company was created by the original creators of Spark and have an excellent ready-to-launch environment to do distributed analysis with Spark.
4250

@@ -48,6 +56,8 @@ It will be much easier to start working with real-life large clusters if you hav
4856
## RDD and SparkContext
4957
Many Spark programs revolve around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. SparkContext resides in the Driver program and manages the distributed data over the worker nodes through the cluster manager. The good thing about using PySpark is that all this complexity of data partitioning and task management is handled automatically at the back and the programmer can focus on the specific analytics or machine learning job at hand.
5058

59+
![rdd-1](https://raw.githubusercontent.com/tirthajyoti/Spark-with-Python/master/Images/RDD-1.png)
60+
5161
There are two ways to create RDDs:
5262
- parallelizing an existing collection in your driver program, or
5363
- referencing a dataset in an external storage system, such as a shared file- system, HDFS, HBase, or any data source offering a Hadoop InputFormat.
@@ -285,6 +295,8 @@ A DataFrame is a distributed collection of rows under named columns. It is conce
285295
- Lazy Evaluations: Which means that a task is not executed until an action is performed.
286296
Distributed: RDD and DataFrame both are distributed in nature.
287297

298+
<p align='center'><img src="https://cdn-images-1.medium.com/max/1202/1*wiXLNwwMyWdyyBuzZnGrWA.png" width="600" height="400"></p>
299+
288300
### Advantages of the DataFrame
289301
- DataFrames are designed for processing large collection of structured or semi-structured data.
290302
- Observations in Spark DataFrame are organised under named columns, which helps Apache Spark to understand the schema of a DataFrame. This helps Spark optimize execution plan on these queries.
@@ -306,6 +318,8 @@ We have had success in the domain of Big Data analytics with Hadoop and the MapR
306318

307319
Spark SQL essentially tries to bridge the gap between the two models we mentioned previously—the relational and procedural models. Spark SQL works through the DataFrame API that can perform relational operations on both external data sources and Spark's built-in distributed collections—at scale!
308320

321+
![sparksql-1](https://raw.githubusercontent.com/tirthajyoti/Spark-with-Python/master/Images/SparkSQL-1.png)
322+
309323
Why is Spark SQL so fast and optimized? The reason is because of a new extensible optimizer, **Catalyst**, based on functional programming constructs in Scala. Catalyst supports both rule-based and cost-based optimization. While extensible optimizers have been proposed in the past, they have typically required a complex domain-specific language to specify rules. Usually, this leads to having a significant learning curve and maintenance burden. In contrast, Catalyst uses standard features of the Scala programming language, such as pattern-matching, to let developers use the full programming language while still making rules easy to specify.
310324

311325
Refer to the following Jupyter notebook for an introduction to Database operations with SparkSQL,

0 commit comments

Comments
 (0)