## <mark>Pyspark

PySpark is a Python API for Apache Spark. Apache Spark is an **analytical processing engine for large scale powerful distributed data processing** and machine learning applications.

PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities, using PySpark we can **run applications parallelly on the distributed cluster (multiple nodes).

Spark basically written in Scala and later on due to its industry adaptation it’s API PySpark released for Python using Py4J. **Py4J is a Java library that is integrated within PySpark and allows python to dynamically interface with JVM objects**, hence to run PySpark you also need Java to be installed along with Python, and Apache Spark.

If you are working with a smaller Dataset and don’t have a Spark cluster, but still you wanted to get benefits similar to Spark DataFrame, you can use Python pandas DataFrames. The main difference is pandas DataFrame is not distributed and run on a single node.

## <mark>Advantages:

![image.png](attachment:7030dd08-6e95-4ff6-897c-d8704f688f6c.png)
    
    In-memory computation
    Distributed processing using parallelize
    Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
    Fault-tolerant
    Immutable
    Lazy evaluation
    Cache & persistence
    Inbuild-optimization when using DataFrames
    Supports ANSI SQL
    in-memory, distributed processing engine that allows you to process data efficiently in a distributed fashion.
    Applications running on PySpark are 100x faster than traditional systems..
    we can process data from Hadoop HDFS, AWS S3, and many file systems.
    used to process real-time data using Streaming and Kafka.
    natively has machine learning and graph libraries.

<mark> PySpark Architecture

Apache Spark works in a **master-slave architecture where the master is called “Driver” and slaves are called “Workers”.** When you run a Spark application, Spark **Driver creates a context that is an entry point to your application, and all operations (transformations and actions) are executed on worker nodes, and the resources are managed by Cluster Manager.**
    
A Spark Application consists of a Driver Program and a group of Executors on the cluster. The Driver is a process that executes the main program of your Spark application and creates the SparkContext that coordinates the execution of jobs (more on this later). The executors are processes running on the worker nodes of the cluster which are responsible for executing the tasks the driver process has assigned to them. The cluster manager (such as Mesos or YARN) is responsible for the allocation of physical resources to Spark Applications.
    
![image.png](attachment:c3cf621b-6df7-43f4-8220-ba8fa56f77eb.png)

##### <mark>Entry Points

Every Spark Application needs an entry point that **allows it to communicate with data sources and perform certain operations such as reading and writing data.** In Spark 1.x, three entry points were introduced: **SparkContext, SQLContext and HiveContext.** Since Spark 2.x, a new entry point called **SparkSession** has been introduced that essentially combined all functionalities available in the three aforementioned contexts. Note that all contexts are still available even in newest Spark releases, mostly for backward compatibility purposes.

<mark>Spark supports below cluster managers:

    Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
    Apache Mesos – Mesons is a Cluster manager that can also run Hadoop MapReduce and PySpark applications.
    Hadoop YARN – the resource manager in Hadoop 2. This is mostly used, cluster manager.
    Kubernetes – an open-source system for automating deployment, scaling, and management of containerized applications.
    
local – which is not really a cluster manager but still I wanted to mention as we use “local” for master() in order to run Spark on your laptop/computer.

<mark> PySpark Modules & Packages
    
    PySpark RDD (pyspark.RDD)
    PySpark DataFrame and SQL (pyspark.sql)
    PySpark Streaming (pyspark.streaming)
    PySpark MLib (pyspark.ml, pyspark.mllib)
    PySpark GraphFrames (GraphFrames)
    PySpark Resource (pyspark.resource)

![image.png](attachment:f21e34fc-0714-4391-85cf-190bf8236b7b.png)

<mark>pyspark installation
    
    Install Python or Anaconda distribution
    Install Java 8: 
        set JAVA_HOME and PATH variable.
        JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201
        PATH = %PATH%;C:\Program Files\Java\jdk1.8.0_201\bin
    Install Apache Spark
        SPARK_HOME  = C:\apps\spark-3.0.0-bin-hadoop2.7
        HADOOP_HOME = C:\apps\spark-3.0.0-bin-hadoop2.7
        PATH=%PATH%;C:\apps\spark-3.0.0-bin-hadoop2.7\bin
    Setup winutils.exe: 
        copy it to %SPARK_HOME%\bin folder. Winutils are different for each Hadoop version
    run: $SPARK_HOME/sbin/pyspark
