# PySpark Tutorial
## What is PySpark
PySpark is the Python API for [Apache Spark](https://spark.apache.org/). PySpark enables developers to write Spark applications using Python, providing access to Spark’s rich set of features and capabilities through Python language. With its rich set of features, robust performance, and extensive ecosystem, PySpark has become a popular choice for data engineers, data scientists, and developers working with big data and distributed computing. PySpark is very well used in the Data Science and Machine Learning community as there are many widely used data science libraries written in Python including NumPy, and TensorFlow. Also used due to its efficient processing of large datasets.

Spark has a multi-language engine, that provides APIs (Application Programming Interfaces) and libraries for several programming languages like Java, Scala, Python, and R, allowing developers to work with Spark using the language they are most comfortable with.

- Scala: Spark’s primary and native language is Scala. Many of Spark’s core components are written in Scala, and it provides the most extensive API for Spark.
- Java: Spark provides a Java API that allows developers to use Spark within Java applications. Java developers can access most of Spark’s functionality through this API.
- R: Spark also offers an R API, enabling R users to work with Spark data and perform distributed data analysis using their familiar R language.
- Python: Spark offers a Python API, called PySpark, which is popular among data scientists and developers who prefer Python for data analysis and machine learning tasks. PySpark provides a Pythonic way to interact with Spark.

## Iinitialize PySpark Environment

In [3]:
!pip list | grep spark

findspark                 2.0.1


In [1]:
import findspark
findspark.init()

## PySpark on Local 

In [14]:
import pyspark
from pyspark.sql import SparkSession

## create a new spark context
#sc = pyspark.SparkContext(master="local", appName="pyspark-basic")
#spark = SparkSession(sc)

spark = SparkSession.builder.master("local[1]").appName("pyspark-basic").getOrCreate()

# display spark session (local master)
spark

## DataFrame example
### What is DataFrame
A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset\[Row\]. While, in Java API, users need to use Dataset<Row> to represent a DataFrame.

### Create DataFrame

In [15]:
# change the log level
spark.sparkContext.setLogLevel("ERROR")

In [16]:
# create DataFrame
data = [('001','Smith','M',40,'DA',4000),
        ('002','Rose','M',35,'DA',3000),
        ('003','Williams','M',30,'DE',2500),
        ('004','Anne','F',30,'DE',3000),
        ('005','Mary','F',35,'BE',4000),
        ('006','James','M',30,'FE',3500)]

columns = ["cd","name","gender","age","div","salary"]
df = spark.createDataFrame(data = data, schema = columns)

df.printSchema()
df.show()

root
 |-- cd: string (nullable = true)
 |-- name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- age: long (nullable = true)
 |-- div: string (nullable = true)
 |-- salary: long (nullable = true)



[Stage 0:>                                                          (0 + 1) / 1]

+---+--------+------+---+---+------+
| cd|    name|gender|age|div|salary|
+---+--------+------+---+---+------+
|001|   Smith|     M| 40| DA|  4000|
|002|    Rose|     M| 35| DA|  3000|
|003|Williams|     M| 30| DE|  2500|
|004|    Anne|     F| 30| DE|  3000|
|005|    Mary|     F| 35| BE|  4000|
|006|   James|     M| 30| FE|  3500|
+---+--------+------+---+---+------+



                                                                                

### Load from file
By utilizing `DataFrameReader.csv("path")` or `format("csv").load("path")` methods, you can read a CSV file into a PySpark DataFrame. These methods accept a file path as their parameter. When using the format(“csv”) approach, you should specify data sources like *csv* or *org.apache.spark.sql.csv*.

Download the zipcode.csv file form the spark examples [repository](https://github.com/spark-examples/pyspark-examples/blob/master/resources/zipcodes.csv).

In [17]:
!curl -LO https://github.com/spark-examples/pyspark-examples/blob/master/resources/zipcodes.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  239k    0  239k    0     0   384k      0 --:--:-- --:--:-- --:--:--  384k


In [18]:
# read CSV File
df = spark.read.csv("./zipcodes.csv")
df.printSchema()

root
 |-- _c0: string (nullable = true)



In [19]:
# stop the current spark session before connecting to standalone cluster
spark.stop()

## PySpark on Stndalone Cluster
Before you begin, you need to make sure that your standalone spark cluster is running. If your cluster is not running, please follow the instructions to [run a spark standalone cluster](https://github.com/Young-ook/data-lab-on-wsl?tab=readme-ov-file#launch-a-standalone-cluster).

In [25]:
import pyspark
sc = pyspark.SparkContext(master="spark://localhost:7077")
sc

## Pi example


In [26]:
import random

num_samples = 100000000

def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1

count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples

print(pi)

[Stage 0:>                                                          (0 + 4) / 4]

3.1416436


                                                                                

In [28]:
# stop the current spark session for cleanup
spark.stop()

# Additional Resources
- [Apache Spark Examples](https://spark.apache.org/examples.html)
- [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)