# PySpark overview

Apache Spark is written in Scala programming language. To support Python with Spark, Apache Spark Community released a tool, PySpark. Using PySpark, you can work with RDDs in Python programming language also. It is because of a library called Py4j that they are able to achieve this.

PySpark offers PySpark Shell which links the Python API to the spark core and initializes the Spark context. Majority of data scientists and analytics experts today use Python because of its rich library set. Integrating Python with Spark is a boon to them

## Importing PySpark

In [None]:
import pyspark
sc = pyspark.SparkContext(appName="Intro_to_pyspark")

SparkContext is the entry point to any spark functionality. When we run any Spark application, a driver program starts, which has the main function and your SparkContext gets initiated here. The driver program then runs the operations inside the executors on worker nodes.

**The details of a PySpark class**
```
class pyspark.SparkContext (
   # master --> It is the URL of the cluster it connects to
   master = None,
   # appName --> Name of your job
   appName = None,
   # sparkHome --> Spark installation directory
   sparkHome = None,
   # pyFiles --> The .zip or .py files to send to the cluster and add to the PYTHONPATH
   pyFiles = None,
   # Environment --> Worker nodes environment variables
   environment = None,
   # batchSize --> The number of Python objects represented as a single Java object. Set 1 to disable batching, 
   0 to automatically choose the batch size based on object sizes, or -1 to use an unlimited batch size.
   batchSize = 0,
   # Serializer --> RDD serializer
   serializer = PickleSerializer(),
   # Conf --> An object of L{SparkConf} to set all the Spark properties.
   conf = None,
   # Gateway --> Use an existing gateway and JVM, otherwise initializing a new JVM.
   gateway = None,
   # JSC --> The JavaSparkContext instance. 
   jsc = None,
   # profiler_cls --> A class of custom Profiler used to do profiling (the default is pyspark.profiler.BasicProfiler).
   profiler_cls = <class 'pyspark.profiler.BasicProfiler'>
)
```

Retrieve SparkContext version

In [None]:
sc.version

'2.4.4'

Retrieve Python version of SparkContext

In [None]:
sc.pythonVer

'3.6'

URL of the cluster or "local" string to run in local mode of SparkContex

In [None]:
sc.master

'local[*]'

# Loading data in PySpark
SparkContext's *paralleliza()* method

In [None]:
rdd = sc.parallelize([1, 2, 3, 4, 5])

SparksContext's *textFile()* method

In [None]:
rdd2 = sc.textFile("sample_data/test.txt")

Your Turn:


*   Print the version of SparkContext in the PySpark shell.
*   Print the Python version of SparkContext in the PySpark shell.
*   What is the master of SparkContext in the PySpark shell?






In [None]:
# Print the version of SparkContext
print("The version of Spark Context in the PySpark shell is", sc.____)

# Print the Python version of SparkContext
print("The Python version of Spark Context in the PySpark shell is", sc.____)

# Print the master of SparkContext
print("The master of Spark Context in the PySpark shell is", sc.____)

* Create a python list named numb containing the numbers 1 to 100.
* Load the list into Spark using Spark Context's parallelize method and assign it to a variable spark_data.


In [None]:
# Create a python list of numbers from 1 to 100 
numb = range(____, ____)

# Load the list into PySpark  
spark_data = sc.____(numb)



*   Load a local text file sample_data/README.md in PySpark shell.




In [None]:
file_path = 'sample_data/README.md'
# Load a local file into PySpark shell
lines = sc.____(file_path)