# Pyspark

Apache Spark can be used for distributed computations in Python for data analysis and processing of Big Data and for ML

PySpark interacts with Spark though special library - Py4J. It allows python programs, which are executed by Interpreter, dynamically address to Java objects in JVM, translating code Scala in JVM. 

PySpark allows to conduct Parallel processing without need to use some Python modules for Flow or Multiprocessing work.   
All complex communication and synchronization between flows, processes and even different CPU is processed in Spark.

### Example:  
Let's show simple example of PySpark usage.
- Open CSV file,
- Count number of rows
- Show top 10 rows

PySpark can not only connect to Spark cluster and process data there, but also can work with many different popular formats. It is very convenient for debugging scripts locally

In [5]:
# Import PySpark libraries
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext

# create a SparkSession:
spark = SparkSession.builder.appName("MyApp").getOrCreate()

# create a SQLContext:
sqlContext = SQLContext(spark)

#load a CSV file:
df = sqlContext.read.csv("./data/example.csv", header=True)

# cound number of rows in the DataFrame
rowCount = df.count()

# print the row count:
print("Number of rows:", rowCount)

# print the schema of the DataFrame
df.printSchema()

# show top 10 rows:
df.show(10)



Number of rows: 12
root
 |-- column1;column2: string (nullable = true)

+---------------+
|column1;column2|
+---------------+
|            1;2|
|            2;3|
|            3;4|
|            4;5|
|            5;6|
|            6;7|
|            7;8|
|            8;9|
|           9;10|
|          10;11|
+---------------+
only showing top 10 rows

