# Spark DataFrames

- Enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing
- Inspired by data frames in R and Python (Pandas)
- Designed from the ground-up to support modern big
data and data science applications
- Extension to the existing RDD API

## References
- [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)
- [Introduction to DataFrames - Python](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html)
- [PySpark Cheat Sheet: Spark DataFrames in Python](https://www.datacamp.com/community/blog/pyspark-sql-cheat-sheet)

### DataFrames are :
- The preferred abstraction in Spark
- Strongly typed collection of distributed elements 
- Built on Resilient Distributed Datasets (RDD)
- Immutable once constructed

### With Dataframes you can :
- Track lineage information to efficiently recompute lost data 
- Enable operations on collection of elements in parallel

### You construct DataFrames
- by parallelizing existing collections (e.g., Pandas DataFrames) 
- by transforming an existing DataFrames
- from files in HDFS or any other storage system (e.g., Parquet)

### Features
- Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster
- Support for a wide array of data formats and storage systems
- Seamless integration with all big data tooling and infrastructure via Spark
- APIs for Python, Java, Scala, and R

### DataFrames versus RDDs
- Nice API for new users familiar with data frames in other programming languages.
- For existing Spark users, the API will make Spark easier to program than using RDDs
- For both sets of users, DataFrames will improve performance through intelligent optimizations and code-generation

## PySpark Shell

**Run the Spark shell:**

~~~ bash
pyspark
~~~

Output similar to the following will be displayed, followed by a `>>>` REPL prompt:

~~~
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
2018-09-18 17:13:13 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/

Using Python version 3.6.5 (default, Apr 29 2018 16:14:56)
SparkSession available as 'spark'.
>>>
~~~

Read data and convert to Dataset

~~~ py
df = sqlContext.read.csv("/tmp/irmar.csv", sep=';', header=True)
~~~

~~~
>>> df2.show()
+---+--------------------+------------+------+------------+--------+-----+---------+--------+
|_c0|                name|       phone|office|organization|position|  hdr|    team1|   team2|
+---+--------------------+------------+------+------------+--------+-----+---------+--------+
|  0|      Alphonse Paul |+33223235223|   214|          R1|     DOC|False|      EDP|      NA|
|  1|        Ammari Zied |+33223235811|   209|          R1|      MC| True|      EDP|      NA|
.
.
.
| 18|    Bernier Joachim |+33223237558|   214|          R1|     DOC|False|   ANANUM|      NA|
| 19|   Berthelot Pierre |+33223236043|   601|          R1|      PE| True|       GA|      NA|
+---+--------------------+------------+------+------------+--------+-----+---------+--------+
only showing top 20 rows
~~~

## Transformations, Actions, Laziness

Like RDDs, DataFrames are lazy. Transformations contribute to the query plan, but they don't execute anything.
Actions cause the execution of the query.

### Transformation examples
- filter
- select
- drop
- intersect 
- join
### Action examples
- count 
- collect 
- show 
- head
- take

## Creating a DataFrame in Python

In [6]:
from pyspark import SparkContext, SparkConf
# The following three lines are not necessary
# in the pyspark shell
conf = SparkConf().setAppName("people").setMaster("local[*]") 
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

df = sqlContext.read.json("/tmp/people.json")

df.show()

+----------+---------+-----------+
| firstname| lastname|      login|
+----------+---------+-----------+
|      Uzel|    Simon|     uzel_d|
|   Perrine|   Moreau|   moreau_p|
|     Elise|    Negri|    negri_e|
|   Camille|   Cochet|   cochet_c|
|   Nolwenn| Giguelay| giguelay_n|
|     Youen|    Meyer|    meyer_y|
|    Emilie|  Lacoste|  lacoste_e|
|       Pia|  LeBihan|  lebihan_p|
|      Yann|    Evain|    evain_y|
|   Camille|    Guyon|    guyon_c|
|  Mathilde|  LeMener|  lemener_m|
|    Gildas| LeGuilly| liguilly_g|
|    Pierre| Gardelle| gardelle_p|
|Christophe|Boulineau|boulineau_c|
|      Omar| Aitichou| aitichou_o|
|     Lijun|      Chi|      chi_l|
|    Jiawei|      Liu|      lin_j|
+----------+---------+-----------+



In [7]:
sc.stop()