## Basic tutorial of Spark DataFrame with PySpark
More about Spark SQL and DataFrame here: https://spark.apache.org/docs/latest/sql-getting-started.html

### Import and load spark context

In [1]:
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext('local', 'spark_dataframe')
sqlContext = SQLContext(sc)

### How to create a DataFrame?

A DataFrame in Apache Spark can be created in multiple ways:

+ It can be created using different data formats. For example, loading the data from JSON, CSV.
+ Loading data from Existing RDD.
+ Programmatically specifying schema

![title](DataFrame-in-Spark.png)

#### 1. Create from RDD

In [2]:
from pyspark.sql import Row

l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
people_df = sqlContext.createDataFrame(people)
people_df.show()

+---+--------+
|age|    name|
+---+--------+
| 25|   Ankit|
| 22|Jalfaizy|
| 20| saurabh|
| 26|    Bala|
+---+--------+



#### 2.Creating from CSV file

In [5]:
csv_df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('code/data/cars.csv')
csv_df.show()

+----+-----+-----+--------------------+-----+
|year| make|model|             comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla|    S|          No comment| null|
|1997| Ford| E350|Go get one now th...| null|
|2015|Chevy| Volt|                null| null|
+----+-----+-----+--------------------+-----+



#### Read using csv function

In [31]:
df1 = sqlContext.read.csv('code/data/cars.csv', header=True)
df1.show()

# check datatype
df1.printSchema()

+----+-----+-----+--------------------+-----+
|year| make|model|             comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla|    S|          No comment| null|
|1997| Ford| E350|Go get one now th...| null|
|2015|Chevy| Volt|                null| null|
+----+-----+-----+--------------------+-----+

root
 |-- year: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- comment: string (nullable = true)
 |-- blank: string (nullable = true)



##### Show 2st rows of dataframw

In [6]:
csv_df.head(2)

[Row(year=2012, make='Tesla', model='S', comment='No comment', blank=None),
 Row(year=1997, make='Ford', model='E350', comment='Go get one now they are going fast', blank=None)]

##### Show column name

In [8]:
csv_df.columns

['year', 'make', 'model', 'comment', 'blank']

##### Statistics description

In [10]:
csv_df.describe().show()

+-------+-----------------+-----+-----+--------------------+-----+
|summary|             year| make|model|             comment|blank|
+-------+-----------------+-----+-----+--------------------+-----+
|  count|                3|    3|    3|                   2|    0|
|   mean|           2008.0| null| null|                null| null|
| stddev|9.643650760992955| null| null|                null| null|
|    min|             1997|Chevy| E350|Go get one now th...| null|
|    max|             2015|Tesla| Volt|          No comment| null|
+-------+-----------------+-----+-----+--------------------+-----+



##### Select column

In [11]:
csv_df.select('make','model').show()

+-----+-----+
| make|model|
+-----+-----+
|Tesla|    S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+



##### Filter

In [16]:
csv_df.filter(csv_df['year']==2012).show() ## <=> df.filter('year==2012').show()

+----+-----+-----+----------+-----+
|year| make|model|   comment|blank|
+----+-----+-----+----------+-----+
|2012|Tesla|    S|No comment| null|
+----+-----+-----+----------+-----+



##### Map a select result to RDD

In [22]:
rdd_make = csv_df.select('make','model').rdd.map(lambda x: (x[0], x[1], 1))
print(rdd_make.collect())

[('Tesla', 'S', 1), ('Ford', 'E350', 1), ('Chevy', 'Volt', 1)]


##### Sort by year

In [24]:
csv_df.orderBy(csv_df.year).show()  # csv_df.orderBy(csv_df.year.desc()).show()  to descease sort

+----+-----+-----+--------------------+-----+
|year| make|model|             comment|blank|
+----+-----+-----+--------------------+-----+
|1997| Ford| E350|Go get one now th...| null|
|2012|Tesla|    S|          No comment| null|
|2015|Chevy| Volt|                null| null|
+----+-----+-----+--------------------+-----+



##### Apply function during select and add new column

In [27]:
from pyspark.sql.functions import year,current_date
csv_df.withColumn('year_old', year(current_date()) - csv_df.year).show()

+----+-----+-----+--------------------+-----+--------+
|year| make|model|             comment|blank|year_old|
+----+-----+-----+--------------------+-----+--------+
|2012|Tesla|    S|          No comment| null|       7|
|1997| Ford| E350|Go get one now th...| null|      22|
|2015|Chevy| Volt|                null| null|       4|
+----+-----+-----+--------------------+-----+--------+



#### drop a column

In [32]:
csv_df.drop('year_old').show()

+----+-----+-----+--------------------+-----+
|year| make|model|             comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla|    S|          No comment| null|
|1997| Ford| E350|Go get one now th...| null|
|2015|Chevy| Volt|                null| null|
+----+-----+-----+--------------------+-----+



##### Aggregation with groupby

In [38]:
people_df.groupby('name').count().show()  ## <=> people_df.groupby('name').agg('count').show()

+--------+-----+
|    name|count|
+--------+-----+
|    Bala|    1|
|Jalfaizy|    1|
| saurabh|    1|
|   Ankit|    1|
+--------+-----+



#### Running SQL Queries

In [37]:
# convert dataframe to view first
people_df.createOrReplaceTempView("people_view")
# then query
sqlContext.sql("SELECT * FROM people_view").show()

+---+--------+
|age|    name|
+---+--------+
| 25|   Ankit|
| 22|Jalfaizy|
| 20| saurabh|
| 26|    Bala|
+---+--------+

