# SparkSQL and DataFrames 

<a href = "http://yogen.io"><img src="http://yogen.io/assets/logo.svg" alt="yogen" style="width: 200px; float: right;"/></a>

## RDDs, DataSets, and DataFrames

RDDs are the original interface for Spark programming.

DataFrames were introduced in 1.3

Datasets were introduced in 1.6, and unified with DataFrames in 2.0

### Advantages of DataFrames:

from https://www.datacamp.com/community/tutorials/apache-spark-python:

> More specifically, the performance improvements are due to two things, which you’ll often come across when you’re reading up DataFrames: custom memory management (project Tungsten), which will make sure that your Spark jobs much faster given CPU constraints, and optimized execution plans (Catalyst optimizer), of which the logical plan of the DataFrame is a part.

## SparkSQL and DataFrames 


pyspark does not have the Dataset API, which is available only if you use Spark from a statically typed language: Scala or Java.

From https://spark.apache.org/docs/2.2.0/sql-programming-guide.html:

> A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset&lt;Row> to represent a DataFrame.


### The pyspark.sql module

Important classes of Spark SQL and DataFrames:

* `pyspark.sql.SparkSession` Main entry point for DataFrame and SQL functionality.

* `pyspark.sql.DataFrame` A distributed collection of data grouped into named columns.

* `pyspark.sql.Column` A column expression in a DataFrame.

* `pyspark.sql.Row` A row of data in a DataFrame.

* `pyspark.sql.GroupedData` Aggregation methods, returned by DataFrame.groupBy().

* `pyspark.sql.DataFrameNaFunctions` Methods for handling missing data (null values).

* `pyspark.sql.DataFrameStatFunctions` Methods for statistics functionality.

* `pyspark.sql.functions` List of built-in functions available for DataFrame.

* `pyspark.sql.types` List of data types available.

* `pyspark.sql.Window` For working with window functions.

http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html

https://spark.apache.org/docs/2.2.0/sql-programming-guide.html

## SparkSession

The traditional way to interact with Spark is the SparkContext. In the notebooks we get that from the pyspark driver.

From 2.0 we can use SparkSession to replace SparkConf, SparkContext and SQLContext

In [1]:
from pyspark.sql import SparkSession

SparkSession.sparkContext

<property at 0x7f2f33aee408>

In [2]:
sc

In [3]:
#Hay que crear una sesión. A través de la sesion se puedr leer archivos
#session = SparkSession.builder.getOrCreate()
session=SparkSession.builder\
        .config('someoption','somevalue')\
        .config('anotheroption','anothervalue')\
        .getOrCreate()

In [3]:
#he tenido que hacer el rm del metadata para que funcione, en el shell

#### Passing other options to spark session:
    
    

We can check option values in the resulting session like this:

### Creating DataFrames

SparkSession.createDataFrame: from an RDD, a list or a pandas.DataFrame.

In [4]:
import random

random.choice(['widgeteer','smoke salesman','wizard','psycopath'])

'smoke salesman'

In [5]:
#revisar este sentencia del profesor, yo la tengo mal y va arrastrando el error hacia abajo
random.seed(42)

ids=range(20)
positions=[random.choices(['widgeteer','smoke salesman','wizard','psycopath']) for _ in range(20)]

In [6]:
[random.choices(['widgeteer','smoke salesman','wizard','psycopath']) for _ in range(20)]

TypeError: object of type 'generator' has no len()

In [46]:
rows=zip(ids,positions)
df=session.createDataFrame(rows)
df

DataFrame[_1: bigint, _2: array<string>]

In [37]:
df.take(5)

[Row(_1=0, _2=['wizard']),
 Row(_1=1, _2=['widgeteer']),
 Row(_1=2, _2=['smoke salesman']),
 Row(_1=3, _2=['widgeteer']),
 Row(_1=4, _2=['wizard'])]

In [38]:
df.show(5)

+---+----------------+
| _1|              _2|
+---+----------------+
|  0|        [wizard]|
|  1|     [widgeteer]|
|  2|[smoke salesman]|
|  3|     [widgeteer]|
|  4|        [wizard]|
+---+----------------+
only showing top 5 rows



In [68]:
from pyspark.sql import Row

In [69]:
#...

In [70]:
#podemos crear una esquema (definición de los campos de una tabla y de que tipo son)
rows=zip(ids,positions)
rows
df=session.createDataFrame(rows,schema=['id_number','position'])
df.show(5)

+---------+----------------+
|id_number|        position|
+---------+----------------+
|        0|        [wizard]|
|        1|     [widgeteer]|
|        2|[smoke salesman]|
|        3|     [widgeteer]|
|        4|        [wizard]|
+---------+----------------+
only showing top 5 rows



In [71]:
df.rdd
#el DF tiene un rdd asociado por debajo

MapPartitionsRDD[76] at javaToPython at NativeMethodAccessorImpl.java:0

In [13]:
#podemos crear dataframes de cualquier fuente...

### Creating DataFrames

* From RDDs
* from Hive tables
* From Spark sources: parquet (default), json, jdbc, orc, libsvm, csv, text


#### From RDDs

In [72]:
lines=session.sparkContext.textFile('coupon150720.csv')
#line es un textFile y puedo hacer todo lo que sé ya

In [73]:
lines.take(5)
#podemos procesar esto y cualquier otra cosa de este modo

['79062005698500,1,MAA,AUH,9W,9W,56.79,USD,1,H,H,0526,150904,OK,IAF0',
 '79062005698500,2,AUH,CDG,9W,9W,84.34,USD,1,H,H,6120,150905,OK,IAF0',
 '79062005924069,1,CJB,MAA,9W,9W,60.0,USD,1,H,H,2768,150721,OK,IAA0',
 '79065668570385,1,DEL,DXB,9W,9W,160.63,USD,2,S,S,0546,150804,OK,INA0',
 '79065668737021,1,AUH,IXE,9W,9W,152.46,USD,1,V,V,0501,150803,OK,INA0']

In [74]:
def parse(line):
    
    fields=line.split(',')
    coupon=[fields[0],fields[2],fields[3],fields[4],float(fields[6])]
    
    return coupon

In [75]:
#puedo mapear esta funcion sobre lines (copupons)
coupons=lines.map(parse)
coupons.take(5)

[['79062005698500', 'MAA', 'AUH', '9W', 56.79],
 ['79062005698500', 'AUH', 'CDG', '9W', 84.34],
 ['79062005924069', 'CJB', 'MAA', '9W', 60.0],
 ['79065668570385', 'DEL', 'DXB', '9W', 160.63],
 ['79065668737021', 'AUH', 'IXE', '9W', 152.46]]

In [18]:
session

In [78]:
#podemos convertir esto en un dataframe
df=session.createDataFrame(coupons)


In [79]:
df.show(5)

+--------------+---+---+---+------+
|            _1| _2| _3| _4|    _5|
+--------------+---+---+---+------+
|79062005698500|MAA|AUH| 9W| 56.79|
|79062005698500|AUH|CDG| 9W| 84.34|
|79062005924069|CJB|MAA| 9W|  60.0|
|79065668570385|DEL|DXB| 9W|160.63|
|79065668737021|AUH|IXE| 9W|152.46|
+--------------+---+---+---+------+
only showing top 5 rows



### Inferring and specifying schemas

In [None]:
from pyspark.sql import types

types.IntegerType()

In [None]:
#el esquema de un df comprende el nombre, el tipo y si los datos son nulables o no
#la forma de especificar completamenete un esquema es con un structtype que es como una lista.

#### Fully specifying a schema

We need to create a `StructType` composed of `StructField`s. each of those specifies afiled with name, type and `nullable` properties. 

In [None]:
#si no me molan los tipos que spark infiere, podemos especificar el esquema de esta forma

employee_schema=types.StructType([types.StructField('id_number', types.LongType(),nullable=False),
                 types.StructField('position',types.StringType(),nullable=True)])

employees=session.createDataFrame(zip(ids,positions),schema=employee_schema)
employees.printSchema()



#### From csv files

We can either read them directly into dataframes or read them as RDDs and transform that into a DataFrame. This second way will be very useful if we have unstructured data like web server logs.

In [None]:
#desde un csv
df_from_csv=session.read.csv('coupon150720.csv')
df_from_csv.show(5)

In [None]:
df_from_csv=session.sql('SELECT  _c0,_c1,_c2,_c3,_c4,_c5 from csv.`coupon150720.csv`')
df_from_csv.show(5)

In [None]:
df_from_csv.printSchema()

In [None]:
df_from_csv=session.sql('SELECT  _c0,_c1,_c2,_c3,_c4,_c5, CAST(_c6 AS FLOAT) from csv.`coupon150720.csv`')
df_from_csv.printSchema()

In [None]:
df_from_csv.show(5)

#### From other types of data

Apache Parquet is a free and open-source column-oriented data store of the Apache Hadoop ecosystem. It is similar to the other columnar storage file formats available in Hadoop namely RCFile and Optimized RCFile. It is compatible with most of the data processing frameworks in the Hadoop environment.

In [None]:
#parquet es un formato column oriented, tambien podemos leer de json, a traves de jdbc,...

session.read.parquet
session.read.json
session.read.jdbc

### Basic operations with DataFrames

In [None]:
employees.show(5)

In [None]:
employee_0=employees.first()
employee_0['id_number']

### Filtering and selecting

Syntax inspired in SQL.

In [None]:
#podemos seleccionar
employees.select('id_number')

If we want to filter, we will need to build an instance of `Column`, using square bracket notation.

In [None]:
type(employees['id_number'])

In [None]:
employees.filter(employees['id_number']<5).show()

That's because a comparison between str and int will error out, so spark will not even get the chance to infer to which column we are referring.

`where` is exactly synonimous with `filter`

In [None]:
#si nos mola mucho SQL

employees.where(employees['id_number']<7).show()

A column is quite different to a Pandas Series. It is just a reference to a column, and can only be used to construct sparkSQL expressions (select, where...). It can't be collected or taken as a one-dimensional sequence:

In [None]:
#una columna es distinto a pandas, no es una serie, es un puntero.


#### Exercise

Extract all employee ids which correspond to pyschopath

In [None]:
employees.filter(employees['position']=='psycopath').show()

In [None]:
#employees.select(('id_number').filter(employees['position']=='psychopath').show(5)
#se debe hacer el filter de las columnas antes de descartarlas con el select
employees.filter(employees['position']=='psycopath').select('id_number').show(5)

### Adding columns

Dataframes are immutable, since they are built on top of RDDs, so we can not assign to them. We need to create new DataFrames with the appropriate columns.

In [None]:
#employees['salary']=df['id_number']**2 no podría hacer esto (como se hace en pandas) pq el data frame es inmutable
employees.withColumn('square',employees['id_number']**2)
#esto genera un nuevo df

In [None]:
employees.select('id_number',
                'position',
                (employees['id_number']**2).alias('holi'))
#esta es otra forma de crear una columna

In [None]:
employees.show(10)

### User defined functions

There are many useful functions in pyspark.sql.functions. These work on columns, that is, they are vectorial.

We can write User Defined Functions (`udf`s), which allow us to "vectorize" operations: write a standard function to process single elements, then build a udf with that that works on columns in a DataFrame, like a SQL function.

In [None]:
from pyspark.sql import functions
#son funciones que nos permiten prcesar una columna

In [None]:
help(functions.log1p)

In [None]:
#tengo que hacer un withvolumn o un select para que haga algo con la funcion
df=employees.select('id_number',
                    'position',
                    functions.log1p(employees['id_number']))
df.show()

#podemos hacer esto tambien:

In [None]:
#podemos hacer esto tambien. ya que ahora estamos pasando a la funcion de spark un string, como es una funcion que opera sobre columnas, es lo bastente inteligente para saber que ese string es el nombre de una columna
df=employees.select('id_number',
                    'position',
                    functions.log1p('id_number'))
df.show(5)

In [None]:
import math
math.log1p(0)

This errors out because 

```python
math.log1p
```

is not a udf: it doesn't know how to work with strings or Column objects:

In [None]:
#udf es user defined function

But we can transform it into a udf:

In [None]:
udf_log1p=functions.udf(math.log1p)
#es una funcion que come una funcion y que da una funcion

df=employees.select('id_number',
                    'position',
                    udf_log1p('id_number'))
df.show(5)

#devuelve string la nueva columna ya que functions.udf devuelve por defecto tipos string. Hay qu especificarle.

We can do the same with any function we dream up:

In [None]:
#podemos hacer esto con cualquier cosa
f_a=functions.udf(lambda word:word[:3])

#aplicarla:
df=employees.select('id_number',
                    'position',
                    f_a('position'))
df.show(5)

If we want the resulting columns to be of a particular type, we need to specify the return type. This is because in Python return types can not be inferred.

Think about this function: what is its return type?

In [None]:
def anonymous(element):
    result=element+element
    return result

In [None]:
#si queremos especificar un tipo
udf_log1p_2=functions.udf(math.log1p,returnType=types.FloatType())
udf_log1p_2.returnType

#### Exercise: 

Create a 'salary' field in our df. make it 100000 for psychopath, 35000 for widgeteer and 50000 for smoke salesman 60000 for wizards.



In [None]:
def exercise2(x):
    if x=='psychopath':
        return '100000'
    if x=='widgeteer':
        return '35000'
    if x=='smoke salesman':
        return '50000'
    if x=='wizard':
        return '60000'
    else: return '0'


In [None]:
udf_e2=functions.udf(exercise2,returnType=types.IntegerType())
udf_e2.returnType

In [None]:
df=employees.select('id_number',
                    'position',
                    udf_e2('position'))
df.show(10)
df.printSchema()

In [None]:
df=df.withColumn('bonus',df['exercise2(position)']*0.1)
df.printSchema()

#es lo suficientemente inteligente para definir la columna bonus como un double

If we have a column that is not the desired type, we can convert it with `cast`.

In [None]:
df.select('id_number',
            'position',
            df['exercise2(position)'].cast(types.IntegerType()))

### Summary statistics

https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html

In [None]:
df.stat.corr('id_number','exercise2(position)')

In [None]:
df.stat.cov('id_number','exercise2(position)')

### .crosstab()

Crosstab returns the contingency table for two columns, as a DataFrame.

In [43]:
location_udf=functions.udf(lambda: random.choice(['Madrid','Barcelona']))
with_locs=df.select('id_number',
            'position',
             df['exercise2(position)'].alias('salary'),
             location_udf().alias('location')
            )
with_locs.show(5)
#si solo fuese añadir una columna tendría mas sentido usar withcolumns
#el df es lazy, cada vez que hacemos show se constituye desde 0 y cambia la localizacion.
#esto se solucionaría o fijando la semilla o lo cacheamos (le decimos que se acuerde de lo valores)


NameError: name 'functions' is not defined

In [None]:
with_locs.cache()
with_locs.show()


In [None]:
#para hacer la contigency table
with_locs.crosstab('position','location').show()
#ME HACE LA TABLA DE CONTINGENCIA, ME DICE COMO CONCURREN LAS DOS VARIABLES

### Grouping

Grouping works very similarly to Pandas: executing groupby (or groupBy) on a DataFrame will return an object (a GroupedData) that can then be aggregated to obtain the results.

In [None]:
#AGRUPAMIENTOS E INTERSECCIONES: MUY PARECIDO A PANDAS
gb = with_locs.groupby('location')
gb

GroupedData has several aggregation functions defined:

In [None]:
#gd tiene una serie de funciones para agregar
gb.avg().show()

In [None]:
gb = with_locs.groupby('salary')
gb.avg().show()

We can do several aggregations in a single step, with a number of different syntaxes:

In [None]:
gb.agg({'id_number':'max','salary':'mean'}).show()

In [None]:
#pero no podemos hacer distintas agregaciones sobre la misma columna, es una limitacion de python con los diccionarios.
#para ello tenemos que usar otro tipo de sintaxis

result=gb.agg(
    functions.mean('salary'),
    functions.count('salary'),
    functions.countDistinct('position')
)

result.show()

#podría hacer todas las agregaciones que quisiera

### Intersections

Very much like SQL joins. We can specify the columns and the join method (left, right, inner, outer) or we can let Spark infer them.

In [None]:
random.choice 

In [None]:
#muy parecidas a JOINS en SQL
[random.choice(['Barcelona', 'Sevilla', 'Madrid']) for _ in range(7)]


In [None]:
[random.randint(0,19) for _ in range(7)]

In [None]:
[random.random()* 10000 for _ in range(7)]

In [None]:
random.seed(42)

raises=session.createDataFrame(list(zip([random.choice(['Bcn', 'Sev', 'Mad']) for _ in range(7)],
                                        [random.randint(0,19) for _ in range(7)],
                                        [random.random()* 10000 for _ in range(7)])),
                               shema=['position','id_number','raises']).cache()
raises.show()

In [None]:
#cruzamos las tablas. si al join no le especifico nada, el mismo toma sus propias decisiones, pero ojo pq algunas veces puede salir error porque la tabla resultante doble su numero de datos (Producto cartesiano)
with_locs.join(raises,on='id_number').show()

In [None]:
with_locs.join(raises,on='id_number',how='left').show()

In [None]:
#con esto ya tengo flexibilidad completa sobre lo que quiera hacer puedo hacer conidciones todo lo completas que desee, incluso aunque las columnas no se llamen ingual
with_locs.join(raises,with_locs['id_number']==raises['position'],how='left').show()

In [26]:
joined=with_locs.join(raises,(with_locs['id_number']==raises['position']) &
               (with_locs['location']==raises['position']),how='left').show()

NameError: name 'with_locs' is not defined

Spark refuses to do cross joins by default. To perform them, we can 

a) Allow then explicitly:

```python
session.conf.set("spark.sql.crossJoin.enabled", "true")
```

b) Specify the join criterion

```python
df4.join(new_df, on='id').show()
```

#### Digression

We can monitor our running jobs and storage used at the Spark Web UI. We can get its url with sc.uiWebUrl.

StorageLevels represent how our DataFrame is cached: we can save the results of the computation up to that point, so that if we process several times the same data only the subsequent steps will be recomputed.

In [None]:
session.sparkContext.uiWebUrl

We can erase it with `unpersist`

In [None]:
joined.unpersist()

#### Exercise

Calculate the [z-score](http://www.statisticshowto.com/probability-and-statistics/z-score/) of each employee's salary for their location

In [None]:
#z-score: medida de como de lejos está un valor determinado del centro de la distribucion.
gb = with_locs.groupby('location')
gb.avg().show()

1) Calculate the mean and std of salary for each location

In [28]:
stats=joined.groupby('location')\
                        .agg(functions.mean('salary').alias('avg_salary'),
                              functions.stddev('salary').alias('std_salary'))

stats.show()

NameError: name 'joined' is not defined

2) Annotate each employee with the stats corresponding to their location

In [29]:
#tenemos stats y joins, vamos a anotar a cada empleado
annotated=starting_pont.join(stats,on='location',how='left').show()
#en este caso da igual, inner que outer que otra cosa
#ID number está repetido

NameError: name 'joined' is not defined

3) Calculate the z-score

In [None]:
annotated=starting_point.select('id_number','position')
#revisar con el del profesor.


Note that we can build more complex boolean conditions for joining, as well as joining on columns that do not have the same name:

### Handling null values

In [56]:
the_other=session.createDataFrame([
    (100,'superboss',12000000,None),
    (101,None,1000000,'Miami')])

df=with_locs.union(the_others)

#seguir con lo del profesor.
#para quitar los na: df.dropna(how='all',subset=['salary','location']).show(25)
# o con fill na

NameError: name 'with_locs' is not defined

## SQL querying

We need to register our DataFrame as a table in the SQL context in order to be able to query against it.

In [57]:
df.registerTempTable('df_table')

Once registered, we can perform queries as complex as we want.

In [58]:
session.sql('''SELECT * from df_table WHERE location="Madrid" AND salary>40000''').show()

AnalysisException: "cannot resolve '`location`' given input columns: [_1, _2]; line 1 pos 29;\n'Project [*]\n+- 'Filter (('location = Madrid) && ('salary > 40000))\n   +- SubqueryAlias df_table\n      +- LogicalRDD [_1#56L, _2#57]\n"

In [None]:
#para poder usar los dataframe de esta forma hay que registrarlos como tablas.
#Lo mismo pasa con las funciones.

In [60]:
session.sql('''SELECT id_number,log1p(id_number) from df_table WHERE location="Madrid" AND salary>40000''').show()

AnalysisException: "cannot resolve '`location`' given input columns: [_1, _2]; line 1 pos 54;\n'Project ['id_number, unresolvedalias('log1p('id_number), None)]\n+- 'Filter (('location = Madrid) && ('salary > 40000))\n   +- SubqueryAlias df_table\n      +- LogicalRDD [_1#56L, _2#57]\n"

In [61]:
#también podemos usar nuestras propias UDF
def classist(salary):
    return 'perroflauta' id salary < 42000 else 'burgues'

class_udf=functions.udf(classist)
df.select('id_number','position',class_udf('salary')).show()

SyntaxError: invalid syntax (<ipython-input-61-729afa8d0d49>, line 3)

In [63]:
#pero si la queremos meter dentro de una query sql tenemos que registrar la funcoin
session.udf.register('classist_sql',class_udf)

session.sql('''SELECT id_number,position,classist_sql(salary) from df_table WHERE location="Madrid"''').show()


NameError: name 'class_udf' is not defined

#### Exercise:

replicate the previous exercise, but with SparkSQL instead of dataframe methods.

## Interoperation with Pandas

Easy peasy. We can convert a spark DataFrame into a Pandas one, which will `collect` it, and viceversa, which will distribute it.

In [65]:
pandas_df=df.toPandas()

In [66]:
session.createDataFrame(pandas_df)

DataFrame[_1: bigint, _2: array<string>]

## Writing out


In [67]:
df.write.csv('df.csv')

Py4JJavaError: An error occurred while calling o357.csv.
: java.lang.UnsupportedOperationException: CSV data source does not support array<string> data type.
	at org.apache.spark.sql.execution.datasources.csv.CSVUtils$.org$apache$spark$sql$execution$datasources$csv$CSVUtils$$verifyType$1(CSVUtils.scala:127)
	at org.apache.spark.sql.execution.datasources.csv.CSVUtils$$anonfun$verifySchema$1.apply(CSVUtils.scala:131)
	at org.apache.spark.sql.execution.datasources.csv.CSVUtils$$anonfun$verifySchema$1.apply(CSVUtils.scala:131)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at org.apache.spark.sql.types.StructType.foreach(StructType.scala:98)
	at org.apache.spark.sql.execution.datasources.csv.CSVUtils$.verifySchema(CSVUtils.scala:131)
	at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.prepareWrite(CSVFileFormat.scala:65)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:142)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:438)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:474)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
	at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:598)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)


#### Exercise

Repeat the exercise from the previous notebook, but this time with DataFrames.

Get stats for all tickets with destination MAD from `coupons150720.csv`.

You will need to extract ticket amounts with destination MAD, and then calculate:

1. Total ticket amounts per origin
2. Top 10 airlines by average amount

+--------------+---+---+---+---+---+------+---+---+---+----+----+------+----+----+
|           _c0|_c1|_c2|_c3|_c4|_c5|   _c6|_c7|_c8|_c9|_c10|_c11|  _c12|_c13|_c14|
+--------------+---+---+---+---+---+------+---+---+---+----+----+------+----+----+
|79062005698500|  1|MAA|AUH| 9W| 9W| 56.79|USD|  1|  H|   H|0526|150904|  OK|IAF0|
|79062005698500|  2|AUH|CDG| 9W| 9W| 84.34|USD|  1|  H|   H|6120|150905|  OK|IAF0|
|79062005924069|  1|CJB|MAA| 9W| 9W|  60.0|USD|  1|  H|   H|2768|150721|  OK|IAA0|
|79065668570385|  1|DEL|DXB| 9W| 9W|160.63|USD|  2|  S|   S|0546|150804|  OK|INA0|
|79065668737021|  1|AUH|IXE| 9W| 9W|152.46|USD|  1|  V|   V|0501|150803|  OK|INA0|
+--------------+---+---+---+---+---+------+---+---+---+----+----+------+----+----+
only showing top 5 rows



1) Extract the fields you need (c0,c1,c2,c3,c4 and c6) into a dataframe with proper names and types

Remember, you want to calculate:

Total ticket amounts per origin

Top 10 airlines by average amount

In [93]:
df_coupons=session.sql('''SELECT _c0 AS TicketNum ,_c1 AS CouponNum ,_c2 Origin,_c3 Destination,_c4 AS Airline, CAST(_c6 AS FLOAT) AS Amount 
                        FROM csv.`coupon150720.csv`''')
df_coupons.show(5)

+--------------+----------+------+-----------+--------+------+
|            ID|COUPON_NUM|ORIGIN|DESTINATION|COMPANNY| PRICE|
+--------------+----------+------+-----------+--------+------+
|79062005698500|         1|   MAA|        AUH|      9W| 56.79|
|79062005698500|         2|   AUH|        CDG|      9W| 84.34|
|79062005924069|         1|   CJB|        MAA|      9W|  60.0|
|79065668570385|         1|   DEL|        DXB|      9W|160.63|
|79065668737021|         1|   AUH|        IXE|      9W|152.46|
+--------------+----------+------+-----------+--------+------+
only showing top 5 rows



In [None]:
df_coupons.where(df_coupons['Destination']=='MAD').head()
#esto mismo hay que hacerlo por origen

In [None]:
gb=df_coupons.where(df_coupons['Destination']=='MAD').groupby('Origin')
#hacemos la agregación y ordenamos
gd.sum('Amount').sort('sum(Amount)',ascending=False).show(20)

In [92]:
#REGISTRAMOS LA TABLA
df.registerTempTable('df_coupons')

In [97]:
session.sql('SELECT')

AnalysisException: "cannot resolve '`PRICE`' given input columns: [_4, _2, _3, _1, _5]; line 1 pos 7;\n'Project ['PRICE]\n+- SubqueryAlias df_coupons\n   +- LogicalRDD [_1#123, _2#124, _3#125, _4#126, _5#127]\n"

2) Total ticket amounts per origin

In [85]:
gb=df_coupons.where(df_coupons['Destination']=='MAD').groupby('Origin')
#hacemos la agregación y ordenamos
gd.sum('Amount').sort('sum(Amount)',ascending=False).show(20)

Help on method sql in module pyspark.sql.session:

sql(sqlQuery) method of pyspark.sql.session.SparkSession instance
    Returns a :class:`DataFrame` representing the result of the given query.
    
    :return: :class:`DataFrame`
    
    >>> df.createOrReplaceTempView("table1")
    >>> df2 = spark.sql("SELECT field1 AS f1, field2 as f2 from table1")
    >>> df2.collect()
    [Row(f1=1, f2='row1'), Row(f1=2, f2='row2'), Row(f1=3, f2='row3')]
    
    .. versionadded:: 2.0



3) Top 10 Airlines by average amount



In [None]:
gb=df_coupons.where(df_coupons['Destination']=='MAD').groupby('Airline')
#hacemos la agregación y ordenamos
gd.mean('Amount').sort('avg(Amount)',ascending=False).show(10)

In [None]:
#para un análisis rápido esto es maravilloso.

## Further Reading

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

https://www.datacamp.com/community/tutorials/apache-spark-python

https://spark.apache.org/docs/2.2.0/sql-programming-guide.html

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf