#Apache Spark - Dataframes

<p><strong>Objetivo: </strong> El objetivo de este cuaderno es crear un dataframe de Spark y aplicar distintos tipos de transformaciones y acciones</p>

## Cargar los datos
Para este ejercicio se va a utilizar un conjunto de datos Movies en formato Parquet:

Leer los datos

In [0]:
df_movies = spark.read.parquet("/FileStore/tables/movies.parquet")

Imprimir el esquema

In [0]:
df_movies.printSchema

Out[2]: <bound method DataFrame.printSchema of DataFrame[actor_name: string, movie_title: string, produced_year: bigint]>

Mostrar los datos

In [0]:
display(df_movies)

actor_name,movie_title,produced_year
"McClure, Marc (I)",Coach Carter,2005
"McClure, Marc (I)",Superman II,1980
"McClure, Marc (I)",Apollo 13,1995
"McClure, Marc (I)",Superman,1978
"McClure, Marc (I)",Back to the Future,1985
"McClure, Marc (I)",Back to the Future Part III,1990
"Cooper, Chris (I)","Me, Myself & Irene",2000
"Cooper, Chris (I)",October Sky,1999
"Cooper, Chris (I)",Capote,2005
"Cooper, Chris (I)",The Bourne Supremacy,2004


Mostrar una cantidad específica de datos

In [0]:
df_movies.take(10)

Out[4]: [Row(actor_name='McClure, Marc (I)', movie_title='Coach Carter', produced_year=2005),
 Row(actor_name='McClure, Marc (I)', movie_title='Superman II', produced_year=1980),
 Row(actor_name='McClure, Marc (I)', movie_title='Apollo 13', produced_year=1995),
 Row(actor_name='McClure, Marc (I)', movie_title='Superman', produced_year=1978),
 Row(actor_name='McClure, Marc (I)', movie_title='Back to the Future', produced_year=1985),
 Row(actor_name='McClure, Marc (I)', movie_title='Back to the Future Part III', produced_year=1990),
 Row(actor_name='Cooper, Chris (I)', movie_title='Me, Myself & Irene', produced_year=2000),
 Row(actor_name='Cooper, Chris (I)', movie_title='October Sky', produced_year=1999),
 Row(actor_name='Cooper, Chris (I)', movie_title='Capote', produced_year=2005),
 Row(actor_name='Cooper, Chris (I)', movie_title='The Bourne Supremacy', produced_year=2004)]

In [0]:
display(df_movies.take(10))

actor_name,movie_title,produced_year
"McClure, Marc (I)",Coach Carter,2005
"McClure, Marc (I)",Superman II,1980
"McClure, Marc (I)",Apollo 13,1995
"McClure, Marc (I)",Superman,1978
"McClure, Marc (I)",Back to the Future,1985
"McClure, Marc (I)",Back to the Future Part III,1990
"Cooper, Chris (I)","Me, Myself & Irene",2000
"Cooper, Chris (I)",October Sky,1999
"Cooper, Chris (I)",Capote,2005
"Cooper, Chris (I)",The Bourne Supremacy,2004


In [0]:
display(df_movies.tail(5))

actor_name,movie_title,produced_year
"Leacock, Viv",This Means War,2012.0
"Leacock, Viv",Hot Tub Time Machine,2010.0
"Leacock, Viv",Freddy vs. Jason,2003.0
"Leacock, Viv",Are We There Yet?,2005.0
"Leacock, Viv",Are We There Yet?,


Mostrar una fracción aleatoria de los datos

In [0]:
display(df_movies.sample(fraction=0.070))

actor_name,movie_title,produced_year
"Cassavetes, Frank",John Q,2002
"Jolie, Angelina",Lara Croft: Tomb Raider,2001
"Jolie, Angelina",Alexander,2004
"Yip, Françoise",Romeo Must Die,2000
"Cueto, Esteban",Collateral Damage,2002
"Danner, Blythe",Hauru no ugoku shiro,2004
"Butters, Mike",Titanic,1997
"Ruskin, Joseph",Indecent Proposal,1993
"Byrne, Michael (I)",Indiana Jones and the Last Crusade,1989
"Byrne, Michael (I)",Gangs of New York,2002


Mostrar los nombres de las columnas en el dataframe

In [0]:
df_movies.columns

Out[8]: ['actor_name', 'movie_title', 'produced_year']

## select(columns)

Seleccionar algunas columnas

In [0]:
display(df_movies.select("movie_title","produced_year"))

movie_title,produced_year
Coach Carter,2005
Superman II,1980
Apollo 13,1995
Superman,1978
Back to the Future,1985
Back to the Future Part III,1990
"Me, Myself & Irene",2000
October Sky,1999
Capote,2005
The Bourne Supremacy,2004


In [0]:
df_movies.select("movie_title","produced_year").show(20)

+--------------------+-------------+
|         movie_title|produced_year|
+--------------------+-------------+
|        Coach Carter|         2005|
|         Superman II|         1980|
|           Apollo 13|         1995|
|            Superman|         1978|
|  Back to the Future|         1985|
|Back to the Futur...|         1990|
|  Me, Myself & Irene|         2000|
|         October Sky|         1999|
|              Capote|         2005|
|The Bourne Supremacy|         2004|
|         The Patriot|         2000|
|            The Town|         2010|
|          Seabiscuit|         2003|
|      A Time to Kill|         1996|
|Where the Wild Th...|         2009|
|         The Muppets|         2011|
|     American Beauty|         1999|
|             Syriana|         2005|
| The Horse Whisperer|         1998|
|             Jarhead|         2005|
+--------------------+-------------+
only showing top 20 rows



##selectExpr(expressions)

Añadir una columna calculada utilizando una expresión SQL

In [0]:
df_movies.selectExpr("*","(produced_year - (produced_year % 10)) as decade").show(5)

+-----------------+------------------+-------------+------+
|       actor_name|       movie_title|produced_year|decade|
+-----------------+------------------+-------------+------+
|McClure, Marc (I)|      Coach Carter|         2005|  2000|
|McClure, Marc (I)|       Superman II|         1980|  1980|
|McClure, Marc (I)|         Apollo 13|         1995|  1990|
|McClure, Marc (I)|          Superman|         1978|  1970|
|McClure, Marc (I)|Back to the Future|         1985|  1980|
+-----------------+------------------+-------------+------+
only showing top 5 rows



Quedarme con los datos del nuevo dataframe

In [0]:
df_new = df_movies.selectExpr("*","(produced_year - (produced_year % 10)) as decade")

In [0]:
display(df_new)

actor_name,movie_title,produced_year,decade
"McClure, Marc (I)",Coach Carter,2005,2000
"McClure, Marc (I)",Superman II,1980,1980
"McClure, Marc (I)",Apollo 13,1995,1990
"McClure, Marc (I)",Superman,1978,1970
"McClure, Marc (I)",Back to the Future,1985,1980
"McClure, Marc (I)",Back to the Future Part III,1990,1990
"Cooper, Chris (I)","Me, Myself & Irene",2000,2000
"Cooper, Chris (I)",October Sky,1999,1990
"Cooper, Chris (I)",Capote,2005,2000
"Cooper, Chris (I)",The Bourne Supremacy,2004,2000


Utilizando funciones dentro de una expresión SQL

In [0]:
df_movies.selectExpr("count(distinct(movie_title)) as movies","count(distinct(actor_name)) as actors").show(1)

+------+------+
|movies|actors|
+------+------+
|  1409|  6527|
+------+------+



##filler(condition), where(condition)

Filtrar filas utilizando operadore lógicos y los valores de las columnas

In [0]:
df_filter = df_movies.filter(df_movies.produced_year < 2000)
display(df_filter)

actor_name,movie_title,produced_year
"McClure, Marc (I)",Superman II,1980
"McClure, Marc (I)",Apollo 13,1995
"McClure, Marc (I)",Superman,1978
"McClure, Marc (I)",Back to the Future,1985
"McClure, Marc (I)",Back to the Future Part III,1990
"Cooper, Chris (I)",October Sky,1999
"Cooper, Chris (I)",A Time to Kill,1996
"Cooper, Chris (I)",American Beauty,1999
"Cooper, Chris (I)",The Horse Whisperer,1998
"Knight, Shirley (I)",As Good as It Gets,1997


In [0]:
df_movies.filter(df_movies.produced_year == 2000).show(5)

+-----------------+--------------------+-------------+
|       actor_name|         movie_title|produced_year|
+-----------------+--------------------+-------------+
|Cooper, Chris (I)|  Me, Myself & Irene|         2000|
|Cooper, Chris (I)|         The Patriot|         2000|
|  Jolie, Angelina|Gone in Sixty Sec...|         2000|
|   Yip, Françoise|      Romeo Must Die|         2000|
|   Danner, Blythe|    Meet the Parents|         2000|
+-----------------+--------------------+-------------+
only showing top 5 rows



In [0]:
df_movies.select("movie_title","produced_year").filter(df_movies.produced_year != 2000).show(5)

+------------------+-------------+
|       movie_title|produced_year|
+------------------+-------------+
|      Coach Carter|         2005|
|       Superman II|         1980|
|         Apollo 13|         1995|
|          Superman|         1978|
|Back to the Future|         1985|
+------------------+-------------+
only showing top 5 rows



##Count

Contar los elementos en el dataframe

In [0]:
df_movies.count()

Out[18]: 31393

##distinct y sort

Elimino los que son iguales y despues los ordeno

In [0]:
display(df_movies.select("movie_title").distinct().sort("movie_title"))

movie_title
'Crocodile' Dundee II
10 Things I Hate About You
"10,000 BC"
101 Dalmatians
102 Dalmatians
12
13 Going on 30
1408
17 Again
2 Fast 2 Furious


##OrderBy

Ordenando los valores de una columna

In [0]:
display(df_movies.orderBy("actor_name"))

actor_name,movie_title,produced_year
"Aaron, Caroline",Along Came Polly,2004
"Aaron, Caroline",Primary Colors,1998
"Aaron, Caroline",Cellular,2004
"Aaron, Caroline",21 Jump Street,2012
"Aaron, Caroline",Just Like Heaven,2005
"Aaron, Caroline",Sleepless in Seattle,1993
"Aarons, Bonnie",Drag Me to Hell,2009
"Aarons, Bonnie",Shallow Hal,2001
"Aarons, Bonnie",The Princess Diaries 2: Royal Engagement,2004
"Aarons, Bonnie",The Fighter,2010


In [0]:
display(df_movies.orderBy("actor_name", ascending=False))

actor_name,movie_title,produced_year
"von Sydow, Max (I)",Never Say Never Again,1983
"von Sydow, Max (I)",Hannah and Her Sisters,1986
"von Sydow, Max (I)",Rush Hour 3,2007
"von Sydow, Max (I)",Minority Report,2002
"von Sydow, Max (I)",Shutter Island,2010
"von Sydow, Max (I)",The Wolfman,2010
"von Sydow, Max (I)",Robin Hood,2010
"von Sydow, Max (I)",The Exorcist,1973
"von Sydow, Max (I)",What Dreams May Come,1998
"von Siegel, Matt",The Last Airbender,2010


##Describe

Sometimes it is useful to have a general sense of the basic statistics of the data you
are working with. The basic statistics this transformation can compute for string and
numeric columns are count, mean, standard deviation, minimum, and maximum.
You can pick and choose which string or numeric columns to compute the statistics for.

In [0]:
df_movies.describe("produced_year").show()

+-------+------------------+
|summary|     produced_year|
+-------+------------------+
|  count|             31392|
|   mean|2002.7964449541284|
| stddev| 6.377236851493877|
|    min|              1961|
|    max|              2012|
+-------+------------------+



##Guardar en un archivo Parquet en el sistema de archivos

In [0]:
df_filter.write.parquet("/FileStore/Filter.parquet")

In [0]:
parquetFile = spark.read.parquet("/FileStore/Filter.parquet")
display(parquetFile)

actor_name,movie_title,produced_year
"McClure, Marc (I)",Superman II,1980
"McClure, Marc (I)",Apollo 13,1995
"McClure, Marc (I)",Superman,1978
"McClure, Marc (I)",Back to the Future,1985
"McClure, Marc (I)",Back to the Future Part III,1990
"Cooper, Chris (I)",October Sky,1999
"Cooper, Chris (I)",A Time to Kill,1996
"Cooper, Chris (I)",American Beauty,1999
"Cooper, Chris (I)",The Horse Whisperer,1998
"Knight, Shirley (I)",As Good as It Gets,1997


In [0]:
dbutils.fs.rm("/FileStore/Filter.parquet", True)

Out[26]: True

<h3>Ahora tu</h3>

Selecciona otro comando de Spark y aplícalo en el conjunto de datos. Ejemplos:

Cambiar el nombre a una columna

In [0]:
df_movies = df_movies.withColumnRenamed("movie_title", "movieTitle")

In [0]:
df_movies.describe().show()

+-------+------------------+--------------------+------------------+
|summary|        actor_name|          movieTitle|     produced_year|
+-------+------------------+--------------------+------------------+
|  count|             31393|               31393|             31392|
|   mean|              null|  312.61538461538464|2002.7964449541284|
| stddev|              null|   485.7043414390151| 6.377236851493877|
|    min|   Aaron, Caroline|'Crocodile' Dunde...|              1961|
|    max|von Sydow, Max (I)|                 xXx|              2012|
+-------+------------------+--------------------+------------------+



Filtrando por atributo numérico y Alias

In [0]:
df_movies.select("movieTitle","produced_year",(df_movies.produced_year.between(1980, 2000)).alias("Between")).show(100)

+--------------------+-------------+-------+
|          movieTitle|produced_year|Between|
+--------------------+-------------+-------+
|        Coach Carter|         2005|  false|
|         Superman II|         1980|   true|
|           Apollo 13|         1995|   true|
|            Superman|         1978|  false|
|  Back to the Future|         1985|   true|
|Back to the Futur...|         1990|   true|
|  Me, Myself & Irene|         2000|   true|
|         October Sky|         1999|   true|
|              Capote|         2005|  false|
|The Bourne Supremacy|         2004|  false|
|         The Patriot|         2000|   true|
|            The Town|         2010|  false|
|          Seabiscuit|         2003|  false|
|      A Time to Kill|         1996|   true|
|Where the Wild Th...|         2009|  false|
|         The Muppets|         2011|  false|
|     American Beauty|         1999|   true|
|             Syriana|         2005|  false|
| The Horse Whisperer|         1998|   true|
|         

Filtrando por atributo categórico y Alias

In [0]:
df_movies.select("actor_name", (df_movies.actor_name.startswith("Coo")).alias("Starts with Coo")).show(50)

+-------------------+---------------+
|         actor_name|Starts with Coo|
+-------------------+---------------+
|  McClure, Marc (I)|          false|
|  McClure, Marc (I)|          false|
|  McClure, Marc (I)|          false|
|  McClure, Marc (I)|          false|
|  McClure, Marc (I)|          false|
|  McClure, Marc (I)|          false|
|  Cooper, Chris (I)|           true|
|  Cooper, Chris (I)|           true|
|  Cooper, Chris (I)|           true|
|  Cooper, Chris (I)|           true|
|  Cooper, Chris (I)|           true|
|  Cooper, Chris (I)|           true|
|  Cooper, Chris (I)|           true|
|  Cooper, Chris (I)|           true|
|  Cooper, Chris (I)|           true|
|  Cooper, Chris (I)|           true|
|  Cooper, Chris (I)|           true|
|  Cooper, Chris (I)|           true|
|  Cooper, Chris (I)|           true|
|  Cooper, Chris (I)|           true|
|  Cooper, Chris (I)|           true|
|  Cassavetes, Frank|          false|
|  Cassavetes, Frank|          false|
|  Cassavete

In [0]:
df_movies.select("actor_name", (df_movies.actor_name.startswith("Coo")).alias("True")).filter( df_movies.actor_name.startswith("Coo")== "true").show(50)

+-------------------+----+
|         actor_name|True|
+-------------------+----+
|  Cooper, Chris (I)|true|
|  Cooper, Chris (I)|true|
|  Cooper, Chris (I)|true|
|  Cooper, Chris (I)|true|
|  Cooper, Chris (I)|true|
|  Cooper, Chris (I)|true|
|  Cooper, Chris (I)|true|
|  Cooper, Chris (I)|true|
|  Cooper, Chris (I)|true|
|  Cooper, Chris (I)|true|
|  Cooper, Chris (I)|true|
|  Cooper, Chris (I)|true|
|  Cooper, Chris (I)|true|
|  Cooper, Chris (I)|true|
|  Cooper, Chris (I)|true|
|      Cook, Candice|true|
|      Cook, Candice|true|
|      Cook, Candice|true|
|         Cool, Greg|true|
|         Cool, Greg|true|
|         Cool, Greg|true|
|         Cool, Greg|true|
|       Coogan, Will|true|
|       Coogan, Will|true|
|       Coogan, Will|true|
|  Cooney, Kevin (I)|true|
|  Cooney, Kevin (I)|true|
|  Cooney, Kevin (I)|true|
|  Cooney, Kevin (I)|true|
|  Cooney, Kevin (I)|true|
|  Cooney, Kevin (I)|true|
|  Cooney, Kevin (I)|true|
|  Cooney, Kevin (I)|true|
|  Cooney, Kevin (I)|true|
|

Agrupamiento, contar y ordenar

In [0]:
per_year = df_movies.groupBy("produced_year").count().sort("produced_year")
display(per_year)

produced_year,count
,1
1961.0,2
1967.0,2
1972.0,12
1973.0,5
1975.0,5
1977.0,40
1978.0,30
1979.0,37
1980.0,47
