# DataFrames en Spark

## Requisitos Previos

Instalar Spark y Java en la M√°quina Virtual (VM)

In [1]:
# instalar Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# descargar spark 3.5.0
!wget -q https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

In [2]:
ls -l # verificar que el .tgz est√° ah√≠

total 391020
drwxr-xr-x 1 root root      4096 Nov 17 14:29 [0m[01;34msample_data[0m/
-rw-r--r-- 1 root root 400395283 Sep  9  2023 spark-3.5.0-bin-hadoop3.tgz


In [3]:
# descomprimirlo
!tar xf spark-3.5.0-bin-hadoop3.tgz

In [4]:
!pip install -q findspark

In [5]:
!pip install py4j

# Para maps
!pip install folium
!pip install plotly



Definir el entorno

In [6]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

Iniciar Sesi√≥n de Spark (Spark Session)

---

In [7]:
import findspark
findspark.init("spark-3.5.0-bin-hadoop3")# SPARK_HOME

from pyspark.sql import SparkSession

# crear la sesi√≥n
spark = SparkSession \
        .builder \
        .appName("DataFrames Basics") \
        .master("local[*]") \
        .getOrCreate()

spark.version

'3.5.0'

In [8]:
spark

In [9]:
# Para optimizaci√≥n de conversi√≥n a Pandas
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [10]:
# Importar funciones sql
from pyspark.sql.functions import *

Descargar y subir los conjuntos de datos (datasets) (BOOKS.CSV, BANK.CSV, GUITARPLAYERS.JSON)

In [11]:
from google.colab import files
uploaded = files.upload()

Saving vehicles.csv to vehicles.csv
Saving roles.csv to roles.csv
Saving planets.csv to planets.csv
Saving paychecks.csv to paychecks.csv
Saving leads.csv to leads.csv
Saving guitars.json to guitars.json
Saving guitarPlayers.json to guitarPlayers.json
Saving games.csv to games.csv
Saving developers.csv to developers.csv
Saving deptmanagers.csv to deptmanagers.csv
Saving cars_dates.json to cars_dates.json
Saving books.csv to books.csv
Saving bank.csv to bank.csv
Saving bands.json to bands.json
Saving titles.csv to titles.csv
Saving taxi_zones.csv to taxi_zones.csv
Saving taxi_data.csv to taxi_data.csv
Saving stocks.csv to stocks.csv
Saving salaries.csv to salaries.csv
Saving quijote.txt to quijote.txt
Saving population.json to population.json
Saving numbers.csv to numbers.csv
Saving movies.json to movies.json
Saving more_cars.json to more_cars.json
Saving frankenstein.txt to frankenstein.txt
Saving employees.csv to employees.csv
Saving characters.csv to characters.csv
Saving cars.json t

In [12]:
ls -l /dataset

ls: cannot access '/dataset': No such file or directory


## Ejemplos

In [13]:
from pyspark.sql.types import Row
from pyspark.sql.functions import *

Crear una RDD directamente a trav√©s de un CSV

In [14]:
bankText = spark.sparkContext.textFile("bank.csv")

#Tenemos que eliminar la primera fila porque es la de los encabezados
bank = bankText.map(lambda lineaCsv: lineaCsv.split(";"))\
.filter(lambda s: s[0] != "\"age\"") \
.map(lambda row: Row(int(row[0]), row[1].replace("\"", ""), row[2].replace("\"", ""), row[3].replace("\"", ""), row[5].replace("\"", ""))) \
.toDF(["age", "job", "marital", "education", "balance"]) \
.withColumn("age", col("age").cast("int"))

bank.show(3)

+---+----------+-------+---------+-------+
|age|       job|marital|education|balance|
+---+----------+-------+---------+-------+
| 30|unemployed|married|  primary|   1787|
| 33|  services|married|secondary|   4789|
| 35|management| single| tertiary|   1350|
+---+----------+-------+---------+-------+
only showing top 3 rows



Leer directamente desde un archivo JSON a un DF

In [15]:
guitarPlayersDF = spark.read.option("inferSchema", True).json("guitarPlayers.json") # inferSchema requiere ponerlo a True para ser usado
# si no se establece None, utiliza el valor por defecto (default = False) tambi√©n puedes pasar el esquema manualmente

Leer directamente desde csv

In [16]:
booksDF = spark.read.option("header", "true") \
                   .option("delimiter", ",") \
                   .csv("books.csv")

booksDF.show(3)


+------+--------------------+--------------------+--------------+----------+-------------+-------------+-----------+-------------+------------------+----------------+---------------+
|bookID|               title|             authors|average_rating|      isbn|       isbn13|language_code|  num_pages|ratings_count|text_reviews_count|publication_date|      publisher|
+------+--------------------+--------------------+--------------+----------+-------------+-------------+-----------+-------------+------------------+----------------+---------------+
|     1|Harry Potter and ...|J.K. Rowling/Mary...|          4.57|0439785960|9780439785969|          eng|        652|      2095690|             27591|       9/16/2006|Scholastic Inc.|
|     2|Harry Potter and ...|J.K. Rowling/Mary...|          4.49|0439358078|9780439358071|          eng|        870|      2153167|             29221|        9/1/2004|Scholastic Inc.|
|     4|Harry Potter and ...|        J.K. Rowling|          4.42|0439554896|978043955

Mostrar un DF e imprimir el esquema

In [17]:
guitarPlayersDF.show(2)
guitarPlayersDF.printSchema()

+----+-------+---+-----------+
|band|guitars| id|       name|
+----+-------+---+-----------+
|   0|    [0]|  0| Jimmy Page|
|   1|    [1]|  1|Angus Young|
+----+-------+---+-----------+
only showing top 2 rows

root
 |-- band: long (nullable = true)
 |-- guitars: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)



Obtener Filas

In [18]:
guitarPlayersDF.take(2)

[Row(band=0, guitars=[0], id=0, name='Jimmy Page'),
 Row(band=1, guitars=[1], id=1, name='Angus Young')]

Conteo

In [19]:
guitarPlayersDF.count()

4

Esquema

In [20]:
# obtener un esquema
guitarPlayersSchema = guitarPlayersDF.schema
print(type(guitarPlayersSchema))
print(guitarPlayersSchema)

<class 'pyspark.sql.types.StructType'>
StructType([StructField('band', LongType(), True), StructField('guitars', ArrayType(LongType(), True), True), StructField('id', LongType(), True), StructField('name', StringType(), True)])


Esquemas Personalizados

In [21]:
players_rdd = spark.sparkContext.parallelize([
    ("Leo Messi", "Delantero", 10),
    ("Virgil van Dijk", "Defensa", 4),
    ("David Villa", "Delantero", 7)
])

In [22]:
exampleDF = spark.createDataFrame(players_rdd)
exampleDF.printSchema()

root
 |-- _1: string (nullable = true)
 |-- _2: string (nullable = true)
 |-- _3: long (nullable = true)



Con nombres de columna

In [23]:
names = list(["Nombre", "Posicion", "Numero"])

In [24]:
example2DF = players_rdd.toDF(names)
example2DF.printSchema()

root
 |-- Nombre: string (nullable = true)
 |-- Posicion: string (nullable = true)
 |-- Numero: long (nullable = true)



In [25]:
# importar tipos sql
from pyspark.sql.types import *

In [26]:
# esquema personalizado
customSchema = StructType([ \
    StructField('Nombre', StringType(), True), \
    StructField('Posicion', StringType(), True), \
    StructField('Numero', StringType(), True)])

In [27]:
example3DF = spark.createDataFrame(players_rdd, customSchema)
example3DF.printSchema()

root
 |-- Nombre: string (nullable = true)
 |-- Posicion: string (nullable = true)
 |-- Numero: string (nullable = true)



In [28]:
example3DF.show(2, False)

+---------------+---------+------+
|Nombre         |Posicion |Numero|
+---------------+---------+------+
|Leo Messi      |Delantero|10    |
|Virgil van Dijk|Defensa  |4     |
+---------------+---------+------+
only showing top 2 rows



In [29]:
# tambi√©n podemos especificar el esquema con DDL (Data Definition Language)
customSchema2 = "`Nombre` STRING NOT NULL, `Posicion` STRING, `Numero` INT"

In [30]:
example4DF = spark.createDataFrame(players_rdd, customSchema2)
example4DF.printSchema()

root
 |-- Nombre: string (nullable = false)
 |-- Posicion: string (nullable = true)
 |-- Numero: integer (nullable = true)



In [31]:
print(type(example2DF.collect()[0]["Numero"]))
print(type(example3DF.collect()[0]["Numero"]))

<class 'int'>
<class 'str'>


## Ejercicios

1. Crea un DF manual describiendo bebidas (beverages) üçπ

    fabricante
    sabor
    tipo_de_envase
    contenido_de_az√∫car_gramos

2. Carga cualquier otro archivo de la carpeta de datos üèôÔ∏è

    imprime su esquema
    cuenta el n√∫mero de filas, llama a `count()`

3. Echa un vistazo a taxi_zones.csv. Lee el archivo a un DF, pero esta vez con tu propio esquema üé§

Ejercicio 1

Ejercicio 2

Ejercicio 3