# Agregaciones y Agrupaciones

## Requisitos Previos

Instalar Spark y Java en la Máquina Virtual (VM)

In [1]:
# instalar Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# descargar spark 3.5.0
!wget -q https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

In [2]:
ls -l # verificar que el .tgz está ahí

total 391020
drwxr-xr-x 1 root root      4096 Nov 17 14:29 [0m[01;34msample_data[0m/
-rw-r--r-- 1 root root 400395283 Sep  9  2023 spark-3.5.0-bin-hadoop3.tgz


In [3]:
# descomprimirlo
!tar xf spark-3.5.0-bin-hadoop3.tgz

In [4]:
!pip install -q findspark

In [5]:

!pip install py4j

# Para maps
!pip install folium
!pip install plotly



Definir el entorno

In [6]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

Iniciar Sesión de Spark (Spark Session)

---

In [7]:
import findspark
findspark.init("spark-3.5.0-bin-hadoop3")# SPARK_HOME

from pyspark.sql import SparkSession

# crear la sesión
spark = SparkSession \
        .builder \
        .appName("DataFrames Basics") \
        .master("local[*]") \
        .getOrCreate()

spark.version

'3.5.0'

In [8]:
spark

In [9]:
# Para optimización de conversión a Pandas
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [10]:
# Importar funciones sql
from pyspark.sql.functions import *

Descargar y subir los conjuntos de datos (datasets) (PLANETS.CSV)

In [11]:
from google.colab import files
uploaded = files.upload()

Saving planets.csv to planets.csv


In [12]:
planetasDF = spark.read \
    .option("inferSchema", True) \
    .option("header", "true") \
    .option("delimiter", ";") \
    .csv("planets.csv")

planetasDF = planetasDF \
    .withColumn("population",
                when(col("population").rlike("^[0-9]+$"), col("population")).otherwise(None).cast("int")) \
    .withColumn("diameter",
                when(col("diameter").rlike("^[0-9]+$"), col("diameter")).otherwise(None).cast("int"))

In [13]:
planetasDF.show(2, False)
print(planetasDF.schema.fields)
planetasDF.columns

+--------+---------------+--------------+--------+-------------------+----------+---------------------+-------------+----------+
|name    |rotation_period|orbital_period|diameter|climate            |gravity   |terrain              |surface_water|population|
+--------+---------------+--------------+--------+-------------------+----------+---------------------+-------------+----------+
|Alderaan|24             |364           |12500   |temperate          |1 standard|grasslands, mountains|40           |2000000000|
|Yavin IV|24             |4818          |10200   |temperate, tropical|1 standard|jungle, rainforests  |8            |1000      |
+--------+---------------+--------------+--------+-------------------+----------+---------------------+-------------+----------+
only showing top 2 rows

[StructField('name', StringType(), True), StructField('rotation_period', StringType(), True), StructField('orbital_period', StringType(), True), StructField('diameter', IntegerType(), True), StructFiel

['name',
 'rotation_period',
 'orbital_period',
 'diameter',
 'climate',
 'gravity',
 'terrain',
 'surface_water',
 'population']

## Examples

Count

In [14]:
# Recuento de filas df, incluyendo NULLS
planetasDF.count()

61

In [15]:
# utilizando funciones SQL, SIN incluir NULLS
climasCountDF = planetasDF.select(count(col("climate")))
climasCountDF.show()

+--------------+
|count(climate)|
+--------------+
|            61|
+--------------+



In [16]:
terrenosCountDF = planetasDF.select(count(planetasDF.terrain))
terrenosCountDF.show()

+--------------+
|count(terrain)|
+--------------+
|            61|
+--------------+



In [17]:
planetasDF.select(count(planetasDF.climate).alias("countClimas"), count(planetasDF.terrain)).show()

+-----------+--------------+
|countClimas|count(terrain)|
+-----------+--------------+
|         61|            61|
+-----------+--------------+



In [18]:
#usando sintaxis SQL
planetasDF.select(expr("count(terrain)")).show()
planetasDF.selectExpr("count(terrain) as count").show()

+--------------+
|count(terrain)|
+--------------+
|            61|
+--------------+

+-----+
|count|
+-----+
|   61|
+-----+



In [19]:
#usando SQL (creando una tabla temporal)
planetasDF.createOrReplaceTempView("planetas")

In [20]:
spark.sql("select count(terrain) from planetas").show()

+--------------+
|count(terrain)|
+--------------+
|            61|
+--------------+



In [21]:
spark.sql("select count(terrain) as countTerrenos, count(climate) from planetas").show()

+-------------+--------------+
|countTerrenos|count(climate)|
+-------------+--------------+
|           61|            61|
+-------------+--------------+



Count Distinct

In [22]:
planetasDF.select(countDistinct(planetasDF.climate)).show()

+-----------------------+
|count(DISTINCT climate)|
+-----------------------+
|                     21|
+-----------------------+



In [23]:
spark.sql("select count(distinct climate) from planetas").show()

+-----------------------+
|count(DISTINCT climate)|
+-----------------------+
|                     21|
+-----------------------+



Min y max

In [24]:
planetasDF.select(min(planetasDF.population), max(planetasDF.population)).show()

+---------------+---------------+
|min(population)|max(population)|
+---------------+---------------+
|              0|     2000000000|
+---------------+---------------+



In [25]:
spark.sql("select min(population) from planetas").show()

+---------------+
|min(population)|
+---------------+
|              0|
+---------------+



Sum

In [26]:
planetasDF.select(sum(planetasDF.orbital_period).alias("suma_periodo_orbital")).show()
planetasDF.selectExpr("sum(orbital_period) as periodo_orbital_total").show()

+--------------------+
|suma_periodo_orbital|
+--------------------+
|             27726.0|
+--------------------+

+---------------------+
|periodo_orbital_total|
+---------------------+
|              27726.0|
+---------------------+



Average (promedio)

In [27]:
planetasDF.select(avg(planetasDF.diameter)).show()
spark.sql("select avg(diameter) from planetas").show()

+-----------------+
|    avg(diameter)|
+-----------------+
|12388.34090909091|
+-----------------+

+-----------------+
|    avg(diameter)|
+-----------------+
|12388.34090909091|
+-----------------+



Estadísticos

In [28]:
planetasDF.select(mean(planetasDF.population)).show()
planetasDF.select(stddev(planetasDF.population)).show()

+---------------+
|avg(population)|
+---------------+
|    4.3004775E8|
+---------------+

+------------------+
|stddev(population)|
+------------------+
|6.37337099389144E8|
+------------------+



### Agrupaciones

---

In [29]:
countByClimaDF = planetasDF.groupBy(planetasDF.climate).count().orderBy("count")
countByClimaDF.show()

+--------------------+-----+
|             climate|count|
+--------------------+-----+
|     temperate, arid|    1|
|               murky|    1|
|temperate, arid, ...|    1|
|            tropical|    1|
|          hot, humid|    1|
|  arid, rocky, windy|    1|
|              frozen|    1|
|         superheated|    1|
| tropical, temperate|    1|
|            polluted|    1|
| temperate, tropical|    1|
|temperate, arid, ...|    1|
|    temperate, artic|    1|
|artificial temper...|    1|
|    temperate, moist|    1|
|              frigid|    2|
|arid, temperate, ...|    2|
|                 hot|    3|
|                arid|    3|
|                  NA|   13|
+--------------------+-----+
only showing top 20 rows



In [30]:
spark.sql("select climate, count(climate) as count from planetas where climate is not null group by climate order by count").show()

+--------------------+-----+
|             climate|count|
+--------------------+-----+
|     temperate, arid|    1|
|               murky|    1|
|temperate, arid, ...|    1|
|            tropical|    1|
|          hot, humid|    1|
|  arid, rocky, windy|    1|
|              frozen|    1|
|         superheated|    1|
| tropical, temperate|    1|
|            polluted|    1|
| temperate, tropical|    1|
|temperate, arid, ...|    1|
|    temperate, artic|    1|
|artificial temper...|    1|
|    temperate, moist|    1|
|              frigid|    2|
|arid, temperate, ...|    2|
|                 hot|    3|
|                arid|    3|
|                  NA|   13|
+--------------------+-----+
only showing top 20 rows



In [31]:
avgDiametroByClimaDF = planetasDF.groupBy(col("climate")).avg("diameter").orderBy(col("avg(diameter)").desc())
avgDiametroByClimaDF.show()

+--------------------+------------------+
|             climate|     avg(diameter)|
+--------------------+------------------+
|           temperate|16481.052631578947|
|arid, temperate, ...|           16365.0|
| tropical, temperate|           15600.0|
|    temperate, artic|           14900.0|
|            polluted|           13490.0|
|temperate, arid, ...|           12900.0|
|         superheated|           12780.0|
|            tropical|           12765.0|
|     temperate, arid|           11370.0|
|temperate, arid, ...|           10600.0|
| temperate, tropical|           10200.0|
|              frigid|           10088.0|
|          hot, humid|            9100.0|
|               murky|            8900.0|
|                 hot| 8889.666666666666|
|              frozen|            7200.0|
|                  NA|            6095.0|
|                arid|3488.3333333333335|
|artificial temper...|               0.0|
|    temperate, moist|               0.0|
+--------------------+------------

In [32]:
planetasDF.groupBy(col("climate")).agg(avg("diameter") \
    .alias("avg")).orderBy(col("avg").desc()).show()

+--------------------+------------------+
|             climate|               avg|
+--------------------+------------------+
|           temperate|16481.052631578947|
|arid, temperate, ...|           16365.0|
| tropical, temperate|           15600.0|
|    temperate, artic|           14900.0|
|            polluted|           13490.0|
|temperate, arid, ...|           12900.0|
|         superheated|           12780.0|
|            tropical|           12765.0|
|     temperate, arid|           11370.0|
|temperate, arid, ...|           10600.0|
| temperate, tropical|           10200.0|
|              frigid|           10088.0|
|          hot, humid|            9100.0|
|               murky|            8900.0|
|                 hot| 8889.666666666666|
|              frozen|            7200.0|
|                  NA|            6095.0|
|                arid|3488.3333333333335|
|artificial temper...|               0.0|
|    temperate, moist|               0.0|
+--------------------+------------

In [33]:
aggregationsByGenreDF = planetasDF.groupBy("climate") \
    .agg(
        count("*").alias("Numero_de_planetas"),
        avg("diameter").alias("promedio")
    ) \
    .orderBy(col("promedio").desc()).show()

+--------------------+------------------+------------------+
|             climate|Numero_de_planetas|          promedio|
+--------------------+------------------+------------------+
|           temperate|                23|16481.052631578947|
|arid, temperate, ...|                 2|           16365.0|
| tropical, temperate|                 1|           15600.0|
|    temperate, artic|                 1|           14900.0|
|            polluted|                 1|           13490.0|
|temperate, arid, ...|                 1|           12900.0|
|         superheated|                 1|           12780.0|
|            tropical|                 1|           12765.0|
|     temperate, arid|                 1|           11370.0|
|temperate, arid, ...|                 1|           10600.0|
| temperate, tropical|                 1|           10200.0|
|              frigid|                 2|           10088.0|
|          hot, humid|                 1|            9100.0|
|               murky|  

## Ejercicios

   1. Suma todos los cost_in_credits de TODOS los vehículos del archivo vehicles.csv. A continuación, suma los cost_in_credits por vehicle_class.
   
   2. Cuenta cuántos manufacturer distintos tenemos.
   
   3. Muestra la media y la desviación estándar de los passengers (de todos los vehículos) y después solo de los vehicle_class (repulsorcraft).
   
   4. Calcula el max_atmosphering_speed medio y la length media POR vehicle_class.
   
   5. Suma TODAS las cargo_capacity de TODOS los vehciulos en el DF. A continuación, suma TODOS los valores de crew por vehicle_class. ¿Ve valores nulos? ¿Por qué? ¿Cómo puede resolverlo?

Ejercicio 1

Ejercicio 2

Ejercicio 3

Ejercicio 4

Ejercicio 5