# Columnas y Expresiones

## Requisitos Previos

Instalar Spark y Java en la Máquina Virtual (VM)

In [1]:
# instalar Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# descargar spark 3.5.0
!wget -q https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

In [2]:
ls -l # verificar que el .tgz está ahí

total 391020
drwxr-xr-x 1 root root      4096 Nov 17 14:29 [0m[01;34msample_data[0m/
-rw-r--r-- 1 root root 400395283 Sep  9  2023 spark-3.5.0-bin-hadoop3.tgz


In [3]:
# descomprimirlo
!tar xf spark-3.5.0-bin-hadoop3.tgz

In [4]:
!pip install -q findspark

In [5]:

!pip install py4j

# Para maps
!pip install folium
!pip install plotly



Definir el entorno

In [6]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

Iniciar Sesión de Spark (Spark Session)

---

In [7]:
import findspark
findspark.init("spark-3.5.0-bin-hadoop3")# SPARK_HOME

from pyspark.sql import SparkSession

# crear la sesión
spark = SparkSession \
        .builder \
        .appName("DataFrames Basics") \
        .master("local[*]") \
        .getOrCreate()

spark.version

'3.5.0'

In [8]:
spark

In [9]:
# Para optimización de conversión a Pandas
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [10]:
# Importar funciones sql
from pyspark.sql.functions import *

Descargar y subir los conjuntos de datos (datasets) (CARS_DATES.JSON, DEPTMANAGERS.JSON, MOVIES.CSV)

In [11]:
from google.colab import files
uploaded = files.upload()

Saving vehicles.csv to vehicles.csv
Saving roles.csv to roles.csv
Saving planets.csv to planets.csv
Saving paychecks.csv to paychecks.csv
Saving leads.csv to leads.csv
Saving guitars.json to guitars.json
Saving guitarPlayers.json to guitarPlayers.json
Saving games.csv to games.csv
Saving developers.csv to developers.csv
Saving deptmanagers.csv to deptmanagers.csv
Saving cars_dates.json to cars_dates.json
Saving books.csv to books.csv
Saving bank.csv to bank.csv
Saving bands.json to bands.json
Saving titles.csv to titles.csv
Saving taxi_zones.csv to taxi_zones.csv
Saving taxi_data.csv to taxi_data.csv
Saving stocks.csv to stocks.csv
Saving salaries.csv to salaries.csv
Saving quijote.txt to quijote.txt
Saving population.json to population.json
Saving numbers.csv to numbers.csv
Saving movies.json to movies.json
Saving more_cars.json to more_cars.json
Saving frankenstein.txt to frankenstein.txt
Saving employees.csv to employees.csv
Saving characters.csv to characters.csv
Saving cars.json t

Leer archivo JSON

In [12]:
cochesDF = spark.read \
    .option("inferSchema", True) \
    .json("cars_dates.json")

## Ejemplos

Seleccionar una columna

In [13]:
cochesDF.select(col("Cylinders")).show(3, False)

+---------+
|Cylinders|
+---------+
|8        |
|8        |
|8        |
+---------+
only showing top 3 rows



Podemos utilizar varios métodos para referirnos a una columna

In [14]:
# Varios metodos de select
cochesDF.select(
    cochesDF.Name,
    col("Miles_per_Gallon"),
    "Displacement"
).show(3)

+--------------------+----------------+------------+
|                Name|Miles_per_Gallon|Displacement|
+--------------------+----------------+------------+
|chevrolet chevell...|              18|         307|
|   buick skylark 320|              15|         350|
|  plymouth satellite|              18|         318|
+--------------------+----------------+------------+
only showing top 3 rows



Expresiones. Podemos utilizzar el lenguaje SQL dentro de un select como expresiones para trabajar y transformar columnas

In [15]:
cochesenKgDF = cochesDF.select(
    col("Name"),
    col("Horsepower"),
    (col("Weight_in_lbs")/2.2).cast("int").alias("Weight_in_kg_2"), #casteo el resultado a un int
    expr("Weight_in_lbs / 1000").cast("string").alias("Weight_in_T") #casteo el resultado a un str
)
cochesenKgDF.printSchema()
cochesenKgDF.show(3)

root
 |-- Name: string (nullable = true)
 |-- Horsepower: long (nullable = true)
 |-- Weight_in_kg_2: integer (nullable = true)
 |-- Weight_in_T: string (nullable = true)

+--------------------+----------+--------------+-----------+
|                Name|Horsepower|Weight_in_kg_2|Weight_in_T|
+--------------------+----------+--------------+-----------+
|chevrolet chevell...|       130|          1592|      3.504|
|   buick skylark 320|       165|          1678|      3.693|
|  plymouth satellite|       150|          1561|      3.436|
+--------------------+----------+--------------+-----------+
only showing top 3 rows



In [16]:
# trabajando con expressions
cochesConSelectExpresionDF = cochesDF.selectExpr(
    "Name",
    "Weight_in_lbs",
    "Weight_in_lbs / 2.2"
  )
cochesConSelectExpresionDF.show(3)

+--------------------+-------------+---------------------+
|                Name|Weight_in_lbs|(Weight_in_lbs / 2.2)|
+--------------------+-------------+---------------------+
|chevrolet chevell...|         3504|          1592.727273|
|   buick skylark 320|         3693|          1678.636364|
|  plymouth satellite|         3436|          1561.818182|
+--------------------+-------------+---------------------+
only showing top 3 rows



### Procesamiento de DFs

Añadir una columna

In [17]:
cochesNuevaColumnaDF = cochesDF.withColumn("Weight_in_kg_3", col("Weight_in_lbs") / 2.2)
cochesNuevaColumnaDF.show(3)

+------------+---------+------------+----------+----------------+--------------------+------+-------------+----------+------------------+
|Acceleration|Cylinders|Displacement|Horsepower|Miles_per_Gallon|                Name|Origin|Weight_in_lbs|      Year|    Weight_in_kg_3|
+------------+---------+------------+----------+----------------+--------------------+------+-------------+----------+------------------+
|        12.0|        8|         307|       130|              18|chevrolet chevell...|   USA|         3504|01-01-1970|1592.7272727272725|
|        11.5|        8|         350|       165|              15|   buick skylark 320|   USA|         3693|01-01-1970|1678.6363636363635|
|        11.0|        8|         318|       150|              18|  plymouth satellite|   USA|         3436|01-01-1970|1561.8181818181818|
+------------+---------+------------+----------+----------------+--------------------+------+-------------+----------+------------------+
only showing top 3 rows



Renombrar una columna

In [18]:
cochesColumnaRenombradaDF = cochesDF.withColumnRenamed("Weight_in_lbs", "Weight in pounds")
cochesColumnaRenombradaDF.show(3)

+------------+---------+------------+----------+----------------+--------------------+------+----------------+----------+
|Acceleration|Cylinders|Displacement|Horsepower|Miles_per_Gallon|                Name|Origin|Weight in pounds|      Year|
+------------+---------+------------+----------+----------------+--------------------+------+----------------+----------+
|        12.0|        8|         307|       130|              18|chevrolet chevell...|   USA|            3504|01-01-1970|
|        11.5|        8|         350|       165|              15|   buick skylark 320|   USA|            3693|01-01-1970|
|        11.0|        8|         318|       150|              18|  plymouth satellite|   USA|            3436|01-01-1970|
+------------+---------+------------+----------+----------------+--------------------+------+----------------+----------+
only showing top 3 rows



In [19]:
# as we hace special characters (spaces) we have to use the ``
cochesColumnaRenombradaDF.selectExpr("`Weight in pounds`").show(3)

+----------------+
|Weight in pounds|
+----------------+
|            3504|
|            3693|
|            3436|
+----------------+
only showing top 3 rows



Eliminar una columna

In [20]:
cochesColumnaRenombradaDF.printSchema()

root
 |-- Acceleration: double (nullable = true)
 |-- Cylinders: long (nullable = true)
 |-- Displacement: long (nullable = true)
 |-- Horsepower: long (nullable = true)
 |-- Miles_per_Gallon: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Weight in pounds: long (nullable = true)
 |-- Year: string (nullable = true)



In [21]:
eliminarColsDF = cochesColumnaRenombradaDF.drop("Horsepower", "Displacement")
eliminarColsDF.printSchema()


root
 |-- Acceleration: double (nullable = true)
 |-- Cylinders: long (nullable = true)
 |-- Miles_per_Gallon: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Weight in pounds: long (nullable = true)
 |-- Year: string (nullable = true)



Filtrar DF

In [22]:
filtroCochesDF = cochesDF.filter(col("Origin") != "USA")
filtroCochesDF2 = cochesDF.where(col("Origin") != "USA")
filtroCochesDF.show(3)
print(f"{filtroCochesDF.count()} == {filtroCochesDF2.count()}")

+------------+---------+------------+----------+----------------+--------------------+------+-------------+----------+
|Acceleration|Cylinders|Displacement|Horsepower|Miles_per_Gallon|                Name|Origin|Weight_in_lbs|      Year|
+------------+---------+------------+----------+----------------+--------------------+------+-------------+----------+
|        17.5|        4|         133|       115|            NULL|citroen ds-21 pallas|Europe|         3090|01-01-1970|
|        15.0|        4|         113|        95|              24|toyota corona mar...| Japan|         2372|01-01-1970|
|        14.5|        4|          97|        88|              27|        datsun pl510| Japan|         2130|01-01-1970|
+------------+---------+------------+----------+----------------+--------------------+------+-------------+----------+
only showing top 3 rows

93 == 93


In [23]:
# Filtrar con expressions strings
cochesUSADF = cochesDF.filter("Year='01-01-1970'")
cochesUSADF.show(3)

+------------+---------+------------+----------+----------------+--------------------+------+-------------+----------+
|Acceleration|Cylinders|Displacement|Horsepower|Miles_per_Gallon|                Name|Origin|Weight_in_lbs|      Year|
+------------+---------+------------+----------+----------------+--------------------+------+-------------+----------+
|        12.0|        8|         307|       130|              18|chevrolet chevell...|   USA|         3504|01-01-1970|
|        11.5|        8|         350|       165|              15|   buick skylark 320|   USA|         3693|01-01-1970|
|        11.0|        8|         318|       150|              18|  plymouth satellite|   USA|         3436|01-01-1970|
+------------+---------+------------+----------+----------------+--------------------+------+-------------+----------+
only showing top 3 rows



Filtros de cadena

In [24]:
cochesPotenciaDF = cochesDF.filter(col("Origin") == "USA").filter(col("Horsepower") > 150)
cochesPotenciaDF2 = cochesDF.filter((col("Origin") == "USA") & (col("Horsepower") > 150))
cochesPotenciaDF3 = cochesDF.filter("Origin = 'USA' and Horsepower > 150")
cochesPotenciaDF.show(3)
cochesPotenciaDF2.show(3)
cochesPotenciaDF3.show(3)

+------------+---------+------------+----------+----------------+-----------------+------+-------------+----------+
|Acceleration|Cylinders|Displacement|Horsepower|Miles_per_Gallon|             Name|Origin|Weight_in_lbs|      Year|
+------------+---------+------------+----------+----------------+-----------------+------+-------------+----------+
|        11.5|        8|         350|       165|              15|buick skylark 320|   USA|         3693|01-01-1970|
|        10.0|        8|         429|       198|              15| ford galaxie 500|   USA|         4341|01-01-1970|
|         9.0|        8|         454|       220|              14| chevrolet impala|   USA|         4354|01-01-1970|
+------------+---------+------------+----------+----------------+-----------------+------+-------------+----------+
only showing top 3 rows

+------------+---------+------------+----------+----------------+-----------------+------+-------------+----------+
|Acceleration|Cylinders|Displacement|Horsepower

## Ejercicios

1. Lee el archivo de deptmanagers.csv y haz un select de las 2 columnas que quieras
   
2. Añade una columna nueva al dataset de movies.json que sea el total generado por cada pelicula = US_Gross + Worldwide_Gross + DVD sales. Estas obteniendo algun nulo? Como lo puedes solucionar?
   
3. Selecciona todas las peliculas que tengan una nota mayor que 7 o igual en IMDB (IMDB_Rating), intenta hacerlo de todas las maneras que sepas.

Ejercicio 1

Ejercicio 2

Ejercicio 3