<a href="https://colab.research.google.com/github/gabrielfernandorey/EDVAI/blob/main/PySpark/PySpark_00.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark

### Instalación y carga de Pyspark

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824028 sha256=b8d4703ceb7ed225ec35440f2c4ed5c212f8eea2e02ad3b2deef1abaacf7d70c
  Stored in directory: /root/.cache/pip/wheels/6c/e3/9b/0525ce8a69478916513509d43693511463c6468db0de237c86
Successfully built pyspark
Installing collected packages: py4j, pyspa

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('test_pyspark').getOrCreate()

### Librerías necesarias

In [4]:
from pyspark.sql.types import StringType, BooleanType, FloatType, IntegerType, DoubleType, DateType
import pyspark.sql.functions as F
from pyspark.sql.functions import sum, col, desc, asc, count, countDistinct, round, max, min, avg
from pyspark.sql.functions import to_timestamp,date_format
from pyspark.sql.window import Window

from pyspark.ml import Transformer
from pyspark.ml.param.shared import HasInputCol, HasOutputCol, HasInputCols, HasOutputCols, Param, Params, TypeConverters
from pyspark import keyword_only
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml import Model
from pyspark.ml import Estimator

from datetime import datetime
import numpy as np

### Importamos datos

In [5]:
!wget https://data-engineer-edvai.s3.amazonaws.com/yellow_tripdata_2021-01.parquet

--2023-04-13 15:30:53--  https://data-engineer-edvai.s3.amazonaws.com/yellow_tripdata_2021-01.parquet
Resolving data-engineer-edvai.s3.amazonaws.com (data-engineer-edvai.s3.amazonaws.com)... 52.216.213.41, 54.231.138.97, 52.217.48.36, ...
Connecting to data-engineer-edvai.s3.amazonaws.com (data-engineer-edvai.s3.amazonaws.com)|52.216.213.41|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21686067 (21M) [application/x-www-form-urlencoded]
Saving to: ‘yellow_tripdata_2021-01.parquet’


2023-04-13 15:30:54 (31.7 MB/s) - ‘yellow_tripdata_2021-01.parquet’ saved [21686067/21686067]



In [6]:
df = spark.read.option("header","true").parquet("*.parquet")

In [8]:
df.printSchema()

root
 |-- VendorID: long (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: double (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: double (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- DOLocationID: long (nullable = true)
 |-- payment_type: long (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)



In [9]:
df.show(10)

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|airport_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|       1| 2021-01-01 00:30:10|  2021-01-01 00:36:12|            1.0|          2.1|       1.0|                 N|         142|          43|           2|        8.0|  3.0|    0.5|       0.

### Mostrar los resultados siguientes
##### a. VendorId Integer
##### b. Tpep_pickup_datetime date
##### c. Total_amount double
##### d. Donde el total (total_amount sea menor a 10 dólares)

In [10]:
# Creamos vista
df.createOrReplaceTempView("yellow_tripdata")

In [11]:
df_31 = spark.sql("select VendorId, tpep_pickup_datetime, total_amount from yellow_tripdata where total_amount < 10")

In [12]:
df_31.show(10)

+--------+--------------------+------------+
|VendorId|tpep_pickup_datetime|total_amount|
+--------+--------------------+------------+
|       1| 2021-01-01 00:51:20|         4.3|
|       2| 2021-01-01 00:42:11|         8.3|
|       2| 2021-01-01 00:04:21|        9.96|
|       2| 2021-01-01 00:43:41|         9.3|
|       2| 2021-01-01 00:36:08|         5.8|
|       1| 2021-01-01 00:03:13|         0.0|
|       1| 2021-01-01 00:30:32|         9.3|
|       2| 2021-01-01 00:16:19|         9.8|
|       2| 2021-01-01 00:57:26|         8.8|
|       2| 2021-01-01 00:33:33|        9.96|
+--------+--------------------+------------+
only showing top 10 rows



### Mostrar los 10 días que más se recaudó dinero (tpep_pickup_datetime, total amount)

In [13]:
df_32 = spark.sql("select cast(tpep_pickup_datetime as date) as tpep_pickup_date , sum(total_amount) as TOTAL from yellow_tripdata group by tpep_pickup_date order by TOTAL Desc")

In [14]:
df_32.show(10)

+----------------+-----------------+
|tpep_pickup_date|            TOTAL|
+----------------+-----------------+
|      2021-01-28|959114.4900002397|
|      2021-01-22|933129.1800002002|
|      2021-01-29|929731.0600002115|
|      2021-01-21| 929307.270000204|
|      2021-01-14|925183.8200001806|
|      2021-01-15|924665.2000001943|
|      2021-01-27|894418.6400001668|
|      2021-01-19|889278.4600001582|
|      2021-01-07|886008.2300001475|
|      2021-01-13|873117.0800001248|
+----------------+-----------------+
only showing top 10 rows



Los valores son aproximados pero no exactos al resultado real

### Mostrar los 10 viajes que menos dinero recaudó en viajes mayores a 10 millas (trip_distance, total_amount)

In [15]:
df_33 = spark.sql("select trip_distance, total_amount from yellow_tripdata where trip_distance > 10 order by total_amount")

In [16]:
df_33.show(10)

+-------------+------------+
|trip_distance|total_amount|
+-------------+------------+
|        12.68|      -252.3|
|        34.35|     -176.42|
|        14.75|      -152.8|
|        33.96|     -127.92|
|         29.1|      -119.3|
|        26.94|      -111.3|
|        20.08|      -107.8|
|        19.55|      -102.8|
|        19.16|      -90.55|
|        25.83|      -88.54|
+-------------+------------+
only showing top 10 rows



### Mostrar los viajes de más de dos pasajeros que hayan pagado con tarjeta de  crédito (mostrar solo las columnas trip_distance y tpep_pickup_datetime)

In [17]:
df_34 = spark.sql("select trip_distance, cast(tpep_pickup_datetime as date) as tpep_pickup_date from yellow_tripdata where passenger_count >= 2 and payment_type==1")

In [None]:
df_34.show()

+-------------+----------------+
|trip_distance|tpep_pickup_date|
+-------------+----------------+
|          2.7|      2021-01-01|
|         6.11|      2021-01-01|
|         1.21|      2021-01-01|
|          1.7|      2021-01-01|
|         1.16|      2021-01-01|
|         3.15|      2021-01-01|
|         0.64|      2021-01-01|
|        10.74|      2021-01-01|
|         2.01|      2021-01-01|
|         3.45|      2021-01-01|
|         2.85|      2021-01-01|
|         1.68|      2021-01-01|
|         0.77|      2021-01-01|
|         0.52|      2021-01-01|
|          0.4|      2021-01-01|
|         1.05|      2021-01-01|
|         5.85|      2021-01-01|
|          3.7|      2021-01-01|
|        16.54|      2021-01-01|
|          4.0|      2021-01-01|
+-------------+----------------+
only showing top 20 rows



No coincide con los resultados reales

### Mostrar los 7 viajes con mayor propina en distancias mayores a 10 millas (mostrar campos tpep_pickup_datetime, trip_distance, passenger_count, tip_amount)

In [18]:
df_35 = spark.sql("select trip_distance, cast(tpep_pickup_datetime as date) as tpep_pickup_date, passenger_count, tip_amount from yellow_tripdata where trip_distance >= 10 order by tip_amount Desc")

In [19]:
df_35.show(7)

+-------------+----------------+---------------+----------+
|trip_distance|tpep_pickup_date|passenger_count|tip_amount|
+-------------+----------------+---------------+----------+
|        427.7|      2021-01-20|            1.0|   1140.44|
|        267.7|      2021-01-03|            1.0|     369.4|
|        326.1|      2021-01-12|            0.0|    192.61|
|        260.5|      2021-01-19|            1.0|    149.03|
|         11.1|      2021-01-31|            0.0|     100.0|
|        14.86|      2021-01-01|            2.0|      99.0|
|         13.0|      2021-01-18|            0.0|      90.0|
+-------------+----------------+---------------+----------+
only showing top 7 rows



### Mostrar para cada uno de los valores de RateCodeID, el monto total y el monto promedio. Excluir los viajes en donde RateCodeID es ‘Group Ride’

In [20]:
df_36 = spark.sql("select RatecodeID, sum(total_amount) as TOTAL, avg(total_amount) as AVERAGE from yellow_tripdata where RatecodeID != 6 group by RatecodeID ")

In [21]:
df_36.show(7)

+----------+--------------------+------------------+
|RatecodeID|               TOTAL|           AVERAGE|
+----------+--------------------+------------------+
|       1.0|1.9496468430212937E7|15.606626116946773|
|       4.0|   90039.93000000082| 74.90842762063296|
|       3.0|   67363.26000000043| 78.69539719626219|
|       2.0|   973635.4700000732| 65.52937609369182|
|      99.0|  1748.0699999999997| 48.55749999999999|
|       5.0|  255075.08999999086|48.939963545662096|
+----------+--------------------+------------------+

