In [1]:
import os

# En nuestro ordenador personal, si no esta definida la variable JAVA_HOME, deberemos indicarla
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-17-openjdk-amd64"

# En los laboratorios docentes, sera necesario utilizar la siguiente
# os.environ["JAVA_HOME"] = "/usr/"

os.environ["JAVA_HOME"]

'/home/maes/.sdkman/candidates/java/current'

In [34]:
import pyspark

from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = (SparkSession.builder
    .master("local[*]")
    .config("spark.driver.cores", 1)
    .appName("Bike rental")
    .getOrCreate())
spark

# Bike Dataset

In this example we are going to explore a dataset generated from a bike rental system deployed in San Francisco area. The dataset consist of two data sources: 
- information about the renting stations
- information about trips done using this service

Next, we will explore both of them and compute some new information

## Stations

### Initial exploration

The first dataset contains information about the renting stations. We can use spark's csv reader to take a look at the data. Because the dataset contains an initial entry with the field names, we must provide the reader with the corresponding option take the header into account.

In [3]:
stations = spark.read.option("header", "true").csv("data/bike-data/201508_station_data.csv")

In [4]:
stations.show(2)

+----------+--------------------+---------+-----------+---------+--------+------------+
|station_id|                name|      lat|       long|dockcount|landmark|installation|
+----------+--------------------+---------+-----------+---------+--------+------------+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|San Jose|    8/6/2013|
|         3|San Jose Civic Ce...|37.330698|-121.888979|       15|San Jose|    8/5/2013|
+----------+--------------------+---------+-----------+---------+--------+------------+
only showing top 2 rows



In [5]:
stations.printSchema()

root
 |-- station_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- lat: string (nullable = true)
 |-- long: string (nullable = true)
 |-- dockcount: string (nullable = true)
 |-- landmark: string (nullable = true)
 |-- installation: string (nullable = true)



As we can see in the output of the previous statements, the schema inference has not worked very well. Moreover, the installation date uses a non standard format.

To get the most of our process we will provide a custom schema to coherce the data types to the proper ones. In addition, we also pass the `dateFormat` option to the Spark DataFrameReaer to parse the installation data values.

In [6]:
stationSchema = StructType([StructField("station_id", ByteType(), False), 
                           StructField("name", StringType(), False),
                           StructField("lat", DoubleType(), False),
                           StructField("long", DoubleType(), False),
                           StructField("dockcount", IntegerType(), False),
                           StructField("landmark", StringType(), False),
                           StructField("installation", DateType(), False)])

**Note**: In Spark 3.X, the date format MM/dd/yyyy is not longer valid, we need to use M/d/yyyy" instead

In [7]:
stations = spark.read.option("header", "true").option("dateFormat", "M/d/yyyy").csv("data/bike-data/201508_station_data.csv", schema=stationSchema)

In [8]:
stations.printSchema()

root
 |-- station_id: byte (nullable = true)
 |-- name: string (nullable = true)
 |-- lat: double (nullable = true)
 |-- long: double (nullable = true)
 |-- dockcount: integer (nullable = true)
 |-- landmark: string (nullable = true)
 |-- installation: date (nullable = true)



After providing the proper schema we are able to load the dataset wihtout formatting issues

In [9]:
stations.show(truncate=False)

+----------+---------------------------------+---------+-----------+---------+------------+------------+
|station_id|name                             |lat      |long       |dockcount|landmark    |installation|
+----------+---------------------------------+---------+-----------+---------+------------+------------+
|2         |San Jose Diridon Caltrain Station|37.329732|-121.901782|27       |San Jose    |2013-08-06  |
|3         |San Jose Civic Center            |37.330698|-121.888979|15       |San Jose    |2013-08-05  |
|4         |Santa Clara at Almaden           |37.333988|-121.894902|11       |San Jose    |2013-08-06  |
|5         |Adobe on Almaden                 |37.331415|-121.8932  |19       |San Jose    |2013-08-05  |
|6         |San Pedro Square                 |37.336721|-121.894074|15       |San Jose    |2013-08-07  |
|7         |Paseo de San Antonio             |37.333798|-121.886943|15       |San Jose    |2013-08-07  |
|8         |San Salvador at 1st              |37.330165

In [10]:
stations.count()

70

### Exercise
The station dataset contains information about the location and characteristics of the stations installed for the rental service.

Let's do a little summary to compute for each landmark the date when the first station was deployed, the date of the last update and the total number of docks available for the area so far.

In [11]:
landmarks = (
    stations.
    groupBy("landmark").
    agg(
        min("installation").alias("service_start"),
        max("installation").alias("last_update"),
        sum("dockcount").alias("total_docks")
    )
)

In [12]:
(landmarks.
 orderBy(col("service_start")).
 show()
)

+-------------+-------------+-----------+-----------+
|     landmark|service_start|last_update|total_docks|
+-------------+-------------+-----------+-----------+
|     San Jose|   2013-08-05| 2014-04-09|        264|
| Redwood City|   2013-08-12| 2014-02-20|        115|
|    Palo Alto|   2013-08-14| 2013-08-15|         75|
|Mountain View|   2013-08-15| 2013-12-31|        117|
|San Francisco|   2013-08-19| 2014-01-22|        665|
+-------------+-------------+-----------+-----------+



## Trips

### Initial exploration

The second dataset contains information about registered trips using the rental service.

Again, we make use of the csv reader to take out the initial exploration.


In [13]:
trips = spark.read.option("header", "true").csv("data/bike-data/201508_trip_data.csv")

In [14]:
trips.show(2)

+-------+--------+---------------+--------------------+--------------+---------------+--------------------+------------+----+---------------+--------+
|trip_id|duration|     start_date|       start_station|start_terminal|       end_date|         end_station|end_terminal|bike|subscriber_type|zip_code|
+-------+--------+---------------+--------------------+--------------+---------------+--------------------+------------+----+---------------+--------+
| 913460|     765|8/31/2015 23:26|Harry Bridges Pla...|            50|8/31/2015 23:39|San Francisco Cal...|          70| 288|     Subscriber|    2139|
| 913459|    1036|8/31/2015 23:11|San Antonio Shopp...|            31|8/31/2015 23:28|Mountain View Cit...|          27|  35|     Subscriber|   95032|
+-------+--------+---------------+--------------------+--------------+---------------+--------------------+------------+----+---------------+--------+
only showing top 2 rows



In [15]:
trips.printSchema()

root
 |-- trip_id: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- start_date: string (nullable = true)
 |-- start_station: string (nullable = true)
 |-- start_terminal: string (nullable = true)
 |-- end_date: string (nullable = true)
 |-- end_station: string (nullable = true)
 |-- end_terminal: string (nullable = true)
 |-- bike: string (nullable = true)
 |-- subscriber_type: string (nullable = true)
 |-- zip_code: string (nullable = true)



As we can see from the previous execution, field types are not inferred and the format of the timesatmps is not an standard one. To parse it propertly we will define the schema manually and also provide the `timestampFormat` option to the DataFrameReader

In [16]:
tripSchema = StructType([StructField("trip_id", IntegerType(), False), 
                         StructField("duration", IntegerType(), False),
                         StructField("start_date", TimestampType(), False),
                         StructField("start_station", StringType(), False),
                         StructField("start_terminal", ByteType(), False),
                         StructField("end_date", TimestampType(), False),
                         StructField("end_station", StringType(), False),
                         StructField("end_terminal", ByteType(), False),
                         StructField("bike", IntegerType(), False),
                         StructField("subscriber_type", StringType(), False),
                         StructField("zip_code", IntegerType(), False)])

In [17]:
trips = (spark.read.option("header", "true")
         .option("timestampFormat", "M/d/yyyy HH:mm")
         .csv("data/bike-data/201508_trip_data.csv", schema=tripSchema)
        )

In [18]:
trips.show(truncate = True)

+-------+--------+-------------------+--------------------+--------------+-------------------+--------------------+------------+----+---------------+--------+
|trip_id|duration|         start_date|       start_station|start_terminal|           end_date|         end_station|end_terminal|bike|subscriber_type|zip_code|
+-------+--------+-------------------+--------------------+--------------+-------------------+--------------------+------------+----+---------------+--------+
| 913460|     765|2015-08-31 23:26:00|Harry Bridges Pla...|            50|2015-08-31 23:39:00|San Francisco Cal...|          70| 288|     Subscriber|    2139|
| 913459|    1036|2015-08-31 23:11:00|San Antonio Shopp...|            31|2015-08-31 23:28:00|Mountain View Cit...|          27|  35|     Subscriber|   95032|
| 913455|     307|2015-08-31 23:13:00|      Post at Kearny|            47|2015-08-31 23:18:00|   2nd at South Park|          64| 468|     Subscriber|   94107|
| 913454|     409|2015-08-31 23:10:00|  San Jo

In [19]:
trips.count()

354152

### Exercise

Compute the total number of trips, the total trip duration (hours) and the average trip duration (minutes) for each bike, and display a ranking for the top 5 most used with the corresponding stats 

In [20]:
bike_info = (
    trips.
    groupBy("bike").
    agg(
        count("*").alias("total"),
        (round(sum("duration")/3600,2)).alias("total_duration(hours)"),
        (round(avg("duration")/60, 2)).alias("avg_duration(mins)")
    )
)

In [21]:
(bike_info.
 orderBy(bike_info.total.desc()).
 show(5))



+----+-----+---------------------+------------------+
|bike|total|total_duration(hours)|avg_duration(mins)|
+----+-----+---------------------+------------------+
| 878| 1121|               279.67|             14.97|
| 392| 1102|               284.41|             15.49|
| 489| 1101|               238.35|             12.99|
| 463| 1085|               279.98|             15.48|
| 532| 1074|               237.33|             13.26|
+----+-----+---------------------+------------------+
only showing top 5 rows



                                                                                

Display a summary (see function `describe()`)of the aggretated dataset containing information about how the bikes are used.

In [22]:
bike_info.drop("bike").describe().show()

+-------+-----------------+---------------------+------------------+
|summary|            total|total_duration(hours)|avg_duration(mins)|
+-------+-----------------+---------------------+------------------+
|  count|              668|                  668|               668|
|   mean|530.1676646706587|    154.0480239520958| 22.65806886227546|
| stddev|398.3555876917163|   210.80905043525703| 32.21889509608674|
|    min|                4|                 0.54|              4.63|
|    max|             1121|               4920.8|            646.06|
+-------+-----------------+---------------------+------------------+



If we want to know how individual trips look like, we can describe the initial dataset before being aggregated.

In [23]:
trips.select((col('duration')/60).alias('duration(mins)')).describe().show()

+-------+------------------+
|summary|    duration(mins)|
+-------+------------------+
|  count|            354152|
|   mean|17.433877685287612|
| stddev| 500.2822692821593|
|    min|               1.0|
|    max|          287840.0|
+-------+------------------+



### Exercise

Compute the minimun distance traveled for a bike trip. We will consider the minimum trip distance to the distance between the starting and ending stations.

We will use the haversine distante provided to compute the distance between two geographical points stated by their (long, lat) coordinates.

In [24]:
from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

In [25]:
start_coords = stations.selectExpr("station_id as start_terminal", "lat as start_lat", "long as start_long")

In [26]:
end_coords = stations.selectExpr("station_id as end_terminal", "lat as end_lat", "long as end_long")

First, you will need to use the `join()` function to add the coordinates to each trip.

In [27]:
trip_coords = (
    trips.
    select('trip_id', 'bike', 'start_terminal', 'end_terminal', 'duration').
    join(start_coords, 'start_terminal', 'inner').
    join(end_coords, 'end_terminal') # default join type is inner
)

In [28]:
trip_coords.show(10)

+------------+--------------+-------+----+--------+---------+-----------+---------+-----------+
|end_terminal|start_terminal|trip_id|bike|duration|start_lat| start_long|  end_lat|   end_long|
+------------+--------------+-------+----+--------+---------+-----------+---------+-----------+
|          70|            50| 913460| 288|     765|37.795392|-122.394203|37.776617| -122.39526|
|          27|            31| 913459|  35|    1036|37.400443|-122.108338|37.389218|-122.081896|
|          64|            47| 913455| 468|     307|37.788975|-122.403452|37.782259|-122.392738|
|           8|            10| 913454|  68|     409|37.337391|-121.886995|37.330165|-121.885831|
|          60|            51| 913453| 487|     789|37.791464|-122.391034| 37.80477|-122.403234|
|          70|            68| 913452| 538|     293|37.784878|-122.401014|37.776617| -122.39526|
|          60|            51| 913451| 363|     896|37.791464|-122.391034| 37.80477|-122.403234|
|          74|            60| 913450| 47

The function to calculate the distance is a function defined by us. We have to use `udf(...)` (User Defined Function) to convert it into a spark compatible function. It will allow us to pass as parameters the names of the fields and our function will obtain the corresponding values.

**Note**: These functions are usually the most computationally expensive, as they are not optimised. More info: https://sparkbyexamples.com/pyspark/pyspark-udf-user-defined-function/ 

In [29]:
haversine_udf = udf(haversine, DoubleType())

After this, we can now calculate the distance of each trip

In [30]:
bike_trips = (
    trip_coords.
    select(
        'trip_id',
        'bike',
        'duration',
        haversine_udf('start_long', 'start_lat', 'end_long', 'end_lat').alias('distance')
    )
).cache()

In [33]:
bike_trips.orderBy(col("distance")).where(bike_trips.distance > 0).show()

+-------+----+--------+-------------------+
|trip_id|bike|duration|           distance|
+-------+----+--------+-------------------+
| 449868|  97|     593|0.01855324879480863|
| 757653| 418|    8682|0.01855324879480863|
| 648319| 392|     115|0.01855324879480863|
| 736456| 436|    4285|0.01855324879480863|
| 842718| 514|   17738|0.01855324879480863|
| 772900| 349|    1172|0.01855324879480863|
| 653113| 339|   14373|0.01855324879480863|
| 777616| 314|    1780|0.01855324879480863|
| 911100| 380|    2973|0.01855324879480863|
| 776085| 359|   33831|0.01855324879480863|
| 652903| 490|    2314|0.01855324879480863|
| 773425| 291|    1646|0.01855324879480863|
| 842984| 395|   26321|0.01855324879480863|
| 773303| 389|     862|0.01855324879480863|
| 652058| 490|   12081|0.01855324879480863|
| 770223| 579|    3909|0.01855324879480863|
| 488917| 629|    1400|0.01855324879480863|
| 770222| 326|    3895|0.01855324879480863|
| 648692| 424|    1040|0.01855324879480863|
| 763841| 524|     129|0.0185532