<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Spark Lab

*Authors: Christoph Rahmede (LDN)*

**In this lab, we will use Spark to dig into the Bay Area Bike Share data.**

Our goal is to calculate the average number of trips per hour, using the Caltrain Station as starting point.

In [1]:
import pyspark as ps
from pyspark.sql import SQLContext

In [2]:
sc = ps.SparkContext('local[4]')
sqlContext = SQLContext(sc)
spark = ps.sql.SparkSession(sc)

## Load the data

In [3]:
trips = spark.read.csv(
    path="./data/201508_trip_data.csv",
    header=True,
    # Poorly formed rows in CSV are dropped rather than erroring entire operation
    mode="DROPMALFORMED",
    # Not always perfect but works well in most cases as of 2.1+
    inferSchema=True
)

In [4]:
trips.registerTempTable("tripsSql_1")

In [5]:
trips.show(10)

+-------+--------+---------------+--------------------+--------------+---------------+--------------------+------------+------+---------------+--------+
|Trip ID|Duration|     Start Date|       Start Station|Start Terminal|       End Date|         End Station|End Terminal|Bike #|Subscriber Type|Zip Code|
+-------+--------+---------------+--------------------+--------------+---------------+--------------------+------------+------+---------------+--------+
| 913460|     765|8/31/2015 23:26|Harry Bridges Pla...|            50|8/31/2015 23:39|San Francisco Cal...|          70|   288|     Subscriber|    2139|
| 913459|    1036|8/31/2015 23:11|San Antonio Shopp...|            31|8/31/2015 23:28|Mountain View Cit...|          27|    35|     Subscriber|   95032|
| 913455|     307|8/31/2015 23:13|      Post at Kearny|            47|8/31/2015 23:18|   2nd at South Park|          64|   468|     Subscriber|   94107|
| 913454|     409|8/31/2015 23:10|  San Jose City Hall|            10|8/31/2015 23

## Timestamps

You can use the following functions to cast into a timestamp and to extract parts of it.

In [6]:
from pyspark.sql.functions import date_format, to_date, to_timestamp, year, month, dayofweek, hour

In [7]:
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

In [8]:
df = trips.withColumn('time', to_timestamp('Start Date', format='MM/dd/yyyy HH:mm'))
df = df.withColumn('hour', hour('time'))
df = df.withColumn('day', to_date('time'))
df = df.withColumn('month', month('time'))
df = df.withColumn('weekday', dayofweek('time'))

In [9]:
df.select('Start Date', 'time', 'hour', 'day', 'month', 'weekday').show(10)

+---------------+-------------------+----+----------+-----+-------+
|     Start Date|               time|hour|       day|month|weekday|
+---------------+-------------------+----+----------+-----+-------+
|8/31/2015 23:26|2015-08-31 23:26:00|  23|2015-08-31|    8|      2|
|8/31/2015 23:11|2015-08-31 23:11:00|  23|2015-08-31|    8|      2|
|8/31/2015 23:13|2015-08-31 23:13:00|  23|2015-08-31|    8|      2|
|8/31/2015 23:10|2015-08-31 23:10:00|  23|2015-08-31|    8|      2|
|8/31/2015 23:09|2015-08-31 23:09:00|  23|2015-08-31|    8|      2|
|8/31/2015 23:07|2015-08-31 23:07:00|  23|2015-08-31|    8|      2|
|8/31/2015 23:07|2015-08-31 23:07:00|  23|2015-08-31|    8|      2|
|8/31/2015 22:16|2015-08-31 22:16:00|  22|2015-08-31|    8|      2|
|8/31/2015 22:12|2015-08-31 22:12:00|  22|2015-08-31|    8|      2|
|8/31/2015 21:57|2015-08-31 21:57:00|  21|2015-08-31|    8|      2|
+---------------+-------------------+----+----------+-----+-------+
only showing top 10 rows



Our datetime parsing has not been perfect. We can check for missing values. We should really try to cure this. Here let's just drop the missing values.

In [10]:
from pyspark.sql.functions import isnan, when, count, col

df.where(col('day').isNull()).toPandas()

Unnamed: 0,Trip ID,Duration,Start Date,Start Station,Start Terminal,End Date,End Station,End Terminal,Bike #,Subscriber Type,Zip Code,time,hour,day,month,weekday
0,702587,699,3/29/2015 1:43,San Jose Diridon Caltrain Station,2,3/29/2015 1:55,Japantown,9,79,Subscriber,95112,NaT,,,,
1,702585,226,3/29/2015 1:30,Grant Avenue at Columbus Avenue,73,3/29/2015 1:34,Broadway St at Battery St,82,563,Subscriber,94114,NaT,,,,


In [11]:
df = df.na.drop()

## Print the schema

## Create a temporary SQL table from the dataframe

## In the following exercises, where possible use both dataframe methods and SQL queries

Hint: In Hive SQL, you can refer to column names including spaces with the following:

```SQL
SELECT `column name` FROM table
```

## Determine the number of observations

## Calculate mean, standard deviation, minimum and maximum of the duration column

## For how many different days do you have observations?

## What are the first and last observed days?

## Obtain the counts of rides per hour

## Obtain the counts per hour averaged over all observed dates

## Obtain the average duration of trips per hour departing from terminal 70 sorted by the hour