<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Spark Lab


**In this lab, we will use Spark to dig into the Bay Area Bike Share data.**

Our goal is to calculate the average number of trips per hour, using the Caltrain Station as starting point.

In [1]:
import pyspark as ps
from pyspark.sql import SQLContext

In [2]:
sc = ps.SparkContext('local[4]')
sqlContext = SQLContext(sc)
spark = ps.sql.SparkSession(sc)

## Load the data

In [3]:
trips = sqlContext.read.format('com.databricks.spark.csv') \
    .options(header='true', inferschema='true') \
    .load('./data/201508_trip_data.csv')
trips.registerTempTable("tripsSql_1")

In [4]:
trips.show(10)

+-------+--------+---------------+--------------------+--------------+---------------+--------------------+------------+------+---------------+--------+
|Trip ID|Duration|     Start Date|       Start Station|Start Terminal|       End Date|         End Station|End Terminal|Bike #|Subscriber Type|Zip Code|
+-------+--------+---------------+--------------------+--------------+---------------+--------------------+------------+------+---------------+--------+
| 913460|     765|8/31/2015 23:26|Harry Bridges Pla...|            50|8/31/2015 23:39|San Francisco Cal...|          70|   288|     Subscriber|    2139|
| 913459|    1036|8/31/2015 23:11|San Antonio Shopp...|            31|8/31/2015 23:28|Mountain View Cit...|          27|    35|     Subscriber|   95032|
| 913455|     307|8/31/2015 23:13|      Post at Kearny|            47|8/31/2015 23:18|   2nd at South Park|          64|   468|     Subscriber|   94107|
| 913454|     409|8/31/2015 23:10|  San Jose City Hall|            10|8/31/2015 23

## Timestamps

You can use the following functions to cast into a timestamp and to extract parts of it.

In [5]:
from pyspark.sql.functions import date_format, to_date, to_timestamp, year, month, dayofweek, hour

In [6]:
df = trips.withColumn('time', to_timestamp('Start Date', format='MM/dd/yyyy HH:mm'))
df = df.withColumn('hour', hour('time'))
df = df.withColumn('day', to_date('time'))
df = df.withColumn('month', month('time'))
df = df.withColumn('weekday', dayofweek('time'))

In [7]:
df.select('Start Date', 'time', 'hour', 'day', 'month', 'weekday').show(10)

+---------------+-------------------+----+----------+-----+-------+
|     Start Date|               time|hour|       day|month|weekday|
+---------------+-------------------+----+----------+-----+-------+
|8/31/2015 23:26|2015-08-31 23:26:00|  23|2015-08-31|    8|      2|
|8/31/2015 23:11|2015-08-31 23:11:00|  23|2015-08-31|    8|      2|
|8/31/2015 23:13|2015-08-31 23:13:00|  23|2015-08-31|    8|      2|
|8/31/2015 23:10|2015-08-31 23:10:00|  23|2015-08-31|    8|      2|
|8/31/2015 23:09|2015-08-31 23:09:00|  23|2015-08-31|    8|      2|
|8/31/2015 23:07|2015-08-31 23:07:00|  23|2015-08-31|    8|      2|
|8/31/2015 23:07|2015-08-31 23:07:00|  23|2015-08-31|    8|      2|
|8/31/2015 22:16|2015-08-31 22:16:00|  22|2015-08-31|    8|      2|
|8/31/2015 22:12|2015-08-31 22:12:00|  22|2015-08-31|    8|      2|
|8/31/2015 21:57|2015-08-31 21:57:00|  21|2015-08-31|    8|      2|
+---------------+-------------------+----+----------+-----+-------+
only showing top 10 rows



Our datetime parsing has not been perfect. We can check for missing values. We should really try to cure this. Here let's just drop the missing values.

In [8]:
from pyspark.sql.functions import isnan, when, count, col, countDistinct

df.where(col('day').isNull()).toPandas()

Unnamed: 0,Trip ID,Duration,Start Date,Start Station,Start Terminal,End Date,End Station,End Terminal,Bike #,Subscriber Type,Zip Code,time,hour,day,month,weekday
0,702587,699,3/29/2015 1:43,San Jose Diridon Caltrain Station,2,3/29/2015 1:55,Japantown,9,79,Subscriber,95112,,,,,
1,702585,226,3/29/2015 1:30,Grant Avenue at Columbus Avenue,73,3/29/2015 1:34,Broadway St at Battery St,82,563,Subscriber,94114,,,,,


In [9]:
df = df.na.drop()

## Print the schema

In [10]:
df.printSchema()

root
 |-- Trip ID: integer (nullable = true)
 |-- Duration: integer (nullable = true)
 |-- Start Date: string (nullable = true)
 |-- Start Station: string (nullable = true)
 |-- Start Terminal: integer (nullable = true)
 |-- End Date: string (nullable = true)
 |-- End Station: string (nullable = true)
 |-- End Terminal: integer (nullable = true)
 |-- Bike #: integer (nullable = true)
 |-- Subscriber Type: string (nullable = true)
 |-- Zip Code: string (nullable = true)
 |-- time: timestamp (nullable = true)
 |-- hour: integer (nullable = true)
 |-- day: date (nullable = true)
 |-- month: integer (nullable = true)
 |-- weekday: integer (nullable = true)



## Create a temporary SQL table from the dataframe

In [11]:
df.createOrReplaceTempView("trips")

## In the following exercises, where possible use both dataframe methods and SQL queries

Hint: In Hive SQL, you can refer to column names including spaces with the following:

```SQL
SELECT `column name` FROM table
```

## Determine the number of observations

In [12]:
df.count()

353872

In [13]:
sqlContext.sql("SELECT COUNT(*) FROM trips").show()

+--------+
|count(1)|
+--------+
|  353872|
+--------+



## Calculate mean, standard deviation, minimum and maximum of the duration column

In [14]:
df.describe().show(truncate=5)

+-------+-------+--------+----------+-------------+--------------+--------+-----------+------------+------+---------------+--------+-----+-----+-------+
|summary|Trip ID|Duration|Start Date|Start Station|Start Terminal|End Date|End Station|End Terminal|Bike #|Subscriber Type|Zip Code| hour|month|weekday|
+-------+-------+--------+----------+-------------+--------------+--------+-----------+------------+------+---------------+--------+-----+-----+-------+
|  count|  35...|   35...|     35...|        35...|         35...|   35...|      35...|       35...| 35...|          35...|   35...|35...|35...|  35...|
|   mean|  67...|   10...|      null|         null|         58...|    null|       null|       58...| 42...|           null|   26...|13...|6....|  3....|
|  st...|  13...|   30...|      null|         null|         16...|    null|       null|       16...| 15...|           null|   1....|4....|3....|  1....|
|    min|  43...|      60|     1/...|        2n...|             2|   1/...|      2

In [15]:
sqlContext.sql("SELECT AVG(Duration), STDDEV(Duration), MIN(Duration), MAX(Duration) FROM trips").show()

+-----------------+-------------------------------------+-------------+-------------+
|    avg(Duration)|stddev_samp(CAST(Duration AS DOUBLE))|min(Duration)|max(Duration)|
+-----------------+-------------------------------------+-------------+-------------+
|1043.044581091468|                   30027.518214567655|           60|     17270400|
+-----------------+-------------------------------------+-------------+-------------+



## For how many different days do you have observations?

In [16]:
df.select('day').distinct().count()

365

In [17]:
sql = '''
SELECT COUNT(DISTINCT Day)
FROM trips
'''

sqlContext.sql(sql).show()

+-------------------+
|count(DISTINCT Day)|
+-------------------+
|                365|
+-------------------+



## What are the first and last observed days?

In [18]:
df.select('day').sort('day', ascending=True).show(1)

+----------+
|       day|
+----------+
|2014-09-01|
+----------+
only showing top 1 row



In [19]:
df.select('day').sort('day', ascending=False).show(1)

+----------+
|       day|
+----------+
|2015-08-31|
+----------+
only showing top 1 row



In [20]:
sql = '''
SELECT MIN(day), MAX(day)
FROM trips
'''

sqlContext.sql(sql).show()

+----------+----------+
|  min(day)|  max(day)|
+----------+----------+
|2014-09-01|2015-08-31|
+----------+----------+



## Obtain the counts of rides per hour

In [21]:
df.select('hour').groupby('hour').count().sort('hour').show(24)

+----+-----+
|hour|count|
+----+-----+
|   0| 1012|
|   1|  506|
|   2|  281|
|   3|  156|
|   4|  640|
|   5| 1848|
|   6| 8012|
|   7|24742|
|   8|49414|
|   9|34929|
|  10|15306|
|  11|14041|
|  12|15769|
|  13|14652|
|  14|12788|
|  15|16466|
|  16|31813|
|  17|45798|
|  18|30956|
|  19|14899|
|  20| 8245|
|  21| 5738|
|  22| 3658|
|  23| 2203|
+----+-----+



In [22]:
sql = '''
SELECT hour, COUNT(hour)
FROM trips
GROUP BY hour
ORDER BY hour ASC
'''

sqlContext.sql(sql).show(24)

+----+-----------+
|hour|count(hour)|
+----+-----------+
|   0|       1012|
|   1|        506|
|   2|        281|
|   3|        156|
|   4|        640|
|   5|       1848|
|   6|       8012|
|   7|      24742|
|   8|      49414|
|   9|      34929|
|  10|      15306|
|  11|      14041|
|  12|      15769|
|  13|      14652|
|  14|      12788|
|  15|      16466|
|  16|      31813|
|  17|      45798|
|  18|      30956|
|  19|      14899|
|  20|       8245|
|  21|       5738|
|  22|       3658|
|  23|       2203|
+----+-----------+



## Obtain the counts per hour averaged over all observed dates

In [23]:
df2 = df.select('day','hour').groupby('hour').count()
df2.withColumn("avg", df2['count'] / df.count()).sort('hour').show(24)

+----+-----+--------------------+
|hour|count|                 avg|
+----+-----+--------------------+
|   0| 1012|0.002859791110910...|
|   1|  506|0.001429895555455...|
|   2|  281|7.940724329701135E-4|
|   3|  156|  4.4083736492291E-4|
|   4|  640|0.001808563548401682|
|   5| 1848|0.005222227246009857|
|   6| 8012|0.022640954921553557|
|   7|24742| 0.06991793642899127|
|   8|49414| 0.13963806121987613|
|   9|34929| 0.09870518153456617|
|  10|15306|0.043252927612243974|
|  11|14041|0.039678188723606275|
|  12|15769|0.044561310304290815|
|  13|14652|0.041404801736221006|
|  14|12788| 0.03613736040150111|
|  15|16466| 0.04653094904372202|
|  16|31813| 0.08989973775828548|
|  17|45798| 0.12941967717140662|
|  18|30956| 0.08747795813175385|
|  19|14899| 0.04210279423068228|
|  20| 8245|0.023299385088393545|
|  21| 5738| 0.01621490256363883|
|  22| 3658|0.010337071031333363|
|  23| 2203|0.006225414839263915|
+----+-----+--------------------+



In [24]:
sql = '''
SELECT hour, COUNT(day), COUNT(day)/a.total
FROM trips
    (SELECT COUNT(day) as total FROM trips as a)
GROUP BY hour
ORDER BY hour ASC
'''

sqlContext.sql(sql).show(24)

ParseException: "\nmismatched input 'FROM' expecting <EOF>(line 3, pos 0)\n\n== SQL ==\n\nSELECT hour, COUNT(day), COUNT(day)/a.total\nFROM trips\n^^^\n    (SELECT COUNT(day) as total FROM trips as a)\nGROUP BY hour\nORDER BY hour ASC\n"

## Obtain the average duration of trips per hour departing from terminal 70 sorted by the hour

In [None]:
df2 = df.select('day','hour','Duration').filter('"Station Terminal" == 70').groupby('hour').count()
df2.withColumn("avg", df2['Duration'] / df.count()).sort('hour').show(24)