# PySpark 101

> This notebook uses one chunk of yellow trip data from NYC Taxi Data

## Overview
- Basic Operation
- Manipulate DataFrame
- Manipulate DataFrame using SQL Statement
- Simple Machine Learning with Logistic Regression

## Basic Operation

In [1]:
import findspark

findspark.init('/home/artthanan/spark-2.4.5-bin-hadoop2.7')

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()

In [3]:
csv_file_path = 'nyc-taxi-data_export_data_yellow_tripdata_yellow_tripdata_000000000044.csv'

df = spark.read.csv(csv_file_path, inferSchema=True, header=True)

In [4]:
df.columns

['vendor_id',
 'pickup_datetime',
 'dropoff_datetime',
 'passenger_count',
 'trip_distance',
 'pickup_longtitude',
 'pickup_latitude',
 'pu_location_id',
 'rate_code_id',
 'store_and_fwd_flag',
 'dropoff_longtitude',
 'dropoff_latitude',
 'do_location_id',
 'payment_type',
 'fare_amount',
 'extra',
 'improvement_surcharge',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'total_amount',
 'consgestion_surcharge']

In [5]:
df.printSchema()

root
 |-- vendor_id: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- pickup_longtitude: double (nullable = true)
 |-- pickup_latitude: string (nullable = true)
 |-- pu_location_id: integer (nullable = true)
 |-- rate_code_id: integer (nullable = true)
 |-- store_and_fwd_flag: boolean (nullable = true)
 |-- dropoff_longtitude: double (nullable = true)
 |-- dropoff_latitude: double (nullable = true)
 |-- do_location_id: integer (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: string (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- consgestion_surcharge: d

In [6]:
df.head(5)

[Row(vendor_id='1', pickup_datetime=datetime.datetime(2017, 9, 22, 12, 7, 56), dropoff_datetime=datetime.datetime(2017, 9, 22, 12, 36, 34), passenger_count=4, trip_distance=3.8, pickup_longtitude=None, pickup_latitude=None, pu_location_id=230, rate_code_id=1, store_and_fwd_flag=False, dropoff_longtitude=None, dropoff_latitude=None, do_location_id=148, payment_type=2, fare_amount=19.5, extra=None, improvement_surcharge=0.3, mta_tax=0.5, tip_amount=0.0, tolls_amount=0.0, total_amount=20.3, consgestion_surcharge=None),
 Row(vendor_id='1', pickup_datetime=datetime.datetime(2018, 10, 11, 21, 41, 3), dropoff_datetime=datetime.datetime(2018, 10, 11, 22, 2, 40), passenger_count=1, trip_distance=4.5, pickup_longtitude=None, pickup_latitude=None, pu_location_id=261, rate_code_id=1, store_and_fwd_flag=False, dropoff_longtitude=None, dropoff_latitude=None, do_location_id=50, payment_type=1, fare_amount=18.0, extra=None, improvement_surcharge=0.3, mta_tax=0.5, tip_amount=2.7, tolls_amount=0.0, tota

In [7]:
df.show()

+---------+-------------------+-------------------+---------------+-------------+------------------+------------------+--------------+------------+------------------+------------------+------------------+--------------+------------+-----------+-----+---------------------+-------+----------+------------+------------+---------------------+
|vendor_id|    pickup_datetime|   dropoff_datetime|passenger_count|trip_distance| pickup_longtitude|   pickup_latitude|pu_location_id|rate_code_id|store_and_fwd_flag|dropoff_longtitude|  dropoff_latitude|do_location_id|payment_type|fare_amount|extra|improvement_surcharge|mta_tax|tip_amount|tolls_amount|total_amount|consgestion_surcharge|
+---------+-------------------+-------------------+---------------+-------------+------------------+------------------+--------------+------------+------------------+------------------+------------------+--------------+------------+-----------+-----+---------------------+-------+----------+------------+------------+---

In [8]:
df.summary().show()

+-------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+-----------------+------------------+------------------+------------------+------------------+------------------+-----+---------------------+-------------------+------------------+------------------+------------------+---------------------+
|summary|         vendor_id|  passenger_count|    trip_distance| pickup_longtitude|  pickup_latitude|   pu_location_id|     rate_code_id|dropoff_longtitude|  dropoff_latitude|    do_location_id|      payment_type|       fare_amount|extra|improvement_surcharge|            mta_tax|        tip_amount|      tolls_amount|      total_amount|consgestion_surcharge|
+-------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+-----------------+------------------+------------------+------------------+------------------+------------------+-----+---------------------+-------------------+

## Manipulate DataFrame 

In [9]:
df.select(['vendor_id', 'pickup_datetime', 'dropoff_datetime']).show()

+---------+-------------------+-------------------+
|vendor_id|    pickup_datetime|   dropoff_datetime|
+---------+-------------------+-------------------+
|        1|2017-09-22 12:07:56|2017-09-22 12:36:34|
|        1|2018-10-11 21:41:03|2018-10-11 22:02:40|
|        2|2018-02-07 14:20:53|2018-02-07 14:52:58|
|        2|2018-01-01 10:47:27|2018-01-01 11:12:26|
|        2|2017-04-19 15:30:26|2017-04-19 16:07:42|
|        2|2018-10-18 11:13:23|2018-10-18 11:34:36|
|        1|2016-02-08 15:08:09|2016-02-08 15:30:45|
|        2|2017-05-23 23:12:21|2017-05-23 23:36:28|
|        2|2018-10-02 07:45:54|2018-10-02 07:58:23|
|     null|2019-07-02 12:16:00|2019-07-02 12:36:00|
|        1|2015-02-17 21:02:31|2015-02-17 21:27:01|
|        2|2017-01-21 02:23:48|2017-01-21 02:49:05|
|        2|2017-02-08 16:49:10|2017-02-08 17:26:40|
|        1|2017-05-22 02:28:22|2017-05-22 02:47:51|
|        2|2018-08-24 06:59:44|2018-08-24 07:29:56|
|        2|2015-12-11 01:04:57|2015-12-11 01:31:41|
|        2|2

In [10]:
from pyspark.sql.functions import mean, min, max, corr

temp_df = df.where(df['vendor_id'] == '1')

temp_df.select(
    mean('total_amount').alias('mean_total_amount'),
    min('total_amount').alias('min_total_amount'), 
    max('total_amount').alias('max_total_amount'),
    corr('payment_type', 'total_amount').alias('correlation_btw_payment_type_and_total_amount')
).show()

# corr is  based on Pearson Correlation Coefficient 

+-----------------+----------------+----------------+---------------------------------------------+
|mean_total_amount|min_total_amount|max_total_amount|correlation_btw_payment_type_and_total_amount|
+-----------------+----------------+----------------+---------------------------------------------+
|16.33927471692185|             0.0|         2021.17|                         -0.13309223576965123|
+-----------------+----------------+----------------+---------------------------------------------+



In [11]:
df.groupby(df['vendor_id']).max().show()

+---------+--------------------+------------------+----------------------+-------------------+-----------------+-----------------------+---------------------+-------------------+-----------------+----------------+--------------------------+------------+---------------+-----------------+-----------------+--------------------------+
|vendor_id|max(passenger_count)|max(trip_distance)|max(pickup_longtitude)|max(pu_location_id)|max(rate_code_id)|max(dropoff_longtitude)|max(dropoff_latitude)|max(do_location_id)|max(payment_type)|max(fare_amount)|max(improvement_surcharge)|max(mta_tax)|max(tip_amount)|max(tolls_amount)|max(total_amount)|max(consgestion_surcharge)|
+---------+--------------------+------------------+----------------------+-------------------+-----------------+-----------------------+---------------------+-------------------+-----------------+----------------+--------------------------+------------+---------------+-----------------+-----------------+--------------------------+
|

## Manipulate DataFrame using SQL Statement

In [12]:
df.createTempView("yellow_tripdata")

In [13]:
result = spark.sql('SELECT * FROM yellow_tripdata')

In [14]:
result.show()

+---------+-------------------+-------------------+---------------+-------------+------------------+------------------+--------------+------------+------------------+------------------+------------------+--------------+------------+-----------+-----+---------------------+-------+----------+------------+------------+---------------------+
|vendor_id|    pickup_datetime|   dropoff_datetime|passenger_count|trip_distance| pickup_longtitude|   pickup_latitude|pu_location_id|rate_code_id|store_and_fwd_flag|dropoff_longtitude|  dropoff_latitude|do_location_id|payment_type|fare_amount|extra|improvement_surcharge|mta_tax|tip_amount|tolls_amount|total_amount|consgestion_surcharge|
+---------+-------------------+-------------------+---------------+-------------+------------------+------------------+--------------+------------+------------------+------------------+------------------+--------------+------------+-----------+-----+---------------------+-------+----------+------------+------------+---

In [15]:
sql_statement = '''
    SELECT 
      vendor_id, 
      SUM(passenger_count) AS total_passenger,
      AVG(passenger_count) AS average_passenger,
      MIN(passenger_count) AS min_passenger,
      MAX(passenger_count) AS max_passenger
    FROM `yellow_tripdata`
    WHERE
      vendor_id IS NOT NULL
    GROUP BY 
      vendor_id
'''

result_2 = spark.sql(sql_statement)

In [16]:
result_2.show()

+---------+---------------+------------------+-------------+-------------+
|vendor_id|total_passenger| average_passenger|min_passenger|max_passenger|
+---------+---------------+------------------+-------------+-------------+
|      CMT|        2875916|1.2860624500831315|            0|            9|
|        3|             22|               1.0|            1|            1|
|      VTS|        4749170| 2.085879466044512|            0|          208|
|      DDS|          86120|1.3719929902819819|            0|            6|
|        1|        1436841|1.2559370790681086|            0|            8|
|        4|           3599|1.0395725014442518|            1|            4|
|        2|        2810183|1.9286295680151095|            0|            9|
+---------+---------------+------------------+-------------+-------------+



## Machine Learning Start Here!!

In [17]:
list_of_selected_col = ['vendor_id', 'pickup_datetime', 'dropoff_datetime', 
                        'passenger_count', 'trip_distance', 'fare_amount', 
                        'tip_amount', 'tolls_amount', 'total_amount']

temp_df = df.select(list_of_selected_col).na.drop()

In [18]:
temp_df.show()

+---------+-------------------+-------------------+---------------+-------------+-----------+----------+------------+------------+
|vendor_id|    pickup_datetime|   dropoff_datetime|passenger_count|trip_distance|fare_amount|tip_amount|tolls_amount|total_amount|
+---------+-------------------+-------------------+---------------+-------------+-----------+----------+------------+------------+
|        1|2017-09-22 12:07:56|2017-09-22 12:36:34|              4|          3.8|       19.5|       0.0|         0.0|        20.3|
|        1|2018-10-11 21:41:03|2018-10-11 22:02:40|              1|          4.5|       18.0|       2.7|         0.0|        22.0|
|        2|2018-02-07 14:20:53|2018-02-07 14:52:58|              1|         5.92|       24.5|       3.0|         0.0|        28.3|
|        2|2018-01-01 10:47:27|2018-01-01 11:12:26|              1|        12.32|       34.0|       0.0|         0.0|        34.8|
|        2|2017-04-19 15:30:26|2017-04-19 16:07:42|              1|         9.97|  

### Find trip duration

In [19]:
from pyspark.sql.functions import unix_timestamp

datetime_format = "yyyy-mm-dd hh:mm:ss"

time_diff = unix_timestamp('dropoff_datetime', format=datetime_format) - unix_timestamp('pickup_datetime', format=datetime_format)

final_df = temp_df.withColumn('trip_duration', time_diff)

In [20]:
final_df = final_df.drop('dropoff_datetime')
final_df = final_df.drop('pickup_datetime')
final_df.show()

+---------+---------------+-------------+-----------+----------+------------+------------+-------------+
|vendor_id|passenger_count|trip_distance|fare_amount|tip_amount|tolls_amount|total_amount|trip_duration|
+---------+---------------+-------------+-----------+----------+------------+------------+-------------+
|        1|              4|          3.8|       19.5|       0.0|         0.0|        20.3|         1718|
|        1|              1|          4.5|       18.0|       2.7|         0.0|        22.0|         1297|
|        2|              1|         5.92|       24.5|       3.0|         0.0|        28.3|         1925|
|        2|              1|        12.32|       34.0|       0.0|         0.0|        34.8|         1499|
|        2|              1|         9.97|       33.0|      7.91|        5.76|       47.47|         2236|
|        2|              2|         4.48|       18.0|      3.76|         0.0|       22.56|         1273|
|        1|              1|          6.2|       22.5|  

In [21]:
final_df.select('vendor_id').distinct().show()

+---------+
|vendor_id|
+---------+
|      CMT|
|      VTS|
|      DDS|
|        1|
|        4|
|        2|
+---------+



In [22]:
final_df.summary().show()

+-------+------------------+------------------+-----------------+------------------+-----------------+------------------+------------------+-----------------+
|summary|         vendor_id|   passenger_count|    trip_distance|       fare_amount|       tip_amount|      tolls_amount|      total_amount|    trip_duration|
+-------+------------------+------------------+-----------------+------------------+-----------------+------------------+------------------+-----------------+
|  count|           6906443|           6906443|          6906443|           6906443|          6906443|           6906443|           6906443|          6906443|
|   mean|1.5662205509569063|1.6662836426797412|7.869465037212382|11.684649503369455|1.326532411836269|0.2529558906082214|14.253854121443874| 830.207139767895|
| stddev|0.5045072214619185| 1.308712968351337|6525.001220841516|10.204495781296464|2.230806249519298|1.3705993702462098|12.352342427778117|4078.789072656279|
|    min|                 1|                 0

In [23]:
del temp_df
del df

## Prepare feature vector

In [24]:
from pyspark.ml.feature import (VectorAssembler,VectorIndexer,
                                OneHotEncoder,StringIndexer)

In [25]:
vendor_id_indexer = StringIndexer(inputCol='vendor_id', outputCol='vendor_id_index')
vendor_id_encoder = OneHotEncoder(inputCol='vendor_id_index', outputCol='vendor_id_vector')

In [26]:
assembler = VectorAssembler(inputCols=['trip_distance',
 'fare_amount',
 'tip_amount',
 'tolls_amount',
 'total_amount',
 'trip_duration'], outputCol='features')

In [27]:
feature_vector = assembler.transform(final_df)

In [28]:
feature_vector.show()

+---------+---------------+-------------+-----------+----------+------------+------------+-------------+--------------------+
|vendor_id|passenger_count|trip_distance|fare_amount|tip_amount|tolls_amount|total_amount|trip_duration|            features|
+---------+---------------+-------------+-----------+----------+------------+------------+-------------+--------------------+
|        1|              4|          3.8|       19.5|       0.0|         0.0|        20.3|         1718|[3.8,19.5,0.0,0.0...|
|        1|              1|          4.5|       18.0|       2.7|         0.0|        22.0|         1297|[4.5,18.0,2.7,0.0...|
|        2|              1|         5.92|       24.5|       3.0|         0.0|        28.3|         1925|[5.92,24.5,3.0,0....|
|        2|              1|        12.32|       34.0|       0.0|         0.0|        34.8|         1499|[12.32,34.0,0.0,0...|
|        2|              1|         9.97|       33.0|      7.91|        5.76|       47.47|         2236|[9.97,33.0,7.9

## Create Pipeline

In [29]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression

In [30]:
log_reg_nyc_taxi = LogisticRegression(featuresCol='features', labelCol='passenger_count')

In [31]:
pipeline = Pipeline(stages=[vendor_id_indexer, vendor_id_encoder, assembler, log_reg_nyc_taxi])

In [32]:
train_data, test_data = final_df.randomSplit([0.7, 0.3])

In [33]:
model = pipeline.fit(train_data)

In [34]:
pred_results = model.transform(test_data)

In [36]:
pred_results.select(['passenger_count', 'prediction']).limit(20).show()

+---------------+----------+
|passenger_count|prediction|
+---------------+----------+
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
+---------------+----------+
only showing top 20 rows



## Sample Evaluation

In [37]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [38]:
# For multiclass
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='passenger_count',
                                             metricName='accuracy')

In [39]:
acc = evaluator.evaluate(pred_results)

In [40]:
print("Accuracy: {:.4f}".format(acc))

Accuracy: 0.6968
