# Let's check the data on HDFS 

### File Sizes 

In [16]:
!hdfs dfs -du -s -h /taxi/raw/

237.2 G  237.2 G  /taxi/raw


### Looks like we have all years

In [1]:
!hdfs dfs -ls /taxi/raw/

Found 14 items
drwxr-xr-x   - cluster supergroup          0 2022-04-25 08:14 /taxi/raw/2009
drwxr-xr-x   - cluster supergroup          0 2022-04-25 09:20 /taxi/raw/2010
drwxr-xr-x   - cluster supergroup          0 2022-04-25 09:49 /taxi/raw/2011
drwxr-xr-x   - cluster supergroup          0 2022-04-25 16:31 /taxi/raw/2012
drwxr-xr-x   - cluster supergroup          0 2022-04-25 16:56 /taxi/raw/2013
drwxr-xr-x   - cluster supergroup          0 2022-04-25 17:13 /taxi/raw/2014
drwxr-xr-x   - cluster supergroup          0 2022-04-25 17:29 /taxi/raw/2015
drwxr-xr-x   - cluster supergroup          0 2022-04-25 17:41 /taxi/raw/2016
drwxr-xr-x   - cluster supergroup          0 2022-04-25 17:49 /taxi/raw/2017
drwxr-xr-x   - cluster supergroup          0 2022-04-25 17:56 /taxi/raw/2018
drwxr-xr-x   - cluster supergroup          0 2022-04-25 18:03 /taxi/raw/2019
drwxr-xr-x   - cluster supergroup          0 2022-04-25 18:06 /taxi/raw/2020
drwxr-xr-x   - cluster supergroup          0 2022-04-25 18:08

### However, the data gets smaller at 2016.

In [3]:
!hdfs dfs -du -s -h /taxi/raw/*

28.9 G   28.9 G   /taxi/raw/2009
28.9 G   28.9 G   /taxi/raw/2010
30.3 G   30.3 G   /taxi/raw/2011
29.9 G   29.9 G   /taxi/raw/2012
27.1 G   27.1 G   /taxi/raw/2013
25.9 G   25.9 G   /taxi/raw/2014
21.3 G   21.3 G   /taxi/raw/2015
15.3 G   15.3 G   /taxi/raw/2016
9.2 G    9.2 G    /taxi/raw/2017
8.4 G    8.4 G    /taxi/raw/2018
7.2 G    7.2 G    /taxi/raw/2019
2.1 G    2.1 G    /taxi/raw/2020
2.3 G    2.3 G    /taxi/raw/2021
430.1 M  430.1 M  /taxi/raw/2022


### Let's see if the schema changes over time 

In [8]:
!hdfs dfs -ls -h /taxi/raw/2009/yellow_tripdata_2009-01.csv

-rw-r--r--   1 cluster supergroup      2.4 G 2022-04-25 07:52 /taxi/raw/2009/yellow_tripdata_2009-01.csv


In [9]:
!hdfs dfs -ls -h /taxi/raw/2017/yellow_tripdata_2017-01.csv

-rw-r--r--   1 cluster supergroup    815.3 M 2022-04-25 17:42 /taxi/raw/2017/yellow_tripdata_2017-01.csv


### Init Spark 

In [11]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Hands On") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-04-26 07:31:42,436 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [12]:
df_2009_01 = spark.read.csv("/taxi/raw/2009/yellow_tripdata_2009-01.csv", header=True)

                                                                                

In [13]:
df_2017_01 = spark.read.csv("/taxi/raw/2017/yellow_tripdata_2017-01.csv", header=True)

                                                                                

###  df_2009_01 vs df_2017_01

There is a diff in the schema and we have to deal with it.
- Description of the [schema](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf)

In [14]:
df_2009_01.show(2)

[Stage 2:>                                                          (0 + 1) / 1]

+-----------+--------------------+---------------------+---------------+------------------+-------------------+------------------+---------+-----------------+-------------------+------------------+------------+------------------+---------+-------+-------+---------+------------------+
|vendor_name|Trip_Pickup_DateTime|Trip_Dropoff_DateTime|Passenger_Count|     Trip_Distance|          Start_Lon|         Start_Lat|Rate_Code|store_and_forward|            End_Lon|           End_Lat|Payment_Type|          Fare_Amt|surcharge|mta_tax|Tip_Amt|Tolls_Amt|         Total_Amt|
+-----------+--------------------+---------------------+---------------+------------------+-------------------+------------------+---------+-----------------+-------------------+------------------+------------+------------------+---------+-------+-------+---------+------------------+
|        VTS| 2009-01-04 02:52:00|  2009-01-04 03:02:00|              1|2.6299999999999999|-73.991956999999999|         40.721567|     null|     

                                                                                

In [15]:
df_2017_01.show(2)

[Stage 3:>                                                          (0 + 1) / 1]

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+
|       1| 2017-01-09 11:13:28|  2017-01-09 11:25:45|              1|         3.30|         1|                 N|         263|         161|           1|       12.5|    0|    0.5|         2|           0|                  0.3|        15.3|
|       1| 2017-01-09 11:32:27|  2017-01-09 11:3

                                                                                

### Count should be similar 

In [17]:
print(df_2009_01.count())
print(df_2017_01.count())

                                                                                

14092413




9710124


                                                                                

We have to analyse if less was recorded or if this is just a downward trend in general

### The Schema changed at 2016 - between June and July 

In [22]:
!hdfs dfs -du -s -h /taxi/raw/2016/*

1.6 G    1.6 G    /taxi/raw/2016/yellow_tripdata_2016-01.csv
1.7 G    1.7 G    /taxi/raw/2016/yellow_tripdata_2016-02.csv
1.8 G    1.8 G    /taxi/raw/2016/yellow_tripdata_2016-03.csv
1.7 G    1.7 G    /taxi/raw/2016/yellow_tripdata_2016-04.csv
1.7 G    1.7 G    /taxi/raw/2016/yellow_tripdata_2016-05.csv
1.6 G    1.6 G    /taxi/raw/2016/yellow_tripdata_2016-06.csv
884.7 M  884.7 M  /taxi/raw/2016/yellow_tripdata_2016-07.csv
854.3 M  854.3 M  /taxi/raw/2016/yellow_tripdata_2016-08.csv
870.0 M  870.0 M  /taxi/raw/2016/yellow_tripdata_2016-09.csv
933.4 M  933.4 M  /taxi/raw/2016/yellow_tripdata_2016-10.csv
868.7 M  868.7 M  /taxi/raw/2016/yellow_tripdata_2016-11.csv
897.8 M  897.8 M  /taxi/raw/2016/yellow_tripdata_2016-12.csv


## Column Names 

In [18]:
df_2009_01.printSchema()

root
 |-- vendor_name: string (nullable = true)
 |-- Trip_Pickup_DateTime: string (nullable = true)
 |-- Trip_Dropoff_DateTime: string (nullable = true)
 |-- Passenger_Count: string (nullable = true)
 |-- Trip_Distance: string (nullable = true)
 |-- Start_Lon: string (nullable = true)
 |-- Start_Lat: string (nullable = true)
 |-- Rate_Code: string (nullable = true)
 |-- store_and_forward: string (nullable = true)
 |-- End_Lon: string (nullable = true)
 |-- End_Lat: string (nullable = true)
 |-- Payment_Type: string (nullable = true)
 |-- Fare_Amt: string (nullable = true)
 |-- surcharge: string (nullable = true)
 |-- mta_tax: string (nullable = true)
 |-- Tip_Amt: string (nullable = true)
 |-- Tolls_Amt: string (nullable = true)
 |-- Total_Amt: string (nullable = true)



In [19]:
df_2017_01.printSchema()

root
 |-- VendorID: string (nullable = true)
 |-- tpep_pickup_datetime: string (nullable = true)
 |-- tpep_dropoff_datetime: string (nullable = true)
 |-- passenger_count: string (nullable = true)
 |-- trip_distance: string (nullable = true)
 |-- RatecodeID: string (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: string (nullable = true)
 |-- DOLocationID: string (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- fare_amount: string (nullable = true)
 |-- extra: string (nullable = true)
 |-- mta_tax: string (nullable = true)
 |-- tip_amount: string (nullable = true)
 |-- tolls_amount: string (nullable = true)
 |-- improvement_surcharge: string (nullable = true)
 |-- total_amount: string (nullable = true)



### Stopping Spark 

In [20]:
spark.stop()