# Let's check the data on HDFS 

### File Sizes 

Last year:

```
!hdfs dfs -du -s -h /taxi/raw/
237.2 G  237.2 G  /taxi/raw
```

In [None]:
!hdfs dfs -du -s -h /taxi/raw/

### Looks like we have all years

In [None]:
!hdfs dfs -ls /taxi/raw/

### However, the data gets smaller at 2016/2020.

Last year:
```
28.9 G   28.9 G   /taxi/raw/2009
28.9 G   28.9 G   /taxi/raw/2010
30.3 G   30.3 G   /taxi/raw/2011
29.9 G   29.9 G   /taxi/raw/2012
27.1 G   27.1 G   /taxi/raw/2013
25.9 G   25.9 G   /taxi/raw/2014
21.3 G   21.3 G   /taxi/raw/2015
15.3 G   15.3 G   /taxi/raw/2016
9.2 G    9.2 G    /taxi/raw/2017
8.4 G    8.4 G    /taxi/raw/2018
7.2 G    7.2 G    /taxi/raw/2019
2.1 G    2.1 G    /taxi/raw/2020
2.3 G    2.3 G    /taxi/raw/2021
430.1 M  430.1 M  /taxi/raw/2022
```

In [None]:
!hdfs dfs -du -s -h /taxi/raw/*

### Let's see if the schema changes over time 

In [None]:
!hdfs dfs -ls -h /taxi/raw/2009/yellow_tripdata_2009-01.parquet

In [None]:
!hdfs dfs -ls -h /taxi/raw/2017/yellow_tripdata_2017-01.parquet

### Init Spark 

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Hands On") \
    .getOrCreate()

In [None]:
df_2009_01 = spark.read.parquet("/taxi/raw/2009/yellow_tripdata_2009-01.parquet")

In [None]:
df_2017_01 = spark.read.parquet("/taxi/raw/2017/yellow_tripdata_2017-01.parquet")

###  df_2009_01 vs df_2017_01

There is a diff in the schema and we have to deal with it.
- Description of the [schema](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf)

In [None]:
df_2009_01.show(2)

In [None]:
df_2017_01.show(2)

### Count should be similar 

In [None]:
print(df_2009_01.count())
print(df_2017_01.count())

We have to analyse if less was recorded or if this is just a downward trend in general

### The Schema changed at 2016 - between June and July 

Last year:

```
1.6 G    1.6 G    /taxi/raw/2016/yellow_tripdata_2016-01.csv
1.7 G    1.7 G    /taxi/raw/2016/yellow_tripdata_2016-02.csv
1.8 G    1.8 G    /taxi/raw/2016/yellow_tripdata_2016-03.csv
1.7 G    1.7 G    /taxi/raw/2016/yellow_tripdata_2016-04.csv
1.7 G    1.7 G    /taxi/raw/2016/yellow_tripdata_2016-05.csv
1.6 G    1.6 G    /taxi/raw/2016/yellow_tripdata_2016-06.csv
884.7 M  884.7 M  /taxi/raw/2016/yellow_tripdata_2016-07.csv
854.3 M  854.3 M  /taxi/raw/2016/yellow_tripdata_2016-08.csv
870.0 M  870.0 M  /taxi/raw/2016/yellow_tripdata_2016-09.csv
933.4 M  933.4 M  /taxi/raw/2016/yellow_tripdata_2016-10.csv
868.7 M  868.7 M  /taxi/raw/2016/yellow_tripdata_2016-11.csv
897.8 M  897.8 M  /taxi/raw/2016/yellow_tripdata_2016-12.csv
```

In [None]:
!hdfs dfs -du -s -h /taxi/raw/2016/*

## Column Names 

In [None]:
df_2009_01.printSchema()

In [None]:
df_2017_01.printSchema()

### Stopping Spark 

In [None]:
spark.stop()