### DataSet 

In [1]:
data_set = 's3://fcc-spark-example/dataset/flight-time.json'

### Schema 

In [2]:
schema_ddl = """FL_DATE STRING, OP_CARRIER STRING, OP_CARRIER_FL_NUM INT, ORIGIN STRING, 
                ORIGIN_CITY_NAME STRING, DEST STRING, DEST_CITY_NAME STRING, CRS_DEP_TIME INT, DEP_TIME INT, 
                WHEELS_ON INT, TAXI_IN INT, CRS_ARR_TIME INT, ARR_TIME INT, CANCELLED STRING, DISTANCE INT"""

##### Beauty of Schema on Read, we can define the scheme at the time of reading the data. 
- `FL_DATE` => String 
- `CANCELLED` => String 

We can change the schema later on if we feel like 

### Reading the data 

In [3]:
raw_df = spark.read \
                .format("json") \
                .schema(schema_ddl) \
                .option("mode", "FAILFAST") \
                .option("dateFormat", "M/d/y") \
                .load(data_set)

### Fix the schema 

Explore for functions available in PySpark. 

They are mostly available in 2 places :
    
- pyspark.sql.functions [link](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html)
- Build-In Spark SQL functions [link](https://spark.apache.org/docs/latest/api/sql/index.html#if)

In [5]:
import pyspark.sql.functions as F

raw_df_2 = raw_df \
            .withColumn("FL_DATE", F.to_date("FL_DATE", "M/d/y")) \
            .withColumn("CANCELLED", F.expr("if(CANCELLED==1, true, false)"))

### Move the data back to S3

In [8]:
processed_data_location = 's3://fcc-spark-example/output'

In [9]:
raw_df_2.write \
        .format("parquet") \
        .mode("overwrite") \
        .save(processed_data_location)

                                                                                

Different `modes` : 

    - APPEND : Appends contents of the DF to existing data, creates new table if doesnt exisit 
    - OVERWRITE : Overwrite existing data, creates new table if doesnt exisit 
    - ERROR or ERRORIFEXISTS : Throws an exception if data already exists 
    - IGNORE : Write the data if the table doesnt exist, and ignore the operation if the data already exists 

In [11]:
!aws s3 ls s3://fcc-spark-example/output/

2023-03-06 21:00:40          0 _SUCCESS
2023-03-06 21:00:40    1829036 part-00000-03aaa650-3c6e-41d8-8696-1c8cac41187d-c000.snappy.parquet
2023-03-06 21:00:40    1658258 part-00001-03aaa650-3c6e-41d8-8696-1c8cac41187d-c000.snappy.parquet
