%md 
# Defining transformations:
- creating a new dataset with 
    - combining 2 or more dataframes
    - aggregating and summarizing data
    - applying various functions

- Two forms of transformations:
    - rows
    - columns 

- Columns are spark object of type column
- Column cannot be referenced outside the dataframe and manipulate indipendently 
- Columns are always used within transformation

## Accessing columns 
- 2 ways to refer to a column in a dataframe
  - col string 
    - simplest method to access col (df.select()).
    -  spark gives bunch of transformations that take column string as transformations like select, drop, order by, group by.
  - col object 
    - second way is to access using column object.
    - the simplest form is using the col or column function but there are other ways as well.
    - we can use column string and col method in same transformation as well.
    - most of the transforamtion will offer both the options and it depends on personal choice to choose.

## Creating Column expressions
- Column expressions are formulas or transformations applied to DataFrame columns. They help you manipulate, filter, or create new columns using existing data.
- example: 
    - col("age") + 5
    - col("name").substr(1, 3)
- In PySpark, there are two main types of column expressions:
    - Column String expressions/SQL Expressions
        - These are expressions written as strings, often resembling SQL syntax.
        - When you want to write familiar SQL-style expressions, especially involving calculations or functions in a concise way.
        - df.selectExpr("fare * 0.9 as discounted_fare").show()
        - df.withColumn("discounted_fare", expr("fare * 0.9"))
    
    - Column Object expressions
        - These use PySpark functions and column objects (like col("column_name")). 
        - They are more flexible and more readable for complex transformations.
        - df.withColumn("uppercase_name", upper(col("name")))
        - df.withColumn("adjusted_fare", (col("fare") * 0.9).cast("int"))


In [2]:
# loading the dataset
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

if __name__ == "__main__":

    spark = (
        SparkSession.builder
        .appName('Spark col transformation')
        .getOrCreate()
    )
             

    flights_df = (
        spark.read
        .format('csv')
        .option('inferSchema','true')
        .option('header','true')
        .option('samplingRatio','0.001')
        .load(
            path = r'C:\Users\shubh\OneDrive\Documents\Visual Studio 2017\datasets\flights.csv', # gotta download the flights dataset to work
            encoding = 'utf-8'
        )   
    )

In [10]:
print(flights_df.count())
flights_df.show(5)
flights_df.printSchema()

100000
+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|
+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|2008|    1|         3|        4|   2003|      1955|   2211|      2225|           WN|  

In [12]:
# Referencing columns using column strings 
flights_df.select('FlightNum','AirTime','ArrDelay').show(5)

+---------+-------+--------+
|FlightNum|AirTime|ArrDelay|
+---------+-------+--------+
|      335|    116|     -14|
|     3231|    113|       2|
|      448|     76|      14|
|     1746|     78|      -6|
|     3920|     77|      34|
+---------+-------+--------+
only showing top 5 rows



In [13]:
# Referencing a column using a column object 

flights_df.select(
    column('FlightNum'),
    col('Airtime'),
    'ArrDelay'
).show(5)

+---------+-------+--------+
|FlightNum|Airtime|ArrDelay|
+---------+-------+--------+
|      335|    116|     -14|
|     3231|    113|       2|
|      448|     76|      14|
|     1746|     78|      -6|
|     3920|     77|      34|
+---------+-------+--------+
only showing top 5 rows



In [21]:
# creating column expression using column string expression
spark.sql("SET spark.sql.legacy.timeParserPolicy = LEGACY")

flights_df.select(
    'Origin',
    'Dest',
    'Distance',
    expr("to_date(concat_ws('-',Year,Month,DayofMonth),'yyyy-MM-dd') as flight_date") # we have to use expr function as the select transformation only takes col object and not expression thus we need to convert the expression to column object
).show(5)

+------+----+--------+-----------+
|Origin|Dest|Distance|flight_date|
+------+----+--------+-----------+
|   IAD| TPA|     810| 2008-01-03|
|   IAD| TPA|     810| 2008-01-03|
|   IND| BWI|     515| 2008-01-03|
|   IND| BWI|     515| 2008-01-03|
|   IND| BWI|     515| 2008-01-03|
+------+----+--------+-----------+
only showing top 5 rows



In [27]:
# creating a column object expressions

flights_df.select(
  'Origin',
  'Dest',
  'Distance',
  to_date(
    concat_ws(
      "-",
      col("Year"),
      col("Month"),
      col("DayOfMonth")
    ),
    "yyyy-MM-dd"
  ).alias('flight_date')
).show(5)

+------+----+--------+-----------+
|Origin|Dest|Distance|flight_date|
+------+----+--------+-----------+
|   IAD| TPA|     810| 2008-01-03|
|   IAD| TPA|     810| 2008-01-03|
|   IND| BWI|     515| 2008-01-03|
|   IND| BWI|     515| 2008-01-03|
|   IND| BWI|     515| 2008-01-03|
+------+----+--------+-----------+
only showing top 5 rows

