In [None]:
from pyspark.sql.functions import *


# Column Transformation Methods

In this notebook, we will understand column transformation methods:

1. `withColumnRenamed`
2. `withColumn`
3. `select`
4. `where`
5. `groupBy`
6. `count`
7. `orderBy`

We will also cover actions and functions:

- Actions: `show`, `count`
- Functions: `display`, `expr`, `printSchema`


In [None]:
fire_df = (
    spark.read
    .format("csv")
    .option("header", True)
    .option("inferSchema", True) 
    .load("/databricks-datasets/learning-spark-v2/sf-fire/sf-fire-calls.csv")
)


In [None]:
display(fire_df)


## Column Name are not standardized
To create a new transformed DataFrame with renamed columns, use the `.withColumnRenamed` method. Provide the current column name and the new name you want to assign. Follow the code below:
- You can create a chain of spark transformation methods one after the other.
- Spark transformation returns a new dataframe after transforming the old dataframe 
- spakr dataframe are immutalbe you cannot change a existing dataframe but can make changes by creating a new dataframe copying old dataframe date


In [None]:
renamed_fire_df = (
    fire_df
    .withColumnRenamed("Call Number", "CallNumber")
    .withColumnRenamed("Unit ID", "UnitID")
    .withColumnRenamed("Incident Number", "IncidentNumber")
    .withColumnRenamed("Call Type", "CallType")
    .withColumnRenamed("Call Date", "CallDate")
    .withColumnRenamed("Watch Date", "WatchDate")
    .withColumnRenamed("Call Final Disposition", "CallFinalDisposition")
    .withColumnRenamed("Available DtTm", "AvailableDtTm")
    .withColumnRenamed("Address", "Address")
    .withColumnRenamed("City", "City")
    .withColumnRenamed("Zipcode of Incident", "Zipcode")
    .withColumnRenamed("Battalion", "Battalion")
    .withColumnRenamed("Station Area", "StationArea")
    .withColumnRenamed("Box", "Box")
    .withColumnRenamed("OrigPriority", "OrigPriority")
    .withColumnRenamed("Priority", "Priority")
    .withColumnRenamed("Final Priority", "FinalPriority")
    .withColumnRenamed("ALS Unit", "ALSUnit")
    .withColumnRenamed("Call Type Group", "CallTypeGroup")
    .withColumnRenamed("NumAlarms", "NumAlarms")
    .withColumnRenamed("Unit Type", "UnitType")
    .withColumnRenamed("Unit sequence in call dispatch", "UnitSequenceInCallDispatch")
    .withColumnRenamed("Fire Prevention District", "FirePreventionDistrict")
    .withColumnRenamed("Supervisor District", "SupervisorDistrict")
    .withColumnRenamed("Neighborhood", "Neighborhood")
    .withColumnRenamed("Location", "Location")
    .withColumnRenamed("Row ID", "RowID")
    .withColumnRenamed("Delay", "Delay")
)
display(renamed_fire_df)


## Understanding DataFrame Schema
The `.printSchema()` method shows you the schema of the DataFrame, allowing you to understand the data types of each column.


In [None]:
renamed_fire_df.printSchema()


## Using .withColumn for Transformations
The `.withColumn` method is a transformation that allows you to apply specific logic to a column in a DataFrame. It takes the column name and the logic to change the column, which can include altering the data type or rounding values. This method returns a new DataFrame with the transformed column.


In [None]:
fire_df = (
    renamed_fire_df
    .withColumn("AvailableDtTm", to_timestamp("AvailableDtTm", "MM/dd/yyyy hh:mm:ss a"))
    .withColumn("Delay", round("Delay",2))
)

display(fire_df)


In [None]:
fire_df.printSchema()


Q1. How many distinct types of calls were made to the Fire Department


## Caching DataFrames
The `.cache()` method helps in storing the DataFrame in memory, which allows for faster querying later. By caching the DataFrame, subsequent actions on the DataFrame can be executed more quickly as the data is retrieved from memory rather than being recomputed or read from the original source.


#### Q1. How many distinct types of calls were made to the Fire Department ?


To run queries for a DataFrame using SQL, you first need to convert the DataFrame into a temporary view and then call a SparkSession and use the `sql` method to query.


In [None]:
fire_df.createOrReplaceTempView("fire_service_calls_view")

q1_sql_df = spark.sql("""
            select count(distinct calltype) as distinct_call_type_count 
            from fire_service_calls_view 
            where calltype is not null      
                """) 

display(q1_sql_df)


dataframe logic for 1Q
1. Filter the records and take only those where calltype is not null 
2. select the calltype column 
3. take only distinct calltypes 
4. show the count


In [None]:
q1_df = (fire_df
         .where("CallType is not null")
         .select("CallType")
         .distinct()
         )
print(q1_df.count())


## Spark DataFrame Transformation Methods

The `.where`, `.select`, and `.distinct` methods are essential transformation methods in Spark DataFrames. These methods allow you to filter, select, and retrieve distinct rows from your DataFrame, respectively. Each transformation returns a new DataFrame, as Spark DataFrames are immutable. Here is a brief overview of these methods:

- `.where(condition)`: Filters rows based on the given condition.
- `.select(*cols)`: Selects a subset of columns or expressions.
- `.distinct()`: Returns a new DataFrame with distinct rows.

These methods can be chained together to perform complex transformations efficiently.

## Spark DataFrame Action Methods

The `.count()` method is an action that triggers the execution of the Spark job and returns the count of rows to the Spark driver. It is used to get the number of elements in the DataFrame.


In [None]:
q1_df1 = fire_df.where("CallType is not null")
q1_df2 = q1_df1.select("CallType")
q1_df3 = q1_df2.distinct()
print(q1_df3.count())


#### Q2. What were distinct types of calls made to the Fire Department ?


In [None]:
fire_df.createOrReplaceTempView("fire_service_calls_view")

q2_sql_df = spark.sql(
    """
    select distinct calltype as distinct_call_type 
    from fire_service_calls_view 
    where calltype is not null
""")

display(q2_sql_df)


dataframe logic 
1. filter the calltypes where calltype is not null
2. select calltype 
3. take only distinct calltype 
4. show the call type


In [None]:
q2_df = (
  fire_df.where("CallType is not null")
  .select(expr("CallType as distinct_call_type"))
  .distinct()
)

display(q2_df)


#### Q3. Find out all response for delayed times greater than 5 mins ?


In [None]:
fire_df.createOrReplaceTempView("fire_service_calls_view")

q3_sql_df = spark.sql(
    """
    select *
    from fire_service_calls_view
    where delay > 5 
    """
)

display(q3_sql_df)


data frame logic 
1. take only rows where delay > 5min
2. filter the rows where delay is not null 
3. select all columns


In [None]:
q3_df = (
    fire_df
    .where("delay > 5 AND delay is not null")
    .select("*")
)

display(q3_df)


#### Q4. What were the most common call types?


In [None]:
fire_df.createOrReplaceTempView("fire_service_calls_view")

q4_sql_df  = spark.sql(
    """
    select calltype , count(calltype) as count 
    from fire_service_calls_view
    where calltype is not null 
    group by calltype 
    order by count desc 
    
    """
)
display(q4_sql_df)


dataframe logic 
1. filter the rows where calltype is not null 
2. select calltype col 
3. group them by calltype 
4. count the grouped dataframe 
5. sort them by count in desc order 
6. show them results


## GroupBy and Count in DataFrame

In Spark DataFrames, the `groupBy` and `count` methods are used for grouping and aggregating data.

- **`count` as an Action**: When you call `count` directly on a DataFrame, it is an action that triggers the execution of the Spark job and returns the number of rows in the DataFrame.

    python
    row_count = df.count()
    

- **`groupBy` and `count` as a Transformation**: When you call `groupBy` on a DataFrame, it returns a `GroupedData` object. You can then call `count` on this `GroupedData` object to get the count of rows for each group. This `count` is a transformation and does not trigger execution until an action is called.

    python
    grouped_df = df.groupBy("column_name").count()
    

In summary:
- `count` before `groupBy` is an action and returns the total row count.
- `count` after `groupBy` is a transformation on the `GroupedData` object and returns a new DataFrame with the count of rows for each group.


In [None]:
q4_df = (
    fire_df
    .select("CallType") # crates new transformed dataframe
    .where("CallType is not null") # creates new transformed dataframe 
    .groupby("CallType") # creates relationalgroupeddataset object 
    .count() # its transformation not action # creates new transformed dataframe using relationalgroupeddataset object 
    .orderBy("count",ascending=False) # creates new transformed dataframe 
    .show() # returns the newest dataframe 
)


#### Q5. What zip codes accounted for most common calls ?


In [None]:
fire_df.createOrReplaceTempView("fire_calls_service_view")

q5_sql_df = spark.sql(
    """
    select zipcode , count(zipcode) as count 
    from fire_calls_service_view
    where zipcode is not null 
    group by zipcode 
    order by count desc 
    """
)
display(q5_sql_df)


dataframe logic 
1. select zipcode col 
2. filter teh zipcode col where its not null 
3. group it by zipcode 
4. count the grouped zipcodes 
5. order them by count desc 
6. show them


In [None]:
q5_df = (
    fire_df
    .select("Zipcode")
    .where("Zipcode is not null")
    .groupby("Zipcode")
    .count()
    .orderBy("count",ascending=False)
    .show()
)


#### Q6. What San Francisco neigbourhoods are in the zip codes 94102 and 94103 ?


dataframe logic 
1. select Zipcode and Neigbourhoods col 
2. filter those col where there are no null values 
3. and filter the dataframe where city = San Francisco 
4. and again filter the dataframe where zipcode is either 94102 or 94103


In [None]:
q6_df = (
  fire_df
  .select("Zipcode","Neighborhood")
  .where("Zipcode is not null and Neighborhood is not null")
  .where("city= 'San Francisco'")
  .where("Zipcode = 94102 or Zipcode = 94103")
  .show()
)


#### Q7. What was the sum of all call alarms, average , min and max of the call response time ?


## Aggregations with .agg() in DataFrames

In PySpark, the `.agg()` method is used to perform aggregate operations on DataFrames. This method allows you to compute summary statistics on your data, such as sums, averages, minimums, and maximums, similar to SQL's aggregate functions.

### How `.agg()` Works

- **Aggregating**: You use the `.agg()` method to specify the aggregation functions you want to apply to the data.

### Common Aggregation Functions

You can import common aggregation functions from `pyspark.sql.functions`, such as:
- `sum()`: Computes the sum of values.
- `avg()`: Computes the average of values.
- `min()`: Finds the minimum value.
- `max()`: Finds the maximum value.

### Example

Here is an example that shows how to use `.agg()` to compute the sum, average, minimum, and maximum of specific columns:

python
from pyspark.sql.functions import sum, avg, min, max

df_aggregated = df.agg(
    sum("numalarms").alias("total_alarms"),
    avg("delay").alias("average_delay"),
    min("delay").alias("min_delay"),
    max("delay").alias("max_delay")
)

display(df_aggregated)


In this example:
- `.agg(sum("numalarms").alias("total_alarms"), avg("delay").alias("average_delay"), min("delay").alias("min_delay"), max("delay").alias("max_delay"))` computes the sum, average, minimum, and maximum for the specified columns and renames the resulting columns accordingly.

Using `.agg()` is a powerful way to perform complex aggregations and derive meaningful insights from your data.


- `sum`, `avg`, `min`, `max` produce Column expressions.
- `.agg()` consumes those expressions and returns a new DataFrame.
- Only the `.agg()` call is what creates a new DataFrame, not the individual aggregate functions.


In [None]:
q7_df = (
  fire_df
  .agg( 
    sum("numalarms").alias("total_alarms"), # col expression/object
    avg("delay").alias("average_delay"), # col expression/object
    min("delay").alias("min_delay"), # col expression/object
    max("delay").alias("max_delay") # col expression/object
  ) # agg creates a new transformed dataframe 
)
display(q7_df)


#### Q8. How many distinct years of data is in the dataset


dataframe logic 
1. from calldata col extract year and select it 
2. filter the col with distinct years 
3. order it by distinct_years in ascending order


## Extracting Year from a Column in a DataFrame

To extract the year from a date column in a DataFrame, you can use the `year` function from `pyspark.sql.functions`. This function expects a column object, which you can create using the `col` function. Here is how you can do it:

1. **Select and Extract Year**: Use the `select` method to choose the column you want to extract the year from. Apply the `year` function to this column.


In [None]:
q8_df = (
    fire_df
    .select(year(col("CallDate")).alias("distinct_years"))
    .distinct()
    .orderBy("distinct_years")
)
display(q8_df)


#### Q9. What week of the year in 2018 had the most fire calls ?


dataframe logic 
1. create a new col week_year to convert call date in weekofyear 
2. filter the rows where calldate year = 2018
3. group by week_year 
4. count the grouped by week_year 
5. order by count in desc order


### weekofyear Function

The `weekofyear` function in PySpark is used to extract the week number from a given date. This function returns an integer representing the week of the year for the specified date.

#### Syntax
python
weekofyear(expr)


#### Arguments
- `expr`: A DATE expression. This can be a column containing date values.

#### Returns
- An INTEGER representing the week of the year.

#### Example
python
from pyspark.sql.functions import weekofyear, col

df = spark.createDataFrame([('2015-04-08',), ('2024-10-31',)], ['dt'])
df.select("*", weekofyear(col("dt")).alias("week_of_year")).show()


In this example, the `weekofyear` function extracts the week number from the date column `dt`.


In [None]:
from pyspark.sql.functions import col, weekofyear, year, count

q9_df = (
  fire_df
  .withColumn("week_year", weekofyear(col("CallDate")))
  .where(year(col("CallDate")) == 2018)
  .groupBy("week_year")
  .agg(count("*").alias("total_calls"))
  .orderBy(col("total_calls").desc())
)
display(q9_df)
