## Importing Libraries

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

**3308. Find Top Performing Driver (Medium)**

**Table: Drivers**

| Column Name  | Type    |
|--------------|---------|
| driver_id    | int     |
| name         | varchar |
| age          | int     |
| experience   | int     |
| accidents    | int     |

(driver_id) is the unique key for this table.
Each row includes a driver's ID, their name, age, years of driving experience, and the number of accidents they’ve had.

**Table: Vehicles**
| Column Name  | Type    |
|--------------|---------|
| vehicle_id   | int     |
| driver_id    | int     |
| model        | varchar |
| fuel_type    | varchar |
| mileage      | int     |

(vehicle_id, driver_id, fuel_type) is the unique key for this table.
Each row includes the vehicle's ID, the driver who operates it, the model, fuel type, and mileage.

**Table: Trips**
| Column Name  | Type    |
|--------------|---------|
| trip_id      | int     |
| vehicle_id   | int     |
| distance     | int     |
| duration     | int     |
| rating       | int     |

(trip_id) is the unique key for this table.
Each row includes a trip's ID, the vehicle used, the distance covered (in miles), the trip duration (in minutes), and the passenger's rating (1-5).
Uber is analyzing drivers based on their trips. 

**Write a solution to find the top-performing driver for each fuel type based on the following criteria:**
- A driver's performance is calculated as the average rating across all their trips. Average rating should be rounded to 2 decimal places.
- If two drivers have the same average rating, the driver with the longer total distance traveled should be ranked higher.
- If there is still a tie, choose the driver with the fewest accidents.

Return the result table ordered by fuel_type in ascending order.

The result format is in the following example.

**Example:**

**Input:**

**Drivers table:**

| driver_id | name     | age | experience | accidents |
|-----------|----------|-----|------------|-----------|
| 1         | Alice    | 34  | 10         | 1         |
| 2         | Bob      | 45  | 20         | 3         |
| 3         | Charlie  | 28  | 5          | 0         |

**Vehicles table:**

| vehicle_id | driver_id | model   | fuel_type | mileage |
|------------|-----------|---------|-----------|---------|
| 100        | 1         | Sedan   | Gasoline  | 20000   |
| 101        | 2         | SUV     | Electric  | 30000   |
| 102        | 3         | Coupe   | Gasoline  | 15000   |

**Trips table:**

| trip_id | vehicle_id | distance | duration | rating |
|---------|------------|----------|----------|--------|
| 201     | 100        | 50       | 30       | 5      |
| 202     | 100        | 30       | 20       | 4      |
| 203     | 101        | 100      | 60       | 4      |
| 204     | 101        | 80       | 50       | 5      |
| 205     | 102        | 40       | 30       | 5      |
| 206     | 102        | 60       | 40       | 5      |

**Output:**
| fuel_type | driver_id | rating | distance |
|-----------|-----------|--------|----------|
| Electric  | 2         | 4.50   | 180      |
| Gasoline  | 3         | 5.00   | 100      |

**Explanation:**
- For fuel type Gasoline, both Alice (Driver 1) and Charlie (Driver 3) have trips. Charlie has an average rating of 5.0, while Alice has 4.5. Therefore, Charlie is selected.
- For fuel type Electric, Bob (Driver 2) is the only driver with an average rating of 4.5, so he is selected.

The output table is ordered by fuel_type in ascending order.

In [0]:
drivers_data_3308 = [
    (1, "Alice", 34, 10, 1),
    (2, "Bob", 45, 20, 3),
    (3, "Charlie", 28, 5, 0)
]

drivers_columns_3308 = ["driver_id", "name", "age", "experience", "accidents"]
drivers_df_3308 = spark.createDataFrame(drivers_data_3308, drivers_columns_3308)
drivers_df_3308.show()

vehicles_data_3308 = [
    (100, 1, "Sedan", "Gasoline", 20000),
    (101, 2, "SUV", "Electric", 30000),
    (102, 3, "Coupe", "Gasoline", 15000)
]

vehicles_columns_3308 = ["vehicle_id", "driver_id", "model", "fuel_type", "mileage"]
vehicles_df_3308 = spark.createDataFrame(vehicles_data_3308, vehicles_columns_3308)
vehicles_df_3308.show()

trips_data_3308 = [
    (201, 100, 50, 30, 5),
    (202, 100, 30, 20, 4),
    (203, 101, 100, 60, 4),
    (204, 101, 80, 50, 5),
    (205, 102, 40, 30, 5),
    (206, 102, 60, 40, 5)
]

trips_columns_3308 = ["trip_id", "vehicle_id", "distance", "duration", "rating"]
trips_df_3308 = spark.createDataFrame(trips_data_3308, trips_columns_3308)
trips_df_3308.show()


+---------+-------+---+----------+---------+
|driver_id|   name|age|experience|accidents|
+---------+-------+---+----------+---------+
|        1|  Alice| 34|        10|        1|
|        2|    Bob| 45|        20|        3|
|        3|Charlie| 28|         5|        0|
+---------+-------+---+----------+---------+

+----------+---------+-----+---------+-------+
|vehicle_id|driver_id|model|fuel_type|mileage|
+----------+---------+-----+---------+-------+
|       100|        1|Sedan| Gasoline|  20000|
|       101|        2|  SUV| Electric|  30000|
|       102|        3|Coupe| Gasoline|  15000|
+----------+---------+-----+---------+-------+

+-------+----------+--------+--------+------+
|trip_id|vehicle_id|distance|duration|rating|
+-------+----------+--------+--------+------+
|    201|       100|      50|      30|     5|
|    202|       100|      30|      20|     4|
|    203|       101|     100|      60|     4|
|    204|       101|      80|      50|     5|
|    205|       102|      40|   

In [0]:
trip_with_vehicle_3308 = trips_df_3308\
                            .join(vehicles_df_3308, "vehicle_id")

In [0]:
driver_stats_3308 = trip_with_vehicle_3308\
                        .groupBy("driver_id", "fuel_type")\
                            .agg(
                                round(avg("rating"), 2).alias("avg_rating"),
                                sum("distance").alias("total_distance")
                                )

In [0]:
driver_stats_3308 = driver_stats_3308\
                        .join(drivers_df_3308.select("driver_id", "accidents"), "driver_id")

In [0]:
window = Window.partitionBy("fuel_type") \
               .orderBy(
                   col("avg_rating").desc(),
                   col("total_distance").desc(),
                   col("accidents").asc()
               )

In [0]:
driver_stats_3308\
    .withColumn("rn", row_number().over(window))\
        .filter(col("rn") == 1) \
            .select(
                col("fuel_type"),
                col("driver_id"),
                col("avg_rating").alias("rating"),
                col("total_distance").alias("distance")
                ).orderBy("fuel_type").display()

fuel_type,driver_id,rating,distance
Electric,2,4.5,180
Gasoline,3,5.0,100
