## Importing Libraries

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

**2142. The Number of Passengers in Each Bus I (Medium)**

**Table: Buses**

| Column Name  | Type |
|--------------|------|
| bus_id       | int  |
| arrival_time | int  |

bus_id is the column with unique values for this table.
Each row of this table contains information about the arrival time of a bus at the LeetCode station.
No two buses will arrive at the same time.
 
**Table: Passengers**

| Column Name  | Type |
|--------------|------|
| passenger_id | int  |
| arrival_time | int  |

passenger_id is the column with unique values for this table.
Each row of this table contains information about the arrival time of a passenger at the LeetCode station.
 
Buses and passengers arrive at the LeetCode station. If a bus arrives at the station at time tbus and a passenger arrived at time tpassenger where tpassenger <= tbus and the passenger did not catch any bus, the passenger will use that bus.

**Write a solution to report the number of users that used each bus.**

Return the result table ordered by bus_id in ascending order.

The result format is in the following example.

**Example 1:**

**Input:** 

**Buses table:**

| bus_id | arrival_time |
|--------|--------------|
| 1      | 2            |
| 2      | 4            |
| 3      | 7            |

**Passengers table:**
| passenger_id | arrival_time |
|--------------|--------------|
| 11           | 1            |
| 12           | 5            |
| 13           | 6            |
| 14           | 7            |

**Output:** 
| bus_id | passengers_cnt |
|--------|----------------|
| 1      | 1              |
| 2      | 0              |
| 3      | 3              |

**Explanation:** 
- Passenger 11 arrives at time 1.
- Bus 1 arrives at time 2 and collects passenger 11.
- Bus 2 arrives at time 4 and does not collect any passengers.
- Passenger 12 arrives at time 5.
- Passenger 13 arrives at time 6.
- Passenger 14 arrives at time 7.
- Bus 3 arrives at time 7 and collects passengers 12, 13, and 14.

In [0]:
buses_data_2142 = [
    (1, 2),
    (2, 4),
    (3, 7)
]

buses_columns_2142 = ["bus_id", "arrival_time"]
buses_df_2142 = spark.createDataFrame(buses_data_2142, buses_columns_2142)
buses_df_2142.show()

passengers_data_2142 = [
    (11, 1),
    (12, 5),
    (13, 6),
    (14, 7)
]

passengers_columns_2142 = ["passenger_id", "arrival_time"]
passengers_df_2142 = spark.createDataFrame(passengers_data_2142, passengers_columns_2142)
passengers_df_2142.show()

+------+------------+
|bus_id|arrival_time|
+------+------------+
|     1|           2|
|     2|           4|
|     3|           7|
+------+------------+

+------------+------------+
|passenger_id|arrival_time|
+------------+------------+
|          11|           1|
|          12|           5|
|          13|           6|
|          14|           7|
+------------+------------+



In [0]:
joined_df_2142 = passengers_df_2142.alias("p")\
                        .join( buses_df_2142.alias("b"), col("p.arrival_time") <= col("b.arrival_time"))

In [0]:
windowSpec = Window.partitionBy("p.passenger_id")

In [0]:
passenger_bus_df_2142 = joined_df_2142\
                            .withColumn( "min_bus_time", min("b.arrival_time").over(windowSpec))\
                                .where(col("b.arrival_time") == col("min_bus_time")) \
                                    .select("p.passenger_id", "b.bus_id")

In [0]:
bus_counts_df_2142 = passenger_bus_df_2142\
                            .groupBy("bus_id")\
                                .agg(count("passenger_id").alias("passengers_cnt")
                                     )

In [0]:
buses_df_2142\
    .join(bus_counts_df_2142, "bus_id", "left").fillna(0).orderBy("bus_id")\
        .select('bus_id','passengers_cnt').show()

+------+--------------+
|bus_id|passengers_cnt|
+------+--------------+
|     1|             1|
|     2|             0|
|     3|             3|
+------+--------------+

