## Importing Libraries

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

**3328. Find Cities in Each State II (Medium)**

**Table: cities**

| Column Name | Type    | 
|-------------|---------|
| state       | varchar |
| city        | varchar |

(state, city) is the combination of columns with unique values for this table.
Each row of this table contains the state name and the city name within that state.

**Write a solution to find all the cities in each state and analyze them based on the following requirements:**
- Combine all cities into a comma-separated string for each state.
- Only include states that have at least 3 cities.
- Only include states where at least one city starts with the same letter as the state name.

Return the result table ordered by the count of matching-letter cities in descending order and then by state name in ascending order.

The result format is in the following example.

**Example:**

**Input:**

**cities table:**

| state        | city          |
|--------------|---------------|
| New York     | New York City |
| New York     | Newark        |
| New York     | Buffalo       |
| New York     | Rochester     |
| California   | San Francisco |
| California   | Sacramento    |
| California   | San Diego     |
| California   | Los Angeles   |
| Texas        | Tyler         |
| Texas        | Temple        |
| Texas        | Taylor        |
| Texas        | Dallas        |
| Pennsylvania | Philadelphia  |
| Pennsylvania | Pittsburgh    |
| Pennsylvania | Pottstown     |

**Output:**

| state       | cities                                    | matching_letter_count |
|-------------|-------------------------------------------|-----------------------|
| Pennsylvania| Philadelphia, Pittsburgh, Pottstown       | 3                     |
| Texas       | Dallas, Taylor, Temple, Tyler             | 3                     |
| New York    | Buffalo, Newark, New York City, Rochester | 2                     |


**Explanation:**
- **Pennsylvania:**
  - Has 3 cities (meets minimum requirement)
  - All 3 cities start with 'P' (same as state)
  - matching_letter_count = 3
- **Texas:**
  - Has 4 cities (meets minimum requirement)
  - 3 cities (Taylor, Temple, Tyler) start with 'T' (same as state)
  - matching_letter_count = 3
- **New York:**
  - Has 4 cities (meets minimum requirement)
  - 2 cities (Newark, New York City) start with 'N' (same as state)
  - matching_letter_count = 2
- **California** is not included in the output because:
  - Although it has 4 cities (meets minimum requirement)
  - No cities start with 'C' (doesn't meet the matching letter requirement)

**Note:**
- Results are ordered by matching_letter_count in descending order
- When matching_letter_count is the same (Texas and New York both have 2), they are ordered by state name alphabetically
- Cities in each row are ordered alphabetically

In [0]:
cities_data_3328 = [
    ("New York", "New York City"),
    ("New York", "Newark"),
    ("New York", "Buffalo"),
    ("New York", "Rochester"),
    ("California", "San Francisco"),
    ("California", "Sacramento"),
    ("California", "San Diego"),
    ("California", "Los Angeles"),
    ("Texas", "Tyler"),
    ("Texas", "Temple"),
    ("Texas", "Taylor"),
    ("Texas", "Dallas"),
    ("Pennsylvania", "Philadelphia"),
    ("Pennsylvania", "Pittsburgh"),
    ("Pennsylvania", "Pottstown"),
]

cities_columns_3328 = ["state", "city"]
cities_df_3328 = spark.createDataFrame(cities_data_3328, cities_columns_3328)
cities_df_3328.show()

+------------+-------------+
|       state|         city|
+------------+-------------+
|    New York|New York City|
|    New York|       Newark|
|    New York|      Buffalo|
|    New York|    Rochester|
|  California|San Francisco|
|  California|   Sacramento|
|  California|    San Diego|
|  California|  Los Angeles|
|       Texas|        Tyler|
|       Texas|       Temple|
|       Texas|       Taylor|
|       Texas|       Dallas|
|Pennsylvania| Philadelphia|
|Pennsylvania|   Pittsburgh|
|Pennsylvania|    Pottstown|
+------------+-------------+



In [0]:
agg_df_3328 = cities_df_3328\
                .groupBy("state")\
                    .agg(array_sort(collect_list("city")).alias("cities_list"))

In [0]:
agg_df_3328 = agg_df_3328\
                .withColumn(
                    "matching_letter_count",
                    expr("size(filter(cities_list, x -> lower(substring(x,1,1)) = lower(substring(state,1,1))))")
                    )

In [0]:
agg_df_3328 = agg_df_3328\
                .withColumn("num_cities", size(col("cities_list")))\
                    .filter((col("num_cities") >= 3) & (col("matching_letter_count") > 0))

In [0]:
agg_df_3328 = agg_df_3328\
                .withColumn("cities", concat_ws(", ", col("cities_list")))

In [0]:
agg_df_3328\
    .select("state", "cities", "matching_letter_count")\
        .orderBy(col("matching_letter_count").desc(), col("state").asc()).display()

state,cities,matching_letter_count
Pennsylvania,"Philadelphia, Pittsburgh, Pottstown",3
Texas,"Dallas, Taylor, Temple, Tyler",3
New York,"Buffalo, New York City, Newark, Rochester",2
