This is a notebook that converts data in LeetCode SQL 50 to spark dataframe and solves the same question using spark

In [1]:
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/20 22:02:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
def convert_pandas_to_spark_df(df: pd.DataFrame) -> None:
    return spark.createDataFrame(df)

## Q1) 1757. Recyclable and Low Fat Products

### Products Table

| Column Name | Type    |
|-------------|---------|
| product_id  | int     |
| low_fats    | enum    |
| recyclable  | enum    |

- `product_id` is the primary key (column with unique values) for this table.
- `low_fats` is an ENUM (category) of type ('Y', 'N') where 'Y' means this product is low fat and 'N' means it is not.
- `recyclable` is an ENUM (category) of types ('Y', 'N') where 'Y' means this product is recyclable and 'N' means it is not.

### Solution Requirement

Write a solution to find the ids of products that are both low fat and recyclable. Return the result table in any order.

### Example 1

#### Input

**Products Table:**

| product_id | low_fats | recyclable |
|------------|----------|------------|
| 0          | Y        | N          |
| 1          | Y        | Y          |
| 2          | N        | Y          |
| 3          | Y        | Y          |
| 4          | N        | N          |

#### Output

| product_id |
|------------|
| 1          |
| 3          |

### Explanation

Only products 1 and 3 are both low fat and recyclable.


In [18]:
data = [['0', 'Y', 'N'], ['1', 'Y', 'Y'], ['2', 'N', 'Y'], ['3', 'Y', 'Y'], ['4', 'N', 'N']]
products = pd.DataFrame(data, columns=['product_id', 'low_fats', 'recyclable']).astype({'product_id':'int64', 'low_fats':'category', 'recyclable':'category'})

In [19]:
q1_df = convert_pandas_to_spark_df(products)

In [20]:
q1_df.show()

+----------+--------+----------+
|product_id|low_fats|recyclable|
+----------+--------+----------+
|         0|       Y|         N|
|         1|       Y|         Y|
|         2|       N|         Y|
|         3|       Y|         Y|
|         4|       N|         N|
+----------+--------+----------+



In [21]:
res = q1_df \
    .filter((q1_df['low_fats'] == 'Y') & (q1_df['recyclable'] == 'Y')) \
    .select('product_id')

res.show()

+----------+
|product_id|
+----------+
|         1|
|         3|
+----------+



## Q2) 584. Find Customer Referee

### Customer Table

| Column Name | Type    |
|-------------|---------|
| id          | int     |
| name        | varchar |
| referee_id  | int     |

- In SQL, `id` is the primary key column for this table.
- Each row of this table indicates the id of a customer, their name, and the id of the customer who referred them.

### Solution Requirement

Find the names of the customers that are not referred by the customer with id = 2. Return the result table in any order.

### Example 1

#### Input

**Customer Table:**

| id | name | referee_id |
|----|------|------------|
| 1  | Will | null       |
| 2  | Jane | null       |
| 3  | Alex | 2          |
| 4  | Bill | null       |
| 5  | Zack | 1          |
| 6  | Mark | 2          |

#### Output

| name |
|------|
| Will |
| Jane |
| Bill |
| Zack |

### Explanation

Customers Will, Jane, Bill, and Zack are not referred by the customer with id = 2.


In [22]:
data = [[1, 'Will', None], [2, 'Jane', None], [3, 'Alex', 2], [4, 'Bill', None], [5, 'Zack', 1], [6, 'Mark', 2]]
customer = pd.DataFrame(data, columns=['id', 'name', 'referee_id']).astype({'id':'Int64', 'name':'object', 'referee_id':'Int64'})

In [23]:
q2_df = spark.createDataFrame(customer)

In [25]:
res = q2_df \
    .filter((q2_df['referee_id'] != 2) | (q2_df['referee_id'] == F.lit(None))) \
    .select('name')

res.show()

+----+
|name|
+----+
|Will|
|Jane|
|Bill|
|Zack|
+----+



## Q3) 595. Big Countries

### World Table

| Column Name | Type    |
|-------------|---------|
| name        | varchar |
| continent   | varchar |
| area        | int     |
| population  | int     |
| gdp         | bigint  |

- `name` is the primary key (column with unique values) for this table.
- Each row of this table gives information about the name of a country, the continent to which it belongs, its area, the population, and its GDP value.

### Solution Requirement

A country is big if:
- It has an area of at least three million (i.e., 3000000 km²), or
- It has a population of at least twenty-five million (i.e., 25000000).

Write a solution to find the name, population, and area of the big countries. Return the result table in any order.

### Example 1

#### Input

**World Table:**

| name        | continent | area    | population | gdp          |
|-------------|-----------|---------|------------|--------------|
| Afghanistan | Asia      | 652230  | 25500100   | 20343000000  |
| Albania     | Europe    | 28748   | 2831741    | 12960000000  |
| Algeria     | Africa    | 2381741 | 37100000   | 188681000000 |
| Andorra     | Europe    | 468     | 78115      | 3712000000   |
| Angola      | Africa    | 1246700 | 20609294   | 100990000000 |

#### Output

| name        | population | area    |
|-------------|------------|---------|
| Afghanistan | 25500100   | 652230  |
| Algeria     | 37100000   | 2381741 |

### Explanation

Afghanistan has a population of 25,500,100, making it a big country. Algeria has a population of 37,100,000, also qualifying as a big country.


In [26]:
data = [['Afghanistan', 'Asia', 652230, 25500100, 20343000000], ['Albania', 'Europe', 28748, 2831741, 12960000000], ['Algeria', 'Africa', 2381741, 37100000, 188681000000], ['Andorra', 'Europe', 468, 78115, 3712000000], ['Angola', 'Africa', 1246700, 20609294, 100990000000]]
world = pd.DataFrame(data, columns=['name', 'continent', 'area', 'population', 'gdp']).astype({'name':'object', 'continent':'object', 'area':'Int64', 'population':'Int64', 'gdp':'Int64'})

In [27]:
q3_df = spark.createDataFrame(world)

In [29]:
res = q3_df \
    .filter((q3_df['area'] >= 3000000) | (q3_df['population'] >= 25000000)) \
    .select('name', 'population', 'area')

res.show()

+-----------+----------+-------+
|       name|population|   area|
+-----------+----------+-------+
|Afghanistan|  25500100| 652230|
|    Algeria|  37100000|2381741|
+-----------+----------+-------+



## Q4) 1148. Article Views I

### Views Table

| Column Name   | Type    |
|---------------|---------|
| article_id    | int     |
| author_id     | int     |
| viewer_id     | int     |
| view_date     | date    |

- There is no primary key (column with unique values) for this table; the table may have duplicate rows.
- Each row of this table indicates that some viewer viewed an article (written by some author) on some date. 
- Note that equal `author_id` and `viewer_id` indicate the same person.

### Solution Requirement

Write a solution to find all the authors that viewed at least one of their own articles. Return the result table sorted by id in ascending order.

### Example 1

#### Input

**Views Table:**

| article_id | author_id | viewer_id | view_date  |
|------------|-----------|-----------|------------|
| 1          | 3         | 5         | 2019-08-01 |
| 1          | 3         | 6         | 2019-08-02 |
| 2          | 7         | 7         | 2019-08-01 |
| 2          | 7         | 6         | 2019-08-02 |
| 4          | 7         | 1         | 2019-07-22 |
| 3          | 4         | 4         | 2019-07-21 |
| 3          | 4         | 4         | 2019-07-21 |

#### Output

| id   |
|------|
| 4    |
| 7    |

### Explanation

Author 4 viewed their own article, and author 7 also viewed at least one of their own articles.


In [30]:
data = [[1, 3, 5, '2019-08-01'], [1, 3, 6, '2019-08-02'], [2, 7, 7, '2019-08-01'], [2, 7, 6, '2019-08-02'], [4, 7, 1, '2019-07-22'], [3, 4, 4, '2019-07-21'], [3, 4, 4, '2019-07-21']]
views = pd.DataFrame(data, columns=['article_id', 'author_id', 'viewer_id', 'view_date']).astype({'article_id':'Int64', 'author_id':'Int64', 'viewer_id':'Int64', 'view_date':'datetime64[ns]'})

In [31]:
q4_df = spark.createDataFrame(views)

In [38]:
res = q4_df \
    .filter(q4_df['author_id'] == q4_df['viewer_id']) \
    .selectExpr('author_id as id') \
    .distinct()

res.show()

+---+
| id|
+---+
|  7|
|  4|
+---+



## Q5) 1683. Invalid Tweets

### Tweets Table

| Column Name | Type    |
|-------------|---------|
| tweet_id    | int     |
| content     | varchar |

- `tweet_id` is the primary key (column with unique values) for this table.
- This table contains all the tweets in a social media app.

### Solution Requirement

Write a solution to find the IDs of the invalid tweets. The tweet is invalid if the number of characters used in the content of the tweet is strictly greater than 15. Return the result table in any order.

### Example 1

#### Input

**Tweets Table:**

| tweet_id | content                           |
|----------|-----------------------------------|
| 1        | Let us Code                       |
| 2        | More than fifteen chars are here! |

#### Output

| tweet_id |
|----------|
| 2        |

### Explanation

- Tweet 1 has length = 11. It is a valid tweet.
- Tweet 2 has length = 33. It is an invalid tweet.


In [39]:
data = [[1, 'Let us Code'], [2, 'More than fifteen chars are here!']]
tweets = pd.DataFrame(data, columns=['tweet_id', 'content']).astype({'tweet_id':'Int64', 'content':'object'})

In [40]:
q5_df = spark.createDataFrame(tweets)

In [44]:
res = q5_df.filter(F.length(q5_df['content']) > 15).select('tweet_id')

res.show()

+--------+
|tweet_id|
+--------+
|       2|
+--------+



## Q6) 1378. Replace Employee ID With The Unique Identifier

### Employees Table

| Column Name | Type    |
|-------------|---------|
| id          | int     |
| name        | varchar |

- `id` is the primary key (column with unique values) for this table.
- Each row of this table contains the `id` and the name of an employee in a company.

### EmployeeUNI Table

| Column Name | Type    |
|-------------|---------|
| id          | int     |
| unique_id   | int     |

- `(id, unique_id)` is the primary key (combination of columns with unique values) for this table.
- Each row of this table contains the `id` and the corresponding unique id of an employee in the company.

### Solution Requirement

Write a solution to show the unique ID of each user. If a user does not have a unique ID, replace it with `null`. Return the result table in any order.

### Example 1

#### Input

**Employees Table:**

| id | name     |
|----|----------|
| 1  | Alice    |
| 7  | Bob      |
| 11 | Meir     |
| 90 | Winston  |
| 3  | Jonathan |

**EmployeeUNI Table:**

| id | unique_id |
|----|-----------|
| 3  | 1         |
| 11 | 2         |
| 90 | 3         |

#### Output

| unique_id | name     |
|-----------|----------|
| null      | Alice    |
| null      | Bob      |
| 2         | Meir     |
| 3         | Winston  |
| 1         | Jonathan |

### Explanation

- Alice and Bob do not have a unique ID; we will show `null` instead.
- The unique ID of Meir is 2.
- The unique ID of Winston is 3.
- The unique ID of Jonathan is 1.


In [46]:
data = [[1, 'Alice'], [7, 'Bob'], [11, 'Meir'], [90, 'Winston'], [3, 'Jonathan']]
employees = pd.DataFrame(data, columns=['id', 'name']).astype({'id':'int64', 'name':'object'})
data = [[3, 1], [11, 2], [90, 3]]
employee_uni = pd.DataFrame(data, columns=['id', 'unique_id']).astype({'id':'int64', 'unique_id':'int64'})

In [47]:
emp_df = spark.createDataFrame(employees)
emp_uni = spark.createDataFrame(employee_uni)

In [51]:
res = emp_df \
    .join(emp_uni, on='id', how='left') \
    .select('unique_id', 'name')

res.show()

+---------+--------+
|unique_id|    name|
+---------+--------+
|     NULL|   Alice|
|     NULL|     Bob|
|        2|    Meir|
|        3| Winston|
|        1|Jonathan|
+---------+--------+



## Q7) 1068. Product Sales Analysis I

### Sales Table

| Column Name | Type  |
|-------------|-------|
| sale_id     | int   |
| product_id  | int   |
| year        | int   |
| quantity    | int   |
| price       | int   |

- `(sale_id, year)` is the primary key (combination of columns with unique values) of this table.
- `product_id` is a foreign key (reference column) to the Product table.
- Each row of this table shows a sale on the product `product_id` in a certain year. Note that the price is per unit.

### Product Table

| Column Name  | Type    |
|--------------|---------|
| product_id   | int     |
| product_name | varchar |

- `product_id` is the primary key (column with unique values) of this table.
- Each row of this table indicates the product name of each product.

### Solution Requirement

Write a solution to report the `product_name`, `year`, and `price` for each `sale_id` in the Sales table. Return the resulting table in any order.

### Example 1

#### Input

**Sales Table:**

| sale_id | product_id | year | quantity | price |
|---------|------------|------|----------|-------|
| 1       | 100        | 2008 | 10       | 5000  |
| 2       | 100        | 2009 | 12       | 5000  |
| 7       | 200        | 2011 | 15       | 9000  |

**Product Table:**

| product_id | product_name |
|------------|--------------|
| 100        | Nokia        |
| 200        | Apple        |
| 300        | Samsung      |

#### Output

| product_name | year  | price |
|--------------|-------|-------|
| Nokia        | 2008  | 5000  |
| Nokia        | 2009  | 5000  |
| Apple        | 2011  | 9000  |

### Explanation

- From `sale_id = 1`, we can conclude that Nokia was sold for 5000 in the year 2008.
- From `sale_id = 2`, we can conclude that Nokia was sold for 5000 in the year 2009.
- From `sale_id = 7`, we can conclude that Apple was sold for 9000 in the year 2011.


In [53]:
data = [[1, 100, 2008, 10, 5000], [2, 100, 2009, 12, 5000], [7, 200, 2011, 15, 9000]]
sales = pd.DataFrame(data, columns=['sale_id', 'product_id', 'year', 'quantity', 'price']).astype({'sale_id':'Int64', 'product_id':'Int64', 'year':'Int64', 'quantity':'Int64', 'price':'Int64'})
data = [[100, 'Nokia'], [200, 'Apple'], [300, 'Samsung']]
product = pd.DataFrame(data, columns=['product_id', 'product_name']).astype({'product_id':'Int64', 'product_name':'object'})

In [54]:
sales_df = spark.createDataFrame(sales)
product_df = spark.createDataFrame(product)

In [56]:
res = sales_df \
    .join(product_df, on='product_id') \
    .select('product_name', 'year', 'price')

res.show()

+------------+----+-----+
|product_name|year|price|
+------------+----+-----+
|       Nokia|2008| 5000|
|       Nokia|2009| 5000|
|       Apple|2011| 9000|
+------------+----+-----+

