<a href="https://colab.research.google.com/github/ankitarm/SQL_Data_Engineer/blob/main/SQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🔗[LeetCode 176 : Second Highest Salary](https://leetcode.com/problems/second-highest-salary/)


---
 🔹 Problem Statement

**Table: Employee**

| Column Name | Type |
|-------------|------|
| id          | int  |
| salary      | int  |

- `id` is the primary key.
- Each row in this table contains the salary of an employee.

---

 ✏️ Task

Write a SQL query to find the **second highest distinct salary** from the `Employee` table.  
If there is no second highest salary, return `null`.

---

 📥 Example 1

**Input:**

**Employee**

| id | salary |
|----|--------|
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |

**Output:**

| SecondHighestSalary |
|---------------------|
| 200                 |

---

 📥 Example 2

**Input:**

**Employee**

| id | salary |
|----|--------|
| 1  | 100    |

**Output:**

| SecondHighestSalary |
|---------------------|
| null                |

---

 ✅ SQL Solution

 - `OFFSET` to remove top n and `LIMIT` to keep n after OFFSET
 - Returns NULL if wntry doesnt exist.
```sql
SELECT (
    SELECT DISTINCT salary
    FROM Employee
    ORDER BY salary DESC
    LIMIT 1 OFFSET 1
) AS SecondHighestSalary;
```

 - Below query is `Incorrect` as Wont return NULL if entry doesn't exist.
 ```sql
SELECT DISTINCT salary AS SecondHighestSalary
FROM Employee
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
```

# 🔗[LeetCode 177 : Nth Highest Salary](https://leetcode.com/problems/nth-highest-salary/)

---

 🔹 Problem Statement

**Table: Employee**

| Column Name | Type |
| ----------- | ---- |
| id          | int  |
| salary      | int  |

* `id` is the primary key.
* Each row in this table contains the salary of an employee.

---

 ✏️ Task

Write a SQL query to find the **nth highest distinct salary** from the `Employee` table.
If there are less than `n` distinct salaries, return `null`.

---

 📥 Example 1

**Input:**

**Employee**

| id | salary |
| -- | ------ |
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |

n = 2

**Output:**

| getNthHighestSalary(2) |
| ---------------------- |
| 200                    |

---

 📥 Example 2

**Input:**

**Employee**

| id | salary |
| -- | ------ |
| 1  | 100    |

n = 2

**Output:**

| getNthHighestSalary(2) |
| ---------------------- |
| null                   |

---

 ✅ SQL Solution

* Uses `LIMIT` and `OFFSET` to skip the top (n-1) salaries and fetch the nth one
* `SELECT (...)` ensures it returns `null` when nth salary doesn’t exist

```sql
CREATE FUNCTION getNthHighestSalary(N INT) RETURNS INT
BEGIN
  SET N = N - 1;
  RETURN (
    SELECT DISTINCT salary
    FROM Employee
    ORDER BY salary DESC
    LIMIT 1 OFFSET N
  );
END

---

- We cannot use directly OFFSET N - !
- The output column name must be getNthHighestSalary(n) (which implies a function is being called).
```sql
SELECT DISTINCT salary
FROM Employee
ORDER BY salary DESC
LIMIT 1 OFFSET N - 1;

- Use DENSE_RANK() Version
```sql
CREATE FUNCTION getNthHighestSalary(N INT) RETURNS INT
BEGIN
  RETURN (
    SELECT salary
    FROM (
      SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
      FROM Employee
    ) ranked
    WHERE rnk = N
    LIMIT 1
  );
END


| Data Size        | Recommended Query                  | Notes                            |
| ---------------- | ---------------------------------- | -------------------------------- |
| Small (< 100 GB) | `LIMIT` + `OFFSET`                 | Fast and lightweight             |
| Large (> 100 GB) | `DENSE_RANK()` via CTE or subquery | Scalable, better for performance |




# 🔗[LeetCode 178 : Rank Scores](https://leetcode.com/problems/rank-scores/)



---

 🔹 Problem Statement

**Table: Scores**

| Column Name | Type    |
| ----------- | ------- |
| id          | int     |
| score       | decimal |

* `id` is the primary key.
* Each row contains a score from a game.
* `score` is a floating point number with two decimal places.

---

 ✏️ Task

Write a SQL query to rank scores from highest to lowest using the following rules:

* Scores are ranked in **descending order**.
* If two scores are the same, they receive the **same rank**.
* After a tie, the next rank should be the **next integer** (no gaps).

Return the result table **ordered by score descending**.

---

 📥 Example

**Input:**

**Scores**

| id | score |
| -- | ----- |
| 1  | 3.50  |
| 2  | 3.65  |
| 3  | 4.00  |
| 4  | 3.85  |
| 5  | 4.00  |

**Output:**

| score | rank |
| ----- | ---- |
| 4.00  | 1    |
| 4.00  | 1    |
| 3.85  | 2    |
| 3.65  | 3    |
| 3.50  | 4    |

---

 ✅ SQL Solution

* Uses `DENSE_RANK()` to handle ties with no gaps.
* Orders results by `score` descending.

```sql
SELECT score,
DENSE_RANK() OVER(ORDER BY score DESC) AS 'rank'
FROM Scores;
```

---



#🔗[LeetCode 180 : Consecutive Numbers](https://leetcode.com/problems/consecutive-numbers/)

---

 🔹 Problem Statement

**Table: Logs**

| Column Name | Type |
| ----------- | ---- |
| id          | int  |
| num         | int  |

* `id` is the unique identifier representing order.
* `num` is the number recorded.

---

 ✏️ Task

Find all numbers that appear **at least 3 times consecutively** in the `Logs` table.

---

 📥 Example

**Input:**

| id | num |
| -- | --- |
| 1  | 1   |
| 2  | 1   |
| 3  | 1   |
| 4  | 2   |
| 5  | 2   |
| 6  | 3   |

**Output:**

| ConsecutiveNums |
| --------------- |
| 1               |

---

 ✅ SQL Solutions

---

 Solution 1: Using Self-JOINs

```sql
SELECT DISTINCT L1.Num AS ConsecutiveNums
FROM Logs L1
JOIN Logs L2 ON L1.Id = L2.Id - 1
JOIN Logs L3 ON L1.Id = L3.Id - 2
WHERE L1.Num = L2.Num AND L2.Num = L3.Num;
```

---

 Solution 2: Using Window Functions (`LEAD` and `LAG`)

```sql
WITH L_L AS (
    SELECT num AS ConsecutiveNums,
           LEAD(num) OVER (ORDER BY id) AS Lead_val,
           LAG(num) OVER (ORDER BY id) AS Lag_val
    FROM Logs
)
SELECT DISTINCT ConsecutiveNums
FROM L_L
WHERE Lead_val = ConsecutiveNums AND Lag_val = ConsecutiveNums;
```

---

 ⚡ Performance & Optimization Notes

| Approach             | Best For                       | Why?                                                                                                                                 |
| -------------------- | ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------ |
| **Self-JOINs**       | Smaller datasets (< 100K rows) | Simple joins work well with smaller data, easy to understand.                                                                        |
| **Window Functions** | Larger datasets (100K+ rows)   | Window functions are optimized for sequential data processing and avoid multiple joins, better for big data and distributed systems. |

---

 🔑 Summary

* For **small datasets**, Self-JOINs are easy and fast enough.
* For **large datasets**, Window Functions (`LEAD`/`LAG`) scale better, reduce computation, and are preferred in modern SQL engines.



# 🔗[LeetCode 181 : Employees Earning More Than Their Managers](https://leetcode.com/problems/employees-earning-more-than-their-managers/)

---

 🔹 Problem Statement

**Table: Employee**

| Column Name | Type    |
| ----------- | ------- |
| id          | int     |
| name        | varchar |
| salary      | int     |
| managerId   | int     |

* `id` is the primary key.
* `managerId` is the id of the employee’s manager.
* Each row contains an employee’s information including their salary and manager.

---

 ✏️ Task

Write a SQL query to find the names of employees who earn more than their managers.

---

 📥 Example

**Input:**

**Employee**

| id | name  | salary | managerId |
| -- | ----- | ------ | --------- |
| 1  | Joe   | 70000  | 3         |
| 2  | Henry | 80000  | 4         |
| 3  | Sam   | 60000  | NULL      |
| 4  | Max   | 90000  | NULL      |
| 5  | Janet | 69000  | 3         |
| 6  | Randy | 85000  | 4         |

**Output:**

| name  |
| ----- |
| Joe   |
| Randy |

---

 ✅ SQL Solution

* Self-join `Employee` table to compare employee salaries with their manager’s salaries.

```sql
SELECT e.name
FROM Employee e
JOIN Employee m ON e.managerId = m.id
WHERE e.salary > m.salary;
```

---


# 🔗[LeetCode 182 : Duplicate Emails](https://leetcode.com/problems/duplicate-emails/)

---

 🔹 Problem Statement

**Table: Person**

| Column Name | Type    |
| ----------- | ------- |
| id          | int     |
| email       | varchar |

* `id` is the primary key.
* Each row contains the email of a person.

---

 ✏️ Task

Write a SQL query to find all **duplicate emails** in the `Person` table.
Return the emails that appear **more than once**.

---

 📥 Example

**Input:**

**Person**

| id | email                                       |
| -- | ------------------------------------------- |
| 1  | [john@example.com](mailto:john@example.com) |
| 2  | [bob@example.com](mailto:bob@example.com)   |
| 3  | [john@example.com](mailto:john@example.com) |

**Output:**

| Email                                       |
| ------------------------------------------- |
| [john@example.com](mailto:john@example.com) |

---

 ✅ SQL Solution

* Use `GROUP BY` and `HAVING` to find emails appearing more than once.

```sql
SELECT email
FROM Person
GROUP BY email
HAVING COUNT(email) > 1;
```

---

 ⚙️ Behavior on Different Dataset Sizes

| Dataset Size                | Behavior & Considerations                                                                                                                                                                                                                                  |
| --------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **< 100 GB (Small/Medium)** | Query runs efficiently if there's an index on `email`. `GROUP BY` and aggregation execute quickly. Can be handled in-memory or with modest disk I/O.                                                                                                       |
| **> 100 GB (Large)**        | Requires distributed processing or optimized execution. Without indexes, grouping large datasets can be expensive (high I/O, memory usage). Use partitioning, indexing, or distributed SQL engines (e.g., Hive, Presto, Spark SQL) to improve performance. |

---



https://leetcode.com/problems/customers-who-never-order/

#  🔗[LeetCode 183 : Customers Who Never Order](https://leetcode.com/problems/customers-who-never-order/)

---

 🔹 Problem Statement

**Table: Customers**

| Column Name | Type    |
| ----------- | ------- |
| id          | int     |
| name        | varchar |

* `id` is the primary key for this table.
* Each row represents a customer.

**Table: Orders**

| Column Name | Type |
| ----------- | ---- |
| id          | int  |
| customerId  | int  |

* `id` is the primary key for this table.
* `customerId` is a foreign key to `Customers.id`.

---

 ✏️ Task

Write a SQL query to find all customers who **never placed an order**.

---

 📥 Example

**Input:**

**Customers**

| id | name  |
| -- | ----- |
| 1  | Joe   |
| 2  | Henry |
| 3  | Sam   |
| 4  | Max   |

**Orders**

| id | customerId |
| -- | ---------- |
| 1  | 3          |
| 2  | 1          |

**Output:**

| Customers |
| --------- |
| Henry     |
| Max       |

---

 ✅ SQL Solutions

 🔹 Solution 1: `LEFT JOIN` with `IS NULL`

* Joins all customers with orders, and filters where no matching order exists.

```sql
SELECT c.name AS Customers
FROM Customers c
LEFT JOIN Orders o ON c.id = o.customerId
WHERE o.id IS NULL;
```

---

 🔹 Solution 2: `NOT IN` subquery

* Selects customers whose IDs are **not in** the list of customer IDs in Orders.

```sql
SELECT name AS Customers
FROM Customers
WHERE id NOT IN (
    SELECT customerId FROM Orders
);
```

⚠️ *Note:* If `customerId` in `Orders` can be `NULL`, this might cause unexpected behavior. Use `NOT EXISTS` for safer logic.

---

 🔹 Solution 3: `NOT EXISTS` correlated subquery

* Checks for absence of matching records using a correlated subquery.
* 1 if exist and ignore, else print name.


```sql
SELECT name AS Customers
FROM Customers c
WHERE NOT EXISTS (
    SELECT 1
    FROM Orders o
    WHERE o.customerId = c.id
);
```

✅ *Best practice when working with large datasets or when `NULL`s might exist.*

---


#  🔗[LeetCode 184 : Department Highest Salary](https://leetcode.com/problems/department-highest-salary/)

---

 🔹 Problem Statement

**Table: Employee**

| Column Name  | Type    |
| ------------ | ------- |
| id           | int     |
| name         | varchar |
| salary       | int     |
| departmentId | int     |

**Table: Department**

| Column Name | Type    |
| ----------- | ------- |
| id          | int     |
| name        | varchar |

* `id` is the primary key in both tables.
* `departmentId` in `Employee` is a foreign key to `Department.id`.

---

 ✏️ Task

Write a SQL query to find employees who have the **highest salary** in each department.
Return the department name, employee name, and salary.

---

 📥 Example

**Input:**

**Employee**

| id | name  | salary | departmentId |
| -- | ----- | ------ | ------------ |
| 1  | Joe   | 70000  | 1            |
| 2  | Jim   | 90000  | 1            |
| 3  | Henry | 80000  | 2            |
| 4  | Sam   | 60000  | 2            |
| 5  | Max   | 90000  | 1            |

**Department**

| id | name  |
| -- | ----- |
| 1  | IT    |
| 2  | Sales |

**Output:**

| Department | Employee | Salary |
| ---------- | -------- | ------ |
| IT         | Jim      | 90000  |
| IT         | Max      | 90000  |
| Sales      | Henry    | 80000  |

---

 ✅ SQL Solutions

 🔹 Solution 1: Correlated Subquery

```sql
SELECT d.name AS Department, e.name AS Employee, e.salary AS Salary
FROM Employee e
JOIN Department d ON e.departmentId = d.id
WHERE e.salary = (
    SELECT MAX(salary)
    FROM Employee
    WHERE departmentId = e.departmentId
);
```

✅ **Important Points:**

* Uses a **correlated subquery** to find the max salary per department.
* Ensures employees with **tied top salaries** are also included.
* Simple and readable; works well for moderate datasets.

---

 🔹 Solution 2: Common Table Expression (CTE) with `DENSE_RANK()`

```sql
WITH RankedSalaries AS (
    SELECT e.name AS Employee, e.salary AS Salary, d.name AS Department,
           DENSE_RANK() OVER (PARTITION BY e.departmentId ORDER BY e.salary DESC) AS rnk
    FROM Employee e
    JOIN Department d ON e.departmentId = d.id
)
SELECT Department, Employee, Salary
FROM RankedSalaries
WHERE rnk = 1;
```

✅ **Important Points:**

* Uses `DENSE_RANK()` to rank salaries within each department.
* Cleaner and more powerful if you plan to extend logic (e.g., top 3 salaries).
* Handles ties automatically.

---

 ⚙️ Behavior on Different Dataset Sizes

<table>
  <thead>
    <tr>
      <th style="width: 40%; word-wrap: break-word; white-space: normal;">Dataset Size</th>
      <th style="width: 60%; word-wrap: break-word; white-space: normal;">Behavior &amp; Considerations</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="word-wrap: break-word; white-space: normal;">&lt; 100 GB (Small/Medium)</td>
      <td style="word-wrap: break-word; white-space: normal;">
        Both solutions run efficiently. Use indexes on <code>departmentId</code> and <code>salary</code> to speed up subqueries or window functions. Most SQL engines will optimize correlated subqueries well.
      </td>
    </tr>
    <tr>
      <td style="word-wrap: break-word; white-space: normal;">&gt; 100 GB (Large)</td>
      <td style="word-wrap: break-word; white-space: normal;">
        Prefer the <code>DENSE_RANK()</code> solution for better performance in distributed engines. Correlated subqueries can become expensive due to repeated execution. Use partitioned tables and columnar storage (e.g., Parquet, ORC) with engines like Hive, Presto, or Spark SQL.
      </td>
    </tr>
  </tbody>
</table>

---

Let me know if you'd like to add optimizations or an explanation for interview discussion!


#  🔗[LeetCode 185 : Department Top Three Salaries](https://leetcode.com/problems/department-top-three-salaries/)

---

 🔹 Problem Statement

**Table: Employee**

| Column Name  | Type    |
| ------------ | ------- |
| id           | int     |
| name         | varchar |
| salary       | int     |
| departmentId | int     |

**Table: Department**

| Column Name | Type    |
| ----------- | ------- |
| id          | int     |
| name        | varchar |

* `id` is the primary key in both tables.
* `departmentId` is a foreign key to `Department.id`.

---

 ✏️ Task

Write a SQL query to find the **top three highest distinct salaries** in each department.
Return the **department name**, **employee name**, and **salary**.

---

 📥 Example

**Input:**

**Employee**

| id | name  | salary | departmentId |
| -- | ----- | ------ | ------------ |
| 1  | Joe   | 85000  | 1            |
| 2  | Henry | 80000  | 2            |
| 3  | Sam   | 60000  | 2            |
| 4  | Max   | 90000  | 1            |
| 5  | Janet | 69000  | 1            |
| 6  | Randy | 85000  | 1            |

**Department**

| id | name  |
| -- | ----- |
| 1  | IT    |
| 2  | Sales |

**Output:**

| Department | Employee | Salary |
| ---------- | -------- | ------ |
| IT         | Joe      | 85000  |
| IT         | Max      | 90000  |
| IT         | Randy    | 85000  |
| Sales      | Henry    | 80000  |
| Sales      | Sam      | 60000  |

---

 ✅ SQL Solutions

 🔹 Solution: `DENSE_RANK()` with CTE

```sql
WITH RankedSalaries AS (
  SELECT d.name AS Department,
         e.name AS Employee,
         e.salary AS Salary,
         DENSE_RANK() OVER (
             PARTITION BY e.departmentId
             ORDER BY e.salary DESC
         ) AS rnk
  FROM Employee e
  JOIN Department d ON e.departmentId = d.id
)
SELECT Department, Employee, Salary
FROM RankedSalaries
WHERE rnk <= 3;
```

✅ **Important Points:**

* Uses `DENSE_RANK()` to ensure duplicate salaries are ranked correctly.
* `PARTITION BY departmentId` restricts ranking within each department.
* Highly readable and extendable for top-N salaries (just change `rnk <= 3`).

---

 ⚙️ Behavior on Different Dataset Sizes

<table>
  <thead>
    <tr>
      <th style="width: 40%; word-wrap: break-word; white-space: normal;">Dataset Size</th>
      <th style="width: 60%; word-wrap: break-word; white-space: normal;">Behavior &amp; Considerations</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="word-wrap: break-word; white-space: normal;">&lt; 100 GB (Small/Medium)</td>
      <td style="word-wrap: break-word; white-space: normal;">
        Performs efficiently on most RDBMS engines. Index on <code>departmentId</code> and <code>salary</code> boosts performance. Query fits well in memory.
      </td>
    </tr>
    <tr>
      <td style="word-wrap: break-word; white-space: normal;">&gt; 100 GB (Large)</td>
      <td style="word-wrap: break-word; white-space: normal;">
        Window functions like <code>DENSE_RANK()</code> scale well in distributed SQL engines (e.g., Spark SQL, Presto). Partitioning and columnar formats like Parquet improve performance. Avoid nested subqueries; prefer CTEs and windowing.
      </td>
    </tr>
  </tbody>
</table>

---

 🔁 Alternate SQL Solution (Using Subqueries & `IN`)

```sql
SELECT d.name AS Department, e.name AS Employee, e.salary AS Salary
FROM Employee e
JOIN Department d ON e.departmentId = d.id
WHERE (
    SELECT COUNT(DISTINCT e2.salary)
    FROM Employee e2
    WHERE e2.departmentId = e.departmentId
      AND e2.salary > e.salary
) < 3;
```

✅ **Important Points:**

* This subquery counts how many **distinct higher salaries** exist in the same department.
* If fewer than 3 higher salaries exist, that means this salary is in the **top 3**.
* Handles **ties** correctly because it uses `DISTINCT` in the count.
* Avoids window functions; suitable for engines with limited analytic function support.

---

 ⚙️ Behavior on Different Dataset Sizes

<table>
  <thead>
    <tr>
      <th style="width: 40%; word-wrap: break-word; white-space: normal;">Dataset Size</th>
      <th style="width: 60%; word-wrap: break-word; white-space: normal;">Behavior &amp; Considerations</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="word-wrap: break-word; white-space: normal;">&lt; 100 GB (Small/Medium)</td>
      <td style="word-wrap: break-word; white-space: normal;">
        Performs reasonably well. Query is readable and doesn’t require window function support. Indexes on <code>departmentId</code> and <code>salary</code> recommended for better performance.
      </td>
    </tr>
    <tr>
      <td style="word-wrap: break-word; white-space: normal;">&gt; 100 GB (Large)</td>
      <td style="word-wrap: break-word; white-space: normal;">
        Subqueries may run slower due to repeated scans per row. Avoid in distributed systems. Prefer window function approach with <code>DENSE_RANK()</code> in large-scale or big data environments.
      </td>
    </tr>
  </tbody>
</table>

---



#  🔗[LeetCode 196 : Delete Duplicate Emails](https://leetcode.com/problems/delete-duplicate-emails/)

---

 🔹 Problem Statement

**Table: Person**

| Column Name | Type    |
| ----------- | ------- |
| id          | int     |
| email       | varchar |

* `id` is the primary key.
* Each row in the table has a person’s email.
* Some emails may appear more than once with different `id`s.

---

 ✏️ Task

Write a SQL query to **delete all duplicate emails**, keeping only the record with the **smallest id** for each email.

---

 📥 Example

**Input:**

**Person**

| id | email                                       |
| -- | ------------------------------------------- |
| 1  | [john@example.com](mailto:john@example.com) |
| 2  | [bob@example.com](mailto:bob@example.com)   |
| 3  | [john@example.com](mailto:john@example.com) |

**Output After Deletion:**

| id | email                                       |
| -- | ------------------------------------------- |
| 1  | [john@example.com](mailto:john@example.com) |
| 2  | [bob@example.com](mailto:bob@example.com)   |

---

 ✅ SQL Solutions

 🔹 Solution 1: Subquery with `NOT IN`

```sql
DELETE FROM Person
WHERE id NOT IN (
  SELECT minid
  FROM (
    SELECT MIN(id) AS minid
    FROM Person
    GROUP BY email
  ) AS A
);
```

✅ **Important Points:**

Understanding MySQL's Limitation
In MySQL, you cannot directly delete rows while selecting from the same table in a subquery. If you try this query:

DELETE FROM Person
WHERE Id NOT IN (SELECT MIN(Id) FROM Person GROUP BY Email);
It may throw the error:

"You can't specify target table 'Person' for update in FROM clause"
* Keeps only the **smallest id** per email by using `GROUP BY`.
* Deletes all other entries **not in the minimum id list**.
* Safe from MySQL error by wrapping inner query in a derived table.

---

 🔹 Solution 2: Window Function with `ROW_NUMBER()` (if supported)

```sql
WITH RankedEmails AS (
  SELECT id, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
  FROM Person
)
DELETE FROM Person
WHERE id IN (
  SELECT id FROM RankedEmails WHERE rn > 1
);
```

✅ **Important Points:**

* Assigns a unique row number to each email group.
* Deletes all rows with `rn > 1`, keeping the one with `rn = 1`.
* Requires support for **common table expressions** and **window functions** (not all RDBMS allow DELETE from CTE).

---

 ⚙️ Behavior on Different Dataset Sizes

<table>
  <thead>
    <tr>
      <th style="width: 40%; word-wrap: break-word; white-space: normal;">Dataset Size</th>
      <th style="width: 60%; word-wrap: break-word; white-space: normal;">Behavior &amp; Considerations</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="word-wrap: break-word; white-space: normal;">&lt; 100 GB (Small/Medium)</td>
      <td style="word-wrap: break-word; white-space: normal;">
        Both solutions run efficiently. Index on <code>email</code> helps speed up grouping. DELETE operations are manageable.
      </td>
    </tr>
    <tr>
      <td style="word-wrap: break-word; white-space: normal;">&gt; 100 GB (Large)</td>
      <td style="word-wrap: break-word; white-space: normal;">
        Subquery version more compatible across systems but less optimized. Window function version is faster in modern engines (e.g., BigQuery, Snowflake, PostgreSQL, SQL Server). For massive deletes, consider staging tables or batch deletes.
      </td>
    </tr>
  </tbody>
</table>

---


#  🔗[LeetCode 197 : Rising Temperature](https://leetcode.com/problems/rising-temperature/)

---

 🔹 Problem Statement

**Table: Weather**

| Column Name | Type |
| ----------- | ---- |
| id          | int  |
| recordDate  | date |
| temperature | int  |

* `id` is the primary key.
* Each row contains the temperature on a specific date.

---

 ✏️ Task

Write a SQL query to find all dates' IDs where the temperature was **higher than the previous day's** temperature.

---

 📥 Example

**Input:**

**Weather**

| id | recordDate | temperature |
| -- | ---------- | ----------- |
| 1  | 2020-01-01 | 10          |
| 2  | 2020-01-02 | 25          |
| 3  | 2020-01-03 | 20          |
| 4  | 2020-01-04 | 30          |

**Output:**

| id |
| -- |
| 2  |
| 4  |

---

 ✅ SQL Solutions

 🔹 Solution 1: Self Join

```sql
SELECT w1.id
FROM Weather w1
JOIN Weather w2
  ON DATEDIFF(w1.recordDate, w2.recordDate) = 1
WHERE w1.temperature > w2.temperature;
```

✅ **Important Points:**

Joins the weather table to itself.
DATEDIFF(w1.recordDate, w2.recordDate) = 1 ensures that w1 is the day after w2.
Then it checks if w1.temperature > w2.temperature — i.e., today's temperature is higher than yesterday's.

* Uses `DATEDIFF()` to match a row with its previous day's row.
* Simple, portable query — works in MySQL and other engines.
* Assumes data has no missing dates. If some dates are missing, results are limited to existing 1-day gaps.

---

 🔹 Solution 2: Window Function (`LAG()`)

```sql
SELECT id
FROM (
  SELECT id, temperature,
         LAG(temperature) OVER (ORDER BY recordDate) AS prev_temp
  FROM Weather
) AS temp_table
WHERE temperature > prev_temp;
```

✅ **Important Points:**

* Uses `LAG()` to access the previous row’s temperature.
* More robust than join—doesn’t rely on exact 1-day date differences.
* Requires SQL engine support for window functions (e.g., PostgreSQL, SQL Server, Snowflake, etc.)

---

 ⚙️ Behavior on Different Dataset Sizes

<table>
  <thead>
    <tr>
      <th style="width: 40%; word-wrap: break-word; white-space: normal;">Dataset Size</th>
      <th style="width: 60%; word-wrap: break-word; white-space: normal;">Behavior &amp; Considerations</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="word-wrap: break-word; white-space: normal;">&lt; 100 GB (Small/Medium)</td>
      <td style="word-wrap: break-word; white-space: normal;">
        Both queries work efficiently. Index on <code>recordDate</code> speeds up self join and window functions. Ideal for standard datasets.
      </td>
    </tr>
    <tr>
      <td style="word-wrap: break-word; white-space: normal;">&gt; 100 GB (Large)</td>
      <td style="word-wrap: break-word; white-space: normal;">
        Window function version scales better in distributed systems (e.g., Spark SQL, BigQuery). Self joins may be more expensive due to data shuffling and sorting. Partitioned data and columnar storage improve performance.
      </td>
    </tr>
  </tbody>
</table>

---
