Assume you're given a table Twitter tweet data, write a query to obtain a histogram of tweets posted per user in 2022. Output the tweet count per user as the bucket and the number of Twitter users who fall into that bucket.

In [None]:
SELECT tweet_count_per_user AS tweet_bucket, COUNT(user_id) users_num
FROM
  (SELECT user_id, COUNT(*) tweet_count_per_user
  FROM tweets
  WHERE to_char(tweet_date::DATE, 'YYYY') = '2022'
  GROUP BY user_id) tweet_per_user
GROUP BY 1


SELECT 
  tweet_count_per_user AS tweet_bucket, 
  COUNT(user_id) AS users_num 
FROM (
  SELECT 
    user_id, 
    COUNT(tweet_id) AS tweet_count_per_user 
  FROM tweets 
  WHERE tweet_date BETWEEN '2022-01-01' 
    AND '2022-12-31'
  GROUP BY user_id) AS total_tweets 
GROUP BY tweet_count_per_user

Given a table of candidates and their skills, you're tasked with finding the candidates best suited for an open Data Science job. You want to find candidates who are proficient in Python, Tableau, and PostgreSQL.

Write a query to list the candidates who possess all of the required skills for the job. Sort the output by candidate ID in ascending order.

Assumption:

There are no duplicates in the candidates table:

| Column Name  | Type    |
| :----------- | :------ |
| candidate_id | integer |
| skill        | varchar |

Example Input:

| candidate_id | skill      |
| :----------- | :--------- |
| 123          | Python     |
| 123          | Tableau    |
| 123          | PostgreSQL |
| 234          | R          |
| 234          | PowerBI    |
| 234          | SQL Server |
| 345          | Python     |
| 345          | Tableau    |


In [None]:
SELECT candidate_id
FROM candidates
GROUP BY candidate_id
HAVING 'Python' = ANY(ARRAY_AGG(skill))
AND 'Tableau' = ANY(ARRAY_AGG(skill))
AND 'PostgreSQL' = ANY(ARRAY_AGG(skill))
ORDER BY 1

SELECT candidate_id
FROM candidates
WHERE skill IN ('Python', 'Tableau', 'PostgreSQL')
GROUP BY candidate_id
HAVING COUNT(*) = 3
ORDER BY candidate_id

Write a solution to find the second highest salary from the Employee table. If there is no second highest salary, return null (return None in Pandas).

In [None]:
WITH rankings AS (
    SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) rank_
    FROM Employee
)
SELECT COALESCE(
    (SELECT DISTINCT salary FROM rankings WHERE rank_ = 2), NULL) AS "SecondHighestSalary";


SELECT COALESCE(
    (
    SELECT DISTINCT salary
    FROM Employee
    ORDER BY salary DESC
    OFFSET 1 LIMIT 1),
    NULL
) AS "SecondHighestSalary"
FROM Employee
LIMIT 1


SELECT MAX(salary) AS "SecondHighestSalary"
FROM Employee
WHERE salary < (SELECT MAX(salary) FROM Employee)

In [None]:
import pandas as pd

def second_highest_salary(employee: pd.DataFrame) -> pd.DataFrame:
    
    salaries = employee['salary'].drop_duplicates()
    salaries.sort_values(ascending=False, inplace=True)
    return pd.DataFrame({'SecondHighestSalary' : [salaries.iloc[1] if salaries.shape[0] > 1 else None]})

Write a solution to find the nth highest salary from the Employee table. If there is no nth highest salary, return null.

Input: 
Employee table:

| id | salary |
|----|--------|
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |

n = 2
Output: 

| getNthHighestSalary(2) |
|------------------------|
| 200                    |

Example 2:

Input: 
Employee table:

| id | salary |
|----|--------|
| 1  | 100    |

n = 2
Output: 

| getNthHighestSalary(2) |
|------------------------|
| null                   |

In [None]:
CREATE OR REPLACE FUNCTION NthHighestSalary(N INT) RETURNS TABLE (Salary INT) AS $$
BEGIN
  RETURN QUERY (

    WITH rankings AS (
        SELECT e.salary, DENSE_RANK() OVER (ORDER BY e.salary DESC) rank_
        FROM Employee e
        ORDER BY e.salary DESC
    )

    SELECT COALESCE(
        (SELECT r.salary
        FROM rankings r
        WHERE rank_ = N
        LIMIT 1),
        NULL
    ) AS "NthHighestSalary"

  );
END;
$$ LANGUAGE plpgsql;

In [None]:
import pandas as pd

def nth_highest_salary(employee: pd.DataFrame, N: int) -> pd.DataFrame:

    df = employee['salary'].drop_duplicates().sort_values(ignore_index=True, ascending=False)
    if N <= 0:
        return pd.DataFrame({f'getNthHighestSalary({N})':[None]})
    else:
        return pd.DataFrame({f'getNthHighestSalary({N})':[int(df.iloc[N-1]) if df.shape[0] >= N else None]})

if we wanted `df` to actually be a DataFrame when defined. In the previous snippet, it immediately became a series.

In [None]:
import pandas as pd

def nth_highest_salary(employee: pd.DataFrame, N: int) -> pd.DataFrame:

    df = employee[['salary']].drop_duplicates().sort_values(by='salary', ignore_index=True, ascending=False)
    if N <= 0:
        return pd.DataFrame({f'getNthHighestSalary({N})':[None]})
    else:
        return pd.DataFrame({f'getNthHighestSalary({N})':[int(df.iloc[N-1]) if df.shape[0] >= N else None]})

Table: `Customers`

| Column Name | Type    |
|-------------|---------|
| id          | int     |
| name        | varchar |

id is the primary key (column with unique values) for this table.
Each row of this table indicates the ID and name of a customer.
 

Table: `Orders`

| Column Name | Type |
|-------------|------|
| id          | int  |
| customerId  | int  |

id is the primary key (column with unique values) for this table.
customerId is a foreign key (reference columns) of the ID from the Customers table.
Each row of this table indicates the ID of an order and the ID of the customer who ordered it.
 

Write a solution to find all customers who never order anything.

Return the result table in any order.

In [None]:
SELECT name AS "Customers"
FROM Customers
WHERE id NOT IN (
    SELECT DISTINCT Orders.customerId
    FROM Orders
)

In [None]:
def find_customers(customers: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    buying_customers = list(orders['customerId'].drop_duplicates())
    return pd.DataFrame({'Customers': [customers.loc[i, 'name'] for i in range(customers.shape[0]) if customers.loc[i, 'id'] not in buying_customers]})

def find_customers(customers: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    df = pd.merge(customers, orders, how='left', left_on='id', right_on='customerId')
    return df[df['customerId'].isna()][['name']].rename(columns={'name':'Customers'})

def find_customers(customers: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    return customers[~customers.id.isin(orders.customerId)][['name']].rename(columns={'name':'Customers'})