Assume you're given a table Twitter tweet data, write a query to obtain a histogram of tweets posted per user in 2022. Output the tweet count per user as the bucket and the number of Twitter users who fall into that bucket.

In [None]:
SELECT tweet_count_per_user AS tweet_bucket, COUNT(user_id) users_num
FROM
  (SELECT user_id, COUNT(*) tweet_count_per_user
  FROM tweets
  WHERE to_char(tweet_date::DATE, 'YYYY') = '2022'
  GROUP BY user_id) tweet_per_user
GROUP BY 1


SELECT 
  tweet_count_per_user AS tweet_bucket, 
  COUNT(user_id) AS users_num 
FROM (
  SELECT 
    user_id, 
    COUNT(tweet_id) AS tweet_count_per_user 
  FROM tweets 
  WHERE tweet_date BETWEEN '2022-01-01' 
    AND '2022-12-31'
  GROUP BY user_id) AS total_tweets 
GROUP BY tweet_count_per_user

Given a table of candidates and their skills, you're tasked with finding the candidates best suited for an open Data Science job. You want to find candidates who are proficient in Python, Tableau, and PostgreSQL.

Write a query to list the candidates who possess all of the required skills for the job. Sort the output by candidate ID in ascending order.

Assumption:

There are no duplicates in the candidates table:

| Column Name  | Type    |
| :----------- | :------ |
| candidate_id | integer |
| skill        | varchar |

Example Input:

| candidate_id | skill      |
| :----------- | :--------- |
| 123          | Python     |
| 123          | Tableau    |
| 123          | PostgreSQL |
| 234          | R          |
| 234          | PowerBI    |
| 234          | SQL Server |
| 345          | Python     |
| 345          | Tableau    |


In [None]:
SELECT candidate_id
FROM candidates
GROUP BY candidate_id
HAVING 'Python' = ANY(ARRAY_AGG(skill))
AND 'Tableau' = ANY(ARRAY_AGG(skill))
AND 'PostgreSQL' = ANY(ARRAY_AGG(skill))
ORDER BY 1

SELECT candidate_id
FROM candidates
WHERE skill IN ('Python', 'Tableau', 'PostgreSQL')
GROUP BY candidate_id
HAVING COUNT(*) = 3
ORDER BY candidate_id

Write a solution to find the second highest salary from the Employee table. If there is no second highest salary, return null (return None in Pandas).

In [None]:
WITH rankings AS (
    SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) rank_
    FROM Employee
)
SELECT COALESCE(
    (SELECT DISTINCT salary FROM rankings WHERE rank_ = 2), NULL) AS "SecondHighestSalary";


SELECT COALESCE(
    (
    SELECT DISTINCT salary
    FROM Employee
    ORDER BY salary DESC
    OFFSET 1 LIMIT 1),
    NULL
) AS "SecondHighestSalary"
FROM Employee
LIMIT 1


SELECT MAX(salary) AS "SecondHighestSalary"
FROM Employee
WHERE salary < (SELECT MAX(salary) FROM Employee)

In [None]:
import pandas as pd

def second_highest_salary(employee: pd.DataFrame) -> pd.DataFrame:
    
    salaries = employee['salary'].drop_duplicates()
    salaries.sort_values(ascending=False, inplace=True)
    return pd.DataFrame({'SecondHighestSalary' : [salaries.iloc[1] if salaries.shape[0] > 1 else None]})

Write a solution to find the nth highest salary from the Employee table. If there is no nth highest salary, return null.

Input: 
Employee table:

| id | salary |
|----|--------|
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |

n = 2
Output: 

| getNthHighestSalary(2) |
|------------------------|
| 200                    |

Example 2:

Input: 
Employee table:

| id | salary |
|----|--------|
| 1  | 100    |

n = 2
Output: 

| getNthHighestSalary(2) |
|------------------------|
| null                   |

In [None]:
CREATE OR REPLACE FUNCTION NthHighestSalary(N INT) RETURNS TABLE (Salary INT) AS $$
BEGIN
  RETURN QUERY (

    WITH rankings AS (
        SELECT e.salary, DENSE_RANK() OVER (ORDER BY e.salary DESC) rank_
        FROM Employee e
        ORDER BY e.salary DESC
    )

    SELECT COALESCE(
        (SELECT r.salary
        FROM rankings r
        WHERE rank_ = N
        LIMIT 1),
        NULL
    ) AS "NthHighestSalary"

  );
END;
$$ LANGUAGE plpgsql;

In [None]:
import pandas as pd

def nth_highest_salary(employee: pd.DataFrame, N: int) -> pd.DataFrame:

    df = employee['salary'].drop_duplicates().sort_values(ignore_index=True, ascending=False)
    if N <= 0:
        return pd.DataFrame({f'getNthHighestSalary({N})':[None]})
    else:
        return pd.DataFrame({f'getNthHighestSalary({N})':[int(df.iloc[N-1]) if df.shape[0] >= N else None]})

if we wanted `df` to actually be a DataFrame when defined. In the previous snippet, it immediately became a series.

In [None]:
import pandas as pd

def nth_highest_salary(employee: pd.DataFrame, N: int) -> pd.DataFrame:

    df = employee[['salary']].drop_duplicates().sort_values(by='salary', ignore_index=True, ascending=False)
    if N <= 0:
        return pd.DataFrame({f'getNthHighestSalary({N})':[None]})
    else:
        return pd.DataFrame({f'getNthHighestSalary({N})':[int(df.iloc[N-1]) if df.shape[0] >= N else None]})

Table: `Customers`

| Column Name | Type    |
|-------------|---------|
| id          | int     |
| name        | varchar |

id is the primary key (column with unique values) for this table.
Each row of this table indicates the ID and name of a customer.
 

Table: `Orders`

| Column Name | Type |
|-------------|------|
| id          | int  |
| customerId  | int  |

id is the primary key (column with unique values) for this table.
customerId is a foreign key (reference columns) of the ID from the Customers table.
Each row of this table indicates the ID of an order and the ID of the customer who ordered it.
 

Write a solution to find all customers who never order anything.

Return the result table in any order.

In [None]:
SELECT name AS "Customers"
FROM Customers
WHERE id NOT IN (
    SELECT DISTINCT Orders.customerId
    FROM Orders
)

In [None]:
def find_customers(customers: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    buying_customers = list(orders['customerId'].drop_duplicates())
    return pd.DataFrame({'Customers': [customers.loc[i, 'name'] for i in range(customers.shape[0]) if customers.loc[i, 'id'] not in buying_customers]})

def find_customers(customers: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    df = pd.merge(customers, orders, how='left', left_on='id', right_on='customerId')
    return df[df['customerId'].isna()][['name']].rename(columns={'name':'Customers'})

def find_customers(customers: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    return customers[~customers.id.isin(orders.customerId)][['name']].rename(columns={'name':'Customers'})

`reference`: https://datalemur.com/questions/time-spent-snaps?referralCode=NpkU8goA

This is the same question as problem #25 in the SQL Chapter of Ace the Data Science Interview!

Assume you're given tables with information on Snapchat users, including their ages and time spent sending and opening snaps.

Write a query to obtain a breakdown of the time spent sending vs. opening snaps as a percentage of total time spent on these activities grouped by age group. Round the percentage to 2 decimal places in the output.

**Notes:**

* Calculate the following percentages:
    * time spent sending / (Time spent sending + Time spent opening)
    * Time spent opening / (Time spent sending + Time spent opening)
* To avoid integer division in percentages, multiply by 100.0 and not 100.

In [None]:
WITH agg_sum AS (
  SELECT age_bucket, activity_type, SUM(time_spent) sum_time, SUM(SUM(time_spent)) OVER(PARTITION BY age_bucket) act_total
  FROM activities JOIN age_breakdown USING (user_id)
  WHERE activity_type IN ('send', 'open')
  GROUP BY age_bucket, activity_type
)
SELECT *
FROM

(SELECT age_bucket, SUM(ROUND((sum_time / act_total)*100.0, 2)) send_perc
FROM agg_sum
WHERE activity_type = 'send'
GROUP BY age_bucket) send_

JOIN

(SELECT age_bucket, SUM(ROUND((sum_time / act_total)*100.0, 2)) open_perc
FROM agg_sum
WHERE activity_type = 'open'
GROUP BY age_bucket) open_

USING (age_bucket)
ORDER BY age_bucket

`reference`: https://leetcode.com/problems/game-play-analysis-i/

Return the result table in any order.

The result format is in the following example.

 

Example 1:

Input: 
Activity table:

| player_id | device_id | event_date | games_played |
|-----------|-----------|------------|--------------|
| 1         | 2         | 2016-03-01 | 5            |
| 1         | 2         | 2016-05-02 | 6            |
| 2         | 3         | 2017-06-25 | 1            |
| 3         | 1         | 2016-03-02 | 0            |
| 3         | 4         | 2018-07-03 | 5            |

Output: 

| player_id | first_login |
|-----------|-------------|
| 1         | 2016-03-01  |
| 2         | 2017-06-25  |
| 3         | 2016-03-02  |


In [None]:
SELECT player_id, event_date AS first_login
FROM
    (SELECT player_id, event_date, ROW_NUMBER() OVER(PARTITION BY player_id ORDER BY event_date) row_
    FROM Activity
    ORDER BY player_id, event_date) act_rows
WHERE row_ = 1

In [None]:
import pandas as pd

def game_analysis(activity: pd.DataFrame) -> pd.DataFrame:
    return activity[['player_id', 'event_date']].groupby(by='player_id', as_index=False).min().rename(columns={'event_date':'first_login'})

`reference`: https://leetcode.com/problems/duplicate-emails/description/

Write a solution to report all the duplicate emails. Note that it's guaranteed that the email field is not NULL.

Return the result table in any order.

The result format is in the following example.

Example 1:

Input: 
Person table:

| id | email   |
|----|---------|
| 1  | a@b.com |
| 2  | c@d.com |
| 3  | a@b.com |

Output: 

| Email   |
|---------|
| a@b.com |

Explanation: a@b.com is repeated two times.

In [None]:
import pandas as pd
from collections import Counter

def duplicate_emails(person: pd.DataFrame) -> pd.DataFrame:
    count_dict = Counter(person['email'])
    return pd.DataFrame({'Email': list(set([e for e, c in count_dict.items() if c > 1]))})

def duplicate_emails(person: pd.DataFrame) -> pd.DataFrame:
    count_dict = {}
    for e in person['email']:
        if e not in count_dict:
            count_dict[e] = 1
        else:
            count_dict[e] += 1
    return pd.DataFrame({'Email': list(set([e for e, c in count_dict.items() if c > 1]))})

In [None]:
SELECT DISTINCT email
FROM Person
GROUP BY email
HAVING count(email) > 1