# Project aggregation

Source: https://towardsdatascience.com/twenty-five-sql-practice-exercises-5fc791e24082
Source: https://www.hackerrank.com/challenges/sql-projects/problem

The projects table contains three columns: task_id, start_date, and
end_date. The difference between end_date and start_date is 1 day for each row in the table. If task end dates are consecutive they are part of the same project. Projects do not overlap.

Write a query to return the start and end dates of each project, and the number of days it took to complete. Order by ascending project duration, and descending start date in the case of a tie.

In [2]:
%run Question.ipynb

 * postgresql://fknight:***@localhost/postgres
Done.
Done.
7 rows affected.
7 rows affected.


# Part A

Write a query to find the start dates NOT present in the `end_date` column. (These are the "true" project start dates).

In [4]:
%%sql

SELECT task_id, start_date
FROM projects
WHERE start_date NOT IN (SELECT end_date FROM projects)

 * postgresql://fknight:***@localhost/postgres
4 rows affected.


task_id,start_date
1,2020-10-01
4,2020-10-13
6,2020-10-28
7,2020-10-30


# Part B

Write a query to find the end dates NOT present in the `start_date` column. (These are the "true" project end dates).

In [9]:
%%sql

SELECT task_id, end_date
FROM projects
WHERE end_date NOT IN (SELECT start_date FROM projects)

 * postgresql://fknight:***@localhost/postgres
4 rows affected.


task_id,end_date
3,2020-10-04
5,2020-10-15
6,2020-10-29
7,2020-10-31


# Part C

Using the subqueries from Part A & B, find the start date and end date for each project.

In [16]:
%%sql

WITH true_start_dates AS (
    SELECT task_id, start_date
    FROM projects
    WHERE start_date NOT IN (SELECT end_date FROM projects)
),

true_end_dates AS (
    SELECT task_id, end_date
    FROM projects
    WHERE end_date NOT IN (SELECT start_date FROM projects)
)

SELECT start_date, min(end_date) AS end_date 
FROM true_start_dates, true_end_dates
WHERE start_date < end_date
GROUP BY start_date 

 * postgresql://fknight:***@localhost/postgres
4 rows affected.


start_date,end_date
2020-10-30,2020-10-31
2020-10-13,2020-10-15
2020-10-01,2020-10-04
2020-10-28,2020-10-29


# Part D

Using the subqueries from Parts A, B, & C, find the solution to the original question.

```sql
WITH true_start_dates AS (
    SELECT task_id, start_date
    FROM projects
    WHERE start_date NOT IN (SELECT end_date FROM projects)
),

true_end_dates AS (
    SELECT task_id, end_date
    FROM projects
    WHERE end_date NOT IN (SELECT start_date FROM projects)
)

start_and_end_dates AS (
    SELECT start_date, min(end_date) AS end_date 
    FROM true_start_dates, true_end_dates
    WHERE start_date < end_date
    GROUP BY start_date 
)
```

In [19]:
%%sql

WITH true_start_dates AS (
    SELECT task_id, start_date
    FROM projects
    WHERE start_date NOT IN (SELECT end_date FROM projects)
),

true_end_dates AS (
    SELECT task_id, end_date
    FROM projects
    WHERE end_date NOT IN (SELECT start_date FROM projects)
),

start_and_end_dates AS (
    SELECT start_date, min(end_date) AS end_date 
    FROM true_start_dates, true_end_dates
    WHERE start_date < end_date
    GROUP BY start_date 
)

SELECT *, end_date - start_date AS project_duration 
FROM start_and_end_dates
ORDER BY project_duration ASC, start_date ASC;

 * postgresql://fknight:***@localhost/postgres
4 rows affected.


start_date,end_date,project_duration
2020-10-28,2020-10-29,1
2020-10-30,2020-10-31,1
2020-10-13,2020-10-15,2
2020-10-01,2020-10-04,3


## The full solution is given below

In [2]:
%%sql

-- get start dates not present in end date column (these are “true”
-- project start dates)

WITH t1 AS (
    SELECT start_date
    FROM projects
    WHERE start_date NOT IN (SELECT end_date FROM projects)
),

-- get end dates not present in start date column (these are “true” 
-- project end dates)

t2 AS (
    SELECT end_date
    FROM projects
    WHERE end_date NOT IN (SELECT start_date FROM projects) 
),

-- filter to plausible start-end pairs (start < end), then find 
-- correct end date for each start date (the minimum end date, since
-- there are no overlapping projects)

t3 AS (
    SELECT start_date, min(end_date) AS end_date FROM t1, t2
    WHERE start_date < end_date
    GROUP BY start_date 
)

SELECT *, end_date - start_date AS project_duration FROM t3
ORDER BY project_duration ASC, start_date ASC;

 * postgresql://fknight:***@localhost/postgres
4 rows affected.


start_date,end_date,project_duration
2020-10-28,2020-10-29,1
2020-10-30,2020-10-31,1
2020-10-13,2020-10-15,2
2020-10-01,2020-10-04,3
