## Danny Ma's Serious SQL Course - Data Exploration

### a. Select and Sort SQL Queries

1. What is the name of the category with the highest category_id in the dvd_rentals.category table?
~~~sql 
SELECT 
 category_id, 
 name 
FROM dvd_rentals.category
ORDER BY category_id DESC
LIMIT 1

![image info](./week1_screenshots/s1.png)

2. For the films with the longest length, what is the title of the “R” rated film with the lowest replacement_cost in dvd_rentals.film table?
~~~sql 
SELECT 
 title, 
 replacement_cost, 
 length, 
 rating 
FROM dvd_rentals.film 
ORDER BY length DESC, replacement_cost 
LIMIT 10

![image info](./week1_screenshots/s2.png)

3. Who was the manager of the store with the highest total_sales in the dvd_rentals.sales_by_store table?

~~~sql 
SELECT 
 manager, 
 total_sales 
FROM dvd_rentals.sales_by_store
ORDER BY total_sales DESC
LIMIT 2

![image info](./week1_screenshots/s3.png)

4. What is the postal_code of the city with the 5th highest city_id in the dvd_rentals.address 

~~~sql 
SELECT 
 city_id, 
 postal_code 
FROM dvd_rentals.address 
ORDER BY city_id DESC
LIMIT 5

![image info](./week1_screenshots/s4.png)

### b. Record Counts & Distinct Values

1. Which actor_id has the most number of unique film_id records in the dvd_rentals.film_actor table?

~~~sql 
SELECT 
 actor_id, 
 COUNT(DISTINCT film_id) as film_count 
FROM dvd_rentals.film_actor
GROUP BY actor_id
ORDER BY film_count DESC
LIMIT 2

![image info](./week1_screenshots/s5.png)

2. How many distinct fid values are there for the 3rd most common price value in the dvd_rentals.nicer_but_slower_film_list 

~~~sql 
SELECT 
 price,
 COUNT(DISTINCT(fid)) as fid_count
FROM dvd_rentals.nicer_but_slower_film_list
GROUP BY price
ORDER BY fid_count DESC

![image info](./week1_screenshots/s6.png)

3. How many unique country_id values exist in the dvd_rentals.city table?

~~~sql 
SELECT 
 COUNT(DISTINCT country_id) 
FROM dvd_rentals.city 

![image info](./week1_screenshots/s7a.png)

4. What percentage of overall total_sales does the Sports category make up in the dvd_rentals.sales_by_film_category 

~~~sql 
SELECT 
 category,
 ROUND(
 100 * total_sales::NUMERIC / SUM(total_sales) OVER(),2) 
 AS percent
FROM dvd_rentals.sales_by_film_category

![image info](./week1_screenshots/s8.png)

5. What percentage of unique fid values are in the Children category in the dvd_rentals.film_list table?

~~~sql 
SELECT 
 category, 
 ROUND(100 * COUNT(DISTINCT fid)::NUMERIC / SUM(COUNT(fid)) OVER(), 2) AS percent
FROM dvd_rentals.film_list 
GROUP BY category
ORDER BY category

![image info](./week1_screenshots/s9.png)

### c. Identifying Duplicate Records

1. Which id value has the most number of duplicate records in the health.user_logs

~~~sql WITH groupby_counts AS (
  SELECT id, 
   log_date,
    measure,
    measure_value,
    systolic,
    diastolic,
    COUNT(*) AS frequency
FROM health.user_logs
GROUP BY 
    id, 
    log_date,
    measure,
    measure_value,
    systolic,
    diastolic 
)

SELECT id,
 SUM(frequency) AS duplicate_records
 FROM groupby_counts 
  WHERE frequency > 1 
 GROUP BY id 
 ORDER BY SUM(frequency) DESC 


![image info](./week1_screenshots/s10.png)

2. Which log_date value had the most duplicate records after removing the max duplicate id value from question 1?

~~~sql 
WITH groupby_counts AS (
SELECT 
id, 
log_date, 
measure, 
measure_value, 
systolic, 
diastolic, 
COUNT(*) AS frequency 
FROM health.user_logs
WHERE id != '054250c692e07a9fa9e62e345231df4b54ff435d'
GROUP BY 
  id,
  log_date, 
  measure, 
  measure_value, 
  systolic, 
  diastolic 
)

SELECT log_date, 
SUM(frequency) AS duplicate_records
FROM groupby_counts
WHERE frequency > 1 
GROUP BY log_date 
ORDER BY SUM(frequency) DESC



![image info](./week1_screenshots/s11.png)

3. Which measure_value had the most occurences in the health.user_logs value when measure = 'weight'?

~~~sql 
SELECT measure_value, 
  COUNT(measure_value) AS measure_count 
  FROM health.user_logs 
WHERE measure = 'weight'
GROUP BY measure_value
ORDER BY measure_count DESC

![image info](./week1_screenshots/s12.png)

4. How many single duplicated rows exist when measure = 'blood_pressure' in the health.user_logs? How about the total number of duplicate records in the same table?

~~~sql 
WITH groupby_count AS (
  SELECT id, 
       log_date, 
       measure, 
       measure_value, 
       systolic,
       diastolic, 
       COUNT(*) AS frequency
FROM health.user_logs
WHERE measure = 'blood_pressure'
GROUP BY 
  id, 
  log_date, 
   measure, 
   measure_value, 
   systolic,
   diastolic 
)

SELECT COUNT(*) AS single_duplicated_row_count, 
SUM(frequency) AS total_duplicated_rows 
FROM groupby_count 
WHERE frequency > 1


![image info](./week1_screenshots/s13.png)

5. What percentage of records measure_value = 0
when measure = 'blood_pressure' in the health.user_logs table?
How many records are there also for this same condition?

~~~sql 
WITH groupby_counts AS (
  SELECT
    measure_value,
    COUNT(*) AS measure_count,
    SUM(COUNT(*)) OVER () AS overall_total
  FROM health.user_logs
  WHERE measure = 'blood_pressure'
  GROUP BY 1
)

SELECT measure_value, measure_count, overall_total, 
ROUND(100 * measure_count / overall_total,2) AS percentage
FROM groupby_counts
WHERE measure_value = 0 


![image info](./week1_screenshots/s14.png)

6. What percentage of records are duplicates in the health.user_logs table?

~~~sql
WITH deduped_logs AS (
  SELECT DISTINCT *
  FROM health.user_logs
)
SELECT
  ROUND(
    100 * (
      (SELECT COUNT(*) FROM health.user_logs) -
      (SELECT COUNT(*) FROM deduped_logs)
    )::NUMERIC /
    (SELECT COUNT(*) FROM health.user_logs),
    2
  ) AS duplicate_percentage;

![image info](./week1_screenshots/s15.png)

### c. Summary Statistics

1. What is the average, median and mode values of blood glucose values to 2 decimal places?
~~~sql
SELECT
  ROUND(AVG(measure_value), 2),
  ROUND(
    CAST(
      PERCENTILE_CONT(0.5) WITHIN GROUP (
        ORDER BY measure_value
        ) AS NUMERIC
       ),2
       ) AS median_value,
  ROUND(
  CAST(
  MODE() WITHIN GROUP (
    ORDER BY
      measure_value) AS NUMERIC), 2) AS mode_value
FROM
  health.user_logs
WHERE
  measure = 'blood_glucose'

![image info](./week1_screenshots/s16.png)

2. What is the most frequently occuring measure_value value for all blood glucose measurements?

~~~sql 
SELECT measure_value, 
COUNT(measure_value) AS measure_count FROM health.user_logs
WHERE measure = 'blood_glucose'
GROUP BY measure_value 
ORDER BY measure_count DESC
LIMIT 5

![image info](./week1_screenshots/s17.png)

3. Calculate the 2 Pearson Coefficient of Skewness for blood glucose measures given the following formulas:

~~~sql 
WITH blood_glucose AS
(
  SELECT AVG(measure_value) AS average_value, 
  MODE() WITHIN GROUP (ORDER BY measure_value) AS mode_value, 
  STDDEV(measure_value) AS std_value, 
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY measure_value) AS median_value
  FROM health.user_logs
  WHERE measure = 'blood_glucose'
)
SELECT 
  ROUND(CAST((average_value - mode_value) / std_value AS NUMERIC),2) AS coef1, 
  ROUND(CAST((3 * (average_value- median_value) / std_value) AS NUMERIC), 2) AS coeff2
FROM blood_glucose

![image info](./week1_screenshots/s18.png)