# Case Study #8: Fresh Segments
The case study questions presented here are created by [**Data With Danny**](https://linktr.ee/datawithdanny). They are part of the [**8 Week SQL Challenge**](https://8weeksqlchallenge.com/).

My SQL queries are written in the `PostgreSQL 15` dialect, integrated into `Jupyter Notebook`, which allows us to instantly view the query results and document the queries.

For more details about the **Case Study #8**, click [**here**](https://8weeksqlchallenge.com/case-study-8/).

## Table of Contents
### [1. Importing Libraries](#Import)
### [2. Tables of the Database](#Tables)
### [3. Case Study Questions](#CaseStudyQuestions)
- [A. Data Exploration and Cleansing](#A)
- [B. Interest Analysis](#B)
- [C. Segment Analysis](#C)
- [D. Index Analysis](#D)

___
<a id = 'Import'></a>
## 1. Import Libraries

In [1]:
import os
import pandas as pd
import psycopg2 as pg2
import warnings

warnings.filterwarnings("ignore")

### Connecting the database from Jupyter Notebook

In [2]:
# Get PostgreSQL password
mypassword = os.getenv("POSTGRESQL_PASSWORD")

# Connect the database
try:
    conn = pg2.connect(user = 'postgres', password = mypassword, database = 'fresh_segments')
    cursor = conn.cursor()
    print("Database connection successful")
except mysql.connector.Error as err:
   print(f"Error: '{err}'") 

Database connection successful


___
<a id = 'Tables'></a>
## 2. Tables of the database

In [3]:
cursor.execute("""
SELECT table_schema, table_name
FROM information_schema.tables
WHERE table_schema = 'fresh_segments'
""")

table_names = []
print('--- Tables within "fresh_segments" database --- ')
for table in cursor:
    print(table[1])
    table_names.append(table[1])

--- Tables within "fresh_segments" database --- 
json_data
interest_map
interest_metrics


In [4]:
for table in table_names:
    print('Table: ', table)
    display(pd.read_sql("SELECT * FROM fresh_segments." + table, conn))

Table:  json_data


Unnamed: 0,raw_data
0,"{'month': 7, 'year': 2018, 'month_year': '07-2..."
1,"{'month': 7, 'year': 2018, 'month_year': '07-2..."
2,"{'month': 7, 'year': 2018, 'month_year': '07-2..."
3,"{'month': 7, 'year': 2018, 'month_year': '07-2..."


Table:  interest_map


Unnamed: 0,id,interest_name,interest_summary,created_at,last_modified
0,1,Fitness Enthusiasts,Consumers using fitness tracking apps and webs...,2016-05-26 14:57:59,2018-05-23 11:30:12
1,2,Gamers,Consumers researching game reviews and cheat c...,2016-05-26 14:57:59,2018-05-23 11:30:12
2,3,Car Enthusiasts,Readers of automotive news and car reviews.,2016-05-26 14:57:59,2018-05-23 11:30:12
3,4,Luxury Retail Researchers,Consumers researching luxury product reviews a...,2016-05-26 14:57:59,2018-05-23 11:30:12
4,5,Brides & Wedding Planners,People researching wedding ideas and vendors.,2016-05-26 14:57:59,2018-05-23 11:30:12
...,...,...,...,...,...
1204,6391,HVAC Service Researchers,,2017-06-08 12:21:04,2017-06-08 12:21:04
1205,7483,Economy Grocery Shoppers,,2017-07-06 15:45:18,2017-07-06 15:45:18
1206,7527,Democratic Donors,,2017-07-17 17:24:48,2017-07-17 17:24:48
1207,7557,Tailgaters,,2017-07-20 17:31:31,2019-01-16 09:11:30


Table:  interest_metrics


Unnamed: 0,_month,_year,month_year,interest_id,composition,index_value,ranking,percentile_ranking
0,7,2018,07-2018,32486,11.89,6.19,1,99.86
1,7,2018,07-2018,6106,9.93,5.31,2,99.73
2,7,2018,07-2018,18923,10.85,5.29,3,99.59
3,7,2018,07-2018,6344,10.32,5.10,4,99.45
4,7,2018,07-2018,100,10.77,5.04,5,99.31
...,...,...,...,...,...,...,...,...
14268,,,,,1.60,0.72,1189,0.42
14269,,,,,1.62,0.71,1190,0.34
14270,,,,,1.62,0.68,1191,0.25
14271,,,,,1.51,0.63,1193,0.08


<a id = 'CaseStudyQuestions'></a>
## 3. Case Study Questions

<a id = 'A'></a>
## A. Data Exploration and Cleansing

#### 1. Update the `fresh_segments.interest_metrics` table by modifying the `month_year` column to be a date data type with the start of the month

### Checking the datatypes

The query reveals that the data type of `month_year` is incorrect, as confirmed by using the function `pg_typeof()`.

In [5]:
pd.read_sql("""
SELECT
    pg_typeof(month_year) AS month_year
FROM fresh_segments.interest_metrics
LIMIT 1
""", conn)

Unnamed: 0,month_year
0,character varying


Alternatively, by checking the datatype of each column of the dataset, we can observe that `_month`, `_year`, `month_year` and `interest_id` have incorrect datatypes.

In [6]:
pd.read_sql("""
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_schema = 'fresh_segments' AND table_name = 'interest_metrics' 
AND column_name IN ('_month', '_year', 'month_year', 'interest_id', 'composition', 'index_value', 'ranking', 'percentile_ranking');
""", conn)

Unnamed: 0,column_name,data_type
0,percentile_ranking,double precision
1,composition,double precision
2,index_value,double precision
3,ranking,integer
4,_month,character varying
5,_year,character varying
6,month_year,character varying
7,interest_id,character varying


When performing data cleaning in SQL, it is essential to adopt certain **best coding practices**. The most recommended practice in this case is to prevent the loss of the original dataset. Regardless of any mistakes made during the process, it is crucial to retain access to the initial dataset. To achieve this, we should create a **backup table** and copy the entire original `interest_metrics` table to this backup table, which we'll name `interest_metrics_backup`. By doing so, we can proceed with the data cleaning directly on `interest_metrics` without any concerns about losing the original data.

### Creating Backup Table

In [7]:
# Create the empty backup table
cursor.execute("DROP TABLE IF EXISTS fresh_segments.interest_metrics_backup;")
cursor.execute("""
CREATE TABLE fresh_segments.interest_metrics_backup
(
  "_month" VARCHAR(4),
  "_year" VARCHAR(4),
  "month_year" VARCHAR(7),
  "interest_id" VARCHAR(5),
  "composition" FLOAT,
  "index_value" FLOAT,
  "ranking" INTEGER,
  "percentile_ranking" FLOAT
);""")


# Copy/Paste the original "interest_metrics" into the "interest_metrics_backup"
cursor.execute("""
INSERT INTO fresh_segments.interest_metrics_backup
SELECT *
FROM fresh_segments.interest_metrics
""")

# Save the updates
conn.commit()

### Data Cleaning on `interest_metrics` table

In [8]:
# Correct the datatypes of the columns
cursor.execute("""
ALTER TABLE fresh_segments.interest_metrics
    ALTER COLUMN _month TYPE INTEGER USING _month::INTEGER,
    ALTER COLUMN _year TYPE INTEGER USING _year::INTEGER,
    ALTER COLUMN interest_id TYPE INTEGER USING interest_id::INTEGER,
    ALTER COLUMN month_year TYPE DATE USING TO_DATE(month_year, 'mm-yyyy');
""")


# Rename the columns "_month" and "_year"
cursor.execute("""
ALTER TABLE fresh_segments.interest_metrics
    RENAME COLUMN _month TO month;
""")

cursor.execute("""
ALTER TABLE fresh_segments.interest_metrics
    RENAME COLUMN _year TO year;
""")


# Save the updates
conn.commit()

Let's perform another check on the datatypes of the columns once more.

In [9]:
pd.read_sql("""
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_schema = 'fresh_segments' AND table_name = 'interest_metrics' 
AND column_name IN ('month', 'year', 'month_year', 'interest_id', 'composition', 'index_value', 'ranking', 'percentile_ranking');
""", conn)

Unnamed: 0,column_name,data_type
0,month,integer
1,year,integer
2,month_year,date
3,interest_id,integer
4,composition,double precision
5,index_value,double precision
6,ranking,integer
7,percentile_ranking,double precision


Here is the processed table:

In [10]:
pd.read_sql("SELECT * FROM fresh_segments.interest_metrics", conn)

Unnamed: 0,month,year,month_year,interest_id,composition,index_value,ranking,percentile_ranking
0,7.0,2018.0,2018-07-01,32486.0,11.89,6.19,1,99.86
1,7.0,2018.0,2018-07-01,6106.0,9.93,5.31,2,99.73
2,7.0,2018.0,2018-07-01,18923.0,10.85,5.29,3,99.59
3,7.0,2018.0,2018-07-01,6344.0,10.32,5.10,4,99.45
4,7.0,2018.0,2018-07-01,100.0,10.77,5.04,5,99.31
...,...,...,...,...,...,...,...,...
14268,,,,,1.60,0.72,1189,0.42
14269,,,,,1.62,0.71,1190,0.34
14270,,,,,1.62,0.68,1191,0.25
14271,,,,,1.51,0.63,1193,0.08


___
#### 2. What is count of records in the `fresh_segments.interest_metrics` for each `month_year` value sorted in chronological order (earliest to latest) with the null values appearing first?

In [11]:
pd.read_sql("""
SELECT 
    month_year, 
    COUNT(*) AS nb_records
FROM fresh_segments.interest_metrics
GROUP BY month_year
ORDER BY (month_year IS NOT NULL), month_year
""", conn)

Unnamed: 0,month_year,nb_records
0,,1194
1,2018-07-01,729
2,2018-08-01,767
3,2018-09-01,780
4,2018-10-01,857
5,2018-11-01,928
6,2018-12-01,995
7,2019-01-01,973
8,2019-02-01,1121
9,2019-03-01,1136


___
#### 3. What do you think we should do with these null values in the `fresh_segments.interest_metrics`?

**Result**</br>
- Retaining the `NULL` values in the `month`, `year`, and `month_year` columns would result in less accurate analysis since the date of the activity cannot be determined.
- Additionally, keeping the `NULL` values in the `interest_id` column provides no information, as we cannot identify the specific interest it refers to.
- Therefore, for our analysis, we will exclude all rows with `NULL` values by filtering them out instead of deleting them entirely.

___
#### 4. How many `interest_id` values exist in the `fresh_segments.interest_metrics` table but not in the `fresh_segments.interest_map` table? What about the other way around?

There is no `interest_id` that exists in the `interest_metrics` table but not in the `interest_map` table.

In [12]:
pd.read_sql("""
SELECT COUNT(interest_id) AS count_not_in_map
FROM fresh_segments.interest_metrics me
LEFT JOIN fresh_segments.interest_map ma ON me.interest_id = ma.id
WHERE interest_id IS NOT NULL AND ma.id IS NULL
""", conn)

Unnamed: 0,count_not_in_map
0,0


There are 7 `interest_id` values that exist in the `interest_map` table but not in the `interest_metrics` table.

In [13]:
pd.read_sql("""
SELECT COUNT(ma.id) AS count_not_in_metrics
FROM fresh_segments.interest_metrics me
RIGHT JOIN fresh_segments.interest_map ma ON me.interest_id = ma.id
WHERE me.interest_id IS NULL
""", conn)

Unnamed: 0,count_not_in_metrics
0,7


The 7 interest_id values are the following:

In [14]:
pd.read_sql("""
SELECT ma.id AS ids_not_in_metrics
FROM fresh_segments.interest_metrics me
RIGHT JOIN fresh_segments.interest_map ma ON me.interest_id = ma.id
WHERE me.interest_id IS NULL
""", conn)

Unnamed: 0,ids_not_in_metrics
0,19598
1,35964
2,40185
3,40186
4,42010
5,42400
6,47789


___
#### 5. Summarise the id values in the `fresh_segments.interest_map` by its total record count in this table

In [15]:
pd.read_sql("""
SELECT 
    me.interest_id AS id,
    ma.interest_name,
    COUNT(*) AS nb_records
FROM fresh_segments.interest_metrics me
JOIN fresh_segments.interest_map ma ON me.interest_id = ma.id
WHERE interest_id IS NOT NULL
GROUP BY me.interest_id, ma.interest_name
ORDER BY me.interest_id 
""", conn)

Unnamed: 0,id,interest_name,nb_records
0,1,Fitness Enthusiasts,12
1,2,Gamers,11
2,3,Car Enthusiasts,10
3,4,Luxury Retail Researchers,14
4,5,Brides & Wedding Planners,14
...,...,...,...
1197,49979,Cape Cod News Readers,5
1198,50860,Food Delivery Service Users,4
1199,51119,Skin Disorder Researchers,4
1200,51120,Foot Health Researchers,4


___
#### 6. What sort of table join should we perform for our analysis and why? Check your logic by checking the rows where interest_id = 21246 in your joined output and include all columns from `fresh_segments.interest_metrics` and all columns from `fresh_segments.interest_map` except from the id column.

In [16]:
pd.read_sql("""
SELECT 
    me.*, 
    ma.interest_name, 
    ma.interest_summary, 
    ma.created_at, 
    ma.last_modified
FROM fresh_segments.interest_metrics me
INNER JOIN fresh_segments.interest_map ma ON me.interest_id = ma.id
-- WHERE me.interest_id = 21246
""", conn)

Unnamed: 0,month,year,month_year,interest_id,composition,index_value,ranking,percentile_ranking,interest_name,interest_summary,created_at,last_modified
0,7.0,2018.0,2018-07-01,1,7.02,3.44,46,93.69,Fitness Enthusiasts,Consumers using fitness tracking apps and webs...,2016-05-26 14:57:59,2018-05-23 11:30:12
1,10.0,2018.0,2018-10-01,1,3.71,1.84,118,86.23,Fitness Enthusiasts,Consumers using fitness tracking apps and webs...,2016-05-26 14:57:59,2018-05-23 11:30:12
2,3.0,2019.0,2019-03-01,1,2.76,1.54,244,78.52,Fitness Enthusiasts,Consumers using fitness tracking apps and webs...,2016-05-26 14:57:59,2018-05-23 11:30:12
3,8.0,2019.0,2019-08-01,1,2.64,1.87,394,65.71,Fitness Enthusiasts,Consumers using fitness tracking apps and webs...,2016-05-26 14:57:59,2018-05-23 11:30:12
4,12.0,2018.0,2018-12-01,1,2.94,1.83,140,85.93,Fitness Enthusiasts,Consumers using fitness tracking apps and webs...,2016-05-26 14:57:59,2018-05-23 11:30:12
...,...,...,...,...,...,...,...,...,...,...,...,...
13075,5.0,2019.0,2019-05-01,51120,2.03,1.62,377,56.01,Foot Health Researchers,People reading news and advice on preventing a...,2019-04-26 18:00:00,2019-04-29 14:20:04
13076,7.0,2019.0,2019-07-01,51120,2.20,1.72,428,50.46,Foot Health Researchers,People reading news and advice on preventing a...,2019-04-26 18:00:00,2019-04-29 14:20:04
13077,8.0,2019.0,2019-08-01,51678,2.26,1.38,904,21.32,Plumbers,Professionals reading industry news and resear...,2019-05-06 22:00:00,2019-05-07 18:50:04
13078,7.0,2019.0,2019-07-01,51678,1.97,1.42,718,16.90,Plumbers,Professionals reading industry news and resear...,2019-05-06 22:00:00,2019-05-07 18:50:04


**Result**</br>
For our analysis, we are using an `INNER JOIN`. The objective of joining two tables is to link all interest_id values from the `interest_metrics` table to their corresponding interest_name values provided by the `interest_map` table.

___
#### 7. Are there any records in your joined table where the `month_year` value is before the `created_at value` from the `fresh_segments.interest_map` table? Do you think these values are valid and why?

In [17]:
pd.read_sql("""
SELECT 
    me.*, 
    ma.interest_name, 
    ma.interest_summary, 
    ma.created_at, 
    ma.last_modified
FROM fresh_segments.interest_metrics me
INNER JOIN fresh_segments.interest_map ma ON me.interest_id = ma.id
WHERE me.month_year < ma.created_at
""", conn)

Unnamed: 0,month,year,month_year,interest_id,composition,index_value,ranking,percentile_ranking,interest_name,interest_summary,created_at,last_modified
0,7,2018,2018-07-01,32701,4.23,1.41,483,33.74,Womens Equality Advocates,People visiting sites advocating for womens eq...,2018-07-06 14:35:03,2018-07-06 14:35:03
1,7,2018,2018-07-01,32702,3.56,1.18,580,20.44,Romantics,People reading about romance and researching i...,2018-07-06 14:35:04,2018-07-06 14:35:04
2,7,2018,2018-07-01,32703,5.53,1.80,375,48.56,School Supply Shoppers,Consumers shopping for classroom supplies for ...,2018-07-06 14:35:04,2018-07-06 14:35:04
3,7,2018,2018-07-01,32704,8.04,2.27,225,69.14,Major Airline Customers,People visiting sites for major airline brands...,2018-07-06 14:35:04,2018-07-06 14:35:04
4,7,2018,2018-07-01,32705,4.38,1.34,505,30.73,Certified Events Professionals,Professionals reading industry news and resear...,2018-07-06 14:35:04,2018-07-06 14:35:04
...,...,...,...,...,...,...,...,...,...,...,...,...
183,4,2019,2019-04-01,49976,2.35,1.27,530,51.77,Agriculture and Climate Advocates,People supporting organizations for agricultur...,2019-04-15 18:00:00,2019-04-24 17:40:04
184,4,2019,2019-04-01,49977,2.17,1.15,722,34.30,DIY Upcycle Home Project Planners,People researching and planning home DIY and u...,2019-04-15 18:00:00,2019-04-24 17:40:04
185,4,2019,2019-04-01,49978,2.19,1.17,695,36.76,Homeschooling Parents,People researching academic projects and progr...,2019-04-15 18:00:00,2019-04-24 17:40:04
186,4,2019,2019-04-01,49979,2.56,1.70,145,86.81,Cape Cod News Readers,People interested in reading about local news ...,2019-04-15 18:00:00,2019-04-18 09:00:05


**Result**</br>
The records in the joined table where the `month_year` value is before the `created_at` value from the `fresh_segments.interest_map` table are indeed valid. This is because the `created_at` date values fall within the same month and year as the `month_year` values, indicating that the data is consistent and correctly aligned.

___
<a id = 'B'></a>
## B. Interest Analysis

#### 1. Which interests have been present in all `month_year` dates in our dataset?

In the Fresh Segment data records, data has been recorded for a total of 14 months.

In [18]:
pd.read_sql("""
SELECT COUNT(DISTINCT month_year) AS nb_month_year_values
FROM fresh_segments.interest_metrics
""", conn)

Unnamed: 0,nb_month_year_values
0,14


By combining the above information, we can utilize a subquery in the `WHERE` statement to filter interests that appeared in all 14 months.

In [19]:
pd.read_sql("""
WITH count_month_year_values_cte AS
(
    SELECT *, ROW_NUMBER() OVER (PARTITION BY interest_id) AS nb_month_year_values
    FROM fresh_segments.interest_metrics
    WHERE month_year IS NOT NULL
)
SELECT 
    cte.interest_id, 
    ma.interest_name, 
    nb_month_year_values
FROM count_month_year_values_cte cte
JOIN fresh_segments.interest_map ma ON cte.interest_id = ma.id
WHERE nb_month_year_values = (SELECT COUNT(DISTINCT month_year) FROM fresh_segments.interest_metrics)
ORDER BY ma.interest_name
""", conn)

Unnamed: 0,interest_id,interest_name,nb_month_year_values
0,6183,Accounting & CPA Continuing Education Researchers,14
1,18347,Affordable Hotel Bookers,14
2,129,Aftermarket Accessories Shoppers,14
3,7541,Alabama Trip Planners,14
4,10284,Alaskan Cruise Planners,14
...,...,...,...
475,19250,World Cup Enthusiasts,14
476,6234,Yachting Enthusiasts,14
477,22427,Yale University Fans,14
478,4902,Yogis,14


___
#### 2. Using this same `total_months` measure - calculate the cumulative percentage of all records starting at 14 months - which `total_months` value passes the 90% cumulative percentage value?

In [20]:
pd.read_sql("""
WITH nb_interests_per_month_cte AS
(
    -- Count the number of interests per month
    
    SELECT 
        total_months, 
        COUNT(interest_id) AS nb_interests
    FROM
    (
        -- Count the total number of month_year values for each interest_id
        
        SELECT 
            interest_id, 
            COUNT(DISTINCT month_year) AS total_months
        FROM fresh_segments.interest_metrics
        WHERE interest_id IS NOT NULL
        GROUP BY interest_id
    ) total_months
    GROUP BY total_months

), cumulative_percent_cte AS
(
    -- Compute the cumulative percentage for each month
    
    SELECT 
        total_months, 
        nb_interests, 
        ROUND(SUM(nb_interests) OVER (ORDER BY total_months DESC)/SUM(nb_interests) OVER () * 100,1) AS cumulative_percent
    FROM nb_interests_per_month_cte
)
SELECT *
FROM cumulative_percent_cte
-- WHERE cumulative_percent >= 90
""", conn)

Unnamed: 0,total_months,nb_interests,cumulative_percent
0,14,480,39.9
1,13,82,46.8
2,12,65,52.2
3,11,94,60.0
4,10,86,67.1
5,9,95,75.0
6,8,67,80.6
7,7,90,88.1
8,6,33,90.8
9,5,38,94.0


**Result**</br>
Based on the calculation of the cumulative percentage of all records starting at 14 months, the `total_months` value that passes the 90% cumulative percentage is 6.

___
#### 3. If we were to remove all `interest_id` values which are lower than the `total_months` value we found in the previous question - how many total data points would we be removing?

In [21]:
pd.read_sql("""
SELECT COUNT(interest_id) AS "Number of interest IDs where month_year value is less than 6"
FROM
(
    SELECT interest_id, COUNT(DISTINCT month_year) AS total_months
    FROM fresh_segments.interest_metrics
    WHERE interest_id IS NOT NULL 
    GROUP BY interest_id
) total_months
WHERE total_months < 6
""", conn)

Unnamed: 0,Number of interest IDs where month_year value is less than 6
0,110


In [22]:
pd.read_sql("""
SELECT COUNT(*) AS "Number of records (data points) where month_year value is less than 6"
FROM
(
    SELECT interest_id, COUNT(DISTINCT month_year) AS total_months
    FROM fresh_segments.interest_metrics
    WHERE interest_id IS NOT NULL 
    GROUP BY interest_id
) total_months
JOIN fresh_segments.interest_metrics me ON total_months.interest_id = me.interest_id
WHERE total_months < 6
""", conn)

Unnamed: 0,Number of records (data points) where month_year value is less
0,400


___
#### 4. Does this decision make sense to remove these data points from a business perspective? Use an example where there are all 14 months present to a removed interest example for your arguments - think about what it means to have less months present from a segment perspective.

In [23]:
pd.read_sql("""
WITH total_months_cte AS
(
    SELECT interest_id, COUNT(DISTINCT month_year) AS total_months
    FROM fresh_segments.interest_metrics
    GROUP BY interest_id
),
full_table_cte AS
(
    SELECT me.interest_id, me.month_year, c.total_months
    FROM fresh_segments.interest_metrics me
    JOIN total_months_cte c ON me.interest_id = c.interest_id
)
SELECT 
    c.month_year, 
    COUNT(c.interest_id) AS nb_present_interest, 
    less.nb_removed_interest,
    CONCAT(ROUND(less.nb_removed_interest/COUNT(c.interest_id)::NUMERIC * 100, 1), ' %') AS "Removed Interest Percentage"
FROM full_table_cte c
JOIN
(
    SELECT month_year, COUNT(interest_id) AS nb_removed_interest
    FROM full_table_cte
    WHERE total_months < 6
    GROUP BY month_year
) less ON less.month_year = c.month_year

GROUP BY c.month_year, less.nb_removed_interest
ORDER BY c.month_year
""", conn)

Unnamed: 0,month_year,nb_present_interest,nb_removed_interest,Removed Interest Percentage
0,2018-07-01,729,20,2.7 %
1,2018-08-01,767,15,2.0 %
2,2018-09-01,780,6,0.8 %
3,2018-10-01,857,4,0.5 %
4,2018-11-01,928,3,0.3 %
5,2018-12-01,995,9,0.9 %
6,2019-01-01,973,7,0.7 %
7,2019-02-01,1121,49,4.4 %
8,2019-03-01,1136,58,5.1 %
9,2019-04-01,1099,64,5.8 %


___
#### 5. After removing these interests - how many unique interests are there for each month?

In [24]:
# Remove the interest IDs with less than 6 months worth of data
cursor.execute("""
DELETE FROM fresh_segments.interest_metrics
WHERE interest_id IN 
(
    SELECT interest_id
    FROM fresh_segments.interest_metrics
    GROUP BY interest_id
    HAVING COUNT(month_year) < 6
);
""")

# Save the updates
conn.commit()

In [25]:
pd.read_sql("""
SELECT month_year, COUNT(DISTINCT interest_id) AS "Number of Interests"
FROM fresh_segments.interest_metrics
WHERE month_year IS NOT NULL
GROUP BY month_year
ORDER BY month_year
""", conn)

Unnamed: 0,month_year,Number of Interests
0,2018-07-01,709
1,2018-08-01,752
2,2018-09-01,774
3,2018-10-01,853
4,2018-11-01,925
5,2018-12-01,986
6,2019-01-01,966
7,2019-02-01,1072
8,2019-03-01,1078
9,2019-04-01,1035


___
<a id = 'C'></a>
## C. Segment Analysis

#### 1. Using our filtered dataset by removing the interests with less than 6 months worth of data, which are the top 10 and bottom 10 interests which have the largest composition values in any month_year? Only use the maximum composition value for each interest but you must keep the corresponding month_year

### Result: Top 10 interests

In [26]:
pd.read_sql("""
WITH max_composition_cte AS
(
    SELECT 
        me.month_year, 
        me.interest_id, 
        ma.interest_name, 
        MAX(me.composition) AS max_composition
    FROM fresh_segments.interest_metrics me
    JOIN fresh_segments.interest_map ma ON me.interest_id = ma.id
    GROUP BY me.month_year, me.interest_id, ma.interest_name 
)
SELECT 
    month_year, 
    interest_name AS "Top 10 interests", 
    max_composition
FROM max_composition_cte 
ORDER BY max_composition DESC
LIMIT 10
""", conn)

Unnamed: 0,month_year,Top 10 interests,max_composition
0,2018-12-01,Work Comes First Travelers,21.2
1,2018-10-01,Work Comes First Travelers,20.28
2,2018-11-01,Work Comes First Travelers,19.45
3,2019-01-01,Work Comes First Travelers,18.99
4,2018-07-01,Gym Equipment Owners,18.82
5,2019-02-01,Work Comes First Travelers,18.39
6,2018-09-01,Work Comes First Travelers,18.18
7,2018-07-01,Furniture Shoppers,17.44
8,2018-07-01,Luxury Retail Shoppers,17.19
9,2018-10-01,Luxury Boutique Hotel Researchers,15.15


### Result: Bottom 10 interests

In [27]:
pd.read_sql("""
WITH max_composition_cte AS
(
    -- Filter the maximum composition for each interest and each corresponding month_year
    
    SELECT 
        me.month_year, 
        me.interest_id, 
        ma.interest_name, 
        MAX(me.composition) AS max_composition
    FROM fresh_segments.interest_metrics me
    JOIN fresh_segments.interest_map ma ON me.interest_id = ma.id
    GROUP BY me.month_year, me.interest_id, ma.interest_name 
)
SELECT 
    month_year, 
    interest_name AS "Bottom 10 interests", 
    max_composition  
FROM max_composition_cte 
ORDER BY max_composition ASC
LIMIT 10
""", conn)

Unnamed: 0,month_year,Bottom 10 interests,max_composition
0,2019-05-01,Mowing Equipment Shoppers,1.51
1,2019-05-01,Philadelphia 76ers Fans,1.52
2,2019-06-01,Disney Fans,1.52
3,2019-06-01,New York Giants Fans,1.52
4,2019-05-01,Beer Aficionados,1.52
5,2019-04-01,United Nations Donors,1.52
6,2019-05-01,Gastrointestinal Researchers,1.52
7,2019-05-01,LED Lighting Shoppers,1.53
8,2019-05-01,Crochet Enthusiasts,1.53
9,2019-06-01,Online Directory Searchers,1.53


___
#### 2. Which 5 interests had the lowest average ranking value?

The ranking value ranges from 1 to 1194.

In [28]:
pd.read_sql("""
SELECT DISTINCT ranking
FROM fresh_segments.interest_metrics
ORDER BY ranking ASC
""", conn)

Unnamed: 0,ranking
0,1
1,2
2,3
3,4
4,5
...,...
986,1189
987,1190
988,1191
989,1193


**Result**

In [29]:
pd.read_sql("""
SELECT 
    me.interest_id, 
    ma.interest_name, 
    ROUND(AVG(me.ranking),1) AS avg_ranking
FROM fresh_segments.interest_metrics me
JOIN fresh_segments.interest_map ma ON me.interest_id = ma.id
GROUP BY me.interest_id, ma.interest_name
ORDER BY avg_ranking ASC
LIMIT 5
""", conn)

Unnamed: 0,interest_id,interest_name,avg_ranking
0,41548,Winter Apparel Shoppers,1.0
1,42203,Fitness Activity Tracker Users,4.1
2,115,Mens Shoe Shoppers,5.9
3,171,Shoe Shoppers,9.4
4,6206,Preppy Clothing Shoppers,11.9


___
#### 3. Which 5 interests had the largest standard deviation in their `percentile_ranking` value?

In [30]:
pd.read_sql("""
SELECT 
    me.interest_id, 
    ma.interest_name, 
    ROUND(STDDEV(percentile_ranking)::NUMERIC, 2) AS standard_deviation
FROM fresh_segments.interest_metrics me
JOIN fresh_segments.interest_map ma ON me.interest_id = ma.id
GROUP BY me.interest_id, ma.interest_name
ORDER BY standard_deviation DESC
LIMIT 5
""", conn)

Unnamed: 0,interest_id,interest_name,standard_deviation
0,23,Techies,30.18
1,20764,Entertainment Industry Decision Makers,28.97
2,38992,Oregon Trip Planners,28.32
3,43546,Personalized Gift Shoppers,26.24
4,10839,Tampa and St Petersburg Trip Planners,25.61


___
#### 4. For the 5 interests found in the previous question - what was minimum and maximum percentile_ranking values for each interest and its corresponding year_month value? Can you describe what is happening for these 5 interests?

In [31]:
pd.read_sql("""
WITH ranking_percentile_cte AS
(
    SELECT 
        me.month_year, 
        me.interest_id, 
        ma.interest_name, 
        me.percentile_ranking, 
        RANK() OVER (PARTITION BY me.interest_id ORDER BY me.percentile_ranking) AS asc_rk,
        RANK() OVER (PARTITION BY me.interest_id ORDER BY me.percentile_ranking DESC) AS desc_rk
    FROM fresh_segments.interest_metrics me
    JOIN fresh_segments.interest_map ma ON me.interest_id = ma.id
    WHERE interest_id IN
    (
        -- Retrieve the 5 interests with the largest standard deviation in their percentile_ranking value
        
        SELECT interest_id
        FROM fresh_segments.interest_metrics
        WHERE interest_id IS NOT NULL
        GROUP BY interest_id
        ORDER BY STDDEV(percentile_ranking) DESC
        LIMIT 5
    ) 
)
SELECT 
    interest_name, 
    interest_id,
    month_year, 
    percentile_ranking, 
    CASE WHEN asc_rk = 1 THEN 'Min' ELSE 'Max' END AS percentile_rank_value_type
FROM ranking_percentile_cte
WHERE asc_rk = 1 OR desc_rk = 1
""", conn)

Unnamed: 0,interest_name,interest_id,month_year,percentile_ranking,percentile_rank_value_type
0,Techies,23,2019-08-01,7.92,Min
1,Techies,23,2018-07-01,86.69,Max
2,Tampa and St Petersburg Trip Planners,10839,2019-03-01,4.84,Min
3,Tampa and St Petersburg Trip Planners,10839,2018-07-01,75.03,Max
4,Entertainment Industry Decision Makers,20764,2019-08-01,11.23,Min
5,Entertainment Industry Decision Makers,20764,2018-07-01,86.15,Max
6,Oregon Trip Planners,38992,2019-07-01,2.2,Min
7,Oregon Trip Planners,38992,2018-11-01,82.44,Max
8,Personalized Gift Shoppers,43546,2019-06-01,5.7,Min
9,Personalized Gift Shoppers,43546,2019-03-01,73.15,Max


Here is an alternative output achieved by putting columns side by side using a JOIN statement.

In [32]:
pd.read_sql("""
WITH ranking_percentile_cte AS
(
    SELECT 
        me.month_year, 
        me.interest_id, 
        ma.interest_name, 
        me.percentile_ranking, 
        RANK() OVER (PARTITION BY me.interest_id ORDER BY me.percentile_ranking) AS asc_rk,
        RANK() OVER (PARTITION BY me.interest_id ORDER BY me.percentile_ranking DESC) AS desc_rk
    FROM fresh_segments.interest_metrics me
    JOIN fresh_segments.interest_map ma ON me.interest_id = ma.id
    WHERE interest_id IN
    (
        -- Retrieve the 5 interests with the largest standard deviation in their percentile_ranking value
        
        SELECT interest_id
        FROM fresh_segments.interest_metrics
        WHERE interest_id IS NOT NULL
        GROUP BY interest_id
        ORDER BY STDDEV(percentile_ranking) DESC
        LIMIT 5
    ) 
)
SELECT 
    min.interest_name, 
    min.interest_id,
    min.month_year AS month_year_minval,
    min.percentile_ranking AS percentile_ranking_minval,
    max.month_year_maxval,
    max.percentile_ranking_maxval
FROM ranking_percentile_cte min
JOIN 
(
    SELECT 
        interest_id,
        month_year AS month_year_maxval, 
        percentile_ranking AS percentile_ranking_maxval
    FROM ranking_percentile_cte
    WHERE desc_rk = 1
    
) max ON min.interest_id = max.interest_id
WHERE asc_rk = 1
""", conn)

Unnamed: 0,interest_name,interest_id,month_year_minval,percentile_ranking_minval,month_year_maxval,percentile_ranking_maxval
0,Techies,23,2019-08-01,7.92,2018-07-01,86.69
1,Tampa and St Petersburg Trip Planners,10839,2019-03-01,4.84,2018-07-01,75.03
2,Entertainment Industry Decision Makers,20764,2019-08-01,11.23,2018-07-01,86.15
3,Oregon Trip Planners,38992,2019-07-01,2.2,2018-11-01,82.44
4,Personalized Gift Shoppers,43546,2019-06-01,5.7,2019-03-01,73.15


___
#### 5. How would you describe our customers in this segment based off their composition and ranking values? What sort of products or services should we show to these customers and what should we avoid?

**Result**
- Topics related to travels, fitness, furniture, and luxury lifestyle are the top 10 interests among our customers. Therefore, Fresh Segments should prioritize displaying more products or services related to travel, fitness, furniture, and luxury lifestyle.

- Products or services to avoid showing on Fresh Segments include anything related to Mowing Equipment, Crochet (Arts and Crafts), Social and Charitable Causes, and LED products.

___
<a id = 'D'></a>
## D. Index Analysis
The `index_value` is a measure which can be used to reverse calculate the average composition for Fresh Segments’ clients.

Average composition can be calculated by dividing the composition column by the index_value column rounded to 2 decimal places.

#### 1. What is the top 10 interests by the average composition for each month?

In [33]:
pd.read_sql("""
WITH avg_comp_cte AS
(
    SELECT *, DENSE_RANK() OVER (PARTITION BY month_year ORDER BY avg_composition DESC) AS rank
    FROM
    (
        SELECT 
            me.month_year, 
            me.interest_id,             
            ma.interest_name, 
            ROUND(composition::NUMERIC/index_value::NUMERIC, 2) AS avg_composition
        FROM fresh_segments.interest_metrics me
        JOIN fresh_segments.interest_map ma ON me.interest_id = ma.id
    ) avg
)
SELECT *
FROM avg_comp_cte
WHERE month_year IS NOT NULL AND interest_id IS NOT NULL AND rank <= 10
""", conn)

Unnamed: 0,month_year,interest_id,interest_name,avg_composition,rank
0,2018-07-01,6324,Las Vegas Trip Planners,7.36,1
1,2018-07-01,6284,Gym Equipment Owners,6.94,2
2,2018-07-01,4898,Cosmetics and Beauty Shoppers,6.78,3
3,2018-07-01,77,Luxury Retail Shoppers,6.61,4
4,2018-07-01,39,Furniture Shoppers,6.51,5
...,...,...,...,...,...
145,2019-08-01,77,Luxury Retail Shoppers,2.59,6
146,2019-08-01,4931,Marijuana Legalization Advocates,2.56,7
147,2019-08-01,6253,Medicare Researchers,2.55,8
148,2019-08-01,6208,Recently Retired Individuals,2.53,9


___
#### 2. For all of these top 10 interests - which interest appears the most often?

In [34]:
pd.read_sql("""
WITH avg_comp_cte AS
(
    SELECT *, DENSE_RANK() OVER (PARTITION BY month_year ORDER BY avg_composition DESC) AS rank
    FROM
    (
        SELECT 
            me.month_year, 
            me.interest_id,             
            ma.interest_name, 
            ROUND(composition::NUMERIC/index_value::NUMERIC, 2) AS avg_composition
        FROM fresh_segments.interest_metrics me
        JOIN fresh_segments.interest_map ma ON me.interest_id = ma.id
    ) avg
)
SELECT 
    interest_name, 
    COUNT(interest_name) AS nb_occurrences
FROM avg_comp_cte
WHERE month_year IS NOT NULL AND interest_id IS NOT NULL AND rank <= 10
GROUP BY interest_name
ORDER BY nb_occurrences DESC
""", conn)

Unnamed: 0,interest_name,nb_occurrences
0,Alabama Trip Planners,10
1,Solar Energy Researchers,10
2,Luxury Bedding Shoppers,10
3,Nursing and Physicians Assistant Journal Resea...,9
4,New Years Eve Party Ticket Purchasers,9
5,Readers of Honduran Content,9
6,Work Comes First Travelers,8
7,Teen Girl Clothing Shoppers,8
8,Christmas Celebration Researchers,7
9,Las Vegas Trip Planners,5


**Result**</br>
The interests that appeared most often in the Fresh Segment are the ones that have the highest number of occurrences among all the `month_year` values. Hence,
- **Alabama Trip Planners**, **Luxury Bedding Shoppers**, and **Solar Energy Researchers** appear most frequently, with a total count of 10 occurrences.
- **Readers of Honduran Content**, **Nursing and Physician Assistant Journal Researchers**, and **New Year's Eve Party Ticket Purchasers** are the second most appearing interests, with a total of 9 occurrences.
- **Teen Girl Clothing Shoppers** and **Work Comes First Travelers** are the third most appearing interests, with a total of 8 occurrences.

___
#### 3. What is the average of the average composition for the top 10 interests for each month?

In [35]:
pd.read_sql("""
WITH avg_comp_cte AS
(
    SELECT *, DENSE_RANK() OVER (PARTITION BY month_year ORDER BY avg_composition DESC) AS rank
    FROM
    (
        SELECT 
            month_year,
            interest_id,
            ROUND(composition::NUMERIC/index_value::NUMERIC, 2) AS avg_composition
        FROM fresh_segments.interest_metrics
    ) avg
)
SELECT 
    month_year, 
    ROUND(AVG(avg_composition),2) AS avg_composition
FROM avg_comp_cte
WHERE month_year IS NOT NULL AND rank <= 10
GROUP BY month_year
""", conn)

Unnamed: 0,month_year,avg_composition
0,2018-07-01,6.04
1,2018-08-01,5.95
2,2018-09-01,6.9
3,2018-10-01,7.01
4,2018-11-01,6.62
5,2018-12-01,6.65
6,2019-01-01,6.24
7,2019-02-01,6.58
8,2019-03-01,6.07
9,2019-04-01,5.75


**Result**
- October 2018 has the highest average composition of 7.01 among the top 10 interests.
- The month with the lowest average composition is June 2019, with only 2.39.

___
#### 4. What is the 3 month rolling average of the max average composition value from September 2018 to August 2019 and include the previous top ranking interests in the same output shown below.

Required output for question 4:

month_year | interest_name | max_index_composition | 3_month_moving_avg | 1_month_ago | 2_months_ago  
--- | --- | --- | --- | --- | --- 
2018-09-01 | Work Comes First Travelers | 8.26 | 7.61 | Las Vegas Trip Planners: 7.21 | Las Vegas Trip Planners: 7.36 
2018-10-01 | Work Comes First Travelers | 9.14 | 8.20 | Work Comes First Travelers: 8.26 | Las Vegas Trip Planners: 7.21 
2018-11-01 | Work Comes First Travelers | 8.28 | 8.56 | Work Comes First Travelers: 9.14 | Work Comes First Travelers: 8.26  
2018-12-01 | Work Comes First Travelers | 8.31 | 8.58 | Work Comes First Travelers: 8.28 | Work Comes First Travelers: 9.14  
2019-01-01 | Work Comes First Travelers | 7.66 | 8.08 | Work Comes First Travelers: 8.31 | Work Comes First Travelers: 8.28  
2019-02-01 | Work Comes First Travelers | 7.66 | 7.88 | Work Comes First Travelers: 7.66 | Work Comes First Travelers: 8.31 
2019-03-01 | Alabama Trip Planners | 6.54 | 7.29 | Work Comes First Travelers: 7.66 | Work Comes First Travelers: 7.66 
2019-04-01 | Solar Energy Researchers | 6.28 | 6.83 | Alabama Trip Planners: 6.54 | Work Comes First Travelers: 7.66 
2019-05-01 | Readers of Honduran Content | 4.41 | 5.74 | Solar Energy Researchers: 6.28 | Alabama Trip Planners: 6.54  
2019-06-01 | Las Vegas Trip Planners | 2.77 | 4.49 | Readers of Honduran Content: 4.41 | Solar Energy Researchers: 6.28 
2019-07-01 | Las Vegas Trip Planners | 2.82 | 3.33 | Las Vegas Trip Planners: 2.77 | Readers of Honduran Content: 4.41 
2019-08-01 | Cosmetics and Beauty Shoppers | 2.73 | 2.77 | Las Vegas Trip Planners: 2.82 | Las Vegas Trip Planners: 2.77 

**Result**

In [36]:
pd.read_sql("""
WITH ranking_interests_cte AS
(
    -- Rank each interest by the index_composition value for each month_year 
    
    SELECT 
        me.month_year, 
        ma.interest_name, 
        ROUND(me.composition::NUMERIC/me.index_value::NUMERIC, 2) AS index_composition,
        DENSE_RANK() OVER (PARTITION BY me.month_year ORDER BY me.composition::NUMERIC/me.index_value::NUMERIC DESC) AS rank
    FROM fresh_segments.interest_metrics me
    JOIN fresh_segments.interest_map ma ON me.interest_id = ma.id
)
SELECT *
FROM
(
    -- Calculate the 3-month rolling average of the max_index_composition.
    -- Determine the interests and their corresponding index_composition from 1 month and 2 months ago.
    
    SELECT 
        month_year, 
        interest_name, 
        index_composition AS max_index_composition, 
        ROUND(AVG(index_composition) OVER (ORDER BY month_year ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),2) AS "3_month_moving_avg",
        LAG(interest_name) OVER () || ': ' || LAG(index_composition) OVER () AS "1_months_ago",
        LAG(interest_name, 2) OVER () || ': ' || LAG(index_composition, 2) OVER () AS "2_months_ago"
    FROM ranking_interests_cte
    WHERE rank = 1
    
) computing
WHERE month_year BETWEEN '2018-09-01' AND '2019-08-01'
""", conn)

Unnamed: 0,month_year,interest_name,max_index_composition,3_month_moving_avg,1_months_ago,2_months_ago
0,2018-09-01,Work Comes First Travelers,8.26,7.61,Las Vegas Trip Planners: 7.21,Las Vegas Trip Planners: 7.36
1,2018-10-01,Work Comes First Travelers,9.14,8.2,Work Comes First Travelers: 8.26,Las Vegas Trip Planners: 7.21
2,2018-11-01,Work Comes First Travelers,8.28,8.56,Work Comes First Travelers: 9.14,Work Comes First Travelers: 8.26
3,2018-12-01,Work Comes First Travelers,8.31,8.58,Work Comes First Travelers: 8.28,Work Comes First Travelers: 9.14
4,2019-01-01,Work Comes First Travelers,7.66,8.08,Work Comes First Travelers: 8.31,Work Comes First Travelers: 8.28
5,2019-02-01,Work Comes First Travelers,7.66,7.88,Work Comes First Travelers: 7.66,Work Comes First Travelers: 8.31
6,2019-03-01,Alabama Trip Planners,6.54,7.29,Work Comes First Travelers: 7.66,Work Comes First Travelers: 7.66
7,2019-04-01,Solar Energy Researchers,6.28,6.83,Alabama Trip Planners: 6.54,Work Comes First Travelers: 7.66
8,2019-05-01,Readers of Honduran Content,4.41,5.74,Solar Energy Researchers: 6.28,Alabama Trip Planners: 6.54
9,2019-06-01,Las Vegas Trip Planners,2.77,4.49,Readers of Honduran Content: 4.41,Solar Energy Researchers: 6.28


___
#### 5. Provide a possible reason why the max average composition might change from month to month? Could it signal something is not quite right with the overall business model for Fresh Segments?

**Result**</br>
One possible explanation for the maximum average composition variation from month to month is that customers may change their interests and switch to other topics. The analysis shows that the curiosity of humans/customers is not consistent.
- However, we can observe that `Work Comes First Travelers` has been the top interest for 6 consecutive months, from September 2018 to February 2019.
- Similarly, `Las Vegas Trip Planners` is also the top interest on the Fresh Segments website for four specific months: July 2018, August 2018, June 2019, and July 2019.

In [37]:
conn.close()