# Module: Main Analysis Types
## Sprint: Customer Segmentation and RFM
## Part 1: BigQueryCode

## Filtering data based on conditions from the task

In [None]:
WITH filtered_data AS (
  SELECT
    CustomerID,
    InvoiceNo,
    InvoiceDate,
    Quantity,
    UnitPrice,
    TIMESTAMP(InvoiceDate) AS TransactionDate
  FROM
    `turing_data_analytics.rfm`
  WHERE
    TIMESTAMP(InvoiceDate) BETWEEN TIMESTAMP('2010-12-01') AND TIMESTAMP('2011-12-01')
    AND CustomerID IS NOT NULL
    AND Quantity > 0
),

### Explanation
WITH filtered_data AS (...) is a Common Table Expression (CTE) that I created to temporarily store a set of data that is specific to the task. The SELECT statement extracts specific information columns and I add a new column called TransactionDate by converting InvoiceDate to a TIMESTAMP format. The WHERE part filters transactions that happened between December 1, 2010, and December 1, 2011, where the CustomerID isn’t missing, and the Quantity is greater than zero (because transactions with no items are not very useful).

## Calculating Frequency and Monetary Value

In [None]:
t1 AS (
  SELECT
    CustomerID,
    MAX(TransactionDate) AS last_purchase_date,
    COUNT(DISTINCT InvoiceNo) AS frequency,
    SUM(Quantity * UnitPrice) AS monetary
  FROM
    filtered_data
  GROUP BY
    CustomerID
),

### Explanation
Another temporary table has been created, called t1.
MAX(TransactionDate) AS last_purchase_date: Finds the most recent shopping date for each customer.
COUNT(DISTINCT InvoiceNo) AS frequency: Counts how many unique shopping trips each customer made.
SUM(Quantity * UnitPrice) AS monetary: Adds up the total money each customer spent.
GROUP BY CustomerID: Groups all this info by each customer. 

## Calculating Recency 

In [None]:
t2 AS (
  SELECT
    *,
    DATE_DIFF(TIMESTAMP('2011-12-01'), last_purchase_date, DAY) AS recency
  FROM
    t1
),

### Explanation
Another temporary table had been created, called t2.
DATE_DIFF(TIMESTAMP('2011-12-01'), last_purchase_date, DAY) AS recency: This finds out how many days since their last purchase until December 1, 2011 in accordancy to task requirements.
The * takes and shows everything from t1, e.g CustomerID, last_purchase_date, frequency, and monetary.

## Calculating Quartiles for R, F, M

In [None]:
t3 AS (
  SELECT 
    a.*,
    -- Quartiles for MONETARY
    b.percentiles[OFFSET(1)] AS m25, 
    b.percentiles[OFFSET(2)] AS m50,
    b.percentiles[OFFSET(3)] AS m75,
    -- Quartiles for FREQUENCY
    c.percentiles[OFFSET(1)] AS f25, 
    c.percentiles[OFFSET(2)] AS f50,
    c.percentiles[OFFSET(3)] AS f75,
    -- Quartiles for RECENCY
    d.percentiles[OFFSET(1)] AS r25, 
    d.percentiles[OFFSET(2)] AS r50,
    d.percentiles[OFFSET(3)] AS r75
  FROM 
    t2 a,
    (SELECT APPROX_QUANTILES(monetary, 4) AS percentiles FROM t2) b,
    (SELECT APPROX_QUANTILES(frequency, 4) AS percentiles FROM t2) c,
    (SELECT APPROX_QUANTILES(recency, 4) AS percentiles FROM t2) d
),

### Explanation
Another temp table created, called t3.
APPROX_QUANTILES(column, 4) AS percentiles: This function splits the data into four parts (quartiles). 
For example, for monetary, it finds out the values at the 25th percentile (m25), 50th percentile (m50), and 75th percentile (m75).
This is done for monetary, frequency, and recency and join these quartiles back to the main data.

## Assigning R, F, M scores based on quartiles

In [None]:
t4 AS (
  SELECT
    *,
    CAST(ROUND((f_score + m_score) / 2, 0) AS INT64) AS fm_score,
    FROM( SELECT *,
    CASE 
      WHEN monetary <= m25 THEN 1
      WHEN monetary <= m50 AND monetary > m25 THEN 2 
      WHEN monetary <= m75 AND monetary > m50 THEN 3 
      ELSE 4
    END AS m_score,
    CASE 
      WHEN frequency <= f25 THEN 1
      WHEN frequency <= f50 AND frequency > f25 THEN 2 
      WHEN frequency <= f75 AND frequency > f50 THEN 3 
      ELSE 4
    END AS f_score,
    CASE 
      WHEN recency <= r25 THEN 4
      WHEN recency <= r50 AND recency > r25 THEN 3 
      WHEN recency <= r75 AND recency > r50 THEN 2 
      ELSE 1
    END AS r_score
  FROM
    t3
)
),

### Explanation
Here I am giving scores from 1 to 4 based on the quartiles.
CASE statements are like if-else conditions here.
For monetary: If the value is in the lowest 25%, it gets a score of 1. If it’s between 25% and 50%, it gets a score of 2, and so on.
The same logic applies for frequency and recency.
CAST(ROUND((f_score + m_score) / 2, 0) AS INT64) AS fm_score: This calculates an average score of f_score and m_score, rounds it, and changes it to an integer.

## Calculating rfm_score and Segmenting Customers

In [None]:
t5 AS (
  SELECT
    CustomerID,
    recency,
    frequency,
    monetary,
    r_score,
    f_score,
    m_score,
    fm_score,
    CONCAT(CAST(r_score AS STRING), CAST(f_score AS STRING), CAST(m_score AS STRING)) AS rfm_score,
    CASE 
      WHEN (r_score = 4 AND fm_score = 4) 
        OR (r_score = 4 AND fm_score = 3) 
        OR (r_score = 3 AND fm_score = 4) 
      THEN 'Best Customers'
      WHEN (r_score = 3 AND fm_score = 3)
        OR (r_score = 2 AND fm_score = 4)
        OR (r_score = 2 AND fm_score = 3)
      THEN 'Loyal Customers'
      WHEN m_score = 4 
      THEN 'Big Spenders'
      WHEN (r_score = 4 AND fm_score = 3) 
        OR (r_score = 3 AND fm_score = 2)
      THEN 'Potential Loyalists'
      WHEN r_score = 4 AND fm_score = 1 THEN 'Recent Customers'
      WHEN (r_score = 3 AND fm_score = 1) 
        OR (r_score = 2 AND fm_score = 1)
      THEN 'Promising'
      WHEN (r_score = 2 AND fm_score = 2) 
        OR (r_score = 1 AND fm_score = 3)
      THEN 'Customers Needing Attention'
      WHEN (r_score = 1 AND fm_score = 4) 
        OR (r_score = 1 AND fm_score = 3)        
      THEN 'At Risk'
      WHEN r_score = 1 AND fm_score = 2 THEN 'Hibernating'
      WHEN r_score = 1 AND fm_score = 1 THEN 'Lost'
    END AS rfm_segment
  FROM
    t4
),

### Explanation:
CONCAT(CAST(r_score AS STRING), CAST(f_score AS STRING), CAST(m_score AS STRING)) AS rfm_score: combines the individual scores into one string.
The CASE statement assigns customers to groups based on their scores:
For example, if they have high scores in all categories, they’re "Best Customers".

## Counting occurrences of each RFM score combination

In [None]:
rfm_counts AS (
  SELECT
    rfm_score,
    COUNT(*) AS rfm_score_count
  FROM
    t5
  GROUP BY
    rfm_score
),

### Explanation
This counts how many customers fall into each RFM score combination.
GROUP BY rfm_score: Groups the results by rfm_score to count the number of customers in each group

## Full Code

In [None]:
-- Filtering data based on task conditions
WITH filtered_data AS (
  SELECT
    CustomerID,
    InvoiceNo,
    InvoiceDate,
    Quantity,
    UnitPrice,
    TIMESTAMP(InvoiceDate) AS TransactionDate
  FROM
    `turing_data_analytics.rfm`
  WHERE
    TIMESTAMP(InvoiceDate) BETWEEN TIMESTAMP('2010-12-01') AND TIMESTAMP('2011-12-01')
    AND CustomerID IS NOT NULL
    AND Quantity > 0
),

-- Calculating Frequency and Monetary value
t1 AS (
  SELECT
    CustomerID,
    MAX(TransactionDate) AS last_purchase_date,
    COUNT(DISTINCT InvoiceNo) AS frequency,
    SUM(Quantity * UnitPrice) AS monetary
  FROM
    filtered_data
  GROUP BY
    CustomerID
),

-- Calculating Recency
t2 AS (
  SELECT
    *,
    DATE_DIFF(TIMESTAMP('2011-12-01'), last_purchase_date, DAY) AS recency
  FROM
    t1
),

-- Calculating Quartiles for R, F, M
t3 AS (
  SELECT 
    a.*,
    -- Quartiles for MONETARY
    b.percentiles[OFFSET(1)] AS m25, 
    b.percentiles[OFFSET(2)] AS m50,
    b.percentiles[OFFSET(3)] AS m75,
    -- Quartiles for FREQUENCY
    c.percentiles[OFFSET(1)] AS f25, 
    c.percentiles[OFFSET(2)] AS f50,
    c.percentiles[OFFSET(3)] AS f75,
    -- Quartiles for RECENCY
    d.percentiles[OFFSET(1)] AS r25, 
    d.percentiles[OFFSET(2)] AS r50,
    d.percentiles[OFFSET(3)] AS r75
  FROM 
    t2 a,
    (SELECT APPROX_QUANTILES(monetary, 4) AS percentiles FROM t2) b,
    (SELECT APPROX_QUANTILES(frequency, 4) AS percentiles FROM t2) c,
    (SELECT APPROX_QUANTILES(recency, 4) AS percentiles FROM t2) d
),

-- Assigning R, F, M scores based on Quartiles
t4 AS (
  SELECT
    *,
    CAST(ROUND((f_score + m_score) / 2, 0) AS INT64) AS fm_score,
    FROM( SELECT*,
    CASE 
      WHEN monetary <= m25 THEN 1
      WHEN monetary <= m50 AND monetary > m25 THEN 2 
      WHEN monetary <= m75 AND monetary > m50 THEN 3 
      ELSE 4
    END AS m_score,
    CASE 
      WHEN frequency <= f25 THEN 1
      WHEN frequency <= f50 AND frequency > f25 THEN 2 
      WHEN frequency <= f75 AND frequency > f50 THEN 3 
      ELSE 4
    END AS f_score,
    CASE 
      WHEN recency <= r25 THEN 4
      WHEN recency <= r50 AND recency > r25 THEN 3 
      WHEN recency <= r75 AND recency > r50 THEN 2 
      ELSE 1
    END AS r_score
  FROM
    t3
)
),

-- Calculating rfm_score and Segmenting Customers
t5 AS (
  SELECT
    CustomerID,
    recency,
    frequency,
    monetary,
    r_score,
    f_score,
    m_score,
    fm_score,
    CONCAT(CAST(r_score AS STRING), CAST(f_score AS STRING), CAST(m_score AS STRING)) AS rfm_score,
    CASE 
      WHEN (r_score = 4 AND fm_score = 4) 
        OR (r_score = 4 AND fm_score = 3) 
        OR (r_score = 3 AND fm_score = 4) 
      THEN 'Best Customers'
      WHEN --(r_score = 4 AND fm_score = 2) 
         (r_score = 3 AND fm_score = 3)
        OR (r_score = 2 AND fm_score = 4)
        OR (r_score = 2 AND fm_score = 3)
      THEN 'Loyal Customers'
       WHEN m_score = 4 
      THEN 'Big Spenders'
      WHEN (r_score = 4 AND fm_score = 3) 
        OR (r_score = 3 AND fm_score = 2)
        --OR (r_score = 2 AND fm_score = 2)
        --OR (r_score = 3 AND fm_score = 2)
      THEN 'Potential Loyalists'
      WHEN r_score = 4 AND fm_score = 1 THEN 'Recent Customers'
      WHEN (r_score = 3 AND fm_score = 1) 
        OR (r_score = 2 AND fm_score = 1)
      THEN 'Promising'
      WHEN (r_score = 2 AND fm_score = 2) 
        OR (r_score = 1 AND fm_score = 3)
        --OR (r_score = 1 AND fm_score = 2)
      THEN 'Customers Needing Attention'
      WHEN (r_score = 1 AND fm_score = 4) 
        OR (r_score = 1 AND fm_score = 3)        
      THEN 'At Risk'
      WHEN r_score = 1 AND fm_score = 2 THEN 'Hibernating'
      WHEN r_score = 1 AND fm_score = 1 THEN 'Lost'
    END AS rfm_segment
  FROM
    t4
),

-- Counting occurrences of each RFM score combination
rfm_counts AS (
  SELECT
    rfm_score,
    COUNT(*) AS rfm_score_count
  FROM
    t5
  GROUP BY
    rfm_score
)

-- Selecting final RFM Scores, Segments, and Counts
SELECT
  t5.CustomerID,
  t5.recency,
  t5.frequency,
  t5.monetary,
  t5.r_score,
  t5.f_score,
  t5.m_score,
  t5.rfm_score,
  rfm_counts.rfm_score_count AS n,
  t5.fm_score,
  t5.rfm_segment
FROM
  t5
JOIN
  rfm_counts ON t5.rfm_score = rfm_counts.rfm_score
ORDER BY
  t5.r_score DESC,
  t5.fm_score DESC,
  t5.CustomerID;



This is the link that helped me create the code and complete this project: https://towardsdatascience.com/a-simple-way-to-segment-customers-using-google-bigquery-and-data-studio-f31c8896cc52