> I turned around to see this cute guy holding an item I had bought. He said, ‘I got the same thing!’ We laughed about it and wound up swapping items because I wanted the color he got.
> I asked him to get some food with me and we spent the rest of the day together.

In [177]:
import polars as pl

customers = pl.read_csv("data/noahs-customers.csv",try_parse_dates=True)
orders_items = pl.read_csv("data/noahs-orders_items.csv",try_parse_dates=True)
orders = pl.read_csv("data/noahs-orders.csv",try_parse_dates=True)
products = pl.read_csv("data/noahs-products.csv",try_parse_dates=True)

Step 1: Find all the items with colors in their names. 
After inspecting the products table, it seems that colors are written in **lower-case** at the end of the product description in parentheses.
I'll separate the product color into its own column.

In [178]:
# Products with colors in the name always end in e.g. `... (blue)`, so use a regex that pulls that color name out
color_name_filter = r"(.+) \(([[:lower:]]+)\)"
colorful_products = products.filter(pl.col("desc").str.contains(color_name_filter))
products_with_colors_extracted = colorful_products.select(
    pl.col("sku"),
    pl.col("desc").str.extract(color_name_filter, group_index=1),
    pl.col("desc").str.extract(color_name_filter, group_index=2).alias("color")
)
print(products_with_colors_extracted.head())

shape: (5, 3)
┌─────────┬──────────────────────┬───────┐
│ sku     ┆ desc                 ┆ color │
│ ---     ┆ ---                  ┆ ---   │
│ str     ┆ str                  ┆ str   │
╞═════════╪══════════════════════╪═══════╡
│ COL0037 ┆ Noah's Jewelry       ┆ green │
│ COL0065 ┆ Noah's Jewelry       ┆ mauve │
│ COL0166 ┆ Noah's Action Figure ┆ blue  │
│ COL0167 ┆ Noah's Bobblehead    ┆ blue  │
│ COL0483 ┆ Noah's Action Figure ┆ mauve │
└─────────┴──────────────────────┴───────┘


Ok now I'll find all orders with items that contain "colorful products".

In [179]:
orders_with_colorful_products = orders_items.join(products_with_colors_extracted, on="sku", how="inner")
print(orders_with_colorful_products)

shape: (28_012, 6)
┌─────────┬─────────┬─────┬────────────┬──────────────────────┬─────────┐
│ orderid ┆ sku     ┆ qty ┆ unit_price ┆ desc                 ┆ color   │
│ ---     ┆ ---     ┆ --- ┆ ---        ┆ ---                  ┆ ---     │
│ i64     ┆ str     ┆ i64 ┆ f64        ┆ str                  ┆ str     │
╞═════════╪═════════╪═════╪════════════╪══════════════════════╪═════════╡
│ 1014    ┆ COL4117 ┆ 1   ┆ 4.55       ┆ Noah's Poster        ┆ yellow  │
│ 1015    ┆ COL8357 ┆ 1   ┆ 13.48      ┆ Noah's Lunchbox      ┆ mauve   │
│ 1018    ┆ COL6388 ┆ 1   ┆ 3.72       ┆ Noah's Gift Box      ┆ magenta │
│ 1040    ┆ COL7454 ┆ 1   ┆ 10.65      ┆ Noah's Jersey        ┆ mauve   │
│ 1041    ┆ COL2141 ┆ 1   ┆ 5.87       ┆ Noah's Bobblehead    ┆ puce    │
│ …       ┆ …       ┆ …   ┆ …          ┆ …                    ┆ …       │
│ 214217  ┆ COL9349 ┆ 1   ┆ 18.31      ┆ Noah's Action Figure ┆ orange  │
│ 214218  ┆ COL0837 ┆ 1   ┆ 4.97       ┆ Noah's Poster        ┆ mauve   │
│ 214227  ┆ COL1263

And now I have to find consecutive orders (the orderids auto-increment by one) which have the same item but a different color...
This is pretty hard for me conceptually, because I don't know how to compare rows only if they're sequential.

Maybe I could do a group_by on "desc", collect the orderids into a list, and somehow find the lists that have orderid off by one with same "desc"??

In [180]:
orders_by_colorful_products = (
    orders_with_colorful_products.group_by("desc")
    .agg(pl.col("orderid").sort()) # Sort orders for pairwise filter later
    .sort("desc")
)
print(orders_by_colorful_products)

shape: (7, 2)
┌──────────────────────┬────────────────────────┐
│ desc                 ┆ orderid                │
│ ---                  ┆ ---                    │
│ str                  ┆ list[i64]              │
╞══════════════════════╪════════════════════════╡
│ Noah's Action Figure ┆ [1073, 1147, … 214227] │
│ Noah's Bobblehead    ┆ [1041, 1071, … 214150] │
│ Noah's Gift Box      ┆ [1018, 1167, … 214232] │
│ Noah's Jersey        ┆ [1040, 1066, … 214184] │
│ Noah's Jewelry       ┆ [1053, 1086, … 214202] │
│ Noah's Lunchbox      ┆ [1015, 1086, … 214231] │
│ Noah's Poster        ┆ [1014, 1047, … 214218] │
└──────────────────────┴────────────────────────┘


For each product, I must find all the order ids that are consecutive.
This is called going pairwise through the list, and luckily python itertools implements that for me.

In [184]:
from itertools import pairwise

def filter_consecutive_ids(series: pl.List(pl.Int64)) -> list[tuple[int, int]]:
    output = []
    for pair in pairwise(series):
        if pair[1] - pair[0] == 1:
            output.append(pair)
    return output

items_ordered_consecutively = orders_by_colorful_products.select(
    pl.col("desc"),
    pl.col("orderid").map_elements(
        filter_consecutive_ids, return_dtype=pl.List(pl.List(pl.Int64))
    ).alias("consecutive_orderid"),
).explode("consecutive_orderid")
print(items_ordered_consecutively)


shape: (550, 2)
┌──────────────────────┬─────────────────────┐
│ desc                 ┆ consecutive_orderid │
│ ---                  ┆ ---                 │
│ str                  ┆ list[i64]           │
╞══════════════════════╪═════════════════════╡
│ Noah's Action Figure ┆ [2220, 2221]        │
│ Noah's Action Figure ┆ [4573, 4574]        │
│ Noah's Action Figure ┆ [4946, 4947]        │
│ Noah's Action Figure ┆ [11945, 11946]      │
│ Noah's Action Figure ┆ [13250, 13251]      │
│ …                    ┆ …                   │
│ Noah's Poster        ┆ [201437, 201438]    │
│ Noah's Poster        ┆ [202011, 202012]    │
│ Noah's Poster        ┆ [204001, 204002]    │
│ Noah's Poster        ┆ [205182, 205183]    │
│ Noah's Poster        ┆ [210059, 210060]    │
└──────────────────────┴─────────────────────┘


This is a table of all the consecutive orders where a specific item (ignoring its color) was bought.
I now need to filter for the rows where the color changed between each order in the list.

In [187]:
def get_orderids_to_item_color(series: pl.List(pl.Int64)) -> pl.List(pl.String):
    # orders_with_colorful_products is a global variable :)
    return orders_with_colorful_products.filter(pl.col("orderid").is_in(series)).select(
        "color"
    )

# - Add a 'color' column that has the color of each item from the corresponding orderid list
# - Filter for rows where exactly one item was bought in each order (length 2)
# - Filter for rows where the color list has two unique elements
items_ordered_consecutively_with_color = items_ordered_consecutively.select(
    pl.col("desc"),
    pl.col("consecutive_orderid"),
    pl.col("consecutive_orderid")
    .map_elements(get_orderids_to_item_color, return_dtype=pl.List(pl.String))
    .alias("color"),
).filter(pl.col("color").list.len() == 2).filter(pl.col("color").list.n_unique() == 2)
print(items_ordered_consecutively_with_color)

shape: (401, 3)
┌──────────────────────┬─────────────────────┬─────────────────────┐
│ desc                 ┆ consecutive_orderid ┆ color               │
│ ---                  ┆ ---                 ┆ ---                 │
│ str                  ┆ list[i64]           ┆ list[str]           │
╞══════════════════════╪═════════════════════╪═════════════════════╡
│ Noah's Action Figure ┆ [2220, 2221]        ┆ ["orange", "amber"] │
│ Noah's Action Figure ┆ [4573, 4574]        ┆ ["white", "purple"] │
│ Noah's Action Figure ┆ [4946, 4947]        ┆ ["amber", "purple"] │
│ Noah's Action Figure ┆ [13250, 13251]      ┆ ["azure", "purple"] │
│ Noah's Action Figure ┆ [13649, 13650]      ┆ ["yellow", "green"] │
│ …                    ┆ …                   ┆ …                   │
│ Noah's Poster        ┆ [189701, 189702]    ┆ ["blue", "white"]   │
│ Noah's Poster        ┆ [197802, 197803]    ┆ ["red", "azure"]    │
│ Noah's Poster        ┆ [201437, 201438]    ┆ ["azure", "red"]    │
│ Noah's Poster   

I have the above table with 401 orders that are consecutive and have the same item in a different color.

I'll find the rows where either orderid belongs to Sherri Long, the woman from part 6.

In [217]:
part6_customerid = 4167
part6_customer_orderids = (
    orders.filter(pl.col("customerid") == part6_customerid)
    .select("orderid")
    .to_series()
)

orders_from_part6_customer = items_ordered_consecutively_with_color.filter(
    pl.col("consecutive_orderid")
    .list.eval(pl.element().is_in(part6_customer_orderids))
    .list.any()
)
print(orders_from_part6_customer)

shape: (1, 3)
┌───────────────┬─────────────────────┬─────────────────────┐
│ desc          ┆ consecutive_orderid ┆ color               │
│ ---           ┆ ---                 ┆ ---                 │
│ str           ┆ list[i64]           ┆ list[str]           │
╞═══════════════╪═════════════════════╪═════════════════════╡
│ Noah's Poster ┆ [70502, 70503]      ┆ ["orange", "azure"] │
└───────────────┴─────────────────────┴─────────────────────┘


I'll programatically get the order id + customer id of the customer that is not Sherri Long from the above list of orderids.

In [224]:
print(
    orders_from_part6_customer.explode("consecutive_orderid")
    .filter(pl.col("consecutive_orderid").is_in(part6_customer_orderids).not_())
    .join(orders, left_on="consecutive_orderid", right_on="orderid", how="inner")
    .join(customers, on="customerid", how="inner")
    .select("customerid", "name", "phone")
)


shape: (1, 3)
┌────────────┬──────────────┬──────────────┐
│ customerid ┆ name         ┆ phone        │
│ ---        ┆ ---          ┆ ---          │
│ i64        ┆ str          ┆ str          │
╞════════════╪══════════════╪══════════════╡
│ 5783       ┆ Carlos Myers ┆ 838-335-7157 │
└────────────┴──────────────┴──────────────┘
