The goal of this notebook is to compute synergy metrics between the stores/retailers/categories based on the cross visit data. We will use graph-based methods to analyze the relationships and interactions between different entities in the dataset.

# Imports

In [None]:
import networkx as nx
import pandas as pd

# Data Loading

In [None]:
import constants.constants as cst
import constants.paths as pth

# Dim Tables
dim_blocks = pd.read_csv(pth.DIM_BLOCKS, **cst.CSV_PARAMS)
dim_malls = pd.read_csv(pth.DIM_MALLS, **cst.CSV_PARAMS)

# Fact Tables
fact_stores = pd.read_csv(pth.FACT_STORES, **cst.CSV_PARAMS)
fact_malls = pd.read_csv(pth.FACT_MALLS, **cst.CSV_PARAMS)
fact_sri_scores = pd.read_csv(pth.FACT_SRI_SCORES, **cst.CSV_PARAMS)

# Store financials table
store_financials = pd.read_csv(pth.STORE_FINANCIALS, **cst.CSV_PARAMS)

# Cross visit table
cross_visit = pd.read_csv(pth.CROSS_VISITS, **cst.CSV_PARAMS)

## Data enriching

In [None]:
dim_blocks[dim_blocks["store_code"].duplicated()].sort_values("store_code")

We want to compute synergy metrics at the store level, retailer level and category level. For that, we need to enrich the cross visit data with retailer and category information. Additionally, to build graphs per mall, we add the mall id to the enriching data.

In [None]:
store_1_enrich = (
    dim_blocks[
        [
            "store_code",
            "mall_id",
            "retailer_code",
            "bl1_label",
            "bl2_label",
            "bl3_label",
        ]
    ]
    .drop_duplicates("store_code")
    .add_suffix("_1")
)

store_2_enrich = (
    dim_blocks[
        [
            "store_code",
            "mall_id",
            "retailer_code",
            "bl1_label",
            "bl2_label",
            "bl3_label",
        ]
    ]
    .drop_duplicates("store_code")
    .add_suffix("_2")
)

cross_visit_enriched = pd.merge(
    cross_visit,
    store_1_enrich,
    left_on="store_code_1",
    right_on="store_code_1",
    how="left",
    validate="m:1",
)

cross_visit_enriched = pd.merge(
    cross_visit_enriched,
    store_2_enrich,
    left_on="store_code_2",
    right_on="store_code_2",
    how="left",
    validate="m:1",
)

We check that there is no error in the `mall_id` (no mismatching `mall_id`). The only differences come from when one store has a mall_id and the other does not. Thus, we can combine the `mall_id` columns to get a full one.

In [None]:
cross_visit_enriched[
    (cross_visit_enriched["mall_id_1"] != cross_visit_enriched["mall_id_2"])
    & (
        cross_visit_enriched["mall_id_1"].notna()
        & cross_visit_enriched["mall_id_2"].notna()
    )
]

In [None]:
cross_visit_enriched["mall_id"] = cross_visit_enriched["mall_id_1"].combine_first(
    cross_visit_enriched["mall_id_2"]
)

cross_visit_enriched = cross_visit_enriched.drop(columns=["mall_id_1", "mall_id_2"])

# Drop rows where mall_id is missing alltogether
cross_visit_enriched = cross_visit_enriched.dropna(axis=0, subset="mall_id")

In [None]:
cross_visit_enriched

We still need to normalize the edge weights to have comparable values. We can do:
$$
edge\_weight_{ij} = \frac{cross\_total\_cross\_visits_{ij}}{\sqrt{total\_visits_i \times {total\_visits_j}}}
$$

At this point, the issue is that for some stores, there are more cross visits in `cross_visits` than total visits in `fact_stores`... Ask the question, but maybe use sum of cross visits as proxy?

# Graph Construction

In [None]:
def construct_graph(data: pd.DataFrame, mall_id: int, granularity: str) -> nx.Graph:
    """Construct a graph for a specific mall and granularity level.

    Parameters:
    - data: DataFrame containing cross visit data.
    - mall_id: The mall ID to filter the data.
    - granularity: The granularity level. Must be one of
                   ('store', 'retailer', 'cat_high', 'cat_mid', 'cat_low').

    Returns:
    - A NetworkX graph object.
    """
    if granularity == "store":
        node_col_1 = "store_code_1"
        node_col_2 = "store_code_2"
    elif granularity == "retailer":
        node_col_1 = "retailer_code_1"
        node_col_2 = "retailer_code_2"
    elif granularity == "cat_high":
        node_col_1 = "bl1_label_1"
        node_col_2 = "bl1_label_2"
    elif granularity == "cat_mid":
        node_col_1 = "bl2_label_1"
        node_col_2 = "bl2_label_2"
    elif granularity == "cat_low":
        node_col_1 = "bl3_label_1"
        node_col_2 = "bl3_label_2"
    else:
        raise ValueError(
            "Granularity must be one of 'store', 'retailer', 'cat_high', 'cat_mid' "
            "or 'cat_low'."
        )

    mall_data = data[data["mall_id"] == mall_id]

    graph = nx.from_pandas_edgelist(
        mall_data,
        source=node_col_1,
        target=node_col_2,
        edge_attr="total_cross_visits",
        create_using=nx.Graph(),
    )

    return graph

In [None]:
store_graph = construct_graph(cross_visit_enriched, mall_id=22, granularity="store")