## Dataset Construction for CAV Training and Evaluation

This notebook implements the dataset construction pipeline used for training and evaluating Concept Activation Vectors (CAVs) in our bias analysis and debiasing framework.

We focus on creating balanced and stratified datasets that support binary concept classification tasks (e.g., gender = female vs. not female) while minimizing confounding factors such as genre or language imbalance. The resulting splits are used for training concept classifiers (logistic regression) and computing TCAV scores. The CAVs can also be used for de-biasing.




In [None]:

from cavmir.data.dataset_creation import get_genre_balanced_datasets, preprocess_metadata

#### Load Metadata
The notebook loads the full metadata table, including track IDs, genre, gender, language, and precomputed embeddings. This metadata serves as the basis for concept stratification.

##### Stratified Sampling
For each concept (e.g., gender, language), we:

- Define positive samples (e.g., gender = female)

- Define non-positive samples (e.g., gender ≠ female, such as male and non-binary)

- Sample an equal number of positive and non-positive examples for each concept within each genre, reducing the risk of genre-driven confounding.

##### Train/Test Split with Genre Balance

- A maximum of 50 samples per (concept, genre) subgroup are used for training.

- The remainder is reserved for testing.

- If data is scarce, the limit is scaled down proportionally.

This ensures broad subgroup diversity in the test set, while avoiding training/test overlap.


In [None]:
song_info_path = "song_info.csv"
song_artist_path = "song_artist.csv"
artist_info_path = "artist_info.csv"
supplementary_dataset_path = "supplementary_dataset.txt"

metadata = preprocess_metadata(song_info_path=song_info_path,
                                 song_artist_path=song_artist_path,
                                 artist_info_path=artist_info_path
                            )

# Merge with supplementary dataset
supplementary_metadata = pd.read_json(supplementary_dataset_path, lines=True)
metadata = pd.concat([metadata, supplementary_metadata], axis=0, ignore_index=True)

# Create genre-balanced datasets for different languages
train_concept_language = ("language", ["en", "fr", "it", "pt", "ja", "es", "de"])
(train_dataset_language, test_dataset_language) = get_genre_balanced_datasets(
    metadata=metadata, train_concept=train_concept_language
)

# Create genre-balanced datasets for different genders
train_concept_gender = ("gender", ["male", "female"])
(train_dataset_gender, test_dataset_gender) = get_genre_balanced_datasets(
    metadata=metadata, train_concept=train_concept_gender
)



#### Exporting Final Datasets
Final training and testing sets are saved per concept and used for:

- CAV training (logistic regression on embeddings)

- TCAV score computation

- Bias manipulation experiments

In [None]:
for key in train_dataset_language:
    train_dataset_language[key].to_csv(
        f"datasets/train_dataset_language_{key}.csv", index=False
    )
    test_dataset_language[key].to_csv(
        f"datasets/test_dataset_language_{key}.csv", index=False
    )

for key in train_dataset_gender:
    train_dataset_gender[key].to_csv(
        f"datasets/train_dataset_gender_{key}.csv", index=False
    )
    test_dataset_gender[key].to_csv(
        f"datasets/test_dataset_gender_{key}.csv", index=False
    )
