**Author:** Alexander Staub
* **Last changed:** 2025.06.29
* **Purpose:** This notebook is the final step in the Chartmetric ID retrieval process. It runs **once** after all worker scripts have completed.
    1.  It automatically finds all worker checkpoint files.
    2.  It concatenates them into a single, complete file of results, handling duplicates created by the copy-paste setup.
    3.  It loads the original master dataset (with all metadata columns).
    4.  It performs a left merge to add the `chartmetric_ids` to the master dataset.
    5.  It saves the final, enriched dataset to a new CSV file.

In [16]:
#installing packages
import time
import requests
import logging
import pandas as pd
import os
import numpy as np
#load package for glob function
import glob

In [17]:
# --- Configuration ---
# Define the paths and parameters for the merge process.

# The base directory where the worker output parts are stored.
# This should match the ID_OUTPUT_DIR from your worker scripts.
WORKER_OUTPUT_DIR = "//bigdata.wu.ac.at/delpero/Data_alexander/data/raw_data/chartmetric/chartmetric_ids/"

# The location of your original, full dataset (the one the controller used).
ORIGINAL_MASTER_INPUT_FILE = "//bigdata.wu.ac.at/delpero/Data_alexander/data/raw_data/Spotify/1980_2000_songs_artists/final_charted_from_spotify_1980_2000.csv"

# CHANGE the path for the chart songs
# ORIGINAL_MASTER_INPUT_FILE =

# The path where the single, combined checkpoint file will be saved.
FINAL_CONSOLIDATED_CHECKPOINT = os.path.join(WORKER_OUTPUT_DIR, "chartmetric_ids_complete_checkpoint.csv")

# CHANGE the path for the chart songs
# FINAL_OUTPUT_FILE_WITH_IDS = "//bigdata.wu.ac.at/delpero/Data_alexander/data/raw_data/chartmetric/chartmetric_ids/chartmetric_ids_chart_songs_matched.csv"

print("Configuration set.")
print(f"Worker output directory: {WORKER_OUTPUT_DIR}")
print(f"Final output file will be: {FINAL_OUTPUT_FILE_WITH_IDS}")

Configuration set.
Worker output directory: //bigdata.wu.ac.at/delpero/Data_alexander/data/raw_data/chartmetric/chartmetric_ids/


NameError: name 'FINAL_OUTPUT_FILE_WITH_IDS' is not defined

### Part A: Find and Consolidate All Worker Checkpoint Files

This step finds all the `chartmetric_ids_checkpoint.csv` files inside the `part_*` subdirectories, loads them into a list of dataframes, and concatenates them into a single dataframe.

**Important:** Because of the "copy-paste" setup method, each worker checkpoint contains the original ~180k rows plus its own new results. This step **must** drop duplicates to create a clean, final list of all unique ISRCs that have been processed.

In [18]:
# Use glob to automatically find all worker checkpoint files
search_pattern = os.path.join(WORKER_OUTPUT_DIR, "part_*", "chartmetric_ids_checkpoint.csv")
worker_checkpoint_files = glob.glob(search_pattern)

if not worker_checkpoint_files:
    raise FileNotFoundError(f"No worker checkpoint files were found using the pattern: {search_pattern}. Please ensure the workers have run and the WORKER_OUTPUT_DIR is correct.")

print(f"Found {len(worker_checkpoint_files)} worker checkpoint files:")
for f in worker_checkpoint_files:
    print(f" - {f}")

# Load all checkpoint files into a list of dataframes
all_worker_dfs = [pd.read_csv(f) for f in worker_checkpoint_files]

# Concatenate all dataframes into one
print("\nConcatenating all worker files...")
consolidated_df = pd.concat(all_worker_dfs, ignore_index=True)
print(f"Total rows before duplicate removal: {len(consolidated_df):,}")

# CRITICAL STEP: Drop duplicates based on 'spotify_isrc'
# This handles the overlap from the initial copy-paste and any potential runtime overlaps.
consolidated_df.drop_duplicates(subset=['spotify_isrc'], keep='first', inplace=True)
print(f"Total unique rows after duplicate removal: {len(consolidated_df):,}")

# Save the final, consolidated checkpoint file
consolidated_df.to_csv(FINAL_CONSOLIDATED_CHECKPOINT, index=False)
print(f"\nSuccessfully saved the complete, consolidated checkpoint file to:\n{FINAL_CONSOLIDATED_CHECKPOINT}")

Found 3 worker checkpoint files:
 - //bigdata.wu.ac.at/delpero/Data_alexander/data/raw_data/chartmetric/chartmetric_ids\part_3\chartmetric_ids_checkpoint.csv
 - //bigdata.wu.ac.at/delpero/Data_alexander/data/raw_data/chartmetric/chartmetric_ids\part_2\chartmetric_ids_checkpoint.csv
 - //bigdata.wu.ac.at/delpero/Data_alexander/data/raw_data/chartmetric/chartmetric_ids\part_1\chartmetric_ids_checkpoint.csv

Concatenating all worker files...
Total rows before duplicate removal: 24,056
Total unique rows after duplicate removal: 24,056

Successfully saved the complete, consolidated checkpoint file to:
//bigdata.wu.ac.at/delpero/Data_alexander/data/raw_data/chartmetric/chartmetric_ids/chartmetric_ids_complete_checkpoint.csv


### Part B: Merge Chartmetric IDs into the Master Dataset

Now, we load the original dataset with all its rich metadata and the complete list of Chartmetric IDs we just created. We then perform a `left` merge to add the `chartmetric_ids` column to the original data.

In [21]:
# Load the original master dataset
print(f"Loading original master data from:\n{ORIGINAL_MASTER_INPUT_FILE}")
master_df = pd.read_csv(ORIGINAL_MASTER_INPUT_FILE)
print(f"Loaded {len(master_df):,} rows from the master file.")

# Load the consolidated chartmetric IDs we just created
print(f"\nLoading consolidated Chartmetric IDs from:\n{FINAL_CONSOLIDATED_CHECKPOINT}")
chartmetric_ids_df = consolidated_df.copy()
print(f"Loaded {len(chartmetric_ids_df):,} unique Chartmetric ID results.")


# Perform the left merge to add the new IDs to the master dataframe
print("\nMerging Chartmetric IDs into the master dataframe...")
spotify_fetch = pd.merge(master_df, chartmetric_ids_df, on='spotify_isrc', how='left')
print("Merge complete.")

Loading original master data from:
//bigdata.wu.ac.at/delpero/Data_alexander/data/raw_data/Spotify/1980_2000_songs_artists/final_charted_from_spotify_1980_2000.csv
Loaded 24,056 rows from the master file.

Loading consolidated Chartmetric IDs from:
//bigdata.wu.ac.at/delpero/Data_alexander/data/raw_data/chartmetric/chartmetric_ids/chartmetric_ids_complete_checkpoint.csv
Loaded 24,056 unique Chartmetric ID results.

Merging Chartmetric IDs into the master dataframe...
Merge complete.


In [22]:
#transform the object type into float 
spotify_fetch["chartmetric_ids"] = pd.to_numeric(spotify_fetch["chartmetric_ids"], errors='coerce')

#transform into integer
spotify_fetch["chartmetric_ids"] = spotify_fetch["chartmetric_ids"].astype("Int64")



In [23]:
#drop the missing entries of chartmetric_ids
spotify_fetch = spotify_fetch[spotify_fetch["chartmetric_ids"].notna()]

In [24]:
#drop the duplicate ISRCs
spotify_fetch.drop_duplicates(subset=['spotify_isrc'], keep='first', inplace=True)

Saving the files needs to take into account the version that I have already saved in the past as I am not able to run the code remotely

In [25]:

import os

# Define the final file path
# NEED TO CHECK:  the suffix
filepath = "//bigdata.wu.ac.at/delpero/Data_alexander/data/raw_data/chartmetric/chartmetric_ids/"

#give the file a name
# file_name = "chartmetric_ids_mb_matched.csv"

#for chart songs
file_name = "chartmetric_ids_chart_songs_matched.csv"


#paste filepath and file_name together and call the variable final_filepath

final_filepath = os.path.join(filepath, file_name)

# Save the dataframe to the final_filepath
spotify_fetch.to_csv(final_filepath, index=False)
print(f"Saved file as: {final_filepath}")

# Log the final message in the log file
logging.info("Completed processing all ISRC codes.")

Saved file as: //bigdata.wu.ac.at/delpero/Data_alexander/data/raw_data/chartmetric/chartmetric_ids/chartmetric_ids_chart_songs_matched.csv


# Checks of the consolidated data

In [19]:
# what is the percentage of missing chartmetric_ids in the consolidated_df dataframe?
missing_percentage = (consolidated_df['chartmetric_ids'].isnull().sum() / len(consolidated_df)) * 100

In [20]:
# how many unique chartmetric_ids are in the consolidated_df dataframe?
unique_chartmetric_ids_count = consolidated_df['chartmetric_ids'].nunique()

In [None]:
#using spotify_fetch dataframe, what 