# Synopsis

This Jupyter Notebook is designed to interact with the CDISC Library API to retrieve and process metadata from various CDISC Implementation Guides (IGs). The primary objectives of this notebook are:

1. **Installation and Setup**:
	- Install the `cdisc_library_client` package from PyPI.
	- Import necessary modules and set up the CDISC Library client using an API key stored in the environment variables.

2. **Retrieving Metadata**:
	- Use the CDISC Library client to fetch metadata for different versions of Implementation Guides such as SENDIG, SENDIG-DART, SDTMIG, and SDTMIG-MD.
	- Store the retrieved metadata in a variable named `ig`.

3. **Processing Metadata**:
	- Define functions to extract classes and datasets from the IG metadata.
	- Define a function to retrieve specific codelists from the CDISC Library using provided hrefs.
	- Define a function to rearrange the extracted classes and datasets into a columnar format suitable for creating a pandas DataFrame.

4. **Data Transformation and Output**:
	- Extract classes and datasets from the IG metadata.
	- Flatten the extracted data into a pandas DataFrame.
	- Output the DataFrame to a CSV file named `ig-class-dataset-variable.csv`.

5. **Documentation and Additional Information**:
	- Provide instructions on setting environment variables and additional APIs available in the CDISC Library client.
	- Include notes on using short names for codelists and the expected time for processing.

Overall, this notebook aims to automate the retrieval, processing, and transformation of CDISC IG metadata into a structured format that can be easily analyzed and exported for further use.

Install CDISC Library client from PyPI, plus Pandas & Numpy in case you do not have them in your Python version environment.

In [None]:
%pip install cdisc_library_client
%pip install pandas
%pip install numpy

In [1]:
import os
from cdisc_library_client import CDISCLibraryClient
import pandas as pd

To set a permenent environment variable in user scope using PowerShell: `[System.Environment]::SetEnvironmentVariable("MY_VAR", "HelloWorld", "User")`

In [2]:
api_key = os.environ.get("CDISC_LIBRARY_API_KEY")
client = CDISCLibraryClient(api_key=api_key)

In [8]:
def extract_classes_datasets(ig):
	"""
	Extracts datasets from the IG classes.

	Args:
		ig (dict): The IG metadata returned by the CDISC Library API.

	Returns:
		dict: A dictionary where keys are class names and values are dictionaries
		with dataset names as keys and dataset details as values.
	"""
	return {
		cls['name']: {
			dataset['name']: {
				'class_name': cls['name'],
				'ordinal': dataset.get('ordinal'),
				'name': dataset.get('name'),
				'label': dataset.get('label'),
				'description': dataset.get('description'),
				'datasetStructure': dataset.get('datasetStructure')
			}
			for dataset in cls.get('datasets', [])
		}
		for cls in ig.get('classes', [])
	}

In [9]:
def extract_datasets_variables(ig):
	"""
	Extracts variables from the IG datasets.

	Args:
		ig (dict): The IG metadata returned by the CDISC Library API.

	Returns:
		dict: A dictionary where keys are class names and values are dictionaries
		with dataset names as keys and lists of variables as values.
	"""
	return {
		cls['name']: {
			dataset['name']: [
				{
					'core': variable.get('core'),
					'description': variable.get('description'),
					'label': variable.get('label'),
					'name': variable.get('name'),
					'ordinal': variable.get('ordinal'),
					'role': variable.get('role'),
					'simpleDatatype': variable.get('simpleDatatype'),
					'codelist': variable.get('_links', {}).get('codelist', [{}])[0].get('href'),
					'valueList': variable.get('valueList'),
					'describedValueDomain': variable.get('describedValueDomain')
				}
				for variable in dataset.get('datasetVariables', [])
			]
			for dataset in cls.get('datasets', [])
		}
		for cls in ig.get('classes', [])
	}

In [10]:
def get_codelist_from_rootct_href(href):
	"""
	Retrieves a CDISC CT codelist of specific version from the CDISC Library using the provided href.

	Args:
	href (str): The href of the root codelist, e.g., /mdr/root/ct/sdtmct/codelists/C66742

	Returns:
	dict: The specific version of codelist retrieved from the CDISC Library.
	"""
	request = f"/mdr/ct/packages/{ct_package}/codelists/{href.split('/')[-1]}"
	codelist = client.get_api_json(request)
	return codelist

In [11]:
def create_datasets_table(classes_datasets, ct_shortname=False):
	"""
	Rearrange classes and datasets dictionary into a columnar arrangement.

	Args:
		classes_datasets (dict): A dictionary where keys are class names and values are dictionaries
			with dataset names as keys and lists of variables as values.
		ct_shortname (bool): If True, look up the short name of the codelist. Otherwise, the NCIt c-code parsed from the root CT href.

	Returns:
		pd.DataFrame: A DataFrame containing the flattened data from the classes_datasets dictionary.
	"""
	data = [
		{
			'Class': class_name,
			'Dataset Name': dataset_name,
			'Order': variable['ordinal'],
			'Variable Name': variable['name'],
			'Label': variable['label'],
			'Type': variable['simpleDatatype'],
			'Codelist': get_codelist_from_rootct_href(variable['codelist']).get('submissionValue') if ct_shortname and variable['codelist'] else (variable['codelist'].split('/')[-1] if variable['codelist'] else None),
			'Value list': '; '.join(variable['valueList']) if isinstance(variable['valueList'], list) else variable['valueList'],
			'Format': variable['describedValueDomain'],
			'Role': variable['role'],
			'Notes': variable['description'],
			'Core': variable['core']
		}
		for class_name, datasets in classes_datasets.items()
		for dataset_name, variables in datasets.items()
		for variable in variables
	]
	return pd.DataFrame(data)

In [15]:
# SDTMIG v3.4
ig = client.get_sdtmig(version="3-4")
ct_package = "sdtmct-2024-09-27"

Additional CDISC Library client API request example provided. You can further automate using the `/mdr/product` endpoint. Refer to its GitHub repo for details: https://github.com/cdisc-org/cdisc-library-client

In [19]:
# SENDIG v3.1.1
ig = client.get_sendig(version="3-1-1")
ct_package = "sendct-2024-09-27"

In [5]:
# SENDIG-DART v1.1
ig = client.get_sendig(version="dart-1-1")
ct_package = "sendct-2024-09-27"

In [23]:
# SDTMIG v3.2
ig = client.get_sdtmig(version="3-2")
ct_package = "sdtmct-2024-09-27"

In [27]:
# SDTMIG v3.3
ig = client.get_sdtmig(version="3-3")
ct_package = "sdtmct-2024-09-27"

In [7]:
# SDTMIG-MD v1.1
ig = client.get_sdtmig(version="md-1-1")
ct_package = "sdtmct-2024-09-27"

1. Retrieve the metadata from CDISC Library using the client
1. Make it look like the specification tables in SDTMIG PDF
1. Output to a flat file

In [28]:
filename_prefix = ig["name"].lower().replace(" ", "-").replace(".", "-").replace("v", "")

Output dataset metadata

In [29]:
# Extract classes_datasets
classes_datasets = extract_classes_datasets(ig)

# Flatten the classes_datasets dictionary and convert it to a DataFrame
flattened_data = [
	{**details}
	for class_name, datasets in classes_datasets.items()
	for dataset_name, details in datasets.items()
]

# Convert the flattened data to a DataFrame and output as CSV
filename = f"{filename_prefix}-datasets.csv"
pd.DataFrame(flattened_data).to_csv(filename, index=False)

Output variable metadata, mimicking the domain table specification layout. 

Set `ct_shortname` to `True` to use the short name of the codelist instead of the NCIt C-code. This process will take approximately two minutes, as it includes an additional step to look up each submission value from a C-code.

In [30]:
# Extract datasets variables
datasets_variables = extract_datasets_variables(ig)

# Flatten datasets variables to a columnar arrangement
df = create_datasets_table(datasets_variables, ct_shortname=False)

# Output as CSV
filename = f"{filename_prefix}-datasets-variables.csv"
df.to_csv(filename, index=False)