# Creating a Static Hierarchical Data Structure (JSON file)

This notebook takes the in-house built ChemDict Smartsheet and converts it into a JSON file that is more readable for JavaScript to understand for the purposes of creating the Hierarchical Edge Bundling Viz.

The final output of this data can then be loaded into another IDE where HTML, JS, and D3 can be implemented to a hierarchical viz of your liking.

## Step 1: Clean up  & standardize category data (from chemical dictionary)

Load in the newest version of the ChemDict using the following code. This gets direct access to the Smartsheet. 

In [None]:
import smartsheet
import pandas as pd

# Initialize client
smartsheet_client = smartsheet.Smartsheet('3gxh6y5CzsT3lEkvkUny0YVdplz6qba8pv8WH')

# Specify the sheet ID
sheet_id = '8604420338044804'

# Load entire sheet
sheet = smartsheet_client.Sheets.get_sheet(sheet_id)

# Convert sheet to DataFrame
columns = [col.title for col in sheet.columns]
rows = []
for row in sheet.rows:
    row_data = []
    for cell in row.cells:
        row_data.append(cell.value)
    rows.append(row_data)

chemdict = pd.DataFrame(rows, columns=columns)

# Save DataFrame to CSV
chemdict.to_csv('chemdict.csv', index=False)

chemdict

Unnamed: 0,substance_ID,assignment,substances,PubChemCID,CAS,synonyms,chemical,pharmco,street_smarts,trending,...,status,R1 complete,tags,opioid,stimulant,psychedelic,cannabinoid,sedative,steroid,category
0,1.0,Nab,"1,3-Diacetin",66924.0,105-70-4,"glyceryl diacetate;2-Hydroxypropane-1,3-diyl d...",fentanyl impurity,"human effects uncertain, inert","cut, flavor, non-toxic","established, uncommon",...,finalize,done,fentanyl impurity;human effects uncertain;iner...,,,,,,,other
1,2.0,Nab,"1,4-Butanediol",8064.0,110-63-4,,"solvent, synthetic",,downer,,...,finalize,done,solvent;synthetic;downer;GHB impurity;night life;,,,,,,,other
2,3.0,Nab,1-2-propanol,,,"3,4-Methylenedioxyphenyl",,human effects uncertain,,,...,to do,to do,human effects uncertain;,,,,,,,
3,4.0,Nab,1-[methyl]cyclopentanol,73830.0,,1-Methylcyclopentanol,"impurity, ketamine impurity, synthetic",human effects uncertain,,uncommon,...,finalize,done,impurity;ketamine impurity;synthetic;human eff...,,,,,,,other
4,5.0,Nab,1-Boc-4-piperidine,,,N-BOC-piperidine-4-carboxylic acid,,human effects uncertain,,,...,to do,to do,human effects uncertain;,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
266,269.0,Erin,vitamin E acetate,86472.0,58-95-7,,vitamin,,,"concern, emerging",...,to do,to do,vitamin;concern;emerging;,,,,,,,
267,270.0,Erin,xylazine,5707.0,7361-61-7,,"alpha-2 agonist, synthetic","anesthetic, dissociative anesthetic, sedative","active agent, downer, up-and-down, veterinary","concern, established",...,tagging done,to do,alpha-2 agonist;synthetic;anesthetic;dissociat...,,,,,sedative,,sedative
268,271.0,Nate,zolpidem,,,"Ambien, Ambien CR",synthetic,sedative,"active agent, downer",established,...,tagging done,done,synthetic;sedative;active agent;downer;establi...,,,,,sedative,,sedative
269,272.0,Erin,α±-Ethylaminopentiophenone,,,,,human effects uncertain,,,...,to do,to do,human effects uncertain;,,,,,,,


Remove unnecessary information from the dictionary. 

In [None]:
# Select the 'substances' and 'category' columns
categories = chemdict[['substances', 'category']]

categories

Unnamed: 0,substances,category
0,"1,3-Diacetin",other
1,"1,4-Butanediol",other
2,1-2-propanol,
3,1-[methyl]cyclopentanol,other
4,1-Boc-4-piperidine,
...,...,...
266,vitamin E acetate,
267,xylazine,sedative
268,zolpidem,sedative
269,α±-Ethylaminopentiophenone,


For the purposes of your visualization, we only want to have 1 category for our visualization. This might be subject to change based on how far along the dictionary is developed. 

In [None]:
#replace doubled categories with 1 of them
categories['category'] = categories['category'].replace({
    r'opioid,.*': 'opioid',
    r'cannabinoid,sedative': 'cannabinoid',
    r'psychedelic,stimulant': 'psychedelic',
    r'psychedelic,sedative': 'psychedelic', 
    r'sedative,psychedelic': 'psychedelic'
}, regex=True)

# Replace None values in the 'category' column with 'not categorized'
categories['category'].fillna('notcat', inplace=True)
categories

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categories['category'] = categories['category'].replace({
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categories['category'].fillna('notcat', inplace=True)


Unnamed: 0,substances,category
0,"1,3-Diacetin",other
1,"1,4-Butanediol",other
2,1-2-propanol,notcat
3,1-[methyl]cyclopentanol,other
4,1-Boc-4-piperidine,notcat
...,...,...
266,vitamin E acetate,notcat
267,xylazine,sedative
268,zolpidem,sedative
269,α±-Ethylaminopentiophenone,notcat


Make some manual changes to individual substance categories

In [None]:
# Update the categories for the specified substances
substances_to_update = {
    '1-Boc-4-piperidine': 'opioid',
    'bipiperidinyl 4-ANPP': 'opioid',
    '3-methoxy-PCE': 'psychedelic',
    '4-acetoxy DMT': 'psychedelic',
    '5-methoxy NiPT': 'psychedelic',
    'norketamine': 'psychedelic',
    'mescaline': 'psychedelic', 
    '2-fluoro deschloroketamine' : 'psychedelic',
    'deschloroketamine' : 'psychedelic', 
    
}

for substance, new_category in substances_to_update.items():
    categories.loc[categories['substances'] == substance, 'category'] = new_category

categories

Unnamed: 0,substances,category
0,"1,3-Diacetin",other
1,"1,4-Butanediol",other
2,1-2-propanol,notcat
3,1-[methyl]cyclopentanol,other
4,1-Boc-4-piperidine,opioid
...,...,...
266,vitamin E acetate,notcat
267,xylazine,sedative
268,zolpidem,sedative
269,α±-Ethylaminopentiophenone,notcat


Remove substances that you do not want in your viz

In [None]:
# List of substances to remove
substances_to_remove = [
    'cannabigerol', '1-2-propanol', 'delta-8-THC acetate', 'delta-9-THC acetate', 'delta-9-THCP', 'ecgonine methylester',
    'ibogamine', 'methoxisopropamine', 'methyl salicylate', 'N-thebaol', 'N-acetyl 2C-B', 'N-ethyl hexedrone',
    'N-methyltryptamine', 'no compounds of interest detected', 'non-specific hydrocarbon', 'non-specific organic acid',
    'non-specific organic acids', 'pending nitazene', 'vitamin E acetate', 'α±-Ethylaminopentiophenone',
    'α±-Pyrrolidinoisohexanophenone', '1-phenethyl-4-hydroxypiperidine', '1-propionyl-4-anilinopiperidine',
    '3-monoacetylmorphine', '4-anilinopiperidine', 'despropionyl ortho-chlorofentanyl', 'despropionyl p-fluorofentanyl',
    'naloxone', 'naltrexone', 'narceine', 'O-desmethyl-cis-tramadol', 'paynantheine', '1-[methyl]cyclopentanol',
    '1-phenethyl-4-propionyloxypiperidine', '1-phenyl-1-propanamine', '1-phenyl-2-propanol', '1-piperidino-1-cyclohexene',
    '1-Piperidinocyclohexanecarbonitrile', '4-benzylpyrimidine', 'alpha-benzyl-N-methylphenethylamine', 'aminopyrine',
    'ethyl vanillin', 'N-benzylcyclohexanamine', 'N-butyl-aniline', 'N-formylmethamphetamine', 'N-isopropylbenzylamine',
    'N-methyl-cyclohexanamine', 'N-phenylacetyl-L-prolylglycine ethyl ester', 'non-specific phthalate', 'non-specific sugar',
    'non-specific sugars', 'phenethyl chloride', 'tianeptine metabolite', 'urea', 'clomiphene', '3,4-methylenedioxy-N-benzylcathinone',
    '3,4-Methylenedioxy-α±-Cyclohexylaminopropiophenone', '4-methyl-5-phenylpyrimide'
]



# Remove the specified substances from the dataframe
categories = categories[~categories['substances'].isin(substances_to_remove)]
categories

Unnamed: 0,substances,category
0,"1,3-Diacetin",other
1,"1,4-Butanediol",other
4,1-Boc-4-piperidine,opioid
12,2-fluoro deschloroketamine,psychedelic
13,2-Fluoro-2-oxo PCE,psychedelic
...,...,...
263,venlafaxine,other
264,vitamin D3,other
265,vitamin E,other
267,xylazine,sedative


Change the names of substances here

In [None]:
# Replace specified substances with their new names
categories['substances'] = categories['substances'].replace({
    'gamma-butyrolactone': 'GBL',
    'gamma-hydroxybutyrate': 'GHB',
    '3-Methoxy-PCP': '3-methoxy-PCP',
    '3-chlorophenmetrazine': '3-CPM',
    '4-Fluoromethylphenidate': '4-fluoromethylphenidate', 
    'mitragynine': 'kratom', 
    '2-Fluoro-2-oxo PCE' : '2-fluoro-2-oxo PCE'
})

categories

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categories['substances'] = categories['substances'].replace({


Unnamed: 0,substances,category
0,"1,3-Diacetin",other
1,"1,4-Butanediol",other
4,1-Boc-4-piperidine,opioid
12,2-fluoro deschloroketamine,psychedelic
13,2-fluoro-2-oxo PCE,psychedelic
...,...,...
263,venlafaxine,other
264,vitamin D3,other
265,vitamin E,other
267,xylazine,sedative


## Step 2: Clean up / Standardize lab samples data

Load in lab.dta, which is a list of all the substance from all samples as they are coming into the lab. 
"sampleid" and "substance" have a many-to-many relationship. 

In [None]:
import pandas as pd

# Load the lab.dta file as a DataFrame
lab_df = pd.read_stata('lab.dta')
lab_df

Unnamed: 0,sampleid,substance,abundance,method,peak,date_complete
0,200236,fentanyl,,GCMS,9.28,2022-01-26
1,200236,heroin,,GCMS,8.96,2022-01-26
2,200236,4-ANPP,,GCMS,8.42,2022-01-26
3,200236,xylazine,,GCMS,6.96,2022-01-26
4,200236,acetylcodeine,,GCMS,8.58,2022-01-26
...,...,...,...,...,...,...
20985,804413,xylazine,,GCMS,7.76,2024-06-10
20986,804413,4-ANPP,,GCMS,9.43,2024-06-10
20987,804413,fentanyl,,GCMS,10.70,2024-06-10
20988,804397,methamphetamine,,GCMS,4.14,2024-06-10


There are some discrepancies in how the lab.dta file and the ChemDict file lists their substances. These discrepancies are cleaned up here to fit the ChemDict standard. 

In [None]:
# Remove synonym values in parentheses from the 'substance' column
lab_df['substance'] = lab_df['substance'].str.replace(r'\s*\(.*\)', '', regex=True)


# Correct the names in the new_df DataFrame
lab_df['substance'] = lab_df['substance'].replace({
    '1--2-propanol': '1-2-propanol',
    '': '3,4-Methylenedioxy-α±-Cyclohexylaminopropiophenone',
    '1-Boc-4--piperidine': '1-Boc-4-piperidine',
    'N--thebaol': 'N-thebaol',
    'N-benzyl-N-cyclohexylamine': 'N-benzylcyclohexanamine',
    'Psilocybin / Psilocin': 'psilocin',
    'ephedrine/pseudoephedrine': 'pseudoephedrine',
    'α-Ethylaminopentiophenon': 'α±-Ethylaminopentiophenone',
    'α-Ethylaminopentiophenone': 'α±-Ethylaminopentiophenone',
    'α-Pyrrolidinoisohexanophenone': 'α±-Pyrrolidinoisohexanophenone',
    'phenethylbromide': 'phenethyl bromide',
    'ethyl 4-ANPP':'ethyl-4-ANPP', 
    'gamma-butyrolactone': 'GBL',
    'gamma-hydroxybutyrate': 'GHB',
    '3-Methoxy-PCP': '3-methoxy-PCP',
    '3-chlorophenmetrazine': '3-CPM',
    '4-Fluoromethylphenidate': '4-fluoromethylphenidate', 
    'mitragynine': 'kratom', 
    '2-Fluoro-2-oxo PCE' : '2-fluoro-2-oxo PCE'

})


lab_df

Unnamed: 0,sampleid,substance,abundance,method,peak,date_complete
0,200236,fentanyl,,GCMS,9.28,2022-01-26
1,200236,heroin,,GCMS,8.96,2022-01-26
2,200236,4-ANPP,,GCMS,8.42,2022-01-26
3,200236,xylazine,,GCMS,6.96,2022-01-26
4,200236,acetylcodeine,,GCMS,8.58,2022-01-26
...,...,...,...,...,...,...
20985,804413,xylazine,,GCMS,7.76,2024-06-10
20986,804413,4-ANPP,,GCMS,9.43,2024-06-10
20987,804413,fentanyl,,GCMS,10.70,2024-06-10
20988,804397,methamphetamine,,GCMS,4.14,2024-06-10


Remove substances that are not in chemical dictionary (you won't be able to use these values)

In [None]:
lab_df = lab_df[~lab_df['substance'].isin(substances_to_remove)]
lab_df

Unnamed: 0,sampleid,substance,abundance,method,peak,date_complete
0,200236,fentanyl,,GCMS,9.28,2022-01-26
1,200236,heroin,,GCMS,8.96,2022-01-26
2,200236,4-ANPP,,GCMS,8.42,2022-01-26
3,200236,xylazine,,GCMS,6.96,2022-01-26
4,200236,acetylcodeine,,GCMS,8.58,2022-01-26
...,...,...,...,...,...,...
20985,804413,xylazine,,GCMS,7.76,2024-06-10
20986,804413,4-ANPP,,GCMS,9.43,2024-06-10
20987,804413,fentanyl,,GCMS,10.70,2024-06-10
20988,804397,methamphetamine,,GCMS,4.14,2024-06-10


Check if there are values in the lab samples that are not in the ChemDict. Our viz will only work if there is a category for every substance found in the lab. Start by getting the list of unique substances in the lab samples data frame. 

In [None]:
# Create a DataFrame with the unique substances
unique_substances_df = pd.DataFrame(lab_df['substance'].unique(), columns=['substance'])
unique_substances_df

Unnamed: 0,substance
0,fentanyl
1,heroin
2,4-ANPP
3,xylazine
4,acetylcodeine
...,...
224,2-methylmethcathinone
225,N-phenethyl-N-phenylpropionamide
226,4-piperidone
227,p-fluoro norfentanyl


Return the values that occur in the lab samples but do not exist in the ChemDict. You'll want to remove these values from the lab samples DF.

In [None]:
#what is in the lab data that are not in categories? 
missing_substances = set(unique_substances_df['substance']) - set(categories['substances'])
missing_substances

{'1,2-Dibromo-4,5-methylenedioxybenzene',
 '2-methylmethcathinone',
 '2-phenylacetamide',
 '3,4-Methylenedioxy-α-Cyclohexylaminopropiophenone',
 '3,4-methylenedioxy-N,N-dimethylamphetamine',
 '3,4-methylenedioxypropiophenone',
 '4-bromo-2,5-Dimethoxyamphetamine',
 '4-piperidone',
 'Fluoroamphetamine',
 'N,N-diamine',
 'N-phenethyl-N-phenylpropionamide',
 'N-pyrrolidino isotonitazene',
 'cyclohexylamine',
 'deschloroetizolam',
 'p-fluoro 4-anilinopiperidine',
 'p-fluoro norfentanyl',
 'p-fluoroacetylfentanyl'}

Return all substances except for the ones in the above object. You should see a reduction in the number of rows. 

In [None]:
lab_cleaned = lab_df[~lab_df.substance.isin(missing_substances)]
lab_cleaned

Unnamed: 0,sampleid,substance,abundance,method,peak,date_complete
0,200236,fentanyl,,GCMS,9.28,2022-01-26
1,200236,heroin,,GCMS,8.96,2022-01-26
2,200236,4-ANPP,,GCMS,8.42,2022-01-26
3,200236,xylazine,,GCMS,6.96,2022-01-26
4,200236,acetylcodeine,,GCMS,8.58,2022-01-26
...,...,...,...,...,...,...
20985,804413,xylazine,,GCMS,7.76,2024-06-10
20986,804413,4-ANPP,,GCMS,9.43,2024-06-10
20987,804413,fentanyl,,GCMS,10.70,2024-06-10
20988,804397,methamphetamine,,GCMS,4.14,2024-06-10


As a sanity check: show the list of unique substances in the new lab samples data file. This value should be lower than before. 

In [None]:
# Create a DataFrame with the unique substances
unique_substances_df = pd.DataFrame(lab_cleaned['substance'].unique(), columns=['substance'])
unique_substances_df

Unnamed: 0,substance
0,fentanyl
1,heroin
2,4-ANPP
3,xylazine
4,acetylcodeine
...,...
207,etaqualone
208,ibogaine
209,flubromazepam
210,norketamine


### Step 3: Create the hierarchical format 

Finally, we can convert our lab samples data into a list of unique values of substances, and its occurrences (what is found alongside those substances in samples). First, create the imports column and population it with whatever is found in that same sample. Do not include the substance itself. 

In [None]:
# Create the 'imports' column
lab_cleaned['imports'] = lab_cleaned.groupby('sampleid')['substance'].transform(lambda x: x.apply(lambda y: list(x[x != y])))

lab_cleaned

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lab_cleaned['imports'] = lab_cleaned.groupby('sampleid')['substance'].transform(lambda x: x.apply(lambda y: list(x[x != y])))


Unnamed: 0,sampleid,substance,abundance,method,peak,date_complete,imports
0,200236,fentanyl,,GCMS,9.28,2022-01-26,"[heroin, 4-ANPP, xylazine, acetylcodeine]"
1,200236,heroin,,GCMS,8.96,2022-01-26,"[fentanyl, 4-ANPP, xylazine, acetylcodeine]"
2,200236,4-ANPP,,GCMS,8.42,2022-01-26,"[fentanyl, heroin, xylazine, acetylcodeine]"
3,200236,xylazine,,GCMS,6.96,2022-01-26,"[fentanyl, heroin, 4-ANPP, acetylcodeine]"
4,200236,acetylcodeine,,GCMS,8.58,2022-01-26,"[fentanyl, heroin, 4-ANPP, xylazine]"
...,...,...,...,...,...,...,...
20985,804413,xylazine,,GCMS,7.76,2024-06-10,"[caffeine, diphenhydramine, 4-ANPP, fentanyl]"
20986,804413,4-ANPP,,GCMS,9.43,2024-06-10,"[caffeine, diphenhydramine, xylazine, fentanyl]"
20987,804413,fentanyl,,GCMS,10.70,2024-06-10,"[caffeine, diphenhydramine, xylazine, 4-ANPP]"
20988,804397,methamphetamine,,GCMS,4.14,2024-06-10,[]


Remove the repetitions. In other words, each sample should only have 1 substance with imports values. All other substances within that sample ID should have an empty imports column. 

In [None]:
# Function to clean imports column by keeping only the first substance and replacing others with an empty list
def clean_imports(df):
    df['imports'] = df.groupby('sampleid')['substance'].transform(lambda x: [[] if idx != 0 else x.iloc[1:].tolist() for idx in range(len(x))])
    return df

# Apply the function to the DataFrame
lab_cleaned = clean_imports(lab_cleaned)

lab_cleaned

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['imports'] = df.groupby('sampleid')['substance'].transform(lambda x: [[] if idx != 0 else x.iloc[1:].tolist() for idx in range(len(x))])


Unnamed: 0,sampleid,substance,abundance,method,peak,date_complete,imports
0,200236,fentanyl,,GCMS,9.28,2022-01-26,"[heroin, 4-ANPP, xylazine, acetylcodeine]"
1,200236,heroin,,GCMS,8.96,2022-01-26,[]
2,200236,4-ANPP,,GCMS,8.42,2022-01-26,[]
3,200236,xylazine,,GCMS,6.96,2022-01-26,[]
4,200236,acetylcodeine,,GCMS,8.58,2022-01-26,[]
...,...,...,...,...,...,...,...
20985,804413,xylazine,,GCMS,7.76,2024-06-10,[]
20986,804413,4-ANPP,,GCMS,9.43,2024-06-10,[]
20987,804413,fentanyl,,GCMS,10.70,2024-06-10,[]
20988,804397,methamphetamine,,GCMS,4.14,2024-06-10,[]


Now convert the lab samples dataframe into a list of unique substances found in it. Add a size column that indicates how many times that substance was found in the substance column. 

In [None]:
# Group by 'substance' and aggregate the data
new_df = lab_cleaned.groupby('substance').agg(
    size=('substance', 'size'),
    imports=('imports', lambda x: [item for sublist in x for item in sublist])
).reset_index()

# Rename the columns
new_df.columns = ['name', 'size', 'imports']


new_df

Unnamed: 0,name,size,imports
0,"1,3-Diacetin",96,"[procaine, 4-ANPP, fentanyl, phenethyl 4-ANPP,..."
1,"1,4-Butanediol",5,[]
2,1-Boc-4-piperidine,3,[]
3,2-fluoro deschloroketamine,2,[]
4,2-fluoro-2-oxo PCE,57,"[xylazine, procaine, cocaine, 4-ANPP, heroin, ..."
...,...,...,...
207,venlafaxine,2,"[amphetamine, pseudoephedrine]"
208,vitamin D3,1,[vitamin E]
209,vitamin E,1,[]
210,xylazine,1019,"[heroin, fentanyl, caffeine, 4-ANPP, lidocaine..."


### Step 4: Add in the categories 

First- another sanity check: are there values in ChemDict that are not in new_df? Only 1- so no biggie! We won't have to use it anyways. 

In [None]:
#there is an occurence of a substance in categories not found in new_df (not an issue)
set(categories['substances']) - set(new_df['name']) 

{'p-fluoro 4-ANPP'}

Using the values from categories add the following format into the new_df so that the category is attached to each occurrence of a substance. This allows us to storage more information into this flat dataset. 

In [None]:
# Create a dictionary for quick lookup of categories
category_dict = dict(zip(categories['substances'], categories['category']))

# Function to update the substance names with their categories
def update_substance_name(substance):
    category = category_dict.get(substance, 'notcat')
    return f"substance.{category}.{substance}"

# Update the 'name' column in new_df
new_df['name'] = new_df['name'].apply(update_substance_name)

# Update the 'imports' column in new_df
new_df['imports'] = new_df['imports'].apply(lambda imports: [update_substance_name(substance) for substance in imports])

new_df

Unnamed: 0,name,size,imports
0,"substance.other.1,3-Diacetin",96,"[substance.other.procaine, substance.opioid.4-..."
1,"substance.other.1,4-Butanediol",5,[]
2,substance.opioid.1-Boc-4-piperidine,3,[]
3,substance.psychedelic.2-fluoro deschloroketamine,2,[]
4,substance.psychedelic.2-fluoro-2-oxo PCE,57,"[substance.sedative.xylazine, substance.other...."
...,...,...,...
207,substance.other.venlafaxine,2,"[substance.stimulant.amphetamine, substance.st..."
208,substance.other.vitamin D3,1,[substance.other.vitamin E]
209,substance.other.vitamin E,1,[]
210,substance.sedative.xylazine,1019,"[substance.opioid.heroin, substance.opioid.fen..."


### Step 6: export as a JSON file

Last sanity check: Check that every occurrence of a substance in imports column has its own row. The output should be an empty set. If it's not, your viz will throw an error. 

In [None]:
# Extract all unique substances from the 'imports' column
imported_substances = set(substance for sublist in new_df['imports'] for substance in sublist)

# Extract all substances from the 'name' column
existing_substances = set(new_df['name'])

# Find substances that are in 'imports' but not in 'name'
missing_substances = imported_substances - existing_substances

missing_substances

set()

final_df_json will be downloadable from your files in DeepNote. This file can be plugged into the code to create the viz!

In [None]:
import json

# Convert the DataFrame to JSON format
final_df_json = new_df.to_json(orient='records')

# Write the JSON to a file for inspection
with open('final_df.json', 'w') as f:
    f.write(final_df_json)

# Display the JSON string
final_df_json

'[{"name":"substance.other.1,3-Diacetin","size":96,"imports":["substance.other.procaine","substance.opioid.4-ANPP","substance.opioid.fentanyl","substance.opioid.phenethyl 4-ANPP","substance.other.menthol","substance.other.N-phenylpropanamide","substance.opioid.4-ANPP","substance.opioid.ethyl-4-ANPP","substance.opioid.fentanyl","substance.opioid.phenethyl 4-ANPP","substance.sedative.xylazine","substance.opioid.4-ANPP","substance.opioid.heroin","substance.opioid.p-fluorofentanyl","substance.opioid.fentanyl","substance.opioid.phenethyl 4-ANPP","substance.sedative.xylazine","substance.opioid.4-ANPP","substance.opioid.p-fluorofentanyl","substance.opioid.fentanyl","substance.other.lidocaine","substance.opioid.4-ANPP","substance.opioid.fentanyl","substance.other.lidocaine","substance.sedative.xylazine","substance.opioid.4-ANPP","substance.opioid.p-fluorofentanyl","substance.opioid.phenethyl 4-ANPP","substance.opioid.tramadol","substance.opioid.p-fluoro phenethyl 4-ANPP","substance.opioid.ethy

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=9f52feaf-67f6-49a2-87eb-73db21017359' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>