# **Substance Lookup with Source Information**
This notebook retrieves chemical information such as CAS number, synonyms, descriptions, and source metadata from PubChem and Wikipedia.

## **Libraries Used**
- `requests`  To make API calls to PubChem and fetch Wikipedia pages.
- `pandas`  For creating and manipulating structured tabular data.
- `BeautifulSoup`  For parsing HTML from Wikipedia.

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

## **Fetch Data from PubChem REST API**

**Function:** `lookup_pubchem_substance` </br>

This function used the **substance list** to queries the **PubChem API** to retrieve:
1. The **CAS number** of the given substance.
2. A list of **synonyms** for the substance.
3. The **PubChem compound source URL**.
4. The **PubChem synonym source URL**.

In [2]:
def lookup_pubchem_substance(substance_name):
    base_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/"
    url = f"{base_url}{substance_name}/property/IUPACName,MolecularFormula,MolecularWeight,CanonicalSMILES,IsomericSMILES/JSON"
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        try:
            compound_id = data['PropertyTable']['Properties'][0]['CID']
            cas_url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{compound_id}/synonyms/JSON"
            cas_response = requests.get(cas_url)
            if cas_response.status_code == 200:
                synonyms_data = cas_response.json()
                synonyms = synonyms_data['InformationList']['Information'][0]['Synonym']
                cas_number = next((s for s in synonyms if s.count('-') == 2 and s.replace('-', '').isdigit()), 'N/A')
                compound_source = f"https://pubchem.ncbi.nlm.nih.gov/compound/{compound_id}"
                synonym_source = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{cas_number}/synonyms/JSON"
                description, record_title, record_source, record_url, source_description, source_license = get_pubchem_description_and_source(compound_id)
                return cas_number, synonyms, compound_source, synonym_source, description, record_title, record_source, record_url, source_description, source_license
        except KeyError:
            return "N/A", [], "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "N/A"
    return "N/A", [], "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "N/A"


## **Fetch Source Data from PubChem REST View API**

**Function:** `get_pubchem_description_and_source`</br>

This function uses the **cid** to queries the **PubChem 'view' API** to retrieve:
1. The **Description** of the given substance.
2. A list of **Source Metadata** for the substance.
   * Source Name
   * Source Description
   * Source URL
   * Source License

In [3]:
def get_pubchem_description_and_source(cid):
    api_url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/index/compound/{cid}/JSON"
    try:
        response = requests.get(api_url)
        if response.status_code == 200:
            data = response.json()
            record = data.get("Record", {})
            record_title = record.get("RecordTitle", "N/A")
            info_entries = record.get("Information", [])
            references = record.get("Reference", [])

            def select_reference():
                for ref in references:
                    if ref.get("SourceName") != "PubChem":
                        return ref
                for ref in references:
                    if ref.get("SourceName") == "PubChem":
                        return ref
                return {}

            selected_ref = select_reference()
            record_source = selected_ref.get("SourceName", "N/A")
            record_url = selected_ref.get("URL", "N/A")
            source_description = selected_ref.get("Description", "N/A")
            source_license = selected_ref.get("LicenseURL", "N/A")

            for entry in info_entries:
                if "Description" in entry:
                    value = entry.get("Value", {})
                    if "StringWithMarkup" in value:
                        for string_obj in value["StringWithMarkup"]:
                            desc_text = string_obj.get("String", "")
                            if desc_text and len(desc_text.split()) > 10:
                                return desc_text, record_title, record_source, record_url, source_description, source_license

            for entry in info_entries:
                value = entry.get("Value", {})
                if "StringWithMarkup" in value:
                    for string_obj in value["StringWithMarkup"]:
                        desc_text = string_obj.get("String", "")
                        if desc_text and len(desc_text.split()) > 10:
                            return desc_text, record_title, record_source, record_url, source_description, source_license

            return "N/A", record_title, record_source, record_url, source_description, source_license
    except Exception:
        return "N/A", "N/A", "N/A", "N/A", "N/A", "N/A"


## **Fetch Data from Wikipedia**

**Function:** `lookup_wikipedia_substance`

Uses Wikipedia as a fallback to get CAS number and synonyms when PubChem fails.</br>

If the **PubChem API** fails to provide data, this function scrapes **Wikipedia** for:
1. The **CAS number** (if available).
2. **Synonyms** of the substance.
3. The **Wikipedia page URL**.

In [4]:
def lookup_wikipedia_substance(substance_name):
    search_url = f"https://en.wikipedia.org/wiki/{substance_name.replace(' ', '_')}"
    response = requests.get(search_url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        infobox = soup.find('table', {'class': 'infobox'})
        if infobox:
            cas_number = "N/A"
            synonyms = []
            for row in infobox.find_all('tr'):
                header = row.find('th')
                if header:
                    if 'CAS Number' in header.text:
                        cas_number = row.find('td').text.strip()
                    if 'Other names' in header.text:
                        synonyms = row.find('td').text.strip().split(', ')
            return cas_number, synonyms, search_url, "N/A", "N/A", "N/A", "N/A", "N/A", "N/A"
    return "N/A", [], "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "N/A"


## **Data Compilation**

**Function:** `compile_substance_info` </br>

Compiles the following information from PubChem and Wikipedia and saves the result to a CSV file.
* Substance Name
* CAS Number
* Record Title
* Substance Description
* Synonyms
* Synonym Source
* Compound Source
* Record Source
* Source URL
* Source Description
* Source License

In [12]:
def compile_substance_info(substance_list):
    results = []
    for substance in substance_list:
        cas, synonyms, compound_source, synonym_source, description, record_title, record_source, record_url, source_description, source_license = lookup_pubchem_substance(substance)
        if cas == "N/A":
            cas, synonyms, compound_source, description, record_title, record_source, record_url, source_description, source_license = lookup_wikipedia_substance(substance)
            synonym_source = compound_source

        for synonym in synonyms:
            results.append({
                'Substance Name': substance,
                'CAS Number': cas,
                'Record Title': record_title,
                'Substance Description': description,
                'Synonyms': synonym,
                'Synonym Source': synonym_source,
                'Compound Source': compound_source,
                'Record Source': record_source,
                'Source URL': record_url,
                'Source Description': source_description,
                'Source License': source_license
            })

    df = pd.DataFrame(results)
    file_path = "substance_data_with_sources.csv"
    df.to_csv(file_path, index=False)
    print(f"Data saved to {file_path}")
    print(df)
    return file_path


## **Run the Substance Scrapping**

Add the list of substances for which you want the lookup.

In [16]:
substances = ["fentanyl"]
compile_substance_info(substances)


Data saved to substance_data_with_sources.csv
    Substance Name CAS Number Record Title  \
0         fentanyl   437-38-7     Fentanyl   
1         fentanyl   437-38-7     Fentanyl   
2         fentanyl   437-38-7     Fentanyl   
3         fentanyl   437-38-7     Fentanyl   
4         fentanyl   437-38-7     Fentanyl   
..             ...        ...          ...   
141       fentanyl   437-38-7     Fentanyl   
142       fentanyl   437-38-7     Fentanyl   
143       fentanyl   437-38-7     Fentanyl   
144       fentanyl   437-38-7     Fentanyl   
145       fentanyl   437-38-7     Fentanyl   

                                 Substance Description  \
0    Fentanyl is a monocarboxylic acid amide result...   
1    Fentanyl is a monocarboxylic acid amide result...   
2    Fentanyl is a monocarboxylic acid amide result...   
3    Fentanyl is a monocarboxylic acid amide result...   
4    Fentanyl is a monocarboxylic acid amide result...   
..                                                 ..

'substance_data_with_sources.csv'