# **Fetching CAS Numbers and Synonyms for Chemical Substances**

This Jupyter Notebook fetches **CAS numbers**, **synonyms**, and **source links** for given chemical substances. 

## **Sources Used:**
1. **PubChem API** - To get structured chemical information.
2. **Wikipedia** - As a secondary source for synonyms and CAS numbers.


## **Import Required Libraries**

We need the following Python libraries:
- `requests` for making API calls and web scraping.
- `pandas` for handling tabular data.
- `BeautifulSoup` for parsing Wikipedia HTML pages.


In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

## **Fetch Data from PubChem API**

This function queries the **PubChem API** to retrieve:
1. The **CAS number** of the given substance.
2. A list of **synonyms** for the substance.
3. The **PubChem compound source URL**.
4. The **PubChem synonym source URL**.


In [2]:
def fetch_pubchem_data(substance_name):
    """Fetch CAS number, synonyms, and sources from PubChem API."""
    base_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/"
    url = f"{base_url}{substance_name}/property/IUPACName,MolecularFormula,MolecularWeight,CanonicalSMILES,IsomericSMILES/JSON"
    
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        try:
            compound_id = data['PropertyTable']['Properties'][0]['CID']
            cas_url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{compound_id}/synonyms/JSON"
            cas_response = requests.get(cas_url)
            if cas_response.status_code == 200:
                synonyms_data = cas_response.json()
                synonyms = synonyms_data['InformationList']['Information'][0]['Synonym']
                cas_number = next((s for s in synonyms if s.count('-') == 2 and s.replace('-', '').isdigit()), 'N/A')
                compound_source = f"https://pubchem.ncbi.nlm.nih.gov/compound/{compound_id}"
                synonym_source = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{cas_number}/synonyms/JSON"
                return cas_number, synonyms, compound_source, synonym_source
        except KeyError:
            return "N/A", [], "N/A", "N/A"
    return "N/A", [], "N/A", "N/A"

## **Fetch Data from Wikipedia**

If the **PubChem API** fails to provide data, this function scrapes **Wikipedia** for:
1. The **CAS number** (if available).
2. **Synonyms** of the substance.
3. The **Wikipedia page URL**.


In [None]:
def fetch_wikipedia_data(substance_name):
    """Scrape Wikipedia for CAS number and synonyms."""
    search_url = f"https://en.wikipedia.org/wiki/{substance_name.replace(' ', '_')}"
    response = requests.get(search_url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        infobox = soup.find('table', {'class': 'infobox'})
        if infobox:
            cas_number = "N/A"
            synonyms = []
            for row in infobox.find_all('tr'):
                header = row.find('th')
                if header:
                    if 'CAS Number' in header.text:
                        cas_number = row.find('td').text.strip()
                    if 'Other names' in header.text:
                        synonyms = row.find('td').text.strip().split(', ')
            return cas_number, synonyms, search_url
    return "N/A", [], "N/A"

## **Process the List of Substances**

This function:
1. Loops through each substance in the provided list.
2. **Tries fetching data from PubChem**.
3. If PubChem fails, it **tries Wikipedia**.
4. Stores the results in a structured **Pandas DataFrame**.
5. Saves the data to a **CSV file**.


In [None]:
def main(substance_list):
    """Processes each substance by fetching data from PubChem and Wikipedia."""
    results = []
    for substance in substance_list:
        cas, synonyms, compound_source, synonym_source = fetch_pubchem_data(substance)
        if cas == "N/A":  # Try Wikipedia if PubChem fails
            cas, synonyms, synonym_source = fetch_wikipedia_data(substance)
            compound_source = synonym_source
        
        for synonym in synonyms:
            results.append({
                'Substance Name': substance,
                'CAS Number': cas,
                'Synonym': synonym,
                'Compound Source': compound_source,
                'Synonym Source': synonym_source
            })
    
    df = pd.DataFrame(results)
    file_path = "substance_data.csv"
    df.to_csv(file_path, index=False)
    print(f"Data saved to {file_path}")
    return df

## **Run the Script**

Here, we define a **list of sample substances** and execute the `main()` function to fetch and store the data.


In [None]:
# Sample list of substances
substances = ['(2-Bromoethyl)Benzene','(2-bromoethyl)benzene','(2-bromoethyl)-benzene',
              '(2-chloroethyl)-benzene','(2-nitroprop-1-en-1-yl)benzene',
              '1-(2-Phenylethyl)-4-phenyl-4-acetoxypiperidine','1-(4-bromophenyl)propan-1-one',
              '1-(4-chlorophenyl)propan-1-one','1-(4-methylphenyl)propan-1-one',
              '1-(phenylmethyl)-4-piperidinone']

# Execute the main function
df_results = main(substances)
print(df_results)