# <span style="color:green">**Challenge #1: NLP extraction to create ML training data**</span>
## <span style="color:green">**Classify drugs that are BCRP inhibitors or non-inhibitors**</span>


# Setup Environment and import all libraries

In [1]:
!nvidia-smi
import os
os.cpu_count()

Sun Nov 19 20:10:37 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   22C    P8     9W /  70W |      5MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

16

In [3]:
from drug_named_entity_recognition import find_drugs
from bs4 import BeautifulSoup
import re
import spacy
import pandas as pd 
import tenacity
import requests,time
from xml.etree import ElementTree as ET
from tqdm import tqdm

# Data Collection through API

### Function to search documents through api in various databases available on Pubmed Central

This Python function, search_pmc, utilizes the NCBI E-Utilities to search the databases like PMC or pubmed for articles related to a given query. It specifically search queries by allowing users to modify the 'query' parameter for articles. The function returns a list of unique PubMed Central (PMC) IDs corresponding to the search results, allowing users to access full-text articles for further analysis.
By adjusting the 'db' parameter in the function's parameters, users can search various databases beyond PubMed, such as 'pmc' for PubMed Central. Additionally, the 'retmax' parameter can be modified to control the maximum number of results returned, providing flexibility in customizing the scope of the search.

In [5]:
def search_pmc(query, retmax=10):
    base_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
    endpoint = 'esearch.fcgi'
    
    params = {
        'db': 'pubmed',
        'term': query + ' AND free full text[filter]',
        'retmax': retmax,
        'retmode': 'json'
    }

    try:
        response = requests.get(base_url + endpoint, params=params)
        data = response.json()
        
        # Extracting PMC IDs from the search results
        pmc_ids = data['esearchresult']['idlist']
        time.sleep(0.3)
        return pmc_ids
    except Exception as e:
        print(f"Error searching PMC: {e}")
        return []

# Example usage
query = ['((BCRP) OR (ABCG2) OR (Breast Cancer Resistance Protein))']
# query = ['((bcrp) AND (inhihbitor)) OR ((bcrp) AND (substrate))']
pmc_ids = [id for keyword in query for id in search_pmc(keyword, retmax=5000)]
pmc_ids = list(set(pmc_ids))  # Get unique PMC IDs
print("PubMed Central IDs:", len(pmc_ids))

PubMed Central IDs: 5000


### Function to get abstract and create a pandas dataframe


The get_abstract function retrieves the abstract for a given PubMed Central (PMC) ID by utilizing the NCBI E-Utilities. The function constructs a request URL with parameters such as the database ('pubmed'), return mode ('json'), and return type ('abstract'). The function returns the abstract as a text string, allowing users to gather concise summaries of articles associated with specific PMC IDs. Additionally, the function includes error handling to manage failed requests and implements a time delay to comply with API usage guidelines.

In [6]:

# Function to get abstract for a given PMC ID
def get_abstract(pmc_id):
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    db = "pubmed"
    retmode = "json"
    rettype = "abstract"

    try:
        # Construct the URL
        url = f"{base_url}?db={db}&id={pmc_id}&retmode={retmode}&rettype={rettype}"

        # Make the request
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad responses

        # Parse the XML response
        # root = ET.fromstring(response.text)

        # Extract the abstract
        # abstract = root.find(".//AbstractText").text

        return response.text
    except requests.exceptions.RequestException as e:
        print(f"Error for PMC_ID {pmc_id}: {str(e)}")
        return None  # Return None for failed requests

# Sample PMC IDs
# pmc_ids = ["PMC123", "PMC456", "PMC789"]

pbar = tqdm(total=len(pmc_ids), desc="Processing PMC IDs")
# Create a DataFrame to store the results
df = pd.DataFrame(columns=["PMC_ID", "Abstract"])

# Populate the DataFrame with PMC IDs and their respective abstracts
for pmc_id in pmc_ids:
    abstract = get_abstract(pmc_id)

    if abstract is not None:
        temp_df = pd.DataFrame({"PMC_ID": [pmc_id], "Abstract": [abstract]})
        df = pd.concat([df, temp_df], ignore_index=True)

    time.sleep(1 / 3)
    pbar.update(1)

# Display the DataFrame
df.head()


Processing PMC IDs: 100%|██████████| 5000/5000 [46:57<00:00,  1.77it/s] 

Unnamed: 0,PMC_ID,Abstract
0,37767694,Cell Physiol Biochem. 2023 Sep 27;57(5):360-37...
1,34829741,Biomedicines. 2021 Oct 21;9(11):1512. doi: 10....
2,33392077,Front Oncol. 2020 Dec 17;10:580176. doi: 10.33...
3,36410675,Pharmacol Res. 2023 Jan;187:106558. doi: 10.10...
4,32206124,Theranostics. 2020 Feb 19;10(8):3816-3832. doi...


### Basic cleaning of text for the raw Abstracts
The basic_clean function is designed to clean input text by performing two main tasks. First, it utilizes the BeautifulSoup library to remove HTML tags, ensuring that the text is free from any markup. Next, it employs a regular expression to replace newline and carriage return characters with spaces, maintaining a consistent and readable format. This function is useful for preprocessing text data, such as extracting clean content from HTML documents, making it suitable for further analysis or natural language processing tasks.

In [7]:
def basic_clean(text):
    """
    Cleans the input text by removing any HTML tags and replacing
    newline and carriage return characters with spaces.

    Args:
        text (str): The text to be cleaned.

    Returns:
        str: The cleaned text.
    """

    # Use BeautifulSoup to remove HTML tags from the text
    soup = BeautifulSoup(text, 'html.parser')
    text = soup.get_text()

    # Use a regular expression to replace newline and carriage return characters with spaces
    text = re.sub(r'[\n\r]', ' ', text)

    # Return the cleaned text
    return text

In [8]:
df['Abstract_clean'] = df['Abstract'].apply(basic_clean)

# Drug Name extraction from cleaned Abstract

### Using Spacy NER to find drug names
This code snippet utilizes the spaCy library to process text data and extract unique drug names from a DataFrame column ('Abstract_clean'). It first loads the spaCy English model ('en_core_web_sm') and selectively disables certain pipeline components for efficiency. The find_drugs function is applied to extract drug names from each processed document, and the unique drug names are stored in a new column ('drug_names_multiple') in the DataFrame. The code efficiently handles text processing and drug name extraction for further analysis.

In [9]:
!python -m spacy download en_core_web_sm -qq

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [10]:
nlp = spacy.load('en_core_web_sm')
disabled = nlp.select_pipes(disable= ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer'])

In [11]:
import pandas as pd
import spacy

# Assuming df is your DataFrame
nlp = spacy.load("en_core_web_sm")

# Create an empty list to store the unique drug names
unique_drug_names_list = []

# Iterate through the rows and extract the 'name' from the list of dictionaries
for doc in nlp.pipe(df['Abstract_clean'].values):
    drug_names = find_drugs([token.text for token in doc], is_ignore_case=True)
    # Extract the 'name' from the first element of the tuple
    drug_names_for_row = [drug[0].get('name') for drug in drug_names]
    
    # Remove duplicate drug names
    unique_drug_names_for_row = list(set(drug_names_for_row))
    
    unique_drug_names_list.append(unique_drug_names_for_row)

# Create a new column with the extracted unique drug names
df['drug_names_multiple'] = unique_drug_names_list


In [12]:
df['no_of_drugs'] = df['drug_names_multiple'].apply(len)

In [13]:
df.head()

Unnamed: 0,PMC_ID,Abstract,Abstract_clean,drug_names_multiple,no_of_drugs
0,37767694,Cell Physiol Biochem. 2023 Sep 27;57(5):360-37...,Cell Physiol Biochem. 2023 Sep 27;57(5):360-37...,"[Paclitaxel, Doxorubicin, Propidium]",3
1,34829741,Biomedicines. 2021 Oct 21;9(11):1512. doi: 10....,Biomedicines. 2021 Oct 21;9(11):1512. doi: 10....,[],0
2,33392077,Front Oncol. 2020 Dec 17;10:580176. doi: 10.33...,Front Oncol. 2020 Dec 17;10:580176. doi: 10.33...,[],0
3,36410675,Pharmacol Res. 2023 Jan;187:106558. doi: 10.10...,Pharmacol Res. 2023 Jan;187:106558. doi: 10.10...,[Tamoxifen],1
4,32206124,Theranostics. 2020 Feb 19;10(8):3816-3832. doi...,Theranostics. 2020 Feb 19;10(8):3816-3832. doi...,"[Chloroquine, Tamoxifen, Ciclosporin]",3


In [14]:
# Function to calculate the number of words and tokens
def calculate_word_token_counts(text):
    words = len(text.split())
    tokens = len(text) // 4
    return words, tokens

# Apply the function to create new columns
df[['no_of_words', 'no_of_tokens']] = df['Abstract_clean'].apply(calculate_word_token_counts).tolist()

# Print the updated DataFrame
df.head()


Unnamed: 0,PMC_ID,Abstract,Abstract_clean,drug_names_multiple,no_of_drugs,no_of_words,no_of_tokens
0,37767694,Cell Physiol Biochem. 2023 Sep 27;57(5):360-37...,Cell Physiol Biochem. 2023 Sep 27;57(5):360-37...,"[Paclitaxel, Doxorubicin, Propidium]",3,398,782
1,34829741,Biomedicines. 2021 Oct 21;9(11):1512. doi: 10....,Biomedicines. 2021 Oct 21;9(11):1512. doi: 10....,[],0,280,524
2,33392077,Front Oncol. 2020 Dec 17;10:580176. doi: 10.33...,Front Oncol. 2020 Dec 17;10:580176. doi: 10.33...,[],0,363,680
3,36410675,Pharmacol Res. 2023 Jan;187:106558. doi: 10.10...,Pharmacol Res. 2023 Jan;187:106558. doi: 10.10...,[Tamoxifen],1,498,1007
4,32206124,Theranostics. 2020 Feb 19;10(8):3816-3832. doi...,Theranostics. 2020 Feb 19;10(8):3816-3832. doi...,"[Chloroquine, Tamoxifen, Ciclosporin]",3,437,794


In [15]:
df_filtered_1_drug = df[df['no_of_drugs'] == 1]

In [16]:
df_filtered_1_drug['drug_name'] = df_filtered_1_drug['drug_names_multiple'].apply(lambda x: x[0])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered_1_drug['drug_name'] = df_filtered_1_drug['drug_names_multiple'].apply(lambda x: x[0])


In [17]:
df_filtered_1_drug=df_filtered_1_drug.reset_index(drop=True)

In [23]:
df_filtered_1_drug.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1335 entries, 0 to 1334
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   PMC_ID               1335 non-null   object
 1   Abstract             1335 non-null   object
 2   Abstract_clean       1335 non-null   object
 3   drug_names_multiple  1335 non-null   object
 4   no_of_drugs          1335 non-null   int64 
 5   no_of_words          1335 non-null   int64 
 6   no_of_tokens         1335 non-null   int64 
 7   drug_name            1335 non-null   object
dtypes: int64(3), object(5)
memory usage: 83.6+ KB


In [28]:
df_filtered_1_drug.to_csv('df_filtered_1_drug.csv',index=False)

In [3]:
df_filtered_1_drug = pd.read_csv('df_filtered_1_drug.csv',index_col=False)

In [11]:
sample = df_filtered_1_drug[:500]

In [12]:
sample =sample.reset_index(drop=True)
sample

Unnamed: 0,PMC_ID,Abstract,Abstract_clean,drug_names_multiple,no_of_drugs,no_of_words,no_of_tokens,drug_name
0,36410675,Pharmacol Res. 2023 Jan;187:106558. doi: 10.10...,Pharmacol Res. 2023 Jan;187:106558. doi: 10.10...,['Tamoxifen'],1,498,1007,Tamoxifen
1,32699032,Cancer Discov. 2020 Oct;10(10):1475-1488. doi:...,Cancer Discov. 2020 Oct;10(10):1475-1488. doi:...,['Aspirin'],1,607,1136,Aspirin
2,31908168,Nano Lett. 2020 Feb 12;20(2):1183-1191. doi: 1...,Nano Lett. 2020 Feb 12;20(2):1183-1191. doi: 1...,['Fibronectin'],1,347,658,Fibronectin
3,36418168,J Nucl Med. 2023 May;64(5):724-730. doi: 10.29...,J Nucl Med. 2023 May;64(5):724-730. doi: 10.29...,['Trastuzumab'],1,575,1005,Trastuzumab
4,37011773,Toxicol Lett. 2023 May 1;380:23-30. doi: 10.10...,Toxicol Lett. 2023 May 1;380:23-30. doi: 10.10...,['Mitoxantrone'],1,403,768,Mitoxantrone
...,...,...,...,...,...,...,...,...
495,36104100,J Immunother Cancer. 2022 Sep;10(9):e005068. d...,J Immunother Cancer. 2022 Sep;10(9):e005068. d...,['Tryptophan'],1,483,872,Tryptophan
496,32805491,Mol Ther Nucleic Acids. 2020 Sep 4;21:885-899....,Mol Ther Nucleic Acids. 2020 Sep 4;21:885-899....,['Gefitinib'],1,315,594,Gefitinib
497,32006616,Cancer Lett. 2020 Apr 10;475:53-64. doi: 10.10...,Cancer Lett. 2020 Apr 10;475:53-64. doi: 10.10...,['Lapatinib'],1,452,874,Lapatinib
498,32695297,Iran J Basic Med Sci. 2020 Jun;23(6):800-809. ...,Iran J Basic Med Sci. 2020 Jun;23(6):800-809. ...,['Methotrexate'],1,371,736,Methotrexate


# Efficient Drug Name Classification with Open AI API GPT3 Turbo and Robust Retry Mechanism

This code defines a classify_drug_names function using the Tenacity library for retrying in case of failures. It interacts with the OpenAI GPT-3.5 Turbo model to classify drug names in an abstract. The function generates a prompt based on the abstract content and drug names, submits it to the GPT-3.5 Turbo model, and extracts the model's response. The retry decorator is applied to handle potential transient failures, making the classification process more robust. The function is then applied to each row of a DataFrame (df_filtered_1_drug) to classify drug names and store the results in a new column ('label_digits').

In [6]:
import os
from openai import OpenAI

api_key = os.environ.get('OPENAI_API_KEY')
client = OpenAI(api_key=api_key)

In [21]:
# Example usage to see if the api is correctly working
response = client.completions.create(
  model="gpt-3.5-turbo-instruct",
  prompt="Hi! How are you?", 
  max_tokens=10 
)

In [22]:
response

Completion(id='cmpl-8MRYmnx1Zk2psJ5Jhbj5HOjOZAXsR', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text='\n\nI am an AI and do not have the')], created=1700359848, model='gpt-3.5-turbo-instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=10, prompt_tokens=6, total_tokens=16))

In [24]:

response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
  ],
  max_tokens = 10 
)

response

ChatCompletion(id='chatcmpl-8M1RwG2pP5w8Gq4KbC65oIat81bpr', choices=[Choice(finish_reason='length', index=0, message=ChatCompletionMessage(content='The Los Angeles Dodgers won the World Series in ', role='assistant', function_call=None, tool_calls=None))], created=1700259480, model='gpt-3.5-turbo-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=10, prompt_tokens=27, total_tokens=37))

In [49]:
abstract = df['Abstract_clean'][3]
drug_names = df['drug_names_multiple'][3]

prompt = f"Label from below abstract if the {drug_names} in below abstract is BCRP_substrate, BCRP_inhibitor  or Not defined only give me label Abstract - {abstract} "
response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt},
  ],
  max_tokens = 5,

)

response

ChatCompletion(id='chatcmpl-8M2D1lyTxLRYeKVX22gQfuRYvOHbO', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content='BCRP_substrate', role='assistant', function_call=None, tool_calls=None))], created=1700262399, model='gpt-3.5-turbo-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=4, prompt_tokens=1010, total_tokens=1014))

In [50]:
response.model_dump()['choices'][0]['message']['content']

'BCRP_substrate'

In [51]:
import time

In [13]:
from tenacity import retry, stop_after_attempt, wait_random_exponential
from tqdm import tqdm
import time

# Function to classify drug names
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def classify_drug_names(abstract, drug_names):

    prompt = f"Classify if the {drug_names} in below abstract in three classes dont give any explanation, 1: if BCRP (inhibitor, transport inhibitor, inhibits, non transporter, unable to transport), 0: if BCRP (non inhibitor, substrate, transport substrate, transporter) or 99: (if we cannot specify), Abstract - {abstract}"

    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                # {"role": "system", "content": "You are a helpful assistant who can label my data"},
                {"role": "user", "content": prompt}
            ],
            max_tokens=1
        )
        # Extract the model's response correctly
        model_response = response.model_dump()['choices'][0]['message']['content']
        return model_response
    except Exception as e:
        # Handle the exception (you can customize this part based on the specific exception)
        print(f"An error occurred: {e}")
        return "Error occurred"

# Apply the classification function to each row with tqdm
tqdm.pandas()
sample['label_digits'] = sample.progress_apply(
    lambda row: classify_drug_names(row['Abstract_clean'], row['drug_name']), axis=1
)


 17%|█▋        | 87/500 [54:55<30:37:00, 266.88s/it]

An error occurred: Error code: 503 - {'error': {'code': 503, 'message': 'Service Unavailable.', 'param': None, 'type': 'cf_service_unavailable'}}


100%|██████████| 500/500 [5:25:55<00:00, 39.11s/it]    


In [14]:
# df_filtered_1_drug_labeled.to_csv('df_filtered_1_drug.csv',index=False)
sample.to_csv('sample_1_drug.csv',index=False)

In [8]:
sample = pd.read_csv('sample_1_drug.csv',index_col=False)

In [9]:
sample.head()

Unnamed: 0,PMC_ID,Abstract,Abstract_clean,drug_names_multiple,no_of_drugs,no_of_words,no_of_tokens,drug_name,label_digits
0,36410675,Pharmacol Res. 2023 Jan;187:106558. doi: 10.10...,Pharmacol Res. 2023 Jan;187:106558. doi: 10.10...,['Tamoxifen'],1,498,1007,Tamoxifen,1
1,32699032,Cancer Discov. 2020 Oct;10(10):1475-1488. doi:...,Cancer Discov. 2020 Oct;10(10):1475-1488. doi:...,['Aspirin'],1,607,1136,Aspirin,99
2,31908168,Nano Lett. 2020 Feb 12;20(2):1183-1191. doi: 1...,Nano Lett. 2020 Feb 12;20(2):1183-1191. doi: 1...,['Fibronectin'],1,347,658,Fibronectin,99
3,36418168,J Nucl Med. 2023 May;64(5):724-730. doi: 10.29...,J Nucl Med. 2023 May;64(5):724-730. doi: 10.29...,['Trastuzumab'],1,575,1005,Trastuzumab,99
4,37011773,Toxicol Lett. 2023 May 1;380:23-30. doi: 10.10...,Toxicol Lett. 2023 May 1;380:23-30. doi: 10.10...,['Mitoxantrone'],1,403,768,Mitoxantrone,0


In [10]:
df_filtered_1_drug = sample

In [11]:
df_filtered_1_drug['label_digits'].value_counts()

label_digits
99                256
1                 155
0                  83
BC                  2
Unfortunately       1
Error occurred      1
AB                  1
C                   1
Name: count, dtype: int64

In [12]:
df_filtered_1_drug['drug_name'].value_counts()

drug_name
Doxorubicin    60
Tamoxifen      58
Trastuzumab    31
Paclitaxel     29
Cisplatin      28
               ..
Losartan        1
Lazertinib      1
Cimetidine      1
Cordycepin      1
Artemisinin     1
Name: count, Length: 142, dtype: int64

In [13]:
df_clean = df_filtered_1_drug[df_filtered_1_drug['label_digits'].isin(['1','0'])]

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Countplot for drug names with multiple labels
plt.figure(figsize=(12, 6))
sns.countplot(x='drug_name', hue='label_digits', data=df_clean)
plt.title('Drug Name vs Labels')
plt.xlabel('Drug Name')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.legend(title='Label Digits')
plt.show()


# Creating smiles and references dataframe

In [14]:
df_smiles_subset =df_clean[['PMC_ID','drug_name','label_digits']].reset_index(drop=True)
df_smiles_subset

Unnamed: 0,PMC_ID,drug_name,label_digits
0,36410675,Tamoxifen,1
1,37011773,Mitoxantrone,0
2,35007696,Primaquine,1
3,32601199,Lapatinib,1
4,36449545,Stavudine,1
...,...,...,...
233,33524223,Doxorubicin,1
234,35798092,Tariquidar,0
235,32805491,Gefitinib,0
236,32695297,Methotrexate,1


#### Manually looking for drug name which are classified as both and filtering them by actually reading the Abstract and also looking for any FDA labels as a reference

In [15]:
# Count the occurrences of each label_digits within each drug_name group
label_counts = df_smiles_subset.groupby(['drug_name', 'label_digits']).size().unstack(fill_value=0)

# Identify the majority class for each drug_name
majority_class = label_counts.idxmax(axis=1)

# Filter the DataFrame based on the majority class, and keep records for class 0 in case of a tie
df_smiles_subset_filtered = df_smiles_subset[df_smiles_subset.apply(lambda row: row['label_digits'] == majority_class[row['drug_name']] or majority_class[row['drug_name']] == 0, axis=1)].reset_index(drop=True)

# Print the result
df_smiles_subset_filtered 

Unnamed: 0,PMC_ID,drug_name,label_digits
0,36410675,Tamoxifen,1
1,37011773,Mitoxantrone,0
2,35007696,Primaquine,1
3,32601199,Lapatinib,1
4,36449545,Stavudine,1
...,...,...,...
179,33524223,Doxorubicin,1
180,35798092,Tariquidar,0
181,32805491,Gefitinib,0
182,32695297,Methotrexate,1


In [273]:
# Assuming df_smiles_subset is your DataFrame
# and pmcid_list is the list of PMC_ID values you want to drop
pmcid_list = ['36386188', '24660104','34830383','22335402','22778859']

# Use the isin method to filter out rows with PMC_ID in pmcid_list
df_smiles_subset_filtered = df_smiles_subset[~df_smiles_subset['PMC_ID'].isin(pmcid_list)].reset_index(drop=True)

In [274]:
df_smiles_subset_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74 entries, 0 to 73
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   PMC_ID        74 non-null     object
 1   drug_name     74 non-null     object
 2   label_digits  74 non-null     object
 3   Reference     74 non-null     object
dtypes: object(4)
memory usage: 2.4+ KB


In [27]:
len(filtered_df['drug_name'].unique())

88

In [16]:
df_smiles_subset_filtered['Reference'] = 'Ref ' + (df_smiles_subset_filtered.index + 1).astype(str)

In [17]:
df_smiles_subset_filtered

Unnamed: 0,PMC_ID,drug_name,label_digits,Reference
0,36410675,Tamoxifen,1,Ref 1
1,37011773,Mitoxantrone,0,Ref 2
2,35007696,Primaquine,1,Ref 3
3,32601199,Lapatinib,1,Ref 4
4,36449545,Stavudine,1,Ref 5
...,...,...,...,...
179,33524223,Doxorubicin,1,Ref 180
180,35798092,Tariquidar,0,Ref 181
181,32805491,Gefitinib,0,Ref 182
182,32695297,Methotrexate,1,Ref 183


In [18]:
# Create one-hot encoding for PMC_ID
df_pmc_onehot = pd.get_dummies(df_smiles_subset_filtered['Reference'])

# Concatenate the one-hot encoded PMC_ID with the original DataFrame
df_combined = pd.concat([df_smiles_subset_filtered[['drug_name', 'label_digits']], df_pmc_onehot], axis=1)

# Group by drug_name and label_digits and sum the one-hot encoded PMC_ID values
df_result = df_combined.groupby(['drug_name', 'label_digits']).sum().reset_index()

df_result.head()

Unnamed: 0,drug_name,label_digits,Ref 1,Ref 10,Ref 100,Ref 101,Ref 102,Ref 103,Ref 104,Ref 105,...,Ref 90,Ref 91,Ref 92,Ref 93,Ref 94,Ref 95,Ref 96,Ref 97,Ref 98,Ref 99
0,Abiraterone,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Alectinib,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Ampicillin,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Aspirin,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Atorvastatin,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Function to get smiles 
The get_cid_and_smiles function utilizes the PubChem API to retrieve the PubChem Compound Identifier (CID) and Isomeric Simplified Molecular Input Line Entry System (SMILES) notation for a given drug name. The function first searches for the CID using the drug name, then extracts the CID and Isomeric SMILES from the API response. This efficient API usage ensures accurate retrieval of chemical information for further analysis. The example usage demonstrates fetching PubChem CID and SMILES for the drug name "aspirin" and printing the results.

In [20]:
import requests
import json,time
# from tenacity import retry, stop_after_attempt, wait_fixed

# @retry(stop=stop_after_attempt(3), wait=wait_fixed(5))
def get_cid_and_smiles(drug_name):
    base_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound"
    
    # Step 1: Search for CID using drug name
    search_url = f"{base_url}/name/{drug_name}/json"
    response = requests.get(search_url)
    
    try:
        data = response.json()

        # Print the entire response content for debugging
        # print(f"Response : {response} \n Response Content:{response.content}")
        # Check if the response contains the expected structure
        if "PC_Compounds" in data and data["PC_Compounds"]:
            compound_info = data["PC_Compounds"][0]
            cid = compound_info.get("id", {}).get("id", {}).get("cid")

            # Look for the SMILES property in the properties list
            smiles_info = next((prop for prop in compound_info.get("props", []) if prop.get("urn", {}).get("label") == "SMILES" and prop.get("urn", {}).get("name") == "Canonical"), None)
            smiles = smiles_info.get("value", {}).get("sval") if smiles_info else None

            return cid, smiles
        else:
            print(f"No CID and SMILES found for drug name: {drug_name}")
            return None, None
    except json.JSONDecodeError:
        print(f"Error decoding JSON response for drug name: {drug_name}")
        return None, None

# Example usage:
drug_name = "aspirin"
cid, smiles = get_cid_and_smiles(drug_name)

if cid and smiles:
    print(f"Drug Name: {drug_name}")
    print(f"PubChem CID: {cid}")
    print(f"Isomeric SMILES: {smiles}")


Drug Name: aspirin
PubChem CID: 2244
Isomeric SMILES: CC(=O)OC1=CC=CC=C1C(=O)O


In [21]:
# Apply the function to the 'drug_name' column
df_result[['Pubchem_CID', 'SMILES']] = df_result['drug_name'].apply(lambda x: pd.Series(get_cid_and_smiles(x)))

No CID and SMILES found for drug name: Bevacizumab
No CID and SMILES found for drug name: Brazikumab
No CID and SMILES found for drug name: Cetuximab
No CID and SMILES found for drug name: Fibronectin
No CID and SMILES found for drug name: Nivolumab
No CID and SMILES found for drug name: Trastuzumab


In [22]:
df_result.head()

Unnamed: 0,drug_name,label_digits,Ref 1,Ref 10,Ref 100,Ref 101,Ref 102,Ref 103,Ref 104,Ref 105,...,Ref 92,Ref 93,Ref 94,Ref 95,Ref 96,Ref 97,Ref 98,Ref 99,Pubchem_CID,SMILES
0,Abiraterone,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,132971.0,CC12CCC(CC1=CCC3C2CCC4(C3CC=C4C5=CN=CC=C5)C)O
1,Alectinib,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,49806720.0,CCC1=CC2=C(C=C1N3CCC(CC3)N4CCOCC4)C(C5=C(C2=O)...
2,Ampicillin,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,6249.0,CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O...
3,Aspirin,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2244.0,CC(=O)OC1=CC=CC=C1C(=O)O
4,Atorvastatin,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,60823.0,CC(C)C1=C(C(=C(N1CCC(CC(CC(=O)O)O)O)C2=CC=C(C=...


Cleaning the dataframe to transform it in required format for submission

In [23]:
df_result = df_result.rename(columns={'drug_name': 'Name', 'label_digits': 'Activity'})

# Map values in the 'Activity' column to 'BCRP_inhibitor' and 'BCRP_substrate'
df_result['Activity'] = df_result['Activity'].map({'1': 'BCRP_inhibitor', '0': 'BCRP_substrate'})

In [24]:
df_result.columns

Index(['Name', 'Activity', 'Ref 1', 'Ref 10', 'Ref 100', 'Ref 101', 'Ref 102',
       'Ref 103', 'Ref 104', 'Ref 105',
       ...
       'Ref 92', 'Ref 93', 'Ref 94', 'Ref 95', 'Ref 96', 'Ref 97', 'Ref 98',
       'Ref 99', 'Pubchem_CID', 'SMILES'],
      dtype='object', length=188)

In [25]:
# Assuming df is your DataFrame
column_order = ['Name', 'Activity', 'SMILES'] + \
                sorted([col for col in df_result.columns if 'Ref' in col], key=lambda x: int(x.split(' ')[-1])) + \
                ['Pubchem_CID']

# Reorder the columns
df_reordered = df_result[column_order]

In [26]:
df_reordered.head()

Unnamed: 0,Name,Activity,SMILES,Ref 1,Ref 2,Ref 3,Ref 4,Ref 5,Ref 6,Ref 7,...,Ref 176,Ref 177,Ref 178,Ref 179,Ref 180,Ref 181,Ref 182,Ref 183,Ref 184,Pubchem_CID
0,Abiraterone,BCRP_substrate,CC12CCC(CC1=CCC3C2CCC4(C3CC=C4C5=CN=CC=C5)C)O,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,132971.0
1,Alectinib,BCRP_inhibitor,CCC1=CC2=C(C=C1N3CCC(CC3)N4CCOCC4)C(C5=C(C2=O)...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,49806720.0
2,Ampicillin,BCRP_inhibitor,CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,6249.0
3,Aspirin,BCRP_inhibitor,CC(=O)OC1=CC=CC=C1C(=O)O,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2244.0
4,Atorvastatin,BCRP_substrate,CC(C)C1=C(C(=C(N1CCC(CC(CC(=O)O)O)O)C2=CC=C(C=...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,60823.0


In [31]:
import re
cols_to_replace = [col for col in df_result.columns if 'Ref' in col]

# Custom sorting function
def extract_number(ref):
    return int(re.search(r'\d+', ref).group())

sorted_columns = sorted(cols_to_replace, key=extract_number)

df_reordered[sorted_columns] = df_reordered[sorted_columns]

In [32]:
df_reordered.to_csv('training_data.tsv',index = False,sep='\t')

#### Creating references dataframe

In [33]:
references = df_smiles_subset_filtered[['Reference','PMC_ID']].rename(columns={'Reference': 'Code', 'PMC_ID': 'Source PMID'})
references.to_csv('references.tsv',index = False,sep='\t')

#### Creating metrics_summary

In [9]:
training_data = pd.read_csv('/root/capsule/code/Final_Workflow_Increased_Data/training_data.tsv',sep = '\t',index_col =False)
references = pd.read_csv('/root/capsule/code/Final_Workflow_Increased_Data/references.tsv',sep = '\t',index_col =False)

In [10]:
team_name = 'Data_Maniac'
team_contact = ('axp210092@utdallas.edu', 'akshatpatil7@gmail.com')
num_samples = training_data['Name'].count()
num_references = references['Code'].count()
num_inhibitor = (training_data['Activity'] == 'BCRP_inhibitor').sum()
num_substrate = (training_data['Activity'] == 'BCRP_substrate').sum()
percent_minority = min(num_inhibitor,num_substrate)/num_samples
num_smiles = training_data['SMILES'].count()
percent_smiles = num_smiles/num_samples

In [11]:
# Create a summary DataFrame
metrics_summary = pd.DataFrame({
    'team_name': [team_name],
    'team_contact': [team_contact],
    'num_samples': [num_samples],
    'num_references': [num_references],
    'num_inhibitor': [num_inhibitor],
    'num_substrate': [num_substrate],
    'percent_minority': [percent_minority],
    'num_smiles': [num_smiles],
    'percent_smiles': [percent_smiles]
})

In [12]:
metrics_summary

Unnamed: 0,team_name,team_contact,num_samples,num_references,num_inhibitor,num_substrate,percent_minority,num_smiles,percent_smiles
0,Data_Maniac,"(axp210092@utdallas.edu, akshatpatil7@gmail.com)",88,184,51,37,0.420455,82,0.931818


In [8]:
metrics_summary.to_csv('metrics_summary.tsv',index = False,sep='\t')