<a href="https://colab.research.google.com/github/chyylee/PythonDemos/blob/main/webscraping_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Webscraping From clinicaltrials.gov
Recently we had a company give a talk on what data analysis can look like in a typical life science consulting project. One area that came up that likely was not common knowledge for our consultants is a concept called webscraping.

In simple terms, webscraping describes extracting data from websites which can be useful when trying to pull large amounts of data from online repositories or databases.

A common technique is to use a programming language called python, which has recently been integrated into google suites in google collab.

Below I will walk through an example of how to webscrape from clinicaltrials.gov and save that data to be analyzed in excel or other programming languages.



# Webscraping Basics

Python is flexible and can import packages, libraries, and/or modules that make accomplishing certain tasks easier, bellow we import several packages/modules to use for webscraping and 
data analysis.

In [None]:
import bs4 # for accessing web data
from collections import defaultdict # data processing
from bs4 import BeautifulSoup  # accessing web data
import requests # accessing web data 
import pandas as pd # data processing


A unique module we will need is to orient our code to be able to write and read files from google drive (not needed if you use python locally on your computer). 

In [None]:
from google.colab import drive
drive.mount('/drive')

Mounted at /drive


After initiliazing these modules, we can write a function that can take in
a NCTid for a clinical trial.

In [None]:
# Write function that intakes a NCT ID from Clinicaltrials.gov
def clinicalTrialsGov(nctid,printflag = False):
    data = defaultdict(list) # presets data entry (into dictionary as list)

    # calls beautiful soup to interact with website API
    soup = BeautifulSoup(requests.get("https://clinicaltrials.gov/ct2/show/" + 
                                      nctid + "?displayxml=true").text, "xml") 

    # Select Information To Collect
    subset = ['intervention_type', 'study_type', 'allocation', 
              'intervention_model', 'primary_purpose', 'masking', 
              'enrollment', 'official_title', 'condition', 'minimum_age', 
              'maximum_age', 'gender', 'healthy_volunteers', 'phase', 
              'primary_outcome', 'secondary_outcome', 'number_of_arms',
              'nct_id']

    # Loops through all the information defined in subset and pulls the information to store in data
    for tag in soup.find_all(subset):
        data['ct{}'.format(tag.name.capitalize())].append(tag.get_text(strip=True))

    # Prints the data if printflag is True, default is False
    if printflag:
      for key in data:
        print('{}: {}'.format(key, ', '.join(data[key])))

        # modifies dictionary to be saved into a dataframe
    for key in data:
      if len(data[key]) > 1:
        lst = data[key]
        s = ",".join(lst)
        data[key] = s
    
    # saves the infromation in dataframe
    df = pd.DataFrame(data)
    return df

Now we can input a NCT id of interest and quickly view the contents we denoted we were intersted in.

In [None]:
dout = clinicalTrialsGov('NCT02170532',True)

ctNct_id: NCT02170532
ctOfficial_title: Aerosolized Beta-Agonist Isomers in Asthma
ctPhase: Phase 4
ctStudy_type: Interventional
ctAllocation: Non-Randomized
ctIntervention_model: Crossover Assignment
ctPrimary_purpose: Treatment
ctMasking: None (Open Label)
ctPrimary_outcome: Change in Maximum Forced Expiratory Volume at One Second (FEV1)Baseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment
ctSecondary_outcome: Change in 8 Hour Area-under-the-curve FEV10 to 8 hours post dose, Change in Heart RateBaseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment, Change in Tremor Assessment Measured by a ScaleBaseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatmentTremor assessment will be made on outstretched hands (0 = none, 1+ = fine tremor, barely perceptible, 2+ = obvious tremor)., Change in Dyspnea Response as Measured by the University of California, San Diego (UCSD) Dyspnea ScaleBaseline (before treatment), 30 minute

# Pulling NCT id's for a given medical condition or treatment
Often webscraping is most beneficial when there are set criteria that we would like to pull across a lot of different studies. For example, what if we wanted to pull the top 100 search results for clinical trials pertaining to COVID-19.

Using API's set-up by clinicaltrials.gov, this is relatively easy to implement in python.

In [None]:
# Function to get NCT id's for a certain medical condition.
# condition = "covid", and numstud is the number of studies to return

# aside: xml is harder to understand than json - I'll do a json example later
def getStudyNCT(condition,numstud):
  base_api = 'https://www.clinicaltrials.gov/api/query/full_studies?expr='
  out_NCT = []
  url = base_api + condition.replace(" ", "+") + '&min_rnk=1&max_rnk=' + str(numstud) + '&fmt=xml'
  response = requests.get(url)
  soup = BeautifulSoup(response.content, 'lxml')
  study_list = soup.find_all("fullstudy")

  for study in study_list:
    nctid = study.find("field", {"name" : "NCTId"})
    tmp = str(nctid)
    data = tmp.split('>')[1].split('<')[0]
    out_NCT.append(data)
  
  return out_NCT

Now we can call the above function and get a list of studies to compile information on the study design and status. Here we will just look at the first 10 results of the search.

In [None]:
NCT_lst = getStudyNCT('covid vaccine',10)
NCT_lst

['NCT05387343',
 'NCT05208983',
 'NCT04817657',
 'NCT05130320',
 'NCT05256602',
 'NCT04834726',
 'NCT05258760',
 'NCT05060354',
 'NCT04751734',
 'NCT05057936']

Going through each of these one at a time and compiling the data we wish to extract from each study would be cumbersome. Luckily, it is pretty easy to loop through a list of NCT codes and save that informatin into a .csv for further analysis in excel.

# Putting it all together: Scraping data for probiotics
We will pull the top 100 results returned for clinical trials pertaining to probiotics and save this information into a .csv file.

Call `getStudyNCT` to get a list of NCT ids for probiotic clinical trials

In [None]:
NCT_lst = getStudyNCT('probiotics',100)
NCT_lst[0:10] # preview first 10

['NCT03330678',
 'NCT01648075',
 'NCT01445704',
 'NCT05032027',
 'NCT02650869',
 'NCT05389033',
 'NCT04175392',
 'NCT02589964',
 'NCT05316064',
 'NCT04050189']

Call `clinicalTrialsGov` to then process that results for each trial.





In [None]:
# loop through generated list
c = 1
for ncid in NCT_lst: 
  if c == 1:
    df = clinicalTrialsGov(ncid) # use first NCTid to initialize the dataframe
  else:
    out = clinicalTrialsGov(ncid) 
    df = df.append(out) # append next NCTid to dataframe
  c = c + 1

Now we can preview the compiled data and save as a .csv. After running the code below, check you Colab Notebooks folder for the file.

In [None]:
df.head() # preview the results
# Save to a csv file
condit = 'probiotics'
fnm = "/drive/My Drive/Colab Notebooks/" + condit + "_data.csv"
df.to_csv(fnm)
