# Elsevier Shadow Health Transcript Scraper 
Author: Muhammad Ashiq <br>
This script will scrape the necessary information from the transcripts found in the HTML of the Shadow Health app. 

## Limitations 
It is limited, though, as much of the data entry requires human judgement. <br>
As such, it is necessary to: <br>
a) manually enter exam action data (e.g. the student answers to the exam found in Objective Data Collection) <br>
b) manually enter in documentation data <br>
c) add other system interactions outside of clarifications <br>

However, participants, timestamps, most interaction types, and utterances (including clarifications from the system for the student) can be successfully scraped. A simple Google Sheets rule can classify actions as either Subjective Data Collection or Objective Data Collection. The rest of the metadata can be entered easily in Google Sheets. <br>

## Installations
Dependencies required include: <br>
&ensp;requests<br>
&ensp;html5lib<br>
&ensp;BeautifulSoup <br>
&ensp;pandas<br>
&ensp;datetime <br>
These can be installed via: <br>
&ensp;pip3  install requests<br>
&ensp;pip3 install html5lib<br>
&ensp;pip3 install bs4<br>
&ensp;pip3 install pandas<br>
&ensp;pip3 install datetime <br>
&ensp;pip3 install dateutil <br>
on your command line, powershell, or terminal. <br>

## Steps 
Before running, get: <br>
a) The URL of your transcript, corresponding to the student you'd like to make a sheet for<br> 
b) Your username and password. Be sure not to share this with anyone else! <br>
c) The file path you want to save the resulting CSV in <br>
d) The name of the student you'd like to transcribe <br>
e) The name of the patient you'd like to transcribe <br>

Then, <br>
1) Update the transcript URL, your user name and password, the name of the transcript student, the name of the patient, and the file path in the code below. Locations to change code are denoted with # CHANGE THIS. <br>
2) Rerun the script<br>
3) Import the CSV into the Google Sheet (TMA-SH)<br>
4) Update necessary areas of the sheet--see limitations<br>
5) Enter necessary metadata<br>

In [69]:
# Necessary imports for code
import requests 
import html5lib  
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
from dateutil import parser
print("Imports succesful!")

Imports succesful!


In [70]:
# URL of the authentication page
auth_url = "https://app.shadowhealth.com/users/sign_in"
# URL of the locked web server page; this is the link to the transcript 
locked_url = "https://app.shadowhealth.com/assignment_attempts/13096835" # CHANGE THIS (Student-Patient)

# Credentials for authentication
username = "[REDACTED]" # CHANGE THIS
password = "[REDACTED]" # CHANGE THIS 

# Begin a new session 
session = requests.Session()

# Send a GET request to the authentication URL to get the authenticity token
auth_response = session.get(auth_url)
auth_soup = BeautifulSoup(auth_response.content, 'html.parser')
auth_token = auth_soup.find('input', {'name': 'authenticity_token'}).get('value')

# Prepare the login data with the credentials and authenticity token
login_data = {
    'authenticity_token': auth_token,
    'user[email]': username,
    'user[password]': password,
    'commit': 'Sign in'
}

# Send a POST request to authenticate
session.post(auth_url, data=login_data)

# Send a GET request to the target URL with the authenticated session
response = session.get(locked_url)

# Check if the request was successful and print the response content
if response.status_code == 200:
    print("Request successful!")
    # print(response.text)
else:
    print("Request failed with status code:", response.status_code)


Request successful!


In [71]:

# Initialzie a new BeautifulSoup object with the response text
soup = BeautifulSoup(response.text, 'html.parser')

# Find the transcript table
table = soup.find('table', class_='transcript-table')

# Initialize empty lists for each column
participants = []
timestamps = []
interaction_types = []
utterances = []

idx  = 0 

# Iterate over each row in the table
for row in table.find_all('tr'):
    idx += 1
    # Get the participant 
    participant_img = row.find('td', class_='align-top').find('img')
    if participant_img:
        participant = participant_img.get('alt')
    else:
        participant = 'Unknown'  # Handle cases where participant is not specified
    participants.append(participant)

    # Get the timestamp
    timestamp_p = row.find('p', class_='topic')
    if timestamp_p:
        timestamp = timestamp_p.find('br').next_sibling.strip()
    else:
        timestamp = timestamps[idx - 2]
    timestamps.append(timestamp)

    # Get the interaction type
    interaction_type_span = row.find('span', class_='strong')
    if interaction_type_span:
        interaction_type = interaction_type_span.get_text()
    else:
        interaction_type = ''
    interaction_types.append(interaction_type)

    # Get the utterance
    utterance_div = row.find('div', class_='line')
    if utterance_div:
        utterance = utterance_div.get_text().strip()
    else:
        utterance = ''
    utterances.append(utterance)

# Create the dataframe
data = {
    'Participant': participants,
    'Timestamp': timestamps,
    'Interaction Type': interaction_types,
    'Utterances': utterances
}
df = pd.DataFrame(data)


# Replace 'Your Avatar.' with 'Student_name' in the Participant column
df.loc[df['Participant'] == 'Your Avatar.', 'Participant'] = 'Sunayana' # CHANGE THIS (Student)

# Replace 'Regina Walker.' with 'Regina Walker'
df.loc[df['Participant'] == 'Lucas Callahan.', 'Participant'] = 'Lucas Callahan' # CHANGE THIS (Patient)

# Fill in all the missing timestamps for Regina Walker with the last student timestamp. Missing timestamps are denoted as '' 
df.loc[df['Participant'] == 'Lucas Callahan', 'Timestamp'] = df.loc[df['Participant'] == 'Lucas Callahan', 'Timestamp'].replace('', method='ffill') # CHANGE THIS (Patient)

# Replace all Regina Walker cases with an Interaction Type of Response 
df.loc[df['Participant'] == 'Lucas Callahan', 'Interaction Type'] = 'Response' # CHANGE THIS (Patient)

# When an utterance has the participant Lana, and contains "(Clarified to ...)", delete that part of the utterance and make a new row with interaction type "Clarification" 
# and participant System, with the same timestamp as the utterance it was taken from. Make sure that these clarification rows come directly after the utterance they were taken from 
# in the dataframe.
for index, row in df.iterrows():
    if row['Participant'] == 'Sunayana' and '(Clarified to' in row['Utterances']: # CHANGE THIS (Student)
        idx = index 
        clarification = row['Utterances'].split('(Clarified to')[1].split(')')[0]
        df.loc[index, 'Utterances'] = row['Utterances'].split('(Clarified to')[0]
        new_row = {
            'Participant': 'System',
            'Timestamp': row['Timestamp'],
            'Interaction Type': 'Clarification',
            'Utterances': clarification
        }
        df = pd.concat([df.iloc[:idx+1], pd.DataFrame(new_row, index=[idx+1]), df.iloc[idx+1:]]).reset_index(drop=True)

# Parse the timestamps into a datetime object.
df['Timestamp'] = df['Timestamp'].apply(lambda x: parser.parse(x))
# df = df.sort_values(by=['Timestamp'])

# Convert times in timestamps to non-military time. Include the date as well, still in the format day, month, year.
df['Timestamp'] = df['Timestamp'].apply(lambda x: x.strftime('%I:%M:%S %p, %B %d, %Y'))

# Convert the dates in timestamps to MM/DD/YYY format
df['Timestamp'] = df['Timestamp'].apply(lambda x: datetime.strptime(x, '%I:%M:%S %p, %B %d, %Y').strftime('%m/%d/%Y %I:%M:%S %p'))

df['Timestamp'] = df['Timestamp'] + ' EDT'

# Change all interaction types that are "Empathize" or "Educate" to "Statement"
df.loc[df['Interaction Type'] == 'Empathize', 'Interaction Type'] = 'Statement'
df.loc[df['Interaction Type'] == 'Educate', 'Interaction Type'] = 'Statement'

# Sort the dataframe by index
df = df.sort_index()
# Print the dataframe
df




Unnamed: 0,Participant,Timestamp,Interaction Type,Utterances
0,Sunayana,10/02/2022 11:08:00 AM EDT,Exam Action,Assessed vitals
1,Sunayana,10/02/2022 11:09:00 AM EDT,Exam Action,Inspected right eye
2,Sunayana,10/02/2022 11:09:00 AM EDT,Exam Action,Inspected left eye
3,Sunayana,10/02/2022 11:09:00 AM EDT,Exam Action,Performed otoscopic examination of right naris
4,Sunayana,10/02/2022 11:09:00 AM EDT,Exam Action,Performed otoscopic examination of left naris
...,...,...,...,...
241,Lucas Callahan,10/02/2022 01:13:00 PM EDT,Response,"Listen, I KNOW Im at risk for injury! Remember..."
242,Sunayana,10/02/2022 01:14:00 PM EDT,Statement,Okay. What is the fear of government?
243,Lucas Callahan,10/02/2022 01:14:00 PM EDT,Response,"Right, right, you gotta keep me safe like a li..."
244,Sunayana,10/02/2022 01:14:00 PM EDT,Statement,The intervention i have is that supervision o...


In [72]:
# Write resulting dataframe to a CSV in a local directory 
df.to_csv("~/Desktop/epistemic_analytics/mamta_elsevier/datasets/sunayana_lucas_transcript.csv") # CHANGE THIS (Student-Patient)