# <font color='violet'> Scrape Erowid for Narrative Experience Reports
    
Here, I'll create a dataframe out of information from Erowid, which has a large "experience vault," where there are thousands of narrative descriptions of psychoactive drugs that could be compared with ratings and reviews of prescription psych meds using the model I created based on more formal psychiatric studies. 

In [1]:
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import pandas as pd
import json

from IPython.display import Audio
sound_file = './alert.wav'

<font color='violet'> Explore the basic structure of the front page of the experience vault.
    
Get a list of drug names, then use that list to extract links associated with those names. The links can then be followed to find the narratives connected to each drug. 

In [2]:
front_url = 'https://erowid.org/experiences/exp_list.shtml'
front_page = requests.get(front_url)
front_soup = BeautifulSoup(front_page.content, "html.parser")
front_pretty = front_soup.prettify().splitlines()
front_pretty[:500]

['<html>',
 ' <head>',
 '  <title>',
 '   Complete Substance and Category List : Erowid Experience Vaults',
 '  </title>',
 '  <meta content="The full list of substances and categories covered by Erowid\'s collection of first-hand experience reports with psychoactive plants and drugs." name="description"/>',
 '  <meta content="Experience Report Vaults, trip reports, stories, descriptions" name="keywords"/>',
 '  <link href="/includes/general_default.css" rel="stylesheet" type="text/css"/>',
 '  <link href="includes/exp.css" rel="stylesheet" type="text/css"/>',
 '  <script src="/includes/javascript/jquery-3.2.1.min.js" type="text/javascript">',
 '  </script>',
 '  <script language="javascript" src="/includes/javascript/erowid_combobox_lib.js" type="text/javascript">',
 '  </script>',
 '  <script language="javascript" src="/includes/javascript/external/mobile-detect.min.js" type="text/javascript">',
 '  </script>',
 ' </head>',
 ' <body alink="#008080" bgcolor="#000000" link="#7777AA" te

It appears as though the name attribute, when inside an a element, only connects with a letter of the alphabet or with drug names. I can get a list of all drugs by pulling the contents of the name attributes into a list and then just deleting the single letters "A," "B," "C," etc.

In [3]:
a_element_name = front_soup.find_all('a', attrs={"name": True})
drug_names = []
for result in a_element_name:
    drug_names.append(result.attrs['name'])
drug_names

['A',
 'AB-001',
 'AB-CHMINACA',
 'AB-FUBINACA',
 'Absinthe',
 'Acacia',
 'Acacia confusa',
 'Acacia maidenii',
 'Acacia phlebophylla',
 'Acepromazine',
 'Acetaminophen',
 'Acetildenafil',
 'Acetylfentanyl',
 'Aconitum napellus',
 'ADB-FUBINACA',
 'Adrafinil',
 'Adrenochrome',
 'AET',
 'AH-7921',
 'AL-LAD',
 'Albizia julibrissin',
 'Alcohol',
 'Alcohol - Beer/Wine',
 'Alcohol - Hard',
 'ALD-52',
 'ALEPH',
 'Aleph-4',
 'Allylescaline',
 'Aloes',
 'alpha-GPC',
 'alpha-PCYP',
 'alpha-PHiP',
 'alpha-PHP',
 'alpha-PVP',
 'AM-2201',
 'AM-DIPT',
 'Amanitas',
 'Amanitas - A. muscaria',
 'Amanitas - A. pantherina',
 'Amphetamines',
 'Amphetamines - Substituted',
 'AMT',
 'Anabolic Steroids',
 'Anadenanthera colubrina',
 'Anadenanthera peregrina',
 'Anadenanthera spp.',
 'Animals',
 'Animals - Black Widow Spider',
 'Animals - Fire Ants',
 'Animals - Frogs',
 'Aniracetam',
 'AP-238',
 'Argemone spp. ',
 'Armodafinil',
 'Arundo donax',
 'Arylcyclohexylamines',
 'Ashwagandha',
 'Aspirin',
 'Atropin

In [4]:
# Get a list of just psychedelic drugs of interest to me. 
psychedelic_drugs = ['AET', 'AL-LAD', 'ALD-52', 'ALEPH', 'Aleph-4', 'Allylescaline',
                     'AMT', 'Arylcyclohexylamines', 'Ayahuasca', 'Banisteriopsis caapi', 
                     'BOD', 'BOH-2C-B', 'Bufotenin', 'Cacti - Mescaline-containing', 'DALT', 
                     'Deschloroketamine', 'DET', 'DiPT', 'DMT', 'DMT-Containing', 'DMXE', 
                     'DOB', 'DOC', 'DOET', 'DOF', 'DOI', 'DOIP', 'DOM', 'DON', 'DOPR', 'DPT', 
                     'EIPLA', 'EPT', 'Escaline', 'ETH-LAD', 'Fluorexetamine', 'H.B. Woodrose',
                     'Harmaline', 'Harmine', 'Herbal Ecstasy', 'HOT-17', 'HOT-2', 'HOT-7',
                     'Huasca Brew', 'Huasca Brew Group', 'Huasca Combo', 'Huasca Group', 'HXE',
                     'Iboga Alkaloid Group', 'Ibogaine', 'Isoproscaline', 'Ketamine', 'LSA',
                     'LSD', 'LSM-775', 'LSZ', 'MALT', 'MDA', 'MDAI', 'MDE', 'MDMA', 'MEM',
                     'Mescaline', 'MET', 'Methallylescaline', 'Methoxetamine', 
                     'Methoxpropamine', 'Mimosa ophthalmocentra', 'Mimosa spp.',
                     'Mimosa tenuiflora', 'MIPLA', 'MIPT', 'MMDA', 'MMDA-3a', 'MPT',
                     'Mushrooms', 'Mushrooms - G. spectabilis', 'Mushrooms - P. atlantis',
                     'Mushrooms - P. azurescens', 'Mushrooms - P. cubensis', 
                     'Mushrooms - P. cyanescens', 'Mushrooms - P. mexicana',
                     'Mushrooms - P. semilanceata', 'Mushrooms - P. subaeruginosa',
                     'Mushrooms - P. tampanensis', 'Mushrooms - P. weilii',
                     'Mushrooms - Panaeolus cyanescens', 'MXiPr', 'PCE', 'PCP', 'Peyote',
                     'Phenethylamine', 'Phenethylamines', 'Phenethylamines - Other',
                     'PIPT', 'Proscaline', 'Psilocin', 'Psilocybin', 'S-Ketamine',
                     'Tabernanthe iboga', 'TCB-2', 'Tetrahydroharmine', 'TMA', 'TMA-2', 
                     'TMA-6', 'Tryptamines - Substituted', '1B-LSD', '1cP-AL-LAD', '1cP-LSD',
                     '1F-LSD', '1P-ETH-LAD', '1P-LSD', '1V-LSD', "2'-Oxo-PCE",
                     '2-Fluorodeschloroketamine', '2-Me-DMT', '2C-B', '2C-B-Fly', '2C-C',
                     '2C-CN', '2C-D', '2C-E', '2C-EF', '2C-G-N', '2C-H', '2C-I', '2C-IP',
                     '2C-N', '2C-P', '2C-T', '2C-T-13', '2C-T-2', '2C-T-21', '2C-T-4', 
                     '2C-T-7', '2C-TFM', '3,4-MD-PCP', '3-Cl-PCP', '3-HO-PCE', '3-HO-PCP',
                     '3-Me-PCE', '3-Me-PCPy', '3-MEO-PCE', '3-MeO-PCMo', '3-MeO-PCP',
                     '3-Methyl-PCP', '3C-E', '3C-P', '3F-PCP', '4-AcO-DALT', '4-AcO-DET',
                     '4-AcO-DiPT', '4-AcO-DMT', '4-AcO-DPT', '4-AcO-EIPT', '4-AcO-EPT',
                     '4-AcO-MALT', '4-AcO-MET', '4-AcO-MiPT', '4-AcO-MPT', '4-HO-DET',
                     '4-HO-DiPT', '4-HO-DPT', '4-HO-EPT', '4-HO-MALT', '4-HO-MCPT', '4-HO-MET',
                     '4-HO-MiPT', '4-HO-MPT', '4-HO-PIPT', '4-MeO-DMT', '4-MeO-MiPT',
                     '4-MeO-PCP', '4-MTA', '4-PrO-DMT', '4C-D', '5-Chloro-AMT', '5-MeO-AET',
                     '5-MeO-AMT', '5-MeO-DALT', '5-MeO-DET', '5-MeO-DiPT', '5-MeO-DMT', 
                     '5-MeO-DPT', '5-MeO-EIPT', '5-MeO-MALT', '5-MeO-MET', '5-MeO-MIPT',
                     '5-MeO-PIPT', '5-MeO-TMT', '5-Methoxy-Tryptamine']
len(psychedelic_drugs)

191

create list of links associated with these drugs. The format is: 
https://erowid.org/experiences/subs/exp_<DRUG>.shtml

In [5]:
# Dashes and periods need to be deleted and spaces need to be replaced with underscores
drug_names_for_links = []
for drug in psychedelic_drugs:
    no_dash = drug.replace('-', '')
    no_period = no_dash.replace('.', '')
    no_double_space = no_period.replace('  ', '_')
    no_space = no_double_space.replace(' ', '_')
    drug_names_for_links.append(no_space)
drug_names_for_links

['AET',
 'ALLAD',
 'ALD52',
 'ALEPH',
 'Aleph4',
 'Allylescaline',
 'AMT',
 'Arylcyclohexylamines',
 'Ayahuasca',
 'Banisteriopsis_caapi',
 'BOD',
 'BOH2CB',
 'Bufotenin',
 'Cacti_Mescalinecontaining',
 'DALT',
 'Deschloroketamine',
 'DET',
 'DiPT',
 'DMT',
 'DMTContaining',
 'DMXE',
 'DOB',
 'DOC',
 'DOET',
 'DOF',
 'DOI',
 'DOIP',
 'DOM',
 'DON',
 'DOPR',
 'DPT',
 'EIPLA',
 'EPT',
 'Escaline',
 'ETHLAD',
 'Fluorexetamine',
 'HB_Woodrose',
 'Harmaline',
 'Harmine',
 'Herbal_Ecstasy',
 'HOT17',
 'HOT2',
 'HOT7',
 'Huasca_Brew',
 'Huasca_Brew_Group',
 'Huasca_Combo',
 'Huasca_Group',
 'HXE',
 'Iboga_Alkaloid_Group',
 'Ibogaine',
 'Isoproscaline',
 'Ketamine',
 'LSA',
 'LSD',
 'LSM775',
 'LSZ',
 'MALT',
 'MDA',
 'MDAI',
 'MDE',
 'MDMA',
 'MEM',
 'Mescaline',
 'MET',
 'Methallylescaline',
 'Methoxetamine',
 'Methoxpropamine',
 'Mimosa_ophthalmocentra',
 'Mimosa_spp',
 'Mimosa_tenuiflora',
 'MIPLA',
 'MIPT',
 'MMDA',
 'MMDA3a',
 'MPT',
 'Mushrooms',
 'Mushrooms_G_spectabilis',
 'Mushrooms_

In [6]:
# Create strings for urls
drug_urls = []
for drug in drug_names_for_links:
    drug_urls.append('https://erowid.org/experiences/subs/exp_' + drug + '.shtml')
drug_urls[:5]

['https://erowid.org/experiences/subs/exp_AET.shtml',
 'https://erowid.org/experiences/subs/exp_ALLAD.shtml',
 'https://erowid.org/experiences/subs/exp_ALD52.shtml',
 'https://erowid.org/experiences/subs/exp_ALEPH.shtml',
 'https://erowid.org/experiences/subs/exp_Aleph4.shtml']

These work correctly. Navigagte to each page and gather the link to "Show All" experience reports. 

<font color='violet'> Explore the structure of a drug's page.

In [7]:
aet_url = 'https://erowid.org/experiences/subs/exp_AET.shtml'
aet_page = requests.get(aet_url)
aet_soup = BeautifulSoup(aet_page.content, "html.parser")
aet_pretty = aet_soup.prettify().splitlines()
aet_pretty[:100]

['<html>',
 ' <head>',
 '  <title>',
 '   AET (also Alpha-ethyltryptamine; Monase) : Erowid Exp: Main Index',
 '  </title>',
 '  <meta content="A categorized index of first-person experiences with AET" name="description"/>',
 '  <meta content="Experience Report Vaults, trip reports, stories, descriptions" name="keywords"/>',
 '  <link href="/includes/general_default.css" rel="stylesheet" type="text/css"/>',
 '  <link href="includes/exp.css" rel="stylesheet" type="text/css"/>',
 '  <script src="/includes/javascript/jquery-3.2.1.min.js" type="text/javascript">',
 '  </script>',
 '  <script language="javascript" src="/includes/javascript/erowid_combobox_lib.js" type="text/javascript">',
 '  </script>',
 '  <script language="javascript" src="/includes/javascript/external/mobile-detect.min.js" type="text/javascript">',
 '  </script>',
 ' </head>',
 ' <body alink="#008080" bgcolor="#000000" link="#7777AA" text="#999977" vlink="#999999">',
 '  <table align="CENTER" border="0" cellpadding="0" 

The href for the link to all of a drug's reports will be inside the a element where the img alt text = "Show All Reports."

In [8]:
# Find the correct href on one of the pages
aet_soup.find('img', attrs={'alt':'Show New Reports'}).parent['href']

'/experiences/exp.cgi?New&S1=299'

In [9]:
# Collect all hrefs
vault_hrefs = []
bad_urls = []

for url in tqdm(drug_urls):
    # Some urls could be wrong if I didn't change the drug names properly. 
    try:
        drug_page = requests.get(url)
        drug_soup = BeautifulSoup(drug_page.content, "html.parser")
        href = drug_soup.find('img', attrs={'alt':'Show All Reports'}).parent['href']
        vault_hrefs.append(href)
    except: bad_urls.append(url)

vault_hrefs[:5]

100%|██████████| 191/191 [07:06<00:00,  2.23s/it]


['/experiences/exp.cgi?S1=299',
 '/experiences/exp.cgi?S1=603',
 '/experiences/exp.cgi?S1=748',
 '/experiences/exp.cgi?S1=807',
 '/experiences/exp.cgi?S1=557']

In [None]:
Audio(sound_file, autoplay=True)

In [10]:
len(bad_urls)

0

In [11]:
# Turn hrefs into proper urls
vault_urls = []
for href in vault_hrefs:
    vault_urls.append('https://erowid.org' + href)
vault_urls[:5]

['https://erowid.org/experiences/exp.cgi?S1=299',
 'https://erowid.org/experiences/exp.cgi?S1=603',
 'https://erowid.org/experiences/exp.cgi?S1=748',
 'https://erowid.org/experiences/exp.cgi?S1=807',
 'https://erowid.org/experiences/exp.cgi?S1=557']

In [12]:
# There should be 191 vault urls
len(vault_urls)

191

The pages in the vault_urls list are now just full of direct links to each experience report. Gather the links to all the reports. 

Check out how to do this using just one of the pages, for the drug 5-MEO-DALT

In [13]:
dalt_vault_url = 'https://erowid.org/experiences/exp.cgi?S1=321'
dalt_page = requests.get(dalt_vault_url)
dalt_soup = BeautifulSoup(dalt_page.content, "html.parser")
dalt_pretty = dalt_soup.prettify().splitlines()
dalt_pretty[:200]

['<html>',
 ' <head>',
 '  <title>',
 '   Search Results : Erowid Experience Vaults',
 '  </title>',
 '  <meta content="Erowid Experience Vaults: An Experience" name="description"/>',
 '  <meta content="Experience Report Vaults, trip reports, stories, descriptions" name="keywords"/>',
 '  <link href="/includes/general_default.css" rel="stylesheet" type="text/css"/>',
 '  <link href="includes/exp.css" rel="stylesheet" type="text/css"/>',
 '  <!-- Sperowider <noindex/> -->',
 '  <script src="/includes/javascript/jquery-3.2.1.min.js" type="text/javascript">',
 '  </script>',
 '  <script language="javascript" src="/includes/javascript/erowid_combobox_lib.js" type="text/javascript">',
 '  </script>',
 '  <script language="javascript" src="/includes/javascript/external/mobile-detect.min.js" type="text/javascript">',
 '  </script>',
 ' </head>',
 ' <body alink="#008080" bgcolor="#000000" link="#7777AA" text="#999977" vlink="#999999">',
 '  <table align="CENTER" border="0" cellpadding="0" cell

There's only one mention of colspan=3; it's an attribute of a td tag for a table, and every single link inside the table (hrefs located inside a tags) is one that I want to collect. 

In [14]:
# Try with one page first
dalt_reports = dalt_soup.find('td', attrs={'colspan':3}).find_all('a')
dalt_hrefs = []
for a in range(len(dalt_reports)):
    href = dalt_reports[a]['href']
    dalt_hrefs.append(href)
dalt_hrefs[:5]    

['exp.php?ID=105518',
 'exp.php?ID=86869',
 'exp.php?ID=86866',
 'exp.php?ID=37775',
 'exp.php?ID=35721']

In [15]:
# That worked. How many reports were linked on this one page?
len(dalt_hrefs)

71

In [16]:
# work through all links in vault_urls to get all hrefs 
report_hrefs = []
for url in tqdm(vault_urls):
    this_page = requests.get(url)
    this_soup = BeautifulSoup(this_page.content, "html.parser")
    this_reports = this_soup.find('td', attrs={'colspan':3}).find_all('a')
    for a in range(len(this_reports)):
        href = this_reports[a]['href']
        report_hrefs.append(href)    
report_hrefs[:5]

100%|██████████| 191/191 [07:45<00:00,  2.44s/it]


['exp.php?ID=58149',
 'exp.php?ID=61822',
 'exp.php?ID=61874',
 'exp.php?ID=63071',
 'exp.php?ID=28238']

In [None]:
Audio(sound_file, autoplay=True)

In [17]:
# How many total reports are there?
len(report_hrefs)

6371

In [18]:
# Turn these hrefs into proper urls
report_urls = []
for href in report_hrefs:
    report_urls.append('https://erowid.org/experiences/' + href)
report_urls[:5]

['https://erowid.org/experiences/exp.php?ID=58149',
 'https://erowid.org/experiences/exp.php?ID=61822',
 'https://erowid.org/experiences/exp.php?ID=61874',
 'https://erowid.org/experiences/exp.php?ID=63071',
 'https://erowid.org/experiences/exp.php?ID=28238']

I now have a url for each experience report. 

Each experience report page could have information about drugs the person was on, the dose they took, their body weight, the year of their experience, their gender, age at time of experience, a title, an alias for the author, and the narrative itself.

I won't need all of this information to meet my primary objective of assigning a rating based on the narrative content, but it would be interesting to explore some of the other detials as well, so pull everything into a dataframe. 

<font color='violet'> Figure out how to turn each report page's contents into a row of a dataframe, with page elements as columns.
    
There are tables on these pages, but most of the information is not in a table. There may be more efficient ways, but I'll start out by just pulling various elements separately and then joining. 

In [19]:
# Try with just the first report url
trial_df = pd.read_html('https://erowid.org/experiences/exp.php?ID=116975')
trial_df

[                                                   0
 0  #message { background: #2233cc; border: 1px so...,
     0   1
 0 NaN NaN,
        0         1       2         3
 0  DOSE:  repeated  smoked       DMT
 1    NaN  repeated  smoked  Cannabis,
               0       1
 0  BODY WEIGHT:  102 kg,
                                                    0  \
 0                                     Exp Year: 2022   
 1                                       Gender: Male   
 2                      Age at time of experience: 22   
 3                            Published: Jan 28, 2023   
 4  [ View as PDF (for printing) ] [ View as LaTeX...   
 5  DMT (18), Cannabis (1) : First Times (2), Comb...   
 
                                                    1  
 0                                      ExpID: 116975  
 1                                                NaN  
 2                                                NaN  
 3                                         Views: 104  
 4  [ View as PDF (fo

In [20]:
# I want information out of dfs 2, 3, 4 
trial_df[2]

Unnamed: 0,0,1,2,3
0,DOSE:,repeated,smoked,DMT
1,,repeated,smoked,Cannabis


In [21]:
drugs_reviewed = trial_df[2][3].to_list()
drugs_reviewed

['DMT', 'Cannabis']

In [22]:
trial_df[3]

Unnamed: 0,0,1
0,BODY WEIGHT:,102 kg


In [23]:
weight = trial_df[3][1].to_list()
weight

['102 kg']

In [24]:
trial_df[4]

Unnamed: 0,0,1
0,Exp Year: 2022,ExpID: 116975
1,Gender: Male,
2,Age at time of experience: 22,
3,"Published: Jan 28, 2023",Views: 104
4,[ View as PDF (for printing) ] [ View as LaTeX...,[ View as PDF (for printing) ] [ View as LaTeX...
5,"DMT (18), Cannabis (1) : First Times (2), Comb...","DMT (18), Cannabis (1) : First Times (2), Comb..."


In [25]:
remainiing_relevant_info = trial_df[4][0][0:3]
remainiing_relevant_info

0                   Exp Year: 2022
1                     Gender: Male
2    Age at time of experience: 22
Name: 0, dtype: object

In [26]:
year = remainiing_relevant_info[0]
year

'Exp Year: 2022'

In [27]:
gender = remainiing_relevant_info[1]
gender

'Gender: Male'

In [28]:
age = remainiing_relevant_info[2]
age

'Age at time of experience: 22'

I can clean these up as part of the process of adding them to a dataframe, i.e. turn 'Exp Year: 2022' into just '2022.' I'll be creating duplicates, the way the reviews in the study came to me, but I can just deal with that later. Here, for example, the cannabis row will just be deleted because it's not one of the drugs in my list of target drugs. 

In [29]:
trial_df = pd.DataFrame({'drug':drugs_reviewed, 'weight':weight[0], 
                         'year':year.replace('Exp Year: ', ''), 
                         'gender':gender.replace('Gender: ', ''), 
                         'age':age.replace('Age at time of experience: ', '')})
trial_df

Unnamed: 0,drug,weight,year,gender,age
0,DMT,102 kg,2022,Male,22
1,Cannabis,102 kg,2022,Male,22


In [30]:
# Pull the text of the report into the dataframe; it's inside a div.
trial_url = 'https://erowid.org/experiences/exp.php?ID=116975'
trial_page = requests.get(trial_url)
trial_soup = BeautifulSoup(trial_page.content, "html.parser")
trial_text = trial_soup.find('div', attrs={'class':'report-text-surround'}).get_text()
trial_text

"\n\n\xa0\n\n\n\n\nDOSE:\n\xa0 repeated\nsmoked\nDMT\n\n\n\xa0\n\xa0 repeated\nsmoked\nCannabis\n\n\n\n\n\nBODY WEIGHT:\n102 kg\n\n\n\n\nSeeing my Buddha-Nature on DMT \r\n\nI am a 22 year old male around 102kg. What I am about to tell you is my experience of using DMT for the first time. I took around 100-150mg of DMT about a month ago. I have no clue as to what the exact dosage is I have no clue as to what the exact dosage is, because I eventually started eyeballing it trying to take bigger dosages in my attempt to “break through”, which I believe was unsuccessful. My only other psychedelic experience is LSD which I tripped heavily on around a year ago, but I stopped completely a couple months before this experience. I am writing this from memory so all of the details may not be 100 percent accurate. \r\n\nThis trip happened about a month ago. I decided to try DMT for the first time because the guy I usually see had it and it tested clean. I also live in student accommodation and eve

Some information from the tables and a bunch of formatting characters remain, but I can clean those out of all the strings later, given that I don't know of a better way to use beautiful soup to do it now because the text is not inside its own element apart from the div that also contains all this other stuff. 

Add this review to the dataframe I started. 

In [31]:
trial_df['report'] = trial_text
trial_df['url'] = trial_url
trial_df

Unnamed: 0,drug,weight,year,gender,age,report,url
0,DMT,102 kg,2022,Male,22,\n\n \n\n\n\n\nDOSE:\n repeated\nsmoked\nDMT\...,https://erowid.org/experiences/exp.php?ID=116975
1,Cannabis,102 kg,2022,Male,22,\n\n \n\n\n\n\nDOSE:\n repeated\nsmoked\nDMT\...,https://erowid.org/experiences/exp.php?ID=116975


I can build from this, if I streamline the process I used to creat it. I can creat other mini-dfs and concatenate them onto this one. 

<font color='violet'> Create a function for building a df from a drug report page.
    
Include an alternate option to create a dictionary if the function can't create a table due to inconsistency in html structure. That way, I can use the dictionary entries to at least add the narratives to my final dataframe. I can optionally also use the urls to go back and try to extract the remaining information. 

In [32]:
# Put together all the steps I just took

def page_to_df_or_dict(url): 
    # Many pages may not have the same table structure as the page I tried. 
    try:
        # extract information from the tables as before. 
        mini_df = pd.read_html(url)
        drugs_reviewed = mini_df[2][3].to_list()
        weight = mini_df[3][1].to_list()
        remainiing_relevant_info = mini_df[4][0][0:3]
        year = remainiing_relevant_info[0]
        gender = remainiing_relevant_info[1]
        age = remainiing_relevant_info[2]
        # Create a dataframe from the table information.
        mini_df = pd.DataFrame({'drug':drugs_reviewed, 'weight':weight[0], 
                                'year':year.replace('Exp Year: ', ''), 
                                'gender':gender.replace('Gender: ', ''), 
                                'age':age.replace('Age at time of experience: ', '')})
        # Access, extract, and add to the dataframe the narrative text. 
        this_page = requests.get(trial_url)
        this_soup = BeautifulSoup(trial_page.content, "html.parser")
        this_text = trial_soup.find('div', attrs={'class':'report-text-surround'}).get_text()
        mini_df['report'] = this_text
        # Add the url to the dataframe
        mini_df['url'] = url
        return mini_df
    # If I can't get all the table information, maybe I can at least get just the text. 
    except: 
        text_dict = {}
        this_page = requests.get(url)
        this_soup = BeautifulSoup(this_page.content, "html.parser")
        this_text = this_soup.find('div', attrs={'class':'report-text-surround'}).get_text()
        text_dict[url] = this_text
        return text_dict

# See what happens. Should match trial_df
try_function = page_to_df_or_dict(trial_url)
try_function

Unnamed: 0,drug,weight,year,gender,age,report,url
0,DMT,102 kg,2022,Male,22,\n\n \n\n\n\n\nDOSE:\n repeated\nsmoked\nDMT\...,https://erowid.org/experiences/exp.php?ID=116975
1,Cannabis,102 kg,2022,Male,22,\n\n \n\n\n\n\nDOSE:\n repeated\nsmoked\nDMT\...,https://erowid.org/experiences/exp.php?ID=116975


That worked

<font color='violet'> Build a full dataframe

Start with what I already created, build from there. Try a small chunk to start with. 

In [33]:
df = trial_df
just_texts = {}

for url in tqdm(report_urls[:5]):
    # Create options for whether the function output is a df or the alternate dict
    try:
        new_rows = page_to_df_or_dict(url)
        df = pd.concat([df, new_rows])
    except: 
        page_text_dict = page_to_df_or_dict(url)
        just_texts.update(page_text_dict)

df.info()

100%|██████████| 5/5 [00:36<00:00,  7.28s/it]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16 entries, 0 to 0
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drug    16 non-null     object
 1   weight  16 non-null     object
 2   year    16 non-null     object
 3   gender  16 non-null     object
 4   age     16 non-null     object
 5   report  16 non-null     object
 6   url     16 non-null     object
dtypes: object(7)
memory usage: 1.0+ KB





In [34]:
df

Unnamed: 0,drug,weight,year,gender,age,report,url
0,DMT,102 kg,2022,Male,22,\n\n \n\n\n\n\nDOSE:\n repeated\nsmoked\nDMT\...,https://erowid.org/experiences/exp.php?ID=116975
1,Cannabis,102 kg,2022,Male,22,\n\n \n\n\n\n\nDOSE:\n repeated\nsmoked\nDMT\...,https://erowid.org/experiences/exp.php?ID=116975
0,AET,150 lb,2006,Male,Not Given,\n\n \n\n\n\n\nDOSE:\n repeated\nsmoked\nDMT\...,https://erowid.org/experiences/exp.php?ID=58149
1,AET,150 lb,2006,Male,Not Given,\n\n \n\n\n\n\nDOSE:\n repeated\nsmoked\nDMT\...,https://erowid.org/experiences/exp.php?ID=58149
2,Kratom,150 lb,2006,Male,Not Given,\n\n \n\n\n\n\nDOSE:\n repeated\nsmoked\nDMT\...,https://erowid.org/experiences/exp.php?ID=58149
3,Morphine,150 lb,2006,Male,Not Given,\n\n \n\n\n\n\nDOSE:\n repeated\nsmoked\nDMT\...,https://erowid.org/experiences/exp.php?ID=58149
0,AET,170 lb,2007,Male,Not Given,\n\n \n\n\n\n\nDOSE:\n repeated\nsmoked\nDMT\...,https://erowid.org/experiences/exp.php?ID=61822
1,Hydromorphone,170 lb,2007,Male,Not Given,\n\n \n\n\n\n\nDOSE:\n repeated\nsmoked\nDMT\...,https://erowid.org/experiences/exp.php?ID=61822
2,Hydromorphone,170 lb,2007,Male,Not Given,\n\n \n\n\n\n\nDOSE:\n repeated\nsmoked\nDMT\...,https://erowid.org/experiences/exp.php?ID=61822
3,Cannabis - Hash,170 lb,2007,Male,Not Given,\n\n \n\n\n\n\nDOSE:\n repeated\nsmoked\nDMT\...,https://erowid.org/experiences/exp.php?ID=61822


In [35]:
len(just_texts)

1

This resulted in a nice dataframe. And only one dictionary entry of text without the rest of the detail. Keep building the dataframe chunk by chunk so it doesn't take too long per scraping attempt. And in case I get blocked and need to change ip addresses. 

In [36]:
# Function didn't work at all for some urls. Might be a totally bad url. Gather those. 
invalid_urls = []

for url in tqdm(report_urls[5:100]):
    try:
        new_rows = page_to_df_or_dict(url)
        df = pd.concat([df, new_rows])
    except: 
        try:
            page_text_dict = page_to_df_or_dict(url)
            just_texts.update(page_text_dict)
        except: invalid_urls.append(url)

df.info()

100%|██████████| 95/95 [10:44<00:00,  6.78s/it]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 238 entries, 0 to 2
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drug    238 non-null    object
 1   weight  238 non-null    object
 2   year    238 non-null    object
 3   gender  238 non-null    object
 4   age     238 non-null    object
 5   report  238 non-null    object
 6   url     238 non-null    object
dtypes: object(7)
memory usage: 14.9+ KB





In [37]:
Audio(sound_file, autoplay=True)

In [38]:
len(just_texts)

6

In [39]:
just_texts

{'https://erowid.org/experiences/exp.php?ID=61874': "\n\n\xa0\n\n\n\n\nDOSE:\n125 mg\noral\nAET\n(capsule)\n\n\n\n\n\nWell, I recently came across a little AET, and I have been wanting to try this since I first read about it in TIHKAL. I never had a desire to experiment with AMT, even though it's been popular for quite some time, but AET seemed to call out to me... eventually, I answered back... by taking the 125mg in a capsule and awaiting the onset.\r\n\nI took the capsule at about 9:30pm, while I still had another hour and a half left to go at work. I figured by the time I left there I would be feeling it, and would then just take the quick ride home where I would lie around lazily listening to music and enjoying some euphoria of the tryptamine variety (my favorite type of euphoria!!). \r\n\nAt 11:00 I went home, and was BARELY feeling it. I attribute this to having swallowed it in a gelcap, which was the intention anyway. By the time I got home, I was still not feeling it that much

In [40]:
# Just 6% ended up in the text-only dictionary. And the dataframe looks solid. Keep going. 

for url in tqdm(report_urls[100:200]):
    try:
        new_rows = page_to_df_or_dict(url)
        df = pd.concat([df, new_rows])
    except: 
        try:
            page_text_dict = page_to_df_or_dict(url)
            just_texts.update(page_text_dict)
        except: invalid_urls.append(url)

df.info()

100%|██████████| 100/100 [11:57<00:00,  7.18s/it]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 485 entries, 0 to 0
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drug    485 non-null    object
 1   weight  485 non-null    object
 2   year    485 non-null    object
 3   gender  485 non-null    object
 4   age     485 non-null    object
 5   report  485 non-null    object
 6   url     485 non-null    object
dtypes: object(7)
memory usage: 30.3+ KB





In [41]:
Audio(sound_file, autoplay=True)

In [42]:
len(just_texts)

18

In [43]:
len(invalid_urls)

0

In [48]:
# No invalid urls yet, and not too many dictionary-only pages. Run through remaining chunks. 
for url in tqdm(report_urls[200:500]):
    try:
        new_rows = page_to_df_or_dict(url)
        df = pd.concat([df, new_rows])
    except: 
        try:
            page_text_dict = page_to_df_or_dict(url)
            just_texts.update(page_text_dict)
        except: invalid_urls.append(url)

df.info()

100%|██████████| 300/300 [36:10<00:00,  7.24s/it]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1249 entries, 0 to 1
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drug    1249 non-null   object
 1   weight  1249 non-null   object
 2   year    1249 non-null   object
 3   gender  1249 non-null   object
 4   age     1249 non-null   object
 5   report  1249 non-null   object
 6   url     1249 non-null   object
dtypes: object(7)
memory usage: 78.1+ KB





In [49]:
Audio(sound_file, autoplay=True)

In [50]:
len(just_texts)

34

In [51]:
len(invalid_urls)

8

In [56]:
for url in tqdm(report_urls[500:1000]):
    try:
        new_rows = page_to_df_or_dict(url)
        df = pd.concat([df, new_rows])
    except: 
        try:
            page_text_dict = page_to_df_or_dict(url)
            just_texts.update(page_text_dict)
        except: invalid_urls.append(url)

df.info()

100%|██████████| 500/500 [55:20<00:00,  6.64s/it]  

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2443 entries, 0 to 0
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drug    2443 non-null   object
 1   weight  2443 non-null   object
 2   year    2443 non-null   object
 3   gender  2443 non-null   object
 4   age     2443 non-null   object
 5   report  2443 non-null   object
 6   url     2443 non-null   object
dtypes: object(7)
memory usage: 152.7+ KB





In [57]:
Audio(sound_file, autoplay=True)

In [58]:
len(just_texts)

69

In [59]:
len(invalid_urls)

9

In [64]:
for url in tqdm(report_urls[1000:1500]):
    try:
        new_rows = page_to_df_or_dict(url)
        df = pd.concat([df, new_rows])
    except: 
        try:
            page_text_dict = page_to_df_or_dict(url)
            just_texts.update(page_text_dict)
        except: invalid_urls.append(url)

df.info()

100%|██████████| 500/500 [59:04<00:00,  7.09s/it] 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3591 entries, 0 to 1
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drug    3591 non-null   object
 1   weight  3591 non-null   object
 2   year    3591 non-null   object
 3   gender  3591 non-null   object
 4   age     3591 non-null   object
 5   report  3591 non-null   object
 6   url     3591 non-null   object
dtypes: object(7)
memory usage: 224.4+ KB





In [65]:
Audio(sound_file, autoplay=True)

In [66]:
len(just_texts)

89

In [67]:
len(invalid_urls)

41

In [72]:
for url in tqdm(report_urls[1500:2000]):
    try:
        new_rows = page_to_df_or_dict(url)
        df = pd.concat([df, new_rows])
    except: 
        try:
            page_text_dict = page_to_df_or_dict(url)
            just_texts.update(page_text_dict)
        except: invalid_urls.append(url)

df.info()

100%|██████████| 500/500 [53:15<00:00,  6.39s/it]  

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4709 entries, 0 to 0
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drug    4709 non-null   object
 1   weight  4709 non-null   object
 2   year    4709 non-null   object
 3   gender  4709 non-null   object
 4   age     4709 non-null   object
 5   report  4709 non-null   object
 6   url     4709 non-null   object
dtypes: object(7)
memory usage: 294.3+ KB





In [73]:
Audio(sound_file, autoplay=True)

In [74]:
len(just_texts)

107

In [75]:
len(invalid_urls)

48

In [80]:
for url in tqdm(report_urls[2000:2500]):
    try:
        new_rows = page_to_df_or_dict(url)
        df = pd.concat([df, new_rows])
    except: 
        try:
            page_text_dict = page_to_df_or_dict(url)
            just_texts.update(page_text_dict)
        except: invalid_urls.append(url)

df.info()

100%|██████████| 500/500 [48:55<00:00,  5.87s/it] 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6148 entries, 0 to 2
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drug    6148 non-null   object
 1   weight  6148 non-null   object
 2   year    6148 non-null   object
 3   gender  6148 non-null   object
 4   age     6148 non-null   object
 5   report  6148 non-null   object
 6   url     6148 non-null   object
dtypes: object(7)
memory usage: 384.2+ KB





In [81]:
Audio(sound_file, autoplay=True)

In [82]:
len(just_texts)

142

In [83]:
len(invalid_urls)

52

In [88]:
for url in tqdm(report_urls[2500:3000]):
    try:
        new_rows = page_to_df_or_dict(url)
        df = pd.concat([df, new_rows])
    except: 
        try:
            page_text_dict = page_to_df_or_dict(url)
            just_texts.update(page_text_dict)
        except: invalid_urls.append(url)

df.info()

100%|██████████| 500/500 [44:47<00:00,  5.37s/it]  

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7682 entries, 0 to 1
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drug    7682 non-null   object
 1   weight  7682 non-null   object
 2   year    7682 non-null   object
 3   gender  7682 non-null   object
 4   age     7682 non-null   object
 5   report  7682 non-null   object
 6   url     7682 non-null   object
dtypes: object(7)
memory usage: 480.1+ KB





In [89]:
Audio(sound_file, autoplay=True)

In [90]:
len(just_texts)

150

In [91]:
len(invalid_urls)

54

In [None]:
for url in tqdm(report_urls[3000:3500]):
    try:
        new_rows = page_to_df_or_dict(url)
        df = pd.concat([df, new_rows])
    except: 
        try:
            page_text_dict = page_to_df_or_dict(url)
            just_texts.update(page_text_dict)
        except: invalid_urls.append(url)

df.info()

  0%|          | 0/500 [00:00<?, ?it/s]

In [None]:
Audio(sound_file, autoplay=True)

In [None]:
len(just_texts)

In [None]:
len(invalid_urls)

In [None]:
for url in tqdm(report_urls[3500:4000]):
    try:
        new_rows = page_to_df_or_dict(url)
        df = pd.concat([df, new_rows])
    except: 
        try:
            page_text_dict = page_to_df_or_dict(url)
            just_texts.update(page_text_dict)
        except: invalid_urls.append(url)

df.info()

In [None]:
Audio(sound_file, autoplay=True)

In [None]:
len(just_texts)

In [None]:
len(invalid_urls)

In [None]:
for url in tqdm(report_urls[4000:4500]):
    try:
        new_rows = page_to_df_or_dict(url)
        df = pd.concat([df, new_rows])
    except: 
        try:
            page_text_dict = page_to_df_or_dict(url)
            just_texts.update(page_text_dict)
        except: invalid_urls.append(url)

df.info()

In [None]:
Audio(sound_file, autoplay=True)

In [None]:
len(just_texts)

In [None]:
len(invalid_urls)

In [None]:
for url in tqdm(report_urls[4500:5000]):
    try:
        new_rows = page_to_df_or_dict(url)
        df = pd.concat([df, new_rows])
    except: 
        try:
            page_text_dict = page_to_df_or_dict(url)
            just_texts.update(page_text_dict)
        except: invalid_urls.append(url)

df.info()

In [None]:
Audio(sound_file, autoplay=True)

In [None]:
len(just_texts)

In [None]:
len(invalid_urls)

In [None]:
for url in tqdm(report_urls[5000:5500]):
    try:
        new_rows = page_to_df_or_dict(url)
        df = pd.concat([df, new_rows])
    except: 
        try:
            page_text_dict = page_to_df_or_dict(url)
            just_texts.update(page_text_dict)
        except: invalid_urls.append(url)

df.info()

In [None]:
Audio(sound_file, autoplay=True)

In [None]:
len(just_texts)

In [None]:
len(invalid_urls)

In [None]:
for url in tqdm(report_urls[5500:6000]):
    try:
        new_rows = page_to_df_or_dict(url)
        df = pd.concat([df, new_rows])
    except: 
        try:
            page_text_dict = page_to_df_or_dict(url)
            just_texts.update(page_text_dict)
        except: invalid_urls.append(url)

df.info()

In [None]:
Audio(sound_file, autoplay=True)

In [None]:
len(just_texts)

In [None]:
len(invalid_urls)

In [None]:
for url in tqdm(report_urls[6000:]):
    try:
        new_rows = page_to_df_or_dict(url)
        df = pd.concat([df, new_rows])
    except: 
        try:
            page_text_dict = page_to_df_or_dict(url)
            just_texts.update(page_text_dict)
        except: invalid_urls.append(url)

df.info()

In [None]:
Audio(sound_file, autoplay=True)

In [None]:
len(just_texts)

In [None]:
len(invalid_urls)

I want to process the dictionary and invalid_urls but in a new notebook; this one's getting too long. Cleaning notebook: https://github.com/fractaldatalearning/psychedelic_efficacy/blob/main/notebooks/9-kl-reports-clean-engineer.ipynb

In [92]:
# Save this df. 
df.to_csv('../data/raw/erowid/raw_reports.csv')

In [93]:
# Save the list of psychedelic drugs for later use
file=open('../data/raw/erowid/psychedelic_drugs.txt','w')
for items in psychedelic_drugs:
    file.writelines(items + ',')
file.close()

In [94]:
# Save the list of urls that couldn't get scraped
file=open('../data/raw/erowid/unscraped_urls.txt','w')
for items in invalid_urls:
    file.writelines(items + ',')
file.close()

In [95]:
# Save the dictionary of urls & text from sites where only the text could be scraped. 
with open('../data/raw/erowid/url_text_dict.txt', 'w') as convert_file:
     convert_file.write(json.dumps(just_texts))