# <font color='violet'> Scrape Erowid for Narrative Experience Reports
    
Here, I'll create a dataframe out of information from Erowid, which has a large "experience vault," where there are thousands of narrative descriptions of psychoactive drugs that could be compared with ratings and reviews of prescription psych meds using the model I created based on more formal psychiatric studies. 

In [1]:
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm


<font color='violet'> Explore the basic structure of the front page of the experience vault.
    
Get a list of drug names, then use that list to extract links associated with those names. The links can then be followed to find the narratives connected to each drug. 

In [2]:
front_url = 'https://erowid.org/experiences/exp_list.shtml'
front_page = requests.get(front_url)
front_soup = BeautifulSoup(front_page.content, "html.parser")
front_pretty = front_soup.prettify().splitlines()
front_pretty[:500]

['<html>',
 ' <head>',
 '  <title>',
 '   Complete Substance and Category List : Erowid Experience Vaults',
 '  </title>',
 '  <meta content="The full list of substances and categories covered by Erowid\'s collection of first-hand experience reports with psychoactive plants and drugs." name="description"/>',
 '  <meta content="Experience Report Vaults, trip reports, stories, descriptions" name="keywords"/>',
 '  <link href="/includes/general_default.css" rel="stylesheet" type="text/css"/>',
 '  <link href="includes/exp.css" rel="stylesheet" type="text/css"/>',
 '  <script src="/includes/javascript/jquery-3.2.1.min.js" type="text/javascript">',
 '  </script>',
 '  <script language="javascript" src="/includes/javascript/erowid_combobox_lib.js" type="text/javascript">',
 '  </script>',
 '  <script language="javascript" src="/includes/javascript/external/mobile-detect.min.js" type="text/javascript">',
 '  </script>',
 ' </head>',
 ' <body alink="#008080" bgcolor="#000000" link="#7777AA" te

It appears as though the name attribute, when inside an a element, only connects with a letter of the alphabet or with drug names. I can get a list of all drugs by pulling the contents of the name attributes into a list and then just deleting the single letters "A," "B," "C," etc.

In [3]:
a_element_name = front_soup.find_all('a', attrs={"name": True})
drug_names = []
for result in a_element_name:
    drug_names.append(result.attrs['name'])
drug_names

['A',
 'AB-001',
 'AB-CHMINACA',
 'AB-FUBINACA',
 'Absinthe',
 'Acacia',
 'Acacia confusa',
 'Acacia maidenii',
 'Acacia phlebophylla',
 'Acepromazine',
 'Acetaminophen',
 'Acetildenafil',
 'Acetylfentanyl',
 'Aconitum napellus',
 'ADB-FUBINACA',
 'Adrafinil',
 'Adrenochrome',
 'AET',
 'AH-7921',
 'AL-LAD',
 'Albizia julibrissin',
 'Alcohol',
 'Alcohol - Beer/Wine',
 'Alcohol - Hard',
 'ALD-52',
 'ALEPH',
 'Aleph-4',
 'Allylescaline',
 'Aloes',
 'alpha-GPC',
 'alpha-PCYP',
 'alpha-PHiP',
 'alpha-PHP',
 'alpha-PVP',
 'AM-2201',
 'AM-DIPT',
 'Amanitas',
 'Amanitas - A. muscaria',
 'Amanitas - A. pantherina',
 'Amphetamines',
 'Amphetamines - Substituted',
 'AMT',
 'Anabolic Steroids',
 'Anadenanthera colubrina',
 'Anadenanthera peregrina',
 'Anadenanthera spp.',
 'Animals',
 'Animals - Black Widow Spider',
 'Animals - Fire Ants',
 'Animals - Frogs',
 'Aniracetam',
 'AP-238',
 'Argemone spp. ',
 'Armodafinil',
 'Arundo donax',
 'Arylcyclohexylamines',
 'Ashwagandha',
 'Aspirin',
 'Atropin

In [4]:
# Get a list of just psychedelic drugs of interest to me. 
psychedelic_drugs = ['AET', 'AL-LAD', 'ALD-52', 'ALEPH', 'Aleph-4', 'Allylescaline',
                     'AMT', 'Arylcyclohexylamines', 'Ayahuasca', 'Banisteriopsis caapi', 
                     'BOD', 'BOH-2C-B', 'Bufotenin', 'Cacti - Mescaline-containing', 'DALT', 
                     'Deschloroketamine', 'DET', 'DiPT', 'DMT', 'DMT-Containing', 'DMXE', 
                     'DOB', 'DOC', 'DOET', 'DOF', 'DOI', 'DOIP', 'DOM', 'DON', 'DOPR', 'DPT', 
                     'EIPLA', 'EPT', 'Escaline', 'ETH-LAD', 'Fluorexetamine', 'H.B. Woodrose',
                     'Harmaline', 'Harmine', 'Herbal Ecstasy', 'HOT-17', 'HOT-2', 'HOT-7',
                     'Huasca Brew', 'Huasca Brew Group', 'Huasca Combo', 'Huasca Group', 'HXE',
                     'Iboga Alkaloid Group', 'Ibogaine', 'Isoproscaline', 'Ketamine', 'LSA',
                     'LSD', 'LSM-775', 'LSZ', 'MALT', 'MDA', 'MDAI', 'MDE', 'MDMA', 'MEM',
                     'Mescaline', 'MET', 'Methallylescaline', 'Methoxetamine', 
                     'Methoxpropamine', 'Mimosa ophthalmocentra', 'Mimosa spp.',
                     'Mimosa tenuiflora', 'MIPLA', 'MIPT', 'MMDA', 'MMDA-3a', 'MPT',
                     'Mushrooms', 'Mushrooms - G. spectabilis', 'Mushrooms - P. atlantis',
                     'Mushrooms - P. azurescens', 'Mushrooms - P. cubensis', 
                     'Mushrooms - P. cyanescens', 'Mushrooms - P. mexicana',
                     'Mushrooms - P. semilanceata', 'Mushrooms - P. subaeruginosa',
                     'Mushrooms - P. tampanensis', 'Mushrooms - P. weilii',
                     'Mushrooms - Panaeolus cyanescens', 'MXiPr', 'PCE', 'PCP', 'Peyote',
                     'Phenethylamine', 'Phenethylamines', 'Phenethylamines - Other',
                     'PIPT', 'Proscaline', 'Psilocin', 'Psilocybin', 'S-Ketamine',
                     'Tabernanthe iboga', 'TCB-2', 'Tetrahydroharmine', 'TMA', 'TMA-2', 
                     'TMA-6', 'Tryptamines - Substituted', '1B-LSD', '1cP-AL-LAD', '1cP-LSD',
                     '1F-LSD', '1P-ETH-LAD', '1P-LSD', '1V-LSD', "2'-Oxo-PCE",
                     '2-Fluorodeschloroketamine', '2-Me-DMT', '2C-B', '2C-B-Fly', '2C-C',
                     '2C-CN', '2C-D', '2C-E', '2C-EF', '2C-G-N', '2C-H', '2C-I', '2C-IP',
                     '2C-N', '2C-P', '2C-T', '2C-T-13', '2C-T-2', '2C-T-21', '2C-T-4', 
                     '2C-T-7', '2C-TFM', '3,4-MD-PCP', '3-Cl-PCP', '3-HO-PCE', '3-HO-PCP',
                     '3-Me-PCE', '3-Me-PCPy', '3-MEO-PCE', '3-MeO-PCMo', '3-MeO-PCP',
                     '3-Methyl-PCP', '3C-E', '3C-P', '3F-PCP', '4-AcO-DALT', '4-AcO-DET',
                     '4-AcO-DiPT', '4-AcO-DMT', '4-AcO-DPT', '4-AcO-EIPT', '4-AcO-EPT',
                     '4-AcO-MALT', '4-AcO-MET', '4-AcO-MiPT', '4-AcO-MPT', '4-HO-DET',
                     '4-HO-DiPT', '4-HO-DPT', '4-HO-EPT', '4-HO-MALT', '4-HO-MCPT', '4-HO-MET',
                     '4-HO-MiPT', '4-HO-MPT', '4-HO-PIPT', '4-MeO-DMT', '4-MeO-MiPT',
                     '4-MeO-PCP', '4-MTA', '4-PrO-DMT', '4C-D', '5-Chloro-AMT', '5-MeO-AET',
                     '5-MeO-AMT', '5-MeO-DALT', '5-MeO-DET', '5-MeO-DiPT', '5-MeO-DMT', 
                     '5-MeO-DPT', '5-MeO-EIPT', '5-MeO-MALT', '5-MeO-MET', '5-MeO-MIPT',
                     '5-MeO-PIPT', '5-MeO-TMT', '5-Methoxy-Tryptamine']
len(psychedelic_drugs)

191

create list of links associated with these drugs. The format is: 
https://erowid.org/experiences/subs/exp_<DRUG>.shtml

In [5]:
# Dashes and periods need to be deleted and spaces need to be replaced with underscores
for drug in psychedelic_drugs:
    no_dash = drug.replace('-', '')
    no_period = no_dash.replace('.', '')
    no_double_space = no_period.replace('  ', '_')
    no_space = no_double_space.replace(' ', '_')
    for i in range(len(psychedelic_drugs)):
        if psychedelic_drugs[i] == drug:
            psychedelic_drugs[i] = no_space
psychedelic_drugs

['AET',
 'ALLAD',
 'ALD52',
 'ALEPH',
 'Aleph4',
 'Allylescaline',
 'AMT',
 'Arylcyclohexylamines',
 'Ayahuasca',
 'Banisteriopsis_caapi',
 'BOD',
 'BOH2CB',
 'Bufotenin',
 'Cacti_Mescalinecontaining',
 'DALT',
 'Deschloroketamine',
 'DET',
 'DiPT',
 'DMT',
 'DMTContaining',
 'DMXE',
 'DOB',
 'DOC',
 'DOET',
 'DOF',
 'DOI',
 'DOIP',
 'DOM',
 'DON',
 'DOPR',
 'DPT',
 'EIPLA',
 'EPT',
 'Escaline',
 'ETHLAD',
 'Fluorexetamine',
 'HB_Woodrose',
 'Harmaline',
 'Harmine',
 'Herbal_Ecstasy',
 'HOT17',
 'HOT2',
 'HOT7',
 'Huasca_Brew',
 'Huasca_Brew_Group',
 'Huasca_Combo',
 'Huasca_Group',
 'HXE',
 'Iboga_Alkaloid_Group',
 'Ibogaine',
 'Isoproscaline',
 'Ketamine',
 'LSA',
 'LSD',
 'LSM775',
 'LSZ',
 'MALT',
 'MDA',
 'MDAI',
 'MDE',
 'MDMA',
 'MEM',
 'Mescaline',
 'MET',
 'Methallylescaline',
 'Methoxetamine',
 'Methoxpropamine',
 'Mimosa_ophthalmocentra',
 'Mimosa_spp',
 'Mimosa_tenuiflora',
 'MIPLA',
 'MIPT',
 'MMDA',
 'MMDA3a',
 'MPT',
 'Mushrooms',
 'Mushrooms_G_spectabilis',
 'Mushrooms_

In [6]:
# Create strings for urls
drug_urls = []
for drug in psychedelic_drugs:
    drug_urls.append('https://erowid.org/experiences/subs/exp_' + drug + '.shtml')
drug_urls[:5]

['https://erowid.org/experiences/subs/exp_AET.shtml',
 'https://erowid.org/experiences/subs/exp_ALLAD.shtml',
 'https://erowid.org/experiences/subs/exp_ALD52.shtml',
 'https://erowid.org/experiences/subs/exp_ALEPH.shtml',
 'https://erowid.org/experiences/subs/exp_Aleph4.shtml']

These work correctly. Navigagte to each page and gather the link to "Show All" experience reports. 

<font color='violet'> Explore the structure of a drug's page.

In [7]:
aet_url = 'https://erowid.org/experiences/subs/exp_AET.shtml'
aet_page = requests.get(aet_url)
aet_soup = BeautifulSoup(aet_page.content, "html.parser")
aet_pretty = aet_soup.prettify().splitlines()
aet_pretty[:100]

['<html>',
 ' <head>',
 '  <title>',
 '   AET (also Alpha-ethyltryptamine; Monase) : Erowid Exp: Main Index',
 '  </title>',
 '  <meta content="A categorized index of first-person experiences with AET" name="description"/>',
 '  <meta content="Experience Report Vaults, trip reports, stories, descriptions" name="keywords"/>',
 '  <link href="/includes/general_default.css" rel="stylesheet" type="text/css"/>',
 '  <link href="includes/exp.css" rel="stylesheet" type="text/css"/>',
 '  <script src="/includes/javascript/jquery-3.2.1.min.js" type="text/javascript">',
 '  </script>',
 '  <script language="javascript" src="/includes/javascript/erowid_combobox_lib.js" type="text/javascript">',
 '  </script>',
 '  <script language="javascript" src="/includes/javascript/external/mobile-detect.min.js" type="text/javascript">',
 '  </script>',
 ' </head>',
 ' <body alink="#008080" bgcolor="#000000" link="#7777AA" text="#999977" vlink="#999999">',
 '  <table align="CENTER" border="0" cellpadding="0" 

The href for the link to all of a drug's reports will be inside the a element where the img alt text = "Show All Reports."

In [8]:
# Find the correct href on one of the pages
aet_soup.find('img', attrs={'alt':'Show New Reports'}).parent['href']

'/experiences/exp.cgi?New&S1=299'

In [None]:
# Collect all hrefs
vault_hrefs = []
bad_urls = []

for url in tqdm(drug_urls):
    # Some urls could be wrong
    try:
        drug_page = requests.get(url)
        drug_soup = BeautifulSoup(drug_page.content, "html.parser")
        href = drug_soup.find('img', attrs={'alt':'Show New Reports'}).parent['href']
        vault_hrefs.append(href)
    except: bad_urls.append(url)

vault_hrefs[:5]

  1%|          | 1/191 [00:01<04:59,  1.57s/it]

In [None]:
# Turn hrefs into proper urls
vault_urls = []
for href in vault_hrefs:
    vault_urls.append('https://erowid.org' + href)
drug_urls[:5]