# <font color='violet'> Cleaning, Parsing, Feature Engineering on Psychedelic Experience Reports
    
Use the same methods I used for the reviews from studies to clean up the report texts. Address any issues that are unique to the texts from Erowid and require additional cleaning. Then, do feature engineering to create the same columns from data modeled previously: complexity level, similarity with a meta-perfect-review, subjectivity, and polarity. 

In [78]:
import pandas as pd
from tqdm import tqdm
from collections import Counter
from textstat import flesch_kincaid_grade
import contractions

In [98]:
# prepare to add local python functions; import modules from src directory
import sys
src = '../src'
sys.path.append(src)

# import local functions
from nlp.parse import remove_accented_chars, strip_most_punc, strip_apostrophe, \
strip_emoji_like_if_spaces, strip_non_emoji_emoji_symbol

ImportError: cannot import name 'strip_emoji_like_if_spaces' from 'nlp.parse' (/Users/admin/Documents/GitHub/psychedelic_efficacy/notebooks/../src/nlp/parse.py)

In [2]:
df = pd.read_csv('../data/raw/erowid/raw_reports_final.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15994 entries, 0 to 15993
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  15994 non-null  int64 
 1   drug        15994 non-null  object
 2   weight      15994 non-null  object
 3   year        15994 non-null  object
 4   gender      15994 non-null  object
 5   age         15994 non-null  object
 6   report      15994 non-null  object
 7   url         15994 non-null  object
dtypes: int64(1), object(7)
memory usage: 999.8+ KB


In [3]:
df = df.drop(columns=['Unnamed: 0'])
df.columns

Index(['drug', 'weight', 'year', 'gender', 'age', 'report', 'url'], dtype='object')

In [4]:
# Are there rows that are total dupicates of one another?
df = df.drop_duplicates().reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9458 entries, 0 to 9457
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drug    9458 non-null   object
 1   weight  9458 non-null   object
 2   year    9458 non-null   object
 3   gender  9458 non-null   object
 4   age     9458 non-null   object
 5   report  9458 non-null   object
 6   url     9458 non-null   object
dtypes: object(7)
memory usage: 517.4+ KB


In [5]:
# Are there rows that are identical in everything except for the url?
without_urls = df.drop(columns=['url']).drop_duplicates().reset_index(drop=True)
len(without_urls)

9458

In [6]:
# Get my list of target drugs I actually want to analyze. 
drugs_file = open('../data/raw/erowid/psychedelic_drugs.txt', 'r')
drugs_as_string = drugs_file.read()
psychedelic_drugs = drugs_as_string.split(',')
psychedelic_drugs[:10]

['AET',
 'AL-LAD',
 'ALD-52',
 'ALEPH',
 'Aleph-4',
 'Allylescaline',
 'AMT',
 'Arylcyclohexylamines',
 'Ayahuasca',
 'Banisteriopsis caapi']

In [7]:
# Rename the dataframes I'm working with
with_urls = df.copy()

# Drop rows from without_urls for non-target drugs
df = without_urls[without_urls.drug.isin(psychedelic_drugs)].copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5458 entries, 0 to 9456
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drug    5458 non-null   object
 1   weight  5458 non-null   object
 2   year    5458 non-null   object
 3   gender  5458 non-null   object
 4   age     5458 non-null   object
 5   report  5458 non-null   object
dtypes: object(6)
memory usage: 298.5+ KB


In [8]:
# How many reviews are still duplicated?
len(set(df.report))

4554

Remove some rows where duplicate reviews are associated with multiple drugs. Remove the row with the less common drug, so that I maximize data I have on the most popular psychedelics. 
<font color='violet'> Remove rows with duplicate reviews. 

In [25]:
# I'll need an index without gaps
df = df.reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5458 entries, 0 to 5457
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drug    5458 non-null   object
 1   weight  5458 non-null   object
 2   year    5458 non-null   object
 3   gender  5458 non-null   object
 4   age     5458 non-null   object
 5   report  5458 non-null   object
dtypes: object(6)
memory usage: 256.0+ KB


In [20]:
drug_prevalance = out = dict(Counter(df['drug'].explode()).most_common())
drug_prevalance

{'MDMA': 223,
 'LSD': 181,
 'DMT': 157,
 '2C-I': 127,
 'Ketamine': 126,
 'Mushrooms': 123,
 'Mimosa tenuiflora': 123,
 '4-AcO-DMT': 120,
 '2C-E': 120,
 '2C-B': 113,
 'Methoxetamine': 109,
 '5-MeO-DMT': 109,
 'Mushrooms - P. cubensis': 109,
 'DPT': 107,
 'H.B. Woodrose': 105,
 '2C-T-2': 104,
 '2C-C': 98,
 '2C-T-7': 98,
 'Ayahuasca': 97,
 '5-MeO-DiPT': 96,
 'DOC': 95,
 '1P-LSD': 94,
 '5-MeO-MIPT': 94,
 'AMT': 94,
 '4-HO-MET': 94,
 '5-MeO-AMT': 93,
 'PCP': 91,
 '4-HO-MiPT': 89,
 '2C-P': 88,
 'MDA': 86,
 'Mescaline': 85,
 'AL-LAD': 80,
 'Banisteriopsis caapi': 79,
 'DiPT': 76,
 'Mushrooms - P. semilanceata': 71,
 '2C-D': 69,
 'Tabernanthe iboga': 61,
 '3-MeO-PCP': 60,
 '4-HO-DiPT': 57,
 'DOM': 54,
 '4-AcO-DET': 54,
 '4-AcO-DiPT': 54,
 'DOB': 54,
 'Mushrooms - P. cyanescens': 47,
 'Peyote': 45,
 'DOI': 43,
 'Ibogaine': 43,
 'MDAI': 42,
 'Mushrooms - Panaeolus cyanescens': 36,
 'Mushrooms - P. mexicana': 35,
 'Deschloroketamine': 34,
 'TMA-2': 33,
 'ALD-52': 32,
 '4-AcO-MiPT': 32,
 '3-MEO-PC

In [30]:
# Remove the less-common drug's row for duplicate reviews
indices_to_drop = []

for row in tqdm(range(len(df))):
    for n in range(row-1):
        # Not all combos of row+n will work.  
        try:
            # Identify later rows where the reviews are the same as the target row
            if (df.loc[row,'report'] == df.loc[row+n, 'report']) & (drug_prevalance[
                # Only delete the row with a less-common drug, if there is one. 
                df.loc[row,'drug']] < drug_prevalance[df.loc[row+n,'drug']]):
                indices_to_drop.append(row)
            # See about whether row+n is actually smaller. 
            elif (df.loc[row,'report'] == df.loc[row+n, 'report']) & (drug_prevalance[
                df.loc[row,'drug']] > drug_prevalance[df.loc[row+n,'drug']]):
                indices_to_drop.append(row+n)
            # If drug prevalance for row == for row+n, it's probably the same row: keep this row.
            else: pass
        except: pass
            
len(indices_to_drop)     

100%|██████████| 5458/5458 [14:01<00:00,  6.49it/s]


1144

In [33]:
# This is close to what I'd expect the length of duplicates to be. Drop these rows.
df = df.drop(index=indices_to_drop)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4562 entries, 0 to 5457
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drug    4562 non-null   object
 1   weight  4562 non-null   object
 2   year    4562 non-null   object
 3   gender  4562 non-null   object
 4   age     4562 non-null   object
 5   report  4562 non-null   object
dtypes: object(6)
memory usage: 249.5+ KB


<font color='violet'> Parse text 
    
 In general, use the same processes used on previously-modeled reviews of psych meds from studies. 
    
But before removing punctuation and such, I had previously noticed that most of the reports started with a bunch of html still present. Remove that, then create a function or functions that clean up the core text. 

In [34]:
# Take a look at one of the reports
df.report[0]

"\n\n\xa0\n\n\n\n\nDOSE:\n\xa0 repeated\nsmoked\nDMT\n\n\n\xa0\n\xa0 repeated\nsmoked\nCannabis\n\n\n\n\n\nBODY WEIGHT:\n102 kg\n\n\n\n\nSeeing my Buddha-Nature on DMT \r\n\nI am a 22 year old male around 102kg. What I am about to tell you is my experience of using DMT for the first time. I took around 100-150mg of DMT about a month ago. I have no clue as to what the exact dosage is I have no clue as to what the exact dosage is, because I eventually started eyeballing it trying to take bigger dosages in my attempt to “break through”, which I believe was unsuccessful. My only other psychedelic experience is LSD which I tripped heavily on around a year ago, but I stopped completely a couple months before this experience. I am writing this from memory so all of the details may not be 100 percent accurate. \r\n\nThis trip happened about a month ago. I decided to try DMT for the first time because the guy I usually see had it and it tested clean. I also live in student accommodation and eve

I want to remove everything through the final string of 5 line breaks \n\n\n\n\n that comes after the BODY WEIGHT section. And everything past "Exp Year". Is this a consistent pattern across many of the strings?

 Start by removing everything up through the word WEIGHT; there are multiple \n\n\n\n\nbefore that; removing them will make it easier to find the final \n\n\n\n\n

In [55]:
# Test with the first text
text = df.report[0]
cut_beg_loc = text.index('WEIGHT')
beg_to_cut = text[:cut_beg_loc] + 'WEIGHT'
beg_to_cut

'\n\n\xa0\n\n\n\n\nDOSE:\n\xa0 repeated\nsmoked\nDMT\n\n\n\xa0\n\xa0 repeated\nsmoked\nCannabis\n\n\n\n\n\nBODY WEIGHT'

In [56]:
# Reset the index again
df = df.reset_index(drop=True)
df.head(2)

Unnamed: 0,drug,weight,year,gender,age,report
0,DMT,102 kg,2022,Male,22,\n\n \n\n\n\n\nDOSE:\n repeated\nsmoked\nDMT\...
1,AET,150 lb,2006,Male,Not Given,\n\n\n \n\n\n\n\nDOSE:\n repeated\ninsufflate...


In [57]:
# Remove everything up through WEIGHT on every report. 

for row in tqdm(range(len(df))):
    text = df.loc[row,'report']
    cut_beg_idx = text.index('WEIGHT')
    beg_to_cut = text[:cut_beg_idx] + 'WEIGHT'
    new_text = text.replace(beg_to_cut,'')
    df.loc[row,'report'] = new_text

# check if that worked
df.report[:3]

100%|██████████| 4562/4562 [00:00<00:00, 5742.82it/s]


0    :\n102 kg\n\n\n\n\nSeeing my Buddha-Nature on ...
1    :\n150 lb\n\n\n\n\n\nOver the past week, I've ...
2    :\n170 lb\n\n\n\n\n\n8:00 p.m.\t130 mg AET ora...
Name: report, dtype: object

In [60]:
# Looking good so far. Now remove up through the \n\n\n\n\n. Test with the first text first.
text = df.report[0]
cut_beg_loc = text.index('\n\n\n\n\n')
beg_to_cut = text[:cut_beg_loc] + '\n\n\n\n\n'
beg_to_cut

':\n102 kg\n\n\n\n\n'

In [61]:
# Repeat removal of this substring from the beginning of each review. 

for row in tqdm(range(len(df))):
    text = df.loc[row,'report']
    cut_beg_idx = text.index('\n\n\n\n\n')
    beg_to_cut = text[:cut_beg_idx] + '\n\n\n\n\n'
    new_text = text.replace(beg_to_cut,'')
    df.loc[row,'report'] = new_text

# check if that worked
df.report[:3]

100%|██████████| 4562/4562 [00:00<00:00, 7838.53it/s]


0    Seeing my Buddha-Nature on DMT \r\n\nI am a 22...
1    \nOver the past week, I've had the chance to t...
2    \n8:00 p.m.\t130 mg AET oral capsule \r\nT+3:1...
Name: report, dtype: object

In [69]:
# Repeat for cutting off the endings. Start by finding in the first report. 
text = df.report[0]
cut_end_loc = text.index('Exp Year: ')
end_to_cut = text[cut_end_loc:]
end_to_cut

'Exp Year: 2022ExpID: 116975\nGender: Male\xa0\nAge at time of experience: 22\xa0\nPublished: Jan 28, 2023Views: 141\n[ View as PDF (for printing) ] [ View as LaTeX (for geeks) ]\n[ Switch Colors ]\n\nDMT (18), Cannabis (1) : First Times (2), Combinations (3), Entities / Beings (37), Alone (16)\n\n\n'

In [71]:
# Remove the tail of all reports
for row in tqdm(range(len(df))):
    # Some don't contain the substring
    try: 
        text = df.loc[row,'report']
        cut_end_loc = text.index('Exp Year: ')
        end_to_cut = text[cut_end_loc:]
        new_text = text.replace(end_to_cut,'')
        df.loc[row,'report'] = new_text
    except: pass

# check if that worked
df.report[0]

100%|██████████| 4562/4562 [00:00<00:00, 6837.20it/s] 


"Seeing my Buddha-Nature on DMT \r\n\nI am a 22 year old male around 102kg. What I am about to tell you is my experience of using DMT for the first time. I took around 100-150mg of DMT about a month ago. I have no clue as to what the exact dosage is I have no clue as to what the exact dosage is, because I eventually started eyeballing it trying to take bigger dosages in my attempt to “break through”, which I believe was unsuccessful. My only other psychedelic experience is LSD which I tripped heavily on around a year ago, but I stopped completely a couple months before this experience. I am writing this from memory so all of the details may not be 100 percent accurate. \r\n\nThis trip happened about a month ago. I decided to try DMT for the first time because the guy I usually see had it and it tested clean. I also live in student accommodation and everyone had left for christmas and I was the only one still there, so I had no one to bother me when I was tripping and to smell the DMT i

<font color='violet'> Remove remaining html code
    
This seems to appear as \r or \n or some combination of these. Just replace each with a space wherever they are. Use a space rather than '' because that may produce extra spaces but at least won't result in improperly compounded words. 

In [72]:
for row in tqdm(range(len(df))):
    text = df.loc[row,'report']
    remove_rs = text.replace('\r', ' ')
    remove_ns = remove_rs.replace('\n', ' ')
    df.loc[row,'report'] = remove_ns

df.report[0]

100%|██████████| 4562/4562 [00:00<00:00, 6488.62it/s]


"Seeing my Buddha-Nature on DMT    I am a 22 year old male around 102kg. What I am about to tell you is my experience of using DMT for the first time. I took around 100-150mg of DMT about a month ago. I have no clue as to what the exact dosage is I have no clue as to what the exact dosage is, because I eventually started eyeballing it trying to take bigger dosages in my attempt to “break through”, which I believe was unsuccessful. My only other psychedelic experience is LSD which I tripped heavily on around a year ago, but I stopped completely a couple months before this experience. I am writing this from memory so all of the details may not be 100 percent accurate.    This trip happened about a month ago. I decided to try DMT for the first time because the guy I usually see had it and it tested clean. I also live in student accommodation and everyone had left for christmas and I was the only one still there, so I had no one to bother me when I was tripping and to smell the DMT if I sm

Before removing any more punctuation, I need to get the text complexity of each report, like I did with the previous psych med reviews.

<font color='violet'> Create feature: text complexity 

In [74]:
df['complexity'] = df['report'].apply(lambda x : flesch_kincaid_grade(x))
df.head()

Unnamed: 0,drug,weight,year,gender,age,report,complexity
0,DMT,102 kg,2022,Male,22,Seeing my Buddha-Nature on DMT I am a 22 ye...,7.9
1,AET,150 lb,2006,Male,Not Given,"Over the past week, I've had the chance to tr...",7.8
2,AET,170 lb,2007,Male,Not Given,8:00 p.m.\t130 mg AET oral capsule T+3:15\t...,7.2
3,AET,220 lb,2007,Male,Not Given,I read about people insuffulating this to bri...,5.6
4,AET,200 lb,1986,Male,Not Given,Back in the mid to late 80s AET was legal and...,6.9


<font color='violet'> Continue parsing with previously-created functions

In [76]:
# Remove accented characters if there are any.
df['report'] = df['report'].apply(remove_accented_chars)
df.head()

Unnamed: 0,drug,weight,year,gender,age,report,complexity
0,DMT,102 kg,2022,Male,22,Seeing my Buddha-Nature on DMT I am a 22 ye...,7.9
1,AET,150 lb,2006,Male,Not Given,"Over the past week, I've had the chance to tr...",7.8
2,AET,170 lb,2007,Male,Not Given,8:00 p.m.\t130 mg AET oral capsule T+3:15\t...,7.2
3,AET,220 lb,2007,Male,Not Given,I read about people insuffulating this to bri...,5.6
4,AET,200 lb,1986,Male,Not Given,Back in the mid to late 80s AET was legal and...,6.9


In [79]:
# Expand contractions. Check results on row 1, which I can already see has a contraction. 
df['report'] = df['report'].apply(contractions.fix)
df.report[1]

" Over the past week, I have had the chance to try out AET at 3 different dosage levels.  Considering the rarity of the substance, I have decided that, even though I encountered nothing mindblowing, I should most definitely write up a summary of those experiences.   FIRST TRIAL   On the first trial, I took 100mg orally at about 10:00pm.  I would have done it earlier but I checked my mail, totally not expecting to see it yet, but it had in fact arrived.  I was going to start with 80mg but in reading the TIHKAL entries, it seemed that 80 would have been very weak indeed, so I started with 100.  The taste, texture, and shape of the granules reminds me greatly of AMT, as does the smell.  It is a less powerful whiff of the same shit-like stench that pervades the air when opening AMT up.  I held it in my mouth with some spit and swished it around, attempting to force sublingual absorbtion.  That skatole taste filled my mouth, weaker than with AMT but still offensive.     After about 5 minute

In [81]:
# That worked. Now, strip most of the punctuation with function strip_most_punc
strip_most_punc(df, 'report')
df.report[1]

100%|██████████| 4562/4562 [00:00<00:00, 6006.21it/s]


" Over the past week I have had the chance to try out AET at 3 different dosage levels  Considering the rarity of the substance I have decided that even though I encountered nothing mindblowing I should most definitely write up a summary of those experiences   FIRST TRIAL   On the first trial I took 100mg orally at about 10:00pm  I would have done it earlier but I checked my mail totally not expecting to see it yet but it had in fact arrived  I was going to start with 80mg but in reading the TIHKAL entries it seemed that 80 would have been very weak indeed so I started with 100  The taste texture and shape of the granules reminds me greatly of AMT as does the smell  It is a less powerful whiff of the same shitlike stench that pervades the air when opening AMT up  I held it in my mouth with some spit and swished it around attempting to force sublingual absorbtion  That skatole taste filled my mouth weaker than with AMT but still offensive     After about 5 minutes my mucous mebranes sta

In [82]:
# Most punctuation is now gone. Strip apostrophes. First, find one so I know it worked. 
df[df['report'].str.find("'")!=-1].head(1)

Unnamed: 0,drug,weight,year,gender,age,report,complexity
0,DMT,102 kg,2022,Male,22,Seeing my BuddhaNature on DMT I am a 22 yea...,7.9


In [83]:
df.report[0]

"Seeing my BuddhaNature on DMT    I am a 22 year old male around 102kg What I am about to tell you is my experience of using DMT for the first time I took around 100150mg of DMT about a month ago I have no clue as to what the exact dosage is I have no clue as to what the exact dosage is because I eventually started eyeballing it trying to take bigger dosages in my attempt to break through which I believe was unsuccessful My only other psychedelic experience is LSD which I tripped heavily on around a year ago but I stopped completely a couple months before this experience I am writing this from memory so all of the details may not be 100 percent accurate    This trip happened about a month ago I decided to try DMT for the first time because the guy I usually see had it and it tested clean I also live in student accommodation and everyone had left for christmas and I was the only one still there so I had no one to bother me when I was tripping and to smell the DMT if I smoked it in my ro

In [84]:
# There's an apostrophe toward the bottom in earth's
strip_apostrophe(df, 'report')
df.report[0]

100%|██████████| 4562/4562 [00:01<00:00, 2953.38it/s]


'Seeing my BuddhaNature on DMT    I am a 22 year old male around 102kg What I am about to tell you is my experience of using DMT for the first time I took around 100150mg of DMT about a month ago I have no clue as to what the exact dosage is I have no clue as to what the exact dosage is because I eventually started eyeballing it trying to take bigger dosages in my attempt to break through which I believe was unsuccessful My only other psychedelic experience is LSD which I tripped heavily on around a year ago but I stopped completely a couple months before this experience I am writing this from memory so all of the details may not be 100 percent accurate    This trip happened about a month ago I decided to try DMT for the first time because the guy I usually see had it and it tested clean I also live in student accommodation and everyone had left for christmas and I was the only one still there so I had no one to bother me when I was tripping and to smell the DMT if I smoked it in my ro

In [85]:
# Stripping apostrophes worked. Find a ) to see if it gets stripped properly with 
# the strip_non_emoji_emoji_symbol function. 

df[df['report'].str.find(")")!=-1].head(1)

Unnamed: 0,drug,weight,year,gender,age,report,complexity
1,AET,150 lb,2006,Male,Not Given,Over the past week I have had the chance to t...,7.8


In [86]:
df.report[1]

' Over the past week I have had the chance to try out AET at 3 different dosage levels  Considering the rarity of the substance I have decided that even though I encountered nothing mindblowing I should most definitely write up a summary of those experiences   FIRST TRIAL   On the first trial I took 100mg orally at about 10:00pm  I would have done it earlier but I checked my mail totally not expecting to see it yet but it had in fact arrived  I was going to start with 80mg but in reading the TIHKAL entries it seemed that 80 would have been very weak indeed so I started with 100  The taste texture and shape of the granules reminds me greatly of AMT as does the smell  It is a less powerful whiff of the same shitlike stench that pervades the air when opening AMT up  I held it in my mouth with some spit and swished it around attempting to force sublingual absorbtion  That skatole taste filled my mouth weaker than with AMT but still offensive     After about 5 minutes my mucous mebranes sta

In [87]:
# ( and ) show up toward the end: (even when I did not combine. . .). Now strip. 
strip_non_emoji_emoji_symbol(df,'report')
df.report[1]

100%|██████████| 4562/4562 [00:06<00:00, 758.41it/s] 


' Over the past week I have had the chance to try out AET at 3 different dosage levels  Considering the rarity of the substance I have decided that even though I encountered nothing mindblowing I should most definitely write up a summary of those experiences   FIRST TRIAL   On the first trial I took 100mg orally at about 10:00pm  I would have done it earlier but I checked my mail totally not expecting to see it yet but it had in fact arrived  I was going to start with 80mg but in reading the TIHKAL entries it seemed that 80 would have been very weak indeed so I started with 100  The taste texture and shape of the granules reminds me greatly of AMT as does the smell  It is a less powerful whiff of the same shitlike stench that pervades the air when opening AMT up  I held it in my mouth with some spit and swished it around attempting to force sublingual absorbtion  That skatole taste filled my mouth weaker than with AMT but still offensive     After about 5 minutes my mucous mebranes sta

Parsing so far has worked well. Originally, I kept some punctuation intact because it might appear in the context of an emoji. Here were the steps I took with remaining punctuation after doing the parsing from above:

- Keep where they exist: ! $ + = ? % 0-9
- Delte everywhere: #
- Delete if surrounded by spaces or adjacent to numbers rather than adjacent to another symbol: ( ) : ;


In [88]:
# Delete hashtag symbols if there are any. 
df[df['report'].str.find("#")!=-1].head(1)

Unnamed: 0,drug,weight,year,gender,age,report,complexity
45,4-AcO-DMT,245 lb,2013,Male,29,Background 29 year old male 62 250lbs Very ...,4.3


In [89]:
df.report[45]

' Background 29 year old male  62 250lbs  Very experienced with psychedelic drugs    Going into this experience I had been quite stressed from personal financial matters and work related complications  I have the house to myself for the night  Girlfriend is gone for the night and the roommate is out of town  My dog will be keeping me company  I had a few beers in the afternoonbetween 2:30 and 3:30)  Followed by hydrating and relaxing  I ate an early dinner  Around 5:30  2 slices of pizza   7:52pm I opened a beer to relax my nerves  I plan to consume about half of it before dosing the ALLAD  I will wash down the 4acoDMT with a few sips of beer   8:08pm dose placed on tongue   8:16pm ALLAD onset is always super quick  Warmth slight alertness light headed buzz and euphoria  Tab still is in my mouth   8:20pm tab swallowed   8:23pm 4acoDMT consumed orally   8:25pm Very alert  At a ++  Still nursing my beer  About to put on some music   8:31pm I let the dog out to pee plus took a few rips of

In [90]:
# That showed up toward the bottom beer #2...beer #3
for row in tqdm(range(len(df))):
    text = df.loc[row,'report']
    df.loc[row,'report'] = text.replace('#', ' ')

df.report[45]

100%|██████████| 4562/4562 [00:01<00:00, 3604.96it/s]


' Background 29 year old male  62 250lbs  Very experienced with psychedelic drugs    Going into this experience I had been quite stressed from personal financial matters and work related complications  I have the house to myself for the night  Girlfriend is gone for the night and the roommate is out of town  My dog will be keeping me company  I had a few beers in the afternoonbetween 2:30 and 3:30)  Followed by hydrating and relaxing  I ate an early dinner  Around 5:30  2 slices of pizza   7:52pm I opened a beer to relax my nerves  I plan to consume about half of it before dosing the ALLAD  I will wash down the 4acoDMT with a few sips of beer   8:08pm dose placed on tongue   8:16pm ALLAD onset is always super quick  Warmth slight alertness light headed buzz and euphoria  Tab still is in my mouth   8:20pm tab swallowed   8:23pm 4acoDMT consumed orally   8:25pm Very alert  At a ++  Still nursing my beer  About to put on some music   8:31pm I let the dog out to pee plus took a few rips of

That worked. Now deal with ():; First, remove them where they are just surrounded by spaces. 

In [91]:
# Find an example.
df[df['report'].str.find(" ( ")!=-1].head(1)

Unnamed: 0,drug,weight,year,gender,age,report,complexity
20,AL-LAD,140 lb,2013,Male,44,Received ten x 100mcg blotters of ALLAD to tr...,7.9


In [92]:
df.report[20]

' Received ten x 100mcg blotters of ALLAD to trial for a vendor had it synthed up in a European lab licensed to handle LSD as a kind of proof of concept thing low volume run hoping it would be commercially viable NMRs and LCMS are available for it I trust the source based on past dealings my two experiences with it and those of others I have read from those who have also trialled it from the same source are entirely consistent with Shulgins reports on it and those I have read elsewhere previously I have every reason to believe what I got is exactly what it says on the tin despite it being a very rare chem generally   Trialled 300mcg one week previous to good effect though not quite where I wanted it to be at that dose Experienced with a good number of your traditional and notsotraditional psychedelics Years since last had LSD but plenty familiar enough with it to know the similarities and the differences here subtle though they are Running a tolerance generally due to somewhat frequent

In [95]:
# There's a solitary ( toward the end. Run the function to remove these. 
strip_emoji_like_if_spaces(df,'report')

df.report[20]

' Received ten x 100mcg blotters of ALLAD to trial for a vendor had it synthed up in a European lab licensed to handle LSD as a kind of proof of concept thing low volume run hoping it would be commercially viable NMRs and LCMS are available for it I trust the source based on past dealings my two experiences with it and those of others I have read from those who have also trialled it from the same source are entirely consistent with Shulgins reports on it and those I have read elsewhere previously I have every reason to believe what I got is exactly what it says on the tin despite it being a very rare chem generally   Trialled 300mcg one week previous to good effect though not quite where I wanted it to be at that dose Experienced with a good number of your traditional and notsotraditional psychedelics Years since last had LSD but plenty familiar enough with it to know the similarities and the differences here subtle though they are Running a tolerance generally due to somewhat frequent

In [None]:
df.to_csv('../data/processed/erowid_cleaned.csv')