# <font color='violet'> Cleaning, Parsing, Feature Engineering on Psychedelic Experience Reports
    
Use the same methods I used for the reviews from studies to clean up the report texts. Address any issues that are unique to the texts from Erowid and require additional cleaning. Then, do feature engineering to create the same columns from data modeled previously: complexity level, similarity with a meta-perfect-review, subjectivity, and polarity. 

In [1]:
import pandas as pd
from tqdm import tqdm

In [2]:
df = pd.read_csv('../data/raw/erowid/raw_reports_final.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15555 entries, 0 to 15554
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  15555 non-null  int64 
 1   drug        15555 non-null  object
 2   weight      15555 non-null  object
 3   year        15555 non-null  object
 4   gender      15555 non-null  object
 5   age         15555 non-null  object
 6   report      15555 non-null  object
 7   url         15555 non-null  object
dtypes: int64(1), object(7)
memory usage: 972.3+ KB


In [3]:
df = df.drop(columns=['Unnamed: 0'])
df.columns

Index(['drug', 'weight', 'year', 'gender', 'age', 'report', 'url'], dtype='object')

In [4]:
# Are there rows that are total dupicates of one another?
df = df.drop_duplicates().reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9330 entries, 0 to 9329
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drug    9330 non-null   object
 1   weight  9330 non-null   object
 2   year    9330 non-null   object
 3   gender  9330 non-null   object
 4   age     9330 non-null   object
 5   report  9330 non-null   object
 6   url     9330 non-null   object
dtypes: object(7)
memory usage: 510.4+ KB


In [5]:
# Are there rows that are identical in everything except for the url?
without_urls = df.drop(columns=['url']).drop_duplicates().reset_index(drop=True)
len(without_urls)

8469

In [6]:
# Get my list of target drugs I actually want to analyze. 
drugs_file = open('../data/raw/erowid/psychedelic_drugs.txt', 'r')
drugs_as_string = drugs_file.read()
psychedelic_drugs = drugs_as_string.split(',')
psychedelic_drugs[:10]

['AET',
 'AL-LAD',
 'ALD-52',
 'ALEPH',
 'Aleph-4',
 'Allylescaline',
 'AMT',
 'Arylcyclohexylamines',
 'Ayahuasca',
 'Banisteriopsis caapi']

In [11]:
# Rename the dataframes I'm working with
with_urls = df.copy()

# Drop rows from without_urls for non-target drugs
df = without_urls[without_urls.drug.isin(psychedelic_drugs)].copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4985 entries, 0 to 8467
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drug    4985 non-null   object
 1   weight  4985 non-null   object
 2   year    4985 non-null   object
 3   gender  4985 non-null   object
 4   age     4985 non-null   object
 5   report  4985 non-null   object
dtypes: object(6)
memory usage: 272.6+ KB


In [None]:
# How many reviews are still duplicated?
len(set(df.report))

<font color='violet'> Next step. 

In [8]:
df.to_csv('../data/processed/erowid_cleaned.csv')