## Create Pre-Processed Datasets and Vocabularies
The goal of this file is to create data sets with pre-tokenized words and vocabularies than can be loaded into the iterators that will eventually feed our models.  Because our tokenizer of choice (spacy) is relatively slow, this process saves us enormous time.  Three datasets and four vocabularies are created in this file:

Datasets
1. small dataset for testing code
2. full dataset
3. dataset that Grasser et al used (which includes duplicates within and across the training and testing dataset)

Datasets are pre-processed then split into train, valid, and test.  Details on pre-processing steps available below

Vocabularies:
1. Vocabulary from full training dataset
2. Vocabulary of lemmatized full training set
3. Vocabulary of Grasser's training dataset
4. Vocabulary of lemmatized Grasser's training dataset

##### Note: this file runs faster locally than on google colab - go figure...

## Setup Env

### Import Packages

In [1]:
import json
import re

import pandas as pd
import spacy
from sklearn.model_selection import train_test_split

from torchtext.data.utils import get_tokenizer
from collections import Counter
from csv import reader
from torchtext.vocab import Vocab

# Pre-processing
Pre-processing involves the following steps:
1. Drop examples with no rating.  Rating is our target so we cannot use examples without that information.
2. Make conditional coluns into strings.  We did not use the condition field during our analysis.  However, this could be important for future work.
3. Drop duplicates.  In some cases there are review duplicates associated with multiple conditions in different rows.  In these instances the rows are compressed to one and the conditions are made into a list.  Note: For grassers dataset duplicates are not dropped in order to maintain an apples to apples comparison.
4. Replacing drug names with a token THISDRUG.  This allows the recurrent network to differentiate between the drug that is the subject of the review (if the drug is spelled correctly) and other drugs that are mentioned as comparisons.
5. Execute data cleaning steps.  These include:
    - stripping quotations that are wrapping the reviews
    - replacing html symbols with the corresponding punctuation


6. Creating the rating category.  This bins the rating column according to the follwoing rules:

    |Ratings | Bin|
    |--------|----|
    |0 - 3   | Negative|
    |4 - 6   | Neutral|
    |7 - 10  | Positive|
    
7. Tokenize review column.  The spacy tokenizer creates numerous useful fields from text.  These include parts of speech, shape of the word (for example lengths and capitalizations), and lemmas (there are others but these are the ones we kept).  We decided to stringfigy the list of tokens for each review and create a column out of those.  Thus, for each example we'd have one column that is a stringified list of the parts of speech for each word, one columne that is a stringified list of the lemmas for each word, etc.
8. We also created one hot encodings for each column created from the tokenized data.  For example, the afformentioned stringified list of parts of speech would be transformed a stringified list of one hot encodings of all parts of speech found in the corpus.


## Handle Duplicates

In [6]:
def handle_dups(df):
    """
    df: pandas dataframe
    returns: pandas dataframe
    This drops duplicates for reviews
    For the few situations where the same review is associated with
    multiple conditions, this concatenates the conditions with a '~'
    """
    print(f"len before drop: {len(df)}")
    new_df = df.drop_duplicates(['review', 'condition'])
    new_df.groupby('review')['condition'].apply(lambda x: '~'.join(x)).reset_index()
    print(f"len after drop: {len(new_df)}")
    return new_df

## Creating Columns

In [7]:
def create_cols(df):
    """
    df: pandas dataframe
    returns: pandas dataframe
    
    bins rating column and interpolates empty conditions with 'Not Entered'
    """

    # code NAs as a "Not Entered" category
    df.loc[df['condition'].isna(), 'condition'] = 'Not Entered'

    # creates ratings category by binning ratings
    df['rating_category'] = 'Positive'
    df.loc[df['rating'] < 7, 'rating_category'] = 'Neutral'
    df.loc[df['rating'] < 4, 'rating_category'] = 'Negative'

    # # create daily useful count
    # max_date = df['date'].max()
    # df['useful_daily'] = df['usefulCount'] / ((max_date - df['date']).dt.days + 1)

    return df

## Clean Review

In [8]:
def unwrap_quotes(rev):
    """
    rev (str): a review
    returns the same review without the quotes around it
    """
    if rev[-1] == '"' and rev[0] == '"':
        return rev[1:-1]
    else:
        return rev


In [9]:
def strip_quotes(df):
    """
    df: pandas dataframe
    returns: pandas dataframe
    applies unwrap quotes to the dataframe
    (i know it's a poorly named function)
    """
    df['review'] = df['review'].apply(unwrap_quotes)
    return df

In [10]:
def replace_html(df):
    """
    df: pandas dataframe
    returns: pandas dataframe
    
    replaces html representation of an apostrophe with an apostrophe
    """
    df['review'] = df['review'].str.replace("&#039;", "'")
    return df

In [11]:
def all_cleaning_steps(df):
    """
    df: pandas dataframe
    returns: pandas dataframe
    
    executes all (both) cleaning steps
    """
    df = strip_quotes(df)
    df = replace_html(df)
    return df

## Tokenized Columns

In [13]:
nlp = spacy.load("en_core_web_sm")

In [None]:
def make_tokens(review, idx, nlp):
    """
    review (str): a review
    idx (int): index (used for checking progress)
    nlp: spacy object - the object returned from spacy.load()
    
    tokenizes review and returns a tuple of jsonfied lists of 
    token data.  currently takes the text, lemma, two parts of 
    speech, and the shape
    """
#     print("idx", idx)

    if not idx % 10000:
        print(f"\tokenizing review {idx}")
    # print("review", review)
    doc = nlp(review)
    token_data = list(zip(*[(token.text, token.lemma_, token.pos_, 
              token.shape_, token.dep_) for token in doc]))
    # print(token_data)
    return (json.dumps(data) for data in token_data)
    

In [16]:
def create_spacy_cols(df, nlp):
    """
    df: pandas dataframe
    nlp: spacy object - the object returned from spacy.load()
    returns: pandas dataframe
    
    applies make tokens to pandas dataframe and configures output to make sense
    
    """
    (df['tokens'], df['lemmas'], df['pos'], df['shape'], df['dep']) = \
        zip(*df.apply(lambda x: make_tokens(x['review'], x.name, nlp), axis=1))
    return df

In [17]:
train_df.columns

Index(['condition', 'review', 'rating', 'date', 'usefulCount',
       'rating_category'],
      dtype='object')

### One hot cols

In [18]:
def get_indices(df, col):
    """
    df: pandas dataframe
    col: name of column
    returns: list of unique values in col
    """
    lists = list(df[col].apply(json.loads))
    flat_list = [item for sublist in lists for item in sublist]
    indices = list(set(flat_list))
    return indices

In [19]:
def encode_col(s, indices):
    """
    s: stringified list of tokens (they can be parts of speech, shapes, etc) 
    indices: list of unique values in a column
    returns same thing as s but with tokens replaced with indices.
    """
    l = json.loads(s)
    new_l = [indices.index(i) for i in l]
    return json.dumps(new_l)

def one_hot_index(df, col):
    """
    df: pandas dataframe
    col: name of column
    returns: pandas df with one hot column
    """
    indices = get_indices(df, col)
    df[col + "_encoding"] = df[col].apply(lambda x: encode_col(x,  indices))
    df[col + "_encoding_count"] = len(indices) - 1
    return df



## Replace drug name in review

In [21]:
def replace_drug_name(row):
    """
    row: row from pandas df (function is meant to be applied)
    returns row 'THISDRUG' replacing the drugname field
    """
    drug_name = row['drugName']
    pattern = re.compile(drug_name, re.IGNORECASE)
    row['review'] = pattern.sub('THISDRUG', row['review'])
    return row

def replace_drug_names(df):
    """
    applies replace_drug_name to df
    """
    df = df.apply(replace_drug_name, axis=1)
    return df

## All steps Together

In [22]:
def execute_preprocessing(df):
    """
    executes all functions above to df
    returns transformed dataframe
    """
    print("dropping null ratings...")
    df = df[df['rating'].notna()]
    print("making condition columns str...")
    df['condition'] = df['condition'].astype(str)
    print("handling duplicates...")
    df = handle_dups(df)
    print("executing data cleaning steps...")
    df = all_cleaning_steps(df)
    print("replacing drug names...")
    df = replace_drug_names(df)
    print("creating rating category...")
    df = create_cols(df)
    print("making spacy token columns..")
    df = create_spacy_cols(df, nlp)
    print("creating one hot encording columns...")
    cols_to_encode = ['pos', 'shape', 'dep']
    for col in cols_to_encode:
        df = one_hot_index(df, col)
    return df


## Create Full Dataset

### Load data

In [3]:
DATA_PATH = "./drive-download-20210514T232544Z-001"

test = DATA_PATH + "/drugsComTest_raw.tsv"
train = DATA_PATH + "/drugsComTrain_raw.tsv"
scraped = DATA_PATH + "/drugComScrapedData.tsv"
 

scraped_df = pd.read_csv(scraped, sep="\t")
test_df = pd.read_csv(test, sep="\t", index_col=0)
train_df = pd.read_csv(train, sep="\t", index_col=0)

full_df = pd.concat([scraped_df, test_df, train_df]).reset_index(drop=True)

## Split Datasets

In [159]:

def split_dfs(df):
    """
    creates train, valid, and test dfs
    valid is stratified along label
    """
    RANDOM_STATE = 1123
    train, test = train_test_split(df, train_size=.85, random_state=RANDOM_STATE)
    train, valid = train_test_split(train, 
                                    train_size=.85, 
                                    random_state=RANDOM_STATE, 
                                    stratify=train['rating_category'])
    return train, valid, test


In [160]:
def save_splits(root, name, train, valid, test):
    """
    saves train, valid, and test dfs
    """
    train.to_csv(f"{root}/train_{name}.csv", index=False)
    valid.to_csv(f"{root}/valid_{name}.csv", index=False)
    test.to_csv(f"{root}/test_{name}.csv", index=False)


rma

## Make New DFs

In [161]:
# sample for testing
small_df = full_df.sample(n=50, random_state=6624)
small_df_pp = execute_preprocessing(small_df)
save_splits(f"{DATA_PATH}/full_processed", 'small_df', *split_dfs(small_df_pp))


dropping null ratings...
making condition columns str...
handling duplicates...
len before drop: 48
len after drop: 48
executing data cleaning steps...
replacing drug names...
creating rating category...
making spacy token columns..


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['condition'] = df['condition'].astype(str)


creating one hot encording columns...
dropping null ratings...
making condition columns str...
handling duplicates...
len before drop: 421480


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['condition'] = df['condition'].astype(str)


len after drop: 181540
executing data cleaning steps...
replacing drug names...
creating rating category...
making spacy token columns..
	okenizing review 0
	okenizing review 20000
	okenizing review 30000
	okenizing review 50000
	okenizing review 70000
	okenizing review 110000
	okenizing review 130000
	okenizing review 230000
	okenizing review 240000
	okenizing review 250000
	okenizing review 260000
	okenizing review 270000
	okenizing review 280000
	okenizing review 310000
	okenizing review 330000
	okenizing review 350000
	okenizing review 370000
	okenizing review 410000
	okenizing review 430000
creating one hot encording columns...


In [None]:
# Full df
full_df_pp = execute_preprocessing(full_df)
full_df_pp.to_csv(f"{DATA_PATH}/full_processed/full_df.csv", index=False)
save_splits(f"{DATA_PATH}/full_processed", 'full_df', *split_dfs(full_df_pp))


### Process pure Grasser data
Alternative versions of functions because Grasser data is already split into train-test we shouldn't remove duplicates

In [24]:
def split_dfs_valid(df):
    """
    creates train, valid, and test dfs
    valid is stratified along label
    """
    RANDOM_STATE = 1123
    train, valid = train_test_split(df, train_size=.85, random_state=RANDOM_STATE)
    return train, valid


In [25]:
def save_splits_valid(root, name, train, valid):
    """
    saves train, valid, and test dfs
    """
    train.to_csv(f"{root}/train_{name}.csv", index=False)
    valid.to_csv(f"{root}/valid_{name}.csv", index=False)

In [26]:
def execute_preprocessing_keep_dups(df):
    """
    executes all functions above to df
    returns transformed dataframe
    """
    print("dropping null ratings...")
    df = df[df['rating'].notna()]
    print("making condition columns str...")
    df['condition'] = df['condition'].astype(str)
#     print("handling duplicates...")
#     df = handle_dups(df)
    print("executing data cleaning steps...")
    df = all_cleaning_steps(df)
    print("replacing drug names...")
    df = replace_drug_names(df)
    print("creating rating category...")
    df = create_cols(df)
    print("making spacy token columns..")
    df = create_spacy_cols(df, nlp)
    print("creating one hot encording columns...")
    cols_to_encode = ['pos', 'shape', 'dep']
    for col in cols_to_encode:
        df = one_hot_index(df, col)
    return df

In [122]:
DATA_PATH = "./grasser_data/"

test = DATA_PATH + "test_original.csv"
train = DATA_PATH + "train_original.csv"
test_df = pd.read_csv(test)
train_df = pd.read_csv(train)
print(train_df.columns)

grasser_train_processed = execute_preprocessing_keep_dups(train_df)
save_splits_valid(f"{DATA_PATH}", 'grasser_data', *split_dfs_valid(grasser_train_processed))
grasser_test_processed = execute_preprocessing_keep_dups(test_df)
grasser_test_processed.to_csv(f"{DATA_PATH}/test_grasser_data.csv", index=False)

Index(['drugName', 'condition', 'review', 'rating', 'date', 'usefulCount',
       'rating_category'],
      dtype='object')
dropping null ratings...
making condition columns str...
executing data cleaning steps...
replacing drug names...
creating rating category...
making spacy token columns..
	okenizing review 0
	okenizing review 10000
	okenizing review 20000
	okenizing review 30000
	okenizing review 40000
	okenizing review 50000
	okenizing review 60000
	okenizing review 70000
	okenizing review 80000
	okenizing review 90000
	okenizing review 100000
	okenizing review 110000
	okenizing review 120000
	okenizing review 130000
	okenizing review 140000
	okenizing review 150000
	okenizing review 160000
creating one hot encording columns...
dropping null ratings...
making condition columns str...
executing data cleaning steps...
replacing drug names...
creating rating category...
making spacy token columns..
	okenizing review 0
	okenizing review 10000
	okenizing review 20000
	okenizing review

### Vocab

In [124]:
DATA_PATH = "."

In [171]:
def create_vocab(csv_file, counter, tokenizer, min_freq):
    """
    creates pytorch vocab utility
    
    csv_file: string, self exlanatory
    counter: counter object (should be empty)
    tokenizer: tokenizer object
    min_freq: int (minimum frequency)
    
    returns vocab object
    
    """
    with open(csv_file, 'r') as f:
        csv_reader = reader(f)
        
        for i, row in enumerate(csv_reader):
            tokens = tokenizer(row[2].lower())
            counter.update(tokens)
            if not i % 10000:
                print(f"{i} examples completed")
    vocab_reviews = Vocab(counter_reviews, min_freq=min_freq)
    return vocab_reviews

In [172]:

tokenizer = get_tokenizer('spacy', language='en_core_web_sm')
counter_reviews = Counter()

review_vocab = create_vocab(f"{DATA_PATH}/full_processed/train_full_df.csv", counter_reviews,  tokenizer, 100)


0 examples completed
10000 examples completed
20000 examples completed
30000 examples completed
40000 examples completed
50000 examples completed
60000 examples completed
70000 examples completed
80000 examples completed
90000 examples completed
100000 examples completed
110000 examples completed
120000 examples completed
130000 examples completed


In [174]:
s = json.dumps(counter_reviews)
with open('full_vocab.json', 'w') as f:
    f.write(s)

### Vocab for Grasser

In [None]:
counter_reviews = Counter()
review_vocab = create_vocab(f"{DATA_PATH}/grasser_data/train_grasser_data.csv", counter_reviews,  tokenizer, 100)
s = json.dumps(counter_reviews)
with open('grasser_vocab.json', 'w') as f:
    f.write(s)

### Vocab for lemmas

In [114]:
nlp = spacy.load("en_core_web_sm")
counter_reviews = Counter()

In [115]:
def create_vocab_lemma(csv_file, counter, tokenizer, min_freq):
    """
    creates pytorch vocab utility
    
    csv_file: string, self exlanatory
    counter: counter object (should be empty)
    tokenizer: tokenizer object
    min_freq: int (minimum frequency)
    
    returns vocab object    
    """
    with open(csv_file, 'r') as f:
        csv_reader = reader(f)
        
        for i, row in enumerate(csv_reader):
            doc = nlp(row[2].lower())
            lemmas = [token.lemma_ for token in doc]
#             print("token", type(token))
#             print("lemma", lemmas)
            counter.update(lemmas)
#             if i == 2: break
            if not i % 10000:
                print(f"{i} examples completed")
    vocab_reviews = Vocab(counter_reviews, min_freq=min_freq)
    return vocab_reviews

In [117]:

lemma_vocab = create_vocab_lemma(f"{DATA_PATH}/full_processed/train_full_df.csv", 
                            counter_reviews, 100)


0 examples completed
10000 examples completed
20000 examples completed
30000 examples completed
40000 examples completed
50000 examples completed
60000 examples completed
70000 examples completed
80000 examples completed
90000 examples completed
100000 examples completed
110000 examples completed
120000 examples completed
130000 examples completed


In [120]:
lemma_vocab.stoi

defaultdict(<bound method Vocab._default_unk_index of <torchtext.vocab.Vocab object at 0x7ff8c28cc6a0>>,
            {'<unk>': 0,
             '<pad>': 1,
             'I': 2,
             '.': 3,
             'be': 4,
             'and': 5,
             ',': 6,
             'the': 7,
             'have': 8,
             'to': 9,
             'it': 10,
             'my': 11,
             'a': 12,
             'for': 13,
             'of': 14,
             ' ': 15,
             'this': 16,
             'take': 17,
             'on': 18,
             'in': 19,
             'do': 20,
             'but': 21,
             'that': 22,
             'with': 23,
             'day': 24,
             '!': 25,
             'not': 26,
             "n't": 27,
             'so': 28,
             'get': 29,
             'go': 30,
             'feel': 31,
             'thisdrug': 32,
             'month': 33,
             'at': 34,
             'year': 35,
             'after': 36,
             'work':

In [121]:
s = json.dumps(counter_reviews)
with open('full_vocab_lemmas.json', 'w') as f:
    f.write(s)

### Grasser Lemma

In [126]:
counter_reviews = Counter()
lemma_vocab = create_vocab_lemma(f"{DATA_PATH}/grasser_data/train_grasser_data.csv", 
                            counter_reviews, 100)
s = json.dumps(counter_reviews)
with open('grasser_vocab_lemmas.json', 'w') as f:
    f.write(s)

0 examples completed
10000 examples completed
20000 examples completed
30000 examples completed
40000 examples completed
50000 examples completed
60000 examples completed
70000 examples completed
80000 examples completed
90000 examples completed
100000 examples completed
110000 examples completed
120000 examples completed
130000 examples completed


In [139]:
nlp = spacy.load("en_core_web_sm")
sentence = "They said that he would say to her what she said"

doc = nlp(sentence)
print("lemmas")
print([token.lemma_ for token in doc])
print()
print("tags")
print([token.tag_ for token in doc])
print()
print("part of speech")
print([token.pos_ for token in doc])
print()
print("dependencies")
print([token.dep_ for token in doc])



lemmas
['they', 'say', 'that', 'he', 'would', 'say', 'to', 'she', 'what', 'she', 'say']

tags
['PRP', 'VBD', 'IN', 'PRP', 'MD', 'VB', 'IN', 'PRP', 'WP', 'PRP', 'VBD']

part of speech
['PRON', 'VERB', 'SCONJ', 'PRON', 'AUX', 'VERB', 'ADP', 'PRON', 'PRON', 'PRON', 'VERB']

dependencies
['nsubj', 'ROOT', 'mark', 'nsubj', 'aux', 'ccomp', 'prep', 'pobj', 'ccomp', 'nsubj', 'ROOT']
