"Standard datasets are used to train natural language models... when crunched by a model, the output reflects quirks in the algorithm rather then the data."

This notebook contains feature extraction, weighting, and modeling portions of the drug-review-text-to-star-rating prediction project. Other aspects of the process--e.g., scraping, cleaning, or visualizing the data--are not included here. Furthermore, there are many possible directions this project can go (re: feature/model selection, weighting, etc.); this is but a starting place. Plus it only looks at "satisfaction" ratings for now.

Feature extraction is handled by sklearn's CountVectorizer function. For now, we're using it to tokenize our review text data into unigrams (words) and bigrams (pairs of words). These fatures are then weighted by tf-idf scores (term frequency-inverse document frequency, so a token that shows up in a lot of documents isn't given much weight, but an otherwise rare token showing up many times in the same document is heavily weighted).

We then pass the features into a couple different classifier models and see how they perform.

```
QUESTIONS/TODO:
    Why use CountVectorizer if we're going to be using tfidf? sklearn has a TfIdfVectorizer function.
    How is performance measured? What is that cross validation metric that's being spewed out? Is it accuracy?
    How to use a regression model instead of classifiers?
    Use non-deprecated modules instead of cross_validation.
    How should models be evaluated? Accuracy alone isn't necessarily the best; should explore the data first and see
        how skewed it is towards any given star rating. Also try precision/recall/F1.          
```
    

In [17]:
# This cell just sets up some data to use.
import pickle

with open('abilify.p', 'rb') as f:
    data = pickle.load(f)
    
reviews = [datum['comment'] for datum in data]
satisfaction_ratings = [datum['satisfaction'] for datum in data]

In [18]:
print(data[0])

{'drugName': 'abilify', 'site': 'webMD', 'condition': 'Additional Medications to Treat Depression', 'reviewDate': '10/15/2017', 'userName': 'caner1', 'ageRange': '65-74', 'gender': 'Female', 'role': 'Patient', 'medDuration': '1 to 6 months', 'effectiveness': 2, 'ease_of_use': 3, 'satisfaction': 1, 'genRating': None, 'comment': 'Terrible made me clench my teeth and rock back & forth. Two yrs later I still do it. Terrible side effects for me. not recommended by me.', 'upVotes': 2}


In [19]:
reviews[:10]

['Terrible made me clench my teeth and rock back & forth. Two yrs later I still do it. Terrible side effects for me. not recommended by me.',
 "I was put on abilify for depression along with Lexapro.  At first there was no side affects.  Then after about 2 years I gained about 90 pounds.  Also my right hand began to shake.  But the worst thing was not being able to control my bowel movements.  The thing is I did not realize it was the abilify causing these problems.  The way I figured out the cause was I ran out of abilify and was off it for about a week.  I started regaining control of my bowels (this is one thing they don't list as a side affect).  That had to be the most embarrassing period of my life.  Since I quit taking the drug I lost 80 pounds and my hand stopped shaking.  But best of all I have control of my bowels.  They say the drug has different affects on people but it's not the drug for me.  To those that it helped I say hurrah but I've read more negative comments than po

In [20]:
satisfaction_ratings[:10]

[1, 1, 5, 5, 5, 5, 1, 1, 5, 3]

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(stop_words='english', min_df = 2, lowercase=True, ngram_range=(1,2))
X_train_counts = count_vect.fit_transform(reviews)
#min_df - a word has to occur in (x) documents to be considered a feature

In [25]:
len(count_vect.vocabulary_)

2831

In [27]:
X_train_counts[0]

<1x2831 sparse matrix of type '<class 'numpy.int64'>'
	with 65 stored elements in Compressed Sparse Row format>

because text data is high dimensional and sparse, a given word probably doesn't exist in a given document

NB about sparse matricies: doesn't store 0s, just saves value and location and assumes everything else is 0
**occasionally this fails and the algorithm doen't play well with sparse matricies

In [31]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer()
X_train_tfidf = transformer.fit_transform(X_train_counts)

TF-IDF = Term frequency times inverse document frequency

term frequency = frequency of a word in a given document 

inverse document frequency = percent of documents the word occurs in 

gives higher weights to infrequently occuring words

In [32]:
print(X_train_tfidf)
#prints the location in the sparse matrix and the tfidf score

  (0, 2465)	0.648544967256
  (0, 2818)	0.324272483628
  (0, 1475)	0.319461115508
  (0, 920)	0.188646037496
  (0, 2071)	0.400795196495
  (0, 2819)	0.419472598021
  (1, 139)	0.120911521184
  (1, 694)	0.04734921795
  (1, 1508)	0.0926364780062
  (1, 269)	0.224565730471
  (1, 2792)	0.0606429065346
  (1, 1140)	0.0662752751989
  (1, 133)	0.117545127139
  (1, 1958)	0.160055106797
  (1, 2121)	0.0838451765612
  (1, 1230)	0.212518205619
  (1, 396)	0.0977690865135
  (1, 2191)	0.129262897828
  (1, 2775)	0.101586709676
  (1, 2476)	0.267837210652
  (1, 209)	0.0850772352367
  (1, 581)	0.296897772178
  (1, 1784)	0.112282865236
  (1, 748)	0.0718725675552
  (1, 2047)	0.124590504694
  :	:
  (704, 452)	0.120529340715
  (704, 1778)	0.134237160689
  (704, 733)	0.12877124694
  (704, 1622)	0.141283949118
  (704, 734)	0.145749925833
  (704, 2829)	0.12877124694
  (704, 19)	0.151215839582
  (704, 385)	0.141283949118
  (704, 201)	0.158262628011
  (704, 1164)	0.158262628011
  (704, 499)	0.158262628011
  (704, 1304)

In [33]:
#TODO Refactor to use model_selection module
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB



'\nstatistically you would want to normalize word counts \nbetween 0 and 1 but in practice TFIDF is a useful because gives\ndifferent weight to rare terms\n'

In [34]:
X_train_tfidf_dense = X_train_tfidf.toarray()

In [38]:
tree = DecisionTreeClassifier()
cross_val_score(tree, X_train_tfidf_dense, satisfaction_ratings, cv=5)

array([ 0.32167832,  0.34507042,  0.40140845,  0.32857143,  0.27536232])

In [39]:
forest = RandomForestClassifier(n_estimators = 10)
cross_val_score(tree, X_train_tfidf_dense, satisfaction_ratings, cv=5)

array([ 0.36363636,  0.34507042,  0.40140845,  0.29285714,  0.2826087 ])

In [40]:
logreg = LogisticRegression()
cross_val_score(tree, X_train_tfidf_dense, satisfaction_ratings, cv=5)

array([ 0.34265734,  0.40140845,  0.3943662 ,  0.30714286,  0.30434783])

In [41]:
nb = MultinomialNB()
cross_val_score(tree, X_train_tfidf_dense, satisfaction_ratings, cv=5)

array([ 0.35664336,  0.36619718,  0.40140845,  0.33571429,  0.25362319])

Results are not so great, but at least they're better than what we would expect by sheer guessing (0.20, or 1/5, picking a random star). Values above represent the average accuracy (I think?) of each test run when trained on a subset of the dataset: see https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation