"Standard datasets are used to train natural language models... when crunched by a model, the output reflects quirks in the algorithm rather then the data."

This notebook contains feature extraction, weighting, and modeling portions of the drug-review-text-to-star-rating prediction project. Other aspects of the process--e.g., scraping, cleaning, or visualizing the data--are not included here. Furthermore, there are many possible directions this project can go (re: feature/model selection, weighting, etc.); this is but a starting place. Plus it only looks at "satisfaction" ratings for now.

Feature extraction is handled by sklearn's CountVectorizer function. For now, we're using it to tokenize our review text data into unigrams (words) and bigrams (pairs of words). These fatures are then weighted by tf-idf scores (term frequency-inverse document frequency, so a token that shows up in a lot of documents isn't given much weight, but an otherwise rare token showing up many times in the same document is heavily weighted).

We then pass the features into a couple different classifier models and see how they perform.

```
QUESTIONS/TODO:
    Why use CountVectorizer if we're going to be using tfidf? sklearn has a TfIdfVectorizer function.
    How is performance measured? What is that cross validation metric that's being spewed out? Is it accuracy?
    How to use a regression model instead of classifiers?
    Use non-deprecated modules instead of cross_validation.
    How should models be evaluated? Accuracy alone isn't necessarily the best; should explore the data first and see
        how skewed it is towards any given star rating. Also try precision/recall/F1.          
```
    

In [2]:
# This cell just sets up some data to use.
import pickle
from drugSite_scrapers3 import drug, review as drug, review

with open('drug_list_ddc2.p', 'rb') as f:
    data = pickle.load(f)

TypeError: 'review' object is not subscriptable

In [17]:
from textblob import TextBlob

In [22]:
test = TextBlob(data[0]['comment'])

In [24]:
test.sentiment

Sentiment(polarity=-0.5, subjectivity=0.5)

In [20]:
print(data[0]['comment'])

Terrible made me clench my teeth and rock back & forth. Two yrs later I still do it. Terrible side effects for me. not recommended by me.


In [38]:
import sys
print(sys.executable)
print()
print(sys.path)

C:\Users\comra_000\Anaconda3\python.exe

['', 'C:\\Users\\comra_000\\Anaconda3\\python36.zip', 'C:\\Users\\comra_000\\Anaconda3\\DLLs', 'C:\\Users\\comra_000\\Anaconda3\\lib', 'C:\\Users\\comra_000\\Anaconda3', 'C:\\Users\\comra_000\\Anaconda3\\lib\\site-packages', 'C:\\Users\\comra_000\\Anaconda3\\lib\\site-packages\\Sphinx-1.5.1-py3.6.egg', 'C:\\Users\\comra_000\\Anaconda3\\lib\\site-packages\\win32', 'C:\\Users\\comra_000\\Anaconda3\\lib\\site-packages\\win32\\lib', 'C:\\Users\\comra_000\\Anaconda3\\lib\\site-packages\\Pythonwin', 'C:\\Users\\comra_000\\Anaconda3\\lib\\site-packages\\setuptools-27.2.0-py3.6.egg', 'C:\\Users\\comra_000\\Anaconda3\\lib\\site-packages\\IPython\\extensions', 'C:\\Users\\comra_000\\.ipython']


In [39]:
blobs = []

for review in reviews[:10]:
    print(review)
    print()
    blobs.append(TextBlob(review))

Terrible made me clench my teeth and rock back & forth. Two yrs later I still do it. Terrible side effects for me. not recommended by me.

I was put on abilify for depression along with Lexapro.  At first there was no side affects.  Then after about 2 years I gained about 90 pounds.  Also my right hand began to shake.  But the worst thing was not being able to control my bowel movements.  The thing is I did not realize it was the abilify causing these problems.  The way I figured out the cause was I ran out of abilify and was off it for about a week.  I started regaining control of my bowels (this is one thing they don't list as a side affect).  That had to be the most embarrassing period of my life.  Since I quit taking the drug I lost 80 pounds and my hand stopped shaking.  But best of all I have control of my bowels.  They say the drug has different affects on people but it's not the drug for me.  To those that it helped I say hurrah but I've read more negative comments than positiv

In [41]:
for blob in blobs:
    print(blob.sentiment)
    print()

Sentiment(polarity=-0.5, subjectivity=0.5)

Sentiment(polarity=0.19629870129870128, subjectivity=0.5339502164502165)

Sentiment(polarity=0.38611111111111107, subjectivity=0.6041666666666666)

Sentiment(polarity=0.13, subjectivity=0.45999999999999996)

Sentiment(polarity=0.05357142857142857, subjectivity=0.5714285714285714)

Sentiment(polarity=0.8, subjectivity=0.75)

Sentiment(polarity=-0.4, subjectivity=0.7)

Sentiment(polarity=0.06, subjectivity=0.18)

Sentiment(polarity=0.0, subjectivity=0.0)

Sentiment(polarity=0.6, subjectivity=0.8)



In [4]:
satisfaction_ratings[:10]

[1, 1, 5, 5, 5, 5, 1, 1, 5, 3]

In [5]:
len(reviews)

705

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(stop_words='english', min_df = 2, lowercase=True, ngram_range=(1,2))
X_train_counts = count_vect.fit_transform(reviews)
#min_df - a word has to occur in (x) documents to be considered a feature

In [7]:
len(count_vect.vocabulary_)

2831

In [8]:
X_train_counts[0]

<1x2831 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

because text data is high dimensional and sparse, a given word probably doesn't exist in a given document

NB about sparse matricies: doesn't store 0s, just saves value and location and assumes everything else is 0
**occasionally this fails and the algorithm doen't play well with sparse matricies

In [9]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer()
X_train_tfidf = transformer.fit_transform(X_train_counts)

TF-IDF = Term frequency times inverse document frequency

term frequency = frequency of a word in a given document 

inverse document frequency = percent of documents the word occurs in 

gives higher weights to infrequently occuring words

In [10]:
print(X_train_tfidf)
#prints the location in the sparse matrix and the tfidf score

  (0, 2465)	0.648544967256
  (0, 2818)	0.324272483628
  (0, 1475)	0.319461115508
  (0, 920)	0.188646037496
  (0, 2071)	0.400795196495
  (0, 2819)	0.419472598021
  (1, 139)	0.120911521184
  (1, 694)	0.04734921795
  (1, 1508)	0.0926364780062
  (1, 269)	0.224565730471
  (1, 2792)	0.0606429065346
  (1, 1140)	0.0662752751989
  (1, 133)	0.117545127139
  (1, 1958)	0.160055106797
  (1, 2121)	0.0838451765612
  (1, 1230)	0.212518205619
  (1, 396)	0.0977690865135
  (1, 2191)	0.129262897828
  (1, 2775)	0.101586709676
  (1, 2476)	0.267837210652
  (1, 209)	0.0850772352367
  (1, 581)	0.296897772178
  (1, 1784)	0.112282865236
  (1, 748)	0.0718725675552
  (1, 2047)	0.124590504694
  :	:
  (704, 452)	0.120529340715
  (704, 1778)	0.134237160689
  (704, 733)	0.12877124694
  (704, 1622)	0.141283949118
  (704, 734)	0.145749925833
  (704, 2829)	0.12877124694
  (704, 19)	0.151215839582
  (704, 385)	0.141283949118
  (704, 201)	0.158262628011
  (704, 1164)	0.158262628011
  (704, 499)	0.158262628011
  (704, 1304)

In [11]:
# Classification
    from sklearn.model_selection import cross_val_score, KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

# Regression
from sklearn.svm import SVR


In [12]:
X_train_tfidf_dense = X_train_tfidf.toarray()
print(len(X_train_tfidf_dense))

705


In [13]:
models = []
models.append(('DTree', DecisionTreeClassifier()))
models.append(('RandForest', RandomForestClassifier(n_estimators = 10)))
models.append(('LogReg', LogisticRegression()))
models.append(('NaiveBayes', MultinomialNB()))

In [14]:
results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=10, random_state = 8)
    cv_results = cross_val_score(model, X_train_tfidf_dense, satisfaction_ratings, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

DTree: 0.362837 (0.050343)
RandForest: 0.416801 (0.054316)
LogReg: 0.480584 (0.056945)
NaiveBayes: 0.477767 (0.055140)


In [15]:
results

[array([ 0.3943662 ,  0.3943662 ,  0.4084507 ,  0.45070423,  0.36619718,
         0.27142857,  0.37142857,  0.31428571,  0.31428571,  0.34285714]),
 array([ 0.45070423,  0.3943662 ,  0.50704225,  0.49295775,  0.3943662 ,
         0.41428571,  0.41428571,  0.32857143,  0.34285714,  0.42857143]),
 array([ 0.57746479,  0.47887324,  0.56338028,  0.45070423,  0.52112676,
         0.45714286,  0.37142857,  0.45714286,  0.47142857,  0.45714286]),
 array([ 0.5915493 ,  0.46478873,  0.54929577,  0.45070423,  0.50704225,
         0.44285714,  0.38571429,  0.47142857,  0.45714286,  0.45714286])]

Results are not so great, but at least they're better than what we would expect by sheer guessing (0.20, or 1/5, picking a random star). Values above represent the average accuracy (I think?) of each test run when trained on a subset of the dataset: see https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation

In [16]:
reg_models = []
reg_models.append(('rbf', SVR(kernel='rbf', C=1e3, gamma=0.1)))
reg_models.append(('linear', SVR(kernel='linear', C=1e3)))
reg_models.append(('quadratic', SVR(kernel='poly', C=1e3, degree=2)))

reg_results = []
reg_names = []

for name, model in reg_models:
    kfold = KFold(n_splits=10, random_state = 8)
    cv_results = cross_val_score(model, X_train_tfidf_dense, satisfaction_ratings, cv=kfold, scoring='neg_mean_absolute_error')
    reg_results.append(cv_results)
    reg_names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)


rbf: -1.191598 (0.062649)
linear: -1.256151 (0.040214)
quadratic: -1.515732 (0.084087)
