In [13]:
import numpy as np
import pandas as pd


  return f(*args, **kwds)


Consider a data set containing customers reviews on musical instruments. In this notebook we will work on text data to create a classification of the users rating (source: http://jmcauley.ucsd.edu/data/amazon/)

In [157]:
data = pd.read_json('../data/Musical_Instruments_5.json',lines=True)
# for the moment we will only take reviewText and overall
df = data[['reviewText','overall']]

In [158]:
data.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,1384719342,"[0, 0]",5,"Not much to write about here, but it does exac...","02 28, 2014",A2IBPI20UZIR0U,"cassandra tu ""Yeah, well, that's just like, u...",good,1393545600
1,1384719342,"[13, 14]",5,The product does exactly as it should and is q...,"03 16, 2013",A14VAT5EAX3D9S,Jake,Jake,1363392000
2,1384719342,"[1, 1]",5,The primary job of this device is to block the...,"08 28, 2013",A195EZSQDW3E21,"Rick Bennette ""Rick Bennette""",It Does The Job Well,1377648000
3,1384719342,"[0, 0]",5,Nice windscreen protects my MXL mic and preven...,"02 14, 2014",A2C00NNG1ZQQG2,"RustyBill ""Sunday Rocker""",GOOD WINDSCREEN FOR THE MONEY,1392336000
4,1384719342,"[0, 0]",5,This pop filter is great. It looks and perform...,"02 21, 2014",A94QU4C90B1AX,SEAN MASLANKA,No more pops when I record my vocals.,1392940800


In [159]:
df.head()

Unnamed: 0,reviewText,overall
0,"Not much to write about here, but it does exac...",5
1,The product does exactly as it should and is q...,5
2,The primary job of this device is to block the...,5
3,Nice windscreen protects my MXL mic and preven...,5
4,This pop filter is great. It looks and perform...,5


Exercise
---------
- Calculate the percentage of data belonging to each rating class (from 1 to 5).
- Which problems and difficulties could we face in the predictions?

Print a couple of examples of bad reviews (overall=1 or 2) and of good reviews (overall=5 or 4) to have a feeling how users write...

Split data into train and test

In [162]:
from sklearn.model_selection import train_test_split

X = df.reviewText.values
y = df.overall.values

X_tr, X_te, y_tr, y_te = train_test_split(X, 
                                          y,
                                          test_size=0.1,
                                          random_state=0)

In [163]:
print('X_tr shape',np.shape(X_tr))
print('X_te shape',np.shape(X_te))
print('y_tr shape',np.shape(y_tr))
print('y_te shape',np.shape(y_te))

X_tr shape (9234,)
X_te shape (1027,)
y_tr shape (9234,)
y_te shape (1027,)


In [164]:
len(df)

10261

## Bag of words

We want to extract numerical features from the text so that we can use them for classification.
One way to achieve this is to use `sklearn.feature_extraction.text.CountVectorizer`: this scikit-learn class find tokens (words) in the text and create a matrix of occurrences counts. Another possibility is to use `sklearn.feature_extraction.text.TfidfVectorizer`: this method returns *weights* of words instead of the absolute count based on the appearence of each word in the all corpus of documents. The idea is that word that appears often in a corpus are not very significant to make predictions and should be given a lower weight.



Bag of words
--------------------

Different types of vectorizers:

<ul>
<li>```sklearn.feature_extraction.text.CountVectorizer``` - Counts the number of times a word appears in the text</li>
<li>```sklearn.feature_extraction.text.TfidfVectorizer``` - Weighs the words according to the importance of the word in the context of whole collection. Is the word ```the``` important if it appears in all documents?</li>
<li>```sklearn.feature_extraction.text.HashingVectorizer``` - Useful when you don't know the vocabulary upfront. Feature number is calculated as ```hash(token) % vocabulary_size```.</li>
</ul>

Exercise
---------
https://github.com/logicai-io/pipelines-sklearn
1. Use ```CountVectorizer``` / ```TfidfVectorizer``` to fit the collection of documents
2. How many unique tokens are there in text? Print some examples (ie first few hundred).
3. What methods you can use to reduce this number? 
   - Check out and experiment with the arguments: ```ngram_range```, ```min_df```. How the vocabulary size changes with each change?
   - Would you make use of stop_words? Check ```CountVectorizer``` documentation
   - What would you replace / delete from the text?
4. Write a custom function `clean_text` that accepts a text as input and transforms it (remove/hash numbers, delete short/long words etc.)

**Solution**
<div>
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words

import re

def clean_text(t):
    t = t.lower()
    t = re.sub("[^A-Za-z0-9]"," ",t)
    t = re.sub("[0-9]+","#",t)
    return t


vectorizers = [
     ("simple",
          CountVectorizer())
    ,("preprocessing",
          CountVectorizer(preprocessor=clean_text))
    ,("preprocessing + min_df=10",
          CountVectorizer(preprocessor=clean_text,
                          min_df=10))
    ,("preprocessing + min_df=10 + stop_words",
          CountVectorizer(preprocessor=clean_text,
                          min_df=10,
                          stop_words=stop_words.ENGLISH_STOP_WORDS))
]

for vect_name, vect in vectorizers:
    print(vect_name)
    vect.fit(X_tr)
    
    print(list(vect.get_feature_names())[:10])
    print(len(vect.get_feature_names()))
    print('and' in vect.get_feature_names())
</div>

Stemming and Lemmatization
------------------

Stemming and Lemmatization are two linguistic normalization techniques that consists on grouping together words that derivate from the same origing and have the same meaning:

    connection
    connections
    connective     --->   connect
    connected
    connecting

**Stemming** works by reducing a group of words to the same *stem* based on the origin of the words in the group. The stem does not necessarly have to be a word in the language. With **lemmatization**, on the other hand, makes sure that the root word is a proper word (*lemma*) in the language.

In [110]:
import nltk
nltk.download('wordnet')

  return f(*args, **kwds)


[nltk_data] Downloading package wordnet to
[nltk_data]     /home/francesca/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [111]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/francesca/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [113]:
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [114]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [121]:
stemmer.stem('universality')

'univers'

In [122]:
stemmer.stem('universally')

'univers'

In [123]:
lemmatizer.lemmatize('universally')

'universally'

In [124]:
lemmatizer.lemmatize('universality')

'universality'

In [127]:
lemmatizer.lemmatize('university')

'university'

In [14]:
from nltk.corpus import wordnet

lemmatizer.lemmatize('universally',pos=wordnet.ADV)

'universally'

In [117]:
stemmer.stem('is')

'is'

In [128]:
lemmatizer.lemmatize('is')

'is'

Exercise
---------

Extend the function `clean_text` created above to a function that, after cleaning the text, applies a stemming technique to it.

**Solutions**
<div>
def clean_text_stem(text):
    t = clean_text(text)
    tokens = nltk.word_tokenize(t)    # alternative: [w for w in t.split()]
    stemmed_tokens = [stemmer.stem(w) for w in tokens] 
    stemmed_text = ' '.join(stemmed_tokens)
    return stemmed_text
    
vectorizers = [
     ("simple",
          CountVectorizer())
    ,("preprocessing",
          CountVectorizer(preprocessor=clean_text))
    ,("preprocessing + min_df=10",
          CountVectorizer(preprocessor=clean_text,
                          min_df=10))
    ,("preprocessing + min_df=10 + stop_words",
          CountVectorizer(preprocessor=clean_text,
                          min_df=10,
                          stop_words=stop_words.ENGLISH_STOP_WORDS))
    ,("stemming + min_df=10 + stop_words",
          CountVectorizer(preprocessor=clean_text_stem,
                          min_df=10,
                          stop_words=stop_words.ENGLISH_STOP_WORDS))
]


for vect_name, vect in vectorizers:
    print(vect_name)
    vect.fit(X_tr)
    
    print(list(vect.get_feature_names())[:10])
    print(len(vect.get_feature_names()))
    print('and' in vect.get_feature_names())
</div>

Pipeline
----------------------

Now that we know how to transform text data, let's put it into a pipeline.
1. Create a pipeline with `CountVectorizer`, `StandardScaler` and a classifier (ex `RandomForestClassifier`)
2. Using ```sklearn.metrics.classification_report``` create a report about your classifier
3. Play with text preprocessing in ```CountVectorizer``` to see if the model improves

**Solution**
<div>
import warnings
warnings.filterwarnings('ignore')
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words

from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV

clf = Pipeline([('vect', CountVectorizer(min_df=10, preprocessor=clean_text)),
                ('scaling', StandardScaler(with_mean=False)),
                ('clf', RandomForestClassifier())])

preds = cross_val_predict(clf, 
                          X_tr, 
                          y_tr, 
                          cv=8, n_jobs=-1, verbose=True)

print(classification_report(y_tr, preds))
</div>


In [231]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words

from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV

clf = Pipeline([('vect', CountVectorizer(min_df=10, preprocessor=clean_text)),
                ('scaling', StandardScaler(with_mean=False)),
                ('clf', RandomForestClassifier())])

preds = cross_val_predict(clf, 
                          X_tr, 
                          y_tr, 
                          cv=8, n_jobs=-1, verbose=True)

print(classification_report(y_tr, preds))

             precision    recall  f1-score   support

          1       0.20      0.00      0.01       202
          2       0.33      0.01      0.02       225
          3       0.20      0.03      0.05       696
          4       0.33      0.12      0.17      1862
          5       0.70      0.95      0.80      6249

avg / total       0.57      0.67      0.58      9234



[Parallel(n_jobs=-1)]: Done   2 out of   8 | elapsed:    2.4s remaining:    7.2s
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:    2.5s finished


## Check some mis-predicted reviews

In [233]:
cv = CountVectorizer(min_df=10,preprocessor=clean_text)

In [242]:
#create train dataframe from X_tr (this can be avoided by using train_test_split above directly on the df)
df_tr = pd.DataFrame(X_tr,columns=['reviewText'])

In [243]:
#df_tr_label

In [246]:
X_tr_ = cv.fit_transform(X_tr)
df_tr_ = pd.DataFrame(X_tr_.todense(), columns=cv.get_feature_names())
df_tr_['overall'] = y_tr
df_tr_['preds'] = preds
df_tr_['reviewText'] = df_tr['reviewText']

In [251]:
X_tr[0]

'I ordered two of these stands and they work very well! Definitely keeps my guitars a lot safer than having them leaning against things.'

In [252]:
df_tr_[['reviewText','overall','preds']].head()

Unnamed: 0,reviewText,overall,preds
0,I ordered two of these stands and they work ve...,5,5
1,"Works well, pretty straight forward. A little ...",5,5
2,Installed on two guitars and I love them! I ca...,5,5
3,"I've been using this for roughly a year now, a...",4,4
4,"This pedal essentialy balances out the high, m...",5,5


In [254]:
#check text that contains the word "terrible" but they are actually labelled as positive"
df_tr_[(df_tr_.terrible>0) & (df_tr_.overall==5)][['reviewText']].head()

Unnamed: 0,reviewText
128,I recently bought this for my first electric g...
1849,This guitar strap was so inexpensive compared ...
2147,I bought a case of these. I can't handle #12 ...
2202,I have to admit this up front - I'm a terrible...
3219,I bough this amp for my 9 year old son who is ...


In [256]:
df_tr_[(df_tr_.terrible>0) & (df_tr_.overall==5)].iloc[0]['reviewText']

'I recently bought this for my first electric guitar and I thoroughly enjoy it. It has a lot of power and noise behind it. It is a little starter amp, just in case the picture was a little deceiving. I cranked it up quite loud and it produces clean sound without terrible distortion. With distortion, it sounds great. It is a solid piece of work, and it makes a great practice amp and possibly a great starter amp.'

Exercise
----------------------

1. Use `GridSearchCV` or `RandomizedSearchCV` to find the best parameters for the models. Check at least 2 parameters.

**GridSearchCV** - you specify the exact values of the parameters you want to test
**RandomizedSearchCV** - you specify ranges of parameters


2. Inspect the attribute `cv_results_` after fitting. It gives a nice representation of the learning.

**Solutions**
<div>
import warnings
warnings.filterwarnings('ignore')
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV

print("Grid search")
print()

params = {'vect__ngram_range': [(1, 1), (1, 2), (1, 3), (1, 4)],
          'vect__analyzer': ["word","char"],
          'vect__binary': [True, False]}
             
grid_clf = GridSearchCV(clf, params, scoring='f1_weighted', n_jobs=1, verbose=True)
grid_clf.fit(X_tr, y_tr)

best_params = sorted(grid_clf.grid_scores_, key=lambda x: -x[1])


for params, score, _ in best_params:
    print(score, params) 
#    
#print("Randomized search")
#print()
#    
#params = {'vect__ngram_range': [(1, 1), (1, 2), (1, 3), (1, 4)],
#          'vect__analyzer': ["word","char"],
#          'model__lr__dimensions': [100, 200]}
#
#grid_clf = RandomizedSearchCV(clf, params, n_jobs=1, verbose=True, n_iter=8)
#grid_clf.fit(np.array(X_tr[:10000]), y_tr[:10000])
#
#best_params = sorted(grid_clf.grid_scores_, key=lambda x: -x[1])
#
#for params, score, _ in best_params:
#    print(score, params)
</div>


## Extract further information from text

The text review contains a number of information that can be used for classification.
Examples:
- length of text
- punctuation
- emoji usage 

Example:
"The product does exactly as it should and is quite affordable.I did not realized it was double screened until it arrived, so it was even better than I had expected.As an added bonus, one of the screens carries a small hint of the smell of an old grape candy I used to buy, so for reminiscent's sake, I cannot stop putting the pop filter next to my nose and smelling it after recording. :DIf you needed a pop filter, this will work just as well as the expensive ones, and it may even come with a pleasing aroma like mine did!Buy this product! :]"


In [142]:
df['review_length'] = df['reviewText'].apply(lambda x: len(x))
df.groupby('overall')['review_length'].median()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


overall
1    391.0
2    363.0
3    351.5
4    319.0
5    265.0
Name: review_length, dtype: float64

Exercise
---------

1. Create a custom Transformer that extract the length of the review 
2. Create a custom Transformer that detect number of positive emoticons 
3. Add the transformers above to the features extraction part in the pipeline above. **ATTENTION**: Make sure the transform methods return a sparse matrix (use `from scipy import sparse` `sparse.csr_matrix`), otherwise you will get dimensions error from numpy concatenation method used in FeatureUnion

**Click here to see the solution**
<div>
from sklearn.base import BaseEstimator, TransformerMixin
from scipy import sparse

class LengthExtractor(BaseEstimator, TransformerMixin):

    def __init__(self):
        pass

    def fit(self, df, y=None):
        return self
    
    def transform(self, X, y=None):
        l = np.array([len(x) for x in X])
        return sparse.csr_matrix(l).T
    

class CountEmojis(BaseEstimator, TransformerMixin):

    def __init__(self):
        pass

    def fit(self, df, y=None):
        """Returns `self` unless something different happens in train and test"""
        return self
    
    def transform(self, X, y=None):
        l_pos = np.array([len(re.findall('(:\)|:D|;\)|:\])', x)) for x in X])
        return sparse.csr_matrix(l_pos).T
        #return l_pos


    
    
from sklearn.pipeline import Pipeline, FeatureUnion

clf = Pipeline([
    ('feats', FeatureUnion([
        ('vect', CountVectorizer(min_df=10, preprocessor=clean_text)),
        ('len', LengthExtractor())
    ])),
    ('scaling', StandardScaler(with_mean=False)),
    ('clf', RandomForestClassifier())
])

preds = cross_val_predict(clf, 
                          X_tr, 
                          y_tr, 
                          cv=8, n_jobs=-1, verbose=True)

print(classification_report(y_tr, preds))
</div>

How to improve it?