In [1]:
import metapy
import pandas as pd
import time
from numpy.random import shuffle

Load json dataset, save text and labels in separate files (line corpus format) - can be done only once

In [2]:
%%time
path=''
file='kindle_reviews.json'
df = pd.read_json(path_or_buf=path+file, lines=True, encoding='utf-8')    #, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, chunksize=None, compression='infer')
print('Length of text: {}'.format(len(df)))

Length of text: 982619
Wall time: 13.3 s


In [3]:
df.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,B000F83SZQ,"[0, 0]",5,I enjoy vintage books and movies so I enjoyed ...,"05 5, 2014",A1F6404F1VG29J,Avidreader,Nice vintage story,1399248000
1,B000F83SZQ,"[2, 2]",4,This book is a reissue of an old one; the auth...,"01 6, 2014",AN0N05A9LIJEQ,critters,Different...,1388966400
2,B000F83SZQ,"[2, 2]",4,This was a fairly interesting read. It had ol...,"04 4, 2014",A795DMNCJILA6,dot,Oldie,1396569600
3,B000F83SZQ,"[1, 1]",5,I'd never read any of the Amy Brewster mysteri...,"02 19, 2014",A1FV0SX13TWVXQ,"Elaine H. Turley ""Montana Songbird""",I really liked it.,1392768000
4,B000F83SZQ,"[0, 1]",4,"If you like period pieces - clothing, lingo, y...","03 19, 2014",A3SPTOKDG7WBLN,Father Dowling Fan,Period Mystery,1395187200


In [4]:
df_text = df[['reviewText', 'overall']].copy()        # copy only certain columns to another df
df = None
df_text['overall'] = df_text['overall'].apply(lambda x: 'pos' if x > 3 else 'neg' if x < 3 else 'mixed')

In [5]:
df_text.head()

Unnamed: 0,reviewText,overall
0,I enjoy vintage books and movies so I enjoyed ...,pos
1,This book is a reissue of an old one; the auth...,pos
2,This was a fairly interesting read. It had ol...,pos
3,I'd never read any of the Amy Brewster mysteri...,pos
4,"If you like period pieces - clothing, lingo, y...",pos


Create the corpus file and labels file on disk which will be used to build a forward index

In [6]:
%%time
df_text['reviewText'].to_csv('ceeaus2/ceeaus2.dat', header=False, index=False)
df_text['overall'].to_csv('ceeaus2/ceeaus2.dat.labels', header=False, index=False)

Wall time: 14.2 s


Create the configuration file for building the forward index which creates instructions on language feature construction

In [7]:
config = """stop-words = "lemur-stopwords.txt"

prefix = "."
dataset = "ceeaus2"
corpus = "line.toml"
index = "ceeaus2-idx"

[[analyzers]]
#method = "ngram-word"
#ngram = 1
#filter = "default-unigram-chain"

method = "ngram-word"
ngram = 1
    [[analyzers.filter]]
    type = "icu-tokenizer"
    
    [[analyzers.filter]]
    type = "lowercase"
    
    [[analyzers.filter]]
    type = "length"
    min = 1
    max = 35
    
    #[[analyzers.filter]]
    #type = "alpha"
    
    [[analyzers.filter]]
    type = "english-normalizer"    
"""
with open('ceeaus-config.toml', 'w') as f:
    f.write(config)

Save `ForwardIndex` to disk.

In [8]:
%%time
fidx = metapy.index.make_forward_index('ceeaus-config.toml')

Wall time: 59.9 s


In [9]:
# inverted index - not needed for these classifiers
#iidx = metapy.index.make_inverted_index('ceeaus-config.toml')

The feature set used for classification depends on the settings in the configuration file _at the time of indexing_. Thus, if you change your `analyzer` pipeline (or other settings) - **reindex** your documents!

Decide what kind of dataset we're using - for binary classification (MeTA's `BinaryDataset`) or multi-class classification (`MulticlassDataset`). To see the number of labels:

In [10]:
fidx.num_labels()

3

Looks like we need a `MulticlassDataset` to predict which of these three labels a document should have (but if we are interested in one particular class only, we might use a `BinaryDataset`).

For now, let's focus on the multi-class case, as that likely makes the most sense for this kind of data. Since the dataset is small enough, we can load all documents into memory at once like this.

In [11]:
dset = metapy.classify.MulticlassDataset(fidx)
len(dset)

982619

Since datasets may be large, it's beneficial to avoid creating copies of them (e.g. to shuffle them) => you can operate with a `DatasetView` (`MulticlassDatasetView` or `BinaryDatasetView`) to shuffle or rotate the dataset without modifying it - use Python's slicing (will it really avoid a modification of both datasets?) or construct a view directly.

In [12]:
#view = dset[0:len(dset)+1]
# or
view = metapy.classify.MulticlassDatasetView(dset)

Shuffle the view without changing the underlying datsaet.

In [13]:
view.shuffle()
print("Is {}, was {}".format(view[0].id, dset[0].id))

Is 244438, was 0


The view is shuffled in random order, but the underlying dataset is still sorted by id.

Slice the shuffled view for a train and test data split.

In [14]:
training = view[0:int(0.8*len(view))]
testing = view[int(0.8*len(view)):len(view)+1]

Train a Naive Bayes classifier on the training view.

In [15]:
%%time
nb = metapy.classify.NaiveBayes(training, alpha=0.7, beta=0.7)

Wall time: 24.7 s


In [16]:
# classify individual documents
#nb.classify(testing[0].weights)

Classify the test set

In [17]:
mtrx = nb.test(testing)
print(mtrx)


           mixed    neg      pos      
         ---------------------------
   mixed | [1m0.459[22m    0.191    0.351    
     neg | 0.21     [1m0.647[22m    0.143    
     pos | 0.0843   0.0229   [1m0.893[22m    




`Test()` method returns a `ConfusionMatrix` which computes a lot of metrics used in classifier evaluation. Shuffling may lead to different results. **Each row says what fraction of documents with that _true_ label were assigned to other labels**.

In [18]:
mtrx.print_stats()

------------------------------------------------------------
[1mClass[22m       [1mF1 Score[22m    [1mPrecision[22m   [1mRecall[22m      [1mClass Dist[22m  
------------------------------------------------------------
mixed       0.397       0.35        0.459       0.0981      
neg         0.563       0.498       0.647       0.0583      
pos         0.919       0.946       0.893       0.844       
------------------------------------------------------------
[1mTotal[22m       [1m0.849[22m       [1m0.862[22m       [1m0.836[22m       
------------------------------------------------------------
196524 predictions attempted, overall accuracy: 0.836



[Cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) - to avoid overfitting; running CV across the whole dataset to get an idea of how well we might generalize to new data.

In [19]:
mtrx = metapy.classify.cross_validate(lambda fold: metapy.classify.NaiveBayes(fold), view, 5)

`Cross_validate()` also returns a `ConfusionMatrix`; arguments - function to create the trained classifiers for each fold, a dataset view (all documents), and the number of folds.

In [20]:
print(mtrx)
mtrx.print_stats()


           mixed    neg      pos      
         ---------------------------
   mixed | [1m0.448[22m    0.259    0.293    
     neg | 0.176    [1m0.722[22m    0.103    
     pos | 0.094    0.0369   [1m0.869[22m    


------------------------------------------------------------
[1mClass[22m       [1mF1 Score[22m    [1mPrecision[22m   [1mRecall[22m      [1mClass Dist[22m  
------------------------------------------------------------
mixed       0.379       0.329       0.448       0.0979      
neg         0.536       0.426       0.722       0.0582      
pos         0.91        0.955       0.869       0.844       
------------------------------------------------------------
[1mTotal[22m       [1m0.84[22m        [1m0.863[22m       [1m0.819[22m       
------------------------------------------------------------
982615 predictions attempted, overall accuracy: 0.819



Now the same for [SVM](https://en.wikipedia.org/wiki/Support_vector_machine).

MeTA's implementation of SVM is an approximation using [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) on the [hinge loss](https://en.wikipedia.org/wiki/Hinge_loss) implemented as a `BinaryClassifier`. In case of multi-class clasification - use adapters: [One-vs-All](https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) and [One-vs-One](https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-one). Constructing `OneVsAll` reduction by passing the dataset, binary classifier, and its arguments:

In [27]:
ova = metapy.classify.OneVsAll(training, metapy.classify.SGD, loss_id='hinge')

Then use `OneVsAll` as a classifier.

In [28]:
mtrx = ova.test(testing)
print(mtrx)
mtrx.print_stats()


           mixed    neg      pos      
         ---------------------------
   mixed | [1m0.199[22m    0.109    0.691    
     neg | 0.0872   [1m0.544[22m    0.369    
     pos | 0.00936  0.00387  [1m0.987[22m    


------------------------------------------------------------
[1mClass[22m       [1mF1 Score[22m    [1mPrecision[22m   [1mRecall[22m      [1mClass Dist[22m  
------------------------------------------------------------
mixed       0.299       0.601       0.199       0.0981      
neg         0.61        0.694       0.544       0.0583      
pos         0.943       0.903       0.987       0.844       
------------------------------------------------------------
[1mTotal[22m       [1m0.872[22m       [1m0.861[22m       [1m0.884[22m       
------------------------------------------------------------
196524 predictions attempted, overall accuracy: 0.884



In [29]:
mtrx = metapy.classify.cross_validate(lambda fold: metapy.classify.OneVsAll(fold, metapy.classify.SGD, loss_id='hinge'), view, 5)
print(mtrx)
mtrx.print_stats()


           mixed    neg      pos      
         ---------------------------
   mixed | [1m0.108[22m    0.104    0.789    
     neg | 0.041    [1m0.469[22m    0.49     
     pos | 0.00412  0.00278  [1m0.993[22m    


------------------------------------------------------------
[1mClass[22m       [1mF1 Score[22m    [1mPrecision[22m   [1mRecall[22m      [1mClass Dist[22m  
------------------------------------------------------------
mixed       0.184       0.643       0.108       0.0979      
neg         0.557       0.686       0.469       0.0582      
pos         0.938       0.888       0.993       0.844       
------------------------------------------------------------
[1mTotal[22m       [1m0.864[22m       [1m0.852[22m       [1m0.876[22m       
------------------------------------------------------------
982615 predictions attempted, overall accuracy: 0.876



To see a list of what's included in the bindings: