## Classification with Sklearn

## Load the imdb dataset

In [3]:
from keras.utils.data_utils import get_file
train_file = get_file('imdb_train.txt', origin='https://goo.gl/FPFnfh', cache_subdir='data')
test_file = get_file('imdb_test.txt', origin='https://goo.gl/mg8bsD', cache_subdir='data')

Files contain a review per line with a numeric score, separated by a TAB.

In [4]:
import csv
x_train = []
y_train = []
with open(train_file, encoding='utf-8', newline='') as infile:
    reader = csv.reader(infile, delimiter='\t')
    for row in reader:
        x_train.append(row[0])
        y_train.append(int(row[1]))

x_test = list()
y_test = list()
with open(test_file, encoding='utf-8', newline='') as infile:
    reader = csv.reader(infile, delimiter='\t')
    for row in reader:
        x_test.append(row[0])
        y_test.append(int(row[1]))


## Setup a pipeline

In [5]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

`CountVectorizer` converts a sequence of text documents to a matrix of token counts.

In [6]:
vect = CountVectorizer()

Perform tokenization and learn a vocabulary dictionary of all tokens.

In [7]:
vect.fit(x_train) 

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

Transform documents to document-term matrix

In [8]:
X_train = vect.transform(x_train) 
X_test = vect.transform(x_test)

Features are just the tokens

In [9]:
vect.get_feature_names()[3000:3020]

['ameliorated',
 'ameliorative',
 'amell',
 'amemiya',
 'amen',
 'amenabar',
 'amenable',
 'amend',
 'amendment',
 'amends',
 'amenities',
 'amenábar',
 'amer',
 'amercan',
 'amercian',
 'ameriac',
 'amerian',
 'america',
 'americain',
 'americaine']

Each document is a sparse vector of token counts

In [13]:
X_train[0,:]

<1x74850 sparse matrix of type '<class 'numpy.int64'>'
	with 88 stored elements in Compressed Sparse Row format>

In [19]:
x_train[0]

"Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."

In [20]:
fn = vect.get_feature_names()
fn[1343], fn[3167]

('absurd', 'an')

In [21]:
X_train[0, 3167]

1

Select features according to the k highest scores of the chi-square function.

Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.

In [22]:
sel = SelectKBest(chi2, k=5000)
sel.fit(X_train, y_train)

SelectKBest(k=5000, score_func=<function chi2 at 0x7f0d81269840>)

Reduce the data to the selected features.

In [23]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

In [24]:
X_train

<25000x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 2128612 stored elements in Compressed Sparse Row format>

In [25]:
X_train[0,:]

<1x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 55 stored elements in Compressed Sparse Row format>

Transform a count matrix to a normalized tf or tf-idf representation.

Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.

The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

In [26]:
tfidf = TfidfTransformer()  # weighting
tfidf.fit(X_train)
X_train = tfidf.transform(X_train)
X_test =tfidf.transform(X_test)

In [27]:
X_train[0,:]

<1x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 55 stored elements in Compressed Sparse Row format>

### SVM classifier with linear kernel.

In [28]:
learner = LinearSVC()
classifier = learner.fit(X_train, y_train)
predictions = classifier.predict(X_test)

In [29]:
predictions

array([0, 1, 0, ..., 1, 1, 1])

## Evaluation of accuracy

In [30]:
accuracy = 0
for prediction,correct in zip(predictions, y_test):
    if prediction == correct:
        accuracy += 1
accuracy/len(predictions)

0.8782048718051277

## Using sklearn pipeline object

In [31]:
pipeline = Pipeline([
    ('vect', CountVectorizer()),  # feature extraction
    ('sel', SelectKBest(chi2, k=5000)),  # feature selection
    ('tfidf', TfidfTransformer()),  # weighting
    ('learner', LinearSVC())  # learning algorithm
])

classifier = pipeline.fit(x_train, y_train)
predictions = classifier.predict(x_test)
accuracy = 0
for prediction,correct in zip(predictions, y_test):
    if prediction == correct:
        accuracy += 1
accuracy/len(predictions)

0.8782048718051277

## Feature extraction function that uses POS tagging and SentiWordNet

In [66]:
from nltk.corpus import sentiwordnet as swn
from nltk import pos_tag
from nltk.sentiment.util import mark_negation
from nltk.tokenize.casual import TweetTokenizer

count=0
def swn_tokenizer(text, threshold=0.1, verbose=True):
    """
    Extracts words, sentiwordnet features, handles negation
    """
    global count
    count += 1
    if verbose:
        if count%1000 == 0:
            print('|', end='')
        elif count%100 == 0:
            print('.', end='')
    tokenizer = TweetTokenizer()
    tokens = tokenizer.tokenize(text)
    # Append _NEG suffix to words that appear in the scope between a negation and a punctuation mark.
    neg_tokens = mark_negation(tokens, double_neg_flip=True)
    swntokens = []
    taggedtokens = pos_tag(tokens)
    for (token, pos),neg_token in zip(taggedtokens, neg_tokens):
        if pos[0] in ('R', 'N', 'J', 'V'): # translating nltk pos to swn pos
            if pos[0] == 'R':
                swnpos = 'r'
            if pos[0] == 'N':
                swnpos = 'n'
            if pos[0] == 'V':
                swnpos = 'v'
            if pos[0] == 'J':
                swnpos = 'a'
            values = list(swn.senti_synsets(token, swnpos))
            if len(values) > 0:
                score = 0.0
                i = 1
                sum = 0.0
                for value in values:
                    score += value.pos_score() / i
                    score -= value.neg_score() / i
                    i += 1
                    sum += 1.0 / i
                score /= sum
                if score > threshold:
                    if neg_token.endswith('_NEG'):
                        swntokens.append('_SWN_NEG_%s' % swnpos)
                    else:
                        swntokens.append('_SWN_POS_%s' % swnpos)
                elif score < -threshold:
                    if neg_token.endswith('_NEG'):
                        swntokens.append('_SWN_POS_%s' % swnpos)
                    else:
                        swntokens.append('_SWN_NEG_%s' % swnpos)
    
    neg_tokens.extend(swntokens)
    return neg_tokens

## Using sentiment oriented text processing

In [67]:
vect = CountVectorizer(analyzer=swn_tokenizer)
vect.fit(x_train)
print('vect fitted')
X_train = vect.transform(x_train) 
X_test = vect.transform(x_test)
sel = SelectKBest(chi2, k=5000)
sel.fit(x_train)
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)
tfidf = TfidfTransformer()
tfidf.fit(x_train)
X_train = tfidf.transform(X_train)
X_test = tfidf.transform(X_test)

.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|vect fitted
.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|.........|

TypeError: fit() missing 1 required positional argument: 'y'

In [None]:
learner = LinearSVC()
classifier = learner.fit(X_train, y_train)

This will take several minutes.

In [None]:
pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer=swn_tokenizer)),  # feature extraction
    ('sel', SelectKBest(chi2, k=5000)), 
    ('tfidf', TfidfTransformer()), 
    ('learner', LinearSVC())
])

classifier = pipeline.fit(x_train, y_train)

### Test accuracy

In [None]:
predictions = classifier.predict(x_test)
accuracy = 0
for prediction,correct in zip(predictions, y_test):
    if prediction == correct:
        accuracy += 1
accuracy/len(predictions)