# Combining heterogeneous feature spaces

A times, we want to use different kinds of features in our classifier. In `sklearn`, there are two ways to combine features from different feature spaces:

1. Using `FeatureUnion` to combine features from heterogeneous sources.
2. Using `DictVectorizer` and define your own feature spaces.

Each approach comes with its advantages and disadvantages, which we will discuss further below. Thus, balance the pro and cons of for yourself before deciding which approach you are gonna use.

## Option 1: Using FeatureUnion

Lets take our sentiment classification example, and combine two feature sources: word and character n-grams. To do so, we will use the class `FeatureUnion` which lets us easily combine build-in featurizers. Lets  go back to our sentiment analysis example, but add `FeatureUnion`.

In [1]:
from sklearn.pipeline import FeatureUnion

Lets train the classifier using word unigrams only, to see how the code changes when we include `FeatureUnion`. 

In [2]:
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
import numpy as np
import random
import sys

# using a seed for replicability
random.seed(113)

def load_sentiment_sentences_and_labels():
    """
    loads the movie review data
    """
    positive_sentences = open("exercise/rt-polaritydata/rt-polarity.pos").readlines()
    negative_sentences = open("exercise/rt-polaritydata/rt-polarity.neg").readlines()

    positive_labels = [1 for sentence in positive_sentences]
    negative_labels = [0 for sentence in negative_sentences]

    sentences = np.concatenate([positive_sentences,negative_sentences], axis=0)
    labels = np.concatenate([positive_labels,negative_labels],axis=0)

    ## make sure we have a label for every data instance
    assert(len(sentences)==len(labels))
    data = list(zip(sentences,labels))
    random.shuffle(data)
    print("split data..", file=sys.stderr)
    split_point = int(0.75*len(data))
    
    sentences = [sentence for sentence, label in data]
    labels = [label for sentence, label in data]
    X_train, X_test = sentences[:split_point], sentences[split_point:]
    y_train, y_test = labels[:split_point], labels[split_point:]

    assert(len(X_train)==len(y_train))
    assert(len(X_test)==len(y_test))

    return X_train, y_train, X_test, y_test

## read input data
print("load data..", file=sys.stderr)
X_train, y_train, X_test, y_test = load_sentiment_sentences_and_labels()

print("vectorize data..", file=sys.stderr)
#vectorizer = CountVectorizer()
#pipeline = Pipeline( [('vec', vectorizer),
#                        ('clf', LogisticRegression())] )

# use FeatureUnion instead
pipeline = Pipeline([
        ('features', FeatureUnion([
            ('words', CountVectorizer()),
        ])),
        ('classifier', LogisticRegression())])

print("train model..", file=sys.stderr)
pipeline.fit(X_train, y_train)
##
print("predict..", file=sys.stderr)
y_predicted = pipeline.predict(X_test)
###
print("Accuracy:", accuracy_score(y_test, y_predicted), file=sys.stderr)


load data..
split data..
vectorize data..
train model..
predict..
Accuracy: 0.767066766692


Once we have `FeatureUnion` in place, we can add further transformers to the list (a transformer needs to implement fit and transform function, see more info here: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html). Lets see if adding character n-grams helps.

In [3]:
pipeline = Pipeline([
        ('features', FeatureUnion([
            ('words', CountVectorizer()),
            ('chars', CountVectorizer(analyzer='char',ngram_range=(4,5), binary=True)),
        ])),
        ('classifier', LogisticRegression())])

print("train model..", file=sys.stderr)
pipeline.fit(X_train, y_train)
##
print("predict..", file=sys.stderr)
y_predicted = pipeline.predict(X_test)
###
print("Accuracy:", accuracy_score(y_test, y_predicted), file=sys.stderr)


train model..
predict..
Accuracy: 0.764066016504


## Feature ablation test

Using `FeatureUnion` has a couple of pros and cons. It's **advantages** are:

+ it's easy to combine different feature spaces; and
+ it's quick to implement

However, with `FeatureUnion` it is no longer straightforward to access the weight coefficient and thus inspect the features. As we will see later, it no longer supports the `show_most_informative_features` method [src: [method taken from here](http://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers)]

However, there are still ways in order to gauge the effect of a certain feature group. In particular, we can do a **feature ablation test**. As the name already suggests, you leave out certain features and train a model without, and compare it to the model that includes the features. 

In particular, a feature ablation test is a set of experiments in which you remove a feature group at a time, and you observe how much your performance drops. This gives you an indication of how good the feature is for your prediction task. The more performance drops, the more useful the feature was for the prediction task. `FeatureUnion` is particularly helpful here as it allows quickly to 'turn off'/'turn on' certain feature groups and thus do feature ablation tests. 

Another advantage of `FeatureUnion` is that it is easy to use if you are working in a team, and team members contribute different features. However, if your aim is more at understanding what are important features, then it might be more fruitful to go with option 2, using the `DictVectorizer`.

How can we add our own featurizer? Before we go into details of the `DictVectorizer`, lets have a look at a simple example (inspired from [1]).

### Writing your own vectorizer

Suppose we want to add a feature that measures the length of a text (suppose for a minute that, say, tweet lenght is indicative for sentiment). To do so, we create our own featurizer, which is a subclass of `TransformerMixin`. 

In [4]:
from sklearn.base import TransformerMixin

Our need class will need to implement both `fit` and `transform`, and it will extract features that are subsequently given to the `DictVectorizer`. 

Regarding `fit`, we can just leave the function emtpy (as no new vocabulary needs to be created). The action happens in `transform`: given the input data `X` (a list of lists of actual texts), we need to convert it to a representation where for every data instance we return a **dictionary** that holds our new feature and its feature value. Thus, in this simple case we represent each text by its length, so we add a feature 'length' that contains the length of the text as value.


In [5]:
class TextStats(TransformerMixin):
    """Extract features from each document for DictVectorizer"""

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        " extract length of each data instance "
        out= [{'length': len(text)}
                for text in X]
        return out

Lets use this new feature and add it to our pipeline. Note: after calling `TextStats` we need to call `DictVectorizer`, which takes care of converting the text into the format `sklearn` internally uses (sparse feature representation). Let's import `DictVectorizer` (notice that is is no longer in the 'text' package, but is in the more general package!).

In [6]:
from sklearn.feature_extraction import DictVectorizer


In [7]:

pipeline = Pipeline([
        ('features', FeatureUnion([
            ('words', CountVectorizer()),
            ('stats', Pipeline([
                ('selector', TextStats()),
                ('statsFeats', DictVectorizer()),
                            ])),
        ])),
        ('classifier', LogisticRegression())])

print("train model..", file=sys.stderr)
pipeline.fit(X_train, y_train)
##
print("predict..", file=sys.stderr)
y_predicted = pipeline.predict(X_test)
###
print("Accuracy:", accuracy_score(y_test, y_predicted), file=sys.stderr)


train model..
predict..
Accuracy: 0.765566391598


Adding the length feature doesn't help. This makes sense (would you really expect that the length of a text tell us much about sentiment?) However, this example code exemplifies you how you could use additional features. 

## POS tagging with Spacy

Assume we want to add POS tag information to our sentiment classifier. First, we need to tag our data.

We here use a simple off-the-shelf tagger that is available for English, `spacy`. See: https://spacy.io/docs/usage/models

In [14]:
import spacy
nlp = spacy.load('en')

We now tag the sentiment example data.

In [15]:
positive_sentences = open("exercise/rt-polaritydata/rt-polarity.pos").readlines()
negative_sentences = open("exercise/rt-polaritydata/rt-polarity.neg").readlines()

In [16]:
def tag(tokens):
    doc = nlp(tokens)
    return [t.pos_ for t in doc]

In [17]:
tag("the rock is destined to be ...")

['DET', 'NOUN', 'VERB', 'VERB', 'PART', 'VERB', 'PUNCT']

Note: for speed reasons it might be more fruitful to actually store the POS tagged text, instead of tagging it for each experiment time from scratch.

## Adding different transformers


In [18]:
from sklearn.base import BaseEstimator,TransformerMixin
import spacy

class PosFeatures(TransformerMixin): 
    """ using POS tags from Spacy """
    def __init__(self):
        nlp = spacy.load('en')
        
    def _tag(tokens):
        doc = nlp(tokens)
        return [t.pos_ for t in doc]
        
    def transform(self, X):
        return [_tag(word) for word in X]

    def fit(self, x, y=None):
        return self

In [148]:
class DataHandler(BaseEstimator, TransformerMixin):
    """Extract features from each document for DictVectorizer"""

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        data={}
        data['raw'] = X
        data['pos'] = [" ".join(tag(str(sentence))) for sentence in X]
        print(len(X))
        return data

In [149]:
class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at a provided key.

    The data is expected to be stored in a 2D data structure, where the first
    index is over features and the second is over samples.  i.e.

    >> len(data[key]) == n_samples

    Please note that this is the opposite convention to scikit-learn feature
    matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem
    (data[key]).  Examples include: a dict of lists, 2D numpy array, Pandas
    DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],
               'b': [9, 4, 1, 4, 1, 3]}
    >> ds = ItemSelector(key='a')
    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample.  (e.g. a
    list of dicts).  If your data is structured this way, consider a
    transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

    Parameters
    ----------
    key : hashable, required
        The key corresponding to the desired value in a mappable.
    """
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

In [150]:
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import DictVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
import numpy as np
import random
random.seed(113)

def load_sentiment_sentences_and_labels():
    """
    loads the movie review data
    """
    positive_sentences = open("exercise/rt-polaritydata/rt-polarity.pos").readlines()
    negative_sentences = open("exercise/rt-polaritydata/rt-polarity.neg").readlines()

    positive_labels = [1 for sentence in positive_sentences]
    negative_labels = [0 for sentence in negative_sentences]

    sentences = np.concatenate([positive_sentences,negative_sentences], axis=0)
    labels = np.concatenate([positive_labels,negative_labels],axis=0)

    ## make sure we have a label for every data instance
    assert(len(sentences)==len(labels))
    data = list(zip(sentences,labels))
    random.shuffle(data)
      ## return the data (instances + labels)
    return data

## read input data
print("load data..")
data = load_sentiment_sentences_and_labels()
print(len(data))

print("split data..")
split_point = int(0.75*len(data))

print("tag data..")
sentences = [sentence for sentence, label in data]
#sentences_tagged = [tag(str(sentence)) for sentence, _ in data]
labels = [label for sentence, label in data]
X_train, X_test = sentences[:split_point], sentences[split_point:]
#X_train_pos, X_test_pos = sentences_tagged[:split_point], sentences_tagged[split_point:]

y_train, y_test = labels[:split_point], labels[split_point:]

assert(len(X_train)==len(y_train))
assert(len(X_test)==len(y_test))

print("vectorize data..")
vectorizer = CountVectorizer()

pipeline = Pipeline([
        ('data',DataHandler()),
        ('features', FeatureUnion([
            ('bow', Pipeline([
                ('selector', ItemSelector(key='raw')),
                ('words', CountVectorizer()),
                            ])),
            ('pos', Pipeline([
                ('selector', ItemSelector(key='pos')),
                ('words', CountVectorizer())
                            ]))
        ])),
        ('classifier', LogisticRegression())])


print("train model..")
pipeline.fit(X_train, y_train)

print("predict..")
y_predicted = pipeline.predict(X_test)

###
print("Accuracy:", accuracy_score(y_test, y_predicted))


load data..
10662
split data..
tag data..
vectorize data..
train model..
7996
predict..
2666
Accuracy: 0.766316579145


The `FeatureUnion` here gets pretty involved, but lets you nicely join heterogeneous feature spaces (although adding POS didn't help in our example). 

However, `FeatureUnion` does not support access to the feature names, thus we cannot run the `show_most_informative_features` method.

In [151]:
def show_most_informative_features(vectorizer, clf, n=10):
    feature_names = vectorizer.get_feature_names() 
    for i in range(0,len(clf.coef_)):
        coefs_with_fns = sorted(zip(clf.coef_[i], feature_names))
        top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
        print("i",i)
        for (coef_1, fn_1), (coef_2, fn_2) in top:
            print("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))


show_most_informative_features(pipeline.named_steps['features'], pipeline.named_steps['classifier'], n=20)

AttributeError: Transformer bow (type Pipeline) does not provide get_feature_names.

## Option 2: Using the `DictVectorizer`

You can use the `DictVectorizer` directly, instead of using `FeatureUnion`. In that case, you create for every instance a dictionary with its features, and then give the list of dictionaries to your DictVectorizer. This has the advantage that you can later inspect the features.

Lets write our own Featurizer. The advantage is that you have full control of what happens with the data, and you can later inspect your features.

### Writing your own Featurizer 


In [156]:
import nltk
from nltk.corpus import stopwords
from collections import defaultdict 
   
class Featurizer(TransformerMixin):
    """Our own featurizer: extract features from each document for DictVectorizer"""

    PREFIX_WORD_NGRAM="W:"
    PREFIX_CHAR_NGRAM="C:"
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, X):
        """
        here we could add more features!
        """
        out= [self._word_ngrams(text,ngram=self.word_ngrams)
                for text in X]
        return out

    def __init__(self,word_ngrams="1",binary=True,lowercase=False,remove_stopwords=False):
        """
        binary: whether to use 1/0 values or counts
        lowercase: convert text to lowercase
        remove_stopwords: True/False
        """
        self.DELIM=" "
        self.data = [] # will hold data (list of dictionaries, one for every instance)
        self.lowercase=lowercase
        self.binary=binary
        self.remove_stopwords = remove_stopwords
        self.stopwords = stopwords.words('english')
        self.word_ngrams=word_ngrams

        
    def _word_ngrams(self,text,ngram="1-2-3"):
        """
        extracts word n-grams

        >>> f=Featurizer()
        >>> d = f._word_ngrams("this is a test",ngram="1-3")
        >>> len(d)
        6
        """
        d={} #dictionary that holds features for current instance
        if self.lowercase:
            text = text.lower()
        words=text.split(self.DELIM)
        if self.remove_stopwords:
            words = [w for w in words if w not in self.stopwords]

        for n in ngram.split("-"):
            for gram in nltk.ngrams(words, int(n)):
                gram = self.PREFIX_WORD_NGRAM + "_".join(gram)
                if self.binary:
                     d[gram] = 1 #binary
                else:
                    d[gram] += 1
        return d
    
if __name__ == "__main__":
    import doctest
    doctest.testmod()

In [157]:
random.seed(113)
X_train, y_train, X_test, y_test = load_sentiment_sentences_and_labels()

print("vectorize data..", file=sys.stderr)
featurizer = Featurizer(word_ngrams="1-2")
vectorizer = DictVectorizer()

# first extract the features (as dictionaries)
X_train_dict = featurizer.fit_transform(X_train)
X_test_dict = featurizer.transform(X_test)

# then convert them to the internal representation (maps each feature to an id)
X_train = vectorizer.fit_transform(X_train_dict)
X_test = vectorizer.transform(X_test_dict)

classifier = LogisticRegression()

print("train model..", file=sys.stderr)
classifier.fit(X_train, y_train)
##
print("predict..", file=sys.stderr)
y_predicted = classifier.predict(X_test)
###
print("Accuracy:", accuracy_score(y_test, y_predicted), file=sys.stderr)


split data..
vectorize data..
train model..
predict..
Accuracy: 0.765191297824


In [158]:
show_most_informative_features(vectorizer, classifier, n=20)

i 0
	-1.6217	W:too          		1.2806	W:works        
	-1.5755	W:bad          		1.2107	W:rare         
	-1.5396	W:dull         		1.1733	W:entertaining 
	-1.2782	W:boring       		1.0934	W:beautiful    
	-1.2193	W:worst        		1.0891	W:engrossing   
	-1.1327	W:tedious      		1.0795	W:cinema       
	-1.1248	W:lacks        		1.0736	W:the_best     
	-1.0898	W:mess         		1.0552	W:funny        
	-1.0781	W:plodding     		1.0408	W:always       
	-1.0755	W:stupid       		1.0213	W:wonderful    
	-1.0569	W:no           		1.0171	W:fun          
	-1.0526	W:flat         		1.0157	W:culture      
	-1.0454	W:only         		0.9988	W:brilliant    
	-1.0204	W:video        		0.9711	W:solid        
	-1.0043	W:the_worst    		0.9650	W:beautifully  
	-1.0002	W:mediocre     		0.9564	W:powerful     
	-0.9967	W:neither      		0.9540	W:still        
	-0.9903	W:barely       		0.9437	W:delivers     
	-0.9826	W:pretentious  		0.9264	W:refreshing   
	-0.9805	W:tv           		0.9073	W:charming     


**Exercises**

1. Extend the `Featurizer` to include character n-grams.
2. Extend the `Featurizer` with POS features.

## How do I know which class number corresponds to which target?

A very handy class that takes care for the mapping between class number and name is the `LabelEncoder`.

Here is an example from the documentation (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html):

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam", "paris"]) # give it your y_train labels!
list(le.classes_)


In [None]:
le.transform(["tokyo", "tokyo", "paris"]) 
example_class_nums = [2, 2, 1]
list(le.inverse_transform(example_class_nums))

## References

1. A detailed example on the 20-newsgroup dataset, from which parts of this tutorial are taken: [sklearn feature union](http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py)
2. http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
3. https://michelleful.github.io/code-blog/2015/06/20/pipelines/
4. My own code for the recent IJCNLP 2017 shared task 4: https://github.com/bplank/ijcnlp2017-customer-feedback/blob/master/src/classifier.py (described in [this paper](http://www.let.rug.nl/~bplank/papers/ijcnlp2017_shared_task4_bplank.pdf))