In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Text Feature Extraction with Bag-of-Words
In many tasks, like in the classical spam detection, your input data is text.
Free text with variables length is very far from the fixed length numeric representation that we need to do machine learning with scikit-learn.
However, there is an easy and effective way to go from text data to a numeric representation that we can use with our models, called bag-of-words.

<img src="figures/bag_of_words.svg" width="100%">


Let's assume that each sample in your dataset is represented as one string, which could be just a sentence, an email or a whole news article or book. To represent the sample, we first split the string into a list of tokens, which correspond to words. A simple way to do this to just split by whitespace, and then lowercase the word.


Then, we built a vocabulary of all tokens (lowercased words) that appear in our whole dataset. This is usually a very large vocabulary.
Finally, looking at our single sample, we could how often each word in the vocabulary appears.
We represent our string by a vector, where each entry is how often a given word in the vocabular appears in the string.

As each sample will only contain very few of the words, most entries will be zero, leading to a very high-dimensional but sparse representation.

The method is called bag-of-words as the order of the words is lost entirely.

In [18]:
X = ["Some say the world will end in fire,",
     "Some say in ice."]

In [19]:
len(X)

2

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(X)


CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [21]:
vectorizer.vocabulary_

{'end': 0,
 'fire': 1,
 'ice': 2,
 'in': 3,
 'say': 4,
 'some': 5,
 'the': 6,
 'will': 7,
 'world': 8}

In [22]:
X_bag_of_words = vectorizer.transform(X)

In [23]:
X_bag_of_words.shape

(2, 9)

In [24]:
X_bag_of_words

<2x9 sparse matrix of type '<class 'numpy.int64'>'
	with 12 stored elements in Compressed Sparse Row format>

In [25]:
X_bag_of_words.toarray()

array([[1, 1, 0, 1, 1, 1, 1, 1, 1],
       [0, 0, 1, 1, 1, 1, 0, 0, 0]])

In [26]:
vectorizer.get_feature_names()

['end', 'fire', 'ice', 'in', 'say', 'some', 'the', 'will', 'world']

In [27]:
vectorizer.inverse_transform(X_bag_of_words)

[array(['end', 'fire', 'in', 'say', 'some', 'the', 'will', 'world'], 
       dtype='<U5'), array(['ice', 'in', 'say', 'some'], 
       dtype='<U5')]

# tf-idf weighting
A useful transformation that is often applied to the bag-of-word encoding is the so-called term-frequency inverse-document-frequency (Tfidf) scaling, which is a non-linear transformation of the word counts.

The tf-idf method rescales words that are common to have less weight:

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [29]:
import numpy as np
np.set_printoptions(precision=2)

print(tfidf_vectorizer.transform(X).toarray())

[[ 0.39  0.39  0.    0.28  0.28  0.28  0.39  0.39  0.39]
 [ 0.    0.    0.63  0.45  0.45  0.45  0.    0.    0.  ]]


# Bigrams and N-Grams
Entirely discarding word order is not always a good idea, as composite phrases often have specific meaning, and modifiers like "not" can invert the meaning of words.
A simple way to include some word order are n-grams, which don't only look at a single token, but at all pairs of neighborhing tokens:

In [30]:
# look at sequences of tokens of minimum length 2 and maximum length 2
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
bigram_vectorizer.fit(X)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(2, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [31]:
bigram_vectorizer.get_feature_names()

['end in',
 'in fire',
 'in ice',
 'say in',
 'say the',
 'some say',
 'the world',
 'will end',
 'world will']

In [32]:
bigram_vectorizer.transform(X).toarray()

array([[1, 1, 0, 0, 1, 1, 1, 1, 1],
       [0, 0, 1, 1, 0, 1, 0, 0, 0]])

Often we want to include unigrams (sigle tokens) and bigrams:

In [33]:
gram_vectorizer = CountVectorizer(ngram_range=(1, 2))
gram_vectorizer.fit(X)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [34]:
gram_vectorizer.get_feature_names()

['end',
 'end in',
 'fire',
 'ice',
 'in',
 'in fire',
 'in ice',
 'say',
 'say in',
 'say the',
 'some',
 'some say',
 'the',
 'the world',
 'will',
 'will end',
 'world',
 'world will']

In [35]:
gram_vectorizer.transform(X).toarray()

array([[1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0]])

Character n-grams
=================
Sometimes it is also helpful to not look at words, but instead single character.
That is particularly useful if you have very noisy data, want to identify the language, or we want to predict something about a single word.
We can simply look at characters instead of words by setting ``analyzer="char"``.
Looking at single characters is usually not very informative, but looking at longer n-grams of characters can be:

In [36]:
char_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer="char")
char_vectorizer.fit(X)

CountVectorizer(analyzer='char', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(2, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [37]:
print(char_vectorizer.get_feature_names())

[' e', ' f', ' i', ' s', ' t', ' w', 'ay', 'ce', 'd ', 'e ', 'e,', 'e.', 'en', 'fi', 'he', 'ic', 'il', 'in', 'ir', 'l ', 'ld', 'll', 'me', 'n ', 'nd', 'om', 'or', 're', 'rl', 'sa', 'so', 'th', 'wi', 'wo', 'y ']


<img src="figures/supervised_scikit_learn.png" width="100%">

# Let's do it for real now!

In [41]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression

In [11]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             categories=('rec.sport.hockey', 'rec.sport.baseball'),
                             remove=('headers', 'footers', 'quotes'))

In [12]:
docs, y = dataset['data'], dataset['target']

In [53]:
print(docs[0])

if
team! 
Yeah but Soderstrom's mask has always appeared to be a lot bigger than the  
average helmet-and-cage variety.  It has a certain appeal on its own

josh



In [54]:
print(y[0])

1


In [51]:
dataset.target_names  # 0 = baseball, 1 = hockey

['rec.sport.baseball', 'rec.sport.hockey']

### Leave out a validation set

In [13]:
docs_train, docs_val, y_train, y_val = train_test_split(docs, y, test_size=0.25, random_state=0)

In [39]:
vect = CountVectorizer()

In [40]:
X_train = vect.fit_transform(docs_train)

In [42]:
clf = LogisticRegression().fit(X_train, y_train)

In [44]:
print("Training set accuracy: {:.2f}".format(clf.score(X_train, y_train)))
X_val = vect.transform(docs_val)
print("Validation accuracy: {:.2f}".format(clf.score(X_val, y_val)))

Training set accuracy: 0.98
Validation accuracy: 0.87


### What cases are we getting wrong?

In [45]:
y_proba = clf.predict_proba(X_val)

In [52]:
y_proba

array([[  7.53e-05,   1.00e+00],
       [  9.95e-01,   5.04e-03],
       [  9.39e-01,   6.07e-02],
       [  1.36e-03,   9.99e-01],
       [  8.33e-01,   1.67e-01],
       [  4.20e-04,   1.00e+00],
       [  9.90e-06,   1.00e+00],
       [  9.33e-01,   6.72e-02],
       [  4.09e-01,   5.91e-01],
       [  8.43e-04,   9.99e-01],
       [  9.96e-01,   4.49e-03],
       [  3.99e-01,   6.01e-01],
       [  1.46e-04,   1.00e+00],
       [  3.03e-01,   6.97e-01],
       [  9.42e-01,   5.76e-02],
       [  2.56e-03,   9.97e-01],
       [  5.86e-01,   4.14e-01],
       [  1.01e-01,   8.99e-01],
       [  9.98e-01,   2.43e-03],
       [  4.73e-01,   5.27e-01],
       [  1.96e-01,   8.04e-01],
       [  8.73e-01,   1.27e-01],
       [  1.16e-05,   1.00e+00],
       [  3.13e-01,   6.87e-01],
       [  4.77e-02,   9.52e-01],
       [  9.81e-04,   9.99e-01],
       [  7.93e-01,   2.07e-01],
       [  1.72e-10,   1.00e+00],
       [  9.56e-01,   4.36e-02],
       [  1.21e-02,   9.88e-01],
       [  

In [66]:
confidence_in_hockey = y_proba[y_val == 1, 1]
confidence_in_baseball = y_proba[y_val == 0, 0]

docs_about_hockey = np.array(docs_val)[y_val == 1]
for ix in np.argsort(confidence_in_hockey)[:3]:
    print(confidence_in_hockey[ix])
    print(docs_about_hockey[ix])
    print()

0.0791647833328

In case anyone missed it, I'm reposting this and I'm also selling some other
stuff.




I've sold one, but I still have 2 left for sale. I also realize that $45 is
alot of money, especially if you don't normally collect cards. So if enough
people are interested, I'll break up the set into team sets. I'm not sure
how much for each. It would be nice to just sell them for $3 each, but then
the people who get the Whalers and Devils (Note, I'm not bagging on these teams
its just that they don't have alot of good rookie cards in this set) would
be subsidizing the people who want Chicago or Pittsburgh. So I'll have to make
it varialble pricing. But most of them should be about $2 or $3 dollars.






Ok someone asked for this one, but he's from Canada, if he can get me the
be the alternate.

Also I would like to sell 2 Upperdeck Pavel Bure rookie cards (note these
are not in the UD low #'s set mentioned above). $16 each. They are $15 in
the book, but the $1 goes for postage, 

In [65]:
docs_about_baseball = np.array(docs_val)[y_val == 0]
for ix in np.argsort(confidence_in_baseball)[:3]:
    print(confidence_in_baseball[ix])
    print(docs_about_baseball[ix])
    print()

0.112150564452


What makes you think Buck will still be in New York at year's end with
George back?  :-)

--
    Keith Keller				LET'S GO RANGERS!!!!!
						LET'S GO QUAKERS!!!!!
	kkeller@mail.sas.upenn.edu		IVY LEAGUE CHAMPS!!!!

0.139498219786
Here's an easy question for someone who knows nothing about baseball...

   What city do the California Angels play out of?



-- 
Richard J. Rauser        "You have no idea what you're doing."
rauser@sfu.ca            "Oh, don't worry about that. We're professional
WNI                          outlaws - we do this for a living."

0.168970202287
Hello, I'm doing a paper on censorship in music and I would appreciate it if you took the time to participate in this survey.  Please answer as each question asks ('why?' simply means that you have room to explain your answer, if you chose.).  The last question is for any comments, questions, or suggestions.  Thank you in advance, please E-mail to the address at the end.

I)  are you [male/female]
II) 

# Discuss