In [1]:
%run convention.ipynb



# Types of Data Represented as Strings

<ul>
<li>Categorical data</li>
<li>Free strings that can be semantically mapped to categories</li>
<li>Structured string data</li>
<li>Text data</li>
</ul>

# Sentiment Analysis

## Representing Text Data as a Bag of Words

<p class = 'note'> Only count how often each word appears in each text in the corpus.</p>

Computing the bag-of-words representation for a corpus of documents consists of
the following three steps:
<ol>
<li><i>Tokenization</i>. Split each document into the words that appear in it (called tokens),
for example by splitting them on whitespace and punctuation.
</li>
<li>
<i>Vocabulary building</i>. Collect a vocabulary of all words that appear in any of the
documents, and number them (say, in alphabetical order).
</li>
<li>
<i>Encoding</i>. For each document, count how often each of the words in the vocabu‐
lary appear in this document.
</li>
</ol>

## Applying Bag-of-Words to a Toy Dataset

In [2]:
lyric = ['An empty street', 'An empty house']

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
vec.fit(lyric)
bow = vec.transform(lyric)
bow

<2x4 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [11]:
vec.get_feature_names()

['an', 'empty', 'house', 'street']

In [15]:
df = pd.DataFrame(bow.toarray(), columns = vec.get_feature_names())
df

Unnamed: 0,an,empty,house,street
0,1,1,0,1
1,1,1,1,0


## Example

In [17]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
groups = fetch_20newsgroups()
X, y = CountVectorizer().fit_transform(groups.data), groups.target 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 6)
clf = MultinomialNB()
clf.fit(X_train, y_train)
print('Train score: %.2f' % clf.score(X_train, y_train))
print('Test score: %.2f' % clf.score(X_test, y_test))

Train score: 0.92
Test score: 0.83


We know that <code>MultinomialNB</code> has a regularization parameter <code>alpha</code>, which can be tuned via cross-validation:

In [20]:
from sklearn.model_selection import GridSearchCV
grid_params = {'alpha' : [.001, .01, .1, 1, 10, 100]}
grid = GridSearchCV(MultinomialNB(), grid_params, cv = 5)
grid.fit(X_train, y_train)
'Best validation score: %.2f' % grid.best_score_

'Best validation score: 0.87'

In [21]:
'Test score: %.2f' % grid.score(X_test, y_test)

'Test score: 0.87'

In [22]:
grid.best_params_

{'alpha': 0.001}

## Feature selection

<p class = 'note'>A token that appears only in a single document is unlikely to appear in the test
set and is therefore not helpful. We can set the minimum number of documents a
token needs to appear in with the min_df parameter:</p>

In [23]:
vec = CountVectorizer(min_df = 2) #select tokens that appear in at least 2 documents
corpus = [' I Love You',
          ' You are my heart, you are my soul',
          'Love the way you lie']
vec.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=2,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [25]:
df = pd.DataFrame(vec.transform(corpus).toarray(), columns = vec.get_feature_names())
df

Unnamed: 0,love,you
0,1,1
1,0,2
2,1,1


## Stop words

<p class = 'note'>Another way that we can get rid of uninformative words is by discarding words that
are too frequent to be informative</p>
<ul>There are 2 strategies:
    <li> using a languagespecific list of stopwords</li>
    <li>iscarding words that appear too frequently</li>
</ul>

<p class = 'highlight'>scikitlearn has a built-in list of English stopwords in the feature_extraction.text
module:</p>


In [26]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
ENGLISH_STOP_WORDS

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

<p class = 'note'>Clearly, removing the stopwords in the list can only decrease the number of features
by the length of the list but it might lead to an improvement in performance.</p>

Let try to remove stop words to see if we can improve our model's performance:

In [28]:
from sklearn.model_selection import cross_val_score
vec = CountVectorizer(min_df = 5, stop_words='english')
X, y = vec.fit_transform(groups.data), groups.target
clf = MultinomialNB(alpha = .001)
scores = cross_val_score(clf, X, y, cv = 10)
scores

array([0.87982456, 0.87609842, 0.86906854, 0.8943662 , 0.88162544,
       0.8591674 , 0.88120567, 0.87833037, 0.88434164, 0.88512912])

In [29]:
scores.mean()

0.8789157366848009

<p class = 'note'>Discarding words that appear frequently: set the keyword: <code>max_df</code></p>

In [30]:
corpus

[' I Love You', ' You are my heart, you are my soul', 'Love the way you lie']

In [33]:
vec = CountVectorizer(max_df = 2) #discard words that appear >= 2 times
vec.fit(corpus)
vec.get_feature_names()

['are', 'heart', 'lie', 'love', 'my', 'soul', 'the', 'way']

## Recaling data with tf-idf

<p class = 'note'>
    Instead of dropping features that are deemed unimportant, another approach is to
rescale features by how informative we expect them to be. One of the most common
    ways to do this is using the term  <b>frequency–inverse document frequency (tf–idf)</b>
method. The intuition of this method is to give high weight to any term that appears
often in a particular document, but not in many documents in the corpus. If a word
appears often in a particular document, but not in very many documents, it is likely
to be very descriptive of the content of that document. scikit-learn implements the
    tf–idf method in two classes: <code>TfidfTransformer</code>, which takes in the sparse matrix
    output produced by <code>CountVectorizer</code> and transforms it, and <code>TfidfVectorizer</code>,
which takes in the text data and does both the bag-of-words feature extraction and
the tf–idf transformation. 
</p>

<p class = 'warning'>
Because tf–idf actually makes use of the statistical properties of the training data, we
will use a pipeline, to ensure the results of our grid search
are valid
</p>


In [35]:
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
pipe = make_pipeline(TfidfVectorizer(), MultinomialNB())
grid_params = {'multinomialnb__alpha' : [.0001, .001, .01, .1, 1, 10, 100]}
grid = GridSearchCV(pipe, grid_params, cv = 5)
grid.fit(groups.data, groups.target)
'Best cross-validation score: %.2f' % grid.best_score_

'Best cross-validation score: 0.91'

<p class = 'highlight'>As you can see, there is some improvement when using tf–idf instead of just word
counts. </p>

We can also inspect which words tf–idf found most important: 

In [37]:
#get the TfidfTvectorizer class
vec = grid.best_estimator_.named_steps['tfidfvectorizer']
vec

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [43]:
corpus_vector = vec.transform(groups.data)
importance = corpus_vector.max(axis = 0).toarray().ravel()
arg = np.argsort(importance)

order = np.array(vec.get_feature_names())[arg]

In [46]:
'features with highest tfidf: {}'.format(order[-10:])

"features with highest tfidf: ['kk' 'db' 'scsi' 'blah' 'donoghue' '00' '___' '25' 'forged' 'ax']"

In [47]:
'features with lowest tfidf: {}'.format(order[:10])

"features with lowest tfidf: ['giz1z4' 'newwhjnux' 'newwhj' 'ne1wj' 'ne1whj' 'biz1pmf' 'mzwt' '9j6e1t'\n '9tad9' '9tagm']"

In [None]:
from sklearn.datasets import 

## Bad-of-words with more than one word

One of the main disadvantages of using a bag-of-words representation is that word
order is completely discarded. Therefore, the two strings “it’s bad, not good at all” and
“it’s good, not bad at all” have exactly the same representation, even though the mean‐
ings are inverted. Putting “not” in front of a word is only one example (if an extreme
one) of how context matters. Formtunately, there is a way of capturing context when
using a bag-of-words representation, by not only considering the counts of single
tokens, but also the counts of pairs or triplets of tokens that appear next to each other.
Pairs of tokens are known as bigrams, triplets of tokens are known as trigrams, and
more generally sequences of tokens are known as n-grams. We can change the range
of tokens that are considered as features by changing the <code>ngram_range</code> parameter of
<code>CountVectorizer</code> or <code>TfidfVectorizer</code>. 

In [50]:
corpus = ["It's good, not bad at all!", "It's bad, not good at all"]
vec = TfidfVectorizer(ngram_range = (2,2)) #select 2 consecutive words at a time
vec.fit(corpus)
vec.get_feature_names()

['at all',
 'bad at',
 'bad not',
 'good at',
 'good not',
 'it bad',
 'it good',
 'not bad',
 'not good']

In [51]:
vec = TfidfVectorizer(ngram_range = (1,2)) #select 1 word at a time, then 2 words at a time
vec.fit(corpus)
vec.get_feature_names()

['all',
 'at',
 'at all',
 'bad',
 'bad at',
 'bad not',
 'good',
 'good at',
 'good not',
 'it',
 'it bad',
 'it good',
 'not',
 'not bad',
 'not good']

<span class = 'warning'>NOTE:</span>
<p class = 'highlight'>
For most applications, the minimum number of tokens should be one, as single
words often capture a lot of meaning. Adding bigrams helps in most cases. Adding
longer sequences—up to 5-grams—might help too, but this will lead to an explosion
of the number of features and might lead to overfitting, as there will be many very
specific features. In principle, the number of bigrams could be the number of
unigrams squared and the number of trigrams could be the number of unigrams to
the power of three, leading to very large feature spaces. In practice, the number of
higher n-grams that actually appear in the data is much smaller, because of the struc‐
ture of the (English) language, though it is still large
</p>

### Example:

In [55]:
#using bigram to test whether we can imporve our model's performance or not
vec = TfidfVectorizer(min_df = 5, stop_words = 'english')
pipe = make_pipeline(vec, MultinomialNB())
grid_params = {
    'tfidfvectorizer__ngram_range' : [(1,1), (1,2), (1,3)],
    'multinomialnb__alpha' : [.001, .01, .1, 1, 10, 100]
}
grid = GridSearchCV(pipe, grid_params, cv = 5)
X_train, X_test, y_train, y_test = train_test_split(groups.data, groups.target, random_state = 6)
grid.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('tfidfvectorizer',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=5,
                                                        ngram_range=(1, 1),
                               

In [56]:
grid.best_params_

{'multinomialnb__alpha': 0.1, 'tfidfvectorizer__ngram_range': (1, 1)}

In [57]:
'Best cross-validation-score: %.2f' % grid.best_score_

'Best cross-validation-score: 0.89'

## Advanced Tokenization, Stemming, and Lemmatization

As mentioned previously, the feature extraction in the CountVectorizer and Tfidf
Vectorizer is relatively simple, and much more elaborate methods are possible. One
particular step that is often improved in more sophisticated text-processing applica‐
tions is the first step in the bag-of-words model: tokenization. This step defines what
constitutes a word for the purpose of feature extraction.
We saw earlier that the vocabulary often contains singular and plural versions of
some words, as in "drawback" and "drawbacks", "drawer" and "drawers", and
"drawing" and "drawings". For the purposes of a bag-of-words model, the semantics
of "drawback" and "drawbacks" are so close that distinguishing them will only
increase overfitting, and not allow the model to fully exploit the training data. Simi‐
larly, we found the vocabulary includes words like "replace", "replaced", "replace
ment", "replaces", and "replacing", which are different verb forms and a noun
relating to the verb “to replace.” Similarly to having singular and plural forms of a
noun, treating different verb forms and related words as distinct tokens is disadvanta‐
geous for building a model that generalizes well.
This problem can be overcome by representing each word using its word stem, which
involves identifying (or conflating) all the words that have the same word stem. If this
is done by using a rule-based heuristic, like dropping common suffixes, it is usually
referred to as stemming. If instead a dictionary of known word forms is used (an
explicit and human-verified system), and the role of the word in the sentence is taken
into account, the process is referred to as lemmatization and the standardized form of
the word is referred to as the lemma. Both processing methods, lemmatization and
stemming, are forms of normalization that try to extract some normal form of a
word. Another interesting case of normalization is spelling correction, which can be
helpful in practice but is outside of the scope of this book

# Topic modeling and Document Clustering