# Working with Text Data


Follow _Introduction to Machine Learning_ [Chapter 7](https://github.com/amueller/introduction_to_ml_with_python/blob/master/07-working-with-text-data.ipynb):
-  Chapter 7. Working with text data - 7.7 Bag of Words with more than one word (p.329-350)


We will look at supervised classification of text features, in particular sentiment analysis. Given a text commenting on a movie, is this a *positive* or a *negative* comment.

Similar problems include for example, given an email text, is this *spam* or *legitimate* message.

Our classifiers will only accept numerical features as input. Therefore, we have to **transform** text to a **numerical representation** first.

We will start by looking at pre-processing techniques using toy data in this lecture.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import mglearn

## Bag of words
Computing the bag-of-words representation for a **corpus** of **documents** consists of the following three steps:

1. Tokenization. Split each document into the words that appear in it (called tokens), for example by splitting them on whitespace and punctuation.

2. Vocabulary building. Collect a vocabulary of all words that appear in any of the documents, and number them (say, in alphabetical order).

3. Encoding. For each document, count how often each of the words in the vocabulary appear in this document.

In [None]:
bards_words =["The fool doth think he is wise,",
              "but the wise man knows himself to be a fool"]

`bards_words` is a corpus.

`"The fool doth think he is wise,"` is a document.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(bards_words)

In [None]:
print("Vocabulary size: {}".format(len(vect.vocabulary_)))
print("Vocabulary content:\n {}".format(vect.vocabulary_))

In [None]:
bag_of_words = vect.transform(bards_words)
print("bag_of_words: {}".format(repr(bag_of_words)))

In [None]:
print("Dense representation of bag_of_words:\n{}".format(
    bag_of_words.toarray()))

In [None]:
vect.get_feature_names_out()

In [None]:
pd.DataFrame(bag_of_words.toarray(), columns=vect.get_feature_names_out())

Looking at the tokenization process.

In [None]:
tokenizer = vect.build_analyzer()
print(bards_words[0])
print(tokenizer(bards_words[0]))

In [None]:
tokenizer("I am sure that this is   an #awesome hyper-text at https:\\ucalgary.ca. ")

### Another example with repeating words

In [None]:
repeating_words = ["The sun, the sun, shines so bright, so bright.",
                   "The moon, the moon, reflects so bright, so bright."]
vect = CountVectorizer()
vect.fit(repeating_words)
print(vect.vocabulary_)
bag_of_words = vect.transform(repeating_words)
pd.DataFrame(bag_of_words.toarray(), columns=vect.get_feature_names_out())

## Improving Bag-of-words: min_df
> One way to cut back on these is to only use tokens that appear in at least two documents (or at least five documents, and so on). A token that appears only in a single document is unlikely to appear in the test set and is therefore not helpful. We can set the minimum number of documents a token needs to appear in with the min_df parameter:

In [None]:
vect = CountVectorizer(min_df=2)
vect.fit(bards_words)
bag_of_words = vect.transform(bards_words)
pd.DataFrame(bag_of_words.toarray(), columns=vect.get_feature_names_out())

## Improving Bag-of-words: removing stopwords
>Another way that we can get rid of uninformative words is by discarding words that are too frequent to be informative. There are two main approaches: using a language-specific list of stopwords, or discarding words that appear too frequently. scikit-learn has a built-in list of English stopwords in the feature_extraction.text module

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
print("Number of stop words: {}".format(len(ENGLISH_STOP_WORDS)))
print("Every 10th stopword:\n{}".format(list(ENGLISH_STOP_WORDS)[::10]))

In [None]:
vect = CountVectorizer(stop_words="english")
vect.fit(bards_words)
bag_of_words = vect.transform(bards_words)
pd.DataFrame(bag_of_words.toarray(), columns=vect.get_feature_names_out())

## Tf-idf
[Sklearn Doc - TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)

>Instead of dropping features that are deemed unimportant, another approach is to rescale features by how informative we expect them to be. One of the most common ways to do this is using the term frequency–inverse document frequency (tf–idf) method. 

>The intuition of this method is to give high weight to any term that appears often in a particular document, but not in many documents in the corpus. 

>If a word appears often in a particular document, but not in very many documents, it is likely to be very descriptive of the content of that document.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
vect.fit(bards_words)

In [None]:
print("Vocabulary size: {}".format(len(vect.vocabulary_)))
print("Vocabulary content:\n {}".format(vect.vocabulary_))

In [None]:
tfidf_words = vect.transform(bards_words)
print("tfidf_words: {}".format(repr(tfidf_words)))

In [None]:
print("Dense representation of tfidf_words:\n{}".format(
    tfidf_words.toarray()))

In [None]:
tfidf_df = pd.DataFrame(tfidf_words.toarray(), columns=vect.get_feature_names_out())
tfidf_df

### Another example with repeating words

In [None]:
repeating_words = ["The sun, the sun, shines so bright, so bright.",
                   "The moon, the moon, reflects so bright, so bright."]
vect = TfidfVectorizer()
vect.fit(repeating_words)
tfidf_words = vect.transform(repeating_words)
pd.DataFrame(tfidf_words.toarray(), columns=vect.get_feature_names_out())

### Without stop-words

In [None]:
repeating_words = ["The sun, the sun, shines so bright, so bright.",
                   "The moon, the moon, reflects so bright, so bright."]
vect = TfidfVectorizer(stop_words="english")
vect.fit(repeating_words)
tfidf_words = vect.transform(repeating_words)
pd.DataFrame(tfidf_words.toarray(), columns=vect.get_feature_names_out())

## Bag-of-Words with More Than One Word (n-Grams)
>One of the main disadvantages of using a bag-of-words representation is that word order is completely discarded. Therefore, the two strings “it’s bad, not good at all” and “it’s good, not bad at all” have exactly the same representation, even though the meanings are inverted. Putting “not” in front of a word is only one example (if an extreme one) of how context matters. 

>Fortunately, there is a way of capturing context when using a bag-of-words representation, by not only considering the counts of single tokens, but also the counts of pairs or triplets of tokens that appear next to each other. 

>Pairs of tokens are known as bigrams, triplets of tokens are known as trigrams, and more generally sequences of tokens are known as n-grams. 

>We can change the range of tokens that are considered as features by changing the ngram_range parameter of CountVectorizer or TfidfVectorizer.

In [None]:
bards_words

In [None]:
cv = CountVectorizer(ngram_range=(1, 1)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names_out()))

In [None]:
cv = CountVectorizer(ngram_range=(2, 2)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names_out()))

In [None]:
cv = CountVectorizer(ngram_range=(1, 2)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names_out()))

### Another example with repeating words

In [None]:
repeating_words = ["The sun, the sun, shines so bright, so bright.",
                   "The moon, the moon, reflects so bright, so bright."]
cv = CountVectorizer(ngram_range=(1, 2)).fit(repeating_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names_out()))

In [None]:
ngram_words = cv.transform(repeating_words)
pd.DataFrame(ngram_words.toarray(), columns=cv.get_feature_names_out())

### Td-idf n-gram

In [None]:
repeating_words = ["The sun, the sun, shines so bright, so bright.",
                   "The moon, the moon, reflects so bright, so bright."]
vect = TfidfVectorizer(ngram_range=(1,2))
vect.fit(repeating_words)
tfidf_words = vect.transform(repeating_words)
pd.DataFrame(tfidf_words.toarray(), columns=vect.get_feature_names_out())

### Td-idf n-gram no stop-words

In [None]:
repeating_words = ["The sun, the sun, shines so bright, so bright.",
                   "The moon, the moon, reflects so bright, so bright."]
vect = TfidfVectorizer(ngram_range=(1,2), stop_words="english")
vect.fit(repeating_words)
tfidf_words = vect.transform(repeating_words)
pd.DataFrame(tfidf_words.toarray(), columns=vect.get_feature_names_out())